Re: [PATCH v2 net-next] liquidio: improve UDP TX performance
On 02/21/2017 01:09 PM, Felix Manlunas wrote: From: VSR Burru <veerasenareddy.bu...@cavium.com> Improve UDP TX performance by: * reducing the ring size from 2K to 512 * replacing the numerous streaming DMA allocations for info buffers and gather lists with one large consistent DMA allocation per ring Netperf benchmark numbers before and after patch: PF UDP TX +++++-+ ||| Before| After | | | Number || Patch | Patch | | | of| Packet | Throughput | Throughput | Percent | | Flows | Size | (Gbps)| (Gbps)| Change | +++++-+ || 360 | 0.52 | 0.93 | +78.9 | | 1| 1024 | 1.62 | 2.84 | +75.3 | || 1518 | 2.44 | 4.21 | +72.5 | +++++-+ || 360 | 0.45 | 1.59 | +253.3 | | 4| 1024 | 1.34 | 5.48 | +308.9 | || 1518 | 2.27 | 8.31 | +266.1 | +++++-+ || 360 | 0.40 | 1.61 | +302.5 | | 8| 1024 | 1.64 | 4.24 | +158.5 | || 1518 | 2.87 | 6.52 | +127.2 | +++++-+ VF UDP TX +++++-+ ||| Before| After | | | Number || Patch | Patch | | | of| Packet | Throughput | Throughput | Percent | | Flows | Size | (Gbps)| (Gbps)| Change | +++++-+ || 360 | 1.28 | 1.49 | +16.4 | | 1| 1024 | 4.44 | 4.39 | -1.1 | || 1518 | 6.08 | 6.51 | +7.1 | +++++-+ || 360 | 2.35 | 2.35 |0.0 | | 4| 1024 | 6.41 | 8.07 | +25.9 | || 1518 | 9.56 | 9.54 | -0.2 | +++++-+ || 360 | 3.41 | 3.65 | +7.0 | | 8| 1024 | 9.35 | 9.34 | -0.1 | || 1518 | 9.56 | 9.57 | +0.1 | +++++-+ Some good looking numbers there. As one approaches the wire limit for bitrate, the likes of a netperf service demand can be used to demonstrate the performance change - though there isn't an easy way to do that for parallel flows. happy benchmarking, rick jones
Re: [PATCH net-next] liquidio: improve UDP TX performance
On 02/16/2017 10:38 AM, Felix Manlunas wrote: From: VSR Burru <veerasenareddy.bu...@cavium.com> Improve UDP TX performance by: * reducing the ring size from 2K to 512 * replacing the numerous streaming DMA allocations for info buffers and gather lists with one large consistent DMA allocation per ring By how much was UDP TX performance improved? happy benchmarking, rick jones
Re: [PATCH net-next] virtio: Fix affinity for >32 VCPUs
On 02/03/2017 10:31 AM, Willem de Bruijn wrote: Configuring interrupts and xps from userspace at boot is more robust, as device driver defaults can change. But especially for customers who are unaware of these settings, choosing sane defaults won't hurt. The devil is in finding the sane defaults. For example, the issues we've seen with VMs sending traffic getting reordered when the driver took it upon itself to enable xps. rick jones
Re: [PATCH net-next] virtio: Fix affinity for >32 VCPUs
On 02/03/2017 10:22 AM, Benjamin Serebrin wrote: Thanks, Michael, I'll put this text in the commit log: XPS settings aren't write-able from userspace, so the only way I know to fix XPS is in the driver. ?? root@np-cp1-c0-m1-mgmt:/home/stack# cat /sys/devices/pci:00/:00:02.0/:04:00.0/net/hed0/queues/tx-0/xps_cpus ,0001 root@np-cp1-c0-m1-mgmt:/home/stack# echo 0 > /sys/devices/pci:00/:00:02.0/:04:00.0/net/hed0/queues/tx-0/xps_cpus root@np-cp1-c0-m1-mgmt:/home/stack# cat /sys/devices/pci:00/:00:02.0/:04:00.0/net/hed0/queues/tx-0/xps_cpus ,
Re: [PATCH net-next] tcp: accept RST for rcv_nxt - 1 after receiving a FIN
On 01/17/2017 11:13 AM, Eric Dumazet wrote: On Tue, Jan 17, 2017 at 11:04 AM, Rick Jones <rick.jon...@hpe.com> wrote: Drifting a bit, and it doesn't change the value of dealing with it, but out of curiosity, when you say mostly in CLOSE_WAIT, why aren't the server-side applications reacting to the read return of zero triggered by the arrival of the FIN? Even if the application reacts, and calls close(fd), kernel will still try to push the data that was queued into socket write queue prior to receiving the FIN. By allowing this RST, we can flush the whole data and react much faster, avoiding locking memory in the kernel for very long time. Understood. I was just wondering if there is also an application bug here. happy benchmarking, rick jones
Re: [PATCH net-next] tcp: accept RST for rcv_nxt - 1 after receiving a FIN
On 01/17/2017 10:37 AM, Jason Baron wrote: From: Jason Baron <jba...@akamai.com> Using a Mac OSX box as a client connecting to a Linux server, we have found that when certain applications (such as 'ab'), are abruptly terminated (via ^C), a FIN is sent followed by a RST packet on tcp connections. The FIN is accepted by the Linux stack but the RST is sent with the same sequence number as the FIN, and Linux responds with a challenge ACK per RFC 5961. The OSX client then sometimes (they are rate-limited) does not reply with any RST as would be expected on a closed socket. This results in sockets accumulating on the Linux server left mostly in the CLOSE_WAIT state, although LAST_ACK and CLOSING are also possible. This sequence of events can tie up a lot of resources on the Linux server since there may be a lot of data in write buffers at the time of the RST. Accepting a RST equal to rcv_nxt - 1, after we have already successfully processed a FIN, has made a significant difference for us in practice, by freeing up unneeded resources in a more expedient fashion. Drifting a bit, and it doesn't change the value of dealing with it, but out of curiosity, when you say mostly in CLOSE_WAIT, why aren't the server-side applications reacting to the read return of zero triggered by the arrival of the FIN? happy benchmarking, rick jones
Re: [pull request][for-next] Mellanox mlx5 Reorganize core driver directory layout
On 01/13/2017 02:56 PM, Tom Herbert wrote: On Fri, Jan 13, 2017 at 2:45 PM, Saeed Mahameed what configuration are you running ? what traffic ? Nothing fancy. 8 queues and 20 concurrent netperf TCP_STREAMs trips it. Not a lot of them, but I don't think we really should ever see these errors. Straight-up defaults with netperf, or do you use specific -s/S or -m/M options? happy benchmarking, rick jones
Re: [PATCH net-next] udp: under rx pressure, try to condense skbs
On 12/08/2016 07:30 AM, Eric Dumazet wrote: On Thu, 2016-12-08 at 10:46 +0100, Jesper Dangaard Brouer wrote: Hmmm... I'm not thrilled to have such heuristics, that change memory behavior when half of the queue size (sk->sk_rcvbuf) is reached. Well, copybreak drivers do that unconditionally, even under no stress at all, you really should complain then. Isn't that behaviour based (in part?) on the observation/belief that it is fewer cycles to copy the small packet into a small buffer than to send the larger buffer up the stack and have to allocate and map a replacement? rick jones
Re: [PATCH net-next 2/4] mlx4: xdp: Allow raising MTU up to one page minus eth and vlan hdrs
On 12/02/2016 03:23 PM, Martin KaFai Lau wrote: When XDP prog is attached, it is currently limiting MTU to be FRAG_SZ0 - ETH_HLEN - (2 * VLAN_HLEN) which is 1514 in x86. AFAICT, since mlx4 is doing one page per packet for XDP, we can at least raise the MTU limitation up to PAGE_SIZE - ETH_HLEN - (2 * VLAN_HLEN) which this patch is doing. It will be useful in the next patch which allows XDP program to extend the packet by adding new header(s). Is mlx4 the only driver doing page-per-packet? rick jones
Re: Initial thoughts on TXDP
On 12/01/2016 02:12 PM, Tom Herbert wrote: We have consider both request size and response side in RPC. Presumably, something like a memcache server is most serving data as opposed to reading it, we are looking to receiving much smaller packets than being sent. Requests are going to be quite small say 100 bytes and unless we are doing significant amount of pipelining on connections GRO would rarely kick-in. Response size will have a lot of variability, anything from a few kilobytes up to a megabyte. I'm sorry I can't be more specific this is an artifact of datacenters that have 100s of different applications and communication patterns. Maybe 100b request size, 8K, 16K, 64K response sizes might be good for test. No worries on the specific sizes, it is a classic "How long is a piece of string?" sort of question. Not surprisingly, as the size of what is being received grows, so too the delta between GRO on and off. stack@np-cp1-c0-m1-mgmt:~/rjones2$ HDR="-P 1"; for r in 8K 16K 64K 1M; do for gro in on off; do sudo ethtool -K hed0 gro ${gro}; brand="$r gro $gro"; ./netperf -B "$brand" -c -H np-cp1-c1-m3-mgmt -t TCP_RR $HDR -- -P 12867 -r 128,${r} -o result_brand,throughput,local_sd; HDR="-P 0"; done; done MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0 Result Tag,Throughput,Local Service Demand "8K gro on",9899.84,35.947 "8K gro off",7299.54,61.097 "16K gro on",8119.38,58.367 "16K gro off",5176.87,95.317 "64K gro on",4429.57,110.629 "64K gro off",2128.58,289.913 "1M gro on",887.85,918.447 "1M gro off",335.97,3427.587 So that gives a feel for by how much this alternative mechanism would have to reduce path-length to maintain the CPU overhead, were the mechanism to preclude GRO. rick
Re: Initial thoughts on TXDP
On 12/01/2016 12:18 PM, Tom Herbert wrote: On Thu, Dec 1, 2016 at 11:48 AM, Rick Jones <rick.jon...@hpe.com> wrote: Just how much per-packet path-length are you thinking will go away under the likes of TXDP? It is admittedly "just" netperf but losing TSO/GSO does some non-trivial things to effective overhead (service demand) and so throughput: For plain in order TCP packets I believe we should be able process each packet at nearly same speed as GRO. Most of the protocol processing we do between GRO and the stack are the same, the differences are that we need to do a connection lookup in the stack path (note we now do this is UDP GRO and that hasn't show up as a major hit). We also need to consider enqueue/dequeue on the socket which is a major reason to try for lockless sockets in this instance. So waving hands a bit, and taking the service demand for the GRO-on receive test in my previous message (860 ns/KB), that would be ~ (1448/1024)*860 or ~1.216 usec of CPU time per TCP segment, including ACK generation which unless an explicit ACK-avoidance heuristic a la HP-UX 11/Solaris 2 is put in place would be for every-other segment. Etc etc. Sure, but trying running something emulates a more realistic workload than a TCP stream, like RR test with relative small payload and many connections. That is a good point, which of course is why the RR tests are there in netperf :) Don't get me wrong, I *like* seeing path-length reductions. What would you posit is a relatively small payload? The promotion of IR10 suggests that perhaps 14KB or so is a sufficiently common so I'll grasp at that as the length of a piece of string: stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t TCP_RR -- -P 12867 -r 128,14K MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. CPUCPUS.dem S.dem Send Recv SizeSize TimeRate local remote local remote bytes bytes bytes bytes secs. per sec % S% Uus/Tr us/Tr 16384 87380 128 14336 10.00 8118.31 1.57 -1.00 46.410 -1.000 16384 87380 stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t TCP_RR -- -P 12867 -r 128,14K MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. CPUCPUS.dem S.dem Send Recv SizeSize TimeRate local remote local remote bytes bytes bytes bytes secs. per sec % S% Uus/Tr us/Tr 16384 87380 128 14336 10.00 5837.35 2.20 -1.00 90.628 -1.000 16384 87380 So, losing GRO doubled the service demand. I suppose I could see cutting path-length in half based on the things you listed which would be bypassed? I'm sure mileage will vary with different NICs and CPUs. The ones used here happened to be to hand. happy benchmarking, rick Just to get a crude feel for sensitivity, doubling to 28K unsurprisingly goes to more than doubling, and halving to 7K narrows the delta: stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t TCP_RR -- -P 12867 -r 128,28K MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. CPUCPUS.dem S.dem Send Recv SizeSize TimeRate local remote local remote bytes bytes bytes bytes secs. per sec % S% Uus/Tr us/Tr 16384 87380 128 28672 10.00 6732.32 1.79 -1.00 63.819 -1.000 16384 87380 stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t TCP_RR -- -P 12867 -r 128,28K MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. CPUCPUS.dem S.dem Send Recv SizeSize TimeRate local remote local remote bytes bytes bytes bytes secs. per sec % S% Uus/Tr us/Tr 16384 87380 128 28672 10.00 3780.47 2.32 -1.00 147.280 -1.000 16384 87380 stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t TCP_RR -- -P 12867 -r 128,7K MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. CPUCPUS.dem S.dem Send Recv SizeSize TimeRate local remote local remote bytes bytes bytes bytes secs. per sec % S
Re: Initial thoughts on TXDP
On 12/01/2016 11:05 AM, Tom Herbert wrote: For the GSO and GRO the rationale is that performing the extra SW processing to do the offloads is significantly less expensive than running each packet through the full stack. This is true in a multi-layered generalized stack. In TXDP, however, we should be able to optimize the stack data path such that that would no longer be true. For instance, if we can process the packets received on a connection quickly enough so that it's about the same or just a little more costly than GRO processing then we might bypass GRO entirely. TSO is probably still relevant in TXDP since it reduces overheads processing TX in the device itself. Just how much per-packet path-length are you thinking will go away under the likes of TXDP? It is admittedly "just" netperf but losing TSO/GSO does some non-trivial things to effective overhead (service demand) and so throughput: stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -- -P 12867 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo Recv SendSend Utilization Service Demand Socket Socket Message Elapsed Send Recv SendRecv Size SizeSize Time Throughput localremote local remote bytes bytes bytessecs.10^6bits/s % S % U us/KB us/KB 87380 16384 1638410.00 9260.24 2.02 -1.000.428 -1.000 stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 tso off gso off stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -- -P 12867 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo Recv SendSend Utilization Service Demand Socket Socket Message Elapsed Send Recv SendRecv Size SizeSize Time Throughput localremote local remote bytes bytes bytessecs.10^6bits/s % S % U us/KB us/KB 87380 16384 1638410.00 5621.82 4.25 -1.001.486 -1.000 And that is still with the stretch-ACKs induced by GRO at the receiver. Losing GRO has quite similar results: stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t TCP_MAERTS -- -P 12867 MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo Recv SendSend Utilization Service Demand Socket Socket Message Elapsed Recv Send RecvSend Size SizeSize Time Throughput localremote local remote bytes bytes bytessecs.10^6bits/s % S % U us/KB us/KB 87380 16384 1638410.00 9154.02 4.00 -1.000.860 -1.000 stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t TCP_MAERTS -- -P 12867 MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo Recv SendSend Utilization Service Demand Socket Socket Message Elapsed Recv Send RecvSend Size SizeSize Time Throughput localremote local remote bytes bytes bytessecs.10^6bits/s % S % U us/KB us/KB 87380 16384 1638410.00 4212.06 5.36 -1.002.502 -1.000 I'm sure there is a very non-trivial "it depends" component here - netperf will get the peak benefit from *SO and so one will see the peak difference in service demands - but even if one gets only 6 segments per *SO that is a lot of path-length to make-up. 4.4 kernel, BE3 NICs ... E5-2640 0 @ 2.50GHz And even if one does have the CPU cycles to burn so to speak, the effect on power consumption needs to be included in the calculus. happy benchmarking, rick jones
Re: Netperf UDP issue with connected sockets
On 11/30/2016 02:43 AM, Jesper Dangaard Brouer wrote: Notice the "fib_lookup" cost is still present, even when I use option "-- -n -N" to create a connected socket. As Eric taught us, this is because we should use syscalls "send" or "write" on a connected socket. In theory, once the data socket is connected, the send_data() call in src/nettest_omni.c is supposed to use send() rather than sendto(). And indeed, based on a quick check, send() is what is being called, though it becomes it seems a sendto() system call - with the destination information NJULL: write(1, "send\n", 5) = 5 sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 1024 write(1, "send\n", 5) = 5 sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 1024 So I'm not sure what might be going-on there. You can get netperf to use write() instead of send() by adding a test-specific -I option. happy benchmarking, rick My udp_flood tool[1] cycle through the different syscalls: taskset -c 2 ~/git/network-testing/src/udp_flood 198.18.50.1 --count $((10**7)) --pmtu 2 ns/pkt pps cycles/pkt send473.08 2113816.28 1891 sendto 558.58 1790265.84 2233 sendmsg 587.24 1702873.80 2348 sendMmsg/32 547.57 1826265.90 2189 write 518.36 1929175.52 2072 Using "send" seems to be the fastest option. Some notes on test: I've forced TX completions to happen on another CPU0 and pinned the udp_flood program (to CPU2) as I want to avoid the CPU scheduler to move udp_flood around as this cause fluctuations in the results (as it stress the memory allocations more). My udp_flood --pmtu option is documented in the --help usage text (see below signature)
Re: Netperf UDP issue with connected sockets
On 11/28/2016 10:33 AM, Rick Jones wrote: On 11/17/2016 12:16 AM, Jesper Dangaard Brouer wrote: time to try IP_MTU_DISCOVER ;) To Rick, maybe you can find a good solution or option with Eric's hint, to send appropriate sized UDP packets with Don't Fragment (DF). Jesper - Top of trunk has a change adding an omni, test-specific -f option which will set IP_MTU_DISCOVER:IP_PMTUDISC_DO on the data socket. Is that sufficient to your needs? Usage examples: raj@tardy:~/netperf2_trunk/src$ ./netperf -t UDP_STREAM -l 1 -H raj-folio.americas.hpqcorp.net -- -m 1472 -f MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to raj-folio.americas.hpqcorp.net () port 0 AF_INET Socket Message Elapsed Messages SizeSize Time Okay Errors Throughput bytes bytessecs# # 10^6bits/sec 2129921472 1.0077495 0 912.35 212992 1.0077495912.35 [1]+ Doneemacs nettest_omni.c raj@tardy:~/netperf2_trunk/src$ ./netperf -t UDP_STREAM -l 1 -H raj-folio.americas.hpqcorp.net -- -m 14720 -f MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to raj-folio.americas.hpqcorp.net () port 0 AF_INET send_data: data send error: Message too long (errno 90) netperf: send_omni: send_data failed: Message too long happy benchmarking, rick jones
Re: Netperf UDP issue with connected sockets
On 11/17/2016 12:16 AM, Jesper Dangaard Brouer wrote: time to try IP_MTU_DISCOVER ;) To Rick, maybe you can find a good solution or option with Eric's hint, to send appropriate sized UDP packets with Don't Fragment (DF). Jesper - Top of trunk has a change adding an omni, test-specific -f option which will set IP_MTU_DISCOVER:IP_PMTUDISC_DO on the data socket. Is that sufficient to your needs? happy benchmarking, rick
Re: Netperf UDP issue with connected sockets
On 11/17/2016 04:37 PM, Julian Anastasov wrote: On Thu, 17 Nov 2016, Rick Jones wrote: raj@tardy:~/netperf2_trunk$ strace -v -o /tmp/netperf.strace src/netperf -F src/nettest_omni.c -t UDP_STREAM -l 1 -- -m 1472 ... socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4 getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0 getsockopt(4, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0 setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(4, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0 connected socket can benefit from dst cached in socket but not if SO_DONTROUTE is set. If we do not want to send packets via gateway this -l 1 should help but I don't see IP_TTL setsockopt in your first example with connect() to 127.0.0.1. Also, may be there can be another default, if -l is used to specify TTL then SO_DONTROUTE should not be set. I.e. we should avoid SO_DONTROUTE, if possible. The global -l option specifies the duration of the test. It doesn't specify the TTL of the IP datagrams being generated by the actions of the test. I resisted setting SO_DONTROUTE for a number of years after the first instance of UDP_STREAM being used in link up/down testing took-out a company's network (including security camera feeds to galactic HQ) but at this point I'm likely to keep it in there because there ended-up being a second such incident. It is set only for UDP_STREAM. It isn't set for UDP_RR or TCP_*. And for UDP_STREAM it can be overridden by the test-specific -R option. happy benchmarking, rick jones
Re: Netperf UDP issue with connected sockets
On 11/17/2016 01:44 PM, Eric Dumazet wrote: because netperf sends the same message over and over... Well, sort of, by default. That can be altered to a degree. The global -F option should cause netperf to fill the buffers in its send ring with data from the specified file. The number of buffers in the send ring can be controlled via the global -W option. The number of elements in the ring will default to one more than the initial SO_SNDBUF size divided by the send size. raj@tardy:~/netperf2_trunk$ strace -v -o /tmp/netperf.strace src/netperf -F src/nettest_omni.c -t UDP_STREAM -l 1 -- -m 1472 ... socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4 getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0 getsockopt(4, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0 setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(4, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0 setsockopt(4, SOL_IP, IP_RECVERR, [1], 4) = 0 open("src/nettest_omni.c", O_RDONLY)= 5 fstat(5, {st_dev=makedev(8, 2), st_ino=82075297, st_mode=S_IFREG|0664, st_nlink=1, st_uid=1000, st_gid=1000, st_blksize=4096, st_blocks=456, st_size=230027, st_atime=2016/11/16-09:49:29, st_mtime=2016/11/16-09:49:24, st_ctime=2016/11/16-09:49:24}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3099f62000 read(5, "#ifdef HAVE_CONFIG_H\n#include <c"..., 4096) = 4096 read(5, "_INTEGER *intvl_two_ptr = "..., 4096) = 4096 read(5, "interval_count = interval_burst;"..., 4096) = 4096 read(5, ";\n\n/* these will control the wid"..., 4096) = 4096 read(5, "\n LOCAL_SECURITY_ENABLED_NUM,\n "..., 4096) = 4096 read(5, " , \n "..., 4096) = 4096 ... rt_sigaction(SIGALRM, {0x402ea6, [ALRM], SA_RESTORER|SA_INTERRUPT, 0x7f30994a7cb0}, NULL, 8) = 0 rt_sigaction(SIGINT, {0x402ea6, [INT], SA_RESTORER|SA_INTERRUPT, 0x7f30994a7cb0}, NULL, 8) = 0 alarm(1)= 0 sendto(4, "#ifdef HAVE_CONFIG_H\n#include <c"..., 1472, 0, {sa_family=AF_INET, sin_port=htons(58088), sin_addr=inet_addr("127.0.0.1")}, 16) = 1472 sendto(4, " used\\n\\\n-m local,remote S"..., 1472, 0, {sa_family=AF_INET, sin_port=htons(58088), sin_addr=inet_addr("127.0.0.1")}, 16) = 1472 sendto(4, " do here but clear the legacy fl"..., 1472, 0, {sa_family=AF_INET, sin_port=htons(58088), sin_addr=inet_addr("127.0.0.1")}, 16) = 1472 sendto(4, "e before we scan the test-specif"..., 1472, 0, {sa_family=AF_INET, sin_port=htons(58088), sin_addr=inet_addr("127.0.0.1")}, 16) = 1472 sendto(4, "\n\n\tfprintf(where,\n\t\ttput_fmt_1_l"..., 1472, 0, {sa_family=AF_INET, sin_port=htons(58088), sin_addr=inet_addr("127.0.0.1")}, 16) = 1472 Of course, it will continue to send the same messages from the send_ring over and over instead of putting different data into the buffers each time, but if one has a sufficiently large -W option specified... happy benchmarking, rick jones
Re: Netperf UDP issue with connected sockets
On 11/17/2016 12:16 AM, Jesper Dangaard Brouer wrote: time to try IP_MTU_DISCOVER ;) To Rick, maybe you can find a good solution or option with Eric's hint, to send appropriate sized UDP packets with Don't Fragment (DF). Well, I suppose adding another setsockopt() to the data socket creation wouldn't be too difficult, along with another command-line option to cause it to happen. Could we leave things as "make sure you don't need fragmentation when you use this" or would netperf have to start processing ICMP messages? happy benchmarking, rick jones
Re: Netperf UDP issue with connected sockets
On 11/16/2016 02:40 PM, Jesper Dangaard Brouer wrote: On Wed, 16 Nov 2016 09:46:37 -0800 Rick Jones <rick.jon...@hpe.com> wrote: It is a wild guess, but does setting SO_DONTROUTE affect whether or not a connect() would have the desired effect? That is there to protect people from themselves (long story about people using UDP_STREAM to stress improperly air-gapped systems during link up/down testing) It can be disabled with a test-specific -R 1 option, so your netperf command would become: netperf -H 198.18.50.1 -t UDP_STREAM -l 120 -- -m 1472 -n -N -R 1 Using -R 1 does not seem to help remove __ip_select_ident() Bummer. It was a wild guess anyway, since I was seeing a connect() call on the data socket. Samples: 56K of event 'cycles', Event count (approx.): 78628132661 Overhead CommandShared ObjectSymbol +9.11% netperf[kernel.vmlinux] [k] __ip_select_ident +6.98% netperf[kernel.vmlinux] [k] _raw_spin_lock +6.21% swapper[mlx5_core] [k] mlx5e_poll_tx_cq +5.03% netperf[kernel.vmlinux] [k] copy_user_enhanced_fast_string +4.69% netperf[kernel.vmlinux] [k] __ip_make_skb +4.63% netperf[kernel.vmlinux] [k] skb_set_owner_w +4.15% swapper[kernel.vmlinux] [k] __slab_free +3.80% netperf[mlx5_core] [k] mlx5e_sq_xmit +2.00% swapper[kernel.vmlinux] [k] sock_wfree +1.94% netperfnetperf [.] send_data +1.92% netperfnetperf [.] send_omni_inner Well, the next step I suppose is to have you try a quick netperf UDP_STREAM under strace to see if your netperf binary does what mine did: strace -v -o /tmp/netperf.strace netperf -H 198.18.50.1 -t UDP_STREAM -l 1 -- -m 1472 -n -N -R 1 And see if you see the connect() I saw. (Note, I make the runtime 1 second) rick
Re: Netperf UDP issue with connected sockets
On 11/16/2016 04:16 AM, Jesper Dangaard Brouer wrote: [1] Subj: High perf top ip_idents_reserve doing netperf UDP_STREAM - https://www.spinics.net/lists/netdev/msg294752.html Not fixed in version 2.7.0. - ftp://ftp.netperf.org/netperf/netperf-2.7.0.tar.gz Used extra netperf configure compile options: ./configure --enable-histogram --enable-demo It seems like some fix attempts exists in the SVN repository:: svn checkout http://www.netperf.org/svn/netperf2/trunk/ netperf2-svn svn log -r709 # A quick stab at getting remote connect going for UDP_STREAM svn diff -r708:709 Testing with SVN version, still show __ip_select_ident() in top#1. Indeed, there was a fix for getting the remote side connect()ed. Looking at what I have for the top of trunk I do though see a connect() call being made at the local end: socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4 getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0 getsockopt(4, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0 setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(4, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0 setsockopt(4, SOL_IP, IP_RECVERR, [1], 4) = 0 brk(0xe53000) = 0xe53000 getsockname(4, {sa_family=AF_INET, sin_port=htons(59758), sin_addr=inet_addr("0.0.0.0")}, [16]) = 0 sendto(3, "\0\0\0a\377\377\377\377\377\377\377\377\377\377\377\377\0\0\0\10\0\0\0\0\0\0\0\321\377\377\377\377"..., 656, 0, NULL, 0) = 656 select(1024, [3], NULL, NULL, {120, 0}) = 1 (in [3], left {119, 995630}) recvfrom(3, "\0\0\0b\0\0\0\0\0\3@\0\0\3@\0\0\0\0\2\0\3@\0\377\377\377\377\0\0\0\321"..., 656, 0, NULL, NULL) = 656 write(1, "need to connect is 1\n", 21) = 21 rt_sigaction(SIGALRM, {0x402ea6, [ALRM], SA_RESTORER|SA_INTERRUPT, 0x7f2824eb2cb0}, NULL, 8) = 0 rt_sigaction(SIGINT, {0x402ea6, [INT], SA_RESTORER|SA_INTERRUPT, 0x7f2824eb2cb0}, NULL, 8) = 0 alarm(1)= 0 connect(4, {sa_family=AF_INET, sin_port=htons(34832), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 1024 sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 1024 sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 1024 the only difference there with top of trunk is that "need to connect" write/printf I just put in the code to be a nice marker in the system call trace. It is a wild guess, but does setting SO_DONTROUTE affect whether or not a connect() would have the desired effect? That is there to protect people from themselves (long story about people using UDP_STREAM to stress improperly air-gapped systems during link up/down testing) It can be disabled with a test-specific -R 1 option, so your netperf command would become: netperf -H 198.18.50.1 -t UDP_STREAM -l 120 -- -m 1472 -n -N -R 1 (p.s. is netperf ever going to be converted from SVN to git?) Well my git-fu could use some work (gentle, offlinetaps with a clueful tutorial bat would be welcome), and at least in the past, going to git was held back because there were a bunch of netperf users on Windows and there wasn't (at the time) support for git under Windows. But I am not against the idea in principle. happy benchmarking, rick jones PS - rick.jo...@hp.com no longer works. rick.jon...@hpe.com should be used instead.
Re: [patch] netlink.7: srcfix Change buffer size in example code about reading netlink message.
Lets change the example so others don't propagate the problem further. Signed-off-by David Wilder <dwil...@us.ibm.com> --- man7/netlink.7.orig 2016-11-14 13:30:36.522101156 -0800 +++ man7/netlink.7 2016-11-14 13:30:51.002086354 -0800 @@ -511,7 +511,7 @@ .in +4n .nf int len; -char buf[4096]; +char buf[8192]; Since there doesn't seem to be a define one could use in the user space linux/netlink.h (?), but there are comments in the example code in the manpage, how about also including a brief comment to the effect that using 8192 bytes will avoid message truncation problems on platforms with a large PAGE_SIZE? /* avoid msg truncation on > 4096 byte PAGE_SIZE platforms */ or something like that. rick jones
Re: [PATCH RFC 0/2] ethtool: Add actual port speed reporting
And besides, one can argue that in the SR-IOV scenario the VF has no business knowing the physical port speed. Good point, but there are more use-cases we should consider. For example, when using Multi-Host/Flex-10/Multi-PF each PF should be able to query both physical port speed and actual speed. Despite my email address, I'm not fully versed on VC/Flex, but I have always been under the impression that the flexnics created were, conceptually, "distinct" NICs considered independently of the physical port over which they operated. Tossing another worm or three into the can, while "back in the day" (when some of the first ethtool changes to report speeds other than the "normal" ones went in) the speed of a flexnic was fixed, today, it can actually operate in a range. From a minimum guarantee to an "if there is bandwidth available" cap. rick jones
Re: [bnx2] [Regression 4.8] Driver loading fails without firmware
On 10/25/2016 08:31 AM, Paul Menzel wrote: To my knowledge, the firmware files haven’t changed since years [1]. Indeed - it looks like I read "bnx2" and thought "bnx2x" Must remember to hold-off on replying until after the morning orange juice is consumed :) rick
Re: [bnx2] [Regression 4.8] Driver loading fails without firmware
On 10/25/2016 07:33 AM, Paul Menzel wrote: Dear Linux folks, A server with the Broadcom devices below, fails to load the drivers because of missing firmware. I have run into the same sort of issue from time to time when going to a newer kernel. A newer version of the driver wants a newer version of the firmware. Usually, finding a package "out there" with the newer version of the firmware, and installing it onto the system is sufficient. happy benchmarking, rick jones
Re: Accelerated receive flow steering (aRFS) for UDP
On 10/10/2016 09:08 AM, Rick Jones wrote: On 10/09/2016 03:33 PM, Eric Dumazet wrote: OK, I am adding/CC Rick Jones, netperf author, since it seems a netperf bug, not a kernel one. I believe I already mentioned fact that "UDP_STREAM -- -N" was not doing a connect() on the receiver side. I can confirm that the receive side of the netperf omni path isn't trying to connect UDP datagrams. I will see what I can put together. I've put something together and pushed it to the netperf top of trunk. It seems to have been successful on a quick loopback UDP_STREAM test. happy benchmarking, rick jones
Re: Accelerated receive flow steering (aRFS) for UDP
On 10/09/2016 03:33 PM, Eric Dumazet wrote: OK, I am adding/CC Rick Jones, netperf author, since it seems a netperf bug, not a kernel one. I believe I already mentioned fact that "UDP_STREAM -- -N" was not doing a connect() on the receiver side. I can confirm that the receive side of the netperf omni path isn't trying to connect UDP datagrams. I will see what I can put together. happy benchmarking, rick jones rick.jon...@hpe.com
Re: [PATCH v2 net-next 4/5] xps_flows: XPS for packets that don't have a socket
On 09/29/2016 06:18 AM, Eric Dumazet wrote: Well, then what this patch series is solving ? You have a producer of packets running on 8 vcpus in a VM. Packets are exiting the VM and need to be queued on a mq NIC in the hypervisor. Flow X can be scheduled on any of these 8 vcpus, so XPS is currently selecting different TXQ. Just for completeness, in my testing, the VMs were single-vCPU. rick jones
Re: [PATCH RFC 0/4] xfs: Transmit flow steering
Here is a quick look at performance tests for the result of trying the prototype fix for the packet reordering problem with VMs sending over an XPS-configured NIC. In particular, the Emulex/Avago/Broadcom Skyhawk. The fix was applied to a 4.4 kernel. Before: 3884 Mbit/s After: 8897 Mbit/s That was from a VM on a node with a Skyhawk and 2 E5-2640 processors to baremetal E5-2640 with a BE3. Physical MTU was 1500, the VM's vNIC's MTU was 1400. Systems were HPE ProLiants in OS Control Mode for power management, with the "performance" frequency governor loaded. An OpenStack Mitaka setup with Distributed Virtual Router. We had some other NIC types in the setup as well. XPS was also enabled on the ConnectX3-Pro. It was not enabled on the 82599ES (a function of the kernel being used, which had it disabled from the first reports of XPS negatively affecting VM traffic at the beginning of the year) Average Mbit/s From NIC type To Bare Metal BE3: NIC Type, CPU on VM HostBeforeAfter ConnectX-3 Pro,E5-2670v39224 9271 BE3, E5-26409016 9022 82599, E5-2640 9192 9003 BCM57840, E5-2640 9213 9153 Skyhawk, E5-26403884 8897 For completeness: Average Mbit/s To NIC type from Bare Metal BE3: NIC Type, CPU on VM HostBeforeAfter ConnectX-3 Pro,E5-2670v39322 9144 BE3, E5-26409074 9017 82599, E5-2640 8670 8564 BCM57840, E5-2640 2468 * 7979 Skyhawk, E5-26408897 9269 * This is the Busted bnx2x NIC FW GRO implementation issue. It was not visible in the "After" because the system was setup to disable the NIC FW GRO by the time it booted on the fix kernel. Average Transactions/s Between NIC type and Bare Metal BE3: NIC Type, CPU on VM HostBeforeAfter ConnectX-3 Pro,E5-2670v3 12421 12612 BE3, E5-26408178 8484 82599, E5-2640 8499 8549 BCM57840, E5-2640 8544 8560 Skyhawk, E5-26408537 8701 happy benchmarking, Drew Balliet Jeurg Haefliger rick jones The semi-cooked results with additional statistics: 554M - BE3 544+M - ConnectX-3 Pro 560M - 82599ES 630M - BCM57840 650M - Skyhawk (substitute is simply replacing a system name with the model of NIC and CPU) Bulk To (South) and From (North) VM, Before: $ ../substitute.sh vxlan_554m_control_performance_gvnr_dvr_northsouth_stream.log | ~/netperf2_trunk/doc/examples/parse_single_stream.py -r -5 -f 1 -f 3 -f 4 -f 7 -f 8 Field1,Field3,Field4,Field7,Field8,Min,P10,Median,Average,P90,P99,Max,Count North,560M,E5-2640,554FLB,E5-2640,8148.090,9048.830,9235.400,9192.868,9315.980,9338.845,9339.500,113 North,630M,E5-2640,554FLB,E5-2640,8909.980,9113.238,9234.750,9213.140,9299.442,9336.206,9337.830,47 North,544+M,E5-2670v3,554FLB,E5-2640,9013.740,9182.546,9229.620,9224.025,9264.036,9299.206,9301.970,99 North,650M,E5-2640,554FLB,E5-2640,3187.680,3393.724,3796.160,3884.765,4405.096,4941.391,4956.300,129 North,554M,E5-2640,554FLB,E5-2640,8700.930,8855.768,9026.030,9016.061,9158.846,9213.687,9226.150,135 South,554FLB,E5-2640,560M,E5-2640,7754.350,8193.114,8718.540,8670.612,9026.436,9262.355,9285.010,113 South,554FLB,E5-2640,630M,E5-2640,1897.660,2068.290,2514.430,2468.323,2787.162,2942.934,2957.250,53 South,554FLB,E5-2640,544+M,E5-2670v3,9298.260,9314.432,9323.220,9322.207,9328.324,9330.704,9331.080,100 South,554FLB,E5-2640,650M,E5-2640,8407.050,8907.136,9304.390,9206.776,9321.320,9325.347,9326.410,103 South,554FLB,E5-2640,554M,E5-2640,7844.900,8632.530,9199.385,9074.535,9308.070,9319.224,9322.360,132 0 too-short lines ignored. Bulk To (South) and From (North) VM, After: $ ../substitute.sh vxlan_554m_control_performance_gvnr_xpsfix_dvr_northsouth_stream.log | ~/netperf2_trunk/doc/examples/parse_single_stream.py -r -5 -f 1 -f 3 -f 4 -f 7 -f 8 Field1,Field3,Field4,Field7,Field8,Min,P10,Median,Average,P90,P99,Max,Count North,560M,E5-2640,554FLB,E5-2640,7576.790,8213.890,9182.870,9003.190,9295.975,9315.878,9318.160,36 North,630M,E5-2640,554FLB,E5-2640,8811.800,8924.000,9206.660,9153.076,9306.287,9315.152,9315.790,12 North,544+M,E5-2670v3,554FLB,E5-2640,9135.990,9228.520,9277.465,9271.875,9324.545,9339.604,9339.780,46 North,650M,E5-2640,554FLB,E5-2640,8133.420,8483.340,8995.040,8897.779,9129.056,9165.230,9165.860,43 North,554M,E5-2640,554FLB,E5-2640,8438.390,8879.150,9048.590,9022.813,9181.540,9248.650,9297.660,101 South,554FLB,E5-2640,630M,E5-2640,7347.120,7592.565,7951.325,7979.951,8365.400,8575.837,8579.890,16 South,554FLB,E5-2640,560M,E5-2640,7719.510,8044.496,8602.750,8564.741,9172.824,9248.686,9259.070,45 South,554
Re: [PATCH v3 net-next 16/16] tcp_bbr: add BBR congestion control
On 09/19/2016 02:10 PM, Eric Dumazet wrote: On Mon, Sep 19, 2016 at 1:57 PM, Stephen Hemminger <step...@networkplumber.org> wrote: Looks good, but could I suggest a simple optimization. All these parameters are immutable in the version of BBR you are submitting. Why not make the values const? And eliminate the always true long-term bw estimate variable? We could do that. We used to have variables (aka module params) while BBR was cooking in our kernels ;) Are there better than epsilon odds of someone perhaps wanting to poke those values as it gets exposure beyond Google? happy benchmarking, rick jones
Re: [PATCH next 3/3] ipvlan: Introduce l3s mode
On 09/09/2016 02:53 PM, Mahesh Bandewar wrote: @@ -48,6 +48,11 @@ master device for the L2 processing and routing from that instance will be used before packets are queued on the outbound device. In this mode the slaves will not receive nor can send multicast / broadcast traffic. +4.3 L3S mode: + This is very similar to the L3 mode except that iptables conn-tracking +works in this mode and that is why L3-symsetric (L3s) from iptables perspective. +This will have slightly less performance but that shouldn't matter since you +are choosing this mode over plain-L3 mode to make conn-tracking work. What is that first sentence trying to say? It appears to be incomplete, and is that supposed to be "L3-symmetric?" happy benchmarking, rick jones
Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
On 09/08/2016 11:16 AM, Tom Herbert wrote: On Thu, Sep 8, 2016 at 10:19 AM, Jesper Dangaard Brouer <bro...@redhat.com> wrote: On Thu, 8 Sep 2016 09:26:03 -0700 Tom Herbert <t...@herbertland.com> wrote: Shouldn't qdisc bulk size be based on the BQL limit? What is the simple algorithm to apply to in-flight packets? Maybe the algorithm is not so simple, and we likely also have to take BQL bytes into account. The reason for wanting packets-in-flight is because we are attacking a transaction cost. The tailptr/doorbell cost around 70ns. (Based on data in this patch desc, 4.9Mpps -> 7.5Mpps (1/4.90-1/7.5)*1000 = 70.74). The 10G wirespeed small packets budget is 67.2ns, this with fixed overhead per packet of 70ns we can never reach 10G wirespeed. But you should be able to do this with BQL and it is more accurate. BQL tells how many bytes need to be sent and that can be used to create a bulk of packets to send with one doorbell. With small packets and the "default" ring size for this NIC/driver combination, is the BQL large enough that the ring fills before one hits the BQL? rick jones
Re: [PATCH] softirq: let ksoftirqd do its job
On 08/31/2016 04:11 PM, Eric Dumazet wrote: On Wed, 2016-08-31 at 15:47 -0700, Rick Jones wrote: With regard to drops, are both of you sure you're using the same socket buffer sizes? Does it really matter ? At least at points in the past I have seen different drop counts at the SO_RCVBUF based on using (sometimes much) larger sizes. The hypothesis I was operating under at the time was that this dealt with those situations where the netserver was held-off from running for "a little while" from time to time. It didn't change things for a sustained overload situation though. In the meantime, is anything interesting happening with TCP_RR or TCP_STREAM? TCP_RR is driven by the network latency, we do not drop packets in the socket itself. I've been of the opinion it (single stream) is driven by path length. Sometimes by NIC latency. But then I'm almost always measuring in the LAN rather than across the WAN. happy benchmarking, rick
Re: [PATCH] softirq: let ksoftirqd do its job
With regard to drops, are both of you sure you're using the same socket buffer sizes? In the meantime, is anything interesting happening with TCP_RR or TCP_STREAM? happy benchmarking, rick jones
Re: [PATCH v2 net-next] documentation: Document issues with VMs and XPS and drivers enabling it on their own
On 08/27/2016 12:41 PM, Tom Herbert wrote: On Fri, Aug 26, 2016 at 9:35 PM, David Miller <da...@davemloft.net> wrote: From: Tom Herbert <t...@herbertland.com> Date: Thu, 25 Aug 2016 16:43:35 -0700 This seems like it will only confuse users even more. You've clearly identified an issue, let's figure out how to fix it. I kinda feel the same way about this situation. I'm working on XFS (as the transmit analogue to RFS). We'll track flows enough so that we should know when it's safe to move them. Is the XFS you are working on going to subsume XPS or will the two continue to exist in parallel a la RPS and RFS? rick jones
[PATCH v2 net-next] documentation: Document issues with VMs and XPS and drivers enabling it on their own
From: Rick Jones <rick.jon...@hpe.com> Since XPS was first introduced two things have happened. Some drivers have started enabling XPS on their own initiative, and it has been found that when a VM is sending data through a host interface with XPS enabled, that traffic can end-up seriously out of order. Signed-off-by: Rick Jones <rick.jon...@hpe.com> Reviewed-by: Alexander Duyck <alexander.h.du...@intel.com> --- diff --git a/Documentation/networking/scaling.txt b/Documentation/networking/scaling.txt index 59f4db2..50cc888 100644 --- a/Documentation/networking/scaling.txt +++ b/Documentation/networking/scaling.txt @@ -400,15 +400,31 @@ transport layer is responsible for setting ooo_okay appropriately. TCP, for instance, sets the flag when all data for a connection has been acknowledged. +When the traffic source is a VM running on the host, there is no +socket structure known to the host. In this case, unless the VM is +itself CPU-pinned, the traffic being sent from it can end-up queued to +multiple transmit queues and end-up being transmitted out of order. + +In some cases this can result in a considerable loss of performance. + +In such situations, XPS should not be enabled at runtime, or +explicitly disabled if the NIC driver(s) in question enable it on +their own. Otherwise, if possible, the VMs should be CPU pinned. + XPS Configuration -XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by -default for SMP). The functionality remains disabled until explicitly -configured. To enable XPS, the bitmap of CPUs that may use a transmit -queue is configured using the sysfs file entry: +XPS is available only if the kconfig symbol CONFIG_XPS is enabled +prior to building the kernel. It is enabled by default for SMP kernel +configurations. In many cases the functionality remains disabled at +runtime until explicitly configured by the system administrator. To +enable XPS, the bitmap of CPUs that may use a transmit queue is +configured using the sysfs file entry: /sys/class/net//queues/tx-/xps_cpus +However, some NIC drivers will configure XPS at runtime for the +interfaces they drive, via a call to netif_set_xps_queue. + == Suggested Configuration For a network device with a single transmission queue, XPS configuration
[PATCH net-next] documentation: Document issues with VMs and XPS and drivers enabling it on their own
From: Rick Jones <rick.jon...@hpe.com> Since XPS was first introduced two things have happened. Some drivers have started enabling XPS on their own initiative, and it has been found that when a VM is sending data through a host interface with XPS enabled, that traffic can end-up seriously out of order. Signed-off-by: Rick Jones <rick.jon...@hpe.com> --- diff --git a/Documentation/networking/scaling.txt b/Documentation/networking/scaling.txt index 59f4db2..50cc888 100644 --- a/Documentation/networking/scaling.txt +++ b/Documentation/networking/scaling.txt @@ -400,15 +400,31 @@ transport layer is responsible for setting ooo_okay appropriately. TCP, for instance, sets the flag when all data for a connection has been acknowledged. +When the traffic source is a VM running on the host, there is no +socket structure known to the host. In this case, unless the VM is +itself CPU-pinned, the traffic being sent from it can end-up queued to +multiple transmit queues and end-up being transmitted out of order. + +In some cases this can result in a considerable loss of performance. + +In such situations, XPS should not be enabled at runtime, or +explicitly disabled if the NIC driver(s) in question enable it on +their own. Othersise, if possible, the VMs should be CPU pinned. + XPS Configuration -XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by -default for SMP). The functionality remains disabled until explicitly -configured. To enable XPS, the bitmap of CPUs that may use a transmit -queue is configured using the sysfs file entry: +XPS is available only if the kconfig symbol CONFIG_XPS is enabled +prior to building the kernel. It is enabled by default for SMP kernel +configurations. In many cases the functionality remains disabled at +runtime until explicitly configured by the system administrator. To +enable XPS, the bitmap of CPUs that may use a transmit queue is +configured using the sysfs file entry: /sys/class/net//queues/tx-/xps_cpus +However, some NIC drivers will configure XPS at runtime for the +interfaces they drive, via a call to netif_set_xps_queue. + == Suggested Configuration For a network device with a single transmission queue, XPS configuration
Re: [RFC PATCH] net: Require socket to allow XPS to set queue mapping
On 08/25/2016 02:08 PM, Eric Dumazet wrote: When XPS was submitted, it was _not_ enabled by default and 'magic' Some NIC vendors decided it was a good thing, you should complain to them ;) I kindasorta am with the emails I've been sending to netdev :) And also hopefully precluding others going down that path. happy benchmarking, rick
Re: [RFC PATCH] net: Require socket to allow XPS to set queue mapping
On 08/25/2016 12:49 PM, Eric Dumazet wrote: On Thu, 2016-08-25 at 12:23 -0700, Alexander Duyck wrote: A simpler approach is provided with this patch. With it we disable XPS any time a socket is not present for a given flow. By doing this we can avoid using XPS for any routing or bridging situations in which XPS is likely more of a hinderance than a help. Yes, but this will destroy isolation for people properly doing VM cpu pining. Why not simply stop enabling XPS by default. Treat it like RPS and RFS (unless I've missed a patch...). The people who are already doing the extra steps to pin VMs can enable XPS in that case. It isn't clear that one should always pin VMs - for example if a (public) cloud needed to oversubscribe the cores. happy benchmarking, rick jones
Re: A second case of XPS considerably reducing single-stream performance
On 08/25/2016 12:19 PM, Alexander Duyck wrote: The problem is that there is no socket associated with the guest from the host's perspective. This is resulting in the traffic bouncing between queues because there is no saved socket to lock the interface onto. I was looking into this recently as well and had considered a couple of options. The first is to fall back to just using skb_tx_hash() when skb->sk is null for a given buffer. I have a patch I have been toying around with but I haven't submitted it yet. If you would like I can submit it as an RFC to get your thoughts. The second option is to enforce the use of RPS for any interfaces that do not perform Rx in NAPI context. The correct solution for this is probably some combination of the two as you have to have all queueing done in order at every stage of the packet processing. I don't know with interfaces would be hit, but just in general, I'm not sure that requiring RPS be enabled is a good solution - picking where traffic is processed based on its addressing is fine in a benchmarking situation, but I think it is better to have the process/thread scheduler decide where something should run and not the addressing of the connections that thread/process is servicing. I would be interested in seeing the RFC patch you propose. Apart from that, given the prevalence of VMs these days I wonder if perhaps simply not enabling XPS by default isn't a viable alternative. I've not played with containers to know if they would exhibit this too. Drifting ever so slightly, if drivers are going to continue to enable XPS by default, Documentation/networking/scaling.txt might use a tweak: diff --git a/Documentation/networking/scaling.txt b/Documentation/networking/sca index 59f4db2..8b5537c 100644 --- a/Documentation/networking/scaling.txt +++ b/Documentation/networking/scaling.txt @@ -402,10 +402,12 @@ acknowledged. XPS Configuration -XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by -default for SMP). The functionality remains disabled until explicitly -configured. To enable XPS, the bitmap of CPUs that may use a transmit -queue is configured using the sysfs file entry: +XPS is available only when the kconfig symbol CONFIG_XPS is enabled +(on by default for SMP). The drivers for some NICs will enable the +functionality by default. For others the functionality remains +disabled until explicitly configured. To enable XPS, the bitmap of +CPUs that may use a transmit queue is configured using the sysfs file +entry: /sys/class/net//queues/tx-/xps_cpus The original wording leaves the impression that XPS is not enabled by default. rick
Re: A second case of XPS considerably reducing single-stream performance
Also, while it doesn't seem to have the same massive effect on throughput, I can also see out of order behaviour happening when the sending VM is on a node with a ConnectX-3 Pro NIC. Its driver is also enabling XPS it would seem. I'm not *certain* but looking at the traces it appears that with the ConnectX-3 Pro there is more interleaving of the out-of-order traffic than there is with the Skyhawk. The ConnectX-3 Pro happens to be in a newer generation server with a newer processor than the other systems where I've seen this. I do not see the out-of-order behaviour when the NIC at the sending end is a BCM57840. It does not appear that the bnx2x driver in the 4.4 kernel is enabling XPS. So, it would seem that there are three cases of enabling XPS resulting in out-of-order traffic, two of which result in a non-trivial loss of performance. happy benchmarking, rick jones
Re: [PATCH net-next] net: minor optimization in qdisc_qstats_cpu_drop()
On 08/24/2016 10:23 AM, Eric Dumazet wrote: From: Eric Dumazet <eduma...@google.com> per_cpu_inc() is faster (at least on x86) than per_cpu_ptr(xxx)++; Is it possible it is non-trivially slower on other architectures? rick jones Signed-off-by: Eric Dumazet <eduma...@google.com> --- include/net/sch_generic.h |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index 0d501779cc68f9426e58da6d039dd64adc937c20..52a2015667b49c8315edbb26513a98d4c677fee5 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -592,7 +592,7 @@ static inline void qdisc_qstats_drop(struct Qdisc *sch) static inline void qdisc_qstats_cpu_drop(struct Qdisc *sch) { - qstats_drop_inc(this_cpu_ptr(sch->cpu_qstats)); + this_cpu_inc(sch->cpu_qstats->drops); } static inline void qdisc_qstats_overlimit(struct Qdisc *sch)
A second case of XPS considerably reducing single-stream performance
Back in February of this year, I reported some performance issues with the ixgbe driver enabling XPS by default and instance network performance in OpenStack: http://www.spinics.net/lists/netdev/msg362915.html I've now seen the same thing with be2net and Skyhawk. In this case, the magnitude of the delta is even greater. Disabling XPS increased the netperf single-stream performance out of the instance from an average of 4108 Mbit/s to Mbit/s or 116%. Should drivers really be enabling XPS by default? Instance To Outside World Single-stream netperf ~30 Samples for Each Statistic Mbit/s SkyhawkBE3 #1BE3 #2 XPS On XPS Off XPS On XPS Off XPS On XPS Off Median4192 8883 8930 8853 8917 8695 Average 4108 8940 8859 8885 8671 happy benchmarking, rick jones The sample counts below may not fully support the additional statistics but for the curious: raj@tardy:/tmp$ ~/netperf2_trunk/doc/examples/parse_single_stream.py -r 6 waxon_performance.log -f 2 Field2,Min,P10,Median,Average,P90,P99,Max,Count be3-1,8758.850,8811.600,8930.900,8940.555,9096.470,9175.839,9183.690,31 be3-2,8588.450,8736.967,8917.075,8885.322,9017.914,9075.735,9094.620,32 skyhawk,3326.760,3536.008,4192.780,4108.513,4651.164,4723.322,4724.320,27 0 too-short lines ignored. raj@tardy:/tmp$ ~/netperf2_trunk/doc/examples/parse_single_stream.py -r 6 waxoff_performance.log -f 2 Field2,Min,P10,Median,Average,P90,P99,Max,Count be3-1,8461.080,8634.690,8853.260,8859.870,9064.480,9247.770,9253.050,31 be3-2,7519.130,8368.564,8695.140,8671.241,9068.588,9200.719,9241.500,27 skyhawk,8071.180,8651.587,8883.340,.411,9135.603,9141.229,9142.010,32 0 too-short lines ignored. "waxon" is with XPS enabled, "waxoff" is with XPS disabled. The servers are the same models/config as in February. stack@np-cp1-comp0013-mgmt:~$ sudo ethtool -i hed3 driver: be2net version: 10.6.0.3 firmware-version: 10.7.110.45
Re: [PATCH net 1/2] tg3: Fix for diasllow rx coalescing time to be 0
On 08/02/2016 09:13 PM, skallam wrote: From: Satish Baddipadige <satish.baddipad...@broadcom.com> When the rx coalescing time is 0, interrupts are not generated from the controller and rx path hangs. To avoid this rx hang, updating the driver to not allow rx coalescing time to be 0. Signed-off-by: Satish Baddipadige <satish.baddipad...@broadcom.com> Signed-off-by: Siva Reddy Kallam <siva.kal...@broadcom.com> Signed-off-by: Michael Chan <michael.c...@broadcom.com> --- drivers/net/ethernet/broadcom/tg3.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c index ff300f7..f3c6c91 100644 --- a/drivers/net/ethernet/broadcom/tg3.c +++ b/drivers/net/ethernet/broadcom/tg3.c @@ -14014,6 +14014,7 @@ static int tg3_set_coalesce(struct net_device *dev, struct ethtool_coalesce *ec) } if ((ec->rx_coalesce_usecs > MAX_RXCOL_TICKS) || + (!ec->rx_coalesce_usecs) || (ec->tx_coalesce_usecs > MAX_TXCOL_TICKS) || (ec->rx_max_coalesced_frames > MAX_RXMAX_FRAMES) || (ec->tx_max_coalesced_frames > MAX_TXMAX_FRAMES) || Should anything then happen with: /* No rx interrupts will be generated if both are zero */ if ((ec->rx_coalesce_usecs == 0) && (ec->rx_max_coalesced_frames == 0)) return -EINVAL; which is the next block of code? The logic there seems to suggest that it was intended to be able to have an rx_coalesce_usecs of 0 and rely on packet arrival to trigger an interrupt. Presumably setting rx_max_coalesced_frames to 1 to disable interrupt coalescing. happy benchmarking, rick jones
Re: [iproute PATCH 0/2] Netns performance improvements
On 07/08/2016 01:01 AM, Nicolas Dichtel wrote: Those 300 routers will each have at least one namespace along with the dhcp namespaces. Depending on the nature of the routers (Distributed versus Centralized Virtual Routers - DVR vs CVR) and whether the routers are supposed to be "HA" there can be more than one namespace for a given router. 300 routers is far from the upper limit/goal. Back in HP Public Cloud, we were running as many as 700 routers per network node (*), and more than four network nodes. (back then it was just the one namespace per router and network). Mileage will of course vary based on the "oomph" of one's network node(s). Thank you for the details. Do you have a script or something else to easily reproduce this problem? Do you mean for my much older, slightly different stuff done in HP Public Cloud, or for what Phil (?) is doing presently? I believe Phil posted something several messages back in the thread. happy benchmarking, rick jones
Re: [iproute PATCH 0/2] Netns performance improvements
On 07/07/2016 09:34 AM, Eric W. Biederman wrote: Rick Jones <rick.jon...@hpe.com> writes: 300 routers is far from the upper limit/goal. Back in HP Public Cloud, we were running as many as 700 routers per network node (*), and more than four network nodes. (back then it was just the one namespace per router and network). Mileage will of course vary based on the "oomph" of one's network node(s). To clarify processes for these routers and dhcp servers are created with "ip netns exec"? I believe so, but it would be good to have someone else confirm that, and speak to your paragraph below. If that is the case and you are using this feature as effectively a lightweight container and not lots vrfs in a single network stack then I suspect much larger gains can be had by creating a variant of ip netns exec avoids the mount propagation. ... * Didn't want to go much higher than that because each router had a port on a common linux bridge and getting to > 1024 would be an unpleasant day. * I would have thought all you have to do is bump of the size of the linux neighbour cache. echo $BIGNUM > /proc/sys/net/ipv4/neigh/default/gc_thresh3 We didn't want to hit the 1024 port limit of a (then?) Linux bridge. rick Having a bit of deja vu but I suspect things like commit 0818bf27c05b2de56c5b2bd08cfae2a939bd5f52 are not exactly on the same wavelength, just my brain seeing "namespaces" and "performance" and lighting-up :)
Re: [iproute PATCH 0/2] Netns performance improvements
On 07/07/2016 08:48 AM, Phil Sutter wrote: On Thu, Jul 07, 2016 at 02:59:48PM +0200, Nicolas Dichtel wrote: Le 07/07/2016 13:17, Phil Sutter a Ă©crit : [snip] The issue came up during OpenStack Neutron testing, see this ticket for reference: https://bugzilla.redhat.com/show_bug.cgi?id=1310795 Access to this ticket is not public :( *Sigh* OK, here are a few quotes: "OpenStack Neutron controller nodes, when undergoing testing, are locking up specifically during creation and mounting of namespaces. They appear to be blocking behind vfsmount_lock, and contention for the namespace_sem" "During the scale testing, we have 300 routers, 600 dhcp namespaces spread across four neutron network nodes. When then start as one set of standard Openstack Rally benchmark test cycle against neutron. An example scenario is creating 10x networks, list them, delete them and repeat 10x times. The second set performs an L3 benchmark test between two instances." Those 300 routers will each have at least one namespace along with the dhcp namespaces. Depending on the nature of the routers (Distributed versus Centralized Virtual Routers - DVR vs CVR) and whether the routers are supposed to be "HA" there can be more than one namespace for a given router. 300 routers is far from the upper limit/goal. Back in HP Public Cloud, we were running as many as 700 routers per network node (*), and more than four network nodes. (back then it was just the one namespace per router and network). Mileage will of course vary based on the "oomph" of one's network node(s). happy benchmarking, rick jones * Didn't want to go much higher than that because each router had a port on a common linux bridge and getting to > 1024 would be an unpleasant day.
Re: strange Mac OSX RST behavior
On 07/01/2016 08:10 AM, Jason Baron wrote: I'm wondering if anybody else has run into this... On Mac OSX 10.11.5 (latest version), we have found that when tcp connections are abruptly terminated (via ^C), a FIN is sent followed by an RST packet. That just seems, well, silly. If the client application wants to use abortive close (sigh..) it should do so, there shouldn't be this little-bit-pregnant, correct close initiation (FIN) followed by a RST. The RST is sent with the same sequence number as the FIN, and thus dropped since the stack only accepts RST packets matching rcv_nxt (RFC 5961). This could also be resolved if Mac OSX replied with an RST on the closed socket, but it appears that it does not. The workaround here is then to reset the connection, if the RST is is equal to rcv_nxt - 1, if we have already received a FIN. The RST attack surface is limited b/c we only accept the RST after we've accepted a FIN and have not previously sent a FIN and received back the corresponding ACK. In other words RST is only accepted in the tcp states: TCP_CLOSE_WAIT, TCP_LAST_ACK, and TCP_CLOSING. I'm interested if anybody else has run into this issue. Its problematic since it takes up server resources for sockets sitting in TCP_CLOSE_WAIT. Isn't the server application expected to act on the read return of zero (which is supposed to be) triggered by the receipt of the FIN segment? rick jones We are also in the process of contacting Apple to see what can be done here...workaround patch is below.
Re: [PATCH v12 net-next 1/1] hv_sock: introduce Hyper-V Sockets
On 06/28/2016 02:59 AM, Dexuan Cui wrote: The idea here is: IMO the syscalls sys_read()/write() shoudn't return -ENOMEM, so I have to make sure the buffer allocation succeeds? I tried to use kmalloc with __GFP_NOFAIL, but I hit a warning in in mm/page_alloc.c: WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); What error code do you think I should return? EAGAIN, ERESTARTSYS, or something else? May I have your suggestion? Thanks! What happens as far as errno is concerned when an application makes a read() call against a (say TCP) socket associated with a connection which has been reset? Is it limited to those errno values listed in the read() manpage, or does it end-up getting an errno value from those listed in the recv() manpage? Or, perhaps even one not (presently) listed in either? rick jones
Re: [PATCH net-next 0/8] tou: Transports over UDP - part I
On 06/24/2016 04:43 PM, Tom Herbert wrote: Here's Christoph's slides on TFO in the wild which presents a good summary of the middlebox problem. There is one significant difference in that ECN needs network support whereas TFO didn't. Given that experience, I'm doubtful other new features at L4 could ever be productively use (like EDO or maybe TCP-ENO). https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf Perhaps I am being overly optimistic, but my takeaway from those slides is Apple were able to come-up with ways to deal with the middleboxes and so could indeed productively use TCP FastOpen. "Overall, very good success-rate" though tempered by "But... middleboxes were a big issue in some ISPs..." Though it doesn't get into how big (some connections, many, most, all?) and how many ISPs. rick jones Just an anecdote... Not that I am a "power user" of my iPhone running 9.3.2 (13F69) nor that I know that anything I am using is the Apple Service stated as using TFO (mostly Safari, Mail and Messages) but if it is, I cannot say that any troubles under the covers have been noticed by me.
Re: [PATCH net-next 0/8] tou: Transports over UDP - part I
On 06/24/2016 02:46 PM, Tom Herbert wrote: On Fri, Jun 24, 2016 at 2:36 PM, Rick Jones <rick.jon...@hpe.com> wrote: How would you define "severely?" Has it actually been more severe than for say ECN? Or it was for say SACK or PAWS? ECN is probably even a bigger disappointment in terms of seeing deployment :-( From http://ecn.ethz.ch/ecn-pam15.pdf: "Even though ECN was standardized in 2001, and it is widely implemented in end-systems, it is barely deployed. This is due to a history of problems with severely broken middleboxes shortly after standardization, which led to connectivity failure and guidance to leave ECN disabled." SACK and PAWS seemed to have faired a little better I believe. The conclusion of that (rather interesting) paper reads: "Our analysis therefore indicates that enabling ECN by default would lead to connections to about five websites per thousand to suffer additional setup latency with RFC 3168 fallback. This represents an order of magnitude fewer than the about forty per thousand which experience transient or permanent connection failure due to other operational issues" Doesn't that then suggest that not enabling ECN is basically a matter of FUD more than remaining assumed broken middleboxes? My main point is that in the past at least, trouble with broken middleboxes didn't lead us to start wrapping all our TCP/transport traffic in UDP to try to hide it from them. We've managed to get SACK and PAWS universal without having to resort to that, and it would seem we could get ECN universal if we could overcome our FUD. Why would TFO for instance be any different? There was an equally interesting second paragraph in the conclusion: "As not all websites are equally popular, failures on five per thousand websites does not by any means imply that five per thousand connection attempts will fail. While estimation of connection attempt rate by rank is out of scope of this work, we note that the highest ranked website exhibiting stable connection failure has rank 596, and only 13 such sites appear in the top 5000" rick jones
Re: [PATCH net-next 0/8] tou: Transports over UDP - part I
On 06/24/2016 02:12 PM, Tom Herbert wrote: The client OS side is only part of the story. Middlebox intrusion at L4 is also a major issue we need to address. The "failure" of TFO is a good case study. Both the upgrade issues on clients and the tendency for some middleboxes to drop SYN packets with data have together severely hindered what otherwise should have been straightforward and useful feature to deploy. How would you define "severely?" Has it actually been more severe than for say ECN? Or it was for say SACK or PAWS? rick jones
Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support
On 06/22/2016 04:10 PM, Rick Jones wrote: My systems are presently in the midst of an install but I should be able to demonstrate it in the morning (US Pacific time, modulo the shuttle service of a car repair place) The installs finished sooner than I thought. So, receiver: root@np-cp1-comp0001-mgmt:/home/stack# uname -a Linux np-cp1-comp0001-mgmt 4.4.11-2-amd64-hpelinux #hpelinux1 SMP Mon May 23 15:39:22 UTC 2016 x86_64 GNU/Linux root@np-cp1-comp0001-mgmt:/home/stack# ethtool -i hed2 driver: bnx2x version: 1.712.30-0 firmware-version: bc 7.10.10 bus-info: :05:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes the hed2 interface is a port of an HPE 630M NIC, based on the BCM57840: 05:00.0 Ethernet controller: Broadcom Corporation BCM57840 NetXtreme II 10/20-Gigabit Ethernet (rev 11) Subsystem: Hewlett-Packard Company HP FlexFabric 20Gb 2-port 630M Adapter (The pci.ids entry being from before that 10 GbE IP was purchased from Broadcom by QLogic...) Verify that LRO is disabled (IIRC it is enabled by default): root@np-cp1-comp0001-mgmt:/home/stack# ethtool -k hed2 | grep large large-receive-offload: off Verify that disable_tpa is not set: root@np-cp1-comp0001-mgmt:/home/stack# cat /sys/module/bnx2x/parameters/disable_tpa 0 So this means we will see NIC-firmware GRO. Start a tcpdump on the receiver: root@np-cp1-comp0001-mgmt:/home/stack# tcpdump -s 96 -c 200 -i hed2 -w foo.pcap port 12867 tcpdump: listening on hed2, link-type EN10MB (Ethernet), capture size 96 bytes Start a netperf test targeting that system, specifying a smaller MSS: stack@np-cp1-comp0002-mgmt:~$ ./netperf -H np-cp1-comp0001-guest -- -G 1400 -P 12867 -O throughput,transport_mss MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-comp0001-guest () port 12867 AF_INET : demo Throughput Transport MSS bytes 3372.821388 Come back to the receiver and post-process the tcpdump capture to get the average segment size for the data segments: 200 packets captured 2000916 packets received by filter 0 packets dropped by kernel root@np-cp1-comp0001-mgmt:/home/stack# tcpdump -n -r foo.pcap | fgrep -v "length 0" | awk '{sum += $NF}END{print "Average:",sum/NR}' reading from file foo.pcap, link-type EN10MB (Ethernet) Average: 2741.93 and finally a snippet of the capture: 00:37:47.333414 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [S], seq 1236484791, win 28000, options [mss 1400,sackOK,TS val 1491134 ecr 0,nop,wscale 7], length 0 00:37:47.333488 IP 192.168.2.7.12867 > 192.168.2.8.12867: Flags [S.], seq 134167501, ack 1236484792, win 28960, options [mss 1460,sackOK,TS val 1499053 ecr 1491134,nop,wscale 7], length 0 00:37:47.333731 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], ack 1, win 219, options [nop,nop,TS val 1491134 ecr 1499053], length 0 00:37:47.333788 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 1:2777, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 1499053], length 2776 00:37:47.333815 IP 192.168.2.7.12867 > 192.168.2.8.12867: Flags [.], ack 2777, win 270, options [nop,nop,TS val 1499053 ecr 1491134], length 0 00:37:47.333822 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 2777:5553, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 1499053], length 2776 00:37:47.333837 IP 192.168.2.7.12867 > 192.168.2.8.12867: Flags [.], ack 5553, win 313, options [nop,nop,TS val 1499053 ecr 1491134], length 0 00:37:47.333842 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 5553:8329, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 1499053], length 2776 00:37:47.333856 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 8329:11105, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 1499053], length 2776 00:37:47.333869 IP 192.168.2.7.12867 > 192.168.2.8.12867: Flags [.], ack 8329, win 357, options [nop,nop,TS val 1499053 ecr 1491134], length 0 00:37:47.333879 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 11105:13881, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 1499053], length 2776 00:37:47.333891 IP 192.168.2.7.12867 > 192.168.2.8.12867: Flags [.], ack 11105, win 400, options [nop,nop,TS val 1499053 ecr 1491134], length 0 00:37:47.333911 IP 192.168.2.7.12867 > 192.168.2.8.12867: Flags [.], ack 13881, win 444, options [nop,nop,TS val 1499053 ecr 1491134], length 0 00:37:47.333964 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 13881:16657, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 1499053], length 2776 00:37:47.333982 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 16657:19433, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 1499053], length 2776 00:37:47.333989 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 19433:22209, ack 1, win 219, options [nop,nop,TS val 149
Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support
On 06/22/2016 03:56 PM, Alexander Duyck wrote: On Wed, Jun 22, 2016 at 3:47 PM, Eric Dumazet <eric.duma...@gmail.com> wrote: On Wed, 2016-06-22 at 14:52 -0700, Rick Jones wrote: Had the bnx2x-driven NICs' firmware not had that rather unfortunate assumption about MSSes I probably would never have noticed. It could be that you and Rick are running different firmware. I believe you can expose that via "ethtool -i". This is the ugly bit about all this. We are offloading GRO into the firmware of these devices with no idea how any of it works and by linking GRO to LRO on the same device you are stuck having to accept either the firmware offload or nothing at all. That is kind of the point Rick was trying to get at. I think you are typing a bit too far ahead into my keyboard with that last sentence. And I may not have been sufficiently complete in what I wrote. If the bnx2x-driven NICs' firmware had been coalescing more than two segments together, not only would I probably not have noticed, I probably would not have been upset to learn it was NIC-firmware GRO rather than stack. My complaint is the specific bug of coalescing only two segments when their size is unexpected, and the difficulty present in disabling the bnx2x-driven NICs' firmware GRO. I don't have a problem necessarily with the existence of NIC-firmware GRO in general. I just want to be able to enable/disable it easily. rick jones Of course, what I really want are much, Much, MUCH larger MTUs. It isn't for nothing that I used to refer to TSO as "Poor man's Jumbo Frames" :)
Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support
On 06/22/2016 03:47 PM, Eric Dumazet wrote: On Wed, 2016-06-22 at 14:52 -0700, Rick Jones wrote: On 06/22/2016 11:22 AM, Yuval Mintz wrote: But seriously, this isn't really anything new but rather a step forward in the direction we've already taken - bnx2x/qede are already performing the same for non-encapsulated TCP. Since you mention bnx2x... I would argue that the NIC firmware on those NICs driven by bnx2x is doing it badly. Not so much from a functional standpoint I suppose, but from a performance one. The NIC-firmware GRO done there has this rather unfortunate assumption about "all MSSes will be directly driven by my own physical MTU" and when it sees segments of a size other than would be suggested by the physical MTU, will coalesce only two segments together. They then do not get further coalesced in the stack. Suffice it to say this does not do well from a performance standpoint. One can disable LRO via ethtool for these NICs, but what that does is disable old-school LRO, not GRO-in-the-NIC. To get that disabled, one must also get the bnx2x module loaded with "disable-tpa=1" so the Linux stack GRO gets used instead. Had the bnx2x-driven NICs' firmware not had that rather unfortunate assumption about MSSes I probably would never have noticed. I do not see this behavior on my bnx2x nics ? ip ro add 10.246.11.52 via 10.246.11.254 dev eth0 mtu 1000 lpk51:~# ./netperf -H 10.246.11.52 -l 1000 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.11.52 () port 0 AF_INET I first saw this with VMs which themselves had 1400 byte MTUs on their vNICs, speaking though bnx2x-driven NICs with a 1500 byte MTU, but I did later reproduce it by tweaking the MTU of my sending side NIC to something like 1400 bytes and running a "bare iron" netperf. I believe you may be able to achieve the same thing by having netperf set a smaller MSS via the test-specific -G option. My systems are presently in the midst of an install but I should be able to demonstrate it in the morning (US Pacific time, modulo the shuttle service of a car repair place) On receiver : Paranoid question, but is LRO disabled on the receiver? I don't know that LRO exhibits the behaviour, just GRO-in-the-NIC. rick 15:46:08.296241 IP 10.246.11.52.46907 > 10.246.11.51.34131: Flags [.], ack 303360, win 8192, options [nop,nop,TS val 1245217243 ecr 1245306446], length 0 15:46:08.296430 IP 10.246.11.51.34131 > 10.246.11.52.46907: Flags [.], seq 303360:327060, ack 1, win 229, options [nop,nop,TS val 1245306446 ecr 1245217242], length 23700 15:46:08.296441 IP 10.246.11.52.46907 > 10.246.11.51.34131: Flags [.], ack 327060, win 8192, options [nop,nop,TS val 1245217243 ecr 1245306446], length 0 15:46:08.296644 IP 10.246.11.51.34131 > 10.246.11.52.46907: Flags [.], seq 327060:350760, ack 1, win 229, options [nop,nop,TS val 1245306446 ecr 1245217242], length 23700 15:46:08.296655 IP 10.246.11.52.46907 > 10.246.11.51.34131: Flags [.], ack 350760, win 8192, options [nop,nop,TS val 1245217244 ecr 1245306446], length 0 15:46:08.296854 IP 10.246.11.51.34131 > 10.246.11.52.46907: Flags [.], seq 350760:374460, ack 1, win 229, options [nop,nop,TS val 1245306446 ecr 1245217242], length 23700 15:46:08.296897 IP 10.246.11.52.46907 > 10.246.11.51.34131: Flags [.], ack 374460, win 8192, options [nop,nop,TS val 1245217244 ecr 1245306446], length 0 15:46:08.297054 IP 10.246.11.51.34131 > 10.246.11.52.46907: Flags [.], seq 374460:398160, ack 1, win 229, options [nop,nop,TS val 1245306446 ecr 1245217242], length 23700 15:46:08.297099 IP 10.246.11.52.46907 > 10.246.11.51.34131: Flags [.], ack 398160, win 8192, options [nop,nop,TS val 1245217244 ecr 1245306446], length 0 15:46:08.297258 IP 10.246.11.51.34131 > 10.246.11.52.46907: Flags [.], seq 398160:420912, ack 1, win 229, options [nop,nop,TS val 1245306446 ecr 1245217242], length 22752 15:46:08.297301 IP 10.246.11.52.46907 > 10.246.11.51.34131: Flags [.], ack 420912, win 8192, options [nop,nop,TS val 1245217244 ecr 1245306446], length 0
Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support
On 06/22/2016 11:22 AM, Yuval Mintz wrote: But seriously, this isn't really anything new but rather a step forward in the direction we've already taken - bnx2x/qede are already performing the same for non-encapsulated TCP. Since you mention bnx2x... I would argue that the NIC firmware on those NICs driven by bnx2x is doing it badly. Not so much from a functional standpoint I suppose, but from a performance one. The NIC-firmware GRO done there has this rather unfortunate assumption about "all MSSes will be directly driven by my own physical MTU" and when it sees segments of a size other than would be suggested by the physical MTU, will coalesce only two segments together. They then do not get further coalesced in the stack. Suffice it to say this does not do well from a performance standpoint. One can disable LRO via ethtool for these NICs, but what that does is disable old-school LRO, not GRO-in-the-NIC. To get that disabled, one must also get the bnx2x module loaded with "disable-tpa=1" so the Linux stack GRO gets used instead. Had the bnx2x-driven NICs' firmware not had that rather unfortunate assumption about MSSes I probably would never have noticed. happy benchmarking, rick jones
Re: [PATCH net-next 0/8] tou: Transports over UDP - part I
On 06/16/2016 10:51 AM, Tom Herbert wrote: Note that #1 is really about running a transport stack in userspace applications in clients, not necessarily servers. For servers we intend to modified the kernel stack in order to leverage existing implementation for building scalable serves (hence these patches). Only if there is a v2 for other reasons... I assume that was meant to be "scalable servers." Tested: Various cases of TOU with IPv4, IPv6 using TCP_STREAM and TCP_RR. Also, tested IPIP for comparing TOU encapsulation to IP tunneling. - IPv6 native 1 TCP_STREAM 8394 tps TPS for TCP_STREAM? Is that Mbit/s? 200 TCP_RR 1726825 tps 100/177/361 90/95/99% latencies To enhance the already good comprehensiveness of the numbers, a 1 TCP_RR showing the effect on latency rather than aggregate PPS would be goodness, as would a comparison of the service demands of the different single-stream results. CPU and NIC models would provide excellent context for the numbers. happy benchmarking, rick jones
Re: [PATCH] openvswitch: Add packet truncation support.
On 06/08/2016 09:30 PM, pravin shelar wrote: On Wed, Jun 8, 2016 at 6:18 PM, William Tu <u9012...@gmail.com> wrote: +struct ovs_action_trunc { + uint32_t max_len; /* Max packet size in bytes. */ This could uint16_t. as it is related to packet len. Is there something limiting MTUs to 65535 bytes? rick jones
Re: [PATCH -next 2/2] virtio_net: Read the advised MTU
On 06/02/2016 10:06 AM, Aaron Conole wrote: Rick Jones <rick.jon...@hpe.com> writes: One of the things I've been doing has been setting-up a cluster (OpenStack) with JumboFrames, and then setting MTUs on instance vNICs by hand to measure different MTU sizes. It would be a shame if such a thing were not possible in the future. Keeping a warning if shrinking the MTU would be good, leave the error (perhaps) to if an attempt is made to go beyond the advised value. This was cut because it didn't make sense for such a warning to be issued, but it seems like perhaps you may want such a feature? I agree with Michael, after thinking about it, that I don't know what sort of use the warning would serve. After all, if you're changing the MTU, you must have wanted such a change to occur? I don't need a warning, was simply willing to live with one when shrinking the MTU. Didn't want an error. happy benchmarking, rick jones
Re: [RFC] net: remove busylock
On 05/19/2016 11:03 AM, Alexander Duyck wrote: On Thu, May 19, 2016 at 10:08 AM, Eric Dumazet <eric.duma...@gmail.com> wrote: With HTB qdisc, here are the numbers for 200 concurrent TCP_RR, on a host with 48 hyperthreads. ... That would be a 8 % increase. The main point of the busy lock is to deal with the bulk throughput case, not the latency case which would be relatively well behaved. The problem wasn't really related to lock bouncing slowing things down. It was the fairness between the threads that was killing us because the dequeue needs to have priority. Quibbledrift... While the origins of the netperf TCP_RR test center on measuring latency, I'm not sure I'd call 200 of them running concurrently a latency test. Indeed it may be neither fish nor fowl, but it will certainly be exercising the basic packet send/receive path rather fully and is likely a reasonable proxy for aggregate small packet performance. happy benchmarking, rick jones
Re: [PATCH] tcp: ensure non-empty connection request queue
On 05/04/2016 10:34 AM, Eric Dumazet wrote: On Wed, 2016-05-04 at 10:24 -0700, Rick Jones wrote: Dropping the connection attempt makes sense, but is entering/claiming synflood really indicated in the case of a zero-length accept queue? This is a one time message. This is how people can learn about their user space bugs, or too small backlog ;) Being totally silent would be not so nice. Assuming Peter's assertion about just drops when syncookies are not enabled is accurate, should there be some one-time message in that case too? rick
Re: [PATCH] tcp: ensure non-empty connection request queue
On 05/03/2016 05:25 PM, Eric Dumazet wrote: On Tue, 2016-05-03 at 23:54 +0200, Peter Wu wrote: When applications use listen() with a backlog of 0, the kernel would set the maximum connection request queue to zero. This causes false reports of SYN flooding (if tcp_syncookies is enabled) or packet drops otherwise. Well, I believe I already gave my opinion on this. listen backlog is not a hint. This is a limit. It is the limit of outstanding children in accept queue. If backlog is 0, no child can be put in the accept queue. It is therefore Working As Intented. Dropping the connection attempt makes sense, but is entering/claiming synflood really indicated in the case of a zero-length accept queue? rick
Re: drop all fragments inside tx queue if one gets dropped
For the "everything old is new again" files, back in the 1990s, it was noticed that on the likes of a netperf UDP_STREAM test on HP-UX, with fragmentation taking place, it was possible to consume 100% of the link bandwidth and have 0% effective throughput because the transmit queue was kept full with IP datagram fragments which could not possibly be reassembled (*) because one or more of the fragments of a datagram were dropped because the transmit queue was full. HP-UX implemented "packet trains" where all the fragments of a fragmented datagram were presented to the driver, which then either queued them all, or none of them. I don't recall seeing similar poor behaviour in Linux; I would have assumed that the intra-stack flow-control "took care" of it. Perhaps there is something specific to wpan which precludes that? happy benchmarking, rick jones
Re: Poorer networking performance in later kernels?
On 04/18/2016 04:27 AM, Butler, Peter wrote: Hi Rick Thanks for the reply. Here is some hardware information, as requested (the two systems are identical, and are communicating with one another over a 10GB full-duplex Ethernet backplane): - processor type: Intel(R) Xeon(R) CPU C5528 @ 2.13GHz - NIC: Intel 82599EB 10GB XAUI/BX4 - NIC driver: ixgbe version 4.2.1-k (part of 4.4.0 kernel) As for the buffer sizes, those rather large ones work fine for us with the 3.4.2 kernel. However, for the sake of being complete, I have re-tried the tests with the 'standard' 4.4.0 kernel parameters for all /proc/sys/net/* values, and the results still were extremely poor in comparison to the 3.4.2 kernel. Our MTU is actually just the standard 1500 bytes, however the message size was chosen to mimic actual traffic which will be segmented. I ran ethtool -k (indeed I checked all ethtool parameters, not just those via -k) and the only real difference I could find was in "large-receive-offload" which was ON in 3.4.2 but OFF in 4.4.0 - so I used ethtool to change this to match the 3.4.2 settings and re-ran the tests. Didn't help :-( It's possible of course that I have missed a parameter here or there in comparing the 3.4.2 setup to the 4.4.0 setup. I also tried running the ethtool config with the latest and greatest ethtool version (4.5) on the 4.4.0 kernel, as compared to the old 3.1 version on our 3.4.2 kernel. So it would seem the stateless offloads are still enabled. My next question would be to wonder if they are still "effective." To that end, you could run a netperf test specifying a particular port number in the test-specific portion: netperf ... -- -P ,12345 and while that is running something like tcpdump -s 96 -c 20 -w /tmp/foo.pcap -i port 12345 then post-processed with the likes of: tcpdump -n -r /tmp/foo.pcap | grep -v "length 0" | awk '{sum += $NF}END{print "average",sum/NR}' the intent behind that is to see what the average post-GRO segment size happens to be on the receiver and then to compare it between the two kernels. Grepping-away the "length 0" is to avoid counting ACKs and look only at data segments. The specific port number is to avoid including any other connections which might happen to have traffic passing through at the time. You could I suspect do the same comparison on the sending side. There might I suppose be an easier way to get the average segment size - perhaps something from looking at ethtool stats - but the stone knives and bear skins of tcpdump above would have the added benefit of having a packet trace or three for someone to look at if they felt the need. And for that, I would actually suggest starting the capture *before* the netperf test so the connection establishment is included. I performed the TCP_RR test as requested and in that case, the results are much more comparable. The old kernel is still better, but now only around 10% better as opposed to 2-3x better. Did the service demand change by 10% or just the transaction rate? However I still contend that the *_STREAM tests are giving us more pertinent data, since our product application is only getting 1/3 to 1/2 half of the performance on the 4.4.0 kernel, and this is the same thing I see when I use netperf to test. One other note: I tried running our 3.4.2 and 4.4.0 kernels in a VM environment on my workstation, so as to take the 'real' production hardware out of the equation. When I perform the tests in this setup the 3.4.2 and 4.4.0 kernels perform identically - just as you would expect. Running in a VM will likely change things massively and could I suppose mask other behaviour changes. happy benchmarking, rick jones raj@tardy:~$ cat signatures/toppost A: Because it fouls the order in which people normally read text. Q: Why is top-posting such a bad thing? A: Top-posting. Q: What is the most annoying thing on usenet and in e-mail? :) Any other ideas? What can I be missing here? Peter -Original Message- From: Rick Jones [mailto:rick.jon...@hpe.com] Sent: April-15-16 6:37 PM To: Butler, Peter <pbut...@sonusnet.com>; netdev@vger.kernel.org Subject: Re: Poorer networking performance in later kernels? On 04/15/2016 02:02 PM, Butler, Peter wrote: (Please keep me CC'd to all comments/responses) I've tried a kernel upgrade from 3.4.2 to 4.4.0 and see a marked drop in networking performance. Nothing was changed on the test systems, other than the kernel itself (and kernel modules). The identical .config used to build the 3.4.2 kernel was brought over into the 4.4.0 kernel source tree, and any configuration differences (e.g. new parameters, etc.) were taken as default values. The testing was performed on the same actual hardware for both kernel versions (i.e. take the existing 3.4.2 physical setup, simply boot into the (new) kernel and run the same test). The netperf utility was used for b
Re: Poorer networking performance in later kernels?
On 04/15/2016 02:02 PM, Butler, Peter wrote: (Please keep me CC'd to all comments/responses) I've tried a kernel upgrade from 3.4.2 to 4.4.0 and see a marked drop in networking performance. Nothing was changed on the test systems, other than the kernel itself (and kernel modules). The identical .config used to build the 3.4.2 kernel was brought over into the 4.4.0 kernel source tree, and any configuration differences (e.g. new parameters, etc.) were taken as default values. The testing was performed on the same actual hardware for both kernel versions (i.e. take the existing 3.4.2 physical setup, simply boot into the (new) kernel and run the same test). The netperf utility was used for benchmarking and the testing was always performed on idle systems. TCP testing yielded the following results, where the 4.4.0 kernel only got about 1/2 of the throughput: Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size SizeTime Throughput localremote local remote bytesbytes bytes secs. 10^6bits/s % S % S us/KB us/KB 3.4.2 13631488 13631488 895230.01 9370.2910.146.50 0.709 0.454 4.4.0 13631488 13631488 895230.02 5314.039.14 14.311.127 1.765 SCTP testing yielded the following results, where the 4.4.0 kernel only got about 1/3 of the throughput: Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size SizeTime Throughput localremote local remote bytesbytes bytes secs. 10^6bits/s % S % S us/KB us/KB 3.4.2 13631488 13631488 895230.00 2306.2213.8713.193.941 3.747 4.4.0 13631488 13631488 895230.01 882.7416.8619.14 12.516 14.210 The same tests were performed a multitude of time, and are always consistent (within a few percent). I've also tried playing with various run-time kernel parameters (/proc/sys/kernel/net/...) on the 4.4.0 kernel to alleviate the issue but have had no success at all. I'm at a loss as to what could possibly account for such a discrepancy... I suspect I am not alone in being curious about the CPU(s) present in the systems and the model/whatnot of the NIC being used. I'm also curious as to why you have what at first glance seem like absurdly large socket buffer sizes. That said, it looks like you have some Really Big (tm) increases in service demand. Many more CPU cycles being consumed per KB of data transferred. Your message size makes me wonder if you were using a 9000 byte MTU. Perhaps in the move from 3.4.2 to 4.4.0 you lost some or all of the stateless offloads for your NIC(s)? Running ethtool -k on both ends under both kernels might be good. Also, if you did have a 9000 byte MTU under 3.4.2 are you certain you still had it under 4.4.0? It would (at least to me) also be interesting to run a TCP_RR test comparing the two kernels. TCP_RR (at least with the default request/response size of one byte) doesn't really care about stateless offloads or MTUs and could show how much difference there is in basic path length (or I suppose in interrupt coalescing behaviour if the NIC in question has a mildly dodgy heuristic for such things). happy benchmarking, rick jones
Re: [net PATCH 2/2] ipv4/GRO: Make GRO conform to RFC 6864
On 04/01/2016 07:21 PM, Eric Dumazet wrote: On Fri, 2016-04-01 at 22:16 -0400, David Miller wrote: From: Alexander Duyck <alexander.du...@gmail.com> Date: Fri, 1 Apr 2016 12:58:41 -0700 RFC 6864 is pretty explicit about this, IPv4 ID used only for fragmentation. https://tools.ietf.org/html/rfc6864#section-4.1 The goal with this change is to try and keep most of the existing behavior in tact without violating this rule? I would think the sequence number should give you the ability to infer a drop in the case of TCP. In the case of UDP tunnels we are now getting a bit more data since we were ignoring the outer IP header ID before. When retransmits happen, the sequence numbers are the same. But you can then use the IP ID to see exactly what happened. You can even tell if multiple retransmits got reordered. Eric's use case is extremely useful, and flat out eliminates ambiguity when analyzing TCP traces. Yes, our team (including Van Jacobson ;) ) would be sad to not have sequential IP ID (but then we don't have them for IPv6 ;) ) Your team would not be the only one sad to see that go away. rick jones Since the cost of generating them is pretty small (inet->inet_id counter), we probably should keep them in linux. Their usage will phase out as IPv6 wins the Internet war...
Re: [RFC net-next 2/2] udp: No longer use SLAB_DESTROY_BY_RCU
On 03/28/2016 01:01 PM, Eric Dumazet wrote: Note : file structures got RCU freeing back in 2.6.14, and I do not think named users ever complained about added cost ;) Couldn't see the tree for the forest I guess :) rick
Re: [RFC net-next 2/2] udp: No longer use SLAB_DESTROY_BY_RCU
On 03/28/2016 11:55 AM, Eric Dumazet wrote: On Mon, 2016-03-28 at 11:44 -0700, Rick Jones wrote: On 03/28/2016 10:00 AM, Eric Dumazet wrote: If you mean that a busy DNS resolver spends _most_ of its time doing : fd = socket() bind(fd port=0) < send and receive one frame > close(fd) Yes. Although it has been a long time, I thought that say the likes of a caching named in the middle between hosts and the rest of the DNS would behave that way as it was looking-up names on behalf those who asked it. I really doubt a modern program would dynamically allocate one UDP port for every in-flight request, as it would limit them to number of ephemeral ports concurrent requests (~3 assuming the process can get them all on the host) I was under the impression that individual DNS queries were supposed to have not only random DNS query IDs but also originate from random UDP source ports. https://tools.ietf.org/html/rfc5452 4.5 at least touches on the topic but I don't see it making it hard and fast. By section 10 though it is more explicit: This document recommends the use of UDP source port number randomization to extend the effective DNS transaction ID beyond the available 16 bits. That being the case, if indeed there were to be 3-odd concurrent requests outstanding "upstream" from that location there'd have to be 3 ephemeral ports in play. rick Managing a pool would be more efficient (The 1.3 usec penalty becomes more like 4 usec in multi threaded programs) Sure, you always can find badly written programs, but they already hit scalability issues anyway. UDP refcounting cost about 2 cache line misses per packet in stress situations, this really has to go, so that well written programs can get full speed.
Re: [RFC net-next 2/2] udp: No longer use SLAB_DESTROY_BY_RCU
On 03/28/2016 10:00 AM, Eric Dumazet wrote: On Mon, 2016-03-28 at 09:15 -0700, Rick Jones wrote: On 03/25/2016 03:29 PM, Eric Dumazet wrote: UDP sockets are not short lived in the high usage case, so the added cost of call_rcu() should not be a concern. Even a busy DNS resolver? If you mean that a busy DNS resolver spends _most_ of its time doing : fd = socket() bind(fd port=0) < send and receive one frame > close(fd) Yes. Although it has been a long time, I thought that say the likes of a caching named in the middle between hosts and the rest of the DNS would behave that way as it was looking-up names on behalf those who asked it. rick (If this is the case, may I suggest doing something different, and use some kind of caches ? It will be way faster.) Then the result for 10,000,000 loops of <socket()+bind()+close()> are Before patch : real0m13.665s user0m0.548s sys 0m12.372s After patch : real0m20.599s user0m0.465s sys 0m17.965s So the worst overhead is 700 ns This is roughly the cost for bringing 960 bytes from memory, or 15 cache lines (on x86_64) # grep UDP /proc/slabinfo UDPLITEv6 0 0 108872 : tunables 24 128 : slabdata 0 0 0 UDPv6 24 49 108872 : tunables 24 128 : slabdata 7 7 0 UDP-Lite 0 096041 : tunables 54 278 : slabdata 0 0 0 UDP 30 3696041 : tunables 54 278 : slabdata 9 9 2 In reality, chances that UDP sockets are re-opened right after being freed and their 15 cache lines are very hot in cpu caches is quite small, so I would not worry at all about this rather stupid benchmark. int main(int argc, char *argv[]) { struct sockaddr_in addr; int i, fd, loops = 1000; for (i = 0; i < loops; i++) { fd = socket(AF_INET, SOCK_DGRAM, 0); if (fd == -1) { perror("socket"); break; } memset(, 0, sizeof(addr)); addr.sin_family = AF_INET; if (bind(fd, (const struct sockaddr *), sizeof(addr)) == -1) { perror("bind"); break; } close(fd); } return 0; }
Re: [RFC net-next 2/2] udp: No longer use SLAB_DESTROY_BY_RCU
On 03/25/2016 03:29 PM, Eric Dumazet wrote: UDP sockets are not short lived in the high usage case, so the added cost of call_rcu() should not be a concern. Even a busy DNS resolver? rick jones
Re: [Codel] [RFCv2 0/3] mac80211: implement fq codel
On 03/17/2016 10:00 AM, Dave Taht wrote: netperf's udp_rr is not how much traffic conventionally behaves. It doesn't do tcp slow start or congestion control in particular... Nor would one expect it to need to, unless one were using "burst mode" to have more than one transaction inflight at one time. And unless one uses the test-specific -e option to provide a very crude retransmission mechanism based on a socket read timeout, neither does UDP_RR recover from lost datagrams. happy benchmarking, rick jones http://www.netperf.org/
Re: [RFC v2 -next 0/2] virtio-net: Advised MTU feature
On 03/15/2016 02:04 PM, Aaron Conole wrote: The following series adds the ability for a hypervisor to set an MTU on the guest during feature negotiation phase. This is useful for VM orchestration when, for instance, tunneling is involved and the MTU of the various systems should be homogenous. The first patch adds the feature bit as described in the proposed VFIO spec addition found at https://lists.oasis-open.org/archives/virtio-dev/201603/msg1.html The second patch adds a user of the bit, and a warning when the guest changes the MTU from the hypervisor advised MTU. Future patches may add more thorough error handling. How do you see this interacting with VMs getting MTU settings via DHCP? rick jones v2: * Whitespace and code style cleanups from Sergei Shtylyov and Paolo Abeni * Additional test before printing a warning Aaron Conole (2): virtio: Start feature MTU support virtio_net: Read the advised MTU drivers/net/virtio_net.c| 12 include/uapi/linux/virtio_net.h | 3 +++ 2 files changed, 15 insertions(+)
Re: [PATCH v6 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
On 03/14/2016 02:15 PM, Eric Dumazet wrote: On Thu, 2016-03-03 at 19:06 +0100, Bendik Rønning Opstad wrote: Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing the latency for applications sending time-dependent data. Latency-sensitive applications or services, such as online games, remote control systems, and VoIP, produce traffic with thin-stream characteristics, characterized by small packets and relatively high inter-transmission times (ITT). When experiencing packet loss, such latency-sensitive applications are heavily penalized by the need to retransmit lost packets, which increases the latency by a minimum of one RTT for the lost packet. Packets coming after a lost packet are held back due to head-of-line blocking, causing increased delays for all data segments until the lost packet has been retransmitted. Acked-by: Eric Dumazet <eduma...@google.com> Note that RDB probably should get some SNMP counters, so that we get an idea of how many times a loss could be repaired. And some idea of the duplication seen by receivers, assuming there isn't already a counter for such a thing in Linux. happy benchmarking, rick jones Ideally, if the path happens to be lossless, all these pro active bundles are overhead. Might be useful to make RDB conditional to tp->total_retrans or something.
Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default
On 02/23/2016 08:47 AM, Tom Herbert wrote: Right, GRO should probably not coalesce packets with non-zero IP identifiers due to the loss of information. Besides that, RFC6848 says the IP identifier should only be set for fragmentation anyway so there shouldn't be any issue and really no need for HW TSO (or LRO) to support that. You sure that is RFC 6848 "Specifying Civic Address Extensions in the Presence Information Data Format Location Object (PIDF-LO)" ? In whichever RFC that may be, is it a SHOULD or a MUST, and just how many "other" stacks might be setting a non-zero IP ID on fragments with DF set? rick jones We need to do increment IP identifier in UFO, but I only see one device (neterion) that advertises NETIF_F_UFO-- honestly, removing that feature might be another good simplification! Tom -- -Ed
Re: Variable download speed
On 02/23/2016 03:24 AM, s...@onet.eu wrote: Hi, I've got a problem with network on one of my embedded boards. I'm testing download speed of 256MB file from my PC to embedded board through 1Gbit ethernet link using ftp. The problem is that sometimes I achieve 25MB/s and sometimes it is only 14MB/s. There are also situations where the transfer speed starts at 14MB/s and after a few seconds achieves 25MB/s. I've caught the second case with tcpdump and I noticed that when the speed is 14MB/s - the tcp window size is 534368 bytes and when the speed achieved 25MB/s the tcp window size is 933888. My question is: what causes such dynamic change in the window size (while transferring data)? Is it some kernel parameter wrong set or something like this? Do I have any influence on such dynamic change in tcp window size? If an application using TCP does not make an explicit setsockopt() call to set the SO_SNDBUF and/or SO_RCVBUF size, then the socket buffer and TCP window size will "autotune" based on what the stack believes to be the correct thing to do. It will be bounded by the values in the tcp_rmem and tcp_wmem sysctl settings: net.ipv4.tcp_rmem = 409687380 6291456 net.ipv4.tcp_wmem = 409616384 4194304 Those are min, initial, max, units of octets (bytes). If on the other hand an application makes an explicit setsockopt() call, that will be the size of the socket buffer, though it will be "clipped" by the values of: net.core.rmem_max = 4194304 net.core.wmem_max = 4194304 Those sysctls will default to different values based on how much memory is in the system. And I think in the case of those last two, I have tweaked them myself away from their default values. You might also look at the CPU utilization of all the CPUs of your embedded board, as well as the link-level statistics for your interface, and the netstat statistics. You would be looking for saturation, and "excessive" drop rates. I would also suggest testing network performance with something other than FTP. While one can try to craft things so there is no storage I/O of note, it would still be better to use a network-specific tool such as netperf or iperf. Minimize the number of variables. happy benchmarking, rick jones
Re: [PATCH][net-next] bridge: increase mtu to 9000
On 02/22/2016 01:29 AM, roy.qing...@gmail.com wrote: From: Li RongQing <roy.qing...@gmail.com> A linux bridge always adopts the smallest MTU of the enslaved devices. When no device are enslaved, it defaults to a MTU of 1500 and refuses to use a larger one. This is problematic when using bridges enslaving only virtual NICs (vnetX) like it's common with KVM guests. Steps to reproduce the problem 1) sudo ip link add br-test0 type bridge # create an empty bridge 2) sudo ip link set br-test0 mtu 9000 # attempt to set MTU > 1500 3) ip link show dev br-test0 # confirm MTU Here, 2) returns "RTNETLINK answers: Invalid argument". One (cumbersome) way around this is: 4) sudo modprobe dummy 5) sudo ip link set dummy0 mtu 9000 master br-test0 Then the bridge's MTU can be changed from anywhere to 9000. This is especially annoying for the virtualization case because the KVM's tap driver will by default adopt the bridge's MTU on startup making it impossible (without the workaround) to use a large MTU on the guest VMs. https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1399064 Signed-off-by: Li RongQing <roy.qing...@gmail.com> --- net/bridge/br_if.c | 4 ++-- net/bridge/br_private.h | 2 ++ 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c index c367b3e..38ced44 100644 --- a/net/bridge/br_if.c +++ b/net/bridge/br_if.c @@ -390,7 +390,7 @@ int br_del_bridge(struct net *net, const char *name) return ret; } -/* MTU of the bridge pseudo-device: ETH_DATA_LEN or the minimum of the ports */ +/* MTU of the bridge pseudo-device: BR_JUMBO_MTU or the minimum of the ports */ int br_min_mtu(const struct net_bridge *br) { const struct net_bridge_port *p; @@ -399,7 +399,7 @@ int br_min_mtu(const struct net_bridge *br) ASSERT_RTNL(); if (list_empty(>port_list)) - mtu = ETH_DATA_LEN; + mtu = BR_JUMBO_MTU; else { list_for_each_entry(p, >port_list, list) { if (!mtu || p->dev->mtu < mtu) diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h index 302ab0a..d3c29f6 100644 --- a/net/bridge/br_private.h +++ b/net/bridge/br_private.h @@ -32,6 +32,8 @@ #define BR_VERSION"2.3" +#define BR_JUMBO_MTU 9000 + /* Control of forwarding link local multicast */ #define BR_GROUPFWD_DEFAULT 0 /* Don't allow forwarding of control protocols like STP, MAC PAUSE and LACP */ If you are going to 9000. why not just go ahead and use the maximum size of an IP datagram? rick jones
Re: [PATCH net V1 1/6] net/mlx4_en: Count HW buffer overrun only once
On 02/17/2016 07:24 AM, Or Gerlitz wrote: From: Amir Vadai <a...@vadai.me> RdropOvflw counts overrun of HW buffer, therefore should be used for rx_fifo_errors only. Currently RdropOvflw counter is mistakenly also set into rx_missed_errors and rx_over_errors too, which makes the device total dropped packets accounting to show wrong results. Fix that. Use it for rx_fifo_errors only. Fixes: c27a02cd94d6 ('mlx4_en: Add driver for Mellanox ConnectX 10GbE NIC') Signed-off-by: Amir Vadai <a...@vadai.me> Signed-off-by: Eugenia Emantayev <euge...@mellanox.com> Signed-off-by: Or Gerlitz <ogerl...@mellanox.com> Reviewed-By: Rick Jones <rick.jon...@hpe.com> rick
Re: [PATCH net 1/6] net/mlx4_en: Do not count dropped packets twice
On 02/16/2016 07:01 AM, Or Gerlitz wrote: From: Amir Vadai <a...@vadai.me> RdropOvflw counter was mistakenly copied into rx_missed_errors. Because of that it was counted twice for the device dropped packets accounting. Fixes: c27a02cd94d6 ('mlx4_en: Add driver for Mellanox ConnectX 10GbE NIC') Signed-off-by: Amir Vadai <a...@vadai.me> Signed-off-by: Eugenia Emantayev <euge...@mellanox.com> Signed-off-by: Or Gerlitz <ogerl...@mellanox.com> --- drivers/net/ethernet/mellanox/mlx4/en_port.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/en_port.c b/drivers/net/ethernet/mellanox/mlx4/en_port.c index ee99e67..7b511a5 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_port.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_port.c @@ -242,7 +242,7 @@ int mlx4_en_DUMP_ETH_STATS(struct mlx4_en_dev *mdev, u8 port, u8 reset) stats->rx_crc_errors = be32_to_cpu(mlx4_en_stats->RCRC); stats->rx_frame_errors = 0; stats->rx_fifo_errors = be32_to_cpu(mlx4_en_stats->RdropOvflw); - stats->rx_missed_errors = be32_to_cpu(mlx4_en_stats->RdropOvflw); + stats->rx_missed_errors = 0; stats->tx_aborted_errors = 0; stats->tx_carrier_errors = 0; stats->tx_fifo_errors = 0; I'm still not clear on when an Acked-by is appropriate, but given that this has been a non-trivial frustration for a long time, a hearty endorsement from me. Perhaps not important enough but it would be nice to have it flow back a release or two. That said, should mlx4_en_stats->RdropOvflw still be going into both rx_fifo_errors and rx_over_errors? stats->rx_over_errors = be32_to_cpu(mlx4_en_stats->RdropOvflw); stats->rx_crc_errors = be32_to_cpu(mlx4_en_stats->RCRC); stats->rx_frame_errors = 0; stats->rx_fifo_errors = be32_to_cpu(mlx4_en_stats->RdropOvflw); happy benchmarking, rick jones
Re: Disabling XPS for 4.4.0-1+ixgbe+OpenStack VM over a VLAN means 65% increase in netperf TCP_STREAM
On 02/04/2016 11:38 AM, Tom Herbert wrote: I'd start with verifying the XPS configuration is sane and then trying to reproduce the issue outside of using VMs, if both of those are okay then maybe look at some sort of bad interaction with OpenStack configuration. So, looking at bare-iron, I can see something similar but not to the same degree (well, depending on which is one's metric of interest I guess): XPS being enabled for ixgbe here looks to be increasing receive side service demand by 30% but there is enough CPU available in this setup that it is only a loss of 2.5% or so on throughput. stack@fcperf-cp1-comp0001-mgmt:~$ grep 87380 xps_on_* | awk '{t+=$6;r+=$9;s+=$10}END{print "throughput",t/NR,"recv sd",r/NR,"send sd",s/NR}' throughput 9072.52 recv sd 0.8623 send sd 0.3686 stack@fcperf-cp1-comp0001-mgmt:~$ grep TCPOFO xps_on_* | awk '{sum += $NF}END{print "sum",sum/NR}' sum 1621.1 stack@fcperf-cp1-comp0001-mgmt:~$ grep 87380 xps_off_* | awk '{t+=$6;r+=$9;s+=$10}END{print "throughput",t/NR,"recv sd",r/NR,"send sd",s/NR}' throughput 9300.48 recv sd 0.6543 send sd 0.3606 stack@fcperf-cp1-comp0001-mgmt:~$ grep TCPOFO xps_off_* | awk '{sum += $NF}END{print "sum",sum/NR}' sum 173.9 happy benchmarking, rick jones raw results at ftp://ftp.netperf.org/xps_4.4.0-1_ixgbe.tgz
Re: Disabling XPS for 4.4.0-1+ixgbe+OpenStack VM over a VLAN means 65% increase in netperf TCP_STREAM
Shame on me for not including bare-iron TCP_RR: stack@fcperf-cp1-comp0001-mgmt:~$ grep "1 1" xps_tcp_rr_on_* | awk '{t+=$6;r+=$9;s+=$10}END{print "throughput",t/NR,"recv sd",r/NR,"send sd",s/NR}' throughput 18589.4 recv sd 21.6296 send sd 20.5931 stack@fcperf-cp1-comp0001-mgmt:~$ grep "1 1" xps_tcp_rr_off_* | awk '{t+=$6;r+=$9;s+=$10}END{print "throughput",t/NR,"recv sd",r/NR,"send sd",s/NR}' throughput 20883.6 recv sd 19.6255 send sd 20.0178 So that is 12% on TCP_RR throughput. Looks like XPS shouldn't be enabled by default for ixgbe. happy benchmarking, rick jones
Disabling XPS for 4.4.0-1+ixgbe+OpenStack VM over a VLAN means 65% increase in netperf TCP_STREAM
Folks - I was doing some performance work with OpenStack Liberty on systems with 2x E5-2650L v3 @ 1.80GHz processors and 560FLR (Intel 82599ES) NICs onto which I'd placed a 4.4.0-1 kernel. I was actually interested in the effect of removing the linux bridge from all the plumbing OpenStack creates (it is there for iptables-based implementation of security group rules because OS Liberty doesn't enable them on the OVS bridge(s) it creates), and I'd noticed that when I removed the linux bridge from the "stack" instance-to-instance (vm-to-vm) performance across a VLAN-based Neutron private network dropped. Quite unexpected. On a lark, I tried explicitly binding the NIC's IRQs and Boom! the single-stream performance shot-up to near link-rate. I couldn't recall explicit binding of IRQs doing that much for single-stream netperf TCP_STREAM before. I asked the Intel folks about that, they suggested I try disabling XPS. So, with that I see the following on single-stream tests between the VMs on that VLAN-based private network as created by OpenStack Liberty: 99% Confident within +/- 2.5% of "real" average TCP_RR in Trans/s TCP_STREAM in Mbit/s XPS Enabled XPS Disabled Delta TCP_STREAM5353 8841 (*)65.2% TCP_RR8562 966612.9% The Intel folks suggested something about the process scheduler moving the sender around and ultimately causing some packet re-ordering. That could I suppose explain the TCP_STREAM difference, but not the TCP_RR since that has just a single segment in flight at one time. I can try to get perf/whatnot installed on the systems - suggestions as to what metrics to look at are welcome. happy benchmarking, rick jones * If I disable XPS on the sending side only, it is more like 7700 Mbit/s netstats from the receiver over a netperf TCP_STREAM test's duration with XPS enabled: $ netperf -H 10.240.50.191 -- -o throughput,local_transport_retrans MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.240.50.191 () port 0 AF_INET : demo Throughput,Local Transport Retransmissions 5292.74,4555 $ ./beforeafter before after Ip: 327837 total packets received 0 with invalid addresses 0 forwarded 0 incoming packets discarded 327837 incoming packets delivered 293438 requests sent out Icmp: 0 ICMP messages received 0 input ICMP message failed. ICMP input histogram: destination unreachable: 0 0 ICMP messages sent 0 ICMP messages failed ICMP output histogram: destination unreachable: 0 IcmpMsg: InType3: 0 OutType3: 0 Tcp: 0 active connections openings 2 passive connection openings 0 failed connection attempts 0 connection resets received 0 connections established 327837 segments received 293438 segments send out 0 segments retransmited 0 bad segments received. 0 resets sent Udp: 0 packets received 0 packets to unknown port received. 0 packet receive errors 0 packets sent IgnoredMulti: 0 UdpLite: TcpExt: 0 TCP sockets finished time wait in fast timer 0 delayed acks sent Quick ack mode was activated 1016 times 50386 packets directly queued to recvmsg prequeue. 309545872 bytes directly in process context from backlog 2874395424 bytes directly received in process context from prequeue 86591 packet headers predicted 84934 packets header predicted and directly queued to user 6 acknowledgments not containing data payload received 20 predicted acknowledgments 1017 DSACKs sent for old packets TCPRcvCoalesce: 157097 TCPOFOQueue: 78206 TCPOrigDataSent: 24 IpExt: InBcastPkts: 0 InOctets: 6643231012 OutOctets: 17203936 InBcastOctets: 0 InNoECTPkts: 327837 And now with it disabled on both sides: $ netperf -H 10.240.50.191 -- -o throughput,local_transport_retrans MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.240.50.191 () port 0 AF_INET : demo Throughput,Local Transport Retransmissions 8656.84,1903 $ ./beforeafter noxps_before noxps_avter Ip: 251831 total packets received 0 with invalid addresses 0 forwarded 0 incoming packets discarded 251831 incoming packets delivered 218415 requests sent out Icmp: 0 ICMP messages received 0 input ICMP message failed. ICMP input histogram: destination unreachable: 0 0 ICMP messages sent 0 ICMP messages failed ICMP output histogram: destination unreachable: 0 IcmpMsg: InType3: 0 OutType3: 0 Tcp: 0 active connections openings 2 passive connection openings 0 failed connection attempts 0 connection resets received 0 connections established 251831 segments received 218415 segments send out 0 segments retransmited 0 bad segments received. 0 resets sent Udp: 0 pa
Re: [PATCH net-next v5 1/2] ethtool: add speed/duplex validation functions
On 02/04/2016 04:47 AM, Michael S. Tsirkin wrote: On Wed, Feb 03, 2016 at 03:49:04PM -0800, Rick Jones wrote: And even for not-quite-virtual devices - such as a VC/FlexNIC in an HPE blade server there can be just about any speed set. I think we went down a path of patching some things to address that many years ago. It would be a shame to undo that. rick I'm not sure I understand. The question is in defining the UAPI. We currently have: * @speed: Low bits of the speed * @speed_hi: Hi bits of the speed with the assumption that all values come from the defines. So if we allow any value here we need to define what it means. I may be mixing apples and kiwis. Many years ago when HP came-out with their blades and VirtualConnect, they included the ability to create "flex NICs" - "sub-NICs" out of a given interface port on a blade, and to assign each a specific bitrate in increments (IIRC) of 100 Mbit/s. This was reported up through the driver and it became necessary to make ethtool (again, IIRC) not so picky about "valid" speed values. rick
Re: Disabling XPS for 4.4.0-1+ixgbe+OpenStack VM over a VLAN means 65% increase in netperf TCP_STREAM
On 02/04/2016 11:38 AM, Tom Herbert wrote: On Thu, Feb 4, 2016 at 11:13 AM, Rick Jones <rick.jon...@hpe.com> wrote: The Intel folks suggested something about the process scheduler moving the sender around and ultimately causing some packet re-ordering. That could I suppose explain the TCP_STREAM difference, but not the TCP_RR since that has just a single segment in flight at one time. XPS has OOO avoidance for TCP, that should not be a problem. What/how much should I read into: With XPSTCPOFOQueue: 78206 Without XPS TCPOFOQueue: 967 out of the netstat statistics on the receiving VM? I can try to get perf/whatnot installed on the systems - suggestions as to what metrics to look at are welcome. I'd start with verifying the XPS configuration is sane and then trying to reproduce the issue outside of using VMs, if both of those are okay then maybe look at some sort of bad interaction with OpenStack configuration. Fair enough - what is the definition of "sane" for an XPS configuration? Here is what it looks like before I disabled it: $ for i in `find /sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0 -name xps_cpus`; do echo $i `cat $i`; done /sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-0/xps_cpus ,0001 /sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-1/xps_cpus ,0002 /sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-2/xps_cpus ,0004 /sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-3/xps_cpus ,0008 /sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-4/xps_cpus ,0010 /sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-5/xps_cpus ,0020 /sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-6/xps_cpus ,0040 /sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-7/xps_cpus ,0080 /sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-8/xps_cpus ,0100 /sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-9/xps_cpus ,0200 /sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-10/xps_cpus ,0400 /sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-11/xps_cpus ,0800 /sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-12/xps_cpus ,1000 /sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-13/xps_cpus ,2000 /sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-14/xps_cpus ,4000 /sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-15/xps_cpus ,8000 /sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-16/xps_cpus ,0001 /sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-17/xps_cpus ,0002 /sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-18/xps_cpus ,0004 /sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-19/xps_cpus ,0008 /sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-20/xps_cpus ,0010 /sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-21/xps_cpus ,0020 /sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-22/xps_cpus ,0040 /sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-23/xps_cpus ,0080 /sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-24/xps_cpus ,0100 /sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-25/xps_cpus ,0200 /sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-26/xps_cpus ,0400 /sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-27/xps_cpus ,0800 /sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-28/xps_cpus ,1000 /sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-29/xps_cpus ,2000 /sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-30/xps_cpus ,4000 /sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-31/xps_cpus ,8000 /sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-32/xps_cpus 0001, /sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-33/xps_cpus 0002, /sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-34/xps_cpus 0004, /sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-35/xps_cpus 0008, /sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-36/xps_cpus 0010, /sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-37/xps_cpus 0020, /sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-38/xps_cpus 0040, /sys/devices/pci:00/:00:02.2/:04:00.0/net/e
Re: Disabling XPS for 4.4.0-1+ixgbe+OpenStack VM over a VLAN means 65% increase in netperf TCP_STREAM
On 02/04/2016 12:13 PM, Tom Herbert wrote: On Thu, Feb 4, 2016 at 11:57 AM, Rick Jones <rick.jon...@hpe.com> wrote: On 02/04/2016 11:38 AM, Tom Herbert wrote: XPS has OOO avoidance for TCP, that should not be a problem. What/how much should I read into: With XPSTCPOFOQueue: 78206 Without XPS TCPOFOQueue: 967 out of the netstat statistics on the receiving VM? Okay, that makes sense. The OOO avoidance only applies to TCP sockets in the stack, that doesn't cross into VM. Presumably, packets coming from the VM don't have a socket so sk_tx_queue_get always returns -1 and so netdev_pick_tx will steer packet to the queue based on currently running CPU without any memory. Any thoughts as to why explicitly binding the IRQs made things better, or for that matter why the scheduler would be moving the VM (or its vhost-net kernel thread I suppose?) around so much? happy benchmarking, rick jones
Re: [PATCH net-next v5 1/2] ethtool: add speed/duplex validation functions
On 02/03/2016 03:32 PM, Stephen Hemminger wrote: But why check for valid value at all. At some point in the future, there will be yet another speed adopted by some standard body and the switch statement would need another value. Why not accept any value? This is a virtual device. And even for not-quite-virtual devices - such as a VC/FlexNIC in an HPE blade server there can be just about any speed set. I think we went down a path of patching some things to address that many years ago. It would be a shame to undo that. rick
Re: bonding (IEEE 802.3ad) not working with qemu/virtio
On 01/29/2016 10:59 PM, David Miller wrote: There should be a default speed/duplex setting for such devices as well. We can pick one that will be use universally for these kinds of devices. There is at least one monitoring tool - collectl - which gets a trifle upset when the actual speed through an interface is significantly greater than the reported link speed. I have to wonder how unique it is in that regard. Doesn't mean there can't be a default, but does suggest it should be rather high. rick jones
Re: [BUG] net: performance regression on ixgbe (Intel 82599EB 10-Gigabit NIC)
On 12/10/2015 06:18 AM, Otto Sabart wrote: *) Is irqbalance disabled and the IRQs set the same each time, or might there be variability possible there? Each of the five netperf runs will be a different four-tuple which means each may (or may not) get RSS hashed/etc differently. The irqbalance is disabled on all systems. Can you suggest, if there is a need to assign irqs manually? Which irqs we should pin to which CPU? Likely as not it will depend on your goals. When I want single-stream results, I will tend to disable irqbalance and set all the IRQs to one CPU in the system (often as not CPU0 but that is as much habit as anything else). The idea is to clamp-down on any source of run-to-run variation. I will also sometimes alter where I bind netperf/netserver to show the effects (especially on service demand) when netperf/netserver run on the same CPU as the IRQ, a thread in the same core as the IRQ, a core in the same processor as the IRQ and/or a core in another processor. Unless all the IRQs are pointed at the same CPU (or I always specify the same, full four-tuple for addressing and wait for TIME_WAIT) that can be a challenge to keep straight. When I want to measure aggregate, I either let irqbalance do its thing and run a bunch of warm-up tests, or simply peanut-butter the IRQs across the CPUs with variations on the theme of: grep eth[23] /proc/interrupts | awk -F ":" -v cpus=12 '{mask = 1 * 2^(count++ % cpus);printf("echo %x > /proc/irq/%d/smp_affinity\n",mask,$1)}' | sh How one might structure/alter that pipeline will depend on the CPU enumeration. That one was from a 2x6 core system where I didn't want to hit the second thread of each core, and the enumeration was the first twelve CPUs were on thread 0 of each core of both processors. *) It is perhaps adding duct tape to already-present belt and suspenders, but is power-management set to a fixed state on the systems involved? (Since this seems to be ProLiant G7s going by the legends on the charts, either static high perf or static low power I would imagine) Power management is set to OS-Control in bios, which effectively means, that _bios_ does not do any power management at all. Probably just as well :) *) What is the difference before/after for the service demands? The netperf tests being run are asking for CPU utilization but I don't see the service demand change being summarized. Unfortunatelly we does not have any summary chart for service demands, we will add some shortly. *) Does a specific CPU on one side or the other saturate? (LOCAL_CPU_PEAK_UTIL, LOCAL_CPU_PEAK_ID, REMOTE_CPU_PEAK_UTIL, REMOTE_CPU_PEAK_ID output selectors) We are sort of stuck in a stone age. We still use old fashion tcp/udp migrated tests, but we plan to switch to omni. Well, you don't have to invoke with -t omni to make use of the output selectors - just add the -O (or -o or -k) test-specific option. *) What are the processors involved? Presumably the "other system" is fixed? In this case: hp-dl380g7 - $ lscpu: Architecture: x86_64 CPU op-mode(s):32-bit, 64-bit Byte Order:Little Endian CPU(s):24 On-line CPU(s) list: 0-23 Thread(s) per core:2 Core(s) per socket:6 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family:6 Model: 44 Model name:Intel(R) Xeon(R) CPU X5650 @ 2.67GHz Stepping: 2 CPU MHz: 2660.000 BogoMIPS: 5331.27 Virtualization:VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 12288K NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23 hp-dl385g7 - $ lscpu: tecture: x86_64 CPU op-mode(s):32-bit, 64-bit Byte Order:Little Endian CPU(s):24 On-line CPU(s) list: 0-23 Thread(s) per core:1 Core(s) per socket:12 Socket(s): 2 NUMA node(s): 4 Vendor ID: AuthenticAMD CPU family:16 Model: 9 Model name:AMD Opteron(tm) Processor 6172 Stepping: 1 CPU MHz: 2100.000 BogoMIPS: 4200.39 Virtualization:AMD-V L1d cache: 64K L1i cache: 64K L2 cache: 512K L3 cache: 5118K NUMA node0 CPU(s): 0,2,4,6,8,10 NUMA node1 CPU(s): 12,14,16,18,20,22 NUMA node2 CPU(s): 13,15,17,19,21,23 NUMA node3 CPU(s): 1,3,5,7,9,11 I guess that helps explain why there were such large differences in the deltas between TCP_STREAM and TCP_MAERTS since it wasn't the same per-core "horsepower" on either side and so why LRO on/off could have also affected the TCP_STREAM results. (When LRO was off it was off on both sides, and when on was on on b
Re: [BUG] net: performance regression on ixgbe (Intel 82599EB 10-Gigabit NIC)
On 12/07/2015 03:28 AM, Otto Sabart wrote: Hi Ota, It looks like there were a few changes that went through that could be causing the regression. The most obvious one that jumps out at me is commit 72bfd32d2f84 ("ixgbe: disable LRO by default"). As such one thing you might try doing is turning on LRO support via ethtool -k to see if that is the issue you are seeing. Hi Alex, enabling LRO resolved the problem. So you had the same NIC and CPUs and whatnot on both sides? rick jones -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] net: performance regression on ixgbe (Intel 82599EB 10-Gigabit NIC)
On 12/03/2015 08:26 AM, Otto Sabart wrote: Hello netdev, I probably found a performance regression on ixgbe (Intel 82599EB 10-Gigabit NIC) on v4.4-rc3. I am able to see this problem since v4.4-rc1. The bug report you can find here [0]. Can somebody take a look at it? [0] https://bugzilla.redhat.com/show_bug.cgi?id=1288124 A few of comments/questions based on reading that bug report: *) It is good to be binding netperf and netserver - helps with reproducibility, but why the two -T options? A brief look at src/netsh.c suggests it will indeed set the two binding options separately but that is merely a side-effect of how I wrote the code. It wasn't an intentional thing. *) Is irqbalance disabled and the IRQs set the same each time, or might there be variability possible there? Each of the five netperf runs will be a different four-tuple which means each may (or may not) get RSS hashed/etc differently. *) It is perhaps adding duct tape to already-present belt and suspenders, but is power-management set to a fixed state on the systems involved? (Since this seems to be ProLiant G7s going by the legends on the charts, either static high perf or static low power I would imagine) *) What is the difference before/after for the service demands? The netperf tests being run are asking for CPU utilization but I don't see the service demand change being summarized. *) Does a specific CPU on one side or the other saturate? (LOCAL_CPU_PEAK_UTIL, LOCAL_CPU_PEAK_ID, REMOTE_CPU_PEAK_UTIL, REMOTE_CPU_PEAK_ID output selectors) *) What are the processors involved? Presumably the "other system" is fixed? *) It is important to remember the socket buffer sizes reported with the default output is *just* what they were when the data socket was created. If you want to see what they became by the end of the test, you need to use the appropriate output selectors (or, IIRC invoking the tests as "omni" rather than tcp_stream/tcp_maerts will report the end values rather than the start ones.). happy benchmarking, rick jones -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ipsec impact on performance
On 12/02/2015 03:56 AM, David Laight wrote: From: Sowmini Varadhan Sent: 01 December 2015 18:37 ... I was using esp-null merely to not have the crypto itself perturb the numbers (i.e., just focus on the s/w overhead for now), but here are the numbers for the stock linux kernel stack Gbps peak cpu util esp-null 1.8 71% aes-gcm-c-2561.6 79% aes-ccm-a-1280.7 96% That trend made me think that if we can get esp-null to be as close as possible to GSO/GRO, the rest will follow closely behind. That's not how I read those figures. They imply to me that there is a massive cost for the actual encryption (particularly for aes-ccm-a-128) - so whatever you do to the esp-null case won't help. To build on the whole "importance of normalizing throughput and CPU utilization in some way" theme, the following are some non-IPSec netperf TCP_STREAM runs between a pair of 2xIntel E5-2603 v3 systems using Broadcom BCM57810-based NICs, 4.2.0-19 kernel, 7.10.72 firmware and bnx2x driver version 1.710.51-0: root@htx-scale300-258:~# ./take_numbers.sh Baseline MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.12.49.1 () port 0 AF_INET : +/-2.500% @ 99% conf. : demo : cpu bind Throughput Local Local Local Remote Remote Remote Throughput Local Remote CPU Service PeakCPUService PeakConfidence CPU CPU Util Demand Per CPU Util Demand Per CPU Width (%) Confidence Confidence % Util % % Util % Width (%) Width (%) 9414.111.87 0.195 26.54 3.70 0.387 45.42 0.002 7.073 1.276 Disable TSO/GSO 5651.258.36 1.454 100.00 2.46 0.428 30.35 1.093 1.101 4.889 Disable tx CKO 5287.698.46 1.573 100.00 2.34 0.435 29.66 0.428 7.710 3.518 Disable remote LRO/GRO 4148.768.32 1.971 99.97 5.95 1.409 71.98 3.656 0.735 3.491 Disable remote rx CKO 4204.498.31 1.942 100.00 6.68 1.563 82.05 2.015 0.437 4.921 You can see that as the offloads are disabled, the service demands (usec of CPU time consumed systemwide per KB of data transferred) go up, and until one hits a bottleneck (eg one of the CPUs pegs at 100%), go up faster than the throughputs go down. To aid in reproducibility those tests were with irqbalance disabled, all the IRQs for the NICs pointed at CPU 0, netperf/netserver bound to CPU 0, and the power management set to static high performance. Assuming I've created a "matching" ipsec.conf, here is what I see with esp=null-null on the TCP_STREAM test - again, keeping all the binding in place etc: 3077.378.01 2.560 97.78 8.21 2.625 99.41 4.869 1.876 0.955 You can see that even with the null-null, there is a rather large increase in service demand. And this is what I see when I run netperf TCP_RR (first is without ipsec, second is with. I didn't ask for confidence intervals this time around and I didn't try to tweak interrupt coalescing settings) # HDR="-P 1";for i in 10.12.49.1 192.168.0.2; do ./netperf -H $i -t TCP_RR -c -C -l 30 -T 0 $HDR; HDR="-P 0"; done MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.12.49.1 () port 0 AF_INET : demo : first burst 0 : cpu bind Local /Remote Socket Size Request Resp. Elapsed Trans. CPUCPUS.dem S.dem Send Recv SizeSize TimeRate local remote local remote bytes bytes bytes bytes secs. per sec % S% Sus/Tr us/Tr 16384 87380 1 1 30.00 30419.75 1.72 1.68 6.783 6.617 16384 87380 16384 87380 1 1 30.00 20711.39 2.15 2.05 12.450 11.882 16384 87380 The service demand increases ~83% on the netperf side and almost 80% on the netserver side. That is pure "effective" path-length increase. happy benchmarking, rick jones PS - the netperf commands were varations on this theme: ./netperf -P 0 -T 0 -H 10.12.49.1 -c -C -l 30 -i 30,3 -- -O throughput,local_cpu_util,local_sd,local_cpu_peak_util,remote_cpu_util,remote_sd,remote_cpu_peak_util,throughput_confid,local_cpu_confid,remote_cpu_confid altering IP address or test as appropriate. -P 0 disables printing the test banner/headers. -T 0 binds netperf and netserver to CPU0 on their respective systems. -H sets the destination, -c and -C ask for local and remote CPU measurements respectively. -l 30 says each test iteration should be 30 seconds long and -i 30,3 says to run at least three iterations and no more than 30 when trying to hit the confidence interval - by default 99% confident the average reported is within +/- 2.5% of the "actual" average. The -O stuff is selecting specific values to be emitted. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ipsec impact on performance
On 12/01/2015 09:59 AM, Sowmini Varadhan wrote: But these are all still relatively small things - tweaking them doesnt get me significantly past the 3 Gbps limit. Any suggestions on how to make this budge (or design criticism of the patch) would be welcome. What do the perf profiles show? Presumably, loss of TSO/GSO means an increase in the per-packet costs, but if the ipsec path significantly increases the per-byte costs... Short of a perf profile, I suppose one way to probe for per-packet versus per-byte would be to up the MTU. That should reduce the per-packet costs while keeping the per-byte roughly the same. You could also compare the likes of a single-byte netperf TCP_RR test between ipsec enabled and not to get an idea of the basic path length differences without TSO/GSO/whatnot muddying the waters. happy benchmarking, rick jones -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ipsec impact on performance
On 12/01/2015 10:45 AM, Sowmini Varadhan wrote: On (12/01/15 10:17), Rick Jones wrote: What do the perf profiles show? Presumably, loss of TSO/GSO means an increase in the per-packet costs, but if the ipsec path significantly increases the per-byte costs... For ESP-null, there's actually very little work to do - we just need to add the 8 byte ESP header with an spi and a seq#.. no crypto work to do.. so the overhead *should* be minimal, else we've painted ourself into a corner where we can't touch anything including TCP options like md5. Something of a longshot, but are you sure you are still getting effective CKO/GRO on the receiver? rick jones -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)
On 11/24/2015 07:49 AM, Eric Dumazet wrote: But in the end, latencies were bigger, because the application had to copy from kernel to user (read()) the full message in one go. While if you wake up application for every incoming GRO message, we prefill cpu caches, and the last read() only has to copy the remaining part and benefit from hot caches (RFS up2date state, TCP socket structure, but also data in the application) You can see something similar (at least in terms of latency) when messing about with MTU sizes. For some message sizes - 8KB being a popular one - you will see higher latency on the likes of netperf TCP_RR with JumboFrames than you would with the standard 1500 byte MTU. Something I saw on GbE links years back anyway. I chalked it up to getting better parallelism between the NIC and the host. Of course the service demands were lower with JumboFrames... rick jones -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next RFC 2/2] vhost_net: basic polling support
On 10/22/2015 02:33 AM, Michael S. Tsirkin wrote: On Thu, Oct 22, 2015 at 01:27:29AM -0400, Jason Wang wrote: This patch tries to poll for new added tx buffer for a while at the end of tx processing. The maximum time spent on polling were limited through a module parameter. To avoid block rx, the loop will end it there's new other works queued on vhost so in fact socket receive queue is also be polled. busyloop_timeout = 50 gives us following improvement on TCP_RR test: size/session/+thu%/+normalize% 1/ 1/ +5%/ -20% 1/50/ +17%/ +3% Is there a measureable increase in cpu utilization with busyloop_timeout = 0? And since a netperf TCP_RR test is involved, be careful about what netperf reports for CPU util if that increase isn't in the context of the guest OS. For completeness, looking at the effect on TCP_STREAM and TCP_MAERTS, aggregate _RR and even aggregate _RR/packets per second for many VMs on the same system would be in order. happy benchmarking, rick jones -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: list of all network namespaces
On 09/16/2015 05:46 PM, Ani Sinha wrote: Hi guys just a stupid question. Is it possible to get a list of all active network namespaces in the kernel through /proc or some other interface? Presumably you could copy what "ip netns" does, which appears to be to look in /var/run/netns . At least that is what an strace of that command suggests. rick jones -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
vethpair creation performance, 3.14 versus 4.2.0
On 08/29/2015 10:59 PM, Raghavendra K T wrote: > Please note that similar overhead was also reported while creating > veth pairs https://lkml.org/lkml/2013/3/19/556 That got me curious, so I took the veth pair creation script from there, and started running it out to 10K pairs, comparing a 3.14.44 kernel with a 4.2.0-rc4+ from net-next and then net-next after pulling to get the snmp stat aggregation perf change (4.2.0-rc8+). Indeed, the 4.2.0-rc8+ kernel with the change was faster than the 4.2.0-rc4+ kernel without it, but both were slower than the 3.14.44 kernel. I've put a spreadsheet with the results at: ftp://ftp.netperf.org/vethpair/vethpair_compare.ods A perf top for the 4.20-rc8+ kernel from the net-next tree looks like this out around 10K pairs: PerfTop: 11155 irqs/sec kernel:94.2% exact: 0.0% [4000Hz cycles], (all, 32 CPUs) --- 23.44% [kernel] [k] vsscanf 7.32% [kernel] [k] mutex_spin_on_owner.isra.4 5.63% [kernel] [k] __memcpy 5.27% [kernel] [k] __dev_alloc_name 3.46% [kernel] [k] format_decode 3.44% [kernel] [k] vsnprintf 3.16% [kernel] [k] acpi_os_write_port 2.71% [kernel] [k] number.isra.13 1.50% [kernel] [k] strncmp 1.21% [kernel] [k] _parse_integer 0.93% [kernel] [k] filemap_map_pages 0.82% [kernel] [k] put_dec_trunc8 0.82% [kernel] [k] unmap_single_vma 0.78% [kernel] [k] native_queued_spin_lock_slowpath 0.71% [kernel] [k] menu_select 0.65% [kernel] [k] clear_page 0.64% [kernel] [k] _raw_spin_lock 0.62% [kernel] [k] page_fault 0.60% [kernel] [k] find_busiest_group 0.53% [kernel] [k] snprintf 0.52% [kernel] [k] int_sqrt 0.46% [kernel] [k] simple_strtoull 0.44% [kernel] [k] page_remove_rmap My attempts to get a call-graph have been met with very limited success. Even though I've installed the dbg package from "make deb-pkg" the symbol resolution doesn't seem to be working. happy benchmarking, rick jones -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vethpair creation performance, 3.14 versus 4.2.0
On 08/31/2015 02:29 PM, David Ahern wrote: On 8/31/15 1:48 PM, Rick Jones wrote: My attempts to get a call-graph have been met with very limited success. Even though I've installed the dbg package from "make deb-pkg" the symbol resolution doesn't seem to be working. Looks like Debian does not enable framepointers by default: $ grep FRAME /boot/config-3.2.0-4-amd64 ... # CONFIG_FRAME_POINTER is not set Similar result for jessie. And indeed, my config file has a Debian lineage. rick -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Low throughput in VMs using VxLAN
On 08/24/2015 09:19 AM, Santosh R wrote: Hi, Earlier I was seeing lower throughput in VMs using VxLan as GRO was not happening in VM. Tom Herbert suggested to use vxlan: GRO support at tunnel layer patch series. With today's net-next (4.2.0-rc7) in host and VM, I could see GRO happening for vxlan, macvtap and virtual interface in VM. The throughput is still low between VMs (around 4Gbps compared to 9Gbps without VxLAN). Out of curiosity, have you tried tweaking gro_flush_timeout (gro_flush_interval?) for the VMs eth interface? Say perhaps a value of 1000? (I'm assuming the VM is using virtio_net) Does the behaviour change if vhost-net is loaded into the host and used by the VM? rick jones For completeness, it would also be good to compare the likes of netperf TCP_RR between VxLAN and without. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 net-next] documentation: bring vxlan documentation more up-to-date
On 08/12/2015 04:46 PM, David Miller wrote: From: r...@tardy.usa.hp.com (Rick Jones) Date: Wed, 12 Aug 2015 10:23:14 -0700 (PDT) From: Rick Jones rick.jon...@hp.com A few things have changed since the previous version of the vxlan documentation was written, so update it and correct some grammer and such while we are at it. Signed-off-by: Rick Jones rick.jon...@hp.com Applied with grammar misspelling fixed. Thanks. rick -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 net-next] documentation: bring vxlan documentation more up-to-date
From: Rick Jones rick.jon...@hp.com A few things have changed since the previous version of the vxlan documentation was written, so update it and correct some grammer and such while we are at it. Signed-off-by: Rick Jones rick.jon...@hp.com --- v2: Stephen Hemminger feedback to include dstport 4789 in command line example. Also some further refinements from other sources. diff --git a/Documentation/networking/vxlan.txt b/Documentation/networking/vxlan.txt index 6d99351..89ee11b 100644 --- a/Documentation/networking/vxlan.txt +++ b/Documentation/networking/vxlan.txt @@ -1,32 +1,36 @@ Virtual eXtensible Local Area Networking documentation == -The VXLAN protocol is a tunnelling protocol that is designed to -solve the problem of limited number of available VLAN's (4096). -With VXLAN identifier is expanded to 24 bits. - -It is a draft RFC standard, that is implemented by Cisco Nexus, -Vmware and Brocade. The protocol runs over UDP using a single -destination port (still not standardized by IANA). -This document describes the Linux kernel tunnel device, -there is also an implantation of VXLAN for Openvswitch. - -Unlike most tunnels, a VXLAN is a 1 to N network, not just point -to point. A VXLAN device can either dynamically learn the IP address -of the other end, in a manner similar to a learning bridge, or the -forwarding entries can be configured statically. - -The management of vxlan is done in a similar fashion to it's -too closest neighbors GRE and VLAN. Configuring VXLAN requires -the version of iproute2 that matches the kernel release -where VXLAN was first merged upstream. +The VXLAN protocol is a tunnelling protocol designed to solve the +problem of limited VLAN IDs (4096) in IEEE 802.1q. With VXLAN the +size of the identifier is expanded to 24 bits (16777216). + +VXLAN is described by IETF RFC 7348, and has been implemented by a +number of vendors. The protocol runs over UDP using a single +destination port. This document describes the Linux kernel tunnel +device, there is also a separate implementation of VXLAN for +Openvswitch. + +Unlike most tunnels, a VXLAN is a 1 to N network, not just point to +point. A VXLAN device can learn the IP address of the other endpoint +either dynamically in a manner similar to a learning bridge, or make +use of statically-configured forwarding entries. + +The management of vxlan is done in a manner similar to its two closest +neighbors GRE and VLAN. Configuring VXLAN requires the version of +iproute2 that matches the kernel release where VXLAN was first merged +upstream. 1. Create vxlan device - # ip li add vxlan0 type vxlan id 42 group 239.1.1.1 dev eth1 - -This creates a new device (vxlan0). The device uses the -the multicast group 239.1.1.1 over eth1 to handle packets where -no entry is in the forwarding table. + # ip link add vxlan0 type vxlan id 42 group 239.1.1.1 dev eth1 dstport 4789 + +This creates a new device named vxlan0. The device uses the multicast +group 239.1.1.1 over eth1 to handle traffic for which there is no +entry in the forwarding table. The destination port number is set to +the IANA-assigned value of 4789. The Linux implementation of VXLAN +pre-dates the IANA's selection of a standard destination port number +and uses the Linux-selected value by default to maintain backwards +compatibility. 2. Delete vxlan device # ip link delete vxlan0 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next] documentation: bring vxlan documentation more up-to-date
From: Rick Jones rick.jon...@hp.com A few things have changed since the previous version of the vxlan documentation was written, so update it and correct some grammer and such while we are at it. Signed-off-by: Rick Jones rick.jon...@hp.com diff --git a/Documentation/networking/vxlan.txt b/Documentation/networking/vxlan.txt index 6d99351..4126031 100644 --- a/Documentation/networking/vxlan.txt +++ b/Documentation/networking/vxlan.txt @@ -1,32 +1,38 @@ Virtual eXtensible Local Area Networking documentation == -The VXLAN protocol is a tunnelling protocol that is designed to -solve the problem of limited number of available VLAN's (4096). -With VXLAN identifier is expanded to 24 bits. +The VXLAN protocol is a tunnelling protocol that is designed to solve +the problem of the limited number of available VLAN IDs (4096) in IEEE +802.1q. With VXLAN the size of the identifier is expanded to 24 bits +(16777216). -It is a draft RFC standard, that is implemented by Cisco Nexus, -Vmware and Brocade. The protocol runs over UDP using a single -destination port (still not standardized by IANA). -This document describes the Linux kernel tunnel device, -there is also an implantation of VXLAN for Openvswitch. +VXLAN is described by IETF RFC 7348, and has been implemented by a +number of vendors. The protocol runs over UDP using a single +destination port. This document describes the Linux kernel tunnel +device, there is also a separate implementation of VXLAN for +Openvswitch. Unlike most tunnels, a VXLAN is a 1 to N network, not just point to point. A VXLAN device can either dynamically learn the IP address of the other end, in a manner similar to a learning bridge, or the forwarding entries can be configured statically. -The management of vxlan is done in a similar fashion to it's -too closest neighbors GRE and VLAN. Configuring VXLAN requires -the version of iproute2 that matches the kernel release -where VXLAN was first merged upstream. +The management of vxlan is done in a similar fashion to its two +closest neighbors GRE and VLAN. Configuring VXLAN requires the version +of iproute2 that matches the kernel release where VXLAN was first +merged upstream. 1. Create vxlan device - # ip li add vxlan0 type vxlan id 42 group 239.1.1.1 dev eth1 - -This creates a new device (vxlan0). The device uses the -the multicast group 239.1.1.1 over eth1 to handle packets where -no entry is in the forwarding table. + # ip link add vxlan0 type vxlan id 42 group 239.1.1.1 dev eth1 + +This creates a new device named vxlan0. The device uses the +multicast group 239.1.1.1 over eth1 to handle traffic for which there +is no entry is in the forwarding table. The Linux implementation of +VXLAN pre-dates the IANA's selection of a standard destination port +number and uses the Linux-selected value by default to maintain +backwards compatibility. If you wish to use the IANA-assigned +destination port number of 4789 you can add dstport 4789 to the +command line above. 2. Delete vxlan device # ip link delete vxlan0 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html