Re: Netperf UDP issue with connected sockets
From: Rick Jones Date: Wed, 30 Nov 2016 09:42:40 -0800 > And indeed, based on a quick check, send() is what is being called, > though it becomes it seems a sendto() system call - with the > destination information NJULL: > > write(1, "send\n", 5) = 5 > sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) > = 1024 > write(1, "send\n", 5) = 5 > sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) > = 1024 > > So I'm not sure what might be going-on there. It's because of glibc's implementation of send() which is: ssize_t __libc_send (int sockfd, const void *buffer, size_t len, int flags) { return SYSCALL_CANCEL (sendto, sockfd, buffer, len, flags, NULL, 0); } strong_alias (__libc_send, __send) weak_alias (__libc_send, send)
Re: Netperf UDP issue with connected sockets
On 11/30/2016 02:43 AM, Jesper Dangaard Brouer wrote: Notice the "fib_lookup" cost is still present, even when I use option "-- -n -N" to create a connected socket. As Eric taught us, this is because we should use syscalls "send" or "write" on a connected socket. In theory, once the data socket is connected, the send_data() call in src/nettest_omni.c is supposed to use send() rather than sendto(). And indeed, based on a quick check, send() is what is being called, though it becomes it seems a sendto() system call - with the destination information NJULL: write(1, "send\n", 5) = 5 sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 1024 write(1, "send\n", 5) = 5 sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 1024 So I'm not sure what might be going-on there. You can get netperf to use write() instead of send() by adding a test-specific -I option. happy benchmarking, rick My udp_flood tool[1] cycle through the different syscalls: taskset -c 2 ~/git/network-testing/src/udp_flood 198.18.50.1 --count $((10**7)) --pmtu 2 ns/pkt pps cycles/pkt send473.08 2113816.28 1891 sendto 558.58 1790265.84 2233 sendmsg 587.24 1702873.80 2348 sendMmsg/32 547.57 1826265.90 2189 write 518.36 1929175.52 2072 Using "send" seems to be the fastest option. Some notes on test: I've forced TX completions to happen on another CPU0 and pinned the udp_flood program (to CPU2) as I want to avoid the CPU scheduler to move udp_flood around as this cause fluctuations in the results (as it stress the memory allocations more). My udp_flood --pmtu option is documented in the --help usage text (see below signature)
Re: Netperf UDP issue with connected sockets
On Mon, 28 Nov 2016 10:33:49 -0800 Rick Jones wrote: > On 11/17/2016 12:16 AM, Jesper Dangaard Brouer wrote: > >> time to try IP_MTU_DISCOVER ;) > > > > To Rick, maybe you can find a good solution or option with Eric's hint, > > to send appropriate sized UDP packets with Don't Fragment (DF). > > Jesper - > > Top of trunk has a change adding an omni, test-specific -f option which > will set IP_MTU_DISCOVER:IP_PMTUDISC_DO on the data socket. Is that > sufficient to your needs? The "-- -f" option makes the __ip_select_ident lookup go away. So, confirming your new option works. Notice the "fib_lookup" cost is still present, even when I use option "-- -n -N" to create a connected socket. As Eric taught us, this is because we should use syscalls "send" or "write" on a connected socket. My udp_flood tool[1] cycle through the different syscalls: taskset -c 2 ~/git/network-testing/src/udp_flood 198.18.50.1 --count $((10**7)) --pmtu 2 ns/pkt pps cycles/pkt send473.08 2113816.28 1891 sendto 558.58 1790265.84 2233 sendmsg 587.24 1702873.80 2348 sendMmsg/32 547.57 1826265.90 2189 write 518.36 1929175.52 2072 Using "send" seems to be the fastest option. Some notes on test: I've forced TX completions to happen on another CPU0 and pinned the udp_flood program (to CPU2) as I want to avoid the CPU scheduler to move udp_flood around as this cause fluctuations in the results (as it stress the memory allocations more). My udp_flood --pmtu option is documented in the --help usage text (see below signature) -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer $ uname -a Linux canyon 4.9.0-rc6-page_pool07-baseline+ #185 SMP PREEMPT Wed Nov 30 10:07:51 CET 2016 x86_64 [1] udp_flood tool: https://github.com/netoptimizer/network-testing/blob/master/src/udp_flood.c Quick command used for verifying __ip_select_ident is removed: # First run benchmark sudo perf record -g -a ~/tools/netperf2-svn/src/netperf -H 198.18.50.1 \ -t UDP_STREAM -l 3 -- -m 1472 -f # Second grep in perf output for functions sudo perf report --no-children --call-graph none --stdio |\ egrep -e '__ip_select_ident|fib_table_lookup' $ ./udp_flood --help DOCUMENTATION: This tool is a UDP flood that measures the outgoing packet rate. Default cycles through tests with different send system calls. What function-call to invoke can also be specified as a command line option (see below). Default transmit 100 packets per test, adjust via --count Usage: ./udp_flood (options-see-below) IPADDR Listing options: --help short-option: -h --ipv4 short-option: -4 --ipv6 short-option: -6 --sendmsg short-option: -u --sendmmsg short-option: -U --sendto short-option: -t --writeshort-option: -T --send short-option: -S --batchshort-option: -b --countshort-option: -c --port short-option: -p --payload short-option: -m --pmtu short-option: -d --verbose short-option: -v Multiple tests can be selected: default: all tests -u -U -t -T -S: run any combination of sendmsg/sendmmsg/sendto/write/send Option --pmtu for Path MTU discover socket option IP_MTU_DISCOVER This affects the DF(Don't-Fragment) bit setting. Following values are selectable: 0 = IP_PMTUDISC_DONT 1 = IP_PMTUDISC_WANT 2 = IP_PMTUDISC_DO 3 = IP_PMTUDISC_PROBE 4 = IP_PMTUDISC_INTERFACE 5 = IP_PMTUDISC_OMIT Documentation see under IP_MTU_DISCOVER in 'man 7 ip'
Re: Netperf UDP issue with connected sockets
On 11/28/2016 10:33 AM, Rick Jones wrote: On 11/17/2016 12:16 AM, Jesper Dangaard Brouer wrote: time to try IP_MTU_DISCOVER ;) To Rick, maybe you can find a good solution or option with Eric's hint, to send appropriate sized UDP packets with Don't Fragment (DF). Jesper - Top of trunk has a change adding an omni, test-specific -f option which will set IP_MTU_DISCOVER:IP_PMTUDISC_DO on the data socket. Is that sufficient to your needs? Usage examples: raj@tardy:~/netperf2_trunk/src$ ./netperf -t UDP_STREAM -l 1 -H raj-folio.americas.hpqcorp.net -- -m 1472 -f MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to raj-folio.americas.hpqcorp.net () port 0 AF_INET Socket Message Elapsed Messages SizeSize Time Okay Errors Throughput bytes bytessecs# # 10^6bits/sec 2129921472 1.0077495 0 912.35 212992 1.0077495912.35 [1]+ Doneemacs nettest_omni.c raj@tardy:~/netperf2_trunk/src$ ./netperf -t UDP_STREAM -l 1 -H raj-folio.americas.hpqcorp.net -- -m 14720 -f MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to raj-folio.americas.hpqcorp.net () port 0 AF_INET send_data: data send error: Message too long (errno 90) netperf: send_omni: send_data failed: Message too long happy benchmarking, rick jones
Re: Netperf UDP issue with connected sockets
On 11/17/2016 12:16 AM, Jesper Dangaard Brouer wrote: time to try IP_MTU_DISCOVER ;) To Rick, maybe you can find a good solution or option with Eric's hint, to send appropriate sized UDP packets with Don't Fragment (DF). Jesper - Top of trunk has a change adding an omni, test-specific -f option which will set IP_MTU_DISCOVER:IP_PMTUDISC_DO on the data socket. Is that sufficient to your needs? happy benchmarking, rick
Re: Netperf UDP issue with connected sockets
On Mon, 2016-11-21 at 17:03 +0100, Jesper Dangaard Brouer wrote: > On Thu, 17 Nov 2016 10:51:23 -0800 > Eric Dumazet wrote: > > > On Thu, 2016-11-17 at 19:30 +0100, Jesper Dangaard Brouer wrote: > > > > > The point is I can see a socket Send-Q forming, thus we do know the > > > application have something to send. Thus, and possibility for > > > non-opportunistic bulking. Allowing/implementing bulk enqueue from > > > socket layer into qdisc layer, should be fairly simple (and rest of > > > xmit_more is already in place). > > > > > > As I said, you are fooled by TX completions. > > Obviously TX completions play a role yes, and I bet I can adjust the > TX completion to cause xmit_more to happen, at the expense of > introducing added latency. > > The point is the "bloated" spinlock in __dev_queue_xmit is still caused > by the MMIO tailptr/doorbell. The added cost occurs when enqueueing > packets, and result in the inability to get enough packets into the > qdisc for xmit_more going (on my system). I argue that a bulk enqueue > API would allow us to get past the hurtle of transitioning into > xmit_more mode more easily. > This is very nice, but we already have bulk enqueue, it is called xmit_more. Kernel does not know your application is sending a packet after the one you send. xmit_more is not often used applications/stacks send many small packets. qdisc is empty (one enqueued packet is immediately dequeued so skb->xmit_more is 0), and even bypassed (TCQ_F_CAN_BYPASS) Not sure it this has been tried before, but the doorbell avoidance could be done by the driver itself, because it knows a TX completion will come shortly (well... if softirqs are not delayed too much !) Doorbell would be forced only if : ("skb->xmit_more is not set" AND "TX engine is not 'started yet'" ) OR ( too many [1] packets were put in TX ring buffer, no point deferring more) Start the pump, but once it is started, let the doorbells being done by TX completion. ndo_start_xmit and TX completion handler would have to maintain a shared state describing if packets were ready but doorbell deferred. Note that TX completion means "if at least one packet was drained", otherwise busy polling, constantly calling napi->poll() would force a doorbell too soon for devices sharing a NAPI for both RX and TX. But then, maybe busy poll would like to force a doorbell... I could try these ideas on mlx4 shortly. [1] limit could be derived from active "ethtool -c" params, eg tx-frames
Re: Netperf UDP issue with connected sockets
On Thu, 17 Nov 2016 10:51:23 -0800 Eric Dumazet wrote: > On Thu, 2016-11-17 at 19:30 +0100, Jesper Dangaard Brouer wrote: > > > The point is I can see a socket Send-Q forming, thus we do know the > > application have something to send. Thus, and possibility for > > non-opportunistic bulking. Allowing/implementing bulk enqueue from > > socket layer into qdisc layer, should be fairly simple (and rest of > > xmit_more is already in place). > > > As I said, you are fooled by TX completions. Obviously TX completions play a role yes, and I bet I can adjust the TX completion to cause xmit_more to happen, at the expense of introducing added latency. The point is the "bloated" spinlock in __dev_queue_xmit is still caused by the MMIO tailptr/doorbell. The added cost occurs when enqueueing packets, and result in the inability to get enough packets into the qdisc for xmit_more going (on my system). I argue that a bulk enqueue API would allow us to get past the hurtle of transitioning into xmit_more mode more easily. > Please make sure to increase the sndbuf limits ! > > echo 2129920 >/proc/sys/net/core/wmem_default Testing with this makes no difference. $ grep -H . /proc/sys/net/core/wmem_default /proc/sys/net/core/wmem_default:2129920 > lpaa23:~# sar -n DEV 1 10|grep eth1 IFACE rxpck/stxpck/srxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s > 10:49:25 eth1 7.00 9273283.00 0.61 2187214.90 0.00 > 0.00 0.00 > 10:49:26 eth1 1.00 9230795.00 0.06 2176787.57 0.00 > 0.00 1.00 > 10:49:27 eth1 2.00 9247906.00 0.17 2180915.45 0.00 > 0.00 0.00 > 10:49:28 eth1 3.00 9246542.00 0.23 2180790.38 0.00 > 0.00 1.00 > Average: eth1 2.50 9018045.70 0.25 2126893.82 0.00 > 0.00 0.50 Very impressive numbers 9.2Mpps TX. What is this test? What kind of traffic? Multiple CPUs? > lpaa23:~# ethtool -S eth1|grep more; sleep 1;ethtool -S eth1|grep more > xmit_more: 2251366909 > xmit_more: 2256011392 > > lpaa23:~# echo 2256011392-2251366909 | bc > 4644483 The xmit_more definitely works on your system, but I cannot get it to "kick-in" on my setup. Once the xmit_more is active, then the "bloated" spinlock problem should go way. (Tests with "udp_flood --pmtu 3 --send") Forcing TX completion to happen on the same CPU, no xmit_more: ~/git/network-testing/bin/ethtool_stats.pl --sec 2 --dev mlx5p2 Show adapter(s) (mlx5p2) statistics (ONLY that changed!) Ethtool(mlx5p2 ) stat:104592908 (104,592,908) <= tx0_bytes /sec Ethtool(mlx5p2 ) stat:39059 ( 39,059) <= tx0_nop /sec Ethtool(mlx5p2 ) stat: 1743215 ( 1,743,215) <= tx0_packets /sec Ethtool(mlx5p2 ) stat:104719986 (104,719,986) <= tx_bytes /sec Ethtool(mlx5p2 ) stat:111774540 (111,774,540) <= tx_bytes_phy /sec Ethtool(mlx5p2 ) stat: 1745333 ( 1,745,333) <= tx_csum_partial /sec Ethtool(mlx5p2 ) stat: 1745333 ( 1,745,333) <= tx_packets /sec Ethtool(mlx5p2 ) stat: 1746477 ( 1,746,477) <= tx_packets_phy /sec Ethtool(mlx5p2 ) stat:111483434 (111,483,434) <= tx_prio1_bytes /sec Ethtool(mlx5p2 ) stat: 1741928 ( 1,741,928) <= tx_prio1_packets /sec Forcing TX completion to happen on remote CPU, some xmit_more: Show adapter(s) (mlx5p2) statistics (ONLY that changed!) Ethtool(mlx5p2 ) stat:128485892 (128,485,892) <= tx0_bytes /sec Ethtool(mlx5p2 ) stat:31840 ( 31,840) <= tx0_nop /sec Ethtool(mlx5p2 ) stat: 2141432 ( 2,141,432) <= tx0_packets /sec Ethtool(mlx5p2 ) stat: 350 (350) <= tx0_xmit_more /sec Ethtool(mlx5p2 ) stat:128486459 (128,486,459) <= tx_bytes /sec Ethtool(mlx5p2 ) stat:137052191 (137,052,191) <= tx_bytes_phy /sec Ethtool(mlx5p2 ) stat: 2141441 ( 2,141,441) <= tx_csum_partial /sec Ethtool(mlx5p2 ) stat: 2141441 ( 2,141,441) <= tx_packets /sec Ethtool(mlx5p2 ) stat: 2141441 ( 2,141,441) <= tx_packets_phy /sec Ethtool(mlx5p2 ) stat:137051300 (137,051,300) <= tx_prio1_bytes /sec Ethtool(mlx5p2 ) stat: 2141427 ( 2,141,427) <= tx_prio1_packets /sec Ethtool(mlx5p2 ) stat: 350 (350) <= tx_xmit_more /sec >PerfTop: 76969 irqs/sec kernel:96.6% exact: 100.0% [4000Hz cycles:pp], > (all, 48 CPUs) >- > 11.64% [kernel] [k] skb_set_owner_w > 6.21% [kernel] [k] queued_spin_lock_slowpath > 4.76% [kernel] [k] _raw_spin_lock > 4.40% [kernel] [k] __ip_make_skb > 3.10% [kernel] [k] sock_wfree > 2.87% [kernel] [k] ipt_do_table > 2.76% [kernel] [k] fq_dequeue
Re: Netperf UDP issue with connected sockets
On Thu, 17 Nov 2016 13:44:02 -0800 Eric Dumazet wrote: > On Thu, 2016-11-17 at 22:19 +0100, Jesper Dangaard Brouer wrote: > > > > > Maybe you can share your udp flood "udpsnd" program source? > > Very ugly. This is based on what I wrote when tracking the UDP v6 > checksum bug (4f2e4ad56a65f3b7d64c258e373cb71e8d2499f4 net: mangle zero > checksum in skb_checksum_help()), because netperf sends the same message > over and over... Thanks a lot, hope you don't mind; I added the code to my github repo: https://github.com/netoptimizer/network-testing/blob/master/src/udp_snd.c So I identified the difference, and reason behind the route lookups. Your program is using send() and I was using sendmsg(). Given udp_flood is designed to test different calls, I simply added --send as a new possibility. https://github.com/netoptimizer/network-testing/commit/16166c2cd1fa8 If I use --write instead, then I can also avoid the fib_table_lookup and __ip_route_output_key_hash calls. > Use -d 2 to remove the ip_idents_reserve() overhead. #define IP_PMTUDISC_DO 2 /* Always DF */ Added a --pmtu option to my udp_flood program https://github.com/netoptimizer/network-testing/commit/23a78caf4bb5b > #define _GNU_SOURCE > > #include > #include > #include > #include > #include > #include > #include > #include > #include > #include > #include > #include > > char buffer[1400]; > > int main(int argc, char** argv) { > int fd, i; > struct sockaddr_in6 addr; > char *host = "2002:af6:798::1"; > int family = AF_INET6; > int discover = -1; > > while ((i = getopt(argc, argv, "4H:d:")) != -1) { > switch (i) { > case 'H': host = optarg; break; > case '4': family = AF_INET; break; > case 'd': discover = atoi(optarg); break; > } > } > fd = socket(family, SOCK_DGRAM, 0); > if (fd < 0) > error(1, errno, "failed to create socket"); > if (discover != -1) > setsockopt(fd, SOL_IP, IP_MTU_DISCOVER, >&discover, sizeof(discover)); > > memset(&addr, 0, sizeof(addr)); > if (family == AF_INET6) { > addr.sin6_family = AF_INET6; > addr.sin6_port = htons(9); > inet_pton(family, host, (void *)&addr.sin6_addr.s6_addr); > } else { > struct sockaddr_in *in = (struct sockaddr_in *)&addr; > in->sin_family = family; > in->sin_port = htons(9); > inet_pton(family, host, &in->sin_addr); > } > connect(fd, (struct sockaddr *)&addr, > (family == AF_INET6) ? sizeof(addr) : > sizeof(struct sockaddr_in)); > memset(buffer, 1, 1400); > for (i = 0; i < 65536; i++) { > memcpy(buffer, &i, sizeof(i)); > send(fd, buffer, 100 + rand() % 200, 0); Using send() avoids the fib_table_lookup, on a connected UDP socket. > } > return 0; > } -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
Re: Netperf UDP issue with connected sockets
On 11/17/2016 04:37 PM, Julian Anastasov wrote: On Thu, 17 Nov 2016, Rick Jones wrote: raj@tardy:~/netperf2_trunk$ strace -v -o /tmp/netperf.strace src/netperf -F src/nettest_omni.c -t UDP_STREAM -l 1 -- -m 1472 ... socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4 getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0 getsockopt(4, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0 setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(4, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0 connected socket can benefit from dst cached in socket but not if SO_DONTROUTE is set. If we do not want to send packets via gateway this -l 1 should help but I don't see IP_TTL setsockopt in your first example with connect() to 127.0.0.1. Also, may be there can be another default, if -l is used to specify TTL then SO_DONTROUTE should not be set. I.e. we should avoid SO_DONTROUTE, if possible. The global -l option specifies the duration of the test. It doesn't specify the TTL of the IP datagrams being generated by the actions of the test. I resisted setting SO_DONTROUTE for a number of years after the first instance of UDP_STREAM being used in link up/down testing took-out a company's network (including security camera feeds to galactic HQ) but at this point I'm likely to keep it in there because there ended-up being a second such incident. It is set only for UDP_STREAM. It isn't set for UDP_RR or TCP_*. And for UDP_STREAM it can be overridden by the test-specific -R option. happy benchmarking, rick jones
Re: Netperf UDP issue with connected sockets
Hello, On Thu, 17 Nov 2016, Rick Jones wrote: > raj@tardy:~/netperf2_trunk$ strace -v -o /tmp/netperf.strace src/netperf -F > src/nettest_omni.c -t UDP_STREAM -l 1 -- -m 1472 > > ... > > socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4 > getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0 > getsockopt(4, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0 > setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 > bind(4, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, > 16) = 0 > setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0 connected socket can benefit from dst cached in socket but not if SO_DONTROUTE is set. If we do not want to send packets via gateway this -l 1 should help but I don't see IP_TTL setsockopt in your first example with connect() to 127.0.0.1. Also, may be there can be another default, if -l is used to specify TTL then SO_DONTROUTE should not be set. I.e. we should avoid SO_DONTROUTE, if possible. Regards -- Julian Anastasov
Re: Netperf UDP issue with connected sockets
On 11/17/2016 01:44 PM, Eric Dumazet wrote: because netperf sends the same message over and over... Well, sort of, by default. That can be altered to a degree. The global -F option should cause netperf to fill the buffers in its send ring with data from the specified file. The number of buffers in the send ring can be controlled via the global -W option. The number of elements in the ring will default to one more than the initial SO_SNDBUF size divided by the send size. raj@tardy:~/netperf2_trunk$ strace -v -o /tmp/netperf.strace src/netperf -F src/nettest_omni.c -t UDP_STREAM -l 1 -- -m 1472 ... socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4 getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0 getsockopt(4, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0 setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(4, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0 setsockopt(4, SOL_IP, IP_RECVERR, [1], 4) = 0 open("src/nettest_omni.c", O_RDONLY)= 5 fstat(5, {st_dev=makedev(8, 2), st_ino=82075297, st_mode=S_IFREG|0664, st_nlink=1, st_uid=1000, st_gid=1000, st_blksize=4096, st_blocks=456, st_size=230027, st_atime=2016/11/16-09:49:29, st_mtime=2016/11/16-09:49:24, st_ctime=2016/11/16-09:49:24}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3099f62000 read(5, "#ifdef HAVE_CONFIG_H\n#include rt_sigaction(SIGALRM, {0x402ea6, [ALRM], SA_RESTORER|SA_INTERRUPT, 0x7f30994a7cb0}, NULL, 8) = 0 rt_sigaction(SIGINT, {0x402ea6, [INT], SA_RESTORER|SA_INTERRUPT, 0x7f30994a7cb0}, NULL, 8) = 0 alarm(1)= 0 sendto(4, "#ifdef HAVE_CONFIG_H\n#include {sa_family=AF_INET, sin_port=htons(58088), sin_addr=inet_addr("127.0.0.1")}, 16) = 1472 sendto(4, " used\\n\\\n-m local,remote S"..., 1472, 0, {sa_family=AF_INET, sin_port=htons(58088), sin_addr=inet_addr("127.0.0.1")}, 16) = 1472 sendto(4, " do here but clear the legacy fl"..., 1472, 0, {sa_family=AF_INET, sin_port=htons(58088), sin_addr=inet_addr("127.0.0.1")}, 16) = 1472 sendto(4, "e before we scan the test-specif"..., 1472, 0, {sa_family=AF_INET, sin_port=htons(58088), sin_addr=inet_addr("127.0.0.1")}, 16) = 1472 sendto(4, "\n\n\tfprintf(where,\n\t\ttput_fmt_1_l"..., 1472, 0, {sa_family=AF_INET, sin_port=htons(58088), sin_addr=inet_addr("127.0.0.1")}, 16) = 1472 Of course, it will continue to send the same messages from the send_ring over and over instead of putting different data into the buffers each time, but if one has a sufficiently large -W option specified... happy benchmarking, rick jones
Re: Netperf UDP issue with connected sockets
On Thu, Nov 17, 2016 at 9:34 AM, David Laight wrote: > From: Jesper Dangaard Brouer >> Sent: 17 November 2016 14:58 >> On Thu, 17 Nov 2016 06:17:38 -0800 >> Eric Dumazet wrote: >> >> > On Thu, 2016-11-17 at 14:42 +0100, Jesper Dangaard Brouer wrote: >> > >> > > I can see that qdisc layer does not activate xmit_more in this case. >> > > >> > >> > Sure. Not enough pressure from the sender(s). >> > >> > The bottleneck is not the NIC or qdisc in your case, meaning that BQL >> > limit is kept at a small value. >> > >> > (BTW not all NIC have expensive doorbells) >> >> I believe this NIC mlx5 (50G edition) does. >> >> I'm seeing UDP TX of 1656017.55 pps, which is per packet: >> 2414 cycles(tsc) 603.86 ns >> >> Perf top shows (with my own udp_flood, that avoids __ip_select_ident): >> >> Samples: 56K of event 'cycles', Event count (approx.): 51613832267 >>Overhead CommandShared ObjectSymbol >> +8.92% udp_flood [kernel.vmlinux] [k] _raw_spin_lock >>- _raw_spin_lock >> + 90.78% __dev_queue_xmit >> + 7.83% dev_queue_xmit >> + 1.30% ___slab_alloc >> +5.59% udp_flood [kernel.vmlinux] [k] skb_set_owner_w >> +4.77% udp_flood [mlx5_core] [k] mlx5e_sq_xmit >> +4.09% udp_flood [kernel.vmlinux] [k] fib_table_lookup >> +4.00% swapper[mlx5_core] [k] mlx5e_poll_tx_cq >> +3.11% udp_flood [kernel.vmlinux] [k] >> __ip_route_output_key_hash >> +2.49% swapper[kernel.vmlinux] [k] __slab_free >> >> In this setup the spinlock in __dev_queue_xmit should be uncongested. >> An uncongested spin_lock+unlock cost 32 cycles(tsc) 8.198 ns on this system. >> >> But 8.92% of the time is spend on it, which corresponds to a cost of 215 >> cycles (2414*0.0892). This cost is too high, thus something else is >> going on... I claim this mysterious extra cost is the tailptr/doorbell. > > Try adding code to ring the doorbell twice. > If this doesn't slow things down then it isn't (likely to be) responsible > for the delay you are seeing. > > David > The problem isn't only the doorbell. It is doorbell plus a locked transaction on x86 results in a long wait until the doorbell write has been completed. You could batch a bunch of doorbell writes together and it isn't an issue unless you do something like writel(), wmb(), writel(), wmb(), then you will see the effect double since the write memory barrier is what is forcing the delays. - Alex
Re: Netperf UDP issue with connected sockets
On Thu, 2016-11-17 at 22:19 +0100, Jesper Dangaard Brouer wrote: > > Maybe you can share your udp flood "udpsnd" program source? Very ugly. This is based on what I wrote when tracking the UDP v6 checksum bug (4f2e4ad56a65f3b7d64c258e373cb71e8d2499f4 net: mangle zero checksum in skb_checksum_help()), because netperf sends the same message over and over... Use -d 2 to remove the ip_idents_reserve() overhead. #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #include #include #include char buffer[1400]; int main(int argc, char** argv) { int fd, i; struct sockaddr_in6 addr; char *host = "2002:af6:798::1"; int family = AF_INET6; int discover = -1; while ((i = getopt(argc, argv, "4H:d:")) != -1) { switch (i) { case 'H': host = optarg; break; case '4': family = AF_INET; break; case 'd': discover = atoi(optarg); break; } } fd = socket(family, SOCK_DGRAM, 0); if (fd < 0) error(1, errno, "failed to create socket"); if (discover != -1) setsockopt(fd, SOL_IP, IP_MTU_DISCOVER, &discover, sizeof(discover)); memset(&addr, 0, sizeof(addr)); if (family == AF_INET6) { addr.sin6_family = AF_INET6; addr.sin6_port = htons(9); inet_pton(family, host, (void *)&addr.sin6_addr.s6_addr); } else { struct sockaddr_in *in = (struct sockaddr_in *)&addr; in->sin_family = family; in->sin_port = htons(9); inet_pton(family, host, &in->sin_addr); } connect(fd, (struct sockaddr *)&addr, (family == AF_INET6) ? sizeof(addr) : sizeof(struct sockaddr_in)); memset(buffer, 1, 1400); for (i = 0; i < 65536; i++) { memcpy(buffer, &i, sizeof(i)); send(fd, buffer, 100 + rand() % 200, 0); } return 0; }
Re: Netperf UDP issue with connected sockets
On Thu, 17 Nov 2016 10:51:23 -0800 Eric Dumazet wrote: > On Thu, 2016-11-17 at 19:30 +0100, Jesper Dangaard Brouer wrote: > > > The point is I can see a socket Send-Q forming, thus we do know the > > application have something to send. Thus, and possibility for > > non-opportunistic bulking. Allowing/implementing bulk enqueue from > > socket layer into qdisc layer, should be fairly simple (and rest of > > xmit_more is already in place). > > > As I said, you are fooled by TX completions. > > Please make sure to increase the sndbuf limits ! > > echo 2129920 >/proc/sys/net/core/wmem_default > > lpaa23:~# sar -n DEV 1 10|grep eth1 > 10:49:25 eth1 7.00 9273283.00 0.61 2187214.90 0.00 > 0.00 0.00 > 10:49:26 eth1 1.00 9230795.00 0.06 2176787.57 0.00 > 0.00 1.00 > 10:49:27 eth1 2.00 9247906.00 0.17 2180915.45 0.00 > 0.00 0.00 > 10:49:28 eth1 3.00 9246542.00 0.23 2180790.38 0.00 > 0.00 1.00 > 10:49:29 eth1 1.00 9239218.00 0.06 2179044.83 0.00 > 0.00 0.00 > 10:49:30 eth1 3.00 9248775.00 0.23 2181257.84 0.00 > 0.00 1.00 > 10:49:31 eth1 4.00 9225471.00 0.65 2175772.75 0.00 > 0.00 0.00 > 10:49:32 eth1 2.00 9253536.00 0.33 2182666.44 0.00 > 0.00 1.00 > 10:49:33 eth1 1.00 9265900.00 0.06 2185598.40 0.00 > 0.00 0.00 > 10:49:34 eth1 1.00 6949031.00 0.06 1638889.63 0.00 > 0.00 1.00 > Average: eth1 2.50 9018045.70 0.25 2126893.82 0.00 > 0.00 0.50 > > > lpaa23:~# ethtool -S eth1|grep more; sleep 1;ethtool -S eth1|grep more > xmit_more: 2251366909 > xmit_more: 2256011392 > > lpaa23:~# echo 2256011392-2251366909 | bc > 4644483 xmit more not happen that frequently for my setup, it does happen sometimes. And I do monitor with "ethtool -S". ~/git/network-testing/bin/ethtool_stats.pl --sec 2 --dev mlx5p2 Show adapter(s) (mlx5p2) statistics (ONLY that changed!) Ethtool(mlx5p2 ) stat: 92900913 ( 92,900,913) <= tx0_bytes /sec Ethtool(mlx5p2 ) stat:36073 ( 36,073) <= tx0_nop /sec Ethtool(mlx5p2 ) stat: 1548349 ( 1,548,349) <= tx0_packets /sec Ethtool(mlx5p2 ) stat:1 ( 1) <= tx0_xmit_more /sec Ethtool(mlx5p2 ) stat: 92884899 ( 92,884,899) <= tx_bytes /sec Ethtool(mlx5p2 ) stat: 99297696 ( 99,297,696) <= tx_bytes_phy /sec Ethtool(mlx5p2 ) stat: 1548082 ( 1,548,082) <= tx_csum_partial /sec Ethtool(mlx5p2 ) stat: 1548082 ( 1,548,082) <= tx_packets /sec Ethtool(mlx5p2 ) stat: 1551527 ( 1,551,527) <= tx_packets_phy /sec Ethtool(mlx5p2 ) stat: 99076658 ( 99,076,658) <= tx_prio1_bytes /sec Ethtool(mlx5p2 ) stat: 1548073 ( 1,548,073) <= tx_prio1_packets /sec Ethtool(mlx5p2 ) stat: 92936078 ( 92,936,078) <= tx_vport_unicast_bytes /sec Ethtool(mlx5p2 ) stat: 1548934 ( 1,548,934) <= tx_vport_unicast_packets /sec Ethtool(mlx5p2 ) stat:1 ( 1) <= tx_xmit_more /sec (after several attempts I got:) $ ethtool -S mlx5p2|grep more; sleep 1;ethtool -S mlx5p2|grep more tx_xmit_more: 14048 tx0_xmit_more: 14048 tx_xmit_more: 14049 tx0_xmit_more: 14049 This was with: $ grep -H . /proc/sys/net/core/wmem_default /proc/sys/net/core/wmem_default:2129920 >PerfTop: 76969 irqs/sec kernel:96.6% exact: 100.0% [4000Hz cycles:pp], > (all, 48 CPUs) > - > > 11.64% [kernel] [k] skb_set_owner_w > 6.21% [kernel] [k] queued_spin_lock_slowpath > 4.76% [kernel] [k] _raw_spin_lock > 4.40% [kernel] [k] __ip_make_skb > 3.10% [kernel] [k] sock_wfree > 2.87% [kernel] [k] ipt_do_table > 2.76% [kernel] [k] fq_dequeue > 2.71% [kernel] [k] mlx4_en_xmit > 2.50% [kernel] [k] __dev_queue_xmit > 2.29% [kernel] [k] __ip_append_data.isra.40 > 2.28% [kernel] [k] udp_sendmsg > 2.01% [kernel] [k] __alloc_skb > 1.90% [kernel] [k] napi_consume_skb > 1.63% [kernel] [k] udp_send_skb > 1.62% [kernel] [k] skb_release_data > 1.62% [kernel] [k] entry_SYSCALL_64_fastpath > 1.56% [kernel] [k] dev_hard_start_xmit > 1.55% udpsnd[.] __libc_send > 1.48% [kernel] [k] netif_skb_features > 1.42% [kernel] [k] __qdisc_run > 1.35% [kernel] [k] sk_dst_check >
Re: Netperf UDP issue with connected sockets
On Thu, 2016-11-17 at 19:30 +0100, Jesper Dangaard Brouer wrote: > The point is I can see a socket Send-Q forming, thus we do know the > application have something to send. Thus, and possibility for > non-opportunistic bulking. Allowing/implementing bulk enqueue from > socket layer into qdisc layer, should be fairly simple (and rest of > xmit_more is already in place). As I said, you are fooled by TX completions. Please make sure to increase the sndbuf limits ! echo 2129920 >/proc/sys/net/core/wmem_default lpaa23:~# sar -n DEV 1 10|grep eth1 10:49:25 eth1 7.00 9273283.00 0.61 2187214.90 0.00 0.00 0.00 10:49:26 eth1 1.00 9230795.00 0.06 2176787.57 0.00 0.00 1.00 10:49:27 eth1 2.00 9247906.00 0.17 2180915.45 0.00 0.00 0.00 10:49:28 eth1 3.00 9246542.00 0.23 2180790.38 0.00 0.00 1.00 10:49:29 eth1 1.00 9239218.00 0.06 2179044.83 0.00 0.00 0.00 10:49:30 eth1 3.00 9248775.00 0.23 2181257.84 0.00 0.00 1.00 10:49:31 eth1 4.00 9225471.00 0.65 2175772.75 0.00 0.00 0.00 10:49:32 eth1 2.00 9253536.00 0.33 2182666.44 0.00 0.00 1.00 10:49:33 eth1 1.00 9265900.00 0.06 2185598.40 0.00 0.00 0.00 10:49:34 eth1 1.00 6949031.00 0.06 1638889.63 0.00 0.00 1.00 Average: eth1 2.50 9018045.70 0.25 2126893.82 0.00 0.00 0.50 lpaa23:~# ethtool -S eth1|grep more; sleep 1;ethtool -S eth1|grep more xmit_more: 2251366909 xmit_more: 2256011392 lpaa23:~# echo 2256011392-2251366909 | bc 4644483 PerfTop: 76969 irqs/sec kernel:96.6% exact: 100.0% [4000Hz cycles:pp], (all, 48 CPUs) --- 11.64% [kernel] [k] skb_set_owner_w 6.21% [kernel] [k] queued_spin_lock_slowpath 4.76% [kernel] [k] _raw_spin_lock 4.40% [kernel] [k] __ip_make_skb 3.10% [kernel] [k] sock_wfree 2.87% [kernel] [k] ipt_do_table 2.76% [kernel] [k] fq_dequeue 2.71% [kernel] [k] mlx4_en_xmit 2.50% [kernel] [k] __dev_queue_xmit 2.29% [kernel] [k] __ip_append_data.isra.40 2.28% [kernel] [k] udp_sendmsg 2.01% [kernel] [k] __alloc_skb 1.90% [kernel] [k] napi_consume_skb 1.63% [kernel] [k] udp_send_skb 1.62% [kernel] [k] skb_release_data 1.62% [kernel] [k] entry_SYSCALL_64_fastpath 1.56% [kernel] [k] dev_hard_start_xmit 1.55% udpsnd[.] __libc_send 1.48% [kernel] [k] netif_skb_features 1.42% [kernel] [k] __qdisc_run 1.35% [kernel] [k] sk_dst_check 1.33% [kernel] [k] sock_def_write_space 1.30% [kernel] [k] kmem_cache_alloc_node_trace 1.29% [kernel] [k] __local_bh_enable_ip 1.21% [kernel] [k] copy_user_enhanced_fast_string 1.08% [kernel] [k] __kmalloc_reserve.isra.40 1.08% [kernel] [k] SYSC_sendto 1.07% [kernel] [k] kmem_cache_alloc_node 0.95% [kernel] [k] ip_finish_output2 0.95% [kernel] [k] ktime_get 0.91% [kernel] [k] validate_xmit_skb 0.88% [kernel] [k] sock_alloc_send_pskb 0.82% [kernel] [k] sock_sendmsg
Re: Netperf UDP issue with connected sockets
On Thu, 17 Nov 2016 08:21:19 -0800 Eric Dumazet wrote: > On Thu, 2016-11-17 at 15:57 +0100, Jesper Dangaard Brouer wrote: > > On Thu, 17 Nov 2016 06:17:38 -0800 > > Eric Dumazet wrote: > > > > > On Thu, 2016-11-17 at 14:42 +0100, Jesper Dangaard Brouer wrote: > > > > > > > I can see that qdisc layer does not activate xmit_more in this case. > > > > > > > > > > Sure. Not enough pressure from the sender(s). > > > > > > The bottleneck is not the NIC or qdisc in your case, meaning that BQL > > > limit is kept at a small value. > > > > > > (BTW not all NIC have expensive doorbells) > > > > I believe this NIC mlx5 (50G edition) does. > > > > I'm seeing UDP TX of 1656017.55 pps, which is per packet: > > 2414 cycles(tsc) 603.86 ns > > > > Perf top shows (with my own udp_flood, that avoids __ip_select_ident): > > > > Samples: 56K of event 'cycles', Event count (approx.): 51613832267 > >Overhead CommandShared ObjectSymbol > > +8.92% udp_flood [kernel.vmlinux] [k] _raw_spin_lock > >- _raw_spin_lock > > + 90.78% __dev_queue_xmit > > + 7.83% dev_queue_xmit > > + 1.30% ___slab_alloc > > +5.59% udp_flood [kernel.vmlinux] [k] skb_set_owner_w > > Does TX completion happens on same cpu ? > > > +4.77% udp_flood [mlx5_core] [k] mlx5e_sq_xmit > > +4.09% udp_flood [kernel.vmlinux] [k] fib_table_lookup > > Why fib_table_lookup() is used with connected UDP flow ??? Don't know. I would be interested in hints howto avoid this!... I also see it with netperf, and my udp_flood code is here: https://github.com/netoptimizer/network-testing/blob/master/src/udp_flood.c > > +4.00% swapper[mlx5_core] [k] mlx5e_poll_tx_cq > > +3.11% udp_flood [kernel.vmlinux] [k] > > __ip_route_output_key_hash > > Same here, this is suspect. It is the function calling fib_table_lookup(), and it is called by ip_route_output_flow(). > > +2.49% swapper[kernel.vmlinux] [k] __slab_free > > > > In this setup the spinlock in __dev_queue_xmit should be uncongested. > > An uncongested spin_lock+unlock cost 32 cycles(tsc) 8.198 ns on this system. > > > > But 8.92% of the time is spend on it, which corresponds to a cost of 215 > > cycles (2414*0.0892). This cost is too high, thus something else is > > going on... I claim this mysterious extra cost is the tailptr/doorbell. > > Well, with no pressure, doorbell is triggered for each packet. > > Since we can not predict that your application is going to send yet > another packet one usec later, we can not avoid this. The point is I can see a socket Send-Q forming, thus we do know the application have something to send. Thus, and possibility for non-opportunistic bulking. Allowing/implementing bulk enqueue from socket layer into qdisc layer, should be fairly simple (and rest of xmit_more is already in place). > Note that with the patches I am working on (busypolling extentions), > we could decide to let the busypoller doing the doorbells, say one every > 10 usec. (or after ~16 packets were queued) Sounds interesting! but an opportunistically approach. > But mlx4 uses two different NAPI for TX and RX, maybe mlx5 has the same > strategy . It is a possibility that TX completions were happening on another CPU (but I don't think so for mlx5). To rule that out, I reran the experiment making sure to pin everything to CPU-0 and the results are the same. sudo ethtool -L mlx5p2 combined 1 sudo sh -c '\ for x in /proc/irq/*/mlx5*/../smp_affinity; do \ echo 01 > $x; grep . -H $x; done \ ' $ taskset -c 0 ./udp_flood --sendto 198.18.50.1 --count $((10**9)) Perf report validating CPU in use: $ perf report -g --no-children --sort cpu,comm,dso,symbol --stdio --call-graph none # Overhead CPU Command Shared ObjectSymbol # ... .. ... . # 9.97% 000 udp_flood [kernel.vmlinux] [k] _raw_spin_lock 4.37% 000 udp_flood [kernel.vmlinux] [k] fib_table_lookup 3.97% 000 udp_flood [mlx5_core] [k] mlx5e_sq_xmit 3.06% 000 udp_flood [kernel.vmlinux] [k] __ip_route_output_key_hash 2.51% 000 udp_flood [mlx5_core] [k] mlx5e_poll_tx_cq 2.48% 000 udp_flood [kernel.vmlinux] [k] copy_user_enhanced_fast_string 2.47% 000 udp_flood [kernel.vmlinux] [k] entry_SYSCALL_64 2.42% 000 udp_flood [kernel.vmlinux] [k] udp_sendmsg 2.39% 000 udp_flood [kernel.vmlinux] [k] __ip_append_data.isra.47 2.29% 000 udp_flood [k
Re: Netperf UDP issue with connected sockets
On 11/17/2016 12:16 AM, Jesper Dangaard Brouer wrote: time to try IP_MTU_DISCOVER ;) To Rick, maybe you can find a good solution or option with Eric's hint, to send appropriate sized UDP packets with Don't Fragment (DF). Well, I suppose adding another setsockopt() to the data socket creation wouldn't be too difficult, along with another command-line option to cause it to happen. Could we leave things as "make sure you don't need fragmentation when you use this" or would netperf have to start processing ICMP messages? happy benchmarking, rick jones
RE: Netperf UDP issue with connected sockets
From: Jesper Dangaard Brouer > Sent: 17 November 2016 14:58 > On Thu, 17 Nov 2016 06:17:38 -0800 > Eric Dumazet wrote: > > > On Thu, 2016-11-17 at 14:42 +0100, Jesper Dangaard Brouer wrote: > > > > > I can see that qdisc layer does not activate xmit_more in this case. > > > > > > > Sure. Not enough pressure from the sender(s). > > > > The bottleneck is not the NIC or qdisc in your case, meaning that BQL > > limit is kept at a small value. > > > > (BTW not all NIC have expensive doorbells) > > I believe this NIC mlx5 (50G edition) does. > > I'm seeing UDP TX of 1656017.55 pps, which is per packet: > 2414 cycles(tsc) 603.86 ns > > Perf top shows (with my own udp_flood, that avoids __ip_select_ident): > > Samples: 56K of event 'cycles', Event count (approx.): 51613832267 >Overhead CommandShared ObjectSymbol > +8.92% udp_flood [kernel.vmlinux] [k] _raw_spin_lock >- _raw_spin_lock > + 90.78% __dev_queue_xmit > + 7.83% dev_queue_xmit > + 1.30% ___slab_alloc > +5.59% udp_flood [kernel.vmlinux] [k] skb_set_owner_w > +4.77% udp_flood [mlx5_core] [k] mlx5e_sq_xmit > +4.09% udp_flood [kernel.vmlinux] [k] fib_table_lookup > +4.00% swapper[mlx5_core] [k] mlx5e_poll_tx_cq > +3.11% udp_flood [kernel.vmlinux] [k] > __ip_route_output_key_hash > +2.49% swapper[kernel.vmlinux] [k] __slab_free > > In this setup the spinlock in __dev_queue_xmit should be uncongested. > An uncongested spin_lock+unlock cost 32 cycles(tsc) 8.198 ns on this system. > > But 8.92% of the time is spend on it, which corresponds to a cost of 215 > cycles (2414*0.0892). This cost is too high, thus something else is > going on... I claim this mysterious extra cost is the tailptr/doorbell. Try adding code to ring the doorbell twice. If this doesn't slow things down then it isn't (likely to be) responsible for the delay you are seeing. David
Re: Netperf UDP issue with connected sockets
On Thu, 17 Nov 2016 06:17:38 -0800 Eric Dumazet wrote: > On Thu, 2016-11-17 at 14:42 +0100, Jesper Dangaard Brouer wrote: > > > I can see that qdisc layer does not activate xmit_more in this case. > > > > Sure. Not enough pressure from the sender(s). > > The bottleneck is not the NIC or qdisc in your case, meaning that BQL > limit is kept at a small value. > > (BTW not all NIC have expensive doorbells) I believe this NIC mlx5 (50G edition) does. I'm seeing UDP TX of 1656017.55 pps, which is per packet: 2414 cycles(tsc) 603.86 ns Perf top shows (with my own udp_flood, that avoids __ip_select_ident): Samples: 56K of event 'cycles', Event count (approx.): 51613832267 Overhead CommandShared ObjectSymbol +8.92% udp_flood [kernel.vmlinux] [k] _raw_spin_lock - _raw_spin_lock + 90.78% __dev_queue_xmit + 7.83% dev_queue_xmit + 1.30% ___slab_alloc +5.59% udp_flood [kernel.vmlinux] [k] skb_set_owner_w +4.77% udp_flood [mlx5_core] [k] mlx5e_sq_xmit +4.09% udp_flood [kernel.vmlinux] [k] fib_table_lookup +4.00% swapper[mlx5_core] [k] mlx5e_poll_tx_cq +3.11% udp_flood [kernel.vmlinux] [k] __ip_route_output_key_hash +2.49% swapper[kernel.vmlinux] [k] __slab_free In this setup the spinlock in __dev_queue_xmit should be uncongested. An uncongested spin_lock+unlock cost 32 cycles(tsc) 8.198 ns on this system. But 8.92% of the time is spend on it, which corresponds to a cost of 215 cycles (2414*0.0892). This cost is too high, thus something else is going on... I claim this mysterious extra cost is the tailptr/doorbell. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
Re: Netperf UDP issue with connected sockets
On Thu, 17 Nov 2016 05:20:50 -0800 Eric Dumazet wrote: > On Thu, 2016-11-17 at 09:16 +0100, Jesper Dangaard Brouer wrote: > > > > > I noticed there is a Send-Q, and the perf-top2 is _raw_spin_lock, which > > looks like it comes from __dev_queue_xmit(), but we know from > > experience that this stall is actually caused by writing the > > tailptr/doorbell in the HW. Thus, this could benefit a lot from > > bulk/xmit_more into the qdisc layer. > > The Send-Q is there because of TX-completions being delayed a bit, > because of IRQ mitigation. > > (ethtool -c eth0) > > It happens even if you do not have a qdisc in the first place. > > And we do have xmit_more in the qdisc layer already. I can see that qdisc layer does not activate xmit_more in this case. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer $ ethtool -c mlx5p4 Coalesce parameters for mlx5p4: Adaptive RX: on TX: off stats-block-usecs: 0 sample-interval: 0 pkt-rate-low: 0 pkt-rate-high: 0 rx-usecs: 3 rx-frames: 32 rx-usecs-irq: 0 rx-frames-irq: 0 tx-usecs: 16 tx-frames: 32 tx-usecs-irq: 0 tx-frames-irq: 0 rx-usecs-low: 0 rx-frame-low: 0 tx-usecs-low: 0 tx-frame-low: 0 rx-usecs-high: 0 rx-frame-high: 0 tx-usecs-high: 0 tx-frame-high: 0
Re: Netperf UDP issue with connected sockets
On Thu, 2016-11-17 at 14:42 +0100, Jesper Dangaard Brouer wrote: > I can see that qdisc layer does not activate xmit_more in this case. > Sure. Not enough pressure from the sender(s). The bottleneck is not the NIC or qdisc in your case, meaning that BQL limit is kept at a small value. (BTW not all NIC have expensive doorbells)
Re: Netperf UDP issue with connected sockets
On Thu, 2016-11-17 at 09:16 +0100, Jesper Dangaard Brouer wrote: > > I noticed there is a Send-Q, and the perf-top2 is _raw_spin_lock, which > looks like it comes from __dev_queue_xmit(), but we know from > experience that this stall is actually caused by writing the > tailptr/doorbell in the HW. Thus, this could benefit a lot from > bulk/xmit_more into the qdisc layer. The Send-Q is there because of TX-completions being delayed a bit, because of IRQ mitigation. (ethtool -c eth0) It happens even if you do not have a qdisc in the first place. And we do have xmit_more in the qdisc layer already.
Re: Netperf UDP issue with connected sockets
On Thu, 2016-11-17 at 15:57 +0100, Jesper Dangaard Brouer wrote: > On Thu, 17 Nov 2016 06:17:38 -0800 > Eric Dumazet wrote: > > > On Thu, 2016-11-17 at 14:42 +0100, Jesper Dangaard Brouer wrote: > > > > > I can see that qdisc layer does not activate xmit_more in this case. > > > > > > > Sure. Not enough pressure from the sender(s). > > > > The bottleneck is not the NIC or qdisc in your case, meaning that BQL > > limit is kept at a small value. > > > > (BTW not all NIC have expensive doorbells) > > I believe this NIC mlx5 (50G edition) does. > > I'm seeing UDP TX of 1656017.55 pps, which is per packet: > 2414 cycles(tsc) 603.86 ns > > Perf top shows (with my own udp_flood, that avoids __ip_select_ident): > > Samples: 56K of event 'cycles', Event count (approx.): 51613832267 >Overhead CommandShared ObjectSymbol > +8.92% udp_flood [kernel.vmlinux] [k] _raw_spin_lock >- _raw_spin_lock > + 90.78% __dev_queue_xmit > + 7.83% dev_queue_xmit > + 1.30% ___slab_alloc > +5.59% udp_flood [kernel.vmlinux] [k] skb_set_owner_w Does TX completion happens on same cpu ? > +4.77% udp_flood [mlx5_core] [k] mlx5e_sq_xmit > +4.09% udp_flood [kernel.vmlinux] [k] fib_table_lookup Why fib_table_lookup() is used with connected UDP flow ??? > +4.00% swapper[mlx5_core] [k] mlx5e_poll_tx_cq > +3.11% udp_flood [kernel.vmlinux] [k] > __ip_route_output_key_hash Same here, this is suspect. > +2.49% swapper[kernel.vmlinux] [k] __slab_free > > In this setup the spinlock in __dev_queue_xmit should be uncongested. > An uncongested spin_lock+unlock cost 32 cycles(tsc) 8.198 ns on this system. > > But 8.92% of the time is spend on it, which corresponds to a cost of 215 > cycles (2414*0.0892). This cost is too high, thus something else is > going on... I claim this mysterious extra cost is the tailptr/doorbell. Well, with no pressure, doorbell is triggered for each packet. Since we can not predict that your application is going to send yet another packet one usec later, we can not avoid this. Note that with the patches I am working on (busypolling extentions), we could decide to let the busypoller doing the doorbells, say one every 10 usec. (or after ~16 packets were queued) But mlx4 uses two different NAPI for TX and RX, maybe mlx5 has the same strategy .
Re: Netperf UDP issue with connected sockets
On Wed, 16 Nov 2016 16:34:09 -0800 Eric Dumazet wrote: > On Wed, 2016-11-16 at 23:40 +0100, Jesper Dangaard Brouer wrote: > > > Using -R 1 does not seem to help remove __ip_select_ident() > > > > Samples: 56K of event 'cycles', Event count (approx.): 78628132661 > > Overhead CommandShared ObjectSymbol > > +9.11% netperf[kernel.vmlinux] [k] __ip_select_ident > > +6.98% netperf[kernel.vmlinux] [k] _raw_spin_lock > > +6.21% swapper[mlx5_core] [k] mlx5e_poll_tx_cq > > +5.03% netperf[kernel.vmlinux] [k] > > copy_user_enhanced_fast_string > > +4.69% netperf[kernel.vmlinux] [k] __ip_make_skb > > +4.63% netperf[kernel.vmlinux] [k] skb_set_owner_w > > +4.15% swapper[kernel.vmlinux] [k] __slab_free > > +3.80% netperf[mlx5_core] [k] mlx5e_sq_xmit > > +2.00% swapper[kernel.vmlinux] [k] sock_wfree > > +1.94% netperfnetperf [.] send_data > > +1.92% netperfnetperf [.] send_omni_inner > > Check "ss -nu" ? > > You will see if sockets are connected (present in ss output or not) Tested different versions of netperf, commands used below signature: netperf-2.6.0: connected "broken" netperf-2.7.0: connected works SVN-r709 : connected works I noticed there is a Send-Q, and the perf-top2 is _raw_spin_lock, which looks like it comes from __dev_queue_xmit(), but we know from experience that this stall is actually caused by writing the tailptr/doorbell in the HW. Thus, this could benefit a lot from bulk/xmit_more into the qdisc layer. > UDP being connected does not prevent __ip_select_ident() being used. > > if ((iph->frag_off & htons(IP_DF)) && !skb->ignore_df) { > > So you need IP_DF being set, and skb->ignore_df being 0 Thanks for explaining that! :-) http://lxr.free-electrons.com/source/include/net/ip.h?v=4.8#L332 http://lxr.free-electrons.com/source/net/ipv4/ip_output.c?v=4.8#L449 Netperf UDP_STREAM default send 64K packets that get fragmented... which actually is very unfortunate because people end-up testing a code path in the kernel they didn't expect. That is why I use the option "-- -m 1472". > time to try IP_MTU_DISCOVER ;) To Rick, maybe you can find a good solution or option with Eric's hint, to send appropriate sized UDP packets with Don't Fragment (DF). -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer Testing with ss -nua $ /usr/local/stow/netperf-2.6.0-demo/bin/netperf -H 198.18.50.1 -t UDP_STREAM -l 3 -- -m 1472 -n -N > /dev/null & sleep 1; ss -una State Recv-Q Send-Q Local Address:Port Peer Address:Port UNCONN 0 11520*:54589*:* $ /usr/local/stow/netperf-2.7.0-demo/bin/netperf -H 198.18.50.1 -t UDP_STREAM -l 3 -- -m 1472 -n -N > /dev/null & sleep 1; ss -una State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 18432 198.18.50.3:46803 198.18.50.1:51851 $ ~/tools/netperf2-svn/src/netperf -H 198.18.50.1 -t UDP_STREAM -l 3 -- -m 1472 -n -N > /dev/null & sleep 1; ss -una State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 43776 198.18.50.3:42965 198.18.50.1:51948
Re: Netperf UDP issue with connected sockets
On Wed, 2016-11-16 at 23:40 +0100, Jesper Dangaard Brouer wrote: > Using -R 1 does not seem to help remove __ip_select_ident() > > Samples: 56K of event 'cycles', Event count (approx.): 78628132661 > Overhead CommandShared ObjectSymbol > +9.11% netperf[kernel.vmlinux] [k] __ip_select_ident > +6.98% netperf[kernel.vmlinux] [k] _raw_spin_lock > +6.21% swapper[mlx5_core] [k] mlx5e_poll_tx_cq > +5.03% netperf[kernel.vmlinux] [k] > copy_user_enhanced_fast_string > +4.69% netperf[kernel.vmlinux] [k] __ip_make_skb > +4.63% netperf[kernel.vmlinux] [k] skb_set_owner_w > +4.15% swapper[kernel.vmlinux] [k] __slab_free > +3.80% netperf[mlx5_core] [k] mlx5e_sq_xmit > +2.00% swapper[kernel.vmlinux] [k] sock_wfree > +1.94% netperfnetperf [.] send_data > +1.92% netperfnetperf [.] send_omni_inner Check "ss -nu" ? You will see if sockets are connected (present in ss output or not) UDP being connected does not prevent __ip_select_ident() being used. if ((iph->frag_off & htons(IP_DF)) && !skb->ignore_df) { So you need IP_DF being set, and skb->ignore_df being 0 -> time to try IP_MTU_DISCOVER ;)
Re: Netperf UDP issue with connected sockets
On 11/16/2016 02:40 PM, Jesper Dangaard Brouer wrote: On Wed, 16 Nov 2016 09:46:37 -0800 Rick Jones wrote: It is a wild guess, but does setting SO_DONTROUTE affect whether or not a connect() would have the desired effect? That is there to protect people from themselves (long story about people using UDP_STREAM to stress improperly air-gapped systems during link up/down testing) It can be disabled with a test-specific -R 1 option, so your netperf command would become: netperf -H 198.18.50.1 -t UDP_STREAM -l 120 -- -m 1472 -n -N -R 1 Using -R 1 does not seem to help remove __ip_select_ident() Bummer. It was a wild guess anyway, since I was seeing a connect() call on the data socket. Samples: 56K of event 'cycles', Event count (approx.): 78628132661 Overhead CommandShared ObjectSymbol +9.11% netperf[kernel.vmlinux] [k] __ip_select_ident +6.98% netperf[kernel.vmlinux] [k] _raw_spin_lock +6.21% swapper[mlx5_core] [k] mlx5e_poll_tx_cq +5.03% netperf[kernel.vmlinux] [k] copy_user_enhanced_fast_string +4.69% netperf[kernel.vmlinux] [k] __ip_make_skb +4.63% netperf[kernel.vmlinux] [k] skb_set_owner_w +4.15% swapper[kernel.vmlinux] [k] __slab_free +3.80% netperf[mlx5_core] [k] mlx5e_sq_xmit +2.00% swapper[kernel.vmlinux] [k] sock_wfree +1.94% netperfnetperf [.] send_data +1.92% netperfnetperf [.] send_omni_inner Well, the next step I suppose is to have you try a quick netperf UDP_STREAM under strace to see if your netperf binary does what mine did: strace -v -o /tmp/netperf.strace netperf -H 198.18.50.1 -t UDP_STREAM -l 1 -- -m 1472 -n -N -R 1 And see if you see the connect() I saw. (Note, I make the runtime 1 second) rick
Re: Netperf UDP issue with connected sockets
On Wed, 16 Nov 2016 09:46:37 -0800 Rick Jones wrote: > On 11/16/2016 04:16 AM, Jesper Dangaard Brouer wrote: > > [1] Subj: High perf top ip_idents_reserve doing netperf UDP_STREAM > > - https://www.spinics.net/lists/netdev/msg294752.html > > > > Not fixed in version 2.7.0. > > - ftp://ftp.netperf.org/netperf/netperf-2.7.0.tar.gz > > > > Used extra netperf configure compile options: > > ./configure --enable-histogram --enable-demo > > > > It seems like some fix attempts exists in the SVN repository:: > > > > svn checkout http://www.netperf.org/svn/netperf2/trunk/ netperf2-svn > > svn log -r709 > > # A quick stab at getting remote connect going for UDP_STREAM > > svn diff -r708:709 > > > > Testing with SVN version, still show __ip_select_ident() in top#1. > > Indeed, there was a fix for getting the remote side connect()ed. > Looking at what I have for the top of trunk I do though see a connect() > call being made at the local end: > > socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4 > getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0 > getsockopt(4, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0 > setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 > bind(4, {sa_family=AF_INET, sin_port=htons(0), > sin_addr=inet_addr("0.0.0.0")}, 16) = 0 > setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0 > setsockopt(4, SOL_IP, IP_RECVERR, [1], 4) = 0 > brk(0xe53000) = 0xe53000 > getsockname(4, {sa_family=AF_INET, sin_port=htons(59758), > sin_addr=inet_addr("0.0.0.0")}, [16]) = 0 > sendto(3, > "\0\0\0a\377\377\377\377\377\377\377\377\377\377\377\377\0\0\0\10\0\0\0\0\0\0\0\321\377\377\377\377"..., > > 656, 0, NULL, 0) = 656 > select(1024, [3], NULL, NULL, {120, 0}) = 1 (in [3], left {119, 995630}) > recvfrom(3, > "\0\0\0b\0\0\0\0\0\3@\0\0\3@\0\0\0\0\2\0\3@\0\377\377\377\377\0\0\0\321"..., > 656, 0, NULL, NULL) = 656 > write(1, "need to connect is 1\n", 21) = 21 > rt_sigaction(SIGALRM, {0x402ea6, [ALRM], SA_RESTORER|SA_INTERRUPT, > 0x7f2824eb2cb0}, NULL, 8) = 0 > rt_sigaction(SIGINT, {0x402ea6, [INT], SA_RESTORER|SA_INTERRUPT, > 0x7f2824eb2cb0}, NULL, 8) = 0 > alarm(1)= 0 > connect(4, {sa_family=AF_INET, sin_port=htons(34832), > sin_addr=inet_addr("127.0.0.1")}, 16) = 0 > sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = > 1024 > sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = > 1024 > sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = > 1024 > > the only difference there with top of trunk is that "need to connect" > write/printf I just put in the code to be a nice marker in the system > call trace. > > It is a wild guess, but does setting SO_DONTROUTE affect whether or not > a connect() would have the desired effect? That is there to protect > people from themselves (long story about people using UDP_STREAM to > stress improperly air-gapped systems during link up/down testing) > It can be disabled with a test-specific -R 1 option, so your netperf > command would become: > > netperf -H 198.18.50.1 -t UDP_STREAM -l 120 -- -m 1472 -n -N -R 1 Using -R 1 does not seem to help remove __ip_select_ident() Samples: 56K of event 'cycles', Event count (approx.): 78628132661 Overhead CommandShared ObjectSymbol +9.11% netperf[kernel.vmlinux] [k] __ip_select_ident +6.98% netperf[kernel.vmlinux] [k] _raw_spin_lock +6.21% swapper[mlx5_core] [k] mlx5e_poll_tx_cq +5.03% netperf[kernel.vmlinux] [k] copy_user_enhanced_fast_string +4.69% netperf[kernel.vmlinux] [k] __ip_make_skb +4.63% netperf[kernel.vmlinux] [k] skb_set_owner_w +4.15% swapper[kernel.vmlinux] [k] __slab_free +3.80% netperf[mlx5_core] [k] mlx5e_sq_xmit +2.00% swapper[kernel.vmlinux] [k] sock_wfree +1.94% netperfnetperf [.] send_data +1.92% netperfnetperf [.] send_omni_inner > > > > (p.s. is netperf ever going to be converted from SVN to git?) > > > > Well my git-fu could use some work (gentle, offlinetaps with a > clueful tutorial bat would be welcome), and at least in the past, going > to git was held back because there were a bunch of netperf users on > Windows and there wasn't (at the time) support for git under Windows. > > But I am not against the idea in principle. Once you have learned git, you will never go back to SVN. Just do it! :-) Here are even nice writeups of how to convert and preserve history: http://john.albin.net/git/convert-subversion-to-git -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer
Re: Netperf UDP issue with connected sockets
On 11/16/2016 04:16 AM, Jesper Dangaard Brouer wrote: [1] Subj: High perf top ip_idents_reserve doing netperf UDP_STREAM - https://www.spinics.net/lists/netdev/msg294752.html Not fixed in version 2.7.0. - ftp://ftp.netperf.org/netperf/netperf-2.7.0.tar.gz Used extra netperf configure compile options: ./configure --enable-histogram --enable-demo It seems like some fix attempts exists in the SVN repository:: svn checkout http://www.netperf.org/svn/netperf2/trunk/ netperf2-svn svn log -r709 # A quick stab at getting remote connect going for UDP_STREAM svn diff -r708:709 Testing with SVN version, still show __ip_select_ident() in top#1. Indeed, there was a fix for getting the remote side connect()ed. Looking at what I have for the top of trunk I do though see a connect() call being made at the local end: socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4 getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0 getsockopt(4, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0 setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(4, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0 setsockopt(4, SOL_IP, IP_RECVERR, [1], 4) = 0 brk(0xe53000) = 0xe53000 getsockname(4, {sa_family=AF_INET, sin_port=htons(59758), sin_addr=inet_addr("0.0.0.0")}, [16]) = 0 sendto(3, "\0\0\0a\377\377\377\377\377\377\377\377\377\377\377\377\0\0\0\10\0\0\0\0\0\0\0\321\377\377\377\377"..., 656, 0, NULL, 0) = 656 select(1024, [3], NULL, NULL, {120, 0}) = 1 (in [3], left {119, 995630}) recvfrom(3, "\0\0\0b\0\0\0\0\0\3@\0\0\3@\0\0\0\0\2\0\3@\0\377\377\377\377\0\0\0\321"..., 656, 0, NULL, NULL) = 656 write(1, "need to connect is 1\n", 21) = 21 rt_sigaction(SIGALRM, {0x402ea6, [ALRM], SA_RESTORER|SA_INTERRUPT, 0x7f2824eb2cb0}, NULL, 8) = 0 rt_sigaction(SIGINT, {0x402ea6, [INT], SA_RESTORER|SA_INTERRUPT, 0x7f2824eb2cb0}, NULL, 8) = 0 alarm(1)= 0 connect(4, {sa_family=AF_INET, sin_port=htons(34832), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 1024 sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 1024 sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 1024 the only difference there with top of trunk is that "need to connect" write/printf I just put in the code to be a nice marker in the system call trace. It is a wild guess, but does setting SO_DONTROUTE affect whether or not a connect() would have the desired effect? That is there to protect people from themselves (long story about people using UDP_STREAM to stress improperly air-gapped systems during link up/down testing) It can be disabled with a test-specific -R 1 option, so your netperf command would become: netperf -H 198.18.50.1 -t UDP_STREAM -l 120 -- -m 1472 -n -N -R 1 (p.s. is netperf ever going to be converted from SVN to git?) Well my git-fu could use some work (gentle, offlinetaps with a clueful tutorial bat would be welcome), and at least in the past, going to git was held back because there were a bunch of netperf users on Windows and there wasn't (at the time) support for git under Windows. But I am not against the idea in principle. happy benchmarking, rick jones PS - rick.jo...@hp.com no longer works. rick.jon...@hpe.com should be used instead.