Re: Netperf UDP issue with connected sockets

2016-11-30 Thread David Miller
From: Rick Jones 
Date: Wed, 30 Nov 2016 09:42:40 -0800

> And indeed, based on a quick check, send() is what is being called,
> though it becomes it seems a sendto() system call - with the
> destination information NJULL:
> 
> write(1, "send\n", 5)   = 5
> sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0)
> = 1024
> write(1, "send\n", 5)   = 5
> sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0)
> = 1024
> 
> So I'm not sure what might be going-on there.

It's because of glibc's implementation of send() which is:

ssize_t
__libc_send (int sockfd, const void *buffer, size_t len, int flags)
{
  return SYSCALL_CANCEL (sendto, sockfd, buffer, len, flags, NULL, 0);
}
strong_alias (__libc_send, __send)
weak_alias (__libc_send, send)


Re: Netperf UDP issue with connected sockets

2016-11-30 Thread Rick Jones

On 11/30/2016 02:43 AM, Jesper Dangaard Brouer wrote:

Notice the "fib_lookup" cost is still present, even when I use
option "-- -n -N" to create a connected socket.  As Eric taught us,
this is because we should use syscalls "send" or "write" on a connected
socket.


In theory, once the data socket is connected, the send_data() call in 
src/nettest_omni.c is supposed to use send() rather than sendto().


And indeed, based on a quick check, send() is what is being called, 
though it becomes it seems a sendto() system call - with the destination 
information NJULL:


write(1, "send\n", 5)   = 5
sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 
1024

write(1, "send\n", 5)   = 5
sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 
1024


So I'm not sure what might be going-on there.

You can get netperf to use write() instead of send() by adding a 
test-specific -I option.


happy benchmarking,

rick



My udp_flood tool[1] cycle through the different syscalls:

taskset -c 2 ~/git/network-testing/src/udp_flood 198.18.50.1 --count $((10**7)) 
--pmtu 2
ns/pkt  pps cycles/pkt
send473.08  2113816.28  1891
sendto  558.58  1790265.84  2233
sendmsg 587.24  1702873.80  2348
sendMmsg/32 547.57  1826265.90  2189
write   518.36  1929175.52  2072

Using "send" seems to be the fastest option.

Some notes on test: I've forced TX completions to happen on another CPU0
and pinned the udp_flood program (to CPU2) as I want to avoid the CPU
scheduler to move udp_flood around as this cause fluctuations in the
results (as it stress the memory allocations more).

My udp_flood --pmtu option is documented in the --help usage text (see below 
signature)





Re: Netperf UDP issue with connected sockets

2016-11-30 Thread Jesper Dangaard Brouer

On Mon, 28 Nov 2016 10:33:49 -0800 Rick Jones  wrote:

> On 11/17/2016 12:16 AM, Jesper Dangaard Brouer wrote:
> >> time to try IP_MTU_DISCOVER ;)  
> >
> > To Rick, maybe you can find a good solution or option with Eric's hint,
> > to send appropriate sized UDP packets with Don't Fragment (DF).  
> 
> Jesper -
> 
> Top of trunk has a change adding an omni, test-specific -f option which 
> will set IP_MTU_DISCOVER:IP_PMTUDISC_DO on the data socket.  Is that 
> sufficient to your needs?

The "-- -f" option makes the __ip_select_ident lookup go away.  So,
confirming your new option works.

Notice the "fib_lookup" cost is still present, even when I use
option "-- -n -N" to create a connected socket.  As Eric taught us,
this is because we should use syscalls "send" or "write" on a connected
socket.

My udp_flood tool[1] cycle through the different syscalls:

taskset -c 2 ~/git/network-testing/src/udp_flood 198.18.50.1 --count $((10**7)) 
--pmtu 2
ns/pkt  pps cycles/pkt
send473.08  2113816.28  1891
sendto  558.58  1790265.84  2233
sendmsg 587.24  1702873.80  2348
sendMmsg/32 547.57  1826265.90  2189
write   518.36  1929175.52  2072

Using "send" seems to be the fastest option.

Some notes on test: I've forced TX completions to happen on another CPU0
and pinned the udp_flood program (to CPU2) as I want to avoid the CPU
scheduler to move udp_flood around as this cause fluctuations in the
results (as it stress the memory allocations more).

My udp_flood --pmtu option is documented in the --help usage text (see below 
signature)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

$ uname -a
Linux canyon 4.9.0-rc6-page_pool07-baseline+ #185 SMP PREEMPT Wed Nov 30 
10:07:51 CET 2016 x86_64

[1] udp_flood tool:
 https://github.com/netoptimizer/network-testing/blob/master/src/udp_flood.c

Quick command used for verifying  __ip_select_ident is removed:

 # First run benchmark
 sudo perf record -g -a ~/tools/netperf2-svn/src/netperf -H 198.18.50.1 \
  -t UDP_STREAM -l 3 -- -m 1472 -f

 # Second grep in perf output for functions
 sudo perf report --no-children --call-graph none --stdio |\
  egrep -e '__ip_select_ident|fib_table_lookup'


$ ./udp_flood --help

DOCUMENTATION:
 This tool is a UDP flood that measures the outgoing packet rate.
 Default cycles through tests with different send system calls.
 What function-call to invoke can also be specified as a command
 line option (see below).

 Default transmit 100 packets per test, adjust via --count

 Usage: ./udp_flood (options-see-below) IPADDR
 Listing options:
 --help short-option: -h
 --ipv4 short-option: -4
 --ipv6 short-option: -6
 --sendmsg  short-option: -u
 --sendmmsg short-option: -U
 --sendto   short-option: -t
 --writeshort-option: -T
 --send short-option: -S
 --batchshort-option: -b
 --countshort-option: -c
 --port short-option: -p
 --payload  short-option: -m
 --pmtu short-option: -d
 --verbose  short-option: -v

 Multiple tests can be selected:
 default: all tests
 -u -U -t -T -S: run any combination of sendmsg/sendmmsg/sendto/write/send

Option --pmtu   for Path MTU discover socket option IP_MTU_DISCOVER
 This affects the DF(Don't-Fragment) bit setting.
 Following values are selectable:
  0 = IP_PMTUDISC_DONT
  1 = IP_PMTUDISC_WANT
  2 = IP_PMTUDISC_DO
  3 = IP_PMTUDISC_PROBE
  4 = IP_PMTUDISC_INTERFACE
  5 = IP_PMTUDISC_OMIT
 Documentation see under IP_MTU_DISCOVER in 'man 7 ip'



Re: Netperf UDP issue with connected sockets

2016-11-28 Thread Rick Jones

On 11/28/2016 10:33 AM, Rick Jones wrote:

On 11/17/2016 12:16 AM, Jesper Dangaard Brouer wrote:

time to try IP_MTU_DISCOVER ;)


To Rick, maybe you can find a good solution or option with Eric's hint,
to send appropriate sized UDP packets with Don't Fragment (DF).


Jesper -

Top of trunk has a change adding an omni, test-specific -f option which
will set IP_MTU_DISCOVER:IP_PMTUDISC_DO on the data socket.  Is that
sufficient to your needs?


Usage examples:

raj@tardy:~/netperf2_trunk/src$ ./netperf -t UDP_STREAM -l 1 -H 
raj-folio.americas.hpqcorp.net -- -m 1472 -f
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
raj-folio.americas.hpqcorp.net () port 0 AF_INET

Socket  Message  Elapsed  Messages
SizeSize Time Okay Errors   Throughput
bytes   bytessecs#  #   10^6bits/sec

2129921472   1.0077495  0 912.35
212992   1.0077495912.35

[1]+  Doneemacs nettest_omni.c
raj@tardy:~/netperf2_trunk/src$ ./netperf -t UDP_STREAM -l 1 -H 
raj-folio.americas.hpqcorp.net -- -m 14720 -f
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
raj-folio.americas.hpqcorp.net () port 0 AF_INET

send_data: data send error: Message too long (errno 90)
netperf: send_omni: send_data failed: Message too long

happy benchmarking,

rick jones


Re: Netperf UDP issue with connected sockets

2016-11-28 Thread Rick Jones

On 11/17/2016 12:16 AM, Jesper Dangaard Brouer wrote:

time to try IP_MTU_DISCOVER ;)


To Rick, maybe you can find a good solution or option with Eric's hint,
to send appropriate sized UDP packets with Don't Fragment (DF).


Jesper -

Top of trunk has a change adding an omni, test-specific -f option which 
will set IP_MTU_DISCOVER:IP_PMTUDISC_DO on the data socket.  Is that 
sufficient to your needs?


happy benchmarking,

rick



Re: Netperf UDP issue with connected sockets

2016-11-21 Thread Eric Dumazet
On Mon, 2016-11-21 at 17:03 +0100, Jesper Dangaard Brouer wrote:
> On Thu, 17 Nov 2016 10:51:23 -0800
> Eric Dumazet  wrote:
> 
> > On Thu, 2016-11-17 at 19:30 +0100, Jesper Dangaard Brouer wrote:
> > 
> > > The point is I can see a socket Send-Q forming, thus we do know the
> > > application have something to send. Thus, and possibility for
> > > non-opportunistic bulking. Allowing/implementing bulk enqueue from
> > > socket layer into qdisc layer, should be fairly simple (and rest of
> > > xmit_more is already in place).
> > 
> > 
> > As I said, you are fooled by TX completions.
> 
> Obviously TX completions play a role yes, and I bet I can adjust the
> TX completion to cause xmit_more to happen, at the expense of
> introducing added latency.
> 
> The point is the "bloated" spinlock in __dev_queue_xmit is still caused
> by the MMIO tailptr/doorbell.  The added cost occurs when enqueueing
> packets, and result in the inability to get enough packets into the
> qdisc for xmit_more going (on my system).  I argue that a bulk enqueue
> API would allow us to get past the hurtle of transitioning into
> xmit_more mode more easily.
> 


This is very nice, but we already have bulk enqueue, it is called
xmit_more.

Kernel does not know your application is sending a packet after the one
you send.

xmit_more is not often used applications/stacks send many small packets.

qdisc is empty (one enqueued packet is immediately dequeued so
skb->xmit_more is 0), and even bypassed (TCQ_F_CAN_BYPASS)

Not sure it this has been tried before, but the doorbell avoidance could
be done by the driver itself, because it knows a TX completion will come
shortly (well... if softirqs are not delayed too much !)

Doorbell would be forced only if :

("skb->xmit_more is not set" AND "TX engine is not 'started yet'" )
OR
( too many [1] packets were put in TX ring buffer, no point deferring
more)

Start the pump, but once it is started, let the doorbells being done by
TX completion.

ndo_start_xmit and TX completion handler would have to maintain a shared
state describing if packets were ready but doorbell deferred.


Note that TX completion means "if at least one packet was drained",
otherwise busy polling, constantly calling napi->poll() would force a
doorbell too soon for devices sharing a NAPI for both RX and TX.

But then, maybe busy poll would like to force a doorbell...

I could try these ideas on mlx4 shortly.


[1] limit could be derived from active "ethtool -c" params, eg tx-frames





Re: Netperf UDP issue with connected sockets

2016-11-21 Thread Jesper Dangaard Brouer

On Thu, 17 Nov 2016 10:51:23 -0800
Eric Dumazet  wrote:

> On Thu, 2016-11-17 at 19:30 +0100, Jesper Dangaard Brouer wrote:
> 
> > The point is I can see a socket Send-Q forming, thus we do know the
> > application have something to send. Thus, and possibility for
> > non-opportunistic bulking. Allowing/implementing bulk enqueue from
> > socket layer into qdisc layer, should be fairly simple (and rest of
> > xmit_more is already in place).
> 
> 
> As I said, you are fooled by TX completions.

Obviously TX completions play a role yes, and I bet I can adjust the
TX completion to cause xmit_more to happen, at the expense of
introducing added latency.

The point is the "bloated" spinlock in __dev_queue_xmit is still caused
by the MMIO tailptr/doorbell.  The added cost occurs when enqueueing
packets, and result in the inability to get enough packets into the
qdisc for xmit_more going (on my system).  I argue that a bulk enqueue
API would allow us to get past the hurtle of transitioning into
xmit_more mode more easily.


> Please make sure to increase the sndbuf limits !
> 
> echo 2129920 >/proc/sys/net/core/wmem_default

Testing with this makes no difference.

 $ grep -H . /proc/sys/net/core/wmem_default
 /proc/sys/net/core/wmem_default:2129920


> lpaa23:~# sar -n DEV 1 10|grep eth1
  IFACE   rxpck/stxpck/srxkB/s txkB/s   rxcmp/s   
txcmp/s  rxmcst/s
> 10:49:25 eth1  7.00 9273283.00  0.61 2187214.90  0.00 
>  0.00  0.00
> 10:49:26 eth1  1.00 9230795.00  0.06 2176787.57  0.00 
>  0.00  1.00
> 10:49:27 eth1  2.00 9247906.00  0.17 2180915.45  0.00 
>  0.00  0.00
> 10:49:28 eth1  3.00 9246542.00  0.23 2180790.38  0.00 
>  0.00  1.00
> Average: eth1  2.50 9018045.70  0.25 2126893.82  0.00 
>  0.00  0.50

Very impressive numbers 9.2Mpps TX.

What is this test?  What kind of traffic? Multiple CPUs?


> lpaa23:~# ethtool -S eth1|grep more; sleep 1;ethtool -S eth1|grep more
>  xmit_more: 2251366909
>  xmit_more: 2256011392
> 
> lpaa23:~# echo 2256011392-2251366909 | bc
> 4644483

The xmit_more definitely works on your system, but I cannot get it to
"kick-in" on my setup.  Once the xmit_more is active, then the
"bloated" spinlock problem should go way.


(Tests with "udp_flood --pmtu 3 --send")

Forcing TX completion to happen on the same CPU, no xmit_more:

 ~/git/network-testing/bin/ethtool_stats.pl --sec 2 --dev mlx5p2
 Show adapter(s) (mlx5p2) statistics (ONLY that changed!)
 Ethtool(mlx5p2  ) stat:104592908 (104,592,908) <= tx0_bytes /sec
 Ethtool(mlx5p2  ) stat:39059 ( 39,059) <= tx0_nop /sec
 Ethtool(mlx5p2  ) stat:  1743215 (  1,743,215) <= tx0_packets /sec 
 Ethtool(mlx5p2  ) stat:104719986 (104,719,986) <= tx_bytes /sec
 Ethtool(mlx5p2  ) stat:111774540 (111,774,540) <= tx_bytes_phy /sec
 Ethtool(mlx5p2  ) stat:  1745333 (  1,745,333) <= tx_csum_partial /sec
 Ethtool(mlx5p2  ) stat:  1745333 (  1,745,333) <= tx_packets /sec
 Ethtool(mlx5p2  ) stat:  1746477 (  1,746,477) <= tx_packets_phy /sec
 Ethtool(mlx5p2  ) stat:111483434 (111,483,434) <= tx_prio1_bytes /sec
 Ethtool(mlx5p2  ) stat:  1741928 (  1,741,928) <= tx_prio1_packets /sec

Forcing TX completion to happen on remote CPU, some xmit_more:

 Show adapter(s) (mlx5p2) statistics (ONLY that changed!)
 Ethtool(mlx5p2  ) stat:128485892 (128,485,892) <= tx0_bytes /sec
 Ethtool(mlx5p2  ) stat:31840 ( 31,840) <= tx0_nop /sec
 Ethtool(mlx5p2  ) stat:  2141432 (  2,141,432) <= tx0_packets /sec
 Ethtool(mlx5p2  ) stat:  350 (350) <= tx0_xmit_more /sec
 Ethtool(mlx5p2  ) stat:128486459 (128,486,459) <= tx_bytes /sec
 Ethtool(mlx5p2  ) stat:137052191 (137,052,191) <= tx_bytes_phy /sec
 Ethtool(mlx5p2  ) stat:  2141441 (  2,141,441) <= tx_csum_partial /sec
 Ethtool(mlx5p2  ) stat:  2141441 (  2,141,441) <= tx_packets /sec
 Ethtool(mlx5p2  ) stat:  2141441 (  2,141,441) <= tx_packets_phy /sec
 Ethtool(mlx5p2  ) stat:137051300 (137,051,300) <= tx_prio1_bytes /sec
 Ethtool(mlx5p2  ) stat:  2141427 (  2,141,427) <= tx_prio1_packets /sec
 Ethtool(mlx5p2  ) stat:  350 (350) <= tx_xmit_more /sec



>PerfTop:   76969 irqs/sec  kernel:96.6%  exact: 100.0% [4000Hz cycles:pp], 
>  (all, 48 CPUs)
>-
> 11.64%  [kernel]  [k] skb_set_owner_w   
>  6.21%  [kernel]  [k] queued_spin_lock_slowpath 
>  4.76%  [kernel]  [k] _raw_spin_lock
>  4.40%  [kernel]  [k] __ip_make_skb 
>  3.10%  [kernel]  [k] sock_wfree
>  2.87%  [kernel]  [k] ipt_do_table  
>  2.76%  [kernel]  [k] fq_dequeue  

Re: Netperf UDP issue with connected sockets

2016-11-18 Thread Jesper Dangaard Brouer
On Thu, 17 Nov 2016 13:44:02 -0800
Eric Dumazet  wrote:

> On Thu, 2016-11-17 at 22:19 +0100, Jesper Dangaard Brouer wrote:
> 
> > 
> > Maybe you can share your udp flood "udpsnd" program source?  
> 
> Very ugly. This is based on what I wrote when tracking the UDP v6
> checksum bug (4f2e4ad56a65f3b7d64c258e373cb71e8d2499f4 net: mangle zero
> checksum in skb_checksum_help()), because netperf sends the same message
> over and over...

Thanks a lot, hope you don't mind; I added the code to my github repo:
 https://github.com/netoptimizer/network-testing/blob/master/src/udp_snd.c

So I identified the difference, and reason behind the route lookups.
Your program is using send() and I was using sendmsg().  Given
udp_flood is designed to test different calls, I simply added --send as
a new possibility.
 https://github.com/netoptimizer/network-testing/commit/16166c2cd1fa8

If I use --write instead, then I can also avoid the fib_table_lookup
and __ip_route_output_key_hash calls.


> Use -d 2   to remove the ip_idents_reserve() overhead.

#define IP_PMTUDISC_DO  2 /* Always DF  */

Added a --pmtu option to my udp_flood program
 https://github.com/netoptimizer/network-testing/commit/23a78caf4bb5b

 
> #define _GNU_SOURCE
> 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> char buffer[1400];
> 
> int main(int argc, char** argv) {
>   int fd, i;
>   struct sockaddr_in6 addr;
>   char *host = "2002:af6:798::1";
>   int family = AF_INET6;
>   int discover = -1;
> 
>   while ((i = getopt(argc, argv, "4H:d:")) != -1) {
> switch (i) {
> case 'H': host = optarg; break;
> case '4': family = AF_INET; break;
> case 'd': discover = atoi(optarg); break;
> }
>   }
>   fd = socket(family, SOCK_DGRAM, 0);
>   if (fd < 0)
> error(1, errno, "failed to create socket");
>   if (discover != -1)
> setsockopt(fd, SOL_IP, IP_MTU_DISCOVER,
>&discover, sizeof(discover));
> 
>   memset(&addr, 0, sizeof(addr));
>   if (family == AF_INET6) {
> addr.sin6_family = AF_INET6;
> addr.sin6_port = htons(9);
>   inet_pton(family, host, (void *)&addr.sin6_addr.s6_addr);
>   } else {
> struct sockaddr_in *in = (struct sockaddr_in *)&addr;
> in->sin_family = family;
> in->sin_port = htons(9);
>   inet_pton(family, host, &in->sin_addr);
>   }
>   connect(fd, (struct sockaddr *)&addr,
>   (family == AF_INET6) ? sizeof(addr) :
>  sizeof(struct sockaddr_in));
>   memset(buffer, 1, 1400);
>   for (i = 0; i < 65536; i++) {
> memcpy(buffer, &i, sizeof(i));
> send(fd, buffer, 100 + rand() % 200, 0);

Using send() avoids the fib_table_lookup, on a connected UDP socket.

>   }
>   return 0;
> }


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


Re: Netperf UDP issue with connected sockets

2016-11-17 Thread Rick Jones

On 11/17/2016 04:37 PM, Julian Anastasov wrote:

On Thu, 17 Nov 2016, Rick Jones wrote:


raj@tardy:~/netperf2_trunk$ strace -v -o /tmp/netperf.strace src/netperf -F
src/nettest_omni.c -t UDP_STREAM -l 1 -- -m 1472

...

socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4
getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0
getsockopt(4, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0
setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(4, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")},
16) = 0
setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0


connected socket can benefit from dst cached in socket
but not if SO_DONTROUTE is set. If we do not want to send packets
via gateway this -l 1 should help but I don't see IP_TTL setsockopt
in your first example with connect() to 127.0.0.1.

Also, may be there can be another default, if -l is used to
specify TTL then SO_DONTROUTE should not be set. I.e. we should
avoid SO_DONTROUTE, if possible.


The global -l option specifies the duration of the test.  It doesn't 
specify the TTL of the IP datagrams being generated by the actions of 
the test.


I resisted setting SO_DONTROUTE for a number of years after the first 
instance of UDP_STREAM being used in link up/down testing took-out a 
company's network (including security camera feeds to galactic HQ) but 
at this point I'm likely to keep it in there because there ended-up 
being a second such incident.  It is set only for UDP_STREAM.  It isn't 
set for UDP_RR or TCP_*.  And for UDP_STREAM it can be overridden by the 
test-specific -R option.


happy benchmarking,

rick jones


Re: Netperf UDP issue with connected sockets

2016-11-17 Thread Julian Anastasov

Hello,

On Thu, 17 Nov 2016, Rick Jones wrote:

> raj@tardy:~/netperf2_trunk$ strace -v -o /tmp/netperf.strace src/netperf -F
> src/nettest_omni.c -t UDP_STREAM -l 1 -- -m 1472
> 
> ...
> 
> socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4
> getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0
> getsockopt(4, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0
> setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> bind(4, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")},
> 16) = 0
> setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0

connected socket can benefit from dst cached in socket
but not if SO_DONTROUTE is set. If we do not want to send packets
via gateway this -l 1 should help but I don't see IP_TTL setsockopt
in your first example with connect() to 127.0.0.1.

Also, may be there can be another default, if -l is used to
specify TTL then SO_DONTROUTE should not be set. I.e. we should
avoid SO_DONTROUTE, if possible.

Regards

--
Julian Anastasov 


Re: Netperf UDP issue with connected sockets

2016-11-17 Thread Rick Jones

On 11/17/2016 01:44 PM, Eric Dumazet wrote:

because netperf sends the same message
over and over...


Well, sort of, by default.  That can be altered to a degree.

The global -F option should cause netperf to fill the buffers in its 
send ring with data from the specified file.  The number of buffers in 
the send ring can be controlled via the global -W option.  The number of 
elements in the ring will default to one more than the initial SO_SNDBUF 
size divided by the send size.


raj@tardy:~/netperf2_trunk$ strace -v -o /tmp/netperf.strace src/netperf 
-F src/nettest_omni.c -t UDP_STREAM -l 1 -- -m 1472


...

socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4
getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0
getsockopt(4, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0
setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(4, {sa_family=AF_INET, sin_port=htons(0), 
sin_addr=inet_addr("0.0.0.0")}, 16) = 0

setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0
setsockopt(4, SOL_IP, IP_RECVERR, [1], 4) = 0
open("src/nettest_omni.c", O_RDONLY)= 5
fstat(5, {st_dev=makedev(8, 2), st_ino=82075297, st_mode=S_IFREG|0664, 
st_nlink=1, st_uid=1000, st_gid=1000, st_blksize=4096, st_blocks=456, 
st_size=230027, st_atime=2016/11/16-09:49:29, 
st_mtime=2016/11/16-09:49:24, st_ctime=2016/11/16-09:49:24}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) 
= 0x7f3099f62000

read(5, "#ifdef HAVE_CONFIG_H\n#include rt_sigaction(SIGALRM, {0x402ea6, [ALRM], SA_RESTORER|SA_INTERRUPT, 
0x7f30994a7cb0}, NULL, 8) = 0
rt_sigaction(SIGINT, {0x402ea6, [INT], SA_RESTORER|SA_INTERRUPT, 
0x7f30994a7cb0}, NULL, 8) = 0

alarm(1)= 0
sendto(4, "#ifdef HAVE_CONFIG_H\n#include {sa_family=AF_INET, sin_port=htons(58088), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 1472
sendto(4, " used\\n\\\n-m local,remote   S"..., 1472, 0, 
{sa_family=AF_INET, sin_port=htons(58088), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 1472
sendto(4, " do here but clear the legacy fl"..., 1472, 0, 
{sa_family=AF_INET, sin_port=htons(58088), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 1472
sendto(4, "e before we scan the test-specif"..., 1472, 0, 
{sa_family=AF_INET, sin_port=htons(58088), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 1472
sendto(4, "\n\n\tfprintf(where,\n\t\ttput_fmt_1_l"..., 1472, 0, 
{sa_family=AF_INET, sin_port=htons(58088), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 1472


Of course, it will continue to send the same messages from the send_ring 
over and over instead of putting different data into the buffers each 
time, but if one has a sufficiently large -W option specified...

happy benchmarking,

rick jones


Re: Netperf UDP issue with connected sockets

2016-11-17 Thread Alexander Duyck
On Thu, Nov 17, 2016 at 9:34 AM, David Laight  wrote:
> From: Jesper Dangaard Brouer
>> Sent: 17 November 2016 14:58
>> On Thu, 17 Nov 2016 06:17:38 -0800
>> Eric Dumazet  wrote:
>>
>> > On Thu, 2016-11-17 at 14:42 +0100, Jesper Dangaard Brouer wrote:
>> >
>> > > I can see that qdisc layer does not activate xmit_more in this case.
>> > >
>> >
>> > Sure. Not enough pressure from the sender(s).
>> >
>> > The bottleneck is not the NIC or qdisc in your case, meaning that BQL
>> > limit is kept at a small value.
>> >
>> > (BTW not all NIC have expensive doorbells)
>>
>> I believe this NIC mlx5 (50G edition) does.
>>
>> I'm seeing UDP TX of 1656017.55 pps, which is per packet:
>> 2414 cycles(tsc) 603.86 ns
>>
>> Perf top shows (with my own udp_flood, that avoids __ip_select_ident):
>>
>>  Samples: 56K of event 'cycles', Event count (approx.): 51613832267
>>Overhead  CommandShared ObjectSymbol
>>  +8.92%  udp_flood  [kernel.vmlinux] [k] _raw_spin_lock
>>- _raw_spin_lock
>>   + 90.78% __dev_queue_xmit
>>   + 7.83% dev_queue_xmit
>>   + 1.30% ___slab_alloc
>>  +5.59%  udp_flood  [kernel.vmlinux] [k] skb_set_owner_w
>>  +4.77%  udp_flood  [mlx5_core]  [k] mlx5e_sq_xmit
>>  +4.09%  udp_flood  [kernel.vmlinux] [k] fib_table_lookup
>>  +4.00%  swapper[mlx5_core]  [k] mlx5e_poll_tx_cq
>>  +3.11%  udp_flood  [kernel.vmlinux] [k] 
>> __ip_route_output_key_hash
>>  +2.49%  swapper[kernel.vmlinux] [k] __slab_free
>>
>> In this setup the spinlock in __dev_queue_xmit should be uncongested.
>> An uncongested spin_lock+unlock cost 32 cycles(tsc) 8.198 ns on this system.
>>
>> But 8.92% of the time is spend on it, which corresponds to a cost of 215
>> cycles (2414*0.0892).  This cost is too high, thus something else is
>> going on... I claim this mysterious extra cost is the tailptr/doorbell.
>
> Try adding code to ring the doorbell twice.
> If this doesn't slow things down then it isn't (likely to be) responsible
> for the delay you are seeing.
>
> David
>

The problem isn't only the doorbell.  It is doorbell plus a locked
transaction on x86 results in a long wait until the doorbell write has
been completed.

You could batch a bunch of doorbell writes together and it isn't an
issue unless you do something like writel(), wmb(), writel(), wmb(),
then you will see the effect double since the write memory barrier is
what is forcing the delays.

- Alex


Re: Netperf UDP issue with connected sockets

2016-11-17 Thread Eric Dumazet
On Thu, 2016-11-17 at 22:19 +0100, Jesper Dangaard Brouer wrote:

> 
> Maybe you can share your udp flood "udpsnd" program source?

Very ugly. This is based on what I wrote when tracking the UDP v6
checksum bug (4f2e4ad56a65f3b7d64c258e373cb71e8d2499f4 net: mangle zero
checksum in skb_checksum_help()), because netperf sends the same message
over and over...


Use -d 2   to remove the ip_idents_reserve() overhead.


#define _GNU_SOURCE

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

char buffer[1400];

int main(int argc, char** argv) {
  int fd, i;
  struct sockaddr_in6 addr;
  char *host = "2002:af6:798::1";
  int family = AF_INET6;
  int discover = -1;

  while ((i = getopt(argc, argv, "4H:d:")) != -1) {
switch (i) {
case 'H': host = optarg; break;
case '4': family = AF_INET; break;
case 'd': discover = atoi(optarg); break;
}
  }
  fd = socket(family, SOCK_DGRAM, 0);
  if (fd < 0)
error(1, errno, "failed to create socket");
  if (discover != -1)
setsockopt(fd, SOL_IP, IP_MTU_DISCOVER,
   &discover, sizeof(discover));

  memset(&addr, 0, sizeof(addr));
  if (family == AF_INET6) {
  addr.sin6_family = AF_INET6;
  addr.sin6_port = htons(9);
  inet_pton(family, host, (void *)&addr.sin6_addr.s6_addr);
  } else {
struct sockaddr_in *in = (struct sockaddr_in *)&addr;
in->sin_family = family;
in->sin_port = htons(9);
  inet_pton(family, host, &in->sin_addr);
  }
  connect(fd, (struct sockaddr *)&addr,
  (family == AF_INET6) ? sizeof(addr) :
 sizeof(struct sockaddr_in));
  memset(buffer, 1, 1400);
  for (i = 0; i < 65536; i++) {
memcpy(buffer, &i, sizeof(i));
send(fd, buffer, 100 + rand() % 200, 0);
  }
  return 0;
}




Re: Netperf UDP issue with connected sockets

2016-11-17 Thread Jesper Dangaard Brouer
On Thu, 17 Nov 2016 10:51:23 -0800
Eric Dumazet  wrote:

> On Thu, 2016-11-17 at 19:30 +0100, Jesper Dangaard Brouer wrote:
> 
> > The point is I can see a socket Send-Q forming, thus we do know the
> > application have something to send. Thus, and possibility for
> > non-opportunistic bulking. Allowing/implementing bulk enqueue from
> > socket layer into qdisc layer, should be fairly simple (and rest of
> > xmit_more is already in place).
> 
> 
> As I said, you are fooled by TX completions.
> 
> Please make sure to increase the sndbuf limits !
> 
> echo 2129920 >/proc/sys/net/core/wmem_default
> 
> lpaa23:~# sar -n DEV 1 10|grep eth1
> 10:49:25 eth1  7.00 9273283.00  0.61 2187214.90  0.00 
>  0.00  0.00
> 10:49:26 eth1  1.00 9230795.00  0.06 2176787.57  0.00 
>  0.00  1.00
> 10:49:27 eth1  2.00 9247906.00  0.17 2180915.45  0.00 
>  0.00  0.00
> 10:49:28 eth1  3.00 9246542.00  0.23 2180790.38  0.00 
>  0.00  1.00
> 10:49:29 eth1  1.00 9239218.00  0.06 2179044.83  0.00 
>  0.00  0.00
> 10:49:30 eth1  3.00 9248775.00  0.23 2181257.84  0.00 
>  0.00  1.00
> 10:49:31 eth1  4.00 9225471.00  0.65 2175772.75  0.00 
>  0.00  0.00
> 10:49:32 eth1  2.00 9253536.00  0.33 2182666.44  0.00 
>  0.00  1.00
> 10:49:33 eth1  1.00 9265900.00  0.06 2185598.40  0.00 
>  0.00  0.00
> 10:49:34 eth1  1.00 6949031.00  0.06 1638889.63  0.00 
>  0.00  1.00
> Average: eth1  2.50 9018045.70  0.25 2126893.82  0.00 
>  0.00  0.50
> 
> 
> lpaa23:~# ethtool -S eth1|grep more; sleep 1;ethtool -S eth1|grep more
>  xmit_more: 2251366909
>  xmit_more: 2256011392
> 
> lpaa23:~# echo 2256011392-2251366909 | bc
> 4644483

xmit more not happen that frequently for my setup, it does happen
sometimes. And I do monitor with "ethtool -S".

~/git/network-testing/bin/ethtool_stats.pl --sec 2 --dev mlx5p2
Show adapter(s) (mlx5p2) statistics (ONLY that changed!)
Ethtool(mlx5p2  ) stat: 92900913 ( 92,900,913) <= tx0_bytes /sec
Ethtool(mlx5p2  ) stat:36073 ( 36,073) <= tx0_nop /sec
Ethtool(mlx5p2  ) stat:  1548349 (  1,548,349) <= tx0_packets /sec
Ethtool(mlx5p2  ) stat:1 (  1) <= tx0_xmit_more /sec
Ethtool(mlx5p2  ) stat: 92884899 ( 92,884,899) <= tx_bytes /sec
Ethtool(mlx5p2  ) stat: 99297696 ( 99,297,696) <= tx_bytes_phy /sec
Ethtool(mlx5p2  ) stat:  1548082 (  1,548,082) <= tx_csum_partial /sec
Ethtool(mlx5p2  ) stat:  1548082 (  1,548,082) <= tx_packets /sec
Ethtool(mlx5p2  ) stat:  1551527 (  1,551,527) <= tx_packets_phy /sec
Ethtool(mlx5p2  ) stat: 99076658 ( 99,076,658) <= tx_prio1_bytes /sec
Ethtool(mlx5p2  ) stat:  1548073 (  1,548,073) <= tx_prio1_packets /sec
Ethtool(mlx5p2  ) stat: 92936078 ( 92,936,078) <= 
tx_vport_unicast_bytes /sec
Ethtool(mlx5p2  ) stat:  1548934 (  1,548,934) <= 
tx_vport_unicast_packets /sec
Ethtool(mlx5p2  ) stat:1 (  1) <= tx_xmit_more /sec

(after several attempts I got:)
$ ethtool -S mlx5p2|grep more; sleep 1;ethtool -S mlx5p2|grep more
 tx_xmit_more: 14048
 tx0_xmit_more: 14048
 tx_xmit_more: 14049
 tx0_xmit_more: 14049

This was with:
 $ grep -H . /proc/sys/net/core/wmem_default
 /proc/sys/net/core/wmem_default:2129920

>PerfTop:   76969 irqs/sec  kernel:96.6%  exact: 100.0% [4000Hz cycles:pp], 
>  (all, 48 CPUs)
> -
> 
> 11.64%  [kernel]  [k] skb_set_owner_w   
>  6.21%  [kernel]  [k] queued_spin_lock_slowpath 
>  4.76%  [kernel]  [k] _raw_spin_lock
>  4.40%  [kernel]  [k] __ip_make_skb 
>  3.10%  [kernel]  [k] sock_wfree
>  2.87%  [kernel]  [k] ipt_do_table  
>  2.76%  [kernel]  [k] fq_dequeue
>  2.71%  [kernel]  [k] mlx4_en_xmit  
>  2.50%  [kernel]  [k] __dev_queue_xmit  
>  2.29%  [kernel]  [k] __ip_append_data.isra.40  
>  2.28%  [kernel]  [k] udp_sendmsg   
>  2.01%  [kernel]  [k] __alloc_skb   
>  1.90%  [kernel]  [k] napi_consume_skb  
>  1.63%  [kernel]  [k] udp_send_skb  
>  1.62%  [kernel]  [k] skb_release_data  
>  1.62%  [kernel]  [k] entry_SYSCALL_64_fastpath 
>  1.56%  [kernel]  [k] dev_hard_start_xmit   
>  1.55%  udpsnd[.] __libc_send   
>  1.48%  [kernel]  [k] netif_skb_features
>  1.42%  [kernel]  [k] __qdisc_run   
>  1.35%  [kernel]  [k] sk_dst_check  
> 

Re: Netperf UDP issue with connected sockets

2016-11-17 Thread Eric Dumazet
On Thu, 2016-11-17 at 19:30 +0100, Jesper Dangaard Brouer wrote:

> The point is I can see a socket Send-Q forming, thus we do know the
> application have something to send. Thus, and possibility for
> non-opportunistic bulking. Allowing/implementing bulk enqueue from
> socket layer into qdisc layer, should be fairly simple (and rest of
> xmit_more is already in place).  


As I said, you are fooled by TX completions.

Please make sure to increase the sndbuf limits !

echo 2129920 >/proc/sys/net/core/wmem_default

lpaa23:~# sar -n DEV 1 10|grep eth1
10:49:25 eth1  7.00 9273283.00  0.61 2187214.90  0.00  
0.00  0.00
10:49:26 eth1  1.00 9230795.00  0.06 2176787.57  0.00  
0.00  1.00
10:49:27 eth1  2.00 9247906.00  0.17 2180915.45  0.00  
0.00  0.00
10:49:28 eth1  3.00 9246542.00  0.23 2180790.38  0.00  
0.00  1.00
10:49:29 eth1  1.00 9239218.00  0.06 2179044.83  0.00  
0.00  0.00
10:49:30 eth1  3.00 9248775.00  0.23 2181257.84  0.00  
0.00  1.00
10:49:31 eth1  4.00 9225471.00  0.65 2175772.75  0.00  
0.00  0.00
10:49:32 eth1  2.00 9253536.00  0.33 2182666.44  0.00  
0.00  1.00
10:49:33 eth1  1.00 9265900.00  0.06 2185598.40  0.00  
0.00  0.00
10:49:34 eth1  1.00 6949031.00  0.06 1638889.63  0.00  
0.00  1.00
Average: eth1  2.50 9018045.70  0.25 2126893.82  0.00  
0.00  0.50


lpaa23:~# ethtool -S eth1|grep more; sleep 1;ethtool -S eth1|grep more
 xmit_more: 2251366909
 xmit_more: 2256011392

lpaa23:~# echo 2256011392-2251366909 | bc
4644483

   PerfTop:   76969 irqs/sec  kernel:96.6%  exact: 100.0% [4000Hz cycles:pp],  
(all, 48 CPUs)
---

11.64%  [kernel]  [k] skb_set_owner_w   
 6.21%  [kernel]  [k] queued_spin_lock_slowpath 
 4.76%  [kernel]  [k] _raw_spin_lock
 4.40%  [kernel]  [k] __ip_make_skb 
 3.10%  [kernel]  [k] sock_wfree
 2.87%  [kernel]  [k] ipt_do_table  
 2.76%  [kernel]  [k] fq_dequeue
 2.71%  [kernel]  [k] mlx4_en_xmit  
 2.50%  [kernel]  [k] __dev_queue_xmit  
 2.29%  [kernel]  [k] __ip_append_data.isra.40  
 2.28%  [kernel]  [k] udp_sendmsg   
 2.01%  [kernel]  [k] __alloc_skb   
 1.90%  [kernel]  [k] napi_consume_skb  
 1.63%  [kernel]  [k] udp_send_skb  
 1.62%  [kernel]  [k] skb_release_data  
 1.62%  [kernel]  [k] entry_SYSCALL_64_fastpath 
 1.56%  [kernel]  [k] dev_hard_start_xmit   
 1.55%  udpsnd[.] __libc_send   
 1.48%  [kernel]  [k] netif_skb_features
 1.42%  [kernel]  [k] __qdisc_run   
 1.35%  [kernel]  [k] sk_dst_check  
 1.33%  [kernel]  [k] sock_def_write_space  
 1.30%  [kernel]  [k] kmem_cache_alloc_node_trace   
 1.29%  [kernel]  [k] __local_bh_enable_ip  
 1.21%  [kernel]  [k] copy_user_enhanced_fast_string
 1.08%  [kernel]  [k] __kmalloc_reserve.isra.40 
 1.08%  [kernel]  [k] SYSC_sendto   
 1.07%  [kernel]  [k] kmem_cache_alloc_node 
 0.95%  [kernel]  [k] ip_finish_output2 
 0.95%  [kernel]  [k] ktime_get 
 0.91%  [kernel]  [k] validate_xmit_skb 
 0.88%  [kernel]  [k] sock_alloc_send_pskb  
 0.82%  [kernel]  [k] sock_sendmsg  





Re: Netperf UDP issue with connected sockets

2016-11-17 Thread Jesper Dangaard Brouer
On Thu, 17 Nov 2016 08:21:19 -0800
Eric Dumazet  wrote:

> On Thu, 2016-11-17 at 15:57 +0100, Jesper Dangaard Brouer wrote:
> > On Thu, 17 Nov 2016 06:17:38 -0800
> > Eric Dumazet  wrote:
> >   
> > > On Thu, 2016-11-17 at 14:42 +0100, Jesper Dangaard Brouer wrote:
> > >   
> > > > I can see that qdisc layer does not activate xmit_more in this case.
> > > > 
> > > 
> > > Sure. Not enough pressure from the sender(s).
> > > 
> > > The bottleneck is not the NIC or qdisc in your case, meaning that BQL
> > > limit is kept at a small value.
> > > 
> > > (BTW not all NIC have expensive doorbells)  
> > 
> > I believe this NIC mlx5 (50G edition) does.
> > 
> > I'm seeing UDP TX of 1656017.55 pps, which is per packet:
> > 2414 cycles(tsc) 603.86 ns
> > 
> > Perf top shows (with my own udp_flood, that avoids __ip_select_ident):
> > 
> >  Samples: 56K of event 'cycles', Event count (approx.): 51613832267
> >Overhead  CommandShared ObjectSymbol
> >  +8.92%  udp_flood  [kernel.vmlinux] [k] _raw_spin_lock
> >- _raw_spin_lock
> >   + 90.78% __dev_queue_xmit
> >   + 7.83% dev_queue_xmit
> >   + 1.30% ___slab_alloc
> >  +5.59%  udp_flood  [kernel.vmlinux] [k] skb_set_owner_w  
> 
> Does TX completion happens on same cpu ?
> 
> >  +4.77%  udp_flood  [mlx5_core]  [k] mlx5e_sq_xmit
> >  +4.09%  udp_flood  [kernel.vmlinux] [k] fib_table_lookup  
> 
> Why fib_table_lookup() is used with connected UDP flow ???

Don't know. I would be interested in hints howto avoid this!...

I also see it with netperf, and my udp_flood code is here:
 https://github.com/netoptimizer/network-testing/blob/master/src/udp_flood.c


> >  +4.00%  swapper[mlx5_core]  [k] mlx5e_poll_tx_cq
> >  +3.11%  udp_flood  [kernel.vmlinux] [k] 
> > __ip_route_output_key_hash  
> 
> Same here, this is suspect.

It is the function calling fib_table_lookup(), and it is called by 
ip_route_output_flow().
 
> >  +2.49%  swapper[kernel.vmlinux] [k] __slab_free
> > 
> > In this setup the spinlock in __dev_queue_xmit should be uncongested.
> > An uncongested spin_lock+unlock cost 32 cycles(tsc) 8.198 ns on this system.
> > 
> > But 8.92% of the time is spend on it, which corresponds to a cost of 215
> > cycles (2414*0.0892).  This cost is too high, thus something else is
> > going on... I claim this mysterious extra cost is the tailptr/doorbell.  
> 
> Well, with no pressure, doorbell is triggered for each packet.
> 
> Since we can not predict that your application is going to send yet
> another packet one usec later, we can not avoid this.

The point is I can see a socket Send-Q forming, thus we do know the
application have something to send. Thus, and possibility for
non-opportunistic bulking. Allowing/implementing bulk enqueue from
socket layer into qdisc layer, should be fairly simple (and rest of
xmit_more is already in place).  


> Note that with the patches I am working on (busypolling extentions),
> we could decide to let the busypoller doing the doorbells, say one every
> 10 usec. (or after ~16 packets were queued)

Sounds interesting! but an opportunistically approach.

> But mlx4 uses two different NAPI for TX and RX, maybe mlx5 has the same
> strategy .

It is a possibility that TX completions were happening on another CPU
(but I don't think so for mlx5).

To rule that out, I reran the experiment making sure to pin everything
to CPU-0 and the results are the same.

sudo ethtool -L mlx5p2 combined 1

sudo sh -c '\
 for x in /proc/irq/*/mlx5*/../smp_affinity; do \
   echo 01 > $x; grep . -H $x; done \
'

$ taskset -c 0 ./udp_flood --sendto 198.18.50.1 --count $((10**9))

Perf report validating CPU in use:

$ perf report -g --no-children --sort cpu,comm,dso,symbol --stdio --call-graph 
none

# Overhead  CPU  Command Shared ObjectSymbol
   
#   ...  ..  ...  
.
#
 9.97%  000  udp_flood   [kernel.vmlinux] [k] _raw_spin_lock
   
 4.37%  000  udp_flood   [kernel.vmlinux] [k] fib_table_lookup  
   
 3.97%  000  udp_flood   [mlx5_core]  [k] mlx5e_sq_xmit 
   
 3.06%  000  udp_flood   [kernel.vmlinux] [k] 
__ip_route_output_key_hash   
 2.51%  000  udp_flood   [mlx5_core]  [k] mlx5e_poll_tx_cq  
   
 2.48%  000  udp_flood   [kernel.vmlinux] [k] 
copy_user_enhanced_fast_string   
 2.47%  000  udp_flood   [kernel.vmlinux] [k] entry_SYSCALL_64  
   
 2.42%  000  udp_flood   [kernel.vmlinux] [k] udp_sendmsg   
   
 2.39%  000  udp_flood   [kernel.vmlinux] [k] 
__ip_append_data.isra.47 
 2.29%  000  udp_flood   [k

Re: Netperf UDP issue with connected sockets

2016-11-17 Thread Rick Jones

On 11/17/2016 12:16 AM, Jesper Dangaard Brouer wrote:

time to try IP_MTU_DISCOVER ;)


To Rick, maybe you can find a good solution or option with Eric's hint,
to send appropriate sized UDP packets with Don't Fragment (DF).


Well, I suppose adding another setsockopt() to the data socket creation 
wouldn't be too difficult, along with another command-line option to 
cause it to happen.


Could we leave things as "make sure you don't need fragmentation when 
you use this" or would netperf have to start processing ICMP messages?


happy benchmarking,

rick jones



RE: Netperf UDP issue with connected sockets

2016-11-17 Thread David Laight
From: Jesper Dangaard Brouer
> Sent: 17 November 2016 14:58
> On Thu, 17 Nov 2016 06:17:38 -0800
> Eric Dumazet  wrote:
> 
> > On Thu, 2016-11-17 at 14:42 +0100, Jesper Dangaard Brouer wrote:
> >
> > > I can see that qdisc layer does not activate xmit_more in this case.
> > >
> >
> > Sure. Not enough pressure from the sender(s).
> >
> > The bottleneck is not the NIC or qdisc in your case, meaning that BQL
> > limit is kept at a small value.
> >
> > (BTW not all NIC have expensive doorbells)
> 
> I believe this NIC mlx5 (50G edition) does.
> 
> I'm seeing UDP TX of 1656017.55 pps, which is per packet:
> 2414 cycles(tsc) 603.86 ns
> 
> Perf top shows (with my own udp_flood, that avoids __ip_select_ident):
> 
>  Samples: 56K of event 'cycles', Event count (approx.): 51613832267
>Overhead  CommandShared ObjectSymbol
>  +8.92%  udp_flood  [kernel.vmlinux] [k] _raw_spin_lock
>- _raw_spin_lock
>   + 90.78% __dev_queue_xmit
>   + 7.83% dev_queue_xmit
>   + 1.30% ___slab_alloc
>  +5.59%  udp_flood  [kernel.vmlinux] [k] skb_set_owner_w
>  +4.77%  udp_flood  [mlx5_core]  [k] mlx5e_sq_xmit
>  +4.09%  udp_flood  [kernel.vmlinux] [k] fib_table_lookup
>  +4.00%  swapper[mlx5_core]  [k] mlx5e_poll_tx_cq
>  +3.11%  udp_flood  [kernel.vmlinux] [k] 
> __ip_route_output_key_hash
>  +2.49%  swapper[kernel.vmlinux] [k] __slab_free
> 
> In this setup the spinlock in __dev_queue_xmit should be uncongested.
> An uncongested spin_lock+unlock cost 32 cycles(tsc) 8.198 ns on this system.
> 
> But 8.92% of the time is spend on it, which corresponds to a cost of 215
> cycles (2414*0.0892).  This cost is too high, thus something else is
> going on... I claim this mysterious extra cost is the tailptr/doorbell.

Try adding code to ring the doorbell twice.
If this doesn't slow things down then it isn't (likely to be) responsible
for the delay you are seeing.

David



Re: Netperf UDP issue with connected sockets

2016-11-17 Thread Jesper Dangaard Brouer
On Thu, 17 Nov 2016 06:17:38 -0800
Eric Dumazet  wrote:

> On Thu, 2016-11-17 at 14:42 +0100, Jesper Dangaard Brouer wrote:
> 
> > I can see that qdisc layer does not activate xmit_more in this case.
> >   
> 
> Sure. Not enough pressure from the sender(s).
> 
> The bottleneck is not the NIC or qdisc in your case, meaning that BQL
> limit is kept at a small value.
> 
> (BTW not all NIC have expensive doorbells)

I believe this NIC mlx5 (50G edition) does.

I'm seeing UDP TX of 1656017.55 pps, which is per packet:
2414 cycles(tsc) 603.86 ns

Perf top shows (with my own udp_flood, that avoids __ip_select_ident):

 Samples: 56K of event 'cycles', Event count (approx.): 51613832267
   Overhead  CommandShared ObjectSymbol
 +8.92%  udp_flood  [kernel.vmlinux] [k] _raw_spin_lock
   - _raw_spin_lock
  + 90.78% __dev_queue_xmit
  + 7.83% dev_queue_xmit
  + 1.30% ___slab_alloc
 +5.59%  udp_flood  [kernel.vmlinux] [k] skb_set_owner_w
 +4.77%  udp_flood  [mlx5_core]  [k] mlx5e_sq_xmit
 +4.09%  udp_flood  [kernel.vmlinux] [k] fib_table_lookup
 +4.00%  swapper[mlx5_core]  [k] mlx5e_poll_tx_cq
 +3.11%  udp_flood  [kernel.vmlinux] [k] __ip_route_output_key_hash
 +2.49%  swapper[kernel.vmlinux] [k] __slab_free

In this setup the spinlock in __dev_queue_xmit should be uncongested.
An uncongested spin_lock+unlock cost 32 cycles(tsc) 8.198 ns on this system.

But 8.92% of the time is spend on it, which corresponds to a cost of 215
cycles (2414*0.0892).  This cost is too high, thus something else is
going on... I claim this mysterious extra cost is the tailptr/doorbell.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


Re: Netperf UDP issue with connected sockets

2016-11-17 Thread Jesper Dangaard Brouer
On Thu, 17 Nov 2016 05:20:50 -0800
Eric Dumazet  wrote:

> On Thu, 2016-11-17 at 09:16 +0100, Jesper Dangaard Brouer wrote:
> 
> > 
> > I noticed there is a Send-Q, and the perf-top2 is _raw_spin_lock, which
> > looks like it comes from __dev_queue_xmit(), but we know from
> > experience that this stall is actually caused by writing the
> > tailptr/doorbell in the HW.  Thus, this could benefit a lot from
> > bulk/xmit_more into the qdisc layer.  
> 
> The Send-Q is there because of TX-completions being delayed a bit,
> because of IRQ mitigation.
> 
> (ethtool -c eth0)
> 
> It happens even if you do not have a qdisc in the first place.
> 
> And we do have xmit_more in the qdisc layer already.

I can see that qdisc layer does not activate xmit_more in this case.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


$ ethtool -c mlx5p4
Coalesce parameters for mlx5p4:
Adaptive RX: on  TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0

rx-usecs: 3
rx-frames: 32
rx-usecs-irq: 0
rx-frames-irq: 0

tx-usecs: 16
tx-frames: 32
tx-usecs-irq: 0
tx-frames-irq: 0

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0



Re: Netperf UDP issue with connected sockets

2016-11-17 Thread Eric Dumazet
On Thu, 2016-11-17 at 14:42 +0100, Jesper Dangaard Brouer wrote:

> I can see that qdisc layer does not activate xmit_more in this case.
> 

Sure. Not enough pressure from the sender(s).

The bottleneck is not the NIC or qdisc in your case, meaning that BQL
limit is kept at a small value.

(BTW not all NIC have expensive doorbells)






Re: Netperf UDP issue with connected sockets

2016-11-17 Thread Eric Dumazet
On Thu, 2016-11-17 at 09:16 +0100, Jesper Dangaard Brouer wrote:

> 
> I noticed there is a Send-Q, and the perf-top2 is _raw_spin_lock, which
> looks like it comes from __dev_queue_xmit(), but we know from
> experience that this stall is actually caused by writing the
> tailptr/doorbell in the HW.  Thus, this could benefit a lot from
> bulk/xmit_more into the qdisc layer.

The Send-Q is there because of TX-completions being delayed a bit,
because of IRQ mitigation.

(ethtool -c eth0)

It happens even if you do not have a qdisc in the first place.

And we do have xmit_more in the qdisc layer already.




Re: Netperf UDP issue with connected sockets

2016-11-17 Thread Eric Dumazet
On Thu, 2016-11-17 at 15:57 +0100, Jesper Dangaard Brouer wrote:
> On Thu, 17 Nov 2016 06:17:38 -0800
> Eric Dumazet  wrote:
> 
> > On Thu, 2016-11-17 at 14:42 +0100, Jesper Dangaard Brouer wrote:
> > 
> > > I can see that qdisc layer does not activate xmit_more in this case.
> > >   
> > 
> > Sure. Not enough pressure from the sender(s).
> > 
> > The bottleneck is not the NIC or qdisc in your case, meaning that BQL
> > limit is kept at a small value.
> > 
> > (BTW not all NIC have expensive doorbells)
> 
> I believe this NIC mlx5 (50G edition) does.
> 
> I'm seeing UDP TX of 1656017.55 pps, which is per packet:
> 2414 cycles(tsc) 603.86 ns
> 
> Perf top shows (with my own udp_flood, that avoids __ip_select_ident):
> 
>  Samples: 56K of event 'cycles', Event count (approx.): 51613832267
>Overhead  CommandShared ObjectSymbol
>  +8.92%  udp_flood  [kernel.vmlinux] [k] _raw_spin_lock
>- _raw_spin_lock
>   + 90.78% __dev_queue_xmit
>   + 7.83% dev_queue_xmit
>   + 1.30% ___slab_alloc
>  +5.59%  udp_flood  [kernel.vmlinux] [k] skb_set_owner_w

Does TX completion happens on same cpu ?

>  +4.77%  udp_flood  [mlx5_core]  [k] mlx5e_sq_xmit
>  +4.09%  udp_flood  [kernel.vmlinux] [k] fib_table_lookup

Why fib_table_lookup() is used with connected UDP flow ???

>  +4.00%  swapper[mlx5_core]  [k] mlx5e_poll_tx_cq
>  +3.11%  udp_flood  [kernel.vmlinux] [k] 
> __ip_route_output_key_hash

Same here, this is suspect.

>  +2.49%  swapper[kernel.vmlinux] [k] __slab_free
> 
> In this setup the spinlock in __dev_queue_xmit should be uncongested.
> An uncongested spin_lock+unlock cost 32 cycles(tsc) 8.198 ns on this system.
> 
> But 8.92% of the time is spend on it, which corresponds to a cost of 215
> cycles (2414*0.0892).  This cost is too high, thus something else is
> going on... I claim this mysterious extra cost is the tailptr/doorbell.

Well, with no pressure, doorbell is triggered for each packet.

Since we can not predict that your application is going to send yet
another packet one usec later, we can not avoid this.

Note that with the patches I am working on (busypolling extentions),
we could decide to let the busypoller doing the doorbells, say one every
10 usec. (or after ~16 packets were queued)

But mlx4 uses two different NAPI for TX and RX, maybe mlx5 has the same
strategy .





Re: Netperf UDP issue with connected sockets

2016-11-17 Thread Jesper Dangaard Brouer
On Wed, 16 Nov 2016 16:34:09 -0800
Eric Dumazet  wrote:

> On Wed, 2016-11-16 at 23:40 +0100, Jesper Dangaard Brouer wrote:
> 
> > Using -R 1 does not seem to help remove __ip_select_ident()
> > 
> > Samples: 56K of event 'cycles', Event count (approx.): 78628132661
> >   Overhead  CommandShared ObjectSymbol
> > +9.11%  netperf[kernel.vmlinux] [k] __ip_select_ident
> > +6.98%  netperf[kernel.vmlinux] [k] _raw_spin_lock
> > +6.21%  swapper[mlx5_core]  [k] mlx5e_poll_tx_cq
> > +5.03%  netperf[kernel.vmlinux] [k] 
> > copy_user_enhanced_fast_string
> > +4.69%  netperf[kernel.vmlinux] [k] __ip_make_skb
> > +4.63%  netperf[kernel.vmlinux] [k] skb_set_owner_w
> > +4.15%  swapper[kernel.vmlinux] [k] __slab_free
> > +3.80%  netperf[mlx5_core]  [k] mlx5e_sq_xmit
> > +2.00%  swapper[kernel.vmlinux] [k] sock_wfree
> > +1.94%  netperfnetperf  [.] send_data
> > +1.92%  netperfnetperf  [.] send_omni_inner  
> 
> Check "ss -nu"  ?
> 
> You will see if sockets are connected (present in ss output or not)

Tested different versions of netperf, commands used below signature:

 netperf-2.6.0: connected "broken"
 netperf-2.7.0: connected works
 SVN-r709 : connected works

I noticed there is a Send-Q, and the perf-top2 is _raw_spin_lock, which
looks like it comes from __dev_queue_xmit(), but we know from
experience that this stall is actually caused by writing the
tailptr/doorbell in the HW.  Thus, this could benefit a lot from
bulk/xmit_more into the qdisc layer.


> UDP being connected does not prevent __ip_select_ident() being used.
> 
> if ((iph->frag_off & htons(IP_DF)) && !skb->ignore_df) {
> 
> So you need IP_DF being set, and skb->ignore_df being 0

Thanks for explaining that! :-)

http://lxr.free-electrons.com/source/include/net/ip.h?v=4.8#L332
http://lxr.free-electrons.com/source/net/ipv4/ip_output.c?v=4.8#L449

Netperf UDP_STREAM default send 64K packets that get fragmented...
which actually is very unfortunate because people end-up testing a code
path in the kernel they didn't expect.  That is why I use the
option "-- -m 1472".


> time to try IP_MTU_DISCOVER ;)  

To Rick, maybe you can find a good solution or option with Eric's hint,
to send appropriate sized UDP packets with Don't Fragment (DF).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

Testing with ss -nua

$ /usr/local/stow/netperf-2.6.0-demo/bin/netperf -H 198.18.50.1 -t UDP_STREAM 
-l 3 -- -m 1472 -n -N > /dev/null & sleep 1; ss -una

State  Recv-Q Send-Q   Local Address:Port  Peer Address:Port
UNCONN 0  11520*:54589*:*

$ /usr/local/stow/netperf-2.7.0-demo/bin/netperf -H 198.18.50.1 -t UDP_STREAM 
-l 3 -- -m 1472 -n -N > /dev/null & sleep 1; ss -una
State  Recv-Q Send-Q   Local Address:Port  Peer Address:Port
ESTAB  0  18432  198.18.50.3:46803  198.18.50.1:51851

$ ~/tools/netperf2-svn/src/netperf -H 198.18.50.1 -t UDP_STREAM -l 3 -- -m 1472 
-n -N > /dev/null & sleep 1; ss -una
State  Recv-Q Send-Q   Local Address:Port  Peer Address:Port
ESTAB  0  43776  198.18.50.3:42965  198.18.50.1:51948



Re: Netperf UDP issue with connected sockets

2016-11-16 Thread Eric Dumazet
On Wed, 2016-11-16 at 23:40 +0100, Jesper Dangaard Brouer wrote:

> Using -R 1 does not seem to help remove __ip_select_ident()
> 
> Samples: 56K of event 'cycles', Event count (approx.): 78628132661
>   Overhead  CommandShared ObjectSymbol
> +9.11%  netperf[kernel.vmlinux] [k] __ip_select_ident
> +6.98%  netperf[kernel.vmlinux] [k] _raw_spin_lock
> +6.21%  swapper[mlx5_core]  [k] mlx5e_poll_tx_cq
> +5.03%  netperf[kernel.vmlinux] [k] 
> copy_user_enhanced_fast_string
> +4.69%  netperf[kernel.vmlinux] [k] __ip_make_skb
> +4.63%  netperf[kernel.vmlinux] [k] skb_set_owner_w
> +4.15%  swapper[kernel.vmlinux] [k] __slab_free
> +3.80%  netperf[mlx5_core]  [k] mlx5e_sq_xmit
> +2.00%  swapper[kernel.vmlinux] [k] sock_wfree
> +1.94%  netperfnetperf  [.] send_data
> +1.92%  netperfnetperf  [.] send_omni_inner

Check "ss -nu"  ?

You will see if sockets are connected (present in ss output or not)

UDP being connected does not prevent __ip_select_ident() being used.

if ((iph->frag_off & htons(IP_DF)) && !skb->ignore_df) {

So you need IP_DF being set, and skb->ignore_df being 0

-> time to try IP_MTU_DISCOVER ;)





Re: Netperf UDP issue with connected sockets

2016-11-16 Thread Rick Jones

On 11/16/2016 02:40 PM, Jesper Dangaard Brouer wrote:

On Wed, 16 Nov 2016 09:46:37 -0800
Rick Jones  wrote:

It is a wild guess, but does setting SO_DONTROUTE affect whether or not
a connect() would have the desired effect?  That is there to protect
people from themselves (long story about people using UDP_STREAM to
stress improperly air-gapped systems during link up/down testing)
It can be disabled with a test-specific -R 1 option, so your netperf
command would become:

netperf -H 198.18.50.1 -t UDP_STREAM -l 120 -- -m 1472 -n -N -R 1


Using -R 1 does not seem to help remove __ip_select_ident()


Bummer.  It was a wild guess anyway, since I was seeing a connect() call 
on the data socket.



Samples: 56K of event 'cycles', Event count (approx.): 78628132661
  Overhead  CommandShared ObjectSymbol
+9.11%  netperf[kernel.vmlinux] [k] __ip_select_ident
+6.98%  netperf[kernel.vmlinux] [k] _raw_spin_lock
+6.21%  swapper[mlx5_core]  [k] mlx5e_poll_tx_cq
+5.03%  netperf[kernel.vmlinux] [k] 
copy_user_enhanced_fast_string
+4.69%  netperf[kernel.vmlinux] [k] __ip_make_skb
+4.63%  netperf[kernel.vmlinux] [k] skb_set_owner_w
+4.15%  swapper[kernel.vmlinux] [k] __slab_free
+3.80%  netperf[mlx5_core]  [k] mlx5e_sq_xmit
+2.00%  swapper[kernel.vmlinux] [k] sock_wfree
+1.94%  netperfnetperf  [.] send_data
+1.92%  netperfnetperf  [.] send_omni_inner


Well, the next step I suppose is to have you try a quick netperf 
UDP_STREAM under strace to see if your netperf binary does what mine did:


strace -v -o /tmp/netperf.strace netperf -H 198.18.50.1 -t UDP_STREAM -l 
1 -- -m 1472 -n -N -R 1


And see if you see the connect() I saw. (Note, I make the runtime 1 second)

rick


Re: Netperf UDP issue with connected sockets

2016-11-16 Thread Jesper Dangaard Brouer
On Wed, 16 Nov 2016 09:46:37 -0800
Rick Jones  wrote:

> On 11/16/2016 04:16 AM, Jesper Dangaard Brouer wrote:
> > [1] Subj: High perf top ip_idents_reserve doing netperf UDP_STREAM
> >  - https://www.spinics.net/lists/netdev/msg294752.html
> >
> > Not fixed in version 2.7.0.
> >  - ftp://ftp.netperf.org/netperf/netperf-2.7.0.tar.gz
> >
> > Used extra netperf configure compile options:
> >  ./configure  --enable-histogram --enable-demo
> >
> > It seems like some fix attempts exists in the SVN repository::
> >
> >  svn checkout http://www.netperf.org/svn/netperf2/trunk/ netperf2-svn
> >  svn log -r709
> >  # A quick stab at getting remote connect going for UDP_STREAM
> >  svn diff -r708:709
> >
> > Testing with SVN version, still show __ip_select_ident() in top#1.  
> 
> Indeed, there was a fix for getting the remote side connect()ed. 
> Looking at what I have for the top of trunk I do though see a connect() 
> call being made at the local end:
> 
> socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4
> getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0
> getsockopt(4, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0
> setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> bind(4, {sa_family=AF_INET, sin_port=htons(0), 
> sin_addr=inet_addr("0.0.0.0")}, 16) = 0
> setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0
> setsockopt(4, SOL_IP, IP_RECVERR, [1], 4) = 0
> brk(0xe53000)   = 0xe53000
> getsockname(4, {sa_family=AF_INET, sin_port=htons(59758), 
> sin_addr=inet_addr("0.0.0.0")}, [16]) = 0
> sendto(3, 
> "\0\0\0a\377\377\377\377\377\377\377\377\377\377\377\377\0\0\0\10\0\0\0\0\0\0\0\321\377\377\377\377"...,
>  
> 656, 0, NULL, 0) = 656
> select(1024, [3], NULL, NULL, {120, 0}) = 1 (in [3], left {119, 995630})
> recvfrom(3, 
> "\0\0\0b\0\0\0\0\0\3@\0\0\3@\0\0\0\0\2\0\3@\0\377\377\377\377\0\0\0\321"..., 
> 656, 0, NULL, NULL) = 656
> write(1, "need to connect is 1\n", 21)  = 21
> rt_sigaction(SIGALRM, {0x402ea6, [ALRM], SA_RESTORER|SA_INTERRUPT, 
> 0x7f2824eb2cb0}, NULL, 8) = 0
> rt_sigaction(SIGINT, {0x402ea6, [INT], SA_RESTORER|SA_INTERRUPT, 
> 0x7f2824eb2cb0}, NULL, 8) = 0
> alarm(1)= 0
> connect(4, {sa_family=AF_INET, sin_port=htons(34832), 
> sin_addr=inet_addr("127.0.0.1")}, 16) = 0
> sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 
> 1024
> sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 
> 1024
> sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 
> 1024
> 
> the only difference there with top of trunk is that "need to connect" 
> write/printf I just put in the code to be a nice marker in the system 
> call trace.
> 
> It is a wild guess, but does setting SO_DONTROUTE affect whether or not 
> a connect() would have the desired effect?  That is there to protect 
> people from themselves (long story about people using UDP_STREAM to 
> stress improperly air-gapped systems during link up/down testing) 
> It can be disabled with a test-specific -R 1 option, so your netperf 
> command would become:
> 
> netperf -H 198.18.50.1 -t UDP_STREAM -l 120 -- -m 1472 -n -N -R 1

Using -R 1 does not seem to help remove __ip_select_ident()

Samples: 56K of event 'cycles', Event count (approx.): 78628132661
  Overhead  CommandShared ObjectSymbol
+9.11%  netperf[kernel.vmlinux] [k] __ip_select_ident
+6.98%  netperf[kernel.vmlinux] [k] _raw_spin_lock
+6.21%  swapper[mlx5_core]  [k] mlx5e_poll_tx_cq
+5.03%  netperf[kernel.vmlinux] [k] 
copy_user_enhanced_fast_string
+4.69%  netperf[kernel.vmlinux] [k] __ip_make_skb
+4.63%  netperf[kernel.vmlinux] [k] skb_set_owner_w
+4.15%  swapper[kernel.vmlinux] [k] __slab_free
+3.80%  netperf[mlx5_core]  [k] mlx5e_sq_xmit
+2.00%  swapper[kernel.vmlinux] [k] sock_wfree
+1.94%  netperfnetperf  [.] send_data
+1.92%  netperfnetperf  [.] send_omni_inner


> >
> > (p.s. is netperf ever going to be converted from SVN to git?)
> >  
> 
> Well  my git-fu could use some work (gentle, offlinetaps with a 
> clueful tutorial bat would be welcome), and at least in the past, going 
> to git was held back because there were a bunch of netperf users on 
> Windows and there wasn't (at the time) support for git under Windows.
> 
> But I am not against the idea in principle.

Once you have learned git, you will never go back to SVN. Just do it! :-)

Here are even nice writeups of how to convert and preserve history:
 http://john.albin.net/git/convert-subversion-to-git

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


Re: Netperf UDP issue with connected sockets

2016-11-16 Thread Rick Jones

On 11/16/2016 04:16 AM, Jesper Dangaard Brouer wrote:

[1] Subj: High perf top ip_idents_reserve doing netperf UDP_STREAM
 - https://www.spinics.net/lists/netdev/msg294752.html

Not fixed in version 2.7.0.
 - ftp://ftp.netperf.org/netperf/netperf-2.7.0.tar.gz

Used extra netperf configure compile options:
 ./configure  --enable-histogram --enable-demo

It seems like some fix attempts exists in the SVN repository::

 svn checkout http://www.netperf.org/svn/netperf2/trunk/ netperf2-svn
 svn log -r709
 # A quick stab at getting remote connect going for UDP_STREAM
 svn diff -r708:709

Testing with SVN version, still show __ip_select_ident() in top#1.


Indeed, there was a fix for getting the remote side connect()ed. 
Looking at what I have for the top of trunk I do though see a connect() 
call being made at the local end:


socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4
getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0
getsockopt(4, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0
setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(4, {sa_family=AF_INET, sin_port=htons(0), 
sin_addr=inet_addr("0.0.0.0")}, 16) = 0

setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0
setsockopt(4, SOL_IP, IP_RECVERR, [1], 4) = 0
brk(0xe53000)   = 0xe53000
getsockname(4, {sa_family=AF_INET, sin_port=htons(59758), 
sin_addr=inet_addr("0.0.0.0")}, [16]) = 0
sendto(3, 
"\0\0\0a\377\377\377\377\377\377\377\377\377\377\377\377\0\0\0\10\0\0\0\0\0\0\0\321\377\377\377\377"..., 
656, 0, NULL, 0) = 656

select(1024, [3], NULL, NULL, {120, 0}) = 1 (in [3], left {119, 995630})
recvfrom(3, 
"\0\0\0b\0\0\0\0\0\3@\0\0\3@\0\0\0\0\2\0\3@\0\377\377\377\377\0\0\0\321"..., 
656, 0, NULL, NULL) = 656

write(1, "need to connect is 1\n", 21)  = 21
rt_sigaction(SIGALRM, {0x402ea6, [ALRM], SA_RESTORER|SA_INTERRUPT, 
0x7f2824eb2cb0}, NULL, 8) = 0
rt_sigaction(SIGINT, {0x402ea6, [INT], SA_RESTORER|SA_INTERRUPT, 
0x7f2824eb2cb0}, NULL, 8) = 0

alarm(1)= 0
connect(4, {sa_family=AF_INET, sin_port=htons(34832), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 0
sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 
1024
sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 
1024
sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 
1024


the only difference there with top of trunk is that "need to connect" 
write/printf I just put in the code to be a nice marker in the system 
call trace.


It is a wild guess, but does setting SO_DONTROUTE affect whether or not 
a connect() would have the desired effect?  That is there to protect 
people from themselves (long story about people using UDP_STREAM to 
stress improperly air-gapped systems during link up/down testing) 
It can be disabled with a test-specific -R 1 option, so your netperf 
command would become:


netperf -H 198.18.50.1 -t UDP_STREAM -l 120 -- -m 1472 -n -N -R 1



(p.s. is netperf ever going to be converted from SVN to git?)



Well  my git-fu could use some work (gentle, offlinetaps with a 
clueful tutorial bat would be welcome), and at least in the past, going 
to git was held back because there were a bunch of netperf users on 
Windows and there wasn't (at the time) support for git under Windows.


But I am not against the idea in principle.

happy benchmarking,

rick jones

PS - rick.jo...@hp.com no longer works.  rick.jon...@hpe.com should be 
used instead.


Netperf UDP issue with connected sockets

2016-11-16 Thread Jesper Dangaard Brouer

While optimizing the kernel RX path, I've run into an issue where I
cannot use netperf UDP_STREAM for testing, because the sender is
slower than receiver.  Thus, it cannot show my receiver improvements
(as receiver have idle cycles).

Eric Dumazet previously told me[1] this was related to netperf need to
use connected socket for UDP.  Netperf options "-- -n -N" should
enable connected UDP sockets, but it never worked! Options are
documented, but netperf seems to have a bug.

Called like:
 netperf -H 198.18.50.1 -t UDP_STREAM -l 120 -- -m 1472 -n -N

Problem on sender-side is "__ip_select_ident".

 Samples: 15K of event 'cycles', Event count (approx.): 16409681913
   Overhead  Command Shared Object Symbol
 +   11.18%  netperf [kernel.vmlinux]  [k] __ip_select_ident
 +6.93%  netperf [kernel.vmlinux]  [k] _raw_spin_lock
 +6.12%  netperf [kernel.vmlinux]  [k] 
copy_user_enhanced_fast_string
 +4.31%  netperf [kernel.vmlinux]  [k] __ip_make_skb
 +3.97%  netperf [kernel.vmlinux]  [k] fib_table_lookup
 +3.51%  netperf [mlx5_core]   [k] mlx5e_sq_xmit
 +2.43%  netperf [kernel.vmlinux]  [k] __ip_route_output_key_hash
 +2.24%  netperf netperf   [.] send_omni_inner
 +2.17%  netperf netperf   [.] send_data

[1] Subj: High perf top ip_idents_reserve doing netperf UDP_STREAM
 - https://www.spinics.net/lists/netdev/msg294752.html

Not fixed in version 2.7.0.
 - ftp://ftp.netperf.org/netperf/netperf-2.7.0.tar.gz

Used extra netperf configure compile options:
 ./configure  --enable-histogram --enable-demo

It seems like some fix attempts exists in the SVN repository::

 svn checkout http://www.netperf.org/svn/netperf2/trunk/ netperf2-svn
 svn log -r709
 # A quick stab at getting remote connect going for UDP_STREAM
 svn diff -r708:709

Testing with SVN version, still show __ip_select_ident() in top#1.

(p.s. is netperf ever going to be converted from SVN to git?)
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

(old email below)

On Wed, 03 Sep 2014 08:17:06 -0700
Eric Dumazet  wrote:

> On Wed, 2014-09-03 at 16:59 +0200, Jesper Dangaard Brouer wrote:
> > Hi Eric,
> > 
> > When doing:
> >  super_netperf 120 -H 192.168.8.2 -t UDP_STREAM -l 100 -- -m 256
> > 
> > I'm seeing function ip_idents_reserve() consuming most CPU.  Could you
> > help, explain what is going on, and how I can avoid this?
> > 
> > Perf top:
> >   11.67%  [kernel]   [k] ip_idents_reserve
> >8.37%  [kernel]   [k] fib_table_lookup
> >4.46%  [kernel]   [k] _raw_spin_lock
> >3.21%  [kernel]   [k] copy_user_enhanced_fast_string
> >2.92%  [kernel]   [k] sock_alloc_send_pskb
> >2.88%  [kernel]   [k] udp_sendmsg
> >   
> 
> Because you use a single destination, all flows compete on a single
> atomic to get their next IP identifier.
> 
> You can try to use netperf options  (-- -N -n) so that netperf uses
> connected UDP sockets.
> 
> In this case, the IP identifier generator is held in each socket.