Re: [PATCH v2 net-next] liquidio: improve UDP TX performance

2017-02-21 Thread Rick Jones

On 02/21/2017 01:09 PM, Felix Manlunas wrote:

From: VSR Burru <veerasenareddy.bu...@cavium.com>

Improve UDP TX performance by:
* reducing the ring size from 2K to 512
* replacing the numerous streaming DMA allocations for info buffers and
  gather lists with one large consistent DMA allocation per ring

Netperf benchmark numbers before and after patch:

PF UDP TX
+++++-+
|||  Before|  After | |
| Number ||  Patch |  Patch | |
|  of| Packet | Throughput | Throughput | Percent |
| Flows  |  Size  |  (Gbps)|  (Gbps)| Change  |
+++++-+
||   360  |   0.52 |   0.93 |  +78.9  |
|   1|  1024  |   1.62 |   2.84 |  +75.3  |
||  1518  |   2.44 |   4.21 |  +72.5  |
+++++-+
||   360  |   0.45 |   1.59 | +253.3  |
|   4|  1024  |   1.34 |   5.48 | +308.9  |
||  1518  |   2.27 |   8.31 | +266.1  |
+++++-+
||   360  |   0.40 |   1.61 | +302.5  |
|   8|  1024  |   1.64 |   4.24 | +158.5  |
||  1518  |   2.87 |   6.52 | +127.2  |
+++++-+


VF UDP TX
+++++-+
|||  Before|  After | |
| Number ||  Patch |  Patch | |
|  of| Packet | Throughput | Throughput | Percent |
| Flows  |  Size  |  (Gbps)|  (Gbps)| Change  |
+++++-+
||   360  |   1.28 |   1.49 |  +16.4  |
|   1|  1024  |   4.44 |   4.39 |   -1.1  |
||  1518  |   6.08 |   6.51 |   +7.1  |
+++++-+
||   360  |   2.35 |   2.35 |0.0  |
|   4|  1024  |   6.41 |   8.07 |  +25.9  |
||  1518  |   9.56 |   9.54 |   -0.2  |
+++++-+
||   360  |   3.41 |   3.65 |   +7.0  |
|   8|  1024  |   9.35 |   9.34 |   -0.1  |
||  1518  |   9.56 |   9.57 |   +0.1  |
+++++-+


Some good looking numbers there.  As one approaches the wire limit for 
bitrate, the likes of a netperf service demand can be used to 
demonstrate the performance change - though there isn't an easy way to 
do that for parallel flows.


happy benchmarking,

rick jones



Re: [PATCH net-next] liquidio: improve UDP TX performance

2017-02-16 Thread Rick Jones

On 02/16/2017 10:38 AM, Felix Manlunas wrote:

From: VSR Burru <veerasenareddy.bu...@cavium.com>

Improve UDP TX performance by:
* reducing the ring size from 2K to 512
* replacing the numerous streaming DMA allocations for info buffers and
  gather lists with one large consistent DMA allocation per ring


By how much was UDP TX performance improved?

happy benchmarking,

rick jones



Re: [PATCH net-next] virtio: Fix affinity for >32 VCPUs

2017-02-03 Thread Rick Jones

On 02/03/2017 10:31 AM, Willem de Bruijn wrote:

Configuring interrupts and xps from userspace at boot is more robust,
as device driver defaults can change. But especially for customers who
are unaware of these settings, choosing sane defaults won't hurt.


The devil is in finding the sane defaults.  For example, the issues 
we've seen with VMs sending traffic getting reordered when the driver 
took it upon itself to enable xps.


rick jones


Re: [PATCH net-next] virtio: Fix affinity for >32 VCPUs

2017-02-03 Thread Rick Jones

On 02/03/2017 10:22 AM, Benjamin Serebrin wrote:

Thanks, Michael, I'll put this text in the commit log:

XPS settings aren't write-able from userspace, so the only way I know
to fix XPS is in the driver.


??

root@np-cp1-c0-m1-mgmt:/home/stack# cat 
/sys/devices/pci:00/:00:02.0/:04:00.0/net/hed0/queues/tx-0/xps_cpus

,0001
root@np-cp1-c0-m1-mgmt:/home/stack# echo 0 > 
/sys/devices/pci:00/:00:02.0/:04:00.0/net/hed0/queues/tx-0/xps_cpus
root@np-cp1-c0-m1-mgmt:/home/stack# cat 
/sys/devices/pci:00/:00:02.0/:04:00.0/net/hed0/queues/tx-0/xps_cpus

,



Re: [PATCH net-next] tcp: accept RST for rcv_nxt - 1 after receiving a FIN

2017-01-17 Thread Rick Jones

On 01/17/2017 11:13 AM, Eric Dumazet wrote:

On Tue, Jan 17, 2017 at 11:04 AM, Rick Jones <rick.jon...@hpe.com> wrote:

Drifting a bit, and it doesn't change the value of dealing with it, but out
of curiosity, when you say mostly in CLOSE_WAIT, why aren't the server-side
applications reacting to the read return of zero triggered by the arrival of
the FIN?


Even if the application reacts, and calls close(fd), kernel will still
try to push the data that was queued into socket write queue prior to
receiving the FIN.

By allowing this RST, we can flush the whole data and react much
faster, avoiding locking memory in the kernel for very long time.


Understood.  I was just wondering if there is also an application bug here.

happy benchmarking,

rick jones


Re: [PATCH net-next] tcp: accept RST for rcv_nxt - 1 after receiving a FIN

2017-01-17 Thread Rick Jones

On 01/17/2017 10:37 AM, Jason Baron wrote:

From: Jason Baron <jba...@akamai.com>

Using a Mac OSX box as a client connecting to a Linux server, we have found
that when certain applications (such as 'ab'), are abruptly terminated
(via ^C), a FIN is sent followed by a RST packet on tcp connections. The
FIN is accepted by the Linux stack but the RST is sent with the same
sequence number as the FIN, and Linux responds with a challenge ACK per
RFC 5961. The OSX client then sometimes (they are rate-limited) does not
reply with any RST as would be expected on a closed socket.

This results in sockets accumulating on the Linux server left mostly in
the CLOSE_WAIT state, although LAST_ACK and CLOSING are also possible.
This sequence of events can tie up a lot of resources on the Linux server
since there may be a lot of data in write buffers at the time of the RST.
Accepting a RST equal to rcv_nxt - 1, after we have already successfully
processed a FIN, has made a significant difference for us in practice, by
freeing up unneeded resources in a more expedient fashion.


Drifting a bit, and it doesn't change the value of dealing with it, but 
out of curiosity, when you say mostly in CLOSE_WAIT, why aren't the 
server-side applications reacting to the read return of zero triggered 
by the arrival of the FIN?


happy benchmarking,

rick jones


Re: [pull request][for-next] Mellanox mlx5 Reorganize core driver directory layout

2017-01-13 Thread Rick Jones

On 01/13/2017 02:56 PM, Tom Herbert wrote:

On Fri, Jan 13, 2017 at 2:45 PM, Saeed Mahameed

what configuration are you running ? what traffic ?


Nothing fancy. 8 queues and 20 concurrent netperf TCP_STREAMs trips
it. Not a lot of them, but I don't think we really should ever see
these errors.


Straight-up defaults with netperf, or do you use specific -s/S or -m/M 
options?


happy benchmarking,

rick jones



Re: [PATCH net-next] udp: under rx pressure, try to condense skbs

2016-12-08 Thread Rick Jones

On 12/08/2016 07:30 AM, Eric Dumazet wrote:

On Thu, 2016-12-08 at 10:46 +0100, Jesper Dangaard Brouer wrote:


Hmmm... I'm not thrilled to have such heuristics, that change memory
behavior when half of the queue size (sk->sk_rcvbuf) is reached.


Well, copybreak drivers do that unconditionally, even under no stress at
all, you really should complain then.


Isn't that behaviour based (in part?) on the observation/belief that it 
is fewer cycles to copy the small packet into a small buffer than to 
send the larger buffer up the stack and have to allocate and map a 
replacement?


rick jones



Re: [PATCH net-next 2/4] mlx4: xdp: Allow raising MTU up to one page minus eth and vlan hdrs

2016-12-02 Thread Rick Jones

On 12/02/2016 03:23 PM, Martin KaFai Lau wrote:

When XDP prog is attached, it is currently limiting
MTU to be FRAG_SZ0 - ETH_HLEN - (2 * VLAN_HLEN) which is 1514
in x86.

AFAICT, since mlx4 is doing one page per packet for XDP,
we can at least raise the MTU limitation up to
PAGE_SIZE - ETH_HLEN - (2 * VLAN_HLEN) which this patch is
doing.  It will be useful in the next patch which allows
XDP program to extend the packet by adding new header(s).


Is mlx4 the only driver doing page-per-packet?

rick jones



Re: Initial thoughts on TXDP

2016-12-01 Thread Rick Jones

On 12/01/2016 02:12 PM, Tom Herbert wrote:

We have consider both request size and response side in RPC.
Presumably, something like a memcache server is most serving data as
opposed to reading it, we are looking to receiving much smaller
packets than being sent. Requests are going to be quite small say 100
bytes and unless we are doing significant amount of pipelining on
connections GRO would rarely kick-in. Response size will have a lot of
variability, anything from a few kilobytes up to a megabyte. I'm sorry
I can't be more specific this is an artifact of datacenters that have
100s of different applications and communication patterns. Maybe 100b
request size, 8K, 16K, 64K response sizes might be good for test.


No worries on the specific sizes, it is a classic "How long is a piece 
of string?" sort of question.


Not surprisingly, as the size of what is being received grows, so too 
the delta between GRO on and off.


stack@np-cp1-c0-m1-mgmt:~/rjones2$ HDR="-P 1"; for r in 8K 16K 64K 1M; 
do for gro in on off; do sudo ethtool -K hed0 gro ${gro}; brand="$r gro 
$gro"; ./netperf -B "$brand" -c -H np-cp1-c1-m3-mgmt -t TCP_RR $HDR -- 
-P 12867 -r 128,${r} -o result_brand,throughput,local_sd; HDR="-P 0"; 
done; done
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0

Result Tag,Throughput,Local Service Demand
"8K gro on",9899.84,35.947
"8K gro off",7299.54,61.097
"16K gro on",8119.38,58.367
"16K gro off",5176.87,95.317
"64K gro on",4429.57,110.629
"64K gro off",2128.58,289.913
"1M gro on",887.85,918.447
"1M gro off",335.97,3427.587

So that gives a feel for by how much this alternative mechanism would 
have to reduce path-length to maintain the CPU overhead, were the 
mechanism to preclude GRO.


rick




Re: Initial thoughts on TXDP

2016-12-01 Thread Rick Jones

On 12/01/2016 12:18 PM, Tom Herbert wrote:

On Thu, Dec 1, 2016 at 11:48 AM, Rick Jones <rick.jon...@hpe.com> wrote:

Just how much per-packet path-length are you thinking will go away under the
likes of TXDP?  It is admittedly "just" netperf but losing TSO/GSO does some
non-trivial things to effective overhead (service demand) and so throughput:


For plain in order TCP packets I believe we should be able process
each packet at nearly same speed as GRO. Most of the protocol
processing we do between GRO and the stack are the same, the
differences are that we need to do a connection lookup in the stack
path (note we now do this is UDP GRO and that hasn't show up as a
major hit). We also need to consider enqueue/dequeue on the socket
which is a major reason to try for lockless sockets in this instance.


So waving hands a bit, and taking the service demand for the GRO-on 
receive test in my previous message (860 ns/KB), that would be ~ 
(1448/1024)*860 or ~1.216 usec of CPU time per TCP segment, including 
ACK generation which unless an explicit ACK-avoidance heuristic a la 
HP-UX 11/Solaris 2 is put in place would be for every-other segment. Etc 
etc.



Sure, but trying running something emulates a more realistic workload
than a TCP stream, like RR test with relative small payload and many
connections.


That is a good point, which of course is why the RR tests are there in 
netperf :) Don't get me wrong, I *like* seeing path-length reductions. 
What would you posit is a relatively small payload?  The promotion of 
IR10 suggests that perhaps 14KB or so is a sufficiently common so I'll 
grasp at that as the length of a piece of string:


stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_RR -- -P 12867 -r 128,14K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0

Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPUCPUS.dem   S.dem
Send   Recv   SizeSize   TimeRate local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S% Uus/Tr   us/Tr

16384  87380  128 14336  10.00   8118.31  1.57   -1.00  46.410  -1.000
16384  87380
stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off
stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_RR -- -P 12867 -r 128,14K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0

Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPUCPUS.dem   S.dem
Send   Recv   SizeSize   TimeRate local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S% Uus/Tr   us/Tr

16384  87380  128 14336  10.00   5837.35  2.20   -1.00  90.628  -1.000
16384  87380

So, losing GRO doubled the service demand.  I suppose I could see 
cutting path-length in half based on the things you listed which would 
be bypassed?


I'm sure mileage will vary with different NICs and CPUs.  The ones used 
here happened to be to hand.


happy benchmarking,

rick

Just to get a crude feel for sensitivity, doubling to 28K unsurprisingly 
goes to more than doubling, and halving to 7K narrows the delta:


stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_RR -- -P 12867 -r 128,28K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0

Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPUCPUS.dem   S.dem
Send   Recv   SizeSize   TimeRate local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S% Uus/Tr   us/Tr

16384  87380  128 28672  10.00   6732.32  1.79   -1.00  63.819  -1.000
16384  87380
stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off
stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_RR -- -P 12867 -r 128,28K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0

Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPUCPUS.dem   S.dem
Send   Recv   SizeSize   TimeRate local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S% Uus/Tr   us/Tr

16384  87380  128 28672  10.00   3780.47  2.32   -1.00  147.280  -1.000
16384  87380



stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_RR -- -P 12867 -r 128,7K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0

Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPUCPUS.dem   S.dem
Send   Recv   SizeSize   TimeRate local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S

Re: Initial thoughts on TXDP

2016-12-01 Thread Rick Jones

On 12/01/2016 11:05 AM, Tom Herbert wrote:

For the GSO and GRO the rationale is that performing the extra SW
processing to do the offloads is significantly less expensive than
running each packet through the full stack. This is true in a
multi-layered generalized stack. In TXDP, however, we should be able
to optimize the stack data path such that that would no longer be
true. For instance, if we can process the packets received on a
connection quickly enough so that it's about the same or just a little
more costly than GRO processing then we might bypass GRO entirely.
TSO is probably still relevant in TXDP since it reduces overheads
processing TX in the device itself.


Just how much per-packet path-length are you thinking will go away under 
the likes of TXDP?  It is admittedly "just" netperf but losing TSO/GSO 
does some non-trivial things to effective overhead (service demand) and 
so throughput:


stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -- 
-P 12867
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to 
np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
Recv   SendSend  Utilization   Service 
Demand

Socket Socket  Message  Elapsed  Send Recv SendRecv
Size   SizeSize Time Throughput  localremote   local 
remote

bytes  bytes   bytessecs.10^6bits/s  % S  % U  us/KB   us/KB

 87380  16384  1638410.00  9260.24   2.02 -1.000.428 
-1.000

stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 tso off gso off
stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -- 
-P 12867
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to 
np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
Recv   SendSend  Utilization   Service 
Demand

Socket Socket  Message  Elapsed  Send Recv SendRecv
Size   SizeSize Time Throughput  localremote   local 
remote

bytes  bytes   bytessecs.10^6bits/s  % S  % U  us/KB   us/KB

 87380  16384  1638410.00  5621.82   4.25 -1.001.486 
-1.000


And that is still with the stretch-ACKs induced by GRO at the receiver.

Losing GRO has quite similar results:
stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_MAERTS -- -P 12867
MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to 
np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
Recv   SendSend  Utilization   Service 
Demand

Socket Socket  Message  Elapsed  Recv Send RecvSend
Size   SizeSize Time Throughput  localremote   local 
remote

bytes  bytes   bytessecs.10^6bits/s  % S  % U  us/KB   us/KB

 87380  16384  1638410.00  9154.02   4.00 -1.000.860 
-1.000

stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off

stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_MAERTS -- -P 12867
MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to 
np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
Recv   SendSend  Utilization   Service 
Demand

Socket Socket  Message  Elapsed  Recv Send RecvSend
Size   SizeSize Time Throughput  localremote   local 
remote

bytes  bytes   bytessecs.10^6bits/s  % S  % U  us/KB   us/KB

 87380  16384  1638410.00  4212.06   5.36 -1.002.502 
-1.000


I'm sure there is a very non-trivial "it depends" component here - 
netperf will get the peak benefit from *SO and so one will see the peak 
difference in service demands - but even if one gets only 6 segments per 
*SO that is a lot of path-length to make-up.


4.4 kernel, BE3 NICs ... E5-2640 0 @ 2.50GHz

And even if one does have the CPU cycles to burn so to speak, the effect 
on power consumption needs to be included in the calculus.


happy benchmarking,

rick jones


Re: Netperf UDP issue with connected sockets

2016-11-30 Thread Rick Jones

On 11/30/2016 02:43 AM, Jesper Dangaard Brouer wrote:

Notice the "fib_lookup" cost is still present, even when I use
option "-- -n -N" to create a connected socket.  As Eric taught us,
this is because we should use syscalls "send" or "write" on a connected
socket.


In theory, once the data socket is connected, the send_data() call in 
src/nettest_omni.c is supposed to use send() rather than sendto().


And indeed, based on a quick check, send() is what is being called, 
though it becomes it seems a sendto() system call - with the destination 
information NJULL:


write(1, "send\n", 5)   = 5
sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 
1024

write(1, "send\n", 5)   = 5
sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 
1024


So I'm not sure what might be going-on there.

You can get netperf to use write() instead of send() by adding a 
test-specific -I option.


happy benchmarking,

rick



My udp_flood tool[1] cycle through the different syscalls:

taskset -c 2 ~/git/network-testing/src/udp_flood 198.18.50.1 --count $((10**7)) 
--pmtu 2
ns/pkt  pps cycles/pkt
send473.08  2113816.28  1891
sendto  558.58  1790265.84  2233
sendmsg 587.24  1702873.80  2348
sendMmsg/32 547.57  1826265.90  2189
write   518.36  1929175.52  2072

Using "send" seems to be the fastest option.

Some notes on test: I've forced TX completions to happen on another CPU0
and pinned the udp_flood program (to CPU2) as I want to avoid the CPU
scheduler to move udp_flood around as this cause fluctuations in the
results (as it stress the memory allocations more).

My udp_flood --pmtu option is documented in the --help usage text (see below 
signature)





Re: Netperf UDP issue with connected sockets

2016-11-28 Thread Rick Jones

On 11/28/2016 10:33 AM, Rick Jones wrote:

On 11/17/2016 12:16 AM, Jesper Dangaard Brouer wrote:

time to try IP_MTU_DISCOVER ;)


To Rick, maybe you can find a good solution or option with Eric's hint,
to send appropriate sized UDP packets with Don't Fragment (DF).


Jesper -

Top of trunk has a change adding an omni, test-specific -f option which
will set IP_MTU_DISCOVER:IP_PMTUDISC_DO on the data socket.  Is that
sufficient to your needs?


Usage examples:

raj@tardy:~/netperf2_trunk/src$ ./netperf -t UDP_STREAM -l 1 -H 
raj-folio.americas.hpqcorp.net -- -m 1472 -f
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
raj-folio.americas.hpqcorp.net () port 0 AF_INET

Socket  Message  Elapsed  Messages
SizeSize Time Okay Errors   Throughput
bytes   bytessecs#  #   10^6bits/sec

2129921472   1.0077495  0 912.35
212992   1.0077495912.35

[1]+  Doneemacs nettest_omni.c
raj@tardy:~/netperf2_trunk/src$ ./netperf -t UDP_STREAM -l 1 -H 
raj-folio.americas.hpqcorp.net -- -m 14720 -f
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
raj-folio.americas.hpqcorp.net () port 0 AF_INET

send_data: data send error: Message too long (errno 90)
netperf: send_omni: send_data failed: Message too long

happy benchmarking,

rick jones


Re: Netperf UDP issue with connected sockets

2016-11-28 Thread Rick Jones

On 11/17/2016 12:16 AM, Jesper Dangaard Brouer wrote:

time to try IP_MTU_DISCOVER ;)


To Rick, maybe you can find a good solution or option with Eric's hint,
to send appropriate sized UDP packets with Don't Fragment (DF).


Jesper -

Top of trunk has a change adding an omni, test-specific -f option which 
will set IP_MTU_DISCOVER:IP_PMTUDISC_DO on the data socket.  Is that 
sufficient to your needs?


happy benchmarking,

rick



Re: Netperf UDP issue with connected sockets

2016-11-17 Thread Rick Jones

On 11/17/2016 04:37 PM, Julian Anastasov wrote:

On Thu, 17 Nov 2016, Rick Jones wrote:


raj@tardy:~/netperf2_trunk$ strace -v -o /tmp/netperf.strace src/netperf -F
src/nettest_omni.c -t UDP_STREAM -l 1 -- -m 1472

...

socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4
getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0
getsockopt(4, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0
setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(4, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")},
16) = 0
setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0


connected socket can benefit from dst cached in socket
but not if SO_DONTROUTE is set. If we do not want to send packets
via gateway this -l 1 should help but I don't see IP_TTL setsockopt
in your first example with connect() to 127.0.0.1.

Also, may be there can be another default, if -l is used to
specify TTL then SO_DONTROUTE should not be set. I.e. we should
avoid SO_DONTROUTE, if possible.


The global -l option specifies the duration of the test.  It doesn't 
specify the TTL of the IP datagrams being generated by the actions of 
the test.


I resisted setting SO_DONTROUTE for a number of years after the first 
instance of UDP_STREAM being used in link up/down testing took-out a 
company's network (including security camera feeds to galactic HQ) but 
at this point I'm likely to keep it in there because there ended-up 
being a second such incident.  It is set only for UDP_STREAM.  It isn't 
set for UDP_RR or TCP_*.  And for UDP_STREAM it can be overridden by the 
test-specific -R option.


happy benchmarking,

rick jones


Re: Netperf UDP issue with connected sockets

2016-11-17 Thread Rick Jones

On 11/17/2016 01:44 PM, Eric Dumazet wrote:

because netperf sends the same message
over and over...


Well, sort of, by default.  That can be altered to a degree.

The global -F option should cause netperf to fill the buffers in its 
send ring with data from the specified file.  The number of buffers in 
the send ring can be controlled via the global -W option.  The number of 
elements in the ring will default to one more than the initial SO_SNDBUF 
size divided by the send size.


raj@tardy:~/netperf2_trunk$ strace -v -o /tmp/netperf.strace src/netperf 
-F src/nettest_omni.c -t UDP_STREAM -l 1 -- -m 1472


...

socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4
getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0
getsockopt(4, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0
setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(4, {sa_family=AF_INET, sin_port=htons(0), 
sin_addr=inet_addr("0.0.0.0")}, 16) = 0

setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0
setsockopt(4, SOL_IP, IP_RECVERR, [1], 4) = 0
open("src/nettest_omni.c", O_RDONLY)= 5
fstat(5, {st_dev=makedev(8, 2), st_ino=82075297, st_mode=S_IFREG|0664, 
st_nlink=1, st_uid=1000, st_gid=1000, st_blksize=4096, st_blocks=456, 
st_size=230027, st_atime=2016/11/16-09:49:29, 
st_mtime=2016/11/16-09:49:24, st_ctime=2016/11/16-09:49:24}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) 
= 0x7f3099f62000

read(5, "#ifdef HAVE_CONFIG_H\n#include <c"..., 4096) = 4096
read(5, "_INTEGER *intvl_two_ptr = "..., 4096) = 4096
read(5, "interval_count = interval_burst;"..., 4096) = 4096
read(5, ";\n\n/* these will control the wid"..., 4096) = 4096
read(5, "\n  LOCAL_SECURITY_ENABLED_NUM,\n "..., 4096) = 4096
read(5, "  ,  \n  "..., 4096) = 4096

...

rt_sigaction(SIGALRM, {0x402ea6, [ALRM], SA_RESTORER|SA_INTERRUPT, 
0x7f30994a7cb0}, NULL, 8) = 0
rt_sigaction(SIGINT, {0x402ea6, [INT], SA_RESTORER|SA_INTERRUPT, 
0x7f30994a7cb0}, NULL, 8) = 0

alarm(1)= 0
sendto(4, "#ifdef HAVE_CONFIG_H\n#include <c"..., 1472, 0, 
{sa_family=AF_INET, sin_port=htons(58088), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 1472
sendto(4, " used\\n\\\n-m local,remote   S"..., 1472, 0, 
{sa_family=AF_INET, sin_port=htons(58088), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 1472
sendto(4, " do here but clear the legacy fl"..., 1472, 0, 
{sa_family=AF_INET, sin_port=htons(58088), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 1472
sendto(4, "e before we scan the test-specif"..., 1472, 0, 
{sa_family=AF_INET, sin_port=htons(58088), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 1472
sendto(4, "\n\n\tfprintf(where,\n\t\ttput_fmt_1_l"..., 1472, 0, 
{sa_family=AF_INET, sin_port=htons(58088), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 1472


Of course, it will continue to send the same messages from the send_ring 
over and over instead of putting different data into the buffers each 
time, but if one has a sufficiently large -W option specified...

happy benchmarking,

rick jones


Re: Netperf UDP issue with connected sockets

2016-11-17 Thread Rick Jones

On 11/17/2016 12:16 AM, Jesper Dangaard Brouer wrote:

time to try IP_MTU_DISCOVER ;)


To Rick, maybe you can find a good solution or option with Eric's hint,
to send appropriate sized UDP packets with Don't Fragment (DF).


Well, I suppose adding another setsockopt() to the data socket creation 
wouldn't be too difficult, along with another command-line option to 
cause it to happen.


Could we leave things as "make sure you don't need fragmentation when 
you use this" or would netperf have to start processing ICMP messages?


happy benchmarking,

rick jones



Re: Netperf UDP issue with connected sockets

2016-11-16 Thread Rick Jones

On 11/16/2016 02:40 PM, Jesper Dangaard Brouer wrote:

On Wed, 16 Nov 2016 09:46:37 -0800
Rick Jones <rick.jon...@hpe.com> wrote:

It is a wild guess, but does setting SO_DONTROUTE affect whether or not
a connect() would have the desired effect?  That is there to protect
people from themselves (long story about people using UDP_STREAM to
stress improperly air-gapped systems during link up/down testing)
It can be disabled with a test-specific -R 1 option, so your netperf
command would become:

netperf -H 198.18.50.1 -t UDP_STREAM -l 120 -- -m 1472 -n -N -R 1


Using -R 1 does not seem to help remove __ip_select_ident()


Bummer.  It was a wild guess anyway, since I was seeing a connect() call 
on the data socket.



Samples: 56K of event 'cycles', Event count (approx.): 78628132661
  Overhead  CommandShared ObjectSymbol
+9.11%  netperf[kernel.vmlinux] [k] __ip_select_ident
+6.98%  netperf[kernel.vmlinux] [k] _raw_spin_lock
+6.21%  swapper[mlx5_core]  [k] mlx5e_poll_tx_cq
+5.03%  netperf[kernel.vmlinux] [k] 
copy_user_enhanced_fast_string
+4.69%  netperf[kernel.vmlinux] [k] __ip_make_skb
+4.63%  netperf[kernel.vmlinux] [k] skb_set_owner_w
+4.15%  swapper[kernel.vmlinux] [k] __slab_free
+3.80%  netperf[mlx5_core]  [k] mlx5e_sq_xmit
+2.00%  swapper[kernel.vmlinux] [k] sock_wfree
+1.94%  netperfnetperf  [.] send_data
+1.92%  netperfnetperf  [.] send_omni_inner


Well, the next step I suppose is to have you try a quick netperf 
UDP_STREAM under strace to see if your netperf binary does what mine did:


strace -v -o /tmp/netperf.strace netperf -H 198.18.50.1 -t UDP_STREAM -l 
1 -- -m 1472 -n -N -R 1


And see if you see the connect() I saw. (Note, I make the runtime 1 second)

rick


Re: Netperf UDP issue with connected sockets

2016-11-16 Thread Rick Jones

On 11/16/2016 04:16 AM, Jesper Dangaard Brouer wrote:

[1] Subj: High perf top ip_idents_reserve doing netperf UDP_STREAM
 - https://www.spinics.net/lists/netdev/msg294752.html

Not fixed in version 2.7.0.
 - ftp://ftp.netperf.org/netperf/netperf-2.7.0.tar.gz

Used extra netperf configure compile options:
 ./configure  --enable-histogram --enable-demo

It seems like some fix attempts exists in the SVN repository::

 svn checkout http://www.netperf.org/svn/netperf2/trunk/ netperf2-svn
 svn log -r709
 # A quick stab at getting remote connect going for UDP_STREAM
 svn diff -r708:709

Testing with SVN version, still show __ip_select_ident() in top#1.


Indeed, there was a fix for getting the remote side connect()ed. 
Looking at what I have for the top of trunk I do though see a connect() 
call being made at the local end:


socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4
getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0
getsockopt(4, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0
setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(4, {sa_family=AF_INET, sin_port=htons(0), 
sin_addr=inet_addr("0.0.0.0")}, 16) = 0

setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0
setsockopt(4, SOL_IP, IP_RECVERR, [1], 4) = 0
brk(0xe53000)   = 0xe53000
getsockname(4, {sa_family=AF_INET, sin_port=htons(59758), 
sin_addr=inet_addr("0.0.0.0")}, [16]) = 0
sendto(3, 
"\0\0\0a\377\377\377\377\377\377\377\377\377\377\377\377\0\0\0\10\0\0\0\0\0\0\0\321\377\377\377\377"..., 
656, 0, NULL, 0) = 656

select(1024, [3], NULL, NULL, {120, 0}) = 1 (in [3], left {119, 995630})
recvfrom(3, 
"\0\0\0b\0\0\0\0\0\3@\0\0\3@\0\0\0\0\2\0\3@\0\377\377\377\377\0\0\0\321"..., 
656, 0, NULL, NULL) = 656

write(1, "need to connect is 1\n", 21)  = 21
rt_sigaction(SIGALRM, {0x402ea6, [ALRM], SA_RESTORER|SA_INTERRUPT, 
0x7f2824eb2cb0}, NULL, 8) = 0
rt_sigaction(SIGINT, {0x402ea6, [INT], SA_RESTORER|SA_INTERRUPT, 
0x7f2824eb2cb0}, NULL, 8) = 0

alarm(1)= 0
connect(4, {sa_family=AF_INET, sin_port=htons(34832), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 0
sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 
1024
sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 
1024
sendto(4, "netperf\0netperf\0netperf\0netperf\0"..., 1024, 0, NULL, 0) = 
1024


the only difference there with top of trunk is that "need to connect" 
write/printf I just put in the code to be a nice marker in the system 
call trace.


It is a wild guess, but does setting SO_DONTROUTE affect whether or not 
a connect() would have the desired effect?  That is there to protect 
people from themselves (long story about people using UDP_STREAM to 
stress improperly air-gapped systems during link up/down testing) 
It can be disabled with a test-specific -R 1 option, so your netperf 
command would become:


netperf -H 198.18.50.1 -t UDP_STREAM -l 120 -- -m 1472 -n -N -R 1



(p.s. is netperf ever going to be converted from SVN to git?)



Well  my git-fu could use some work (gentle, offlinetaps with a 
clueful tutorial bat would be welcome), and at least in the past, going 
to git was held back because there were a bunch of netperf users on 
Windows and there wasn't (at the time) support for git under Windows.


But I am not against the idea in principle.

happy benchmarking,

rick jones

PS - rick.jo...@hp.com no longer works.  rick.jon...@hpe.com should be 
used instead.


Re: [patch] netlink.7: srcfix Change buffer size in example code about reading netlink message.

2016-11-14 Thread Rick Jones

Lets change the example so others don't propagate the problem further.

Signed-off-by David Wilder <dwil...@us.ibm.com>

--- man7/netlink.7.orig 2016-11-14 13:30:36.522101156 -0800
+++ man7/netlink.7  2016-11-14 13:30:51.002086354 -0800
@@ -511,7 +511,7 @@
 .in +4n
 .nf
 int len;
-char buf[4096];
+char buf[8192];


Since there doesn't seem to be a define one could use in the user space 
linux/netlink.h (?), but there are comments in the example code in the 
manpage, how about also including a brief comment to the effect that 
using 8192 bytes will avoid message truncation problems on platforms 
with a large PAGE_SIZE?


/* avoid msg truncation on > 4096 byte PAGE_SIZE platforms */

or something like that.

rick jones


Re: [PATCH RFC 0/2] ethtool: Add actual port speed reporting

2016-11-03 Thread Rick Jones

And besides, one can argue that in the SR-IOV scenario the VF has no business
knowing the physical port speed.



Good point, but there are more use-cases we should consider.
For example, when using Multi-Host/Flex-10/Multi-PF each PF should
be able to query both physical port speed and actual speed.


Despite my email address, I'm not fully versed on VC/Flex, but I have 
always been under the impression that the flexnics created were, 
conceptually, "distinct" NICs considered independently of the physical 
port over which they operated.  Tossing another worm or three into the 
can, while "back in the day" (when some of the first ethtool changes to 
report speeds other than the "normal" ones went in) the speed of a 
flexnic was fixed, today, it can actually operate in a range.  From a 
minimum guarantee to an "if there is bandwidth available" cap.


rick jones



Re: [bnx2] [Regression 4.8] Driver loading fails without firmware

2016-10-25 Thread Rick Jones

On 10/25/2016 08:31 AM, Paul Menzel wrote:

To my knowledge, the firmware files haven’t changed since years [1].


Indeed - it looks like I read "bnx2" and thought "bnx2x"  Must remember 
to hold-off on replying until after the morning orange juice is consumed :)


rick


Re: [bnx2] [Regression 4.8] Driver loading fails without firmware

2016-10-25 Thread Rick Jones

On 10/25/2016 07:33 AM, Paul Menzel wrote:

Dear Linux folks,

A server with the Broadcom devices below, fails to load the drivers
because of missing firmware.


I have run into the same sort of issue from time to time when going to a 
newer kernel.  A newer version of the driver wants a newer version of 
the firmware.  Usually, finding a package "out there" with the newer 
version of the firmware, and installing it onto the system is sufficient.


happy benchmarking,

rick jones


Re: Accelerated receive flow steering (aRFS) for UDP

2016-10-10 Thread Rick Jones

On 10/10/2016 09:08 AM, Rick Jones wrote:

On 10/09/2016 03:33 PM, Eric Dumazet wrote:

OK, I am adding/CC Rick Jones, netperf author, since it seems a netperf
bug, not a kernel one.

I believe I already mentioned fact that "UDP_STREAM -- -N" was not doing
a connect() on the receiver side.


I can confirm that the receive side of the netperf omni path isn't
trying to connect UDP datagrams.  I will see what I can put together.


I've put something together and pushed it to the netperf top of trunk. 
It seems to have been successful on a quick loopback UDP_STREAM test.


happy benchmarking,

rick jones



Re: Accelerated receive flow steering (aRFS) for UDP

2016-10-10 Thread Rick Jones

On 10/09/2016 03:33 PM, Eric Dumazet wrote:

OK, I am adding/CC Rick Jones, netperf author, since it seems a netperf
bug, not a kernel one.

I believe I already mentioned fact that "UDP_STREAM -- -N" was not doing
a connect() on the receiver side.


I can confirm that the receive side of the netperf omni path isn't 
trying to connect UDP datagrams.  I will see what I can put together.


happy benchmarking,

rick jones
rick.jon...@hpe.com



Re: [PATCH v2 net-next 4/5] xps_flows: XPS for packets that don't have a socket

2016-09-29 Thread Rick Jones

On 09/29/2016 06:18 AM, Eric Dumazet wrote:

Well, then what this patch series is solving ?

You have a producer of packets running on 8 vcpus in a VM.

Packets are exiting the VM and need to be queued on a mq NIC in the
hypervisor.

Flow X can be scheduled on any of these 8 vcpus, so XPS is currently
selecting different TXQ.


Just for completeness, in my testing, the VMs were single-vCPU.

rick jones


Re: [PATCH RFC 0/4] xfs: Transmit flow steering

2016-09-28 Thread Rick Jones


Here is a quick look at performance tests for the result of trying the
prototype fix for the packet reordering problem with VMs sending over
an XPS-configured NIC.  In particular, the Emulex/Avago/Broadcom
Skyhawk.  The fix was applied to a 4.4 kernel.

Before: 3884 Mbit/s
After: 8897 Mbit/s

That was from a VM on a node with a Skyhawk and 2 E5-2640 processors
to baremetal E5-2640 with a BE3.  Physical MTU was 1500, the VM's
vNIC's MTU was 1400.  Systems were HPE ProLiants in OS Control Mode
for power management, with the "performance" frequency governor
loaded. An OpenStack Mitaka setup with Distributed Virtual Router.

We had some other NIC types in the setup as well.  XPS was also
enabled on the ConnectX3-Pro.  It was not enabled on the 82599ES (a
function of the kernel being used, which had it disabled from the
first reports of XPS negatively affecting VM traffic at the beginning
of the year)

Average Mbit/s From NIC type To Bare Metal BE3:
NIC Type,
 CPU on VM HostBeforeAfter

ConnectX-3 Pro,E5-2670v39224 9271
BE3, E5-26409016 9022
82599, E5-2640  9192 9003
BCM57840, E5-2640   9213 9153
Skyhawk, E5-26403884 8897

For completeness:
Average Mbit/s To NIC type from Bare Metal BE3:
NIC Type,
 CPU on VM HostBeforeAfter

ConnectX-3 Pro,E5-2670v39322 9144
BE3, E5-26409074 9017
82599, E5-2640  8670 8564
BCM57840, E5-2640   2468 *   7979
Skyhawk, E5-26408897 9269

* This is the Busted bnx2x NIC FW GRO implementation issue.  It was
  not visible in the "After" because the system was setup to disable
  the NIC FW GRO by the time it booted on the fix kernel.

Average Transactions/s Between NIC type and Bare Metal BE3:
NIC Type,
 CPU on VM HostBeforeAfter

ConnectX-3 Pro,E5-2670v3   12421 12612
BE3, E5-26408178  8484
82599, E5-2640  8499  8549
BCM57840, E5-2640   8544  8560
Skyhawk, E5-26408537  8701

happy benchmarking,

Drew Balliet
Jeurg Haefliger
rick jones

The semi-cooked results with additional statistics:

554M  - BE3
544+M - ConnectX-3 Pro
560M - 82599ES
630M - BCM57840
650M - Skyhawk

(substitute is simply replacing a system name with the model of NIC and CPU)
Bulk To (South) and From (North) VM, Before:
$ ../substitute.sh 
vxlan_554m_control_performance_gvnr_dvr_northsouth_stream.log | 
~/netperf2_trunk/doc/examples/parse_single_stream.py -r -5 -f 1 -f 3 -f 
4 -f 7 -f 8

Field1,Field3,Field4,Field7,Field8,Min,P10,Median,Average,P90,P99,Max,Count
North,560M,E5-2640,554FLB,E5-2640,8148.090,9048.830,9235.400,9192.868,9315.980,9338.845,9339.500,113
North,630M,E5-2640,554FLB,E5-2640,8909.980,9113.238,9234.750,9213.140,9299.442,9336.206,9337.830,47
North,544+M,E5-2670v3,554FLB,E5-2640,9013.740,9182.546,9229.620,9224.025,9264.036,9299.206,9301.970,99
North,650M,E5-2640,554FLB,E5-2640,3187.680,3393.724,3796.160,3884.765,4405.096,4941.391,4956.300,129
North,554M,E5-2640,554FLB,E5-2640,8700.930,8855.768,9026.030,9016.061,9158.846,9213.687,9226.150,135
South,554FLB,E5-2640,560M,E5-2640,7754.350,8193.114,8718.540,8670.612,9026.436,9262.355,9285.010,113
South,554FLB,E5-2640,630M,E5-2640,1897.660,2068.290,2514.430,2468.323,2787.162,2942.934,2957.250,53
South,554FLB,E5-2640,544+M,E5-2670v3,9298.260,9314.432,9323.220,9322.207,9328.324,9330.704,9331.080,100
South,554FLB,E5-2640,650M,E5-2640,8407.050,8907.136,9304.390,9206.776,9321.320,9325.347,9326.410,103
South,554FLB,E5-2640,554M,E5-2640,7844.900,8632.530,9199.385,9074.535,9308.070,9319.224,9322.360,132
0 too-short lines ignored.

Bulk To (South) and From (North) VM, After:

$ ../substitute.sh 
vxlan_554m_control_performance_gvnr_xpsfix_dvr_northsouth_stream.log | 
~/netperf2_trunk/doc/examples/parse_single_stream.py -r -5 -f 1 -f 3 -f 
4 -f 7 -f 8

Field1,Field3,Field4,Field7,Field8,Min,P10,Median,Average,P90,P99,Max,Count
North,560M,E5-2640,554FLB,E5-2640,7576.790,8213.890,9182.870,9003.190,9295.975,9315.878,9318.160,36
North,630M,E5-2640,554FLB,E5-2640,8811.800,8924.000,9206.660,9153.076,9306.287,9315.152,9315.790,12
North,544+M,E5-2670v3,554FLB,E5-2640,9135.990,9228.520,9277.465,9271.875,9324.545,9339.604,9339.780,46
North,650M,E5-2640,554FLB,E5-2640,8133.420,8483.340,8995.040,8897.779,9129.056,9165.230,9165.860,43
North,554M,E5-2640,554FLB,E5-2640,8438.390,8879.150,9048.590,9022.813,9181.540,9248.650,9297.660,101
South,554FLB,E5-2640,630M,E5-2640,7347.120,7592.565,7951.325,7979.951,8365.400,8575.837,8579.890,16
South,554FLB,E5-2640,560M,E5-2640,7719.510,8044.496,8602.750,8564.741,9172.824,9248.686,9259.070,45
South,554

Re: [PATCH v3 net-next 16/16] tcp_bbr: add BBR congestion control

2016-09-19 Thread Rick Jones

On 09/19/2016 02:10 PM, Eric Dumazet wrote:

On Mon, Sep 19, 2016 at 1:57 PM, Stephen Hemminger
<step...@networkplumber.org> wrote:


Looks good, but could I suggest a simple optimization.
All these parameters are immutable in the version of BBR you are submitting.
Why not make the values const? And eliminate the always true long-term bw 
estimate
variable?



We could do that.

We used to have variables (aka module params) while BBR was cooking in
our kernels ;)


Are there better than epsilon odds of someone perhaps wanting to poke 
those values as it gets exposure beyond Google?


happy benchmarking,

rick jones


Re: [PATCH next 3/3] ipvlan: Introduce l3s mode

2016-09-09 Thread Rick Jones

On 09/09/2016 02:53 PM, Mahesh Bandewar wrote:


@@ -48,6 +48,11 @@ master device for the L2 processing and routing from that 
instance will be
 used before packets are queued on the outbound device. In this mode the slaves
 will not receive nor can send multicast / broadcast traffic.

+4.3 L3S mode:
+   This is very similar to the L3 mode except that iptables conn-tracking
+works in this mode and that is why L3-symsetric (L3s) from iptables 
perspective.
+This will have slightly less performance but that shouldn't matter since you
+are choosing this mode over plain-L3 mode to make conn-tracking work.


What is that first sentence trying to say?  It appears to be incomplete, 
and is that supposed to be "L3-symmetric?"


happy benchmarking,

rick jones


Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more

2016-09-08 Thread Rick Jones

On 09/08/2016 11:16 AM, Tom Herbert wrote:

On Thu, Sep 8, 2016 at 10:19 AM, Jesper Dangaard Brouer
<bro...@redhat.com> wrote:

On Thu, 8 Sep 2016 09:26:03 -0700
Tom Herbert <t...@herbertland.com> wrote:

Shouldn't qdisc bulk size be based on the BQL limit? What is the
simple algorithm to apply to in-flight packets?


Maybe the algorithm is not so simple, and we likely also have to take
BQL bytes into account.

The reason for wanting packets-in-flight is because we are attacking a
transaction cost.  The tailptr/doorbell cost around 70ns.  (Based on
data in this patch desc, 4.9Mpps -> 7.5Mpps (1/4.90-1/7.5)*1000 =
70.74). The 10G wirespeed small packets budget is 67.2ns, this with
fixed overhead per packet of 70ns we can never reach 10G wirespeed.


But you should be able to do this with BQL and it is more accurate.
BQL tells how many bytes need to be sent and that can be used to
create a bulk of packets to send with one doorbell.


With small packets and the "default" ring size for this NIC/driver 
combination, is the BQL large enough that the ring fills before one hits 
the BQL?


rick jones



Re: [PATCH] softirq: let ksoftirqd do its job

2016-08-31 Thread Rick Jones

On 08/31/2016 04:11 PM, Eric Dumazet wrote:

On Wed, 2016-08-31 at 15:47 -0700, Rick Jones wrote:

With regard to drops, are both of you sure you're using the same socket
buffer sizes?


Does it really matter ?


At least at points in the past I have seen different drop counts at the 
SO_RCVBUF based on using (sometimes much) larger sizes.  The hypothesis 
I was operating under at the time was that this dealt with those 
situations where the netserver was held-off from running for "a little 
while" from time to time.  It didn't change things for a sustained 
overload situation though.



In the meantime, is anything interesting happening with TCP_RR or
TCP_STREAM?


TCP_RR is driven by the network latency, we do not drop packets in the
socket itself.


I've been of the opinion it (single stream) is driven by path length. 
Sometimes by NIC latency.  But then I'm almost always measuring in the 
LAN rather than across the WAN.


happy benchmarking,

rick


Re: [PATCH] softirq: let ksoftirqd do its job

2016-08-31 Thread Rick Jones
With regard to drops, are both of you sure you're using the same socket 
buffer sizes?


In the meantime, is anything interesting happening with TCP_RR or 
TCP_STREAM?


happy benchmarking,

rick jones


Re: [PATCH v2 net-next] documentation: Document issues with VMs and XPS and drivers enabling it on their own

2016-08-29 Thread Rick Jones

On 08/27/2016 12:41 PM, Tom Herbert wrote:

On Fri, Aug 26, 2016 at 9:35 PM, David Miller <da...@davemloft.net> wrote:

From: Tom Herbert <t...@herbertland.com>
Date: Thu, 25 Aug 2016 16:43:35 -0700


This seems like it will only confuse users even more. You've clearly
identified an issue, let's figure out how to fix it.


I kinda feel the same way about this situation.


I'm working on XFS (as the transmit analogue to RFS). We'll track
flows enough so that we should know when it's safe to move them.


Is the XFS you are working on going to subsume XPS or will the two 
continue to exist in parallel a la RPS and RFS?


rick jones



[PATCH v2 net-next] documentation: Document issues with VMs and XPS and drivers enabling it on their own

2016-08-25 Thread Rick Jones
From: Rick Jones <rick.jon...@hpe.com>

Since XPS was first introduced two things have happened.  Some drivers
have started enabling XPS on their own initiative, and it has been
found that when a VM is sending data through a host interface with XPS
enabled, that traffic can end-up seriously out of order.

Signed-off-by: Rick Jones <rick.jon...@hpe.com>
Reviewed-by: Alexander Duyck <alexander.h.du...@intel.com>
---

diff --git a/Documentation/networking/scaling.txt 
b/Documentation/networking/scaling.txt
index 59f4db2..50cc888 100644
--- a/Documentation/networking/scaling.txt
+++ b/Documentation/networking/scaling.txt
@@ -400,15 +400,31 @@ transport layer is responsible for setting ooo_okay 
appropriately. TCP,
 for instance, sets the flag when all data for a connection has been
 acknowledged.
 
+When the traffic source is a VM running on the host, there is no
+socket structure known to the host.  In this case, unless the VM is
+itself CPU-pinned, the traffic being sent from it can end-up queued to
+multiple transmit queues and end-up being transmitted out of order.
+
+In some cases this can result in a considerable loss of performance.
+
+In such situations, XPS should not be enabled at runtime, or
+explicitly disabled if the NIC driver(s) in question enable it on
+their own.  Otherwise, if possible, the VMs should be CPU pinned.
+
  XPS Configuration
 
-XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by
-default for SMP). The functionality remains disabled until explicitly
-configured. To enable XPS, the bitmap of CPUs that may use a transmit
-queue is configured using the sysfs file entry:
+XPS is available only if the kconfig symbol CONFIG_XPS is enabled
+prior to building the kernel.  It is enabled by default for SMP kernel
+configurations.  In many cases the functionality remains disabled at
+runtime until explicitly configured by the system administrator. To
+enable XPS, the bitmap of CPUs that may use a transmit queue is
+configured using the sysfs file entry:
 
 /sys/class/net//queues/tx-/xps_cpus
 
+However, some NIC drivers will configure XPS at runtime for the
+interfaces they drive, via a call to netif_set_xps_queue.
+
 == Suggested Configuration
 
 For a network device with a single transmission queue, XPS configuration


[PATCH net-next] documentation: Document issues with VMs and XPS and drivers enabling it on their own

2016-08-25 Thread Rick Jones
From: Rick Jones <rick.jon...@hpe.com>

Since XPS was first introduced two things have happened.  Some drivers
have started enabling XPS on their own initiative, and it has been
found that when a VM is sending data through a host interface with XPS
enabled, that traffic can end-up seriously out of order.

Signed-off-by: Rick Jones <rick.jon...@hpe.com>

---

diff --git a/Documentation/networking/scaling.txt 
b/Documentation/networking/scaling.txt
index 59f4db2..50cc888 100644
--- a/Documentation/networking/scaling.txt
+++ b/Documentation/networking/scaling.txt
@@ -400,15 +400,31 @@ transport layer is responsible for setting ooo_okay 
appropriately. TCP,
 for instance, sets the flag when all data for a connection has been
 acknowledged.
 
+When the traffic source is a VM running on the host, there is no
+socket structure known to the host.  In this case, unless the VM is
+itself CPU-pinned, the traffic being sent from it can end-up queued to
+multiple transmit queues and end-up being transmitted out of order.
+
+In some cases this can result in a considerable loss of performance.
+
+In such situations, XPS should not be enabled at runtime, or
+explicitly disabled if the NIC driver(s) in question enable it on
+their own.  Othersise, if possible, the VMs should be CPU pinned.
+
  XPS Configuration
 
-XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by
-default for SMP). The functionality remains disabled until explicitly
-configured. To enable XPS, the bitmap of CPUs that may use a transmit
-queue is configured using the sysfs file entry:
+XPS is available only if the kconfig symbol CONFIG_XPS is enabled
+prior to building the kernel.  It is enabled by default for SMP kernel
+configurations.  In many cases the functionality remains disabled at
+runtime until explicitly configured by the system administrator. To
+enable XPS, the bitmap of CPUs that may use a transmit queue is
+configured using the sysfs file entry:
 
 /sys/class/net//queues/tx-/xps_cpus
 
+However, some NIC drivers will configure XPS at runtime for the
+interfaces they drive, via a call to netif_set_xps_queue.
+
 == Suggested Configuration
 
 For a network device with a single transmission queue, XPS configuration


Re: [RFC PATCH] net: Require socket to allow XPS to set queue mapping

2016-08-25 Thread Rick Jones

On 08/25/2016 02:08 PM, Eric Dumazet wrote:

When XPS was submitted, it was _not_ enabled by default and 'magic'

Some NIC vendors decided it was a good thing, you should complain to
them ;)


I kindasorta am with the emails I've been sending to netdev :)  And also 
hopefully precluding others going down that path.


happy benchmarking,

rick



Re: [RFC PATCH] net: Require socket to allow XPS to set queue mapping

2016-08-25 Thread Rick Jones

On 08/25/2016 12:49 PM, Eric Dumazet wrote:

On Thu, 2016-08-25 at 12:23 -0700, Alexander Duyck wrote:

A simpler approach is provided with this patch.  With it we disable XPS any
time a socket is not present for a given flow.  By doing this we can avoid
using XPS for any routing or bridging situations in which XPS is likely
more of a hinderance than a help.


Yes, but this will destroy isolation for people properly doing VM cpu
pining.


Why not simply stop enabling XPS by default. Treat it like RPS and RFS 
(unless I've missed a patch...). The people who are already doing the 
extra steps to pin VMs can enable XPS in that case.  It isn't clear that 
one should always pin VMs - for example if a (public) cloud needed to 
oversubscribe the cores.


happy benchmarking,

rick jones


Re: A second case of XPS considerably reducing single-stream performance

2016-08-25 Thread Rick Jones

On 08/25/2016 12:19 PM, Alexander Duyck wrote:

The problem is that there is no socket associated with the guest from
the host's perspective.  This is resulting in the traffic bouncing
between queues because there is no saved socket  to lock the interface
onto.

I was looking into this recently as well and had considered a couple
of options.  The first is to fall back to just using skb_tx_hash()
when skb->sk is null for a given buffer.  I have a patch I have been
toying around with but I haven't submitted it yet.  If you would like
I can submit it as an RFC to get your thoughts.  The second option is
to enforce the use of RPS for any interfaces that do not perform Rx in
NAPI context.  The correct solution for this is probably some
combination of the two as you have to have all queueing done in order
at every stage of the packet processing.


I don't know with interfaces would be hit, but just in general, I'm not 
sure that requiring RPS be enabled is a good solution - picking where 
traffic is processed based on its addressing is fine in a benchmarking 
situation, but I think it is better to have the process/thread scheduler 
decide where something should run and not the addressing of the 
connections that thread/process is servicing.


I would be interested in seeing the RFC patch you propose.

Apart from that, given the prevalence of VMs these days I wonder if 
perhaps simply not enabling XPS by default isn't a viable alternative. 
I've not played with containers to know if they would exhibit this too.


Drifting ever so slightly, if drivers are going to continue to enable 
XPS by default, Documentation/networking/scaling.txt might use a tweak:


diff --git a/Documentation/networking/scaling.txt 
b/Documentation/networking/sca

index 59f4db2..8b5537c 100644
--- a/Documentation/networking/scaling.txt
+++ b/Documentation/networking/scaling.txt
@@ -402,10 +402,12 @@ acknowledged.

  XPS Configuration

-XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by
-default for SMP). The functionality remains disabled until explicitly
-configured. To enable XPS, the bitmap of CPUs that may use a transmit
-queue is configured using the sysfs file entry:
+XPS is available only when the kconfig symbol CONFIG_XPS is enabled
+(on by default for SMP). The drivers for some NICs will enable the
+functionality by default.  For others the functionality remains
+disabled until explicitly configured. To enable XPS, the bitmap of
+CPUs that may use a transmit queue is configured using the sysfs file
+entry:

 /sys/class/net//queues/tx-/xps_cpus


The original wording leaves the impression that XPS is not enabled by 
default.


rick


Re: A second case of XPS considerably reducing single-stream performance

2016-08-24 Thread Rick Jones
Also, while it doesn't seem to have the same massive effect on 
throughput, I can also see out of order behaviour happening when the 
sending VM is on a node with a ConnectX-3 Pro NIC.  Its driver is also 
enabling XPS it would seem.  I'm not *certain* but looking at the traces 
it appears that with the ConnectX-3 Pro there is more interleaving of 
the out-of-order traffic than there is with the Skyhawk.  The ConnectX-3 
Pro happens to be in a newer generation server with a newer processor 
than the other systems where I've seen this.


I do not see the out-of-order behaviour when the NIC at the sending end 
is a BCM57840.  It does not appear that the bnx2x driver in the 4.4 
kernel is enabling XPS.


So, it would seem that there are three cases of enabling XPS resulting 
in out-of-order traffic, two of which result in a non-trivial loss of 
performance.


happy benchmarking,

rick jones


Re: [PATCH net-next] net: minor optimization in qdisc_qstats_cpu_drop()

2016-08-24 Thread Rick Jones

On 08/24/2016 10:23 AM, Eric Dumazet wrote:

From: Eric Dumazet <eduma...@google.com>

per_cpu_inc() is faster (at least on x86) than per_cpu_ptr(xxx)++;


Is it possible it is non-trivially slower on other architectures?

rick jones



Signed-off-by: Eric Dumazet <eduma...@google.com>
---
 include/net/sch_generic.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 
0d501779cc68f9426e58da6d039dd64adc937c20..52a2015667b49c8315edbb26513a98d4c677fee5
 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -592,7 +592,7 @@ static inline void qdisc_qstats_drop(struct Qdisc *sch)

 static inline void qdisc_qstats_cpu_drop(struct Qdisc *sch)
 {
-   qstats_drop_inc(this_cpu_ptr(sch->cpu_qstats));
+   this_cpu_inc(sch->cpu_qstats->drops);
 }

 static inline void qdisc_qstats_overlimit(struct Qdisc *sch)





A second case of XPS considerably reducing single-stream performance

2016-08-24 Thread Rick Jones
Back in February of this year, I reported some performance issues with 
the ixgbe driver enabling XPS by default and instance network 
performance in OpenStack:


http://www.spinics.net/lists/netdev/msg362915.html

I've now seen the same thing with be2net and Skyhawk.  In this case, the 
magnitude of the delta is even greater.  Disabling XPS increased the 
netperf single-stream performance out of the instance from an average of 
4108 Mbit/s to  Mbit/s or 116%.


Should drivers really be enabling XPS by default?

  Instance To Outside World
Single-stream netperf
~30 Samples for Each Statistic
  Mbit/s

 SkyhawkBE3 #1BE3 #2
 XPS On   XPS Off  XPS On   XPS Off  XPS On   XPS Off
Median4192 8883 8930 8853 8917 8695
Average   4108  8940 8859 8885 8671

happy benchmarking,

rick jones

The sample counts below may not fully support the additional statistics 
but for the curious:


raj@tardy:/tmp$ ~/netperf2_trunk/doc/examples/parse_single_stream.py -r 
6 waxon_performance.log  -f 2

Field2,Min,P10,Median,Average,P90,P99,Max,Count
be3-1,8758.850,8811.600,8930.900,8940.555,9096.470,9175.839,9183.690,31
be3-2,8588.450,8736.967,8917.075,8885.322,9017.914,9075.735,9094.620,32
skyhawk,3326.760,3536.008,4192.780,4108.513,4651.164,4723.322,4724.320,27
0 too-short lines ignored.
raj@tardy:/tmp$ ~/netperf2_trunk/doc/examples/parse_single_stream.py -r 
6 waxoff_performance.log  -f 2

Field2,Min,P10,Median,Average,P90,P99,Max,Count
be3-1,8461.080,8634.690,8853.260,8859.870,9064.480,9247.770,9253.050,31
be3-2,7519.130,8368.564,8695.140,8671.241,9068.588,9200.719,9241.500,27
skyhawk,8071.180,8651.587,8883.340,.411,9135.603,9141.229,9142.010,32
0 too-short lines ignored.

"waxon" is with XPS enabled, "waxoff" is with XPS disabled.  The servers 
are the same models/config as in February.


stack@np-cp1-comp0013-mgmt:~$ sudo ethtool -i hed3
driver: be2net
version: 10.6.0.3
firmware-version: 10.7.110.45


Re: [PATCH net 1/2] tg3: Fix for diasllow rx coalescing time to be 0

2016-08-03 Thread Rick Jones

On 08/02/2016 09:13 PM, skallam wrote:

From: Satish Baddipadige <satish.baddipad...@broadcom.com>

When the rx coalescing time is 0, interrupts
are not generated from the controller and rx path hangs.
To avoid this rx hang, updating the driver to not allow
rx coalescing time to be 0.

Signed-off-by: Satish Baddipadige <satish.baddipad...@broadcom.com>
Signed-off-by: Siva Reddy Kallam <siva.kal...@broadcom.com>
Signed-off-by: Michael Chan <michael.c...@broadcom.com>
---
 drivers/net/ethernet/broadcom/tg3.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/tg3.c 
b/drivers/net/ethernet/broadcom/tg3.c
index ff300f7..f3c6c91 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -14014,6 +14014,7 @@ static int tg3_set_coalesce(struct net_device *dev, 
struct ethtool_coalesce *ec)
}

if ((ec->rx_coalesce_usecs > MAX_RXCOL_TICKS) ||
+   (!ec->rx_coalesce_usecs) ||
(ec->tx_coalesce_usecs > MAX_TXCOL_TICKS) ||
(ec->rx_max_coalesced_frames > MAX_RXMAX_FRAMES) ||
(ec->tx_max_coalesced_frames > MAX_TXMAX_FRAMES) ||



Should anything then happen with:

/* No rx interrupts will be generated if both are zero */
if ((ec->rx_coalesce_usecs == 0) &&
(ec->rx_max_coalesced_frames == 0))
return -EINVAL;


which is the next block of code?  The logic there seems to suggest that 
it was intended to be able to have an rx_coalesce_usecs of 0 and rely on 
packet arrival to trigger an interrupt.  Presumably setting 
rx_max_coalesced_frames to 1 to disable interrupt coalescing.


happy benchmarking,

rick jones


Re: [iproute PATCH 0/2] Netns performance improvements

2016-07-08 Thread Rick Jones

On 07/08/2016 01:01 AM, Nicolas Dichtel wrote:

Those 300 routers will each have at least one namespace along with the dhcp
namespaces.  Depending on the nature of the routers (Distributed versus
Centralized Virtual Routers - DVR vs CVR) and whether the routers are supposed
to be "HA" there can be more than one namespace for a given router.

300 routers is far from the upper limit/goal.  Back in HP Public Cloud, we were
running as many as 700 routers per network node (*), and more than four network
nodes. (back then it was just the one namespace per router and network). Mileage
will of course vary based on the "oomph" of one's network node(s).

Thank you for the details.

Do you have a script or something else to easily reproduce this problem?


Do you mean for my much older, slightly different stuff done in HP 
Public Cloud, or for what Phil (?) is doing presently?  I believe Phil 
posted something several messages back in the thread.


happy benchmarking,

rick jones


Re: [iproute PATCH 0/2] Netns performance improvements

2016-07-07 Thread Rick Jones

On 07/07/2016 09:34 AM, Eric W. Biederman wrote:

Rick Jones <rick.jon...@hpe.com> writes:

300 routers is far from the upper limit/goal.  Back in HP Public
Cloud, we were running as many as 700 routers per network node (*),
and more than four network nodes. (back then it was just the one
namespace per router and network). Mileage will of course vary based
on the "oomph" of one's network node(s).


To clarify processes for these routers and dhcp servers are created
with "ip netns exec"?


I believe so, but it would be good to have someone else confirm that, 
and speak to your paragraph below.



If that is the case and you are using this feature as effectively a
lightweight container and not lots vrfs in a single network stack
then I suspect much larger gains can be had by creating a variant
of ip netns exec avoids the mount propagation.



...


* Didn't want to go much higher than that because each router had a
port on a common linux bridge and getting to > 1024 would be an
unpleasant day.


* I would have thought all you have to do is bump of the size
   of the linux neighbour cache.  echo $BIGNUM > 
/proc/sys/net/ipv4/neigh/default/gc_thresh3


We didn't want to hit the 1024 port limit of a (then?) Linux bridge.

rick

Having a bit of deja vu but I suspect things like commit 
0818bf27c05b2de56c5b2bd08cfae2a939bd5f52  are not exactly on the same 
wavelength, just my brain seeing "namespaces" and "performance" and 
lighting-up :)


Re: [iproute PATCH 0/2] Netns performance improvements

2016-07-07 Thread Rick Jones

On 07/07/2016 08:48 AM, Phil Sutter wrote:

On Thu, Jul 07, 2016 at 02:59:48PM +0200, Nicolas Dichtel wrote:

Le 07/07/2016 13:17, Phil Sutter a Ă©crit :
[snip]

The issue came up during OpenStack Neutron testing, see this ticket for
reference:

https://bugzilla.redhat.com/show_bug.cgi?id=1310795

Access to this ticket is not public :(


*Sigh* OK, here are a few quotes:

"OpenStack Neutron controller nodes, when undergoing testing, are
locking up specifically during creation and mounting of namespaces.
They appear to be blocking behind vfsmount_lock, and contention for the
namespace_sem"

"During the scale testing, we have 300 routers, 600 dhcp namespaces
spread across four neutron network nodes. When then start as one set of
standard Openstack Rally benchmark test cycle against neutron. An
example scenario is creating 10x networks, list them, delete them and
repeat 10x times. The second set performs an L3 benchmark test between
two instances."



Those 300 routers will each have at least one namespace along with the 
dhcp namespaces.  Depending on the nature of the routers (Distributed 
versus Centralized Virtual Routers - DVR vs CVR) and whether the routers 
are supposed to be "HA" there can be more than one namespace for a given 
router.


300 routers is far from the upper limit/goal.  Back in HP Public Cloud, 
we were running as many as 700 routers per network node (*), and more 
than four network nodes. (back then it was just the one namespace per 
router and network). Mileage will of course vary based on the "oomph" of 
one's network node(s).


happy benchmarking,

rick jones

* Didn't want to go much higher than that because each router had a port 
on a common linux bridge and getting to > 1024 would be an unpleasant day.


Re: strange Mac OSX RST behavior

2016-07-01 Thread Rick Jones

On 07/01/2016 08:10 AM, Jason Baron wrote:

I'm wondering if anybody else has run into this...

On Mac OSX 10.11.5 (latest version), we have found that when tcp
connections are abruptly terminated (via ^C), a FIN is sent followed
by an RST packet.


That just seems, well, silly.  If the client application wants to use 
abortive close (sigh..) it should do so, there shouldn't be this 
little-bit-pregnant, correct close initiation (FIN) followed by a RST.



The RST is sent with the same sequence number as the
FIN, and thus dropped since the stack only accepts RST packets matching
rcv_nxt (RFC 5961). This could also be resolved if Mac OSX replied with
an RST on the closed socket, but it appears that it does not.

The workaround here is then to reset the connection, if the RST is
is equal to rcv_nxt - 1, if we have already received a FIN.

The RST attack surface is limited b/c we only accept the RST after we've
accepted a FIN and have not previously sent a FIN and received back the
corresponding ACK. In other words RST is only accepted in the tcp
states: TCP_CLOSE_WAIT, TCP_LAST_ACK, and TCP_CLOSING.

I'm interested if anybody else has run into this issue. Its problematic
since it takes up server resources for sockets sitting in TCP_CLOSE_WAIT.


Isn't the server application expected to act on the read return of zero 
(which is supposed to be) triggered by the receipt of the FIN segment?


rick jones


We are also in the process of contacting Apple to see what can be done
here...workaround patch is below.


Re: [PATCH v12 net-next 1/1] hv_sock: introduce Hyper-V Sockets

2016-06-28 Thread Rick Jones

On 06/28/2016 02:59 AM, Dexuan Cui wrote:

The idea here is: IMO the syscalls sys_read()/write() shoudn't return
-ENOMEM, so I have to make sure the buffer allocation succeeds?

I tried to use kmalloc with __GFP_NOFAIL, but I hit a warning in
in mm/page_alloc.c:
WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));

What error code do you think I should return?
EAGAIN, ERESTARTSYS, or something else?

May I have your suggestion? Thanks!


What happens as far as errno is concerned when an application makes a 
read() call against a (say TCP) socket associated with a connection 
which has been reset?  Is it limited to those errno values listed in the 
read() manpage, or does it end-up getting an errno value from those 
listed in the recv() manpage?  Or, perhaps even one not (presently) 
listed in either?


rick jones



Re: [PATCH net-next 0/8] tou: Transports over UDP - part I

2016-06-24 Thread Rick Jones

On 06/24/2016 04:43 PM, Tom Herbert wrote:

Here's Christoph's slides on TFO in the wild which presents a good
summary of the middlebox problem. There is one significant difference
in that ECN needs network support whereas TFO didn't. Given that
experience, I'm doubtful other new features at L4 could ever be
productively use (like EDO or maybe TCP-ENO).

https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf


Perhaps I am being overly optimistic, but my takeaway from those slides 
is Apple were able to come-up with ways to deal with the middleboxes and 
so could indeed productively use TCP FastOpen.


"Overall, very good success-rate"
though tempered by
"But... middleboxes were a big issue in some ISPs..."

Though it doesn't get into how big (some connections, many, most, all?) 
and how many ISPs.


rick jones

Just an anecdote...  Not that I am a "power user" of my iPhone running 
9.3.2 (13F69) nor that I know that anything I am using is the Apple 
Service stated as using TFO (mostly Safari, Mail and Messages) but if it 
is, I cannot say that any troubles under the covers have been noticed by me.


Re: [PATCH net-next 0/8] tou: Transports over UDP - part I

2016-06-24 Thread Rick Jones

On 06/24/2016 02:46 PM, Tom Herbert wrote:

On Fri, Jun 24, 2016 at 2:36 PM, Rick Jones <rick.jon...@hpe.com> wrote:

How would you define "severely?"  Has it actually been more severe than for
say ECN?  Or it was for say SACK or PAWS?


ECN is probably even a bigger disappointment in terms of seeing
deployment :-( From http://ecn.ethz.ch/ecn-pam15.pdf:

"Even though ECN was standardized in 2001, and it is widely
implemented in end-systems, it is barely deployed. This is due to a
history of problems with severely broken middleboxes shortly after
standardization, which led to connectivity failure and guidance to
leave ECN disabled."

SACK and PAWS seemed to have faired a little better I believe.


The conclusion of that (rather interesting) paper reads:

"Our analysis therefore indicates that enabling ECN by default would
lead to connections to about five websites per thousand to suffer
additional setup latency with RFC 3168 fallback. This represents an
order of magnitude fewer than the about forty per thousand which
experience transient or permanent connection failure due to other
operational issues"

Doesn't that then suggest that not enabling ECN is basically a matter of 
FUD more than remaining assumed broken middleboxes?


My main point is that in the past at least, trouble with broken 
middleboxes didn't lead us to start wrapping all our TCP/transport 
traffic in UDP to try to hide it from them.  We've managed to get SACK 
and PAWS universal without having to resort to that, and it would seem 
we could get ECN universal if we could overcome our FUD.  Why would TFO 
for instance be any different?


There was an equally interesting second paragraph in the conclusion:

"As not all websites are equally popular, failures on five per thousand
websites does not by any means imply that five per thousand connection 
attempts will fail. While estimation of connection attempt rate by rank 
is out of scope of this work, we note that the highest ranked website 
exhibiting stable connection failure has rank 596, and only 13 such 
sites appear in the top 5000"


rick jones


Re: [PATCH net-next 0/8] tou: Transports over UDP - part I

2016-06-24 Thread Rick Jones

On 06/24/2016 02:12 PM, Tom Herbert wrote:

The client OS side is only part of the story. Middlebox intrusion at
L4 is also a major issue we need to address. The "failure" of TFO is a
good case study. Both the upgrade issues on clients and the tendency
for some middleboxes to drop SYN packets with data have together
severely hindered what otherwise should have been straightforward and
useful feature to deploy.


How would you define "severely?"  Has it actually been more severe than 
for say ECN?  Or it was for say SACK or PAWS?


rick jones



Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support

2016-06-22 Thread Rick Jones

On 06/22/2016 04:10 PM, Rick Jones wrote:

My systems are presently in the midst of an install but I should be able
to demonstrate it in the morning (US Pacific time, modulo the shuttle
service of a car repair place)


The installs finished sooner than I thought.  So, receiver:


root@np-cp1-comp0001-mgmt:/home/stack# uname -a
Linux np-cp1-comp0001-mgmt 4.4.11-2-amd64-hpelinux #hpelinux1 SMP Mon 
May 23 15:39:22 UTC 2016 x86_64 GNU/Linux

root@np-cp1-comp0001-mgmt:/home/stack# ethtool -i hed2
driver: bnx2x
version: 1.712.30-0
firmware-version: bc 7.10.10
bus-info: :05:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

the hed2 interface is a port of an HPE 630M NIC, based on the BCM57840:

05:00.0 Ethernet controller: Broadcom Corporation BCM57840 NetXtreme II 
10/20-Gigabit Ethernet (rev 11)

Subsystem: Hewlett-Packard Company HP FlexFabric 20Gb 2-port 630M 
Adapter

(The pci.ids entry being from before that 10 GbE IP was purchased from 
Broadcom by QLogic...)


Verify that LRO is disabled (IIRC it is enabled by default):

root@np-cp1-comp0001-mgmt:/home/stack# ethtool -k hed2 | grep large
large-receive-offload: off

Verify that disable_tpa is not set:

root@np-cp1-comp0001-mgmt:/home/stack# cat 
/sys/module/bnx2x/parameters/disable_tpa

0

So this means we will see NIC-firmware GRO.

Start a tcpdump on the receiver:
root@np-cp1-comp0001-mgmt:/home/stack# tcpdump -s 96 -c 200 -i hed2 
-w foo.pcap port 12867
tcpdump: listening on hed2, link-type EN10MB (Ethernet), capture size 96 
bytes


Start a netperf test targeting that system, specifying a smaller MSS:

stack@np-cp1-comp0002-mgmt:~$ ./netperf -H np-cp1-comp0001-guest -- -G 
1400 -P 12867 -O throughput,transport_mss
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to 
np-cp1-comp0001-guest () port 12867 AF_INET : demo

Throughput Transport
   MSS
   bytes

3372.821388

Come back to the receiver and post-process the tcpdump capture to get 
the average segment size for the data segments:


200 packets captured
2000916 packets received by filter
0 packets dropped by kernel
root@np-cp1-comp0001-mgmt:/home/stack# tcpdump -n -r foo.pcap | fgrep -v 
"length 0" | awk '{sum += $NF}END{print "Average:",sum/NR}'

reading from file foo.pcap, link-type EN10MB (Ethernet)
Average: 2741.93

and finally a snippet of the capture:

00:37:47.333414 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [S], seq 
1236484791, win 28000, options [mss 1400,sackOK,TS val 1491134 ecr 
0,nop,wscale 7], length 0
00:37:47.333488 IP 192.168.2.7.12867 > 192.168.2.8.12867: Flags [S.], 
seq 134167501, ack 1236484792, win 28960, options [mss 1460,sackOK,TS 
val 1499053 ecr 1491134,nop,wscale 7], length 0
00:37:47.333731 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], ack 
1, win 219, options [nop,nop,TS val 1491134 ecr 1499053], length 0
00:37:47.333788 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 
1:2777, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 1499053], 
length 2776
00:37:47.333815 IP 192.168.2.7.12867 > 192.168.2.8.12867: Flags [.], ack 
2777, win 270, options [nop,nop,TS val 1499053 ecr 1491134], length 0
00:37:47.333822 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 
2777:5553, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 1499053], 
length 2776
00:37:47.333837 IP 192.168.2.7.12867 > 192.168.2.8.12867: Flags [.], ack 
5553, win 313, options [nop,nop,TS val 1499053 ecr 1491134], length 0
00:37:47.333842 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 
5553:8329, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 1499053], 
length 2776
00:37:47.333856 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 
8329:11105, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 
1499053], length 2776
00:37:47.333869 IP 192.168.2.7.12867 > 192.168.2.8.12867: Flags [.], ack 
8329, win 357, options [nop,nop,TS val 1499053 ecr 1491134], length 0
00:37:47.333879 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 
11105:13881, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 
1499053], length 2776
00:37:47.333891 IP 192.168.2.7.12867 > 192.168.2.8.12867: Flags [.], ack 
11105, win 400, options [nop,nop,TS val 1499053 ecr 1491134], length 0
00:37:47.333911 IP 192.168.2.7.12867 > 192.168.2.8.12867: Flags [.], ack 
13881, win 444, options [nop,nop,TS val 1499053 ecr 1491134], length 0
00:37:47.333964 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 
13881:16657, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 
1499053], length 2776
00:37:47.333982 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 
16657:19433, ack 1, win 219, options [nop,nop,TS val 1491134 ecr 
1499053], length 2776
00:37:47.333989 IP 192.168.2.8.12867 > 192.168.2.7.12867: Flags [.], seq 
19433:22209, ack 1, win 219, options [nop,nop,TS val 149

Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support

2016-06-22 Thread Rick Jones

On 06/22/2016 03:56 PM, Alexander Duyck wrote:

On Wed, Jun 22, 2016 at 3:47 PM, Eric Dumazet <eric.duma...@gmail.com> wrote:

On Wed, 2016-06-22 at 14:52 -0700, Rick Jones wrote:

Had the bnx2x-driven NICs' firmware not had that rather unfortunate
assumption about MSSes I probably would never have noticed.




It could be that you and Rick are running different firmware. I
believe you can expose that via "ethtool -i".  This is the ugly bit
about all this.  We are offloading GRO into the firmware of these
devices with no idea how any of it works and by linking GRO to LRO on
the same device you are stuck having to accept either the firmware
offload or nothing at all.  That is kind of the point Rick was trying
to get at.


I think you are typing a bit too far ahead into my keyboard with that 
last sentence.  And I may not have been sufficiently complete in what I 
wrote.  If the bnx2x-driven NICs' firmware had been coalescing more than 
two segments together, not only would I probably not have noticed, I 
probably would not have been upset to learn it was NIC-firmware GRO 
rather than stack.


My complaint is the specific bug of coalescing only two segments when 
their size is unexpected, and the difficulty present in disabling the 
bnx2x-driven NICs' firmware GRO.  I don't have a problem necessarily 
with the existence of NIC-firmware GRO in general.  I just want to be 
able to enable/disable it easily.


rick jones

Of course, what I really want are much, Much, MUCH larger MTUs.  It 
isn't for nothing that I used to refer to TSO as "Poor man's Jumbo 
Frames" :)


Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support

2016-06-22 Thread Rick Jones

On 06/22/2016 03:47 PM, Eric Dumazet wrote:

On Wed, 2016-06-22 at 14:52 -0700, Rick Jones wrote:

On 06/22/2016 11:22 AM, Yuval Mintz wrote:

But seriously, this isn't really anything new but rather a step forward in
the direction we've already taken - bnx2x/qede are already performing
the same for non-encapsulated TCP.


Since you mention bnx2x...   I would argue that the NIC firmware on
those NICs driven by bnx2x is doing it badly.  Not so much from a
functional standpoint I suppose, but from a performance one.  The
NIC-firmware GRO done there has this rather unfortunate assumption about
"all MSSes will be directly driven by my own physical MTU" and when it
sees segments of a size other than would be suggested by the physical
MTU, will coalesce only two segments together.  They then do not get
further coalesced in the stack.

Suffice it to say this does not do well from a performance standpoint.

One can disable LRO via ethtool for these NICs, but what that does is
disable old-school LRO, not GRO-in-the-NIC.  To get that disabled, one
must also get the bnx2x module loaded with "disable-tpa=1" so the Linux
stack GRO gets used instead.

Had the bnx2x-driven NICs' firmware not had that rather unfortunate
assumption about MSSes I probably would never have noticed.


I do not see this behavior on my bnx2x nics ?

ip ro add 10.246.11.52 via 10.246.11.254 dev eth0 mtu 1000
lpk51:~# ./netperf -H 10.246.11.52 -l 1000
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
10.246.11.52 () port 0 AF_INET


I first saw this with VMs which themselves had 1400 byte MTUs on their 
vNICs, speaking though bnx2x-driven NICs with a 1500 byte MTU, but I did 
later reproduce it by tweaking the MTU of my sending side NIC to 
something like 1400 bytes and running a "bare iron" netperf.  I believe 
you may be able to achieve the same thing by having netperf set a 
smaller MSS via the test-specific -G option.


My systems are presently in the midst of an install but I should be able 
to demonstrate it in the morning (US Pacific time, modulo the shuttle 
service of a car repair place)



On receiver :


Paranoid question, but is LRO disabled on the receiver?  I don't know 
that LRO exhibits the behaviour, just GRO-in-the-NIC.


rick



15:46:08.296241 IP 10.246.11.52.46907 > 10.246.11.51.34131: Flags [.],
ack 303360, win 8192, options [nop,nop,TS val 1245217243 ecr
1245306446], length 0
15:46:08.296430 IP 10.246.11.51.34131 > 10.246.11.52.46907: Flags [.],
seq 303360:327060, ack 1, win 229, options [nop,nop,TS val 1245306446
ecr 1245217242], length 23700
15:46:08.296441 IP 10.246.11.52.46907 > 10.246.11.51.34131: Flags [.],
ack 327060, win 8192, options [nop,nop,TS val 1245217243 ecr
1245306446], length 0
15:46:08.296644 IP 10.246.11.51.34131 > 10.246.11.52.46907: Flags [.],
seq 327060:350760, ack 1, win 229, options [nop,nop,TS val 1245306446
ecr 1245217242], length 23700
15:46:08.296655 IP 10.246.11.52.46907 > 10.246.11.51.34131: Flags [.],
ack 350760, win 8192, options [nop,nop,TS val 1245217244 ecr
1245306446], length 0
15:46:08.296854 IP 10.246.11.51.34131 > 10.246.11.52.46907: Flags [.],
seq 350760:374460, ack 1, win 229, options [nop,nop,TS val 1245306446
ecr 1245217242], length 23700
15:46:08.296897 IP 10.246.11.52.46907 > 10.246.11.51.34131: Flags [.],
ack 374460, win 8192, options [nop,nop,TS val 1245217244 ecr
1245306446], length 0
15:46:08.297054 IP 10.246.11.51.34131 > 10.246.11.52.46907: Flags [.],
seq 374460:398160, ack 1, win 229, options [nop,nop,TS val 1245306446
ecr 1245217242], length 23700
15:46:08.297099 IP 10.246.11.52.46907 > 10.246.11.51.34131: Flags [.],
ack 398160, win 8192, options [nop,nop,TS val 1245217244 ecr
1245306446], length 0
15:46:08.297258 IP 10.246.11.51.34131 > 10.246.11.52.46907: Flags [.],
seq 398160:420912, ack 1, win 229, options [nop,nop,TS val 1245306446
ecr 1245217242], length 22752
15:46:08.297301 IP 10.246.11.52.46907 > 10.246.11.51.34131: Flags [.],
ack 420912, win 8192, options [nop,nop,TS val 1245217244 ecr
1245306446], length 0





Re: [PATCH net-next 0/5] qed/qede: Tunnel hardware GRO support

2016-06-22 Thread Rick Jones

On 06/22/2016 11:22 AM, Yuval Mintz wrote:

But seriously, this isn't really anything new but rather a step forward in
the direction we've already taken - bnx2x/qede are already performing
the same for non-encapsulated TCP.


Since you mention bnx2x...   I would argue that the NIC firmware on 
those NICs driven by bnx2x is doing it badly.  Not so much from a 
functional standpoint I suppose, but from a performance one.  The 
NIC-firmware GRO done there has this rather unfortunate assumption about 
"all MSSes will be directly driven by my own physical MTU" and when it 
sees segments of a size other than would be suggested by the physical 
MTU, will coalesce only two segments together.  They then do not get 
further coalesced in the stack.


Suffice it to say this does not do well from a performance standpoint.

One can disable LRO via ethtool for these NICs, but what that does is 
disable old-school LRO, not GRO-in-the-NIC.  To get that disabled, one 
must also get the bnx2x module loaded with "disable-tpa=1" so the Linux 
stack GRO gets used instead.


Had the bnx2x-driven NICs' firmware not had that rather unfortunate 
assumption about MSSes I probably would never have noticed.


happy benchmarking,

rick jones


Re: [PATCH net-next 0/8] tou: Transports over UDP - part I

2016-06-16 Thread Rick Jones

On 06/16/2016 10:51 AM, Tom Herbert wrote:


Note that #1 is really about running a transport stack in userspace
applications in clients, not necessarily servers. For servers we
intend to modified the kernel stack in order to leverage existing
implementation for building scalable serves (hence these patches).


Only if there is a v2 for other reasons...  I assume that was meant to 
be "scalable servers."




Tested: Various cases of TOU with IPv4, IPv6 using TCP_STREAM and
TCP_RR. Also, tested IPIP for comparing TOU encapsulation to IP
tunneling.

 - IPv6 native
   1 TCP_STREAM
8394 tps


TPS for TCP_STREAM?  Is that Mbit/s?


   200 TCP_RR
1726825 tps
100/177/361 90/95/99% latencies


To enhance the already good comprehensiveness of the numbers, a 1 TCP_RR 
showing the effect on latency rather than aggregate PPS would be 
goodness, as would a comparison of the service demands of the different 
single-stream results.


CPU and NIC models would provide excellent context for the numbers.

happy benchmarking,

rick jones


Re: [PATCH] openvswitch: Add packet truncation support.

2016-06-09 Thread Rick Jones

On 06/08/2016 09:30 PM, pravin shelar wrote:

On Wed, Jun 8, 2016 at 6:18 PM, William Tu <u9012...@gmail.com> wrote:

+struct ovs_action_trunc {
+   uint32_t max_len; /* Max packet size in bytes. */

This could uint16_t. as it is related to packet len.



Is there something limiting MTUs to 65535 bytes?

rick jones


Re: [PATCH -next 2/2] virtio_net: Read the advised MTU

2016-06-02 Thread Rick Jones

On 06/02/2016 10:06 AM, Aaron Conole wrote:

Rick Jones <rick.jon...@hpe.com> writes:

One of the things I've been doing has been setting-up a cluster
(OpenStack) with JumboFrames, and then setting MTUs on instance vNICs
by hand to measure different MTU sizes.  It would be a shame if such a
thing were not possible in the future.  Keeping a warning if shrinking
the MTU would be good, leave the error (perhaps) to if an attempt is
made to go beyond the advised value.


This was cut because it didn't make sense for such a warning to
be issued, but it seems like perhaps you may want such a feature?  I
agree with Michael, after thinking about it, that I don't know what sort
of use the warning would serve.  After all, if you're changing the MTU,
you must have wanted such a change to occur?


I don't need a warning, was simply willing to live with one when 
shrinking the MTU.  Didn't want an error.


happy benchmarking,

rick jones



Re: [RFC] net: remove busylock

2016-05-19 Thread Rick Jones

On 05/19/2016 11:03 AM, Alexander Duyck wrote:

On Thu, May 19, 2016 at 10:08 AM, Eric Dumazet <eric.duma...@gmail.com> wrote:

With HTB qdisc, here are the numbers for 200 concurrent TCP_RR, on a host with 
48 hyperthreads.

...


That would be a 8 % increase.


The main point of the busy lock is to deal with the bulk throughput
case, not the latency case which would be relatively well behaved.
The problem wasn't really related to lock bouncing slowing things
down.  It was the fairness between the threads that was killing us
because the dequeue needs to have priority.


Quibbledrift... While the origins of the netperf TCP_RR test center on 
measuring latency, I'm not sure I'd call 200 of them running 
concurrently a latency test.  Indeed it may be neither fish nor fowl, 
but it will certainly be exercising the basic packet send/receive path 
rather fully and is likely a reasonable proxy for aggregate small packet 
performance.


happy benchmarking,

rick jones


Re: [PATCH] tcp: ensure non-empty connection request queue

2016-05-04 Thread Rick Jones

On 05/04/2016 10:34 AM, Eric Dumazet wrote:

On Wed, 2016-05-04 at 10:24 -0700, Rick Jones wrote:


Dropping the connection attempt makes sense, but is entering/claiming
synflood really indicated in the case of a zero-length accept queue?


This is a one time message.

This is how people can learn about their user space bugs, or too small
backlog ;)

Being totally silent would be not so nice.



Assuming Peter's assertion about just drops when syncookies are not 
enabled is accurate, should there be some one-time message in that case too?


rick


Re: [PATCH] tcp: ensure non-empty connection request queue

2016-05-04 Thread Rick Jones

On 05/03/2016 05:25 PM, Eric Dumazet wrote:

On Tue, 2016-05-03 at 23:54 +0200, Peter Wu wrote:

When applications use listen() with a backlog of 0, the kernel would
set the maximum connection request queue to zero. This causes false
reports of SYN flooding (if tcp_syncookies is enabled) or packet drops
otherwise.




Well, I believe I already gave my opinion on this.

listen backlog is not a hint. This is a limit.

It is the limit of outstanding children in accept queue.

If backlog is 0, no child can be put in the accept queue.

It is therefore Working As Intented.


Dropping the connection attempt makes sense, but is entering/claiming 
synflood really indicated in the case of a zero-length accept queue?


rick


Re: drop all fragments inside tx queue if one gets dropped

2016-04-20 Thread Rick Jones
For the "everything old is new again" files, back in the 1990s, it was 
noticed that on the likes of a netperf UDP_STREAM test on HP-UX, with 
fragmentation taking place, it was possible to consume 100% of the link 
bandwidth and have 0% effective throughput because the transmit queue 
was kept full with IP datagram fragments which could not possibly be 
reassembled (*) because one or more of the fragments of a datagram were 
dropped because the transmit queue was full.


HP-UX implemented "packet trains" where all the fragments of a 
fragmented datagram were presented to the driver, which then either 
queued them all, or none of them.


I don't recall seeing similar poor behaviour in Linux; I would have 
assumed that the intra-stack flow-control "took care" of it.  Perhaps 
there is something specific to wpan which precludes that?


happy benchmarking,

rick jones



Re: Poorer networking performance in later kernels?

2016-04-18 Thread Rick Jones

On 04/18/2016 04:27 AM, Butler, Peter wrote:

Hi Rick

Thanks for the reply.

Here is some hardware information, as requested (the two systems are
identical, and are communicating with one another over a 10GB
full-duplex Ethernet backplane):

- processor type: Intel(R) Xeon(R) CPU C5528  @ 2.13GHz
- NIC: Intel 82599EB 10GB XAUI/BX4
- NIC driver: ixgbe version 4.2.1-k (part of 4.4.0 kernel)

As for the buffer sizes, those rather large ones work fine for us
with the 3.4.2 kernel.  However, for the sake of being complete, I
have re-tried the tests with the 'standard' 4.4.0 kernel parameters
for all /proc/sys/net/* values, and the results still were extremely
poor in comparison to the 3.4.2 kernel.

Our MTU is actually just the standard 1500 bytes, however the message
size was chosen to mimic actual traffic which will be segmented.

I ran ethtool -k (indeed I checked all ethtool parameters, not just
those via -k) and the only real difference I could find was in
"large-receive-offload" which was ON in 3.4.2 but OFF in 4.4.0 - so I
used ethtool to change this to match the 3.4.2 settings and re-ran
the tests.  Didn't help :-(   It's possible of course that I have
missed a parameter here or there in comparing the 3.4.2 setup to the
4.4.0 setup.  I also tried running the ethtool config with the latest
and greatest ethtool version (4.5) on the 4.4.0 kernel, as compared
to the old 3.1 version on our 3.4.2 kernel.


So it would seem the stateless offloads are still enabled.  My next 
question would be to wonder if they are still "effective."  To that end, 
you could run a netperf test specifying a particular port number in the 
test-specific portion:


netperf ...   -- -P ,12345

and while that is running something like

tcpdump -s 96 -c 20 -w /tmp/foo.pcap -i  port 12345

then post-processed with the likes of:

tcpdump -n -r /tmp/foo.pcap | grep -v "length 0" | awk '{sum += 
$NF}END{print "average",sum/NR}'


the intent behind that is to see what the average post-GRO segment size 
happens to be on the receiver and then to compare it between the two 
kernels.  Grepping-away the "length 0" is to avoid counting ACKs and 
look only at data segments.  The specific port number is to avoid 
including any other connections which might happen to have traffic 
passing through at the time.


You could I suspect do the same comparison on the sending side.

There might I suppose be an easier way to get the average segment size - 
perhaps something from looking at ethtool stats - but the stone knives 
and bear skins of tcpdump above would have the added benefit of having a 
packet trace or three for someone to look at if they felt the need.  And 
for that, I would actually suggest starting the capture *before* the 
netperf test so the connection establishment is included.



I performed the TCP_RR test as requested and in that case, the
results are much more comparable.  The old kernel is still better,
but now only around 10% better as opposed to 2-3x better.


Did the service demand change by 10% or just the transaction rate?


However I still contend that the *_STREAM tests are giving us more
pertinent data, since our product application is only getting 1/3 to
1/2 half of the performance on the 4.4.0 kernel, and this is the same
thing I see when I use netperf to test.

One other note: I tried running our 3.4.2 and 4.4.0 kernels in a VM
environment on my workstation, so as to take the 'real' production
hardware out of the equation.  When I perform the tests in this setup
the 3.4.2 and 4.4.0 kernels perform identically - just as you would
expect.


Running in a VM will likely change things massively and could I suppose 
mask other behaviour changes.


happy benchmarking,

rick jones
raj@tardy:~$ cat signatures/toppost
A: Because it fouls the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

:)



Any other ideas?  What can I be missing here?

Peter




-Original Message-
From: Rick Jones [mailto:rick.jon...@hpe.com]
Sent: April-15-16 6:37 PM
To: Butler, Peter <pbut...@sonusnet.com>; netdev@vger.kernel.org
Subject: Re: Poorer networking performance in later kernels?

On 04/15/2016 02:02 PM, Butler, Peter wrote:

(Please keep me CC'd to all comments/responses)

I've tried a kernel upgrade from 3.4.2 to 4.4.0 and see a marked drop
in networking performance.  Nothing was changed on the test systems,
other than the kernel itself (and kernel modules).  The identical
.config used to build the 3.4.2 kernel was brought over into the
4.4.0 kernel source tree, and any configuration differences (e.g. new
parameters, etc.) were taken as default values.

The testing was performed on the same actual hardware for both kernel
versions (i.e. take the existing 3.4.2 physical setup, simply boot
into the (new) kernel and run the same test).  The netperf utility was
used for b

Re: Poorer networking performance in later kernels?

2016-04-15 Thread Rick Jones

On 04/15/2016 02:02 PM, Butler, Peter wrote:

(Please keep me CC'd to all comments/responses)

I've tried a kernel upgrade from 3.4.2 to 4.4.0 and see a marked drop
in networking performance.  Nothing was changed on the test systems,
other than the kernel itself (and kernel modules).  The identical
.config used to build the 3.4.2 kernel was brought over into the
4.4.0 kernel source tree, and any configuration differences (e.g. new
parameters, etc.) were taken as default values.

The testing was performed on the same actual hardware for both kernel
versions (i.e. take the existing 3.4.2 physical setup, simply boot
into the (new) kernel and run the same test).  The netperf utility
was used for benchmarking and the testing was always performed on
idle systems.

TCP testing yielded the following results, where the 4.4.0 kernel
only got about 1/2 of the throughput:




   Recv Send   Send  Utilization   
Service Demand
   Socket   Socket Message Elapsed   Send Recv Send 
   Recv
   Size Size   SizeTime   Throughput localremote   
local   remote
   bytesbytes  bytes   secs.  10^6bits/s % S  % S  
us/KB   us/KB

3.4.2 13631488 13631488   895230.01  9370.2910.146.50 0.709 
  0.454
4.4.0 13631488 13631488   895230.02  5314.039.14 14.311.127 
  1.765

SCTP testing yielded the following results, where the 4.4.0 kernel only got 
about 1/3 of the throughput:

   Recv Send   Send  Utilization   
Service Demand
   Socket   Socket Message Elapsed   Send Recv Send 
   Recv
   Size Size   SizeTime   Throughput localremote   
local   remote
   bytesbytes  bytes   secs.  10^6bits/s  % S % S  
us/KB   us/KB

3.4.2 13631488 13631488   895230.00  2306.2213.8713.193.941 
  3.747
4.4.0 13631488 13631488   895230.01   882.7416.8619.14
12.516  14.210

The same tests were performed a multitude of time, and are always
consistent (within a few percent).  I've also tried playing with
various run-time kernel parameters (/proc/sys/kernel/net/...) on the
4.4.0 kernel to alleviate the issue but have had no success at all.

I'm at a loss as to what could possibly account for such a discrepancy...



I suspect I am not alone in being curious about the CPU(s) present in 
the systems and the model/whatnot of the NIC being used.  I'm also 
curious as to why you have what at first glance seem like absurdly large 
socket buffer sizes.


That said, it looks like you have some Really Big (tm) increases in 
service demand.  Many more CPU cycles being consumed per KB of data 
transferred.


Your message size makes me wonder if you were using a 9000 byte MTU.

Perhaps in the move from 3.4.2 to 4.4.0 you lost some or all of the 
stateless offloads for your NIC(s)?  Running ethtool -k  on 
both ends under both kernels might be good.


Also, if you did have a 9000 byte MTU under 3.4.2 are you certain you 
still had it under 4.4.0?


It would (at least to me) also be interesting to run a TCP_RR test 
comparing the two kernels.  TCP_RR (at least with the default 
request/response size of one byte) doesn't really care about stateless 
offloads or MTUs and could show how much difference there is in basic 
path length (or I suppose in interrupt coalescing behaviour if the NIC 
in question has a mildly dodgy heuristic for such things).


happy benchmarking,

rick jones



Re: [net PATCH 2/2] ipv4/GRO: Make GRO conform to RFC 6864

2016-04-02 Thread Rick Jones

On 04/01/2016 07:21 PM, Eric Dumazet wrote:

On Fri, 2016-04-01 at 22:16 -0400, David Miller wrote:

From: Alexander Duyck <alexander.du...@gmail.com>
Date: Fri, 1 Apr 2016 12:58:41 -0700


RFC 6864 is pretty explicit about this, IPv4 ID used only for
fragmentation.  https://tools.ietf.org/html/rfc6864#section-4.1

The goal with this change is to try and keep most of the existing
behavior in tact without violating this rule?  I would think the
sequence number should give you the ability to infer a drop in the
case of TCP.  In the case of UDP tunnels we are now getting a bit more
data since we were ignoring the outer IP header ID before.


When retransmits happen, the sequence numbers are the same.  But you
can then use the IP ID to see exactly what happened.  You can even
tell if multiple retransmits got reordered.

Eric's use case is extremely useful, and flat out eliminates ambiguity
when analyzing TCP traces.


Yes, our team (including Van Jacobson ;) ) would be sad to not have
sequential IP ID (but then we don't have them for IPv6 ;) )


Your team would not be the only one sad to see that go away.

rick jones


Since the cost of generating them is pretty small (inet->inet_id
counter), we probably should keep them in linux. Their usage will phase
out as IPv6 wins the Internet war...






Re: [RFC net-next 2/2] udp: No longer use SLAB_DESTROY_BY_RCU

2016-03-28 Thread Rick Jones

On 03/28/2016 01:01 PM, Eric Dumazet wrote:

Note : file structures got RCU freeing back in 2.6.14, and I do not
think named users ever complained about added cost ;)


Couldn't see the tree for the forest I guess :)

rick



Re: [RFC net-next 2/2] udp: No longer use SLAB_DESTROY_BY_RCU

2016-03-28 Thread Rick Jones

On 03/28/2016 11:55 AM, Eric Dumazet wrote:

On Mon, 2016-03-28 at 11:44 -0700, Rick Jones wrote:

On 03/28/2016 10:00 AM, Eric Dumazet wrote:

If you mean that a busy DNS resolver spends _most_ of its time doing :

fd = socket()
bind(fd  port=0)
< send and receive one frame >
close(fd)


Yes.  Although it has been a long time, I thought that say the likes of
a caching named in the middle between hosts and the rest of the DNS
would behave that way as it was looking-up names on behalf those who
asked it.


I really doubt a modern program would dynamically allocate one UDP port
for every in-flight request, as it would limit them to number of
ephemeral ports concurrent requests (~3 assuming the process can get
them all on the host)


I was under the impression that individual DNS queries were supposed to 
have not only random DNS query IDs but also originate from random UDP 
source ports.  https://tools.ietf.org/html/rfc5452 4.5 at least touches 
on the topic but I don't see it making it hard and fast.  By section 10 
though it is more explicit:


   This document recommends the use of UDP source port number
   randomization to extend the effective DNS transaction ID beyond the
   available 16 bits.

That being the case, if indeed there were to be 3-odd concurrent 
requests outstanding "upstream" from that location there'd have to be 
3 ephemeral ports in play.


rick



Managing a pool would be more efficient (The 1.3 usec penalty becomes
more like 4 usec in multi threaded programs)

Sure, you always can find badly written programs, but they already hit
scalability issues anyway.

UDP refcounting cost about 2 cache line misses per packet in stress
situations, this really has to go, so that well written programs can get
full speed.






Re: [RFC net-next 2/2] udp: No longer use SLAB_DESTROY_BY_RCU

2016-03-28 Thread Rick Jones

On 03/28/2016 10:00 AM, Eric Dumazet wrote:

On Mon, 2016-03-28 at 09:15 -0700, Rick Jones wrote:

On 03/25/2016 03:29 PM, Eric Dumazet wrote:

UDP sockets are not short lived in the high usage case, so the added
cost of call_rcu() should not be a concern.


Even a busy DNS resolver?


If you mean that a busy DNS resolver spends _most_ of its time doing :

fd = socket()
bind(fd  port=0)
   < send and receive one frame >
close(fd)


Yes.  Although it has been a long time, I thought that say the likes of 
a caching named in the middle between hosts and the rest of the DNS 
would behave that way as it was looking-up names on behalf those who 
asked it.


rick



(If this is the case, may I suggest doing something different, and use
some kind of caches ? It will be way faster.)

Then the result for 10,000,000 loops of <socket()+bind()+close()> are

Before patch :

real0m13.665s
user0m0.548s
sys 0m12.372s

After patch :

real0m20.599s
user0m0.465s
sys 0m17.965s

So the worst overhead is 700 ns

This is roughly the cost for bringing 960 bytes from memory, or 15 cache
lines (on x86_64)

# grep UDP /proc/slabinfo
UDPLITEv6  0  0   108872 : tunables   24   128 : 
slabdata  0  0  0
UDPv6 24 49   108872 : tunables   24   128 : 
slabdata  7  7  0
UDP-Lite   0  096041 : tunables   54   278 : 
slabdata  0  0  0
UDP   30 3696041 : tunables   54   278 : 
slabdata  9  9  2

In reality, chances that UDP sockets are re-opened right after being
freed and their 15 cache lines are very hot in cpu caches is quite
small, so I would not worry at all about this rather stupid benchmark.

int main(int argc, char *argv[]) {
struct sockaddr_in addr;
int i, fd, loops = 1000;

for (i = 0; i < loops; i++) {
fd = socket(AF_INET, SOCK_DGRAM, 0);
if (fd == -1) {
perror("socket");
break;
}
memset(, 0, sizeof(addr));
addr.sin_family = AF_INET;
if (bind(fd, (const struct sockaddr *), sizeof(addr)) == 
-1) {
perror("bind");
break;
}
close(fd);
}
return 0;
}





Re: [RFC net-next 2/2] udp: No longer use SLAB_DESTROY_BY_RCU

2016-03-28 Thread Rick Jones

On 03/25/2016 03:29 PM, Eric Dumazet wrote:

UDP sockets are not short lived in the high usage case, so the added
cost of call_rcu() should not be a concern.


Even a busy DNS resolver?

rick jones


Re: [Codel] [RFCv2 0/3] mac80211: implement fq codel

2016-03-19 Thread Rick Jones

On 03/17/2016 10:00 AM, Dave Taht wrote:

netperf's udp_rr is not how much traffic conventionally behaves. It
doesn't do tcp slow start or congestion control in particular...


Nor would one expect it to need to, unless one were using "burst mode" 
to have more than one transaction inflight at one time.


And unless one uses the test-specific -e option to provide a very crude 
retransmission mechanism based on a socket read timeout, neither does 
UDP_RR recover from lost datagrams.


happy benchmarking,

rick jones
http://www.netperf.org/


Re: [RFC v2 -next 0/2] virtio-net: Advised MTU feature

2016-03-15 Thread Rick Jones

On 03/15/2016 02:04 PM, Aaron Conole wrote:

The following series adds the ability for a hypervisor to set an MTU on the
guest during feature negotiation phase. This is useful for VM orchestration
when, for instance, tunneling is involved and the MTU of the various systems
should be homogenous.

The first patch adds the feature bit as described in the proposed VFIO spec
addition found at
https://lists.oasis-open.org/archives/virtio-dev/201603/msg1.html

The second patch adds a user of the bit, and a warning when the guest changes
the MTU from the hypervisor advised MTU. Future patches may add more thorough
error handling.


How do you see this interacting with VMs getting MTU settings via DHCP?

rick jones



v2:
* Whitespace and code style cleanups from Sergei Shtylyov and Paolo Abeni
* Additional test before printing a warning

Aaron Conole (2):
   virtio: Start feature MTU support
   virtio_net: Read the advised MTU

  drivers/net/virtio_net.c| 12 
  include/uapi/linux/virtio_net.h |  3 +++
  2 files changed, 15 insertions(+)





Re: [PATCH v6 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)

2016-03-14 Thread Rick Jones

On 03/14/2016 02:15 PM, Eric Dumazet wrote:

On Thu, 2016-03-03 at 19:06 +0100, Bendik Rønning Opstad wrote:

Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
the latency for applications sending time-dependent data.

Latency-sensitive applications or services, such as online games,
remote control systems, and VoIP, produce traffic with thin-stream
characteristics, characterized by small packets and relatively high
inter-transmission times (ITT). When experiencing packet loss, such
latency-sensitive applications are heavily penalized by the need to
retransmit lost packets, which increases the latency by a minimum of
one RTT for the lost packet. Packets coming after a lost packet are
held back due to head-of-line blocking, causing increased delays for
all data segments until the lost packet has been retransmitted.


Acked-by: Eric Dumazet <eduma...@google.com>

Note that RDB probably should get some SNMP counters,
so that we get an idea of how many times a loss could be repaired.


And some idea of the duplication seen by receivers, assuming there isn't 
already a counter for such a thing in Linux.


happy benchmarking,

rick jones



Ideally, if the path happens to be lossless, all these pro active
bundles are overhead. Might be useful to make RDB conditional to
tp->total_retrans or something.






Re: [net-next PATCH 0/2] GENEVE/VXLAN: Enable outer Tx checksum by default

2016-02-23 Thread Rick Jones

On 02/23/2016 08:47 AM, Tom Herbert wrote:

Right, GRO should probably not coalesce packets with non-zero IP
identifiers due to the loss of information. Besides that, RFC6848 says
the IP identifier should only be set for fragmentation anyway so there
shouldn't be any issue and really no need for HW TSO (or LRO) to
support that.


You sure that is RFC 6848 "Specifying Civic Address Extensions in the 
Presence Information Data Format Location Object (PIDF-LO)" ?


In whichever RFC that may be, is it a SHOULD or a MUST, and just how 
many "other" stacks might be setting a non-zero IP ID on fragments with 
DF set?


rick jones


We need to do increment IP identifier in UFO, but I only see one
device (neterion) that advertises NETIF_F_UFO-- honestly, removing
that feature might be another good simplification!

Tom


--
-Ed




Re: Variable download speed

2016-02-23 Thread Rick Jones

On 02/23/2016 03:24 AM, s...@onet.eu wrote:

Hi,

I've got a problem with network on one of my embedded boards.
I'm testing download speed of 256MB file from my PC to embedded board
through 1Gbit ethernet link using ftp.

The problem is that sometimes I achieve 25MB/s and sometimes it is only
14MB/s. There are also situations where the transfer speed starts at
14MB/s and after a few seconds achieves 25MB/s.
I've caught the second case with tcpdump and I noticed that when the speed
is 14MB/s - the tcp window size is 534368 bytes and when the speed
achieved 25MB/s the tcp window size is 933888.

My question is: what causes such dynamic change in the window size (while
transferring data)?  Is it some kernel parameter wrong set or something
like this?
Do I have any influence on such dynamic change in tcp window size?



If an application using TCP does not make an explicit setsockopt() call 
to set the SO_SNDBUF and/or SO_RCVBUF size, then the socket buffer and 
TCP window size will "autotune" based on what the stack believes to be 
the correct thing to do.  It will be bounded by the values in the 
tcp_rmem and tcp_wmem sysctl settings:



net.ipv4.tcp_rmem = 409687380   6291456
net.ipv4.tcp_wmem = 409616384   4194304

Those are min, initial, max, units of octets (bytes).

If on the other hand an application makes an explicit setsockopt() call, 
 that will be the size of the socket buffer, though it will be 
"clipped" by the values of:


net.core.rmem_max = 4194304
net.core.wmem_max = 4194304

Those sysctls will default to different values based on how much memory 
is in the system.  And I think in the case of those last two, I have 
tweaked them myself away from their default values.


You might also look at the CPU utilization of all the CPUs of your 
embedded board, as well as the link-level statistics for your interface, 
and the netstat statistics.  You would be looking for saturation, and 
"excessive" drop rates.  I would also suggest testing network 
performance with something other than FTP.  While one can try to craft 
things so there is no storage I/O of note, it would still be better to 
use a network-specific tool such as netperf or iperf.  Minimize the 
number of variables.


happy benchmarking,

rick jones


Re: [PATCH][net-next] bridge: increase mtu to 9000

2016-02-22 Thread Rick Jones

On 02/22/2016 01:29 AM, roy.qing...@gmail.com wrote:

From: Li RongQing <roy.qing...@gmail.com>

A linux bridge always adopts the smallest MTU of the enslaved devices.
When no device are enslaved, it defaults to a MTU of 1500 and refuses to
use a larger one. This is problematic when using bridges enslaving only
virtual NICs (vnetX) like it's common with KVM guests.

Steps to reproduce the problem

1) sudo ip link add br-test0 type bridge # create an empty bridge
2) sudo ip link set br-test0 mtu 9000 # attempt to set MTU > 1500
3) ip link show dev br-test0 # confirm MTU

Here, 2) returns "RTNETLINK answers: Invalid argument". One (cumbersome)
way around this is:

4) sudo modprobe dummy
5) sudo ip link set dummy0 mtu 9000 master br-test0

Then the bridge's MTU can be changed from anywhere to 9000.

This is especially annoying for the virtualization case because the
KVM's tap driver will by default adopt the bridge's MTU on startup
making it impossible (without the workaround) to use a large MTU on the
guest VMs.

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1399064

Signed-off-by: Li RongQing <roy.qing...@gmail.com>
---
  net/bridge/br_if.c  | 4 ++--
  net/bridge/br_private.h | 2 ++
  2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index c367b3e..38ced44 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -390,7 +390,7 @@ int br_del_bridge(struct net *net, const char *name)
return ret;
  }

-/* MTU of the bridge pseudo-device: ETH_DATA_LEN or the minimum of the ports */
+/* MTU of the bridge pseudo-device: BR_JUMBO_MTU or the minimum of the ports */
  int br_min_mtu(const struct net_bridge *br)
  {
const struct net_bridge_port *p;
@@ -399,7 +399,7 @@ int br_min_mtu(const struct net_bridge *br)
ASSERT_RTNL();

if (list_empty(>port_list))
-   mtu = ETH_DATA_LEN;
+   mtu = BR_JUMBO_MTU;
else {
list_for_each_entry(p, >port_list, list) {
if (!mtu  || p->dev->mtu < mtu)
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 302ab0a..d3c29f6 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -32,6 +32,8 @@

  #define BR_VERSION"2.3"

+#define BR_JUMBO_MTU   9000
+
  /* Control of forwarding link local multicast */
  #define BR_GROUPFWD_DEFAULT   0
  /* Don't allow forwarding of control protocols like STP, MAC PAUSE and LACP */



If you are going to 9000. why not just go ahead and use the maximum size 
of an IP datagram?


rick jones


Re: [PATCH net V1 1/6] net/mlx4_en: Count HW buffer overrun only once

2016-02-17 Thread Rick Jones

On 02/17/2016 07:24 AM, Or Gerlitz wrote:

From: Amir Vadai <a...@vadai.me>

RdropOvflw counts overrun of HW buffer, therefore should
be used for rx_fifo_errors only.

Currently RdropOvflw counter is mistakenly also set into
rx_missed_errors and rx_over_errors too, which makes the
device total dropped packets accounting to show wrong results.

Fix that. Use it for rx_fifo_errors only.

Fixes: c27a02cd94d6 ('mlx4_en: Add driver for Mellanox ConnectX 10GbE NIC')
Signed-off-by: Amir Vadai <a...@vadai.me>
Signed-off-by: Eugenia Emantayev <euge...@mellanox.com>
Signed-off-by: Or Gerlitz <ogerl...@mellanox.com>


Reviewed-By: Rick Jones <rick.jon...@hpe.com>

rick



Re: [PATCH net 1/6] net/mlx4_en: Do not count dropped packets twice

2016-02-16 Thread Rick Jones

On 02/16/2016 07:01 AM, Or Gerlitz wrote:

From: Amir Vadai <a...@vadai.me>

RdropOvflw counter was mistakenly copied into rx_missed_errors. Because
of that it was counted twice for the device dropped packets accounting.

Fixes: c27a02cd94d6 ('mlx4_en: Add driver for Mellanox ConnectX 10GbE NIC')
Signed-off-by: Amir Vadai <a...@vadai.me>
Signed-off-by: Eugenia Emantayev <euge...@mellanox.com>
Signed-off-by: Or Gerlitz <ogerl...@mellanox.com>
---
  drivers/net/ethernet/mellanox/mlx4/en_port.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_port.c 
b/drivers/net/ethernet/mellanox/mlx4/en_port.c
index ee99e67..7b511a5 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_port.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_port.c
@@ -242,7 +242,7 @@ int mlx4_en_DUMP_ETH_STATS(struct mlx4_en_dev *mdev, u8 
port, u8 reset)
stats->rx_crc_errors = be32_to_cpu(mlx4_en_stats->RCRC);
stats->rx_frame_errors = 0;
stats->rx_fifo_errors = be32_to_cpu(mlx4_en_stats->RdropOvflw);
-   stats->rx_missed_errors = be32_to_cpu(mlx4_en_stats->RdropOvflw);
+   stats->rx_missed_errors = 0;
stats->tx_aborted_errors = 0;
stats->tx_carrier_errors = 0;
stats->tx_fifo_errors = 0;



I'm still not clear on when an Acked-by is appropriate, but given that 
this has been a non-trivial frustration for a long time, a hearty 
endorsement from me.  Perhaps not important enough but it would be nice 
to have it flow back a release or two.


That said, should mlx4_en_stats->RdropOvflw still be going into both 
rx_fifo_errors and rx_over_errors?


stats->rx_over_errors = be32_to_cpu(mlx4_en_stats->RdropOvflw);
stats->rx_crc_errors = be32_to_cpu(mlx4_en_stats->RCRC);
stats->rx_frame_errors = 0;
stats->rx_fifo_errors = be32_to_cpu(mlx4_en_stats->RdropOvflw);

happy benchmarking,

rick jones


Re: Disabling XPS for 4.4.0-1+ixgbe+OpenStack VM over a VLAN means 65% increase in netperf TCP_STREAM

2016-02-08 Thread Rick Jones

On 02/04/2016 11:38 AM, Tom Herbert wrote:

I'd start with verifying the XPS configuration is sane and then trying
to reproduce the issue outside of using VMs, if both of those are okay
then maybe look at some sort of bad interaction with OpenStack
configuration.


So, looking at bare-iron, I can see something similar but not to the 
same degree (well, depending on which is one's metric of interest I guess):



XPS being enabled for ixgbe here looks to be increasing receive side 
service demand by 30% but there is enough CPU available in this setup 
that it is only a loss of 2.5% or so on throughput.


stack@fcperf-cp1-comp0001-mgmt:~$ grep 87380 xps_on_* | awk 
'{t+=$6;r+=$9;s+=$10}END{print "throughput",t/NR,"recv sd",r/NR,"send 
sd",s/NR}'

throughput 9072.52 recv sd 0.8623 send sd 0.3686
stack@fcperf-cp1-comp0001-mgmt:~$ grep TCPOFO xps_on_* | awk '{sum += 
$NF}END{print "sum",sum/NR}'

sum 1621.1
stack@fcperf-cp1-comp0001-mgmt:~$ grep 87380 xps_off_* | awk 
'{t+=$6;r+=$9;s+=$10}END{print "throughput",t/NR,"recv sd",r/NR,"send 
sd",s/NR}'

throughput 9300.48 recv sd 0.6543 send sd 0.3606
stack@fcperf-cp1-comp0001-mgmt:~$ grep TCPOFO xps_off_* | awk '{sum += 
$NF}END{print "sum",sum/NR}'

sum 173.9

happy benchmarking,

rick jones

raw results at ftp://ftp.netperf.org/xps_4.4.0-1_ixgbe.tgz



Re: Disabling XPS for 4.4.0-1+ixgbe+OpenStack VM over a VLAN means 65% increase in netperf TCP_STREAM

2016-02-08 Thread Rick Jones


Shame on me for not including bare-iron TCP_RR:

stack@fcperf-cp1-comp0001-mgmt:~$ grep "1   1" xps_tcp_rr_on_* | awk 
'{t+=$6;r+=$9;s+=$10}END{print "throughput",t/NR,"recv sd",r/NR,"send 
sd",s/NR}'

throughput 18589.4 recv sd 21.6296 send sd 20.5931
stack@fcperf-cp1-comp0001-mgmt:~$ grep "1   1" xps_tcp_rr_off_* | 
awk '{t+=$6;r+=$9;s+=$10}END{print "throughput",t/NR,"recv 
sd",r/NR,"send sd",s/NR}'

throughput 20883.6 recv sd 19.6255 send sd 20.0178

So that is 12% on TCP_RR throughput.

Looks like XPS shouldn't be enabled by default for ixgbe.

happy benchmarking,

rick jones


Disabling XPS for 4.4.0-1+ixgbe+OpenStack VM over a VLAN means 65% increase in netperf TCP_STREAM

2016-02-04 Thread Rick Jones

Folks -

I was doing some performance work with OpenStack Liberty on systems with 
2x E5-2650L v3 @ 1.80GHz processors and 560FLR (Intel 82599ES) NICs onto 
which I'd placed a 4.4.0-1 kernel.  I was actually interested in the 
effect of removing the linux bridge from all the plumbing OpenStack 
creates (it is there for iptables-based implementation of security group 
rules because OS Liberty doesn't enable them on the OVS bridge(s) it 
creates), and I'd noticed that when I removed the linux bridge from the 
"stack" instance-to-instance (vm-to-vm) performance across a VLAN-based 
Neutron private network dropped.  Quite unexpected.


On a lark, I tried explicitly binding the NIC's IRQs and Boom! the 
single-stream performance shot-up to near link-rate.  I couldn't recall 
explicit binding of IRQs doing that much for single-stream netperf 
TCP_STREAM before.


I asked the Intel folks about that, they suggested I try disabling XPS. 
 So, with that I see the following on single-stream tests between the 
VMs on that VLAN-based private network as created by OpenStack Liberty:



   99% Confident within +/- 2.5% of "real" average
TCP_RR in Trans/s TCP_STREAM in Mbit/s

   XPS Enabled   XPS Disabled   Delta
TCP_STREAM5353  8841 (*)65.2%
TCP_RR8562  966612.9%

The Intel folks suggested something about the process scheduler moving 
the sender around and ultimately causing some packet re-ordering.  That 
could I suppose explain the TCP_STREAM difference, but not the TCP_RR 
since that has just a single segment in flight at one time.


I can try to get perf/whatnot installed on the systems - suggestions as 
to what metrics to look at are welcome.


happy benchmarking,

rick jones
* If I disable XPS on the sending side only, it is more like 7700 Mbit/s

netstats from the receiver over a netperf TCP_STREAM test's duration 
with XPS enabled:


$ netperf -H 10.240.50.191 -- -o throughput,local_transport_retrans
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
10.240.50.191 () port 0 AF_INET : demo

Throughput,Local Transport Retransmissions
5292.74,4555


$ ./beforeafter before after
Ip:
327837 total packets received
0 with invalid addresses
0 forwarded
0 incoming packets discarded
327837 incoming packets delivered
293438 requests sent out
Icmp:
0 ICMP messages received
0 input ICMP message failed.
ICMP input histogram:
destination unreachable: 0
0 ICMP messages sent
0 ICMP messages failed
ICMP output histogram:
destination unreachable: 0
IcmpMsg:
InType3: 0
OutType3: 0
Tcp:
0 active connections openings
2 passive connection openings
0 failed connection attempts
0 connection resets received
0 connections established
327837 segments received
293438 segments send out
0 segments retransmited
0 bad segments received.
0 resets sent
Udp:
0 packets received
0 packets to unknown port received.
0 packet receive errors
0 packets sent
IgnoredMulti: 0
UdpLite:
TcpExt:
0 TCP sockets finished time wait in fast timer
0 delayed acks sent
Quick ack mode was activated 1016 times
50386 packets directly queued to recvmsg prequeue.
309545872 bytes directly in process context from backlog
2874395424 bytes directly received in process context from prequeue
86591 packet headers predicted
84934 packets header predicted and directly queued to user
6 acknowledgments not containing data payload received
20 predicted acknowledgments
1017 DSACKs sent for old packets
TCPRcvCoalesce: 157097
TCPOFOQueue: 78206
TCPOrigDataSent: 24
IpExt:
InBcastPkts: 0
InOctets: 6643231012
OutOctets: 17203936
InBcastOctets: 0
InNoECTPkts: 327837

And now with it disabled on both sides:
$ netperf -H 10.240.50.191 -- -o throughput,local_transport_retrans
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
10.240.50.191 () port 0 AF_INET : demo

Throughput,Local Transport Retransmissions
8656.84,1903
$ ./beforeafter noxps_before noxps_avter
Ip:
251831 total packets received
0 with invalid addresses
0 forwarded
0 incoming packets discarded
251831 incoming packets delivered
218415 requests sent out
Icmp:
0 ICMP messages received
0 input ICMP message failed.
ICMP input histogram:
destination unreachable: 0
0 ICMP messages sent
0 ICMP messages failed
ICMP output histogram:
destination unreachable: 0
IcmpMsg:
InType3: 0
OutType3: 0
Tcp:
0 active connections openings
2 passive connection openings
0 failed connection attempts
0 connection resets received
0 connections established
251831 segments received
218415 segments send out
0 segments retransmited
0 bad segments received.
0 resets sent
Udp:
0 pa

Re: [PATCH net-next v5 1/2] ethtool: add speed/duplex validation functions

2016-02-04 Thread Rick Jones

On 02/04/2016 04:47 AM, Michael S. Tsirkin wrote:

On Wed, Feb 03, 2016 at 03:49:04PM -0800, Rick Jones wrote:

And even for not-quite-virtual devices - such as a VC/FlexNIC in an HPE
blade server there can be just about any speed set.  I think we went down a
path of patching some things to address that many years ago.  It would be a
shame to undo that.

rick


I'm not sure I understand. The question is in defining the UAPI.
We currently have:

  * @speed: Low bits of the speed
  * @speed_hi: Hi bits of the speed

with the assumption that all values come from the defines.

So if we allow any value here we need to define what it means.


I may be mixing apples and kiwis.  Many years ago when HP came-out with 
their blades and VirtualConnect, they included the ability to create 
"flex NICs" - "sub-NICs" out of a given interface port on a blade, and 
to assign each a specific bitrate in increments (IIRC) of 100 Mbit/s. 
This was reported up through the driver and it became necessary to make 
ethtool (again, IIRC) not so picky about "valid" speed values.


rick



Re: Disabling XPS for 4.4.0-1+ixgbe+OpenStack VM over a VLAN means 65% increase in netperf TCP_STREAM

2016-02-04 Thread Rick Jones

On 02/04/2016 11:38 AM, Tom Herbert wrote:

On Thu, Feb 4, 2016 at 11:13 AM, Rick Jones <rick.jon...@hpe.com> wrote:

The Intel folks suggested something about the process scheduler moving the
sender around and ultimately causing some packet re-ordering.  That could I
suppose explain the TCP_STREAM difference, but not the TCP_RR since that has
just a single segment in flight at one time.


XPS has OOO avoidance for TCP, that should not be a problem.


What/how much should I read into:

With XPSTCPOFOQueue: 78206
Without XPS TCPOFOQueue: 967

out of the netstat statistics on the receiving VM?


I can try to get perf/whatnot installed on the systems - suggestions as to
what metrics to look at are welcome.


I'd start with verifying the XPS configuration is sane and then trying
to reproduce the issue outside of using VMs, if both of those are okay
then maybe look at some sort of bad interaction with OpenStack
configuration.


Fair enough - what is the definition of "sane" for an XPS configuration?

Here is what it looks like before I disabled it:

$ for i in `find 
/sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0 -name 
xps_cpus`; do echo $i `cat $i`; done
/sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-0/xps_cpus 
,0001
/sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-1/xps_cpus 
,0002
/sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-2/xps_cpus 
,0004
/sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-3/xps_cpus 
,0008
/sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-4/xps_cpus 
,0010
/sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-5/xps_cpus 
,0020
/sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-6/xps_cpus 
,0040
/sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-7/xps_cpus 
,0080
/sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-8/xps_cpus 
,0100
/sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-9/xps_cpus 
,0200
/sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-10/xps_cpus 
,0400
/sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-11/xps_cpus 
,0800
/sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-12/xps_cpus 
,1000
/sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-13/xps_cpus 
,2000
/sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-14/xps_cpus 
,4000
/sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-15/xps_cpus 
,8000
/sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-16/xps_cpus 
,0001
/sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-17/xps_cpus 
,0002
/sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-18/xps_cpus 
,0004
/sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-19/xps_cpus 
,0008
/sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-20/xps_cpus 
,0010
/sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-21/xps_cpus 
,0020
/sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-22/xps_cpus 
,0040
/sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-23/xps_cpus 
,0080
/sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-24/xps_cpus 
,0100
/sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-25/xps_cpus 
,0200
/sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-26/xps_cpus 
,0400
/sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-27/xps_cpus 
,0800
/sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-28/xps_cpus 
,1000
/sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-29/xps_cpus 
,2000
/sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-30/xps_cpus 
,4000
/sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-31/xps_cpus 
,8000
/sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-32/xps_cpus 
0001,
/sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-33/xps_cpus 
0002,
/sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-34/xps_cpus 
0004,
/sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-35/xps_cpus 
0008,
/sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-36/xps_cpus 
0010,
/sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-37/xps_cpus 
0020,
/sys/devices/pci:00/:00:02.2/:04:00.0/net/eth0/queues/tx-38/xps_cpus 
0040,
/sys/devices/pci:00/:00:02.2/:04:00.0/net/e

Re: Disabling XPS for 4.4.0-1+ixgbe+OpenStack VM over a VLAN means 65% increase in netperf TCP_STREAM

2016-02-04 Thread Rick Jones

On 02/04/2016 12:13 PM, Tom Herbert wrote:

On Thu, Feb 4, 2016 at 11:57 AM, Rick Jones <rick.jon...@hpe.com> wrote:

On 02/04/2016 11:38 AM, Tom Herbert wrote:

XPS has OOO avoidance for TCP, that should not be a problem.



What/how much should I read into:

With XPSTCPOFOQueue: 78206
Without XPS TCPOFOQueue: 967

out of the netstat statistics on the receiving VM?


Okay, that makes sense. The OOO avoidance only applies to TCP sockets
in the stack, that doesn't cross into VM. Presumably, packets coming
from the VM don't have a socket so sk_tx_queue_get always returns -1
and so netdev_pick_tx will steer packet to the queue based on
currently running CPU without any memory.


Any thoughts as to why explicitly binding the IRQs made things better, 
or for that matter why the scheduler would be moving the VM (or its 
vhost-net kernel thread I suppose?) around so much?


happy benchmarking,

rick jones



Re: [PATCH net-next v5 1/2] ethtool: add speed/duplex validation functions

2016-02-03 Thread Rick Jones

On 02/03/2016 03:32 PM, Stephen Hemminger wrote:


But why check for valid value at all. At some point in the
future, there will be yet another speed adopted by some standard body
and the switch statement would need another value.

Why not accept any value? This is a virtual device.



And even for not-quite-virtual devices - such as a VC/FlexNIC in an HPE 
blade server there can be just about any speed set.  I think we went 
down a path of patching some things to address that many years ago.  It 
would be a shame to undo that.


rick


Re: bonding (IEEE 802.3ad) not working with qemu/virtio

2016-02-01 Thread Rick Jones

On 01/29/2016 10:59 PM, David Miller wrote:

There should be a default speed/duplex setting for such devices as well.
We can pick one that will be use universally for these kinds of devices.


There is at least one monitoring tool - collectl - which gets a trifle 
upset when the actual speed through an interface is significantly 
greater than the reported link speed.  I have to wonder how unique it is 
in that regard.


Doesn't mean there can't be a default, but does suggest it should be 
rather high.


rick jones



Re: [BUG] net: performance regression on ixgbe (Intel 82599EB 10-Gigabit NIC)

2015-12-10 Thread Rick Jones

On 12/10/2015 06:18 AM, Otto Sabart wrote:

*) Is irqbalance disabled and the IRQs set the same each time, or might
there be variability possible there?  Each of the five netperf runs will be
a different four-tuple which means each may (or may not) get RSS hashed/etc
differently.


The irqbalance is disabled on all systems.

Can you suggest, if there is a need to assign irqs manually? Which irqs
we should pin to which CPU?


Likely as not it will depend on your goals.  When I want single-stream 
results, I will tend to disable irqbalance and set all the IRQs to one 
CPU in the system (often as not CPU0 but that is as much habit as 
anything else).  The idea is to clamp-down on any source of run-to-run 
variation.  I will also sometimes alter where I bind netperf/netserver 
to show the effects (especially on service demand) when 
netperf/netserver run on the same CPU as the IRQ, a thread in the same 
core as the IRQ, a core in the same processor as the IRQ and/or a core 
in another processor.  Unless all the IRQs are pointed at the same CPU 
(or I always specify the same, full four-tuple for addressing and wait 
for TIME_WAIT) that can be a challenge to keep straight.


When I want to measure aggregate, I either let irqbalance do its thing 
and run a bunch of warm-up tests, or simply peanut-butter the IRQs 
across the CPUs with variations on the theme of:


grep eth[23] /proc/interrupts | awk -F ":" -v cpus=12 '{mask = 1 * 
2^(count++ % cpus);printf("echo %x > 
/proc/irq/%d/smp_affinity\n",mask,$1)}' | sh


How one might structure/alter that pipeline will depend on the CPU 
enumeration.  That one was from a 2x6 core system where I didn't want to 
hit the second thread of each core, and the enumeration was the first 
twelve CPUs were on thread 0 of each core of both processors.



*) It is perhaps adding duct tape to already-present belt and suspenders,
but is power-management set to a fixed state on the systems involved? (Since
this seems to be ProLiant G7s going by the legends on the charts, either
static high perf or static low power I would imagine)


Power management is set to OS-Control in bios, which effectively means,
that _bios_ does not do any power management at all.


Probably just as well :)


*) What is the difference before/after for the service demands?  The netperf
tests being run are asking for CPU utilization but I don't see the service
demand change being summarized.


Unfortunatelly we does not have any summary chart for service demands,
we will add some shortly.


*) Does a specific CPU on one side or the other saturate?
(LOCAL_CPU_PEAK_UTIL, LOCAL_CPU_PEAK_ID, REMOTE_CPU_PEAK_UTIL,
REMOTE_CPU_PEAK_ID output selectors)


We are sort of stuck in a stone age. We still use old fashion tcp/udp
migrated tests, but we plan to switch to omni.


Well, you don't have to invoke with -t omni to make use of the output 
selectors - just add the -O (or -o or -k) test-specific option.





*) What are the processors involved?  Presumably the "other system" is
fixed?


In this case:

hp-dl380g7 - $ lscpu:
Architecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Little Endian
CPU(s):24
On-line CPU(s) list:   0-23
Thread(s) per core:2
Core(s) per socket:6
Socket(s): 2
NUMA node(s):  2
Vendor ID: GenuineIntel
CPU family:6
Model: 44
Model name:Intel(R) Xeon(R) CPU   X5650  @ 2.67GHz
Stepping:  2
CPU MHz:   2660.000
BogoMIPS:  5331.27
Virtualization:VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache:  256K
L3 cache:  12288K
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23


hp-dl385g7 - $ lscpu:
tecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Little Endian
CPU(s):24
On-line CPU(s) list:   0-23
Thread(s) per core:1
Core(s) per socket:12
Socket(s): 2
NUMA node(s):  4
Vendor ID: AuthenticAMD
CPU family:16
Model: 9
Model name:AMD Opteron(tm) Processor 6172
Stepping:  1
CPU MHz:   2100.000
BogoMIPS:  4200.39
Virtualization:AMD-V
L1d cache: 64K
L1i cache: 64K
L2 cache:  512K
L3 cache:  5118K
NUMA node0 CPU(s): 0,2,4,6,8,10
NUMA node1 CPU(s): 12,14,16,18,20,22
NUMA node2 CPU(s): 13,15,17,19,21,23
NUMA node3 CPU(s): 1,3,5,7,9,11


I guess that helps explain why there were such large differences in the 
deltas between TCP_STREAM and TCP_MAERTS since it wasn't the same 
per-core "horsepower" on either side and so why LRO on/off could have 
also affected the TCP_STREAM results. (When LRO was off it was off on 
both sides, and when on was on on b

Re: [BUG] net: performance regression on ixgbe (Intel 82599EB 10-Gigabit NIC)

2015-12-07 Thread Rick Jones

On 12/07/2015 03:28 AM, Otto Sabart wrote:

Hi Ota,

It looks like there were a few changes that went through that could be
causing the regression.  The most obvious one that jumps out at me is commit
72bfd32d2f84 ("ixgbe: disable LRO by default").  As such one thing you might
try doing is turning on LRO support via ethtool -k to see if that is the
issue you are seeing.



Hi Alex,
enabling LRO resolved the problem.


So you had the same NIC and CPUs and whatnot on both sides?

rick jones

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] net: performance regression on ixgbe (Intel 82599EB 10-Gigabit NIC)

2015-12-04 Thread Rick Jones

On 12/03/2015 08:26 AM, Otto Sabart wrote:

Hello netdev,
I probably found a performance regression on ixgbe (Intel 82599EB
10-Gigabit NIC) on v4.4-rc3. I am able to see this problem since
v4.4-rc1.

The bug report you can find here [0].

Can somebody take a look at it?

[0] https://bugzilla.redhat.com/show_bug.cgi?id=1288124


A few of comments/questions  based on reading that bug report:

*)  It is good to be binding netperf and netserver - helps with 
reproducibility, but why the two -T options?  A brief look at 
src/netsh.c suggests it will indeed set the two binding options 
separately but that is merely a side-effect of how I wrote the code.  It 
wasn't an intentional thing.


*) Is irqbalance disabled and the IRQs set the same each time, or might 
there be variability possible there?  Each of the five netperf runs will 
be a different four-tuple which means each may (or may not) get RSS 
hashed/etc differently.


*) It is perhaps adding duct tape to already-present belt and 
suspenders, but is power-management set to a fixed state on the systems 
involved? (Since this seems to be ProLiant G7s going by the legends on 
the charts, either static high perf or static low power I would imagine)


*) What is the difference before/after for the service demands?  The 
netperf tests being run are asking for CPU utilization but I don't see 
the service demand change being summarized.


*) Does a specific CPU on one side or the other saturate? 
(LOCAL_CPU_PEAK_UTIL, LOCAL_CPU_PEAK_ID, REMOTE_CPU_PEAK_UTIL, 
REMOTE_CPU_PEAK_ID output selectors)


*) What are the processors involved?  Presumably the "other system" is 
fixed?


*) It is important to remember the socket buffer sizes reported with the 
default output is *just* what they were when the data socket was 
created.  If you want to see what they became by the end of the test, 
you need to use the appropriate output selectors (or, IIRC invoking the 
tests as "omni" rather than tcp_stream/tcp_maerts will report the end 
values rather than the start ones.).


happy benchmarking,

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ipsec impact on performance

2015-12-02 Thread Rick Jones

On 12/02/2015 03:56 AM, David Laight wrote:

From: Sowmini Varadhan

Sent: 01 December 2015 18:37

...

I was using esp-null merely to not have the crypto itself perturb
the numbers (i.e., just focus on the s/w overhead for now), but here
are the numbers for the stock linux kernel stack
 Gbps  peak cpu util
esp-null 1.8   71%
aes-gcm-c-2561.6   79%
aes-ccm-a-1280.7   96%

That trend made me think that if we can get esp-null to be as close
as possible to GSO/GRO, the rest will follow closely behind.


That's not how I read those figures.
They imply to me that there is a massive cost for the actual encryption
(particularly for aes-ccm-a-128) - so whatever you do to the esp-null
case won't help.



To build on the whole "importance of normalizing throughput and CPU 
utilization in some way" theme, the following are some non-IPSec netperf 
TCP_STREAM runs between a pair of 2xIntel E5-2603 v3 systems using 
Broadcom BCM57810-based NICs, 4.2.0-19 kernel, 7.10.72 firmware and 
bnx2x driver version 1.710.51-0:



root@htx-scale300-258:~# ./take_numbers.sh
Baseline
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
10.12.49.1 () port 0 AF_INET : +/-2.500% @ 99% conf.  : demo : cpu bind
Throughput Local Local   Local   Remote Remote  Remote  Throughput Local 
 Remote
   CPU   Service PeakCPUService PeakConfidence CPU 
   CPU
   Util  Demand  Per CPU Util   Demand  Per CPU Width (%) 
Confidence Confidence
   % Util %  %  Util % 
Width (%)  Width (%)
9414.111.87  0.195   26.54   3.70   0.387   45.42   0.002  7.073 
 1.276

Disable TSO/GSO
5651.258.36  1.454   100.00  2.46   0.428   30.35   1.093  1.101 
 4.889

Disable tx CKO
5287.698.46  1.573   100.00  2.34   0.435   29.66   0.428  7.710 
 3.518

Disable remote LRO/GRO
4148.768.32  1.971   99.97   5.95   1.409   71.98   3.656  0.735 
 3.491

Disable remote rx CKO
4204.498.31  1.942   100.00  6.68   1.563   82.05   2.015  0.437 
 4.921


You can see that as the offloads are disabled, the service demands (usec 
of CPU time consumed systemwide per KB of data transferred) go up, and 
until one hits a bottleneck (eg one of the CPUs pegs at 100%), go up 
faster than the throughputs go down.


To aid in reproducibility those tests were with irqbalance disabled, all 
the IRQs for the NICs pointed at CPU 0, netperf/netserver bound to CPU 
0, and the power management set to static high performance.


Assuming I've created a "matching" ipsec.conf, here is what I see with 
esp=null-null on the TCP_STREAM test - again, keeping all the binding in 
place etc:


3077.378.01  2.560   97.78   8.21   2.625   99.41   4.869  1.876 
 0.955


You can see that even with the null-null, there is a rather large 
increase in service demand.


And this is what I see when I run netperf TCP_RR (first is without 
ipsec, second is with. I didn't ask for confidence intervals this time 
around and I didn't try to tweak interrupt coalescing settings)


# HDR="-P 1";for i in 10.12.49.1 192.168.0.2; do ./netperf -H $i -t 
TCP_RR -c -C -l 30 -T 0 $HDR; HDR="-P 0"; done
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET 
to 10.12.49.1 () port 0 AF_INET : demo : first burst 0 : cpu bind

Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPUCPUS.dem   S.dem
Send   Recv   SizeSize   TimeRate local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S% Sus/Tr   us/Tr

16384  87380  1   1  30.00   30419.75  1.72   1.68   6.783   6.617
16384  87380
16384  87380  1   1  30.00   20711.39  2.15   2.05   12.450  11.882
16384  87380

The service demand increases ~83% on the netperf side and almost 80% on 
the netserver side.  That is pure "effective" path-length increase.


happy benchmarking,

rick jones

PS - the netperf commands were varations on this theme:
./netperf -P 0 -T 0 -H 10.12.49.1 -c -C -l 30 -i 30,3 -- -O 
throughput,local_cpu_util,local_sd,local_cpu_peak_util,remote_cpu_util,remote_sd,remote_cpu_peak_util,throughput_confid,local_cpu_confid,remote_cpu_confid
altering IP address or test as appropriate.  -P 0 disables printing the 
test banner/headers.  -T 0 binds netperf and netserver to CPU0 on their 
respective systems.  -H sets the destination, -c and -C ask for local 
and remote CPU measurements respectively.  -l 30 says each test 
iteration should be 30 seconds long and -i 30,3 says to run at least 
three iterations and no more than 30 when trying to hit the confidence 
interval - by default 99% confident the average reported is within +/- 
2.5% of the "actual" average.  The -O stuff is selecting specific values 
to be emitted.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ipsec impact on performance

2015-12-01 Thread Rick Jones

On 12/01/2015 09:59 AM, Sowmini Varadhan wrote:

But these are all still relatively small things - tweaking them
doesnt get me significantly past the 3 Gbps limit. Any suggestions
on how to make this budge (or design criticism of the patch) would
be welcome.


What do the perf profiles show?  Presumably, loss of TSO/GSO means an 
increase in the per-packet costs, but if the ipsec path significantly 
increases the per-byte costs...


Short of a perf profile, I suppose one way to probe for per-packet 
versus per-byte would be to up the MTU.  That should reduce the 
per-packet costs while keeping the per-byte roughly the same.


You could also compare the likes of a single-byte netperf TCP_RR test 
between ipsec enabled and not to get an idea of the basic path length 
differences without TSO/GSO/whatnot muddying the waters.


happy benchmarking,

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ipsec impact on performance

2015-12-01 Thread Rick Jones

On 12/01/2015 10:45 AM, Sowmini Varadhan wrote:

On (12/01/15 10:17), Rick Jones wrote:


What do the perf profiles show?  Presumably, loss of TSO/GSO means
an increase in the per-packet costs, but if the ipsec path
significantly increases the per-byte costs...


For ESP-null, there's actually very little work to do - we just
need to add the 8 byte ESP header with an spi and a seq#.. no
crypto work to do.. so the overhead *should* be minimal, else
we've painted ourself into a corner where we can't touch anything
including TCP options like md5.


Something of a longshot, but are you sure you are still getting 
effective CKO/GRO on the receiver?


rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)

2015-11-24 Thread Rick Jones

On 11/24/2015 07:49 AM, Eric Dumazet wrote:

But in the end, latencies were bigger, because the application had to
copy from kernel to user (read()) the full message in one go. While if
you wake up application for every incoming GRO message, we prefill cpu
caches, and the last read() only has to copy the remaining part and
benefit from hot caches (RFS up2date state, TCP socket structure, but
also data in the application)


You can see something similar (at least in terms of latency) when 
messing about with MTU sizes.  For some message sizes - 8KB being a 
popular one - you will see higher latency on the likes of netperf TCP_RR 
with JumboFrames than you would with the standard 1500 byte MTU. 
Something I saw on GbE links years back anyway.  I chalked it up to 
getting better parallelism between the NIC and the host.


Of course the service demands were lower with JumboFrames...

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next RFC 2/2] vhost_net: basic polling support

2015-10-22 Thread Rick Jones

On 10/22/2015 02:33 AM, Michael S. Tsirkin wrote:

On Thu, Oct 22, 2015 at 01:27:29AM -0400, Jason Wang wrote:

This patch tries to poll for new added tx buffer for a while at the
end of tx processing. The maximum time spent on polling were limited
through a module parameter. To avoid block rx, the loop will end it
there's new other works queued on vhost so in fact socket receive
queue is also be polled.

busyloop_timeout = 50 gives us following improvement on TCP_RR test:

size/session/+thu%/+normalize%
 1/ 1/   +5%/  -20%
 1/50/  +17%/   +3%


Is there a measureable increase in cpu utilization
with busyloop_timeout = 0?


And since a netperf TCP_RR test is involved, be careful about what 
netperf reports for CPU util if that increase isn't in the context of 
the guest OS.


For completeness, looking at the effect on TCP_STREAM and TCP_MAERTS, 
aggregate _RR and even aggregate _RR/packets per second for many VMs on 
the same system would be in order.


happy benchmarking,

rick jones

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: list of all network namespaces

2015-09-16 Thread Rick Jones

On 09/16/2015 05:46 PM, Ani Sinha wrote:

Hi guys

just a stupid question. Is it possible to get a list of all active
network namespaces in the kernel through /proc or some other
interface?


Presumably you could copy what "ip netns" does, which appears to be to 
look in /var/run/netns .  At least that is what an strace of that 
command suggests.


rick jones

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


vethpair creation performance, 3.14 versus 4.2.0

2015-08-31 Thread Rick Jones

On 08/29/2015 10:59 PM, Raghavendra K T wrote:
> Please note that similar overhead was also reported while creating
> veth pairs  https://lkml.org/lkml/2013/3/19/556


That got me curious, so I took the veth pair creation script from there, 
and started running it out to 10K pairs, comparing a 3.14.44 kernel with 
a 4.2.0-rc4+ from net-next and then net-next after pulling to get the 
snmp stat aggregation perf change (4.2.0-rc8+).


Indeed, the 4.2.0-rc8+ kernel with the change was faster than the 
4.2.0-rc4+ kernel without it, but both were slower than the 3.14.44 kernel.


I've put a spreadsheet with the results at:

ftp://ftp.netperf.org/vethpair/vethpair_compare.ods

A perf top for the 4.20-rc8+ kernel from the net-next tree looks like 
this out around 10K pairs:


   PerfTop:   11155 irqs/sec  kernel:94.2%  exact:  0.0% [4000Hz 
cycles],  (all, 32 CPUs)

---

23.44%  [kernel]   [k] vsscanf
 7.32%  [kernel]   [k] mutex_spin_on_owner.isra.4
 5.63%  [kernel]   [k] __memcpy
 5.27%  [kernel]   [k] __dev_alloc_name
 3.46%  [kernel]   [k] format_decode
 3.44%  [kernel]   [k] vsnprintf
 3.16%  [kernel]   [k] acpi_os_write_port
 2.71%  [kernel]   [k] number.isra.13
 1.50%  [kernel]   [k] strncmp
 1.21%  [kernel]   [k] _parse_integer
 0.93%  [kernel]   [k] filemap_map_pages
 0.82%  [kernel]   [k] put_dec_trunc8
 0.82%  [kernel]   [k] unmap_single_vma
 0.78%  [kernel]   [k] native_queued_spin_lock_slowpath
 0.71%  [kernel]   [k] menu_select
 0.65%  [kernel]   [k] clear_page
 0.64%  [kernel]   [k] _raw_spin_lock
 0.62%  [kernel]   [k] page_fault
 0.60%  [kernel]   [k] find_busiest_group
 0.53%  [kernel]   [k] snprintf
 0.52%  [kernel]   [k] int_sqrt
 0.46%  [kernel]   [k] simple_strtoull
 0.44%  [kernel]   [k] page_remove_rmap

My attempts to get a call-graph have been met with very limited success. 
 Even though I've installed the dbg package from "make deb-pkg" the 
symbol resolution doesn't seem to be working.


happy benchmarking,

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: vethpair creation performance, 3.14 versus 4.2.0

2015-08-31 Thread Rick Jones

On 08/31/2015 02:29 PM, David Ahern wrote:

On 8/31/15 1:48 PM, Rick Jones wrote:

My attempts to get a call-graph have been met with very limited success.
  Even though I've installed the dbg package from "make deb-pkg" the
symbol resolution doesn't seem to be working.


Looks like Debian does not enable framepointers by default:

$ grep FRAME /boot/config-3.2.0-4-amd64
...
# CONFIG_FRAME_POINTER is not set

Similar result for jessie.


And indeed, my config file has a Debian lineage.

rick

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Low throughput in VMs using VxLAN

2015-08-24 Thread Rick Jones

On 08/24/2015 09:19 AM, Santosh R wrote:

  Hi,

Earlier I was seeing lower throughput in VMs using VxLan as GRO was
not happening in VM.
Tom Herbert suggested to use vxlan: GRO support at tunnel layer patch series.
With today's net-next (4.2.0-rc7) in host and VM, I could see GRO
happening for vxlan, macvtap and virtual interface in VM.
The throughput is still low between VMs (around 4Gbps compared to
9Gbps without VxLAN).


Out of curiosity, have you tried tweaking gro_flush_timeout 
(gro_flush_interval?) for the VMs eth interface?  Say perhaps a value of 
1000?  (I'm assuming the VM is using virtio_net) Does the behaviour 
change if vhost-net is loaded into the host and used by the VM?


rick jones

For completeness, it would also be good to compare the likes of netperf 
TCP_RR between VxLAN and without.

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 net-next] documentation: bring vxlan documentation more up-to-date

2015-08-12 Thread Rick Jones

On 08/12/2015 04:46 PM, David Miller wrote:

From: r...@tardy.usa.hp.com (Rick Jones)
Date: Wed, 12 Aug 2015 10:23:14 -0700 (PDT)


From: Rick Jones rick.jon...@hp.com

A few things have changed since the previous version of the vxlan
documentation was written, so update it and correct some grammer and
such while we are at it.

Signed-off-by: Rick Jones rick.jon...@hp.com


Applied with grammar misspelling fixed.


Thanks.

rick
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 net-next] documentation: bring vxlan documentation more up-to-date

2015-08-12 Thread Rick Jones
From: Rick Jones rick.jon...@hp.com

A few things have changed since the previous version of the vxlan
documentation was written, so update it and correct some grammer and
such while we are at it.

Signed-off-by: Rick Jones rick.jon...@hp.com

---

v2: Stephen Hemminger feedback to include dstport 4789 in command line
example.  Also some further refinements from other sources.

diff --git a/Documentation/networking/vxlan.txt 
b/Documentation/networking/vxlan.txt
index 6d99351..89ee11b 100644
--- a/Documentation/networking/vxlan.txt
+++ b/Documentation/networking/vxlan.txt
@@ -1,32 +1,36 @@
 Virtual eXtensible Local Area Networking documentation
 ==
 
-The VXLAN protocol is a tunnelling protocol that is designed to
-solve the problem of limited number of available VLAN's (4096).
-With VXLAN identifier is expanded to 24 bits.
-
-It is a draft RFC standard, that is implemented by Cisco Nexus,
-Vmware and Brocade. The protocol runs over UDP using a single
-destination port (still not standardized by IANA).
-This document describes the Linux kernel tunnel device,
-there is also an implantation of VXLAN for Openvswitch.
-
-Unlike most tunnels, a VXLAN is a 1 to N network, not just point
-to point. A VXLAN device can either dynamically learn the IP address
-of the other end, in a manner similar to a learning bridge, or the
-forwarding entries can be configured statically.
-
-The management of vxlan is done in a similar fashion to it's
-too closest neighbors GRE and VLAN. Configuring VXLAN requires
-the version of iproute2 that matches the kernel release
-where VXLAN was first merged upstream.
+The VXLAN protocol is a tunnelling protocol designed to solve the
+problem of limited VLAN IDs (4096) in IEEE 802.1q.  With VXLAN the
+size of the identifier is expanded to 24 bits (16777216).
+
+VXLAN is described by IETF RFC 7348, and has been implemented by a
+number of vendors.  The protocol runs over UDP using a single
+destination port.  This document describes the Linux kernel tunnel
+device, there is also a separate implementation of VXLAN for
+Openvswitch.
+
+Unlike most tunnels, a VXLAN is a 1 to N network, not just point to
+point. A VXLAN device can learn the IP address of the other endpoint
+either dynamically in a manner similar to a learning bridge, or make
+use of statically-configured forwarding entries.
+
+The management of vxlan is done in a manner similar to its two closest
+neighbors GRE and VLAN. Configuring VXLAN requires the version of
+iproute2 that matches the kernel release where VXLAN was first merged
+upstream.
 
 1. Create vxlan device
-  # ip li add vxlan0 type vxlan id 42 group 239.1.1.1 dev eth1
-
-This creates a new device (vxlan0). The device uses the
-the multicast group 239.1.1.1 over eth1 to handle packets where
-no entry is in the forwarding table.
+ # ip link add vxlan0 type vxlan id 42 group 239.1.1.1 dev eth1 dstport 4789
+
+This creates a new device named vxlan0.  The device uses the multicast
+group 239.1.1.1 over eth1 to handle traffic for which there is no
+entry in the forwarding table.  The destination port number is set to
+the IANA-assigned value of 4789.  The Linux implementation of VXLAN
+pre-dates the IANA's selection of a standard destination port number
+and uses the Linux-selected value by default to maintain backwards
+compatibility.
 
 2. Delete vxlan device
   # ip link delete vxlan0
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next] documentation: bring vxlan documentation more up-to-date

2015-08-11 Thread Rick Jones
From: Rick Jones rick.jon...@hp.com

A few things have changed since the previous version of the vxlan
documentation was written, so update it and correct some grammer and
such while we are at it.

Signed-off-by: Rick Jones rick.jon...@hp.com

diff --git a/Documentation/networking/vxlan.txt 
b/Documentation/networking/vxlan.txt
index 6d99351..4126031 100644
--- a/Documentation/networking/vxlan.txt
+++ b/Documentation/networking/vxlan.txt
@@ -1,32 +1,38 @@
 Virtual eXtensible Local Area Networking documentation
 ==
 
-The VXLAN protocol is a tunnelling protocol that is designed to
-solve the problem of limited number of available VLAN's (4096).
-With VXLAN identifier is expanded to 24 bits.
+The VXLAN protocol is a tunnelling protocol that is designed to solve
+the problem of the limited number of available VLAN IDs (4096) in IEEE
+802.1q.  With VXLAN the size of the identifier is expanded to 24 bits
+(16777216).
 
-It is a draft RFC standard, that is implemented by Cisco Nexus,
-Vmware and Brocade. The protocol runs over UDP using a single
-destination port (still not standardized by IANA).
-This document describes the Linux kernel tunnel device,
-there is also an implantation of VXLAN for Openvswitch.
+VXLAN is described by IETF RFC 7348, and has been implemented by a
+number of vendors.  The protocol runs over UDP using a single
+destination port.  This document describes the Linux kernel tunnel
+device, there is also a separate implementation of VXLAN for
+Openvswitch.
 
 Unlike most tunnels, a VXLAN is a 1 to N network, not just point
 to point. A VXLAN device can either dynamically learn the IP address
 of the other end, in a manner similar to a learning bridge, or the
 forwarding entries can be configured statically.
 
-The management of vxlan is done in a similar fashion to it's
-too closest neighbors GRE and VLAN. Configuring VXLAN requires
-the version of iproute2 that matches the kernel release
-where VXLAN was first merged upstream.
+The management of vxlan is done in a similar fashion to its two
+closest neighbors GRE and VLAN. Configuring VXLAN requires the version
+of iproute2 that matches the kernel release where VXLAN was first
+merged upstream.
 
 1. Create vxlan device
-  # ip li add vxlan0 type vxlan id 42 group 239.1.1.1 dev eth1
-
-This creates a new device (vxlan0). The device uses the
-the multicast group 239.1.1.1 over eth1 to handle packets where
-no entry is in the forwarding table.
+  # ip link add vxlan0 type vxlan id 42 group 239.1.1.1 dev eth1
+
+This creates a new device named vxlan0.  The device uses the
+multicast group 239.1.1.1 over eth1 to handle traffic for which there
+is no entry is in the forwarding table.  The Linux implementation of
+VXLAN pre-dates the IANA's selection of a standard destination port
+number and uses the Linux-selected value by default to maintain
+backwards compatibility.  If you wish to use the IANA-assigned
+destination port number of 4789 you can add dstport 4789 to the
+command line above.
 
 2. Delete vxlan device
   # ip link delete vxlan0
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   5   >