Re: Initial thoughts on TXDP

2016-12-02 Thread Tom Herbert
On Fri, Dec 2, 2016 at 6:36 AM, Edward Cree  wrote:
> On 01/12/16 23:46, Tom Herbert wrote:
>> The only time we
>> _really_ to allocate an skbuf is when we need to put the packet onto a
>> queue. All the other use cases are really just to pass a structure
>> containing a packet from function to function. For that purpose we
>> should be able to just pass a much smaller structure in a stack
>> argument and only allocate an skbuff when we need to enqueue. In cases
>> where we don't ever queue a packet we might never need to allocate any
>> skbuff
> Now this intrigues me, because one of the objections to bundling (vs GRO)
> was the memory usage of all those SKBs.  IIRC we already do a 'GRO-like'
> coalescing when packets reach a TCP socket anyway (or at least in some
> cases, not sure if all the different ways we can enqueue a TCP packet for
> RX do it), but if we could carry the packets from NIC to socket without
> SKBs, doing so in lists rather than one-at-a-time wouldn't cost any extra
> memory (the packet-pages are all already allocated on the NIC RX ring).
> Possibly combine the two, so that rather than having potentially four
> versions of each function (skb, skbundle, void*, void* bundle) you just
> have the two 'ends'.
>
Yep, seems like a good idea to incorporate bundling into TXDP from the get-go.

Tom

> -Ed


Re: Initial thoughts on TXDP

2016-12-02 Thread Edward Cree
On 01/12/16 23:46, Tom Herbert wrote:
> The only time we
> _really_ to allocate an skbuf is when we need to put the packet onto a
> queue. All the other use cases are really just to pass a structure
> containing a packet from function to function. For that purpose we
> should be able to just pass a much smaller structure in a stack
> argument and only allocate an skbuff when we need to enqueue. In cases
> where we don't ever queue a packet we might never need to allocate any
> skbuff
Now this intrigues me, because one of the objections to bundling (vs GRO)
was the memory usage of all those SKBs.  IIRC we already do a 'GRO-like'
coalescing when packets reach a TCP socket anyway (or at least in some
cases, not sure if all the different ways we can enqueue a TCP packet for
RX do it), but if we could carry the packets from NIC to socket without
SKBs, doing so in lists rather than one-at-a-time wouldn't cost any extra
memory (the packet-pages are all already allocated on the NIC RX ring).
Possibly combine the two, so that rather than having potentially four
versions of each function (skb, skbundle, void*, void* bundle) you just
have the two 'ends'.

-Ed


Re: Initial thoughts on TXDP

2016-12-02 Thread Jesper Dangaard Brouer
On Thu, 1 Dec 2016 23:47:44 +0100
Hannes Frederic Sowa  wrote:

> Side note:
> 
> On 01.12.2016 20:51, Tom Herbert wrote:
> >> > E.g. "mini-skb": Even if we assume that this provides a speedup
> >> > (where does that come from? should make no difference if a 32 or
> >> >  320 byte buffer gets allocated).

Yes, the size of the allocation from the SLUB allocator does not change
base performance/cost much (at least for small objects, if < 1024).

Do notice the base SLUB alloc+free cost is fairly high (compared to a
201 cycles budget). Especially for networking as the free-side is very
likely to hit a slow path.  SLUB fast-path 53 cycles, and slow-path
around 100 cycles (data from [1]).  I've tried to address this with the
kmem_cache bulk APIs.  Which reduce the cost to approx 30 cycles.
(Something we have not fully reaped the benefit from yet!)

[1] https://git.kernel.org/torvalds/c/ca257195511

> >> >  
> > It's the zero'ing of three cache lines. I believe we talked about that
> > as netdev.

Actually 4 cache-lines, but with some cleanup I believe we can get down
to clearing 192 bytes 3 cache-lines.

> 
> Jesper and me played with that again very recently:
> 
> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_memset.c#L590
> 
> In micro-benchmarks we saw a pretty good speed up not using the rep
> stosb generated by gcc builtin but plain movq's. Probably the cost model
> for __builtin_memset in gcc is wrong?

Yes, I believe so.
 
> When Jesper is free we wanted to benchmark this and maybe come up with a
> arch specific way of cleaning if it turns out to really improve throughput.
> 
> SIMD instructions seem even faster but the kernel_fpu_begin/end() kill
> all the benefits.

One strange thing was, that on my skylake CPU (i7-6700K @4.00GHz),
Hannes's hand-optimized MOVQ ASM-code didn't go past 8 bytes per cycle,
or 32 cycles for 256 bytes.

Talking to Alex and John during netdev, and reading on the Intel arch,
I though that this CPU should be-able-to perform 16 bytes per cycle.
The CPU can do it as the rep-stos show this once the size gets large
enough.

On this CPU the memset rep stos starts to win around 512 bytes:

 192/35 =  5.5 bytes/cycle
 256/36 =  7.1 bytes/cycle
 512/40 = 12.8 bytes/cycle
 768/46 = 16.7 bytes/cycle
1024/52 = 19.7 bytes/cycle
2048/84 = 24.4 bytes/cycle
4096/148= 27.7 bytes/cycle

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: Initial thoughts on TXDP

2016-12-02 Thread Jesper Dangaard Brouer

On Thu, 1 Dec 2016 11:51:42 -0800 Tom Herbert  wrote:
> On Wed, Nov 30, 2016 at 6:44 PM, Florian Westphal  wrote:
> > Tom Herbert  wrote:  
[...]
> >>   - Call into TCP/IP stack with page data directly from driver-- no
> >> skbuff allocation or interface. This is essentially provided by the
> >> XDP API although we would need to generalize the interface to call
> >> stack functions (I previously posted patches for that). We will also
> >> need a new action, XDP_HELD?, that indicates the XDP function held the
> >> packet (put on a socket for instance).  
> >
> > Seems this will not work at all with the planned page pool thing when
> > pages start to be held indefinitely.

It is quite the opposite, the page pool support pages are being held
for longer times, than drivers today.  The current driver page recycle
tricks cannot, as they depend on page refcnt being decremented quickly
(while pages are still mapped in their recycle queue).

> > You can also never get even close to userspace offload stacks once you
> > need/do this; allocations in hotpath are too expensive.

Yes. It is important to understand that once the number of outstanding
pages get large, the driver recycle stops working.  Meaning the pages
allocations start to go through the page allocator.  I've documented[1]
that the bare alloc+free cost[2] (231 cycles order-0/4K) is higher than
the 10G wirespeed budget (201 cycles).

Thus, the driver recycle tricks are nice for benchmarking, as it hides
the page allocator overhead. But this optimization might disappear for
Tom's and Eric's more real-world use-cases e.g. like 10.000 sockets.
The page pool don't these issues.

[1] 
http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf
[2] 
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench01.c

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: Initial thoughts on TXDP

2016-12-01 Thread Rick Jones

On 12/01/2016 02:12 PM, Tom Herbert wrote:

We have consider both request size and response side in RPC.
Presumably, something like a memcache server is most serving data as
opposed to reading it, we are looking to receiving much smaller
packets than being sent. Requests are going to be quite small say 100
bytes and unless we are doing significant amount of pipelining on
connections GRO would rarely kick-in. Response size will have a lot of
variability, anything from a few kilobytes up to a megabyte. I'm sorry
I can't be more specific this is an artifact of datacenters that have
100s of different applications and communication patterns. Maybe 100b
request size, 8K, 16K, 64K response sizes might be good for test.


No worries on the specific sizes, it is a classic "How long is a piece 
of string?" sort of question.


Not surprisingly, as the size of what is being received grows, so too 
the delta between GRO on and off.


stack@np-cp1-c0-m1-mgmt:~/rjones2$ HDR="-P 1"; for r in 8K 16K 64K 1M; 
do for gro in on off; do sudo ethtool -K hed0 gro ${gro}; brand="$r gro 
$gro"; ./netperf -B "$brand" -c -H np-cp1-c1-m3-mgmt -t TCP_RR $HDR -- 
-P 12867 -r 128,${r} -o result_brand,throughput,local_sd; HDR="-P 0"; 
done; done
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0

Result Tag,Throughput,Local Service Demand
"8K gro on",9899.84,35.947
"8K gro off",7299.54,61.097
"16K gro on",8119.38,58.367
"16K gro off",5176.87,95.317
"64K gro on",4429.57,110.629
"64K gro off",2128.58,289.913
"1M gro on",887.85,918.447
"1M gro off",335.97,3427.587

So that gives a feel for by how much this alternative mechanism would 
have to reduce path-length to maintain the CPU overhead, were the 
mechanism to preclude GRO.


rick




Re: Initial thoughts on TXDP

2016-12-01 Thread Tom Herbert
On Thu, Dec 1, 2016 at 2:47 PM, Hannes Frederic Sowa
 wrote:
> Side note:
>
> On 01.12.2016 20:51, Tom Herbert wrote:
>>> > E.g. "mini-skb": Even if we assume that this provides a speedup
>>> > (where does that come from? should make no difference if a 32 or
>>> >  320 byte buffer gets allocated).
>>> >
>> It's the zero'ing of three cache lines. I believe we talked about that
>> as netdev.
>
> Jesper and me played with that again very recently:
>
> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_memset.c#L590
>
> In micro-benchmarks we saw a pretty good speed up not using the rep
> stosb generated by gcc builtin but plain movq's. Probably the cost model
> for __builtin_memset in gcc is wrong?
>
> When Jesper is free we wanted to benchmark this and maybe come up with a
> arch specific way of cleaning if it turns out to really improve throughput.
>
> SIMD instructions seem even faster but the kernel_fpu_begin/end() kill
> all the benefits.
>
One nice direction of XDP is that it forces drivers to defer
allocating (and hence zero'ing) skbs. In the receive path I think we
can exploit this property deeper into the stack. The only time we
_really_ to allocate an skbuf is when we need to put the packet onto a
queue. All the other use cases are really just to pass a structure
containing a packet from function to function. For that purpose we
should be able to just pass a much smaller structure in a stack
argument and only allocate an skbuff when we need to enqueue. In cases
where we don't ever queue a packet we might never need to allocate any
skbuff-- this includes pure acks, packets that end up being dropped.
But even more than that, if a received packet generates a TX packet
(like a SYN causes a SYN-ACK) then we might even be able to just
recycle the received packet and avoid needing any skbuff allocation on
transmit (XDP_TX already does this in a limited context)--  this could
be a win to handle SYN attacks for instance. Also, since we don't
queue on the socket buffer for UDP it's conceivable we could avoid
skbuffs in an expedited UDP TX path.

Currently, nearly the whole stack depends on packets always being
passed in skbuffs, however __skb_flow_dissect is an interesting
exception as it can handle packets passed in either an skbuff or by
just a void *-- so we know that this "dual mode" is at least possible.
Trying to retrain the whole stack to be able to handle both skbuffs
and raw pages is probably untenable at this point, but selectively
augmenting some critical performance functions for dual mode (ip_rcv,
tcp_rcv, udp_rcv functions for instance) might work.

Thanks,
Tom

> Bye,
> Hannes
>


Re: Initial thoughts on TXDP

2016-12-01 Thread Hannes Frederic Sowa
On 01.12.2016 21:13, Sowmini Varadhan wrote:
> On (12/01/16 11:05), Tom Herbert wrote:
>>
>> Polling does not necessarily imply that networking monopolizes the CPU
>> except when the CPU is otherwise idle. Presumably the application
>> drives the polling when it is ready to receive work.
> 
> I'm not grokking that- "if the cpu is idle, we want to busy-poll
> and make it 0% idle"?  Keeping CPU 0% idle has all sorts
> of issues, see slide 20 of
>  http://www.slideshare.net/shemminger/dpdk-performance
>
>>> and one other critical difference from the hot-potato-forwarding
>>> model (the sort of OVS model that DPDK etc might aruguably be a fit for)
>>> does not apply: in order to figure out the ethernet and IP headers
>>> in the response correctly at all times (in the face of things like VRRP,
>>> gw changes, gw's mac addr changes etc) the application should really
>>> be listening on NETLINK sockets for modifications to the networking
>>> state - again points to needing a select() socket set where you can
>>> have both the I/O fds and the netlink socket,
>>>
>> I would think that that is management would not be implemented in a
>> fast path processing thread for an application.
> 
> sure, but my point was that *XDP and other stack-bypass methods needs 
> to provide a select()able socket: when your use-case is not about just
> networking, you have to snoop on changes to the control plane, and update
> your data path. In the OVS case (pure networking) the OVS control plane
> updates are intrinsic to OVS. For the rest of the request/response world,
> we need a select()able socket set to do this elegantly (not really
> possible in DPDK, for example)

Busypoll on steroids is what windows does by mapping the user space
"doorbell" into a vDSO and let user space loop on that maybe with
MWAIT/MONITOR. The interesting thing is that you can map other events to
this notification event, too. It sounds like a usable idea to me and
reassembles what we already do with futexes.

Bye,
Hannes



Re: Initial thoughts on TXDP

2016-12-01 Thread Hannes Frederic Sowa
Side note:

On 01.12.2016 20:51, Tom Herbert wrote:
>> > E.g. "mini-skb": Even if we assume that this provides a speedup
>> > (where does that come from? should make no difference if a 32 or
>> >  320 byte buffer gets allocated).
>> >
> It's the zero'ing of three cache lines. I believe we talked about that
> as netdev.

Jesper and me played with that again very recently:

https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_memset.c#L590

In micro-benchmarks we saw a pretty good speed up not using the rep
stosb generated by gcc builtin but plain movq's. Probably the cost model
for __builtin_memset in gcc is wrong?

When Jesper is free we wanted to benchmark this and maybe come up with a
arch specific way of cleaning if it turns out to really improve throughput.

SIMD instructions seem even faster but the kernel_fpu_begin/end() kill
all the benefits.

Bye,
Hannes



Re: Initial thoughts on TXDP

2016-12-01 Thread Tom Herbert
On Thu, Dec 1, 2016 at 1:47 PM, Rick Jones  wrote:
> On 12/01/2016 12:18 PM, Tom Herbert wrote:
>>
>> On Thu, Dec 1, 2016 at 11:48 AM, Rick Jones  wrote:
>>>
>>> Just how much per-packet path-length are you thinking will go away under
>>> the
>>> likes of TXDP?  It is admittedly "just" netperf but losing TSO/GSO does
>>> some
>>> non-trivial things to effective overhead (service demand) and so
>>> throughput:
>>>
>> For plain in order TCP packets I believe we should be able process
>> each packet at nearly same speed as GRO. Most of the protocol
>> processing we do between GRO and the stack are the same, the
>> differences are that we need to do a connection lookup in the stack
>> path (note we now do this is UDP GRO and that hasn't show up as a
>> major hit). We also need to consider enqueue/dequeue on the socket
>> which is a major reason to try for lockless sockets in this instance.
>
>
> So waving hands a bit, and taking the service demand for the GRO-on receive
> test in my previous message (860 ns/KB), that would be ~ (1448/1024)*860 or
> ~1.216 usec of CPU time per TCP segment, including ACK generation which
> unless an explicit ACK-avoidance heuristic a la HP-UX 11/Solaris 2 is put in
> place would be for every-other segment. Etc etc.
>
>> Sure, but trying running something emulates a more realistic workload
>> than a TCP stream, like RR test with relative small payload and many
>> connections.
>
>
> That is a good point, which of course is why the RR tests are there in
> netperf :) Don't get me wrong, I *like* seeing path-length reductions. What
> would you posit is a relatively small payload?  The promotion of IR10
> suggests that perhaps 14KB or so is a sufficiently common so I'll grasp at
> that as the length of a piece of string:
>
We have consider both request size and response side in RPC.
Presumably, something like a memcache server is most serving data as
opposed to reading it, we are looking to receiving much smaller
packets than being sent. Requests are going to be quite small say 100
bytes and unless we are doing significant amount of pipelining on
connections GRO would rarely kick-in. Response size will have a lot of
variability, anything from a few kilobytes up to a megabyte. I'm sorry
I can't be more specific this is an artifact of datacenters that have
100s of different applications and communication patterns. Maybe 100b
request size, 8K, 16K, 64K response sizes might be good for test.

> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t
> TCP_RR -- -P 12867 -r 128,14K
> MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET
> to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
> Local /Remote
> Socket Size   Request Resp.  Elapsed Trans.   CPUCPUS.dem   S.dem
> Send   Recv   SizeSize   TimeRate local  remote local   remote
> bytes  bytes  bytes   bytes  secs.   per sec  % S% Uus/Tr   us/Tr
>
> 16384  87380  128 14336  10.00   8118.31  1.57   -1.00  46.410  -1.000
> 16384  87380
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t
> TCP_RR -- -P 12867 -r 128,14K
> MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET
> to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
> Local /Remote
> Socket Size   Request Resp.  Elapsed Trans.   CPUCPUS.dem   S.dem
> Send   Recv   SizeSize   TimeRate local  remote local   remote
> bytes  bytes  bytes   bytes  secs.   per sec  % S% Uus/Tr   us/Tr
>
> 16384  87380  128 14336  10.00   5837.35  2.20   -1.00  90.628  -1.000
> 16384  87380
>
> So, losing GRO doubled the service demand.  I suppose I could see cutting
> path-length in half based on the things you listed which would be bypassed?
>
> I'm sure mileage will vary with different NICs and CPUs.  The ones used here
> happened to be to hand.
>
This is also biased because you're using a single connection, but is
consistent with data we've seen in the past. To be clear I'm not
saying GRO is bad, the fact that GRO has such a visible impact in your
test means that the GRO path is significantly more efficient. Closing
that gap seen in your numbers would be a benefit, that means we have
improved per packet processing.

Tom

> happy benchmarking,
>
> rick
>
> Just to get a crude feel for sensitivity, doubling to 28K unsurprisingly
> goes to more than doubling, and halving to 7K narrows the delta:
>
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t
> TCP_RR -- -P 12867 -r 128,28K
> MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET
> to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
> Local /Remote
> Socket Size   Request Resp.  Elapsed Trans.   CPUCPUS.dem   S.dem
> Send   Recv   SizeSize   TimeRate local  remote local   remote
> bytes  bytes  bytes   bytes  secs.   per sec  %

Re: Initial thoughts on TXDP

2016-12-01 Thread Rick Jones

On 12/01/2016 12:18 PM, Tom Herbert wrote:

On Thu, Dec 1, 2016 at 11:48 AM, Rick Jones  wrote:

Just how much per-packet path-length are you thinking will go away under the
likes of TXDP?  It is admittedly "just" netperf but losing TSO/GSO does some
non-trivial things to effective overhead (service demand) and so throughput:


For plain in order TCP packets I believe we should be able process
each packet at nearly same speed as GRO. Most of the protocol
processing we do between GRO and the stack are the same, the
differences are that we need to do a connection lookup in the stack
path (note we now do this is UDP GRO and that hasn't show up as a
major hit). We also need to consider enqueue/dequeue on the socket
which is a major reason to try for lockless sockets in this instance.


So waving hands a bit, and taking the service demand for the GRO-on 
receive test in my previous message (860 ns/KB), that would be ~ 
(1448/1024)*860 or ~1.216 usec of CPU time per TCP segment, including 
ACK generation which unless an explicit ACK-avoidance heuristic a la 
HP-UX 11/Solaris 2 is put in place would be for every-other segment. Etc 
etc.



Sure, but trying running something emulates a more realistic workload
than a TCP stream, like RR test with relative small payload and many
connections.


That is a good point, which of course is why the RR tests are there in 
netperf :) Don't get me wrong, I *like* seeing path-length reductions. 
What would you posit is a relatively small payload?  The promotion of 
IR10 suggests that perhaps 14KB or so is a sufficiently common so I'll 
grasp at that as the length of a piece of string:


stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_RR -- -P 12867 -r 128,14K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0

Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPUCPUS.dem   S.dem
Send   Recv   SizeSize   TimeRate local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S% Uus/Tr   us/Tr

16384  87380  128 14336  10.00   8118.31  1.57   -1.00  46.410  -1.000
16384  87380
stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off
stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_RR -- -P 12867 -r 128,14K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0

Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPUCPUS.dem   S.dem
Send   Recv   SizeSize   TimeRate local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S% Uus/Tr   us/Tr

16384  87380  128 14336  10.00   5837.35  2.20   -1.00  90.628  -1.000
16384  87380

So, losing GRO doubled the service demand.  I suppose I could see 
cutting path-length in half based on the things you listed which would 
be bypassed?


I'm sure mileage will vary with different NICs and CPUs.  The ones used 
here happened to be to hand.


happy benchmarking,

rick

Just to get a crude feel for sensitivity, doubling to 28K unsurprisingly 
goes to more than doubling, and halving to 7K narrows the delta:


stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_RR -- -P 12867 -r 128,28K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0

Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPUCPUS.dem   S.dem
Send   Recv   SizeSize   TimeRate local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S% Uus/Tr   us/Tr

16384  87380  128 28672  10.00   6732.32  1.79   -1.00  63.819  -1.000
16384  87380
stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off
stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_RR -- -P 12867 -r 128,28K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0

Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPUCPUS.dem   S.dem
Send   Recv   SizeSize   TimeRate local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S% Uus/Tr   us/Tr

16384  87380  128 28672  10.00   3780.47  2.32   -1.00  147.280  -1.000
16384  87380



stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_RR -- -P 12867 -r 128,7K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0

Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPUCPUS.dem   S.dem
Send   Recv   SizeSize   TimeRate local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S% Uus/Tr   us/Tr

16384  87380  

Re: Initial thoughts on TXDP

2016-12-01 Thread Tom Herbert
On Thu, Dec 1, 2016 at 12:13 PM, Sowmini Varadhan
 wrote:
> On (12/01/16 11:05), Tom Herbert wrote:
>>
>> Polling does not necessarily imply that networking monopolizes the CPU
>> except when the CPU is otherwise idle. Presumably the application
>> drives the polling when it is ready to receive work.
>
> I'm not grokking that- "if the cpu is idle, we want to busy-poll
> and make it 0% idle"?  Keeping CPU 0% idle has all sorts
> of issues, see slide 20 of
>  http://www.slideshare.net/shemminger/dpdk-performance
>
>> > and one other critical difference from the hot-potato-forwarding
>> > model (the sort of OVS model that DPDK etc might aruguably be a fit for)
>> > does not apply: in order to figure out the ethernet and IP headers
>> > in the response correctly at all times (in the face of things like VRRP,
>> > gw changes, gw's mac addr changes etc) the application should really
>> > be listening on NETLINK sockets for modifications to the networking
>> > state - again points to needing a select() socket set where you can
>> > have both the I/O fds and the netlink socket,
>> >
>> I would think that that is management would not be implemented in a
>> fast path processing thread for an application.
>
> sure, but my point was that *XDP and other stack-bypass methods needs
> to provide a select()able socket: when your use-case is not about just
> networking, you have to snoop on changes to the control plane, and update
> your data path. In the OVS case (pure networking) the OVS control plane
> updates are intrinsic to OVS. For the rest of the request/response world,
> we need a select()able socket set to do this elegantly (not really
> possible in DPDK, for example)
>
I'm not sure that TXDP can be reconciled to help OVS. The point of
TXDP is to drive applications closer to bare metal performance, as I
mentioned this is only going to be worth it if the fast path can be
kept simple and not complicated by a requirement for generalization.
It seems like the second we put OVS in we're doubling the data path
and accepting the performance consequences of a complex path anyway.

TXDP can't over the whole system (any more than DPDK can) and needs to
work in concert with other mechanisms-- the key is how to steer the
work amongst the CPUs. For instance, if a latency critical thread is
running on some CPU we either a dedicated queue for the connections of
the thread (e.g. ntuple filtering or aRFS support) or we need a fast
way to get move unrelated packets received on a queue processed by
that CPU to other CPUs (less efficient, but no special HW support is
needed either).

Tom

>
>> The *SOs are always an interesting question. They make for great
>> benchmarks, but in real life the amount of benefit is somewhat
>> unclear. Under the wrong conditions, like all cwnds have collapsed or
>
> I think Rick's already bringing up this one.
>
> --Sowmini
>


Re: Initial thoughts on TXDP

2016-12-01 Thread Tom Herbert
On Thu, Dec 1, 2016 at 11:48 AM, Rick Jones  wrote:
> On 12/01/2016 11:05 AM, Tom Herbert wrote:
>>
>> For the GSO and GRO the rationale is that performing the extra SW
>> processing to do the offloads is significantly less expensive than
>> running each packet through the full stack. This is true in a
>> multi-layered generalized stack. In TXDP, however, we should be able
>> to optimize the stack data path such that that would no longer be
>> true. For instance, if we can process the packets received on a
>> connection quickly enough so that it's about the same or just a little
>> more costly than GRO processing then we might bypass GRO entirely.
>> TSO is probably still relevant in TXDP since it reduces overheads
>> processing TX in the device itself.
>
>
> Just how much per-packet path-length are you thinking will go away under the
> likes of TXDP?  It is admittedly "just" netperf but losing TSO/GSO does some
> non-trivial things to effective overhead (service demand) and so throughput:
>
For plain in order TCP packets I believe we should be able process
each packet at nearly same speed as GRO. Most of the protocol
processing we do between GRO and the stack are the same, the
differences are that we need to do a connection lookup in the stack
path (note we now do this is UDP GRO and that hasn't show up as a
major hit). We also need to consider enqueue/dequeue on the socket
which is a major reason to try for lockless sockets in this instance.

> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -- -P
> 12867
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to
> np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
> Recv   SendSend  Utilization   Service
> Demand
> Socket Socket  Message  Elapsed  Send Recv SendRecv
> Size   SizeSize Time Throughput  localremote   local remote
> bytes  bytes   bytessecs.10^6bits/s  % S  % U  us/KB   us/KB
>
>  87380  16384  1638410.00  9260.24   2.02 -1.000.428 -1.000
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 tso off gso off
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -- -P
> 12867
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to
> np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
> Recv   SendSend  Utilization   Service
> Demand
> Socket Socket  Message  Elapsed  Send Recv SendRecv
> Size   SizeSize Time Throughput  localremote   local remote
> bytes  bytes   bytessecs.10^6bits/s  % S  % U  us/KB   us/KB
>
>  87380  16384  1638410.00  5621.82   4.25 -1.001.486 -1.000
>
> And that is still with the stretch-ACKs induced by GRO at the receiver.
>
Sure, but trying running something emulates a more realistic workload
than a TCP stream, like RR test with relative small payload and many
connections.

> Losing GRO has quite similar results:
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t
> TCP_MAERTS -- -P 12867
> MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to
> np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
> Recv   SendSend  Utilization   Service
> Demand
> Socket Socket  Message  Elapsed  Recv Send RecvSend
> Size   SizeSize Time Throughput  localremote   local remote
> bytes  bytes   bytessecs.10^6bits/s  % S  % U  us/KB   us/KB
>
>  87380  16384  1638410.00  9154.02   4.00 -1.000.860 -1.000
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off
>
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t
> TCP_MAERTS -- -P 12867
> MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to
> np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
> Recv   SendSend  Utilization   Service
> Demand
> Socket Socket  Message  Elapsed  Recv Send RecvSend
> Size   SizeSize Time Throughput  localremote   local remote
> bytes  bytes   bytessecs.10^6bits/s  % S  % U  us/KB   us/KB
>
>  87380  16384  1638410.00  4212.06   5.36 -1.002.502 -1.000
>
> I'm sure there is a very non-trivial "it depends" component here - netperf
> will get the peak benefit from *SO and so one will see the peak difference
> in service demands - but even if one gets only 6 segments per *SO that is a
> lot of path-length to make-up.
>
True, but I think there's a lot of path we'll be able to cut out. In
this mode we don't need IPtables, Netfilter, input route, IPvlan
check, or other similar lookups. Once we've successfully matched a
establish TCP state anything related to policy on both TX and RX for
that connection is inferred from the state. We want the processing
path in this case to just be concerned with just protocol processing
an

Re: Initial thoughts on TXDP

2016-12-01 Thread Sowmini Varadhan
On (12/01/16 11:05), Tom Herbert wrote:
> 
> Polling does not necessarily imply that networking monopolizes the CPU
> except when the CPU is otherwise idle. Presumably the application
> drives the polling when it is ready to receive work.

I'm not grokking that- "if the cpu is idle, we want to busy-poll
and make it 0% idle"?  Keeping CPU 0% idle has all sorts
of issues, see slide 20 of
 http://www.slideshare.net/shemminger/dpdk-performance

> > and one other critical difference from the hot-potato-forwarding
> > model (the sort of OVS model that DPDK etc might aruguably be a fit for)
> > does not apply: in order to figure out the ethernet and IP headers
> > in the response correctly at all times (in the face of things like VRRP,
> > gw changes, gw's mac addr changes etc) the application should really
> > be listening on NETLINK sockets for modifications to the networking
> > state - again points to needing a select() socket set where you can
> > have both the I/O fds and the netlink socket,
> >
> I would think that that is management would not be implemented in a
> fast path processing thread for an application.

sure, but my point was that *XDP and other stack-bypass methods needs 
to provide a select()able socket: when your use-case is not about just
networking, you have to snoop on changes to the control plane, and update
your data path. In the OVS case (pure networking) the OVS control plane
updates are intrinsic to OVS. For the rest of the request/response world,
we need a select()able socket set to do this elegantly (not really
possible in DPDK, for example)


> The *SOs are always an interesting question. They make for great
> benchmarks, but in real life the amount of benefit is somewhat
> unclear. Under the wrong conditions, like all cwnds have collapsed or

I think Rick's already bringing up this one.

--Sowmini



Re: Initial thoughts on TXDP

2016-12-01 Thread Tom Herbert
On Wed, Nov 30, 2016 at 6:44 PM, Florian Westphal  wrote:
> Tom Herbert  wrote:
>> Posting for discussion
>
> Warning: You are not going to like this reply...
>
>> Now that XDP seems to be nicely gaining traction
>
> Yes, I regret to see that.  XDP seems useful to create impressive
> benchmark numbers (and little else).
>
> I will send a separate email to keep that flamebait part away from
> this thread though.
>
> [..]
>
>> addresses the performance gap for stateless packet processing). The
>> problem statement is analogous to that which we had for XDP, namely
>> can we create a mode in the kernel that offer the same performance
>> that is seen with L4 protocols over kernel bypass
>
> Why?  If you want to bypass the kernel, then DO IT.
>
I don't want kernel bypass. I want the Linux stack to provide
something close to bare metal performance for TCP/UDP for some latency
sensitive applications we run.

> There is nothing wrong with DPDK.  The ONLY problem is that the kernel
> does not offer a userspace fastpath like Windows RIO or FreeBSDs netmap.
>
> But even without that its not difficult to get DPDK running.
>
That is not true for large scale deployments. Also, TXDP is about
accelerating transport layers like TCP, DPDK is just the interface
from userspace to device queues. We need a whole lot more with DPDK, a
userspace TCP/IP stack for instance, to consider that we have an
equivalent functionality.

> (T)XDP seems born from spite, not technical rationale.
> IMO everyone would be better off if we'd just have something netmap-esqe
> in the network core (also see below).
>
>> I imagine there are a few reasons why userspace TCP stacks can get
>> good performance:
>>
>> - Spin polling (we already can do this in kernel)
>> - Lockless, I would assume that threads typically have exclusive
>> access to a queue pair for a connection
>> - Minimal TCP/IP stack code
>> - Zero copy TX/RX
>> - Light weight structures for queuing
>> - No context switches
>> - Fast data path for in order, uncongested flows
>> - Silo'ing between application and device queues
>
> I only see two cases:
>
> 1. Many applications running (standard Os model) that need to
> send/receive data
> -> Linux Network Stack
>
> 2. Single dedicated application that does all rx/tx
>
> -> no queueing needed (can block network rx completely if receiver
> is slow)
> -> no allocations needed at runtime at all
> -> no locking needed (single produce, single consumer)
>
> If you have #2 and you need to be fast etc then full userspace
> bypass is fine.  We will -- no matter what we do in kernel -- never
> be able to keep up with the speed you can get with that
> because we have to deal with #1.  (Plus the ease of use/freedom of doing
> userspace programming).  And yes, I think that #2 is something we
> should address solely by providing netmap or something similar.
>
> But even considering #1 there are ways to speed stack up:
>
> I'd kill RPF/RPS so we don't have IPI anymore and skb stays
> on same cpu up to where it gets queued (ofo or rx queue).
>
The reference to RPS and RFS is only to move packets off the hot CPU
that are not in the datapath. For instance if we get a FIN for a
connection it we can put this into a slow path since FIN processing is
not latency sensitive but may take a considerable amount of CPU to
process.

> Then we could tell driver what happened with the skb it gave us, e.g.
> we can tell driver it can do immediate page/dma reuse, for example
> in pure ack case as opposed to skb sitting in ofo or receive queue.
>
> (RPS/RFS functionality could still be provided via one of the gazillion
>  hooks we now have in the stack for those that need/want it), so I do
> not think we would lose functionality.
>
>>   - Call into TCP/IP stack with page data directly from driver-- no
>> skbuff allocation or interface. This is essentially provided by the
>> XDP API although we would need to generalize the interface to call
>> stack functions (I previously posted patches for that). We will also
>> need a new action, XDP_HELD?, that indicates the XDP function held the
>> packet (put on a socket for instance).
>
> Seems this will not work at all with the planned page pool thing when
> pages start to be held indefinitely.
>
The processing needed to gift a page to the stack shouldn't be very
different than what a driver needs to do when and skbuff is created
when XDP_PASS is returned. We probably would want to pass additional
meta data, things like checksum and vlan information from received
descriptor to the stack. A callback can be included if the stack
decides it wants to hold on to the buffer and driver needs to do
dma_sync etc. for that.

> You can also never get even close to userspace offload stacks once you
> need/do this; allocations in hotpath are too expensive.
>
> [..]
>
>>   - When we transmit, it would be nice to go straight from TCP
>> connection to an XDP device queue and in particular skip the qdisc
>> layer. This follows the principle of low

Re: Initial thoughts on TXDP

2016-12-01 Thread Rick Jones

On 12/01/2016 11:05 AM, Tom Herbert wrote:

For the GSO and GRO the rationale is that performing the extra SW
processing to do the offloads is significantly less expensive than
running each packet through the full stack. This is true in a
multi-layered generalized stack. In TXDP, however, we should be able
to optimize the stack data path such that that would no longer be
true. For instance, if we can process the packets received on a
connection quickly enough so that it's about the same or just a little
more costly than GRO processing then we might bypass GRO entirely.
TSO is probably still relevant in TXDP since it reduces overheads
processing TX in the device itself.


Just how much per-packet path-length are you thinking will go away under 
the likes of TXDP?  It is admittedly "just" netperf but losing TSO/GSO 
does some non-trivial things to effective overhead (service demand) and 
so throughput:


stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -- 
-P 12867
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to 
np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
Recv   SendSend  Utilization   Service 
Demand

Socket Socket  Message  Elapsed  Send Recv SendRecv
Size   SizeSize Time Throughput  localremote   local 
remote

bytes  bytes   bytessecs.10^6bits/s  % S  % U  us/KB   us/KB

 87380  16384  1638410.00  9260.24   2.02 -1.000.428 
-1.000

stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 tso off gso off
stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -- 
-P 12867
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to 
np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
Recv   SendSend  Utilization   Service 
Demand

Socket Socket  Message  Elapsed  Send Recv SendRecv
Size   SizeSize Time Throughput  localremote   local 
remote

bytes  bytes   bytessecs.10^6bits/s  % S  % U  us/KB   us/KB

 87380  16384  1638410.00  5621.82   4.25 -1.001.486 
-1.000


And that is still with the stretch-ACKs induced by GRO at the receiver.

Losing GRO has quite similar results:
stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_MAERTS -- -P 12867
MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to 
np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
Recv   SendSend  Utilization   Service 
Demand

Socket Socket  Message  Elapsed  Recv Send RecvSend
Size   SizeSize Time Throughput  localremote   local 
remote

bytes  bytes   bytessecs.10^6bits/s  % S  % U  us/KB   us/KB

 87380  16384  1638410.00  9154.02   4.00 -1.000.860 
-1.000

stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off

stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_MAERTS -- -P 12867
MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to 
np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
Recv   SendSend  Utilization   Service 
Demand

Socket Socket  Message  Elapsed  Recv Send RecvSend
Size   SizeSize Time Throughput  localremote   local 
remote

bytes  bytes   bytessecs.10^6bits/s  % S  % U  us/KB   us/KB

 87380  16384  1638410.00  4212.06   5.36 -1.002.502 
-1.000


I'm sure there is a very non-trivial "it depends" component here - 
netperf will get the peak benefit from *SO and so one will see the peak 
difference in service demands - but even if one gets only 6 segments per 
*SO that is a lot of path-length to make-up.


4.4 kernel, BE3 NICs ... E5-2640 0 @ 2.50GHz

And even if one does have the CPU cycles to burn so to speak, the effect 
on power consumption needs to be included in the calculus.


happy benchmarking,

rick jones


Re: Initial thoughts on TXDP

2016-12-01 Thread Tom Herbert
On Thu, Dec 1, 2016 at 5:55 AM, Sowmini Varadhan
 wrote:
> On (11/30/16 14:54), Tom Herbert wrote:
>>
>> Posting for discussion
>:
>> One simplifying assumption we might make is that TXDP is primarily for
>> optimizing latency, specifically request/response type operations
>> (think HPC, HFT, flash server, or other tightly coupled communications
>> within the datacenter). Notably, I don't think that saving CPU is as
>> relevant to TXDP, in fact we have already seen that CPU utilization
>> can be traded off for lower latency via spin polling. Similar to XDP
>> though, we might assume that single CPU performance is relevant (i.e.
>> on a cache server we'd like to spin as few CPUs as needed and no more
>> to handle the load an maintain throughput and latency requirements).
>> High throughput (ops/sec) and low variance should be side effects of
>> any design.
>
> I'm sending this with some hesitation (esp as the flamebait threads
> are starting up - I have no interest in getting into food-fights!!),
> because it sounds like the HPC/request-response use-case you have in mind
> (HTTP based?) is very likely different than the one the DB use-cases in
> my environment (RDBMS, Cluster req/responses). But to provide some
> perspective from the latter use-case..
>
> We also have request-response transactions, but CPU utilization
> is extremely critical- many DB operations are highly CPU bound,
> so it's not acceptable for the network to hog CPU util by polling.
> In that sense, the DB req/resp model has a lot of overlap with the
> Suricata use-case.
>
Hi Sowmini,

Polling does not necessarily imply that networking monopolizes the CPU
except when the CPU is otherwise idle. Presumably the application
drives the polling when it is ready to receive work.

> Also we need a select()able socket, because we have to deal with
> input from several sources- network I/O, but also disk, and
> file-system I/O. So need to make sure there is no starvation,
> and that we multiplex between  I/O sources efficiently
>
Yes, that is a requirement.

> and one other critical difference from the hot-potato-forwarding
> model (the sort of OVS model that DPDK etc might aruguably be a fit for)
> does not apply: in order to figure out the ethernet and IP headers
> in the response correctly at all times (in the face of things like VRRP,
> gw changes, gw's mac addr changes etc) the application should really
> be listening on NETLINK sockets for modifications to the networking
> state - again points to needing a select() socket set where you can
> have both the I/O fds and the netlink socket,
>
I would think that that is management would not be implemented in a
fast path processing thread for an application.

> For all of these reasons, we are investigating approaches similar ot
> Suricata- PF_PACKET with TPACKETV2 (since we need both Tx and Rx,
> and so far, tpacketv2 seems "good enough"). FWIW, we also took
> a look at netmap and so far have not seen any significant benefits
> to netmap over pf_packet.. investigation still ongoing.
>
>>   - Call into TCP/IP stack with page data directly from driver-- no
>> skbuff allocation or interface. This is essentially provided by the
>
> I'm curious- one thing that came out of the IPsec evaluation
> is that TSO is very valuable for performance, and this is most easily
> accessed via the sk_buff interfaces.  I have not had a chance
> to review your patches yet, but isnt that an issue if you bypass
> sk_buff usage? But I should probably go and review your patchset..
>
The *SOs are always an interesting question. They make for great
benchmarks, but in real life the amount of benefit is somewhat
unclear. Under the wrong conditions, like all cwnds have collapsed or
received packets for flows are small or so mixed that we can't get
much aggregation, SO provides no benefit and in fact becomes
overhead. Relying on any amount of segmentation offload in real
deployment is risky; for instance we've seen some video servers
deployed that were able to serve line rate at 90% CPU in testing (SO
was effective) but ended up needing 110% CPU in deployment when a
hiccup caused all cwnds to collapse. Moral of the story is provision
your servers assuming the worse case conditions that would render
opportunistic offloads unless.

For the GSO and GRO the rationale is that performing the extra SW
processing to do the offloads is significantly less expensive than
running each packet through the full stack. This is true in a
multi-layered generalized stack. In TXDP, however, we should be able
to optimize the stack data path such that that would no longer be
true. For instance, if we can process the packets received on a
connection quickly enough so that it's about the same or just a little
more costly than GRO processing then we might bypass GRO entirely.
TSO is probably still relevant in TXDP since it reduces overheads
processing TX in the device itself.

Tom

> --Sowmini


Re: Initial thoughts on TXDP

2016-12-01 Thread Sowmini Varadhan
On (11/30/16 14:54), Tom Herbert wrote:
> 
> Posting for discussion
   :
> One simplifying assumption we might make is that TXDP is primarily for
> optimizing latency, specifically request/response type operations
> (think HPC, HFT, flash server, or other tightly coupled communications
> within the datacenter). Notably, I don't think that saving CPU is as
> relevant to TXDP, in fact we have already seen that CPU utilization
> can be traded off for lower latency via spin polling. Similar to XDP
> though, we might assume that single CPU performance is relevant (i.e.
> on a cache server we'd like to spin as few CPUs as needed and no more
> to handle the load an maintain throughput and latency requirements).
> High throughput (ops/sec) and low variance should be side effects of
> any design.

I'm sending this with some hesitation (esp as the flamebait threads
are starting up - I have no interest in getting into food-fights!!), 
because it sounds like the HPC/request-response use-case you have in mind
(HTTP based?) is very likely different than the one the DB use-cases in
my environment (RDBMS, Cluster req/responses). But to provide some
perspective from the latter use-case..

We also have request-response transactions, but CPU utilization
is extremely critical- many DB operations are highly CPU bound,
so it's not acceptable for the network to hog CPU util by polling.
In that sense, the DB req/resp model has a lot of overlap with the
Suricata use-case.

Also we need a select()able socket, because we have to deal with
input from several sources- network I/O, but also disk, and 
file-system I/O. So need to make sure there is no starvation,
and that we multiplex between  I/O sources efficiently

and one other critical difference from the hot-potato-forwarding
model (the sort of OVS model that DPDK etc might aruguably be a fit for)
does not apply: in order to figure out the ethernet and IP headers
in the response correctly at all times (in the face of things like VRRP,
gw changes, gw's mac addr changes etc) the application should really
be listening on NETLINK sockets for modifications to the networking
state - again points to needing a select() socket set where you can
have both the I/O fds and the netlink socket, 

For all of these reasons, we are investigating approaches similar ot
Suricata- PF_PACKET with TPACKETV2 (since we need both Tx and Rx,
and so far, tpacketv2 seems "good enough"). FWIW, we also took
a look at netmap and so far have not seen any significant benefits
to netmap over pf_packet.. investigation still ongoing.

>   - Call into TCP/IP stack with page data directly from driver-- no
> skbuff allocation or interface. This is essentially provided by the

I'm curious- one thing that came out of the IPsec evaluation
is that TSO is very valuable for performance, and this is most easily
accessed via the sk_buff interfaces.  I have not had a chance
to review your patches yet, but isnt that an issue if you bypass
sk_buff usage? But I should probably go and review your patchset..

--Sowmini


Re: Initial thoughts on TXDP

2016-11-30 Thread Florian Westphal
Tom Herbert  wrote:
> Posting for discussion

Warning: You are not going to like this reply...

> Now that XDP seems to be nicely gaining traction

Yes, I regret to see that.  XDP seems useful to create impressive
benchmark numbers (and little else).

I will send a separate email to keep that flamebait part away from
this thread though.

[..]

> addresses the performance gap for stateless packet processing). The
> problem statement is analogous to that which we had for XDP, namely
> can we create a mode in the kernel that offer the same performance
> that is seen with L4 protocols over kernel bypass

Why?  If you want to bypass the kernel, then DO IT.

There is nothing wrong with DPDK.  The ONLY problem is that the kernel
does not offer a userspace fastpath like Windows RIO or FreeBSDs netmap.

But even without that its not difficult to get DPDK running.

(T)XDP seems born from spite, not technical rationale.
IMO everyone would be better off if we'd just have something netmap-esqe
in the network core (also see below).

> I imagine there are a few reasons why userspace TCP stacks can get
> good performance:
> 
> - Spin polling (we already can do this in kernel)
> - Lockless, I would assume that threads typically have exclusive
> access to a queue pair for a connection
> - Minimal TCP/IP stack code
> - Zero copy TX/RX
> - Light weight structures for queuing
> - No context switches
> - Fast data path for in order, uncongested flows
> - Silo'ing between application and device queues

I only see two cases:

1. Many applications running (standard Os model) that need to
send/receive data
-> Linux Network Stack

2. Single dedicated application that does all rx/tx

-> no queueing needed (can block network rx completely if receiver
is slow)
-> no allocations needed at runtime at all
-> no locking needed (single produce, single consumer)

If you have #2 and you need to be fast etc then full userspace
bypass is fine.  We will -- no matter what we do in kernel -- never
be able to keep up with the speed you can get with that
because we have to deal with #1.  (Plus the ease of use/freedom of doing
userspace programming).  And yes, I think that #2 is something we
should address solely by providing netmap or something similar.

But even considering #1 there are ways to speed stack up:

I'd kill RPF/RPS so we don't have IPI anymore and skb stays
on same cpu up to where it gets queued (ofo or rx queue).

Then we could tell driver what happened with the skb it gave us, e.g.
we can tell driver it can do immediate page/dma reuse, for example
in pure ack case as opposed to skb sitting in ofo or receive queue.

(RPS/RFS functionality could still be provided via one of the gazillion
 hooks we now have in the stack for those that need/want it), so I do
not think we would lose functionality.

>   - Call into TCP/IP stack with page data directly from driver-- no
> skbuff allocation or interface. This is essentially provided by the
> XDP API although we would need to generalize the interface to call
> stack functions (I previously posted patches for that). We will also
> need a new action, XDP_HELD?, that indicates the XDP function held the
> packet (put on a socket for instance).

Seems this will not work at all with the planned page pool thing when
pages start to be held indefinitely.

You can also never get even close to userspace offload stacks once you
need/do this; allocations in hotpath are too expensive.

[..]

>   - When we transmit, it would be nice to go straight from TCP
> connection to an XDP device queue and in particular skip the qdisc
> layer. This follows the principle of low latency being first criteria.

It will never be lower than userspace offloads so anyone with serious
"low latency" requirement (trading) will use that instead.

Whats your target audience?

> longer latencies in effect which likely means TXDP isn't appropriate
> in such a cases. BQL is also out, however we would want the TX
> batching of XDP.

Right, congestion control and buffer bloat are totally overrated .. 8-(

So far I haven't seen anything that would need XDP at all.

What makes it technically impossible to apply these miracles to the
stack...?

E.g. "mini-skb": Even if we assume that this provides a speedup
(where does that come from? should make no difference if a 32 or
 320 byte buffer gets allocated).

If we assume that its the zeroing of sk_buff (but iirc it made little
to no difference), could add

unsigned long skb_extensions[1];

to sk_buff, then move everything not needed for tcp fastpath
(e.g. secpath, conntrack, nf_bridge, tunnel encap, tc, ...)
below that

Then convert accesses to accessors and init it on demand.

One could probably also split cb[] into a smaller fastpath area
and another one at the end that won't be touched at allocation time.

> Miscellaneous

> contemplating that connections/sockets can be bound to particularly
> CPUs and that any operations (socket operations, timers, receive
> processing) must occur on 

Initial thoughts on TXDP

2016-11-30 Thread Tom Herbert
Posting for discussion

Now that XDP seems to be nicely gaining traction we can start to
consider the next logical step which is to apply the principles of XDP
to accelerating transport protocols in the kernel. For lack of a
better name I'll refer to this as Transport eXpress Data Path, or just
TXDP :-). Pulling off TXDP might not be the most trivial of problems
to solve, but if we can this may address the performance gap between
kernel bypass and the stack for transport layer protocols (XDP
addresses the performance gap for stateless packet processing). The
problem statement is analogous to that which we had for XDP, namely
can we create a mode in the kernel that offer the same performance
that is seen with L4 protocols over kernel bypass (e.g. TCP/OpenOnload
or TCP/DPDK) or perhaps something reasonably close to a full HW
offload solution (such as RDMA)?

TXDP is different from XDP in that we are dealing with stateful
protocols and is part of a full protocol implementation, specifically
this would be an accelerated mode of transport connections (e.g. TCP)
in the kernel. Also, unlike XDP we now need to be concerned with
transmit path (both application generating packets as well as protocol
sourced packets like ACKs, retransmits, clocking out data, etc.).
Another distinction is that the user API needs to be considered, for
instance optimizing the nominal protocol stack but then using an
unmodified socket interface could easily undo the effects of
optimizing the lower layers. This last point actually implies a nice
constraint, if we can't keep the accelerated path simple its probably
not worth trying to accelerate.

One simplifying assumption we might make is that TXDP is primarily for
optimizing latency, specifically request/response type operations
(think HPC, HFT, flash server, or other tightly coupled communications
within the datacenter). Notably, I don't think that saving CPU is as
relevant to TXDP, in fact we have already seen that CPU utilization
can be traded off for lower latency via spin polling. Similar to XDP
though, we might assume that single CPU performance is relevant (i.e.
on a cache server we'd like to spin as few CPUs as needed and no more
to handle the load an maintain throughput and latency requirements).
High throughput (ops/sec) and low variance should be side effects of
any design.

As with XDP, TXDP is _not_ intended to be a completely generic and
transparent solution. The application may be specifically optimized
for use with TXDP (for instance to implement perfect lockless
silo'ing). So TXDP is not going to be for everyone and it should be as
minimally invasive to the rest of the stack as possible.

I imagine there are a few reasons why userspace TCP stacks can get
good performance:

- Spin polling (we already can do this in kernel)
- Lockless, I would assume that threads typically have exclusive
access to a queue pair for a connection
- Minimal TCP/IP stack code
- Zero copy TX/RX
- Light weight structures for queuing
- No context switches
- Fast data path for in order, uncongested flows
- Silo'ing between application and device queues

Not all of these have cognates in the Linux stack, for instance we
probably can't entirely eliminate context switches for a userspace
application.

So with that the components of TXDP might look something like:

RX

  - Call into TCP/IP stack with page data directly from driver-- no
skbuff allocation or interface. This is essentially provided by the
XDP API although we would need to generalize the interface to call
stack functions (I previously posted patches for that). We will also
need a new action, XDP_HELD?, that indicates the XDP function held the
packet (put on a socket for instance).
  - Perform connection lookup. If we assume lockless model as
described below then we should be able to perform lockless connection
lookup similar to work Eric did to optimize UDP lookups for tunnel
processing
  - Call function that implements expedited TCP/IP datapath (something
like Van Jacobson's famous 80 instructions? :-) ).
  - If there is anything funky about the packet, connection state, or
TCP connection is not being TXDP accelerated just return XDP_PASS so
that packet follows normal stack processing. Since we did connection
lookup we could return an early demux also.Since we're already in an
exception mode this is where we might want to move packet processing
to different CPU (can be done by RPS/RFS)..
  - If packet contains new data we can allocate a "mini skbfuf (talked
about that at netdev) for queuing on socket.
  - If packet is an ACK we can process it directly without ever creating skbuff
  - There is also the possibility of avoiding the skbuff allocation
for in-kernel applications. Stream parser might also be taught how to
deal with raw buffers.
  - If we're really ambitious we can also consider putting packets
into a packet ring for user space presuming that packets are typically
in order (might be a little orthogonal to TXDP.

TX

  - Norm