Re: Initial thoughts on TXDP
On Fri, Dec 2, 2016 at 6:36 AM, Edward Cree wrote: > On 01/12/16 23:46, Tom Herbert wrote: >> The only time we >> _really_ to allocate an skbuf is when we need to put the packet onto a >> queue. All the other use cases are really just to pass a structure >> containing a packet from function to function. For that purpose we >> should be able to just pass a much smaller structure in a stack >> argument and only allocate an skbuff when we need to enqueue. In cases >> where we don't ever queue a packet we might never need to allocate any >> skbuff > Now this intrigues me, because one of the objections to bundling (vs GRO) > was the memory usage of all those SKBs. IIRC we already do a 'GRO-like' > coalescing when packets reach a TCP socket anyway (or at least in some > cases, not sure if all the different ways we can enqueue a TCP packet for > RX do it), but if we could carry the packets from NIC to socket without > SKBs, doing so in lists rather than one-at-a-time wouldn't cost any extra > memory (the packet-pages are all already allocated on the NIC RX ring). > Possibly combine the two, so that rather than having potentially four > versions of each function (skb, skbundle, void*, void* bundle) you just > have the two 'ends'. > Yep, seems like a good idea to incorporate bundling into TXDP from the get-go. Tom > -Ed
Re: Initial thoughts on TXDP
On 01/12/16 23:46, Tom Herbert wrote: > The only time we > _really_ to allocate an skbuf is when we need to put the packet onto a > queue. All the other use cases are really just to pass a structure > containing a packet from function to function. For that purpose we > should be able to just pass a much smaller structure in a stack > argument and only allocate an skbuff when we need to enqueue. In cases > where we don't ever queue a packet we might never need to allocate any > skbuff Now this intrigues me, because one of the objections to bundling (vs GRO) was the memory usage of all those SKBs. IIRC we already do a 'GRO-like' coalescing when packets reach a TCP socket anyway (or at least in some cases, not sure if all the different ways we can enqueue a TCP packet for RX do it), but if we could carry the packets from NIC to socket without SKBs, doing so in lists rather than one-at-a-time wouldn't cost any extra memory (the packet-pages are all already allocated on the NIC RX ring). Possibly combine the two, so that rather than having potentially four versions of each function (skb, skbundle, void*, void* bundle) you just have the two 'ends'. -Ed
Re: Initial thoughts on TXDP
On Thu, 1 Dec 2016 23:47:44 +0100 Hannes Frederic Sowa wrote: > Side note: > > On 01.12.2016 20:51, Tom Herbert wrote: > >> > E.g. "mini-skb": Even if we assume that this provides a speedup > >> > (where does that come from? should make no difference if a 32 or > >> > 320 byte buffer gets allocated). Yes, the size of the allocation from the SLUB allocator does not change base performance/cost much (at least for small objects, if < 1024). Do notice the base SLUB alloc+free cost is fairly high (compared to a 201 cycles budget). Especially for networking as the free-side is very likely to hit a slow path. SLUB fast-path 53 cycles, and slow-path around 100 cycles (data from [1]). I've tried to address this with the kmem_cache bulk APIs. Which reduce the cost to approx 30 cycles. (Something we have not fully reaped the benefit from yet!) [1] https://git.kernel.org/torvalds/c/ca257195511 > >> > > > It's the zero'ing of three cache lines. I believe we talked about that > > as netdev. Actually 4 cache-lines, but with some cleanup I believe we can get down to clearing 192 bytes 3 cache-lines. > > Jesper and me played with that again very recently: > > https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_memset.c#L590 > > In micro-benchmarks we saw a pretty good speed up not using the rep > stosb generated by gcc builtin but plain movq's. Probably the cost model > for __builtin_memset in gcc is wrong? Yes, I believe so. > When Jesper is free we wanted to benchmark this and maybe come up with a > arch specific way of cleaning if it turns out to really improve throughput. > > SIMD instructions seem even faster but the kernel_fpu_begin/end() kill > all the benefits. One strange thing was, that on my skylake CPU (i7-6700K @4.00GHz), Hannes's hand-optimized MOVQ ASM-code didn't go past 8 bytes per cycle, or 32 cycles for 256 bytes. Talking to Alex and John during netdev, and reading on the Intel arch, I though that this CPU should be-able-to perform 16 bytes per cycle. The CPU can do it as the rep-stos show this once the size gets large enough. On this CPU the memset rep stos starts to win around 512 bytes: 192/35 = 5.5 bytes/cycle 256/36 = 7.1 bytes/cycle 512/40 = 12.8 bytes/cycle 768/46 = 16.7 bytes/cycle 1024/52 = 19.7 bytes/cycle 2048/84 = 24.4 bytes/cycle 4096/148= 27.7 bytes/cycle -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: Initial thoughts on TXDP
On Thu, 1 Dec 2016 11:51:42 -0800 Tom Herbert wrote: > On Wed, Nov 30, 2016 at 6:44 PM, Florian Westphal wrote: > > Tom Herbert wrote: [...] > >> - Call into TCP/IP stack with page data directly from driver-- no > >> skbuff allocation or interface. This is essentially provided by the > >> XDP API although we would need to generalize the interface to call > >> stack functions (I previously posted patches for that). We will also > >> need a new action, XDP_HELD?, that indicates the XDP function held the > >> packet (put on a socket for instance). > > > > Seems this will not work at all with the planned page pool thing when > > pages start to be held indefinitely. It is quite the opposite, the page pool support pages are being held for longer times, than drivers today. The current driver page recycle tricks cannot, as they depend on page refcnt being decremented quickly (while pages are still mapped in their recycle queue). > > You can also never get even close to userspace offload stacks once you > > need/do this; allocations in hotpath are too expensive. Yes. It is important to understand that once the number of outstanding pages get large, the driver recycle stops working. Meaning the pages allocations start to go through the page allocator. I've documented[1] that the bare alloc+free cost[2] (231 cycles order-0/4K) is higher than the 10G wirespeed budget (201 cycles). Thus, the driver recycle tricks are nice for benchmarking, as it hides the page allocator overhead. But this optimization might disappear for Tom's and Eric's more real-world use-cases e.g. like 10.000 sockets. The page pool don't these issues. [1] http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench01.c -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: Initial thoughts on TXDP
On 12/01/2016 02:12 PM, Tom Herbert wrote: We have consider both request size and response side in RPC. Presumably, something like a memcache server is most serving data as opposed to reading it, we are looking to receiving much smaller packets than being sent. Requests are going to be quite small say 100 bytes and unless we are doing significant amount of pipelining on connections GRO would rarely kick-in. Response size will have a lot of variability, anything from a few kilobytes up to a megabyte. I'm sorry I can't be more specific this is an artifact of datacenters that have 100s of different applications and communication patterns. Maybe 100b request size, 8K, 16K, 64K response sizes might be good for test. No worries on the specific sizes, it is a classic "How long is a piece of string?" sort of question. Not surprisingly, as the size of what is being received grows, so too the delta between GRO on and off. stack@np-cp1-c0-m1-mgmt:~/rjones2$ HDR="-P 1"; for r in 8K 16K 64K 1M; do for gro in on off; do sudo ethtool -K hed0 gro ${gro}; brand="$r gro $gro"; ./netperf -B "$brand" -c -H np-cp1-c1-m3-mgmt -t TCP_RR $HDR -- -P 12867 -r 128,${r} -o result_brand,throughput,local_sd; HDR="-P 0"; done; done MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0 Result Tag,Throughput,Local Service Demand "8K gro on",9899.84,35.947 "8K gro off",7299.54,61.097 "16K gro on",8119.38,58.367 "16K gro off",5176.87,95.317 "64K gro on",4429.57,110.629 "64K gro off",2128.58,289.913 "1M gro on",887.85,918.447 "1M gro off",335.97,3427.587 So that gives a feel for by how much this alternative mechanism would have to reduce path-length to maintain the CPU overhead, were the mechanism to preclude GRO. rick
Re: Initial thoughts on TXDP
On Thu, Dec 1, 2016 at 2:47 PM, Hannes Frederic Sowa wrote: > Side note: > > On 01.12.2016 20:51, Tom Herbert wrote: >>> > E.g. "mini-skb": Even if we assume that this provides a speedup >>> > (where does that come from? should make no difference if a 32 or >>> > 320 byte buffer gets allocated). >>> > >> It's the zero'ing of three cache lines. I believe we talked about that >> as netdev. > > Jesper and me played with that again very recently: > > https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_memset.c#L590 > > In micro-benchmarks we saw a pretty good speed up not using the rep > stosb generated by gcc builtin but plain movq's. Probably the cost model > for __builtin_memset in gcc is wrong? > > When Jesper is free we wanted to benchmark this and maybe come up with a > arch specific way of cleaning if it turns out to really improve throughput. > > SIMD instructions seem even faster but the kernel_fpu_begin/end() kill > all the benefits. > One nice direction of XDP is that it forces drivers to defer allocating (and hence zero'ing) skbs. In the receive path I think we can exploit this property deeper into the stack. The only time we _really_ to allocate an skbuf is when we need to put the packet onto a queue. All the other use cases are really just to pass a structure containing a packet from function to function. For that purpose we should be able to just pass a much smaller structure in a stack argument and only allocate an skbuff when we need to enqueue. In cases where we don't ever queue a packet we might never need to allocate any skbuff-- this includes pure acks, packets that end up being dropped. But even more than that, if a received packet generates a TX packet (like a SYN causes a SYN-ACK) then we might even be able to just recycle the received packet and avoid needing any skbuff allocation on transmit (XDP_TX already does this in a limited context)-- this could be a win to handle SYN attacks for instance. Also, since we don't queue on the socket buffer for UDP it's conceivable we could avoid skbuffs in an expedited UDP TX path. Currently, nearly the whole stack depends on packets always being passed in skbuffs, however __skb_flow_dissect is an interesting exception as it can handle packets passed in either an skbuff or by just a void *-- so we know that this "dual mode" is at least possible. Trying to retrain the whole stack to be able to handle both skbuffs and raw pages is probably untenable at this point, but selectively augmenting some critical performance functions for dual mode (ip_rcv, tcp_rcv, udp_rcv functions for instance) might work. Thanks, Tom > Bye, > Hannes >
Re: Initial thoughts on TXDP
On 01.12.2016 21:13, Sowmini Varadhan wrote: > On (12/01/16 11:05), Tom Herbert wrote: >> >> Polling does not necessarily imply that networking monopolizes the CPU >> except when the CPU is otherwise idle. Presumably the application >> drives the polling when it is ready to receive work. > > I'm not grokking that- "if the cpu is idle, we want to busy-poll > and make it 0% idle"? Keeping CPU 0% idle has all sorts > of issues, see slide 20 of > http://www.slideshare.net/shemminger/dpdk-performance > >>> and one other critical difference from the hot-potato-forwarding >>> model (the sort of OVS model that DPDK etc might aruguably be a fit for) >>> does not apply: in order to figure out the ethernet and IP headers >>> in the response correctly at all times (in the face of things like VRRP, >>> gw changes, gw's mac addr changes etc) the application should really >>> be listening on NETLINK sockets for modifications to the networking >>> state - again points to needing a select() socket set where you can >>> have both the I/O fds and the netlink socket, >>> >> I would think that that is management would not be implemented in a >> fast path processing thread for an application. > > sure, but my point was that *XDP and other stack-bypass methods needs > to provide a select()able socket: when your use-case is not about just > networking, you have to snoop on changes to the control plane, and update > your data path. In the OVS case (pure networking) the OVS control plane > updates are intrinsic to OVS. For the rest of the request/response world, > we need a select()able socket set to do this elegantly (not really > possible in DPDK, for example) Busypoll on steroids is what windows does by mapping the user space "doorbell" into a vDSO and let user space loop on that maybe with MWAIT/MONITOR. The interesting thing is that you can map other events to this notification event, too. It sounds like a usable idea to me and reassembles what we already do with futexes. Bye, Hannes
Re: Initial thoughts on TXDP
Side note: On 01.12.2016 20:51, Tom Herbert wrote: >> > E.g. "mini-skb": Even if we assume that this provides a speedup >> > (where does that come from? should make no difference if a 32 or >> > 320 byte buffer gets allocated). >> > > It's the zero'ing of three cache lines. I believe we talked about that > as netdev. Jesper and me played with that again very recently: https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_memset.c#L590 In micro-benchmarks we saw a pretty good speed up not using the rep stosb generated by gcc builtin but plain movq's. Probably the cost model for __builtin_memset in gcc is wrong? When Jesper is free we wanted to benchmark this and maybe come up with a arch specific way of cleaning if it turns out to really improve throughput. SIMD instructions seem even faster but the kernel_fpu_begin/end() kill all the benefits. Bye, Hannes
Re: Initial thoughts on TXDP
On Thu, Dec 1, 2016 at 1:47 PM, Rick Jones wrote: > On 12/01/2016 12:18 PM, Tom Herbert wrote: >> >> On Thu, Dec 1, 2016 at 11:48 AM, Rick Jones wrote: >>> >>> Just how much per-packet path-length are you thinking will go away under >>> the >>> likes of TXDP? It is admittedly "just" netperf but losing TSO/GSO does >>> some >>> non-trivial things to effective overhead (service demand) and so >>> throughput: >>> >> For plain in order TCP packets I believe we should be able process >> each packet at nearly same speed as GRO. Most of the protocol >> processing we do between GRO and the stack are the same, the >> differences are that we need to do a connection lookup in the stack >> path (note we now do this is UDP GRO and that hasn't show up as a >> major hit). We also need to consider enqueue/dequeue on the socket >> which is a major reason to try for lockless sockets in this instance. > > > So waving hands a bit, and taking the service demand for the GRO-on receive > test in my previous message (860 ns/KB), that would be ~ (1448/1024)*860 or > ~1.216 usec of CPU time per TCP segment, including ACK generation which > unless an explicit ACK-avoidance heuristic a la HP-UX 11/Solaris 2 is put in > place would be for every-other segment. Etc etc. > >> Sure, but trying running something emulates a more realistic workload >> than a TCP stream, like RR test with relative small payload and many >> connections. > > > That is a good point, which of course is why the RR tests are there in > netperf :) Don't get me wrong, I *like* seeing path-length reductions. What > would you posit is a relatively small payload? The promotion of IR10 > suggests that perhaps 14KB or so is a sufficiently common so I'll grasp at > that as the length of a piece of string: > We have consider both request size and response side in RPC. Presumably, something like a memcache server is most serving data as opposed to reading it, we are looking to receiving much smaller packets than being sent. Requests are going to be quite small say 100 bytes and unless we are doing significant amount of pipelining on connections GRO would rarely kick-in. Response size will have a lot of variability, anything from a few kilobytes up to a megabyte. I'm sorry I can't be more specific this is an artifact of datacenters that have 100s of different applications and communication patterns. Maybe 100b request size, 8K, 16K, 64K response sizes might be good for test. > stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t > TCP_RR -- -P 12867 -r 128,14K > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET > to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0 > Local /Remote > Socket Size Request Resp. Elapsed Trans. CPUCPUS.dem S.dem > Send Recv SizeSize TimeRate local remote local remote > bytes bytes bytes bytes secs. per sec % S% Uus/Tr us/Tr > > 16384 87380 128 14336 10.00 8118.31 1.57 -1.00 46.410 -1.000 > 16384 87380 > stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off > stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t > TCP_RR -- -P 12867 -r 128,14K > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET > to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0 > Local /Remote > Socket Size Request Resp. Elapsed Trans. CPUCPUS.dem S.dem > Send Recv SizeSize TimeRate local remote local remote > bytes bytes bytes bytes secs. per sec % S% Uus/Tr us/Tr > > 16384 87380 128 14336 10.00 5837.35 2.20 -1.00 90.628 -1.000 > 16384 87380 > > So, losing GRO doubled the service demand. I suppose I could see cutting > path-length in half based on the things you listed which would be bypassed? > > I'm sure mileage will vary with different NICs and CPUs. The ones used here > happened to be to hand. > This is also biased because you're using a single connection, but is consistent with data we've seen in the past. To be clear I'm not saying GRO is bad, the fact that GRO has such a visible impact in your test means that the GRO path is significantly more efficient. Closing that gap seen in your numbers would be a benefit, that means we have improved per packet processing. Tom > happy benchmarking, > > rick > > Just to get a crude feel for sensitivity, doubling to 28K unsurprisingly > goes to more than doubling, and halving to 7K narrows the delta: > > stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t > TCP_RR -- -P 12867 -r 128,28K > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET > to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0 > Local /Remote > Socket Size Request Resp. Elapsed Trans. CPUCPUS.dem S.dem > Send Recv SizeSize TimeRate local remote local remote > bytes bytes bytes bytes secs. per sec %
Re: Initial thoughts on TXDP
On 12/01/2016 12:18 PM, Tom Herbert wrote: On Thu, Dec 1, 2016 at 11:48 AM, Rick Jones wrote: Just how much per-packet path-length are you thinking will go away under the likes of TXDP? It is admittedly "just" netperf but losing TSO/GSO does some non-trivial things to effective overhead (service demand) and so throughput: For plain in order TCP packets I believe we should be able process each packet at nearly same speed as GRO. Most of the protocol processing we do between GRO and the stack are the same, the differences are that we need to do a connection lookup in the stack path (note we now do this is UDP GRO and that hasn't show up as a major hit). We also need to consider enqueue/dequeue on the socket which is a major reason to try for lockless sockets in this instance. So waving hands a bit, and taking the service demand for the GRO-on receive test in my previous message (860 ns/KB), that would be ~ (1448/1024)*860 or ~1.216 usec of CPU time per TCP segment, including ACK generation which unless an explicit ACK-avoidance heuristic a la HP-UX 11/Solaris 2 is put in place would be for every-other segment. Etc etc. Sure, but trying running something emulates a more realistic workload than a TCP stream, like RR test with relative small payload and many connections. That is a good point, which of course is why the RR tests are there in netperf :) Don't get me wrong, I *like* seeing path-length reductions. What would you posit is a relatively small payload? The promotion of IR10 suggests that perhaps 14KB or so is a sufficiently common so I'll grasp at that as the length of a piece of string: stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t TCP_RR -- -P 12867 -r 128,14K MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. CPUCPUS.dem S.dem Send Recv SizeSize TimeRate local remote local remote bytes bytes bytes bytes secs. per sec % S% Uus/Tr us/Tr 16384 87380 128 14336 10.00 8118.31 1.57 -1.00 46.410 -1.000 16384 87380 stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t TCP_RR -- -P 12867 -r 128,14K MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. CPUCPUS.dem S.dem Send Recv SizeSize TimeRate local remote local remote bytes bytes bytes bytes secs. per sec % S% Uus/Tr us/Tr 16384 87380 128 14336 10.00 5837.35 2.20 -1.00 90.628 -1.000 16384 87380 So, losing GRO doubled the service demand. I suppose I could see cutting path-length in half based on the things you listed which would be bypassed? I'm sure mileage will vary with different NICs and CPUs. The ones used here happened to be to hand. happy benchmarking, rick Just to get a crude feel for sensitivity, doubling to 28K unsurprisingly goes to more than doubling, and halving to 7K narrows the delta: stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t TCP_RR -- -P 12867 -r 128,28K MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. CPUCPUS.dem S.dem Send Recv SizeSize TimeRate local remote local remote bytes bytes bytes bytes secs. per sec % S% Uus/Tr us/Tr 16384 87380 128 28672 10.00 6732.32 1.79 -1.00 63.819 -1.000 16384 87380 stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t TCP_RR -- -P 12867 -r 128,28K MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. CPUCPUS.dem S.dem Send Recv SizeSize TimeRate local remote local remote bytes bytes bytes bytes secs. per sec % S% Uus/Tr us/Tr 16384 87380 128 28672 10.00 3780.47 2.32 -1.00 147.280 -1.000 16384 87380 stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t TCP_RR -- -P 12867 -r 128,7K MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0 Local /Remote Socket Size Request Resp. Elapsed Trans. CPUCPUS.dem S.dem Send Recv SizeSize TimeRate local remote local remote bytes bytes bytes bytes secs. per sec % S% Uus/Tr us/Tr 16384 87380
Re: Initial thoughts on TXDP
On Thu, Dec 1, 2016 at 12:13 PM, Sowmini Varadhan wrote: > On (12/01/16 11:05), Tom Herbert wrote: >> >> Polling does not necessarily imply that networking monopolizes the CPU >> except when the CPU is otherwise idle. Presumably the application >> drives the polling when it is ready to receive work. > > I'm not grokking that- "if the cpu is idle, we want to busy-poll > and make it 0% idle"? Keeping CPU 0% idle has all sorts > of issues, see slide 20 of > http://www.slideshare.net/shemminger/dpdk-performance > >> > and one other critical difference from the hot-potato-forwarding >> > model (the sort of OVS model that DPDK etc might aruguably be a fit for) >> > does not apply: in order to figure out the ethernet and IP headers >> > in the response correctly at all times (in the face of things like VRRP, >> > gw changes, gw's mac addr changes etc) the application should really >> > be listening on NETLINK sockets for modifications to the networking >> > state - again points to needing a select() socket set where you can >> > have both the I/O fds and the netlink socket, >> > >> I would think that that is management would not be implemented in a >> fast path processing thread for an application. > > sure, but my point was that *XDP and other stack-bypass methods needs > to provide a select()able socket: when your use-case is not about just > networking, you have to snoop on changes to the control plane, and update > your data path. In the OVS case (pure networking) the OVS control plane > updates are intrinsic to OVS. For the rest of the request/response world, > we need a select()able socket set to do this elegantly (not really > possible in DPDK, for example) > I'm not sure that TXDP can be reconciled to help OVS. The point of TXDP is to drive applications closer to bare metal performance, as I mentioned this is only going to be worth it if the fast path can be kept simple and not complicated by a requirement for generalization. It seems like the second we put OVS in we're doubling the data path and accepting the performance consequences of a complex path anyway. TXDP can't over the whole system (any more than DPDK can) and needs to work in concert with other mechanisms-- the key is how to steer the work amongst the CPUs. For instance, if a latency critical thread is running on some CPU we either a dedicated queue for the connections of the thread (e.g. ntuple filtering or aRFS support) or we need a fast way to get move unrelated packets received on a queue processed by that CPU to other CPUs (less efficient, but no special HW support is needed either). Tom > >> The *SOs are always an interesting question. They make for great >> benchmarks, but in real life the amount of benefit is somewhat >> unclear. Under the wrong conditions, like all cwnds have collapsed or > > I think Rick's already bringing up this one. > > --Sowmini >
Re: Initial thoughts on TXDP
On Thu, Dec 1, 2016 at 11:48 AM, Rick Jones wrote: > On 12/01/2016 11:05 AM, Tom Herbert wrote: >> >> For the GSO and GRO the rationale is that performing the extra SW >> processing to do the offloads is significantly less expensive than >> running each packet through the full stack. This is true in a >> multi-layered generalized stack. In TXDP, however, we should be able >> to optimize the stack data path such that that would no longer be >> true. For instance, if we can process the packets received on a >> connection quickly enough so that it's about the same or just a little >> more costly than GRO processing then we might bypass GRO entirely. >> TSO is probably still relevant in TXDP since it reduces overheads >> processing TX in the device itself. > > > Just how much per-packet path-length are you thinking will go away under the > likes of TXDP? It is admittedly "just" netperf but losing TSO/GSO does some > non-trivial things to effective overhead (service demand) and so throughput: > For plain in order TCP packets I believe we should be able process each packet at nearly same speed as GRO. Most of the protocol processing we do between GRO and the stack are the same, the differences are that we need to do a connection lookup in the stack path (note we now do this is UDP GRO and that hasn't show up as a major hit). We also need to consider enqueue/dequeue on the socket which is a major reason to try for lockless sockets in this instance. > stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -- -P > 12867 > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to > np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo > Recv SendSend Utilization Service > Demand > Socket Socket Message Elapsed Send Recv SendRecv > Size SizeSize Time Throughput localremote local remote > bytes bytes bytessecs.10^6bits/s % S % U us/KB us/KB > > 87380 16384 1638410.00 9260.24 2.02 -1.000.428 -1.000 > stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 tso off gso off > stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -- -P > 12867 > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to > np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo > Recv SendSend Utilization Service > Demand > Socket Socket Message Elapsed Send Recv SendRecv > Size SizeSize Time Throughput localremote local remote > bytes bytes bytessecs.10^6bits/s % S % U us/KB us/KB > > 87380 16384 1638410.00 5621.82 4.25 -1.001.486 -1.000 > > And that is still with the stretch-ACKs induced by GRO at the receiver. > Sure, but trying running something emulates a more realistic workload than a TCP stream, like RR test with relative small payload and many connections. > Losing GRO has quite similar results: > stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t > TCP_MAERTS -- -P 12867 > MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to > np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo > Recv SendSend Utilization Service > Demand > Socket Socket Message Elapsed Recv Send RecvSend > Size SizeSize Time Throughput localremote local remote > bytes bytes bytessecs.10^6bits/s % S % U us/KB us/KB > > 87380 16384 1638410.00 9154.02 4.00 -1.000.860 -1.000 > stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off > > stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t > TCP_MAERTS -- -P 12867 > MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to > np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo > Recv SendSend Utilization Service > Demand > Socket Socket Message Elapsed Recv Send RecvSend > Size SizeSize Time Throughput localremote local remote > bytes bytes bytessecs.10^6bits/s % S % U us/KB us/KB > > 87380 16384 1638410.00 4212.06 5.36 -1.002.502 -1.000 > > I'm sure there is a very non-trivial "it depends" component here - netperf > will get the peak benefit from *SO and so one will see the peak difference > in service demands - but even if one gets only 6 segments per *SO that is a > lot of path-length to make-up. > True, but I think there's a lot of path we'll be able to cut out. In this mode we don't need IPtables, Netfilter, input route, IPvlan check, or other similar lookups. Once we've successfully matched a establish TCP state anything related to policy on both TX and RX for that connection is inferred from the state. We want the processing path in this case to just be concerned with just protocol processing an
Re: Initial thoughts on TXDP
On (12/01/16 11:05), Tom Herbert wrote: > > Polling does not necessarily imply that networking monopolizes the CPU > except when the CPU is otherwise idle. Presumably the application > drives the polling when it is ready to receive work. I'm not grokking that- "if the cpu is idle, we want to busy-poll and make it 0% idle"? Keeping CPU 0% idle has all sorts of issues, see slide 20 of http://www.slideshare.net/shemminger/dpdk-performance > > and one other critical difference from the hot-potato-forwarding > > model (the sort of OVS model that DPDK etc might aruguably be a fit for) > > does not apply: in order to figure out the ethernet and IP headers > > in the response correctly at all times (in the face of things like VRRP, > > gw changes, gw's mac addr changes etc) the application should really > > be listening on NETLINK sockets for modifications to the networking > > state - again points to needing a select() socket set where you can > > have both the I/O fds and the netlink socket, > > > I would think that that is management would not be implemented in a > fast path processing thread for an application. sure, but my point was that *XDP and other stack-bypass methods needs to provide a select()able socket: when your use-case is not about just networking, you have to snoop on changes to the control plane, and update your data path. In the OVS case (pure networking) the OVS control plane updates are intrinsic to OVS. For the rest of the request/response world, we need a select()able socket set to do this elegantly (not really possible in DPDK, for example) > The *SOs are always an interesting question. They make for great > benchmarks, but in real life the amount of benefit is somewhat > unclear. Under the wrong conditions, like all cwnds have collapsed or I think Rick's already bringing up this one. --Sowmini
Re: Initial thoughts on TXDP
On Wed, Nov 30, 2016 at 6:44 PM, Florian Westphal wrote: > Tom Herbert wrote: >> Posting for discussion > > Warning: You are not going to like this reply... > >> Now that XDP seems to be nicely gaining traction > > Yes, I regret to see that. XDP seems useful to create impressive > benchmark numbers (and little else). > > I will send a separate email to keep that flamebait part away from > this thread though. > > [..] > >> addresses the performance gap for stateless packet processing). The >> problem statement is analogous to that which we had for XDP, namely >> can we create a mode in the kernel that offer the same performance >> that is seen with L4 protocols over kernel bypass > > Why? If you want to bypass the kernel, then DO IT. > I don't want kernel bypass. I want the Linux stack to provide something close to bare metal performance for TCP/UDP for some latency sensitive applications we run. > There is nothing wrong with DPDK. The ONLY problem is that the kernel > does not offer a userspace fastpath like Windows RIO or FreeBSDs netmap. > > But even without that its not difficult to get DPDK running. > That is not true for large scale deployments. Also, TXDP is about accelerating transport layers like TCP, DPDK is just the interface from userspace to device queues. We need a whole lot more with DPDK, a userspace TCP/IP stack for instance, to consider that we have an equivalent functionality. > (T)XDP seems born from spite, not technical rationale. > IMO everyone would be better off if we'd just have something netmap-esqe > in the network core (also see below). > >> I imagine there are a few reasons why userspace TCP stacks can get >> good performance: >> >> - Spin polling (we already can do this in kernel) >> - Lockless, I would assume that threads typically have exclusive >> access to a queue pair for a connection >> - Minimal TCP/IP stack code >> - Zero copy TX/RX >> - Light weight structures for queuing >> - No context switches >> - Fast data path for in order, uncongested flows >> - Silo'ing between application and device queues > > I only see two cases: > > 1. Many applications running (standard Os model) that need to > send/receive data > -> Linux Network Stack > > 2. Single dedicated application that does all rx/tx > > -> no queueing needed (can block network rx completely if receiver > is slow) > -> no allocations needed at runtime at all > -> no locking needed (single produce, single consumer) > > If you have #2 and you need to be fast etc then full userspace > bypass is fine. We will -- no matter what we do in kernel -- never > be able to keep up with the speed you can get with that > because we have to deal with #1. (Plus the ease of use/freedom of doing > userspace programming). And yes, I think that #2 is something we > should address solely by providing netmap or something similar. > > But even considering #1 there are ways to speed stack up: > > I'd kill RPF/RPS so we don't have IPI anymore and skb stays > on same cpu up to where it gets queued (ofo or rx queue). > The reference to RPS and RFS is only to move packets off the hot CPU that are not in the datapath. For instance if we get a FIN for a connection it we can put this into a slow path since FIN processing is not latency sensitive but may take a considerable amount of CPU to process. > Then we could tell driver what happened with the skb it gave us, e.g. > we can tell driver it can do immediate page/dma reuse, for example > in pure ack case as opposed to skb sitting in ofo or receive queue. > > (RPS/RFS functionality could still be provided via one of the gazillion > hooks we now have in the stack for those that need/want it), so I do > not think we would lose functionality. > >> - Call into TCP/IP stack with page data directly from driver-- no >> skbuff allocation or interface. This is essentially provided by the >> XDP API although we would need to generalize the interface to call >> stack functions (I previously posted patches for that). We will also >> need a new action, XDP_HELD?, that indicates the XDP function held the >> packet (put on a socket for instance). > > Seems this will not work at all with the planned page pool thing when > pages start to be held indefinitely. > The processing needed to gift a page to the stack shouldn't be very different than what a driver needs to do when and skbuff is created when XDP_PASS is returned. We probably would want to pass additional meta data, things like checksum and vlan information from received descriptor to the stack. A callback can be included if the stack decides it wants to hold on to the buffer and driver needs to do dma_sync etc. for that. > You can also never get even close to userspace offload stacks once you > need/do this; allocations in hotpath are too expensive. > > [..] > >> - When we transmit, it would be nice to go straight from TCP >> connection to an XDP device queue and in particular skip the qdisc >> layer. This follows the principle of low
Re: Initial thoughts on TXDP
On 12/01/2016 11:05 AM, Tom Herbert wrote: For the GSO and GRO the rationale is that performing the extra SW processing to do the offloads is significantly less expensive than running each packet through the full stack. This is true in a multi-layered generalized stack. In TXDP, however, we should be able to optimize the stack data path such that that would no longer be true. For instance, if we can process the packets received on a connection quickly enough so that it's about the same or just a little more costly than GRO processing then we might bypass GRO entirely. TSO is probably still relevant in TXDP since it reduces overheads processing TX in the device itself. Just how much per-packet path-length are you thinking will go away under the likes of TXDP? It is admittedly "just" netperf but losing TSO/GSO does some non-trivial things to effective overhead (service demand) and so throughput: stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -- -P 12867 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo Recv SendSend Utilization Service Demand Socket Socket Message Elapsed Send Recv SendRecv Size SizeSize Time Throughput localremote local remote bytes bytes bytessecs.10^6bits/s % S % U us/KB us/KB 87380 16384 1638410.00 9260.24 2.02 -1.000.428 -1.000 stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 tso off gso off stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -- -P 12867 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo Recv SendSend Utilization Service Demand Socket Socket Message Elapsed Send Recv SendRecv Size SizeSize Time Throughput localremote local remote bytes bytes bytessecs.10^6bits/s % S % U us/KB us/KB 87380 16384 1638410.00 5621.82 4.25 -1.001.486 -1.000 And that is still with the stretch-ACKs induced by GRO at the receiver. Losing GRO has quite similar results: stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t TCP_MAERTS -- -P 12867 MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo Recv SendSend Utilization Service Demand Socket Socket Message Elapsed Recv Send RecvSend Size SizeSize Time Throughput localremote local remote bytes bytes bytessecs.10^6bits/s % S % U us/KB us/KB 87380 16384 1638410.00 9154.02 4.00 -1.000.860 -1.000 stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t TCP_MAERTS -- -P 12867 MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo Recv SendSend Utilization Service Demand Socket Socket Message Elapsed Recv Send RecvSend Size SizeSize Time Throughput localremote local remote bytes bytes bytessecs.10^6bits/s % S % U us/KB us/KB 87380 16384 1638410.00 4212.06 5.36 -1.002.502 -1.000 I'm sure there is a very non-trivial "it depends" component here - netperf will get the peak benefit from *SO and so one will see the peak difference in service demands - but even if one gets only 6 segments per *SO that is a lot of path-length to make-up. 4.4 kernel, BE3 NICs ... E5-2640 0 @ 2.50GHz And even if one does have the CPU cycles to burn so to speak, the effect on power consumption needs to be included in the calculus. happy benchmarking, rick jones
Re: Initial thoughts on TXDP
On Thu, Dec 1, 2016 at 5:55 AM, Sowmini Varadhan wrote: > On (11/30/16 14:54), Tom Herbert wrote: >> >> Posting for discussion >: >> One simplifying assumption we might make is that TXDP is primarily for >> optimizing latency, specifically request/response type operations >> (think HPC, HFT, flash server, or other tightly coupled communications >> within the datacenter). Notably, I don't think that saving CPU is as >> relevant to TXDP, in fact we have already seen that CPU utilization >> can be traded off for lower latency via spin polling. Similar to XDP >> though, we might assume that single CPU performance is relevant (i.e. >> on a cache server we'd like to spin as few CPUs as needed and no more >> to handle the load an maintain throughput and latency requirements). >> High throughput (ops/sec) and low variance should be side effects of >> any design. > > I'm sending this with some hesitation (esp as the flamebait threads > are starting up - I have no interest in getting into food-fights!!), > because it sounds like the HPC/request-response use-case you have in mind > (HTTP based?) is very likely different than the one the DB use-cases in > my environment (RDBMS, Cluster req/responses). But to provide some > perspective from the latter use-case.. > > We also have request-response transactions, but CPU utilization > is extremely critical- many DB operations are highly CPU bound, > so it's not acceptable for the network to hog CPU util by polling. > In that sense, the DB req/resp model has a lot of overlap with the > Suricata use-case. > Hi Sowmini, Polling does not necessarily imply that networking monopolizes the CPU except when the CPU is otherwise idle. Presumably the application drives the polling when it is ready to receive work. > Also we need a select()able socket, because we have to deal with > input from several sources- network I/O, but also disk, and > file-system I/O. So need to make sure there is no starvation, > and that we multiplex between I/O sources efficiently > Yes, that is a requirement. > and one other critical difference from the hot-potato-forwarding > model (the sort of OVS model that DPDK etc might aruguably be a fit for) > does not apply: in order to figure out the ethernet and IP headers > in the response correctly at all times (in the face of things like VRRP, > gw changes, gw's mac addr changes etc) the application should really > be listening on NETLINK sockets for modifications to the networking > state - again points to needing a select() socket set where you can > have both the I/O fds and the netlink socket, > I would think that that is management would not be implemented in a fast path processing thread for an application. > For all of these reasons, we are investigating approaches similar ot > Suricata- PF_PACKET with TPACKETV2 (since we need both Tx and Rx, > and so far, tpacketv2 seems "good enough"). FWIW, we also took > a look at netmap and so far have not seen any significant benefits > to netmap over pf_packet.. investigation still ongoing. > >> - Call into TCP/IP stack with page data directly from driver-- no >> skbuff allocation or interface. This is essentially provided by the > > I'm curious- one thing that came out of the IPsec evaluation > is that TSO is very valuable for performance, and this is most easily > accessed via the sk_buff interfaces. I have not had a chance > to review your patches yet, but isnt that an issue if you bypass > sk_buff usage? But I should probably go and review your patchset.. > The *SOs are always an interesting question. They make for great benchmarks, but in real life the amount of benefit is somewhat unclear. Under the wrong conditions, like all cwnds have collapsed or received packets for flows are small or so mixed that we can't get much aggregation, SO provides no benefit and in fact becomes overhead. Relying on any amount of segmentation offload in real deployment is risky; for instance we've seen some video servers deployed that were able to serve line rate at 90% CPU in testing (SO was effective) but ended up needing 110% CPU in deployment when a hiccup caused all cwnds to collapse. Moral of the story is provision your servers assuming the worse case conditions that would render opportunistic offloads unless. For the GSO and GRO the rationale is that performing the extra SW processing to do the offloads is significantly less expensive than running each packet through the full stack. This is true in a multi-layered generalized stack. In TXDP, however, we should be able to optimize the stack data path such that that would no longer be true. For instance, if we can process the packets received on a connection quickly enough so that it's about the same or just a little more costly than GRO processing then we might bypass GRO entirely. TSO is probably still relevant in TXDP since it reduces overheads processing TX in the device itself. Tom > --Sowmini
Re: Initial thoughts on TXDP
On (11/30/16 14:54), Tom Herbert wrote: > > Posting for discussion : > One simplifying assumption we might make is that TXDP is primarily for > optimizing latency, specifically request/response type operations > (think HPC, HFT, flash server, or other tightly coupled communications > within the datacenter). Notably, I don't think that saving CPU is as > relevant to TXDP, in fact we have already seen that CPU utilization > can be traded off for lower latency via spin polling. Similar to XDP > though, we might assume that single CPU performance is relevant (i.e. > on a cache server we'd like to spin as few CPUs as needed and no more > to handle the load an maintain throughput and latency requirements). > High throughput (ops/sec) and low variance should be side effects of > any design. I'm sending this with some hesitation (esp as the flamebait threads are starting up - I have no interest in getting into food-fights!!), because it sounds like the HPC/request-response use-case you have in mind (HTTP based?) is very likely different than the one the DB use-cases in my environment (RDBMS, Cluster req/responses). But to provide some perspective from the latter use-case.. We also have request-response transactions, but CPU utilization is extremely critical- many DB operations are highly CPU bound, so it's not acceptable for the network to hog CPU util by polling. In that sense, the DB req/resp model has a lot of overlap with the Suricata use-case. Also we need a select()able socket, because we have to deal with input from several sources- network I/O, but also disk, and file-system I/O. So need to make sure there is no starvation, and that we multiplex between I/O sources efficiently and one other critical difference from the hot-potato-forwarding model (the sort of OVS model that DPDK etc might aruguably be a fit for) does not apply: in order to figure out the ethernet and IP headers in the response correctly at all times (in the face of things like VRRP, gw changes, gw's mac addr changes etc) the application should really be listening on NETLINK sockets for modifications to the networking state - again points to needing a select() socket set where you can have both the I/O fds and the netlink socket, For all of these reasons, we are investigating approaches similar ot Suricata- PF_PACKET with TPACKETV2 (since we need both Tx and Rx, and so far, tpacketv2 seems "good enough"). FWIW, we also took a look at netmap and so far have not seen any significant benefits to netmap over pf_packet.. investigation still ongoing. > - Call into TCP/IP stack with page data directly from driver-- no > skbuff allocation or interface. This is essentially provided by the I'm curious- one thing that came out of the IPsec evaluation is that TSO is very valuable for performance, and this is most easily accessed via the sk_buff interfaces. I have not had a chance to review your patches yet, but isnt that an issue if you bypass sk_buff usage? But I should probably go and review your patchset.. --Sowmini
Re: Initial thoughts on TXDP
Tom Herbert wrote: > Posting for discussion Warning: You are not going to like this reply... > Now that XDP seems to be nicely gaining traction Yes, I regret to see that. XDP seems useful to create impressive benchmark numbers (and little else). I will send a separate email to keep that flamebait part away from this thread though. [..] > addresses the performance gap for stateless packet processing). The > problem statement is analogous to that which we had for XDP, namely > can we create a mode in the kernel that offer the same performance > that is seen with L4 protocols over kernel bypass Why? If you want to bypass the kernel, then DO IT. There is nothing wrong with DPDK. The ONLY problem is that the kernel does not offer a userspace fastpath like Windows RIO or FreeBSDs netmap. But even without that its not difficult to get DPDK running. (T)XDP seems born from spite, not technical rationale. IMO everyone would be better off if we'd just have something netmap-esqe in the network core (also see below). > I imagine there are a few reasons why userspace TCP stacks can get > good performance: > > - Spin polling (we already can do this in kernel) > - Lockless, I would assume that threads typically have exclusive > access to a queue pair for a connection > - Minimal TCP/IP stack code > - Zero copy TX/RX > - Light weight structures for queuing > - No context switches > - Fast data path for in order, uncongested flows > - Silo'ing between application and device queues I only see two cases: 1. Many applications running (standard Os model) that need to send/receive data -> Linux Network Stack 2. Single dedicated application that does all rx/tx -> no queueing needed (can block network rx completely if receiver is slow) -> no allocations needed at runtime at all -> no locking needed (single produce, single consumer) If you have #2 and you need to be fast etc then full userspace bypass is fine. We will -- no matter what we do in kernel -- never be able to keep up with the speed you can get with that because we have to deal with #1. (Plus the ease of use/freedom of doing userspace programming). And yes, I think that #2 is something we should address solely by providing netmap or something similar. But even considering #1 there are ways to speed stack up: I'd kill RPF/RPS so we don't have IPI anymore and skb stays on same cpu up to where it gets queued (ofo or rx queue). Then we could tell driver what happened with the skb it gave us, e.g. we can tell driver it can do immediate page/dma reuse, for example in pure ack case as opposed to skb sitting in ofo or receive queue. (RPS/RFS functionality could still be provided via one of the gazillion hooks we now have in the stack for those that need/want it), so I do not think we would lose functionality. > - Call into TCP/IP stack with page data directly from driver-- no > skbuff allocation or interface. This is essentially provided by the > XDP API although we would need to generalize the interface to call > stack functions (I previously posted patches for that). We will also > need a new action, XDP_HELD?, that indicates the XDP function held the > packet (put on a socket for instance). Seems this will not work at all with the planned page pool thing when pages start to be held indefinitely. You can also never get even close to userspace offload stacks once you need/do this; allocations in hotpath are too expensive. [..] > - When we transmit, it would be nice to go straight from TCP > connection to an XDP device queue and in particular skip the qdisc > layer. This follows the principle of low latency being first criteria. It will never be lower than userspace offloads so anyone with serious "low latency" requirement (trading) will use that instead. Whats your target audience? > longer latencies in effect which likely means TXDP isn't appropriate > in such a cases. BQL is also out, however we would want the TX > batching of XDP. Right, congestion control and buffer bloat are totally overrated .. 8-( So far I haven't seen anything that would need XDP at all. What makes it technically impossible to apply these miracles to the stack...? E.g. "mini-skb": Even if we assume that this provides a speedup (where does that come from? should make no difference if a 32 or 320 byte buffer gets allocated). If we assume that its the zeroing of sk_buff (but iirc it made little to no difference), could add unsigned long skb_extensions[1]; to sk_buff, then move everything not needed for tcp fastpath (e.g. secpath, conntrack, nf_bridge, tunnel encap, tc, ...) below that Then convert accesses to accessors and init it on demand. One could probably also split cb[] into a smaller fastpath area and another one at the end that won't be touched at allocation time. > Miscellaneous > contemplating that connections/sockets can be bound to particularly > CPUs and that any operations (socket operations, timers, receive > processing) must occur on
Initial thoughts on TXDP
Posting for discussion Now that XDP seems to be nicely gaining traction we can start to consider the next logical step which is to apply the principles of XDP to accelerating transport protocols in the kernel. For lack of a better name I'll refer to this as Transport eXpress Data Path, or just TXDP :-). Pulling off TXDP might not be the most trivial of problems to solve, but if we can this may address the performance gap between kernel bypass and the stack for transport layer protocols (XDP addresses the performance gap for stateless packet processing). The problem statement is analogous to that which we had for XDP, namely can we create a mode in the kernel that offer the same performance that is seen with L4 protocols over kernel bypass (e.g. TCP/OpenOnload or TCP/DPDK) or perhaps something reasonably close to a full HW offload solution (such as RDMA)? TXDP is different from XDP in that we are dealing with stateful protocols and is part of a full protocol implementation, specifically this would be an accelerated mode of transport connections (e.g. TCP) in the kernel. Also, unlike XDP we now need to be concerned with transmit path (both application generating packets as well as protocol sourced packets like ACKs, retransmits, clocking out data, etc.). Another distinction is that the user API needs to be considered, for instance optimizing the nominal protocol stack but then using an unmodified socket interface could easily undo the effects of optimizing the lower layers. This last point actually implies a nice constraint, if we can't keep the accelerated path simple its probably not worth trying to accelerate. One simplifying assumption we might make is that TXDP is primarily for optimizing latency, specifically request/response type operations (think HPC, HFT, flash server, or other tightly coupled communications within the datacenter). Notably, I don't think that saving CPU is as relevant to TXDP, in fact we have already seen that CPU utilization can be traded off for lower latency via spin polling. Similar to XDP though, we might assume that single CPU performance is relevant (i.e. on a cache server we'd like to spin as few CPUs as needed and no more to handle the load an maintain throughput and latency requirements). High throughput (ops/sec) and low variance should be side effects of any design. As with XDP, TXDP is _not_ intended to be a completely generic and transparent solution. The application may be specifically optimized for use with TXDP (for instance to implement perfect lockless silo'ing). So TXDP is not going to be for everyone and it should be as minimally invasive to the rest of the stack as possible. I imagine there are a few reasons why userspace TCP stacks can get good performance: - Spin polling (we already can do this in kernel) - Lockless, I would assume that threads typically have exclusive access to a queue pair for a connection - Minimal TCP/IP stack code - Zero copy TX/RX - Light weight structures for queuing - No context switches - Fast data path for in order, uncongested flows - Silo'ing between application and device queues Not all of these have cognates in the Linux stack, for instance we probably can't entirely eliminate context switches for a userspace application. So with that the components of TXDP might look something like: RX - Call into TCP/IP stack with page data directly from driver-- no skbuff allocation or interface. This is essentially provided by the XDP API although we would need to generalize the interface to call stack functions (I previously posted patches for that). We will also need a new action, XDP_HELD?, that indicates the XDP function held the packet (put on a socket for instance). - Perform connection lookup. If we assume lockless model as described below then we should be able to perform lockless connection lookup similar to work Eric did to optimize UDP lookups for tunnel processing - Call function that implements expedited TCP/IP datapath (something like Van Jacobson's famous 80 instructions? :-) ). - If there is anything funky about the packet, connection state, or TCP connection is not being TXDP accelerated just return XDP_PASS so that packet follows normal stack processing. Since we did connection lookup we could return an early demux also.Since we're already in an exception mode this is where we might want to move packet processing to different CPU (can be done by RPS/RFS).. - If packet contains new data we can allocate a "mini skbfuf (talked about that at netdev) for queuing on socket. - If packet is an ACK we can process it directly without ever creating skbuff - There is also the possibility of avoiding the skbuff allocation for in-kernel applications. Stream parser might also be taught how to deal with raw buffers. - If we're really ambitious we can also consider putting packets into a packet ring for user space presuming that packets are typically in order (might be a little orthogonal to TXDP. TX - Norm