Re: [PATCH net-next] udp: under rx pressure, try to condense skbs
From: Eric Dumazet Date: Wed, 07 Dec 2016 09:19:33 -0800 > From: Eric Dumazet > > Under UDP flood, many softirq producers try to add packets to > UDP receive queue, and one user thread is burning one cpu trying > to dequeue packets as fast as possible. > > Two parts of the per packet cost are : > - copying payload from kernel space to user space, > - freeing memory pieces associated with skb. > > If socket is under pressure, softirq handler(s) can try to pull in > skb->head the payload of the packet if it fits. > > Meaning the softirq handler(s) can free/reuse the page fragment > immediately, instead of letting udp_recvmsg() do this hundreds of usec > later, possibly from another node. > > > Additional gains : > - We reduce skb->truesize and thus can store more packets per SO_RCVBUF > - We avoid cache line misses at copyout() time and consume_skb() time, > and avoid one put_page() with potential alien freeing on NUMA hosts. > > This comes at the cost of a copy, bounded to available tail room, which > is usually small. (We might have to fix GRO_MAX_HEAD which looks bigger > than necessary) > > This patch gave me about 5 % increase in throughput in my tests. > > skb_condense() helper could probably used in other contexts. > > Signed-off-by: Eric Dumazet This is isolated to UDP, and would be easy to revert if it causes problems. So applied, thanks Eric.
Re: [PATCH net-next] udp: under rx pressure, try to condense skbs
On Thu, 2016-12-08 at 07:36 -0800, Rick Jones wrote: > On 12/08/2016 07:30 AM, Eric Dumazet wrote: > > On Thu, 2016-12-08 at 10:46 +0100, Jesper Dangaard Brouer wrote: > > > >> Hmmm... I'm not thrilled to have such heuristics, that change memory > >> behavior when half of the queue size (sk->sk_rcvbuf) is reached. > > > > Well, copybreak drivers do that unconditionally, even under no stress at > > all, you really should complain then. > > Isn't that behaviour based (in part?) on the observation/belief that it > is fewer cycles to copy the small packet into a small buffer than to > send the larger buffer up the stack and have to allocate and map a > replacement? If properly done yes ;) Some drivers do a copybreak, but throw away the original page frag and reallocates a fresh one anyway. Like if you have a PAGE_SIZE=65536, it is split in ~32 frags, and drivers might not bother trying to reuse 1 frag.
Re: [PATCH net-next] udp: under rx pressure, try to condense skbs
On 12/08/2016 07:30 AM, Eric Dumazet wrote: On Thu, 2016-12-08 at 10:46 +0100, Jesper Dangaard Brouer wrote: Hmmm... I'm not thrilled to have such heuristics, that change memory behavior when half of the queue size (sk->sk_rcvbuf) is reached. Well, copybreak drivers do that unconditionally, even under no stress at all, you really should complain then. Isn't that behaviour based (in part?) on the observation/belief that it is fewer cycles to copy the small packet into a small buffer than to send the larger buffer up the stack and have to allocate and map a replacement? rick jones
Re: [PATCH net-next] udp: under rx pressure, try to condense skbs
On Thu, 2016-12-08 at 10:46 +0100, Jesper Dangaard Brouer wrote: > Hmmm... I'm not thrilled to have such heuristics, that change memory > behavior when half of the queue size (sk->sk_rcvbuf) is reached. Well, copybreak drivers do that unconditionally, even under no stress at all, you really should complain then. copybreak is interesting, not only for performance point of view, but ability to handle DOS/DDOS : Attackers need to send bigger packets to eventually force us to consume one page per packet. My idea (which I already described in the past) is to perform the (small) copy only in contexts we know packet might sit for a long time in a socket queue, and only if we know we are in stress conditions. ACK packets for example do not need copybreak, since they wont be queued for a long time. > > Most of the win comes from doing a local atomic page-refcnt decrement > oppose to doing a remote CPU refcnf-dec. And as you noticed the > benefit is quite high saving 241 cycles (see [1]). And you patch is > "using" these cycles to copy the packet instead. So, just to let you know, I have a patch series which achieve ~100 % perf increase, without the 2nd queue I envisioned for linux-4.11 A single thread doing mere recvmsg() system calls can now read ~2Mpps. This skb_condense() is done before producer cpus are competing using a busylock array (not a per socket new spinlock, but a shared hashed array, out of line). I plan using it tcp_add_backlog() to replace the : if (!skb->data_len) skb->truesize = SKB_TRUESIZE(skb_end_offset(skb)); > This might no be a win in the future. I'm working on a more generic > solution (page_pool) that (as one objective) target this remote recfnt. Very well, when all drivers can use this, we might revert this patch if proved not beneficial. But make sure we can hold 1,000,000 pages in skbs stored in ~100,000 TCP/UDP sockets. Your ideas sound fine in controlled environments, I am sure you will be able to demonstrate their gains independently of the counter measures we put in place in the protocol handlers.
Re: [PATCH net-next] udp: under rx pressure, try to condense skbs
On Wed, 07 Dec 2016 09:19:33 -0800 Eric Dumazet wrote: > From: Eric Dumazet > > Under UDP flood, many softirq producers try to add packets to > UDP receive queue, and one user thread is burning one cpu trying > to dequeue packets as fast as possible. > > Two parts of the per packet cost are : > - copying payload from kernel space to user space, > - freeing memory pieces associated with skb. > > If socket is under pressure, softirq handler(s) can try to pull in > skb->head the payload of the packet if it fits. > > Meaning the softirq handler(s) can free/reuse the page fragment > immediately, instead of letting udp_recvmsg() do this hundreds of usec > later, possibly from another node. > > > Additional gains : > - We reduce skb->truesize and thus can store more packets per SO_RCVBUF > - We avoid cache line misses at copyout() time and consume_skb() time, > and avoid one put_page() with potential alien freeing on NUMA hosts. > > This comes at the cost of a copy, bounded to available tail room, which > is usually small. (We might have to fix GRO_MAX_HEAD which looks bigger > than necessary) > > This patch gave me about 5 % increase in throughput in my tests. Hmmm... I'm not thrilled to have such heuristics, that change memory behavior when half of the queue size (sk->sk_rcvbuf) is reached. Most of the win comes from doing a local atomic page-refcnt decrement oppose to doing a remote CPU refcnf-dec. And as you noticed the benefit is quite high saving 241 cycles (see [1]). And you patch is "using" these cycles to copy the packet instead. This might no be a win in the future. I'm working on a more generic solution (page_pool) that (as one objective) target this remote recfnt. [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench03.c Measured on: i7-4790K CPU @ 4.00GHz Same CPU release cost : 251 cycles Remote CPU release cost: 492 cycles > skb_condense() helper could probably used in other contexts. > > Signed-off-by: Eric Dumazet > Cc: Paolo Abeni > --- [...] > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index > b45cd1494243fc99686016949f4546dbba11f424..84151cf40aebb973bad5bee3ee4be0758084d83c > 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -4931,3 +4931,31 @@ struct sk_buff *pskb_extract(struct sk_buff *skb, int > off, > EXPORT_SYMBOL(pskb_extract); > + > +/** > + * skb_condense - try to get rid of fragments/frag_list if possible > + * @skb: buffer > + * > + * Can be used to save memory before skb is added to a busy queue. > + * If packet has bytes in frags and enough tail room in skb->head, > + * pull all of them, so that we can free the frags right now and adjust > + * truesize. > + * Notes: > + * We do not reallocate skb->head thus can not fail. > + * Caller must re-evaluate skb->truesize if needed. > + */ > +void skb_condense(struct sk_buff *skb) > +{ > + if (!skb->data_len || > + skb->data_len > skb->end - skb->tail || > + skb_cloned(skb)) > + return; So this only active, depending on how driver constructed the SKB, but all end-up doing a function call (not inlined). > + /* Nice, we can free page frag(s) right now */ > + __pskb_pull_tail(skb, skb->data_len); > + > + /* Now adjust skb->truesize, since __pskb_pull_tail() does > + * not do this. > + */ > + skb->truesize = SKB_TRUESIZE(skb_end_offset(skb)); > +} > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c > index > 16d88ba9ff1c402f77063cfb5eea2708d86da2fc..f5628ada47b53f0d92d08210e5d7e4132a107f73 > 100644 > --- a/net/ipv4/udp.c > +++ b/net/ipv4/udp.c [...] > @@ -1208,6 +1208,16 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct > sk_buff *skb) > if (rmem > sk->sk_rcvbuf) > goto drop; > > + /* Under mem pressure, it might be helpful to help udp_recvmsg() > + * having linear skbs : > + * - Reduce memory overhead and thus increase receive queue capacity > + * - Less cache line misses at copyout() time > + * - Less work at consume_skb() (less alien page frag freeing) > + */ > + if (rmem > (sk->sk_rcvbuf >> 1)) > + skb_condense(skb); > + size = skb->truesize; > + -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH net-next] udp: under rx pressure, try to condense skbs
On Wed, 2016-12-07 at 09:19 -0800, Eric Dumazet wrote: > From: Eric Dumazet > This patch gave me about 5 % increase in throughput in my tests. > BTW, this reduces the gap with GRO being on or off for me : mlx4 drivers has a copybreak feature when GRO is off : Packets below 256 bytes are simply copied into a linear skb ( done in mlx4_en_rx_skb() )
[PATCH net-next] udp: under rx pressure, try to condense skbs
From: Eric Dumazet Under UDP flood, many softirq producers try to add packets to UDP receive queue, and one user thread is burning one cpu trying to dequeue packets as fast as possible. Two parts of the per packet cost are : - copying payload from kernel space to user space, - freeing memory pieces associated with skb. If socket is under pressure, softirq handler(s) can try to pull in skb->head the payload of the packet if it fits. Meaning the softirq handler(s) can free/reuse the page fragment immediately, instead of letting udp_recvmsg() do this hundreds of usec later, possibly from another node. Additional gains : - We reduce skb->truesize and thus can store more packets per SO_RCVBUF - We avoid cache line misses at copyout() time and consume_skb() time, and avoid one put_page() with potential alien freeing on NUMA hosts. This comes at the cost of a copy, bounded to available tail room, which is usually small. (We might have to fix GRO_MAX_HEAD which looks bigger than necessary) This patch gave me about 5 % increase in throughput in my tests. skb_condense() helper could probably used in other contexts. Signed-off-by: Eric Dumazet Cc: Paolo Abeni --- include/linux/skbuff.h |2 ++ net/core/skbuff.c | 28 net/ipv4/udp.c | 12 +++- 3 files changed, 41 insertions(+), 1 deletion(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 9c535fbccf2c7dbfae04cee393460e86d588c26b..0cd92b0f2af5fe5a7c153435b8dc758338180ae3 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1966,6 +1966,8 @@ static inline int pskb_may_pull(struct sk_buff *skb, unsigned int len) return __pskb_pull_tail(skb, len - skb_headlen(skb)) != NULL; } +void skb_condense(struct sk_buff *skb); + /** * skb_headroom - bytes at buffer head * @skb: buffer to check diff --git a/net/core/skbuff.c b/net/core/skbuff.c index b45cd1494243fc99686016949f4546dbba11f424..84151cf40aebb973bad5bee3ee4be0758084d83c 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -4931,3 +4931,31 @@ struct sk_buff *pskb_extract(struct sk_buff *skb, int off, return clone; } EXPORT_SYMBOL(pskb_extract); + +/** + * skb_condense - try to get rid of fragments/frag_list if possible + * @skb: buffer + * + * Can be used to save memory before skb is added to a busy queue. + * If packet has bytes in frags and enough tail room in skb->head, + * pull all of them, so that we can free the frags right now and adjust + * truesize. + * Notes: + * We do not reallocate skb->head thus can not fail. + * Caller must re-evaluate skb->truesize if needed. + */ +void skb_condense(struct sk_buff *skb) +{ + if (!skb->data_len || + skb->data_len > skb->end - skb->tail || + skb_cloned(skb)) + return; + + /* Nice, we can free page frag(s) right now */ + __pskb_pull_tail(skb, skb->data_len); + + /* Now adjust skb->truesize, since __pskb_pull_tail() does +* not do this. +*/ + skb->truesize = SKB_TRUESIZE(skb_end_offset(skb)); +} diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 16d88ba9ff1c402f77063cfb5eea2708d86da2fc..f5628ada47b53f0d92d08210e5d7e4132a107f73 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1199,7 +1199,7 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb) { struct sk_buff_head *list = &sk->sk_receive_queue; int rmem, delta, amt, err = -ENOMEM; - int size = skb->truesize; + int size; /* try to avoid the costly atomic add/sub pair when the receive * queue is full; always allow at least a packet @@ -1208,6 +1208,16 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb) if (rmem > sk->sk_rcvbuf) goto drop; + /* Under mem pressure, it might be helpful to help udp_recvmsg() +* having linear skbs : +* - Reduce memory overhead and thus increase receive queue capacity +* - Less cache line misses at copyout() time +* - Less work at consume_skb() (less alien page frag freeing) +*/ + if (rmem > (sk->sk_rcvbuf >> 1)) + skb_condense(skb); + size = skb->truesize; + /* we drop only if the receive buf is full and the receive * queue contains some other skb */