Re: optimizations to sk_buff handling in rds_tcp_data_ready
On Sat, 2016-04-09 at 20:18 -0400, Sowmini Varadhan wrote: > On (04/07/16 07:16), Eric Dumazet wrote: > > Use skb split like TCP in output path ? > > Really, pskb_expand_head() is not supposed to copy payload ;) > > Question- how come skb_split doesnt have to deal with frag_list > and do a skb_walk_frags()? Couldn't the split-line be somewhere > in the frag_list? Also even for the skb_split_inside_header, > dont we have to set > skb_shinfo(skb1)->frag_list = skb_shinfo(skb)->frag_list; > and cut loose the skb_shinfo(skb)->frag_list? > > As I try to mimic skb_split in some new set of "skb_carve" > funtions, I'm running into all the various frag_list cases. I'm > afraid I might end up needing most of the stuff under the "Pure > masohism" (sic) comment in __pskb_pull_tail(). > The helper was written for TCP I guess. TCP in output path never uses skb_shinfo(skb)->frag_list : It is guaranteed to be NULL for all skbs.
Re: optimizations to sk_buff handling in rds_tcp_data_ready
On (04/07/16 07:16), Eric Dumazet wrote: > Use skb split like TCP in output path ? > Really, pskb_expand_head() is not supposed to copy payload ;) Question- how come skb_split doesnt have to deal with frag_list and do a skb_walk_frags()? Couldn't the split-line be somewhere in the frag_list? Also even for the skb_split_inside_header, dont we have to set skb_shinfo(skb1)->frag_list = skb_shinfo(skb)->frag_list; and cut loose the skb_shinfo(skb)->frag_list? As I try to mimic skb_split in some new set of "skb_carve" funtions, I'm running into all the various frag_list cases. I'm afraid I might end up needing most of the stuff under the "Pure masohism" (sic) comment in __pskb_pull_tail(). --Sowmini
Re: optimizations to sk_buff handling in rds_tcp_data_ready
On (04/07/16 07:16), Eric Dumazet wrote: > Use skb split like TCP in output path ? That almost looks like what I want, but skb_split modifies both skb and skb1, and I want to leave skb untouched (otherwise I will mess up the book-keeping in tcp_read_sock). But skb_split is a good template- I think it could even be extended to avoid copying the frags that we'd later trim off in rds_tcp_data_recv anyway, let me see what I can come up with, based on that code. > Really, pskb_expand_head() is not supposed to copy payload ;) That would make my world very easy! But it has callers from all over the kernel, e.g., skb_realloc_headroom, and changing it is obviously risky
Re: optimizations to sk_buff handling in rds_tcp_data_ready
On Thu, 2016-04-07 at 09:11 -0400, Sowmini Varadhan wrote: > I was going back to Alexei's comment here: > http://permalink.gmane.org/gmane.linux.network/387806 > In rds-stress profiling, we are indeed seeing the pksb_pull > (from rds_tcp_data_recv) show up in the perf profile. > > At least for rds-tcp, we cannot re-use the skb even if > it is not shared, because what we need to do is to carve out > the interesting bits (start at offset, and trim it to to_copy) > and queue up those interesting bits on the PF_RDS socket, > while allowing tcp_data_read to go back and read the next > tcp segment (which may be part of the next rds dgram). > > But, when pskb_expand_head is invoked in the call-stack > pskb_pull(.., offset) -> ... -> __pskb_pull_tail(.., delta) > it will memcpy the offset bytes to the start of data. At least > for the rds_tcp_data_recv, we are not interested in being able > to do a *skb_push after the *skb_pull, so we dont really care > about the intactness of these bytes in offset. > Thus what I am finding is that when delta > 0, if we skip the > following in pskb_expand_head (for the rds-tcp recv case only!) > > /* Copy only real data... and, alas, header. This should be > * optimized for the cases when header is void. > */ > memcpy(data + nhead, skb->head, skb_tail_pointer(skb) - skb->head); > > and also (only for this case!) this one in __pskb_pull_tail > > if (skb_copy_bits(skb, skb_headlen(skb), skb_tail_pointer(skb), delta)) > BUG(); > > I am able to get a 40% improvement in the measured IOPS (req/response > transactions per second, using 8K byte requests, 256 byte responses, > 16 concurrent threads), so this optimization seems worth doing. > > Does my analysis above make sense? If yes, the question is, how to > achieve this bypass in a neat way. Clearly there are many callers of > pskb_expand_head who will expect to find the skb_header_len bytes at > the start of data, but we also dont want to duplicate code in these > functions. One thought I had is to pass a flag arouund saying "caller > doesnt care about retaining offset bytes", and use this flag > - in __pskb_pull_tail, to avoid skb_copy_bits() above, and to > pass delta to pskb_expand_head, > - in pskb_expand_head, only do the memcpy listed above > if delta <= 0 > Any other ideas for how to achieve this? Use skb split like TCP in output path ? Really, pskb_expand_head() is not supposed to copy payload ;)
optimizations to sk_buff handling in rds_tcp_data_ready
I was going back to Alexei's comment here: http://permalink.gmane.org/gmane.linux.network/387806 In rds-stress profiling, we are indeed seeing the pksb_pull (from rds_tcp_data_recv) show up in the perf profile. At least for rds-tcp, we cannot re-use the skb even if it is not shared, because what we need to do is to carve out the interesting bits (start at offset, and trim it to to_copy) and queue up those interesting bits on the PF_RDS socket, while allowing tcp_data_read to go back and read the next tcp segment (which may be part of the next rds dgram). But, when pskb_expand_head is invoked in the call-stack pskb_pull(.., offset) -> ... -> __pskb_pull_tail(.., delta) it will memcpy the offset bytes to the start of data. At least for the rds_tcp_data_recv, we are not interested in being able to do a *skb_push after the *skb_pull, so we dont really care about the intactness of these bytes in offset. Thus what I am finding is that when delta > 0, if we skip the following in pskb_expand_head (for the rds-tcp recv case only!) /* Copy only real data... and, alas, header. This should be * optimized for the cases when header is void. */ memcpy(data + nhead, skb->head, skb_tail_pointer(skb) - skb->head); and also (only for this case!) this one in __pskb_pull_tail if (skb_copy_bits(skb, skb_headlen(skb), skb_tail_pointer(skb), delta)) BUG(); I am able to get a 40% improvement in the measured IOPS (req/response transactions per second, using 8K byte requests, 256 byte responses, 16 concurrent threads), so this optimization seems worth doing. Does my analysis above make sense? If yes, the question is, how to achieve this bypass in a neat way. Clearly there are many callers of pskb_expand_head who will expect to find the skb_header_len bytes at the start of data, but we also dont want to duplicate code in these functions. One thought I had is to pass a flag arouund saying "caller doesnt care about retaining offset bytes", and use this flag - in __pskb_pull_tail, to avoid skb_copy_bits() above, and to pass delta to pskb_expand_head, - in pskb_expand_head, only do the memcpy listed above if delta <= 0 Any other ideas for how to achieve this? --Sowmini