Re: optimizations to sk_buff handling in rds_tcp_data_ready

2016-04-10 Thread Eric Dumazet
On Sat, 2016-04-09 at 20:18 -0400, Sowmini Varadhan wrote:
> On (04/07/16 07:16), Eric Dumazet wrote:
> > Use skb split like TCP in output path ?
> > Really, pskb_expand_head() is not supposed to copy payload ;)
> 
> Question- how come skb_split doesnt have to deal with frag_list
> and do a skb_walk_frags()? Couldn't the split-line be somewhere
> in the frag_list? Also even for the skb_split_inside_header,
> dont we have to set
>   skb_shinfo(skb1)->frag_list = skb_shinfo(skb)->frag_list;
> and cut loose the skb_shinfo(skb)->frag_list?
> 
> As I try to mimic skb_split in some new set of "skb_carve"
> funtions, I'm running into all  the various frag_list cases. I'm 
> afraid I might end up needing most of the stuff under the "Pure 
> masohism" (sic) comment in __pskb_pull_tail(). 
> 

The helper was written for TCP I guess. TCP in output path never uses
skb_shinfo(skb)->frag_list : It is guaranteed to be NULL for all skbs.






Re: optimizations to sk_buff handling in rds_tcp_data_ready

2016-04-09 Thread Sowmini Varadhan
On (04/07/16 07:16), Eric Dumazet wrote:
> Use skb split like TCP in output path ?
> Really, pskb_expand_head() is not supposed to copy payload ;)

Question- how come skb_split doesnt have to deal with frag_list
and do a skb_walk_frags()? Couldn't the split-line be somewhere
in the frag_list? Also even for the skb_split_inside_header,
dont we have to set
  skb_shinfo(skb1)->frag_list = skb_shinfo(skb)->frag_list;
and cut loose the skb_shinfo(skb)->frag_list?

As I try to mimic skb_split in some new set of "skb_carve"
funtions, I'm running into all  the various frag_list cases. I'm 
afraid I might end up needing most of the stuff under the "Pure 
masohism" (sic) comment in __pskb_pull_tail(). 

--Sowmini



Re: optimizations to sk_buff handling in rds_tcp_data_ready

2016-04-07 Thread Sowmini Varadhan
On (04/07/16 07:16), Eric Dumazet wrote:
> Use skb split like TCP in output path ?

That almost looks like what I want, but skb_split modifies both
skb and skb1, and I want to leave skb untouched (otherwise 
I will mess up the book-keeping in tcp_read_sock). But skb_split 
is a good template- I think it could even be extended to avoid copying
the frags that we'd later trim off in rds_tcp_data_recv anyway, 
let me see what I can come up with, based on that code.

> Really, pskb_expand_head() is not supposed to copy payload ;)

That would make my world very easy! But it has callers from all
over the kernel, e.g., skb_realloc_headroom, and changing it is
obviously risky



Re: optimizations to sk_buff handling in rds_tcp_data_ready

2016-04-07 Thread Eric Dumazet
On Thu, 2016-04-07 at 09:11 -0400, Sowmini Varadhan wrote:
> I was going back to Alexei's comment here:
>   http://permalink.gmane.org/gmane.linux.network/387806
> In rds-stress profiling,  we are indeed seeing the pksb_pull 
> (from rds_tcp_data_recv) show up in the perf profile.
> 
> At least for rds-tcp, we cannot re-use the skb even if
> it is not shared, because what we need to do is to carve out
> the interesting bits (start at offset, and trim it to to_copy)
> and queue up those interesting bits on the PF_RDS socket,
> while allowing tcp_data_read to go back and read the next 
> tcp segment (which may be part of the next rds dgram).
> 
> But, when  pskb_expand_head is invoked in the call-stack
>   pskb_pull(.., offset) -> ... -> __pskb_pull_tail(.., delta) 
> it will memcpy the offset bytes to the start of data. At least 
> for the rds_tcp_data_recv, we are not interested in being able 
> to do a *skb_push after the *skb_pull, so we dont really care 
> about the intactness of these bytes in offset.
> Thus what I am finding is that when delta > 0, if we skip the 
> following in pskb_expand_head (for the rds-tcp recv case only!)
> 
> /* Copy only real data... and, alas, header. This should be
>  * optimized for the cases when header is void.
>  */
> memcpy(data + nhead, skb->head, skb_tail_pointer(skb) - skb->head);
> 
> and also (only for this case!) this one in __pskb_pull_tail
> 
>  if (skb_copy_bits(skb, skb_headlen(skb), skb_tail_pointer(skb), delta))
> BUG();
> 
> I am able to get a 40% improvement in the measured IOPS (req/response 
> transactions per second, using 8K byte requests, 256 byte responses,
> 16 concurrent threads), so this optimization seems worth doing.
> 
> Does my analysis above make sense? If yes, the question is, how to
> achieve this bypass in a neat way.  Clearly there are many callers of
> pskb_expand_head who will expect to find the skb_header_len bytes at
> the start of data, but we also dont want to duplicate code in these
> functions. One thought I had is to pass a flag arouund saying "caller
> doesnt care about retaining offset bytes", and use this flag
> - in __pskb_pull_tail, to avoid skb_copy_bits() above,  and to
>   pass delta to pskb_expand_head,
> - in pskb_expand_head, only do the memcpy listed above 
>   if delta <= 0
> Any other ideas for how to achieve this?

Use skb split like TCP in output path ?

Really, pskb_expand_head() is not supposed to copy payload ;)




optimizations to sk_buff handling in rds_tcp_data_ready

2016-04-07 Thread Sowmini Varadhan

I was going back to Alexei's comment here:
  http://permalink.gmane.org/gmane.linux.network/387806
In rds-stress profiling,  we are indeed seeing the pksb_pull 
(from rds_tcp_data_recv) show up in the perf profile.

At least for rds-tcp, we cannot re-use the skb even if
it is not shared, because what we need to do is to carve out
the interesting bits (start at offset, and trim it to to_copy)
and queue up those interesting bits on the PF_RDS socket,
while allowing tcp_data_read to go back and read the next 
tcp segment (which may be part of the next rds dgram).

But, when  pskb_expand_head is invoked in the call-stack
  pskb_pull(.., offset) -> ... -> __pskb_pull_tail(.., delta) 
it will memcpy the offset bytes to the start of data. At least 
for the rds_tcp_data_recv, we are not interested in being able 
to do a *skb_push after the *skb_pull, so we dont really care 
about the intactness of these bytes in offset.
Thus what I am finding is that when delta > 0, if we skip the 
following in pskb_expand_head (for the rds-tcp recv case only!)

/* Copy only real data... and, alas, header. This should be
 * optimized for the cases when header is void.
 */
memcpy(data + nhead, skb->head, skb_tail_pointer(skb) - skb->head);

and also (only for this case!) this one in __pskb_pull_tail

 if (skb_copy_bits(skb, skb_headlen(skb), skb_tail_pointer(skb), delta))
BUG();

I am able to get a 40% improvement in the measured IOPS (req/response 
transactions per second, using 8K byte requests, 256 byte responses,
16 concurrent threads), so this optimization seems worth doing.

Does my analysis above make sense? If yes, the question is, how to
achieve this bypass in a neat way.  Clearly there are many callers of
pskb_expand_head who will expect to find the skb_header_len bytes at
the start of data, but we also dont want to duplicate code in these
functions. One thought I had is to pass a flag arouund saying "caller
doesnt care about retaining offset bytes", and use this flag
- in __pskb_pull_tail, to avoid skb_copy_bits() above,  and to
  pass delta to pskb_expand_head,
- in pskb_expand_head, only do the memcpy listed above 
  if delta <= 0
Any other ideas for how to achieve this?

--Sowmini