On Thu, Mar 24, 2016 at 6:39 PM, Willem de Bruijn <willemdebruijn.ker...@gmail.com> wrote: > > > This patch allows the user process to use MSG_EOR during > > tcp_sendmsg to tell the kernel that it is the last byte > > of an application response message. > > > > The user process can use the new SOF_TIMESTAMPING_TX_EOR to > > ask the kernel to only track timestamp of the MSG_EOR byte. > > Selective timestamp requests is a useful addition. Soheil (cc-ed) also > happens to be looking at this. That approach uses cmsg to selectively > tag send calls, avoiding the need to define a new SOF_ flag. > > > The current SOF_TIMESTAMPING_TX_ACK is tracking the last > > byte appended to a skb during the tcp_sendmsg. It may track > > multiple bytes if the response spans across multiple skbs. > > It only tracks the last byte of the buffer passed in sendmsg. If a > sendmsg results in multiple skbuffs, only the last skb is tracked. It > is, however, possible that that skbuff is appended to in a follow-on > sendmsg call. If multiple calls enable recording on an skbuff, only > the last record request on an skb is kept. > > > it is enough to measure the response latency for application > > protocol with a single request/response at a time (like HTTP 1.1 > > without pipeline), it does not work well for application protocol > > with >1 pipeline responses (like HTTP2). Looks like an interesting and useful patch. Since HTTP2 allows multiplexing data stream frames from multiple logical streams on a single socket, how would you instrument to measure the latency of each stream? e.g.,
sendmsg of data_frame_1_of_stream_a sendmsg of data_frame_1_of_stream_b sendmsg of data_frame_2_of_stream_a sendmsg of data_frame_1_of_stream_c sendmsg of data_frame_2_of_stream_b > > > > Each skb can only track one tskey (which is the seq number of > > the last byte of the message). To allow tracking the > > last byte of multiple response messages, this patch takes an approach > > by not appending to the last skb during tcp_sendmsg if the last skb's > > tskey will be overwritten. A similar case also happens during retransmit. > > > > This approach avoids introducing another list to track the tskey. The > > downside is that it will have less gso benefit and/or more outgoing > > packets. Practically, due to the amount of measurement data generated, > > sampling is usually used in production. (i.e. not every connection is > > tracked). > > Agreed. This is the simplest approach to avoiding timestamp request > overwrites. A list-based approach quickly becomes complex as skbuffs > can be split and merged at various points. > > Since this use is rare, I would suggest making the code even simpler > by just jumping to new_segment on a call with this MSG option (or > cmsg) set, avoiding tcp_tx_ts_noappend_skb() on each new segment. +1 > > > One of our use case is at the webserver. The webserver tracks > > the HTTP2 response latency by measuring when the webserver > > sends the first byte to the socket till the TCP ACK of the last byte > > is received. In the cases where we don't have client side > > measurement, measuring from the server side is the only option. > > In the cases we have the client side measurement, the server side > > data can also be used to justify/cross-check-with the client > > side data (e.g. is there slowness at the layer above the client's > > TCP stack). > > > > The TCP PRR paper [1] also measures a similar metrics: > > "The TCP latency of a HTTP response when the server sends the first > > byte until it receives the acknowledgment (ACK) for the last byte." > > > > [1] Proportional Rate Reduction for TCP: > > http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37486.pdf > > > > Signed-off-by: Martin KaFai Lau <ka...@fb.com> > > Cc: Eric Dumazet <eduma...@google.com> > > Cc: Neal Cardwell <ncardw...@google.com> > > Cc: Willem de Bruijn <will...@google.com> > > Cc: Yuchung Cheng <ych...@google.com> > > --- > > include/uapi/linux/net_tstamp.h | 3 ++- > > net/ipv4/tcp.c | 23 ++++++++++++++++++----- > > net/ipv4/tcp_output.c | 9 +++++++-- > > 3 files changed, 27 insertions(+), 8 deletions(-) > > > > diff --git a/include/uapi/linux/net_tstamp.h > > b/include/uapi/linux/net_tstamp.h > > index 6d1abea..5376569 100644 > > --- a/include/uapi/linux/net_tstamp.h > > +++ b/include/uapi/linux/net_tstamp.h > > @@ -25,8 +25,9 @@ enum { > > SOF_TIMESTAMPING_TX_ACK = (1<<9), > > SOF_TIMESTAMPING_OPT_CMSG = (1<<10), > > SOF_TIMESTAMPING_OPT_TSONLY = (1<<11), > > + SOF_TIMESTAMPING_TX_EOR = (1<<12), > > > > - SOF_TIMESTAMPING_LAST = SOF_TIMESTAMPING_OPT_TSONLY, > > + SOF_TIMESTAMPING_LAST = SOF_TIMESTAMPING_TX_EOR, > > SOF_TIMESTAMPING_MASK = (SOF_TIMESTAMPING_LAST - 1) | > > SOF_TIMESTAMPING_LAST > > }; > > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > > index 08b8b96..7de96eb 100644 > > --- a/net/ipv4/tcp.c > > +++ b/net/ipv4/tcp.c > > @@ -428,11 +428,16 @@ void tcp_init_sock(struct sock *sk) > > } > > EXPORT_SYMBOL(tcp_init_sock); > > > > -static void tcp_tx_timestamp(struct sock *sk, struct sk_buff *skb) > > +static void tcp_tx_timestamp(struct sock *sk, struct sk_buff *skb, int > > flags) > > { > > if (sk->sk_tsflags) { > > - struct skb_shared_info *shinfo = skb_shinfo(skb); > > + struct skb_shared_info *shinfo; > > > > + if ((sk->sk_tsflags & SOF_TIMESTAMPING_TX_EOR) && > > + !(flags & MSG_EOR)) > > + return; > > + > > + shinfo = skb_shinfo(skb); > > sock_tx_timestamp(sk, &shinfo->tx_flags); > > if (shinfo->tx_flags & SKBTX_ANY_TSTAMP) > > shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1; > > @@ -957,7 +962,7 @@ new_segment: > > offset += copy; > > size -= copy; > > if (!size) { > > - tcp_tx_timestamp(sk, skb); > > + tcp_tx_timestamp(sk, skb, flags); > > goto out; > > } > > > > @@ -1073,6 +1078,14 @@ static int tcp_sendmsg_fastopen(struct sock *sk, > > struct msghdr *msg, > > return err; > > } > > > > +static bool tcp_tx_ts_noappend_skb(const struct sock *sk, > > + const struct sk_buff *last_skb, int > > flags) > > +{ > > + return unlikely((sk->sk_tsflags & SOF_TIMESTAMPING_TX_EOR) && > > + (flags & MSG_EOR) && > > flags seems more likely to be cached than sk->sk_tsflags at this > point, in which case swap those tests. > > > + (skb_shinfo(last_skb)->tx_flags & > > SKBTX_ANY_TSTAMP)); > > +} > > + > > int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size) > > for a non-rfc patch, also change do_tcp_sendpages > > > { > > struct tcp_sock *tp = tcp_sk(sk); > > @@ -1144,7 +1157,7 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, > > size_t size) > > copy = max - skb->len; > > } > > > > - if (copy <= 0) { > > + if (copy <= 0 || tcp_tx_ts_noappend_skb(sk, skb, flags)) { > > new_segment: > > This adds a test to every segment for a niche feature. See my point of > just jumping here on first entering the loop.