Re: [RFC PATCH v2 net-next 4/7] tcp: Make use of MSG_EOR flag in tcp_sendmsg

2016-04-19 Thread Eric Dumazet
On Tue, 2016-04-19 at 09:47 +, David Laight wrote:
> From: Eric Dumazet
> > Sent: 19 April 2016 00:18
> ...
> > MSG_EOR should not depend on SKBTX_ANY_TSTAMP
> > 
> > Really, simply using send(fd, ..., len, MSG_EOR) should instruct TCP to
> > mark the cooked skb as a non candidate for future coalescing.
> 
> Isn't that very similar to the inverse of MSG_MORE?

Yes. But not specifying MSG_MORE does not mean we want to force send
tiny packets. TCP is allowed to aggregate since it is a stream protocol,
since it reduces overhead.

> Or a send with Nagle disabled?

No.  tcp_sendmsg() still can coalesce/aggregate data to the last packet
in write queue, if it was not yet sent, regardless of Nagle.

If you enable or disable Nagle, following command will only send first
packets with 724 bytes, but following ones will be very big.

netperf -t TCP_STREAM -- -m 724

05:16:53.707778 IP 10.246.7.133.37427 > 10.246.7.151.42681: Flags [S], seq 
711876787, win 29200, options [mss 1460,sackOK,TS val 320048472 ecr 
0,nop,wscale 7], length 0
05:16:53.707908 IP 10.246.7.151.42681 > 10.246.7.133.37427: Flags [S.], seq 
2592965917, ack 711876788, win 28960, options [mss 1460,sackOK,TS val 23660532 
ecr 320048472,nop,wscale 7], length 0
05:16:53.708001 IP 10.246.7.133.37427 > 10.246.7.151.42681: Flags [.], ack 1, 
win 229, options [nop,nop,TS val 320048473 ecr 23660532], length 0
05:16:53.708045 IP 10.246.7.133.37427 > 10.246.7.151.42681: Flags [P.], seq 
1:725, ack 1, win 229, options [nop,nop,TS val 320048473 ecr 23660532], length 
724
05:16:53.708053 IP 10.246.7.151.42681 > 10.246.7.133.37427: Flags [.], ack 725, 
win 238, options [nop,nop,TS val 23660532 ecr 320048473], length 0
05:16:53.708083 IP 10.246.7.133.37427 > 10.246.7.151.42681: Flags [P.], seq 
725:2173, ack 1, win 229, options [nop,nop,TS val 320048473 ecr 23660532], 
length 1448
05:16:53.708097 IP 10.246.7.151.42681 > 10.246.7.133.37427: Flags [.], ack 
2173, win 261, options [nop,nop,TS val 23660532 ecr 320048473], length 0
05:16:53.708094 IP 10.246.7.133.37427 > 10.246.7.151.42681: Flags [P.], seq 
2173:3621, ack 1, win 229, options [nop,nop,TS val 320048473 ecr 23660532], 
length 1448
05:16:53.708107 IP 10.246.7.151.42681 > 10.246.7.133.37427: Flags [.], ack 
3621, win 283, options [nop,nop,TS val 23660532 ecr 320048473], length 0
05:16:53.708132 IP 10.246.7.133.37427 > 10.246.7.151.42681: Flags [P.], seq 
3621:6517, ack 1, win 229, options [nop,nop,TS val 320048473 ecr 23660532], 
length 2896
05:16:53.708133 IP 10.246.7.133.37427 > 10.246.7.151.42681: Flags [P.], seq 
6517:7965, ack 1, win 229, options [nop,nop,TS val 320048473 ecr 23660532], 
length 1448
05:16:53.708148 IP 10.246.7.151.42681 > 10.246.7.133.37427: Flags [.], ack 
6517, win 329, options [nop,nop,TS val 23660532 ecr 320048473], length 0
05:16:53.708152 IP 10.246.7.151.42681 > 10.246.7.133.37427: Flags [.], ack 
7965, win 351, options [nop,nop,TS val 23660532 ecr 320048473], length 0
05:16:53.708157 IP 10.246.7.133.37427 > 10.246.7.151.42681: Flags [P.], seq 
7965:10861, ack 1, win 229, options [nop,nop,TS val 320048473 ecr 23660532], 
length 2896
05:16:53.708165 IP 10.246.7.151.42681 > 10.246.7.133.37427: Flags [.], ack 
10861, win 396, options [nop,nop,TS val 23660532 ecr 320048473], length 0
05:16:53.708168 IP 10.246.7.133.37427 > 10.246.7.151.42681: Flags [P.], seq 
10861:12309, ack 1, win 229, options [nop,nop,TS val 320048473 ecr 23660532], 
length 1448
05:16:53.708177 IP 10.246.7.151.42681 > 10.246.7.133.37427: Flags [.], ack 
12309, win 419, options [nop,nop,TS val 23660532 ecr 320048473], length 0
05:16:53.708220 IP 10.246.7.133.37427 > 10.246.7.151.42681: Flags [.], seq 
12309:16653, ack 1, win 229, options [nop,nop,TS val 320048473 ecr 23660532], 
length 4344
05:16:53.708240 IP 10.246.7.151.42681 > 10.246.7.133.37427: Flags [.], ack 
16653, win 487, options [nop,nop,TS val 23660532 ecr 320048473], length 0
05:16:53.708335 IP 10.246.7.133.37427 > 10.246.7.151.42681: Flags [P.], seq 
16653:28237, ack 1, win 229, options [nop,nop,TS val 320048473 ecr 23660532], 
length 11584
05:16:53.708378 IP 10.246.7.151.42681 > 10.246.7.133.37427: Flags [.], ack 
28237, win 668, options [nop,nop,TS val 23660532 ecr 320048473], length 0
05:16:53.708453 IP 10.246.7.133.37427 > 10.246.7.151.42681: Flags [P.], seq 
28237:42717, ack 1, win 229, options [nop,nop,TS val 320048473 ecr 23660532], 
length 14480
05:16:53.708496 IP 10.246.7.151.42681 > 10.246.7.133.37427: Flags [.], ack 
42717, win 894, options [nop,nop,TS val 23660532 ecr 320048473], length 0
05:16:53.708537 IP 10.246.7.133.37427 > 10.246.7.151.42681: Flags [.], seq 
42717:49957, ack 1, win 229, options [nop,nop,TS val 320048473 ecr 23660532], 
length 7240
05:16:53.708585 IP 10.246.7.133.37427 > 10.246.7.151.42681: Flags [P.], seq 
49957:57197, ack 1, win 229, options [nop,nop,TS val 320048473 ecr 23660532], 
length 7240
05:16:53.708641 IP 10.246.7.151.42681 > 10.246.7.133.37427: Flags [.], ack 
49957, win 1007, options [nop,nop,TS val 

RE: [RFC PATCH v2 net-next 4/7] tcp: Make use of MSG_EOR flag in tcp_sendmsg

2016-04-19 Thread David Laight
From: Eric Dumazet
> Sent: 19 April 2016 00:18
...
> MSG_EOR should not depend on SKBTX_ANY_TSTAMP
> 
> Really, simply using send(fd, ..., len, MSG_EOR) should instruct TCP to
> mark the cooked skb as a non candidate for future coalescing.

Isn't that very similar to the inverse of MSG_MORE?
Or a send with Nagle disabled?

David



Re: [RFC PATCH v2 net-next 4/7] tcp: Make use of MSG_EOR flag in tcp_sendmsg

2016-04-18 Thread Eric Dumazet
On Mon, 2016-04-18 at 20:18 -0700, Martin KaFai Lau wrote:
> On Mon, Apr 18, 2016 at 07:50:41PM -0700, Eric Dumazet wrote:
> > I believe it is slightly wrong (to do the goto new_segment if there is
> > no data to send)
> Aha. Thanks for pointing it out.
> 
> >
> > I would instead use this fast path, doing the test _when_ we already
> > have an skb to test for.
> The v1 was doing a check in the loop but the feedback was, instead
> of doing this unlikely(test) repeatedly in the loop, do it before
> entering the loop and do a goto new_segment if needed.
> 
> I agree that doing it in the loop is easier to follow/read
> and checking TCP_SKB_CB(skb)->eor is cheaper than my v1.
> I will respin with your suggestion.

I do not remember the suggestion. It is better to do the test in the
loop, as you could have multiple threads doing a sendmsg() on the same
socket, and eventually sleeping in sk_stream_wait_memory()

A bit convoluted, but who knows what can be done by some applications ;)





Re: [RFC PATCH v2 net-next 4/7] tcp: Make use of MSG_EOR flag in tcp_sendmsg

2016-04-18 Thread Martin KaFai Lau
On Mon, Apr 18, 2016 at 07:50:41PM -0700, Eric Dumazet wrote:
> I believe it is slightly wrong (to do the goto new_segment if there is
> no data to send)
Aha. Thanks for pointing it out.

>
> I would instead use this fast path, doing the test _when_ we already
> have an skb to test for.
The v1 was doing a check in the loop but the feedback was, instead
of doing this unlikely(test) repeatedly in the loop, do it before
entering the loop and do a goto new_segment if needed.

I agree that doing it in the loop is easier to follow/read
and checking TCP_SKB_CB(skb)->eor is cheaper than my v1.
I will respin with your suggestion.

> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 6451b83d81e9..acfbff81ef47 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1171,6 +1171,8 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, 
> u32 len,
>   TCP_SKB_CB(skb)->tcp_flags = flags & ~(TCPHDR_FIN | TCPHDR_PSH);
>   TCP_SKB_CB(buff)->tcp_flags = flags;
>   TCP_SKB_CB(buff)->sacked = TCP_SKB_CB(skb)->sacked;
> + TCP_SKB_CB(buff)->eor = TCP_SKB_CB(skb)->eor;
> + TCP_SKB_CB(skb)->eor = 0;
>
>   if (!skb_shinfo(skb)->nr_frags && skb->ip_summed != CHECKSUM_PARTIAL) {
>   /* Copy and checksum data tail into the new buffer. */
> @@ -1730,6 +1732,8 @@ static int tso_fragment(struct sock *sk, struct sk_buff 
> *skb, unsigned int len,
>
>   /* This packet was never sent out yet, so no SACK bits. */
>   TCP_SKB_CB(buff)->sacked = 0;
> + TCP_SKB_CB(buff)->eor = TCP_SKB_CB(skb)->eor;
> + TCP_SKB_CB(skb)->eor = 0;
>
>   buff->ip_summed = skb->ip_summed = CHECKSUM_PARTIAL;
>   skb_split(skb, buff, len);
> @@ -2471,6 +2475,7 @@ static void tcp_collapse_retrans(struct sock *sk, 
> struct sk_buff *skb)
>
>   /* Merge over control information. This moves PSH/FIN etc. over */
>   TCP_SKB_CB(skb)->tcp_flags |= TCP_SKB_CB(next_skb)->tcp_flags;
> + TCP_SKB_CB(skb)->eor = TCP_SKB_CB(next_skb)->eor;
>
>   /* All done, get rid of second SKB and account for it so
>* packet counting does not break.
> @@ -2502,7 +2507,8 @@ static bool tcp_can_collapse(const struct sock *sk, 
> const struct sk_buff *skb)
>   /* Some heurestics for collapsing over SACK'd could be invented */
>   if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)
>   return false;
> -
> + if (TCP_SKB_CB(skb)->eor)
> + return false;
>   return true;
>  }
Thanks for this diff.  It confirms that I probably understand your last
suggestion correctly.  I also have similar diff for the sacks handling.


Re: [RFC PATCH v2 net-next 4/7] tcp: Make use of MSG_EOR flag in tcp_sendmsg

2016-04-18 Thread Eric Dumazet
On Mon, 2016-04-18 at 19:27 -0700, Martin KaFai Lau wrote:
> On Mon, Apr 18, 2016 at 05:06:57PM -0700, Eric Dumazet wrote:
> > It should only be a request from user space to ask TCP to not aggregate
> > stuff on future sendpage()/sendmsg() on the skb carrying this new flag.
> >
> How about something like this.  Please advise if tcp_sendmsg_noappend can
> be simpler.
> 
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index c0ef054..ac31798 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -762,7 +762,8 @@ struct tcp_skb_cb {
> 
>   __u8ip_dsfield; /* IPv4 tos or IPv6 dsfield */
>   __u8txstamp_ack:1,  /* Record TX timestamp for ack? */
> - unused:7;
> + eor:1,  /* Is skb MSG_EOR marked */
> + unused:6;
>   __u32   ack_seq;/* Sequence number ACK'd*/
>   union {
>   struct inet_skb_parmh4;
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 4d73858..12772be 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -874,6 +874,13 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, 
> int flags)
>   return mss_now;
>  }
> 
> +static bool tcp_sendmsg_noappend(const struct sock *sk)
> +{
> + const struct sk_buff *skb = tcp_write_queue_tail(sk);
> +
> + return (!skb || TCP_SKB_CB(skb)->eor);
> +}
> +
>  static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int 
> offset,
>   size_t size, int flags)
>  {
> @@ -903,6 +910,9 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct 
> page *page, int offset,
>   if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
>   goto out_err;
> 
> + if (tcp_sendmsg_noappend(sk))
> + goto new_segment;
> +
>   while (size > 0) {
>   struct sk_buff *skb = tcp_write_queue_tail(sk);
>   int copy, i;
> @@ -960,6 +970,7 @@ new_segment:
>   size -= copy;
>   if (!size) {
>   tcp_tx_timestamp(sk, sk->sk_tsflags, skb);
> + TCP_SKB_CB(skb)->eor = !!(flags & MSG_EOR);
>   goto out;
>   }
> 
> @@ -1145,6 +1156,9 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, 
> size_t size)
> 
>   sg = !!(sk->sk_route_caps & NETIF_F_SG);
> 
> + if (tcp_sendmsg_noappend(sk))
> + goto new_segment;
> +
>   while (msg_data_left(msg)) {
>   int copy = 0;
>   int max = size_goal;
> @@ -1250,6 +1264,7 @@ new_segment:
>   copied += copy;
>   if (!msg_data_left(msg)) {
>   tcp_tx_timestamp(sk, sockc.tsflags, skb);
> + TCP_SKB_CB(skb)->eor = !!(flags & MSG_EOR);
>   goto out;
>   }

I believe it is slightly wrong (to do the goto new_segment if there is
no data to send)

I would instead use this fast path, doing the test _when_ we already
have an skb to test for.

diff --git a/include/net/tcp.h b/include/net/tcp.h
index fd40f8c64d5f..015851634195 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -760,7 +760,8 @@ struct tcp_skb_cb {
 
__u8ip_dsfield; /* IPv4 tos or IPv6 dsfield */
__u8txstamp_ack:1,  /* Record TX timestamp for ack? */
-   unused:7;
+   eor:1,  /* Is skb MSG_EOR marked */
+   unused:6;
__u32   ack_seq;/* Sequence number ACK'd*/
union {
struct inet_skb_parmh4;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 4d73858991af..1944c19885ab 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -908,7 +908,9 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct 
page *page, int offset,
int copy, i;
bool can_coalesce;
 
-   if (!tcp_send_head(sk) || (copy = size_goal - skb->len) <= 0) {
+   if (!tcp_send_head(sk) ||
+   (copy = size_goal - skb->len) <= 0 ||
+   TCP_SKB_CB(skb)->eor) {
 new_segment:
if (!sk_stream_memory_free(sk))
goto wait_for_sndbuf;
@@ -960,6 +962,7 @@ new_segment:
size -= copy;
if (!size) {
tcp_tx_timestamp(sk, sk->sk_tsflags, skb);
+   TCP_SKB_CB(skb)->eor = !!(flags & MSG_EOR);
goto out;
}
 
@@ -1156,7 +1159,7 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t size)
copy = max - skb->len;
}
 
-   if (copy <= 0) {
+   if (copy <= 0 || TCP_SKB_CB(skb)->eor) {
 new_segment:
/* Allocate new segment. If the interface is SG,
 * allocate skb fitting to single page.
@@ -1250,6 +1253,7 @@ 

Re: [RFC PATCH v2 net-next 4/7] tcp: Make use of MSG_EOR flag in tcp_sendmsg

2016-04-18 Thread Martin KaFai Lau
On Mon, Apr 18, 2016 at 05:06:57PM -0700, Eric Dumazet wrote:
> It should only be a request from user space to ask TCP to not aggregate
> stuff on future sendpage()/sendmsg() on the skb carrying this new flag.
>
How about something like this.  Please advise if tcp_sendmsg_noappend can
be simpler.

diff --git a/include/net/tcp.h b/include/net/tcp.h
index c0ef054..ac31798 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -762,7 +762,8 @@ struct tcp_skb_cb {

__u8ip_dsfield; /* IPv4 tos or IPv6 dsfield */
__u8txstamp_ack:1,  /* Record TX timestamp for ack? */
-   unused:7;
+   eor:1,  /* Is skb MSG_EOR marked */
+   unused:6;
__u32   ack_seq;/* Sequence number ACK'd*/
union {
struct inet_skb_parmh4;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 4d73858..12772be 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -874,6 +874,13 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, 
int flags)
return mss_now;
 }

+static bool tcp_sendmsg_noappend(const struct sock *sk)
+{
+   const struct sk_buff *skb = tcp_write_queue_tail(sk);
+
+   return (!skb || TCP_SKB_CB(skb)->eor);
+}
+
 static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
size_t size, int flags)
 {
@@ -903,6 +910,9 @@ static ssize_t do_tcp_sendpages(struct sock *sk, struct 
page *page, int offset,
if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
goto out_err;

+   if (tcp_sendmsg_noappend(sk))
+   goto new_segment;
+
while (size > 0) {
struct sk_buff *skb = tcp_write_queue_tail(sk);
int copy, i;
@@ -960,6 +970,7 @@ new_segment:
size -= copy;
if (!size) {
tcp_tx_timestamp(sk, sk->sk_tsflags, skb);
+   TCP_SKB_CB(skb)->eor = !!(flags & MSG_EOR);
goto out;
}

@@ -1145,6 +1156,9 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t size)

sg = !!(sk->sk_route_caps & NETIF_F_SG);

+   if (tcp_sendmsg_noappend(sk))
+   goto new_segment;
+
while (msg_data_left(msg)) {
int copy = 0;
int max = size_goal;
@@ -1250,6 +1264,7 @@ new_segment:
copied += copy;
if (!msg_data_left(msg)) {
tcp_tx_timestamp(sk, sockc.tsflags, skb);
+   TCP_SKB_CB(skb)->eor = !!(flags & MSG_EOR);
goto out;
}

--
2.5.1


Re: [RFC PATCH v2 net-next 4/7] tcp: Make use of MSG_EOR flag in tcp_sendmsg

2016-04-18 Thread Eric Dumazet
On Mon, 2016-04-18 at 16:43 -0700, ka...@fb.com wrote:

> >
> > netperf could then get an option to set this MSG_EOR ;)
> Not sure how it is related.  Can you share how netperf can
> benefit from MSG_EOR in TCP tests without any of the
> SOF_TIMESTAMPING_TX_RECORD_MASK.

Simply setting MSG_EOR would be orthogonal to other timestamping stuff.

Maybe the application does not _want_ to be notified when skb is sent or
acknowledged, but would like some kind of "tcpdump awareness" or
something about burst control, who knows...

It should only be a request from user space to ask TCP to not aggregate
stuff on future sendpage()/sendmsg() on the skb carrying this new flag.

We already have other flags to ask for timestamping stuff, and they
could be used at the same time.

If the stack needs to be changed to properly fragment skb (or
aggregating them at retransmit), this is a separate concern.

Note that you do not need to automatically assert MSG_BOR (Begin of
Request) : MSG_EOR should really control the fact that last byte sent
marks the skb as being a non candidate for aggregation.

This would keep tcp_sendmsg() reasonnably fast.
Your tcp_sendmsg_noappend() is quite expensive :(




Re: [RFC PATCH v2 net-next 4/7] tcp: Make use of MSG_EOR flag in tcp_sendmsg

2016-04-18 Thread kafai
On Mon, Apr 18, 2016 at 04:18:13PM -0700, Eric Dumazet wrote:
> On Mon, 2016-04-18 at 15:46 -0700, Martin KaFai Lau wrote:
> > This patch allows the user process to use MSG_EOR during
> > tcp_sendmsg to tell the kernel that it is the last byte
> > of an application response message.
> >
> > It is currently useful when the end-user has turned on any bit of the
> > SOF_TIMESTAMPING_TX_RECORD_MASK (either by setsockopt or cmsg).
> > The kernel will then mark the newly added tcb->eor_info bit so
> > that the shinfo->tskey will not be overwritten (i.e. lost) in
> > the later skb append/collapse operation.
> >
> > With selective SOF_TIMESTAMPING_TX_ACK (by cmsg) and MSG_EOR (this
> > patch), the user application can specially tell which outgoing byte
> > it wants to track its ACK and ask the kernel not to lose this
> > tracking info in the later skb append/collapse action.
> >
> > This patch handles the append case in tcp_sendmsg.  The later
> > patches will handle the collapse during retransmission and
> > skb slicing in tcp_fragment()/tso_fragment().
> >
> > One of our use case is at the webserver.  The webserver tracks
> > the HTTP2 response latency by measuring when the webserver sends
> > the first byte to the socket till the TCP ACK of the last byte
> > is received.  In the cases where we don't have client side
> > measurement, measuring from the server side is the only option.
> > In the cases we have the client side measurement, the server side
> > data can also be used to justify/cross-check-with the client
> > side data.
> >
> > Signed-off-by: Martin KaFai Lau 
> > Cc: Eric Dumazet 
> > Cc: Neal Cardwell 
> > Cc: Soheil Hassas Yeganeh 
> > Cc: Willem de Bruijn 
> > Cc: Yuchung Cheng 
> > ---
>
> MSG_EOR should not depend on SKBTX_ANY_TSTAMP
>
> Really, simply using send(fd, ..., len, MSG_EOR) should instruct TCP to
> mark the cooked skb as a non candidate for future coalescing.
It was one of my earlier local attempt.  There are cases that coalescing
will not lose the tskey, so I trashed it.

If we mark eor only based on MSG_EOR, we can still do checks on
prev_skb's tskey and next_skb's tskey before coalescing two skbs
or
you meant simply don't coalesce if the prev_skb has eor marked?

>
> netperf could then get an option to set this MSG_EOR ;)
Not sure how it is related.  Can you share how netperf can
benefit from MSG_EOR in TCP tests without any of the
SOF_TIMESTAMPING_TX_RECORD_MASK.


Re: [RFC PATCH v2 net-next 4/7] tcp: Make use of MSG_EOR flag in tcp_sendmsg

2016-04-18 Thread Eric Dumazet
On Mon, 2016-04-18 at 15:46 -0700, Martin KaFai Lau wrote:
> This patch allows the user process to use MSG_EOR during
> tcp_sendmsg to tell the kernel that it is the last byte
> of an application response message.
> 
> It is currently useful when the end-user has turned on any bit of the
> SOF_TIMESTAMPING_TX_RECORD_MASK (either by setsockopt or cmsg).
> The kernel will then mark the newly added tcb->eor_info bit so
> that the shinfo->tskey will not be overwritten (i.e. lost) in
> the later skb append/collapse operation.
> 
> With selective SOF_TIMESTAMPING_TX_ACK (by cmsg) and MSG_EOR (this
> patch), the user application can specially tell which outgoing byte
> it wants to track its ACK and ask the kernel not to lose this
> tracking info in the later skb append/collapse action.
> 
> This patch handles the append case in tcp_sendmsg.  The later
> patches will handle the collapse during retransmission and
> skb slicing in tcp_fragment()/tso_fragment().
> 
> One of our use case is at the webserver.  The webserver tracks
> the HTTP2 response latency by measuring when the webserver sends
> the first byte to the socket till the TCP ACK of the last byte
> is received.  In the cases where we don't have client side
> measurement, measuring from the server side is the only option.
> In the cases we have the client side measurement, the server side
> data can also be used to justify/cross-check-with the client
> side data.
> 
> Signed-off-by: Martin KaFai Lau 
> Cc: Eric Dumazet 
> Cc: Neal Cardwell 
> Cc: Soheil Hassas Yeganeh 
> Cc: Willem de Bruijn 
> Cc: Yuchung Cheng 
> ---

MSG_EOR should not depend on SKBTX_ANY_TSTAMP

Really, simply using send(fd, ..., len, MSG_EOR) should instruct TCP to
mark the cooked skb as a non candidate for future coalescing.

netperf could then get an option to set this MSG_EOR ;)

I believe Soheil was working on such simple alternative ?




[RFC PATCH v2 net-next 4/7] tcp: Make use of MSG_EOR flag in tcp_sendmsg

2016-04-18 Thread Martin KaFai Lau
This patch allows the user process to use MSG_EOR during
tcp_sendmsg to tell the kernel that it is the last byte
of an application response message.

It is currently useful when the end-user has turned on any bit of the
SOF_TIMESTAMPING_TX_RECORD_MASK (either by setsockopt or cmsg).
The kernel will then mark the newly added tcb->eor_info bit so
that the shinfo->tskey will not be overwritten (i.e. lost) in
the later skb append/collapse operation.

With selective SOF_TIMESTAMPING_TX_ACK (by cmsg) and MSG_EOR (this
patch), the user application can specially tell which outgoing byte
it wants to track its ACK and ask the kernel not to lose this
tracking info in the later skb append/collapse action.

This patch handles the append case in tcp_sendmsg.  The later
patches will handle the collapse during retransmission and
skb slicing in tcp_fragment()/tso_fragment().

One of our use case is at the webserver.  The webserver tracks
the HTTP2 response latency by measuring when the webserver sends
the first byte to the socket till the TCP ACK of the last byte
is received.  In the cases where we don't have client side
measurement, measuring from the server side is the only option.
In the cases we have the client side measurement, the server side
data can also be used to justify/cross-check-with the client
side data.

Signed-off-by: Martin KaFai Lau 
Cc: Eric Dumazet 
Cc: Neal Cardwell 
Cc: Soheil Hassas Yeganeh 
Cc: Willem de Bruijn 
Cc: Yuchung Cheng 
---
 include/net/tcp.h |  5 -
 net/ipv4/tcp.c| 21 +
 2 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index c0ef054..f3c5dcb 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -762,7 +762,10 @@ struct tcp_skb_cb {
 
__u8ip_dsfield; /* IPv4 tos or IPv6 dsfield */
__u8txstamp_ack:1,  /* Record TX timestamp for ack? */
-   unused:7;
+   eor_info:1, /* Any EOR marked info that prevents
+* skbs from merging.
+*/
+   unused:6;
__u32   ack_seq;/* Sequence number ACK'd*/
union {
struct inet_skb_parmh4;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 4d73858..2918f42 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -428,15 +428,18 @@ void tcp_init_sock(struct sock *sk)
 }
 EXPORT_SYMBOL(tcp_init_sock);
 
-static void tcp_tx_timestamp(struct sock *sk, u16 tsflags, struct sk_buff *skb)
+static void tcp_tx_timestamp(struct sock *sk, u16 tsflags, struct sk_buff *skb,
+int flags)
 {
if (sk->sk_tsflags || tsflags) {
struct skb_shared_info *shinfo = skb_shinfo(skb);
struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
 
sock_tx_timestamp(sk, tsflags, >tx_flags);
-   if (shinfo->tx_flags & SKBTX_ANY_TSTAMP)
+   if (shinfo->tx_flags & SKBTX_ANY_TSTAMP) {
shinfo->tskey = TCP_SKB_CB(skb)->seq + skb->len - 1;
+   tcb->eor_info = !!(flags & MSG_EOR);
+   }
tcb->txstamp_ack = !!(shinfo->tx_flags & SKBTX_ACK_TSTAMP);
}
 }
@@ -874,6 +877,13 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, 
int flags)
return mss_now;
 }
 
+static bool tcp_sendmsg_noappend(struct sock *sk, u16 tx_tsflags)
+{
+   return unlikely((tx_tsflags & SOF_TIMESTAMPING_TX_RECORD_MASK) &&
+   tcp_send_head(sk) &&
+   TCP_SKB_CB(tcp_write_queue_tail(sk))->eor_info);
+}
+
 static ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
size_t size, int flags)
 {
@@ -959,7 +969,7 @@ new_segment:
offset += copy;
size -= copy;
if (!size) {
-   tcp_tx_timestamp(sk, sk->sk_tsflags, skb);
+   tcp_tx_timestamp(sk, sk->sk_tsflags, skb, 0);
goto out;
}
 
@@ -1145,6 +1155,9 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t size)
 
sg = !!(sk->sk_route_caps & NETIF_F_SG);
 
+   if (tcp_sendmsg_noappend(sk, sockc.tsflags))
+   goto new_segment;
+
while (msg_data_left(msg)) {
int copy = 0;
int max = size_goal;
@@ -1249,7 +1262,7 @@ new_segment:
 
copied += copy;
if (!msg_data_left(msg)) {
-   tcp_tx_timestamp(sk, sockc.tsflags, skb);
+   tcp_tx_timestamp(sk, sockc.tsflags, skb, flags);
goto out;
}
 
-- 
2.5.1