from:"Neal Cardwell"

[PATCH net-next] tcp: shrink inet_connection_sock icsk_mtup enabled and probe_size

2021-01-29 Thread Neal Cardwell

From: Neal Cardwell 

This commit shrinks inet_connection_sock by 4 bytes, by shrinking
icsk_mtup.enabled from 32 bits to 1 bit, and shrinking
icsk_mtup.probe_size from s32 to an unsuigned 31 bit field.

This is to save space to compensate for the recent introduction of a
new u32 in inet_connection_sock, icsk_probes_tstamp, in the recent bug
fix commit 9d9b1ee0b2d1 ("tcp: fix TCP_USER_TIMEOUT with zero window").

This should not change functionality, since icsk_mtup.enabled is only
ever set to 0 or 1, and icsk_mtup.probe_size can only be either 0
or a positive MTU value returned by tcp_mss_to_mtu()

Signed-off-by: Neal Cardwell 
Signed-off-by: Eric Dumazet 
---
 include/net/inet_connection_sock.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/net/inet_connection_sock.h 
b/include/net/inet_connection_sock.h
index c11f80f328f1..10a625760de9 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -120,14 +120,14 @@ struct inet_connection_sock {
__u16 rcv_mss;   /* MSS used for delayed ACK 
decisions */
} icsk_ack;
struct {
-   int   enabled;
-
/* Range of MTUs to search */
int   search_high;
int   search_low;
 
/* Information on the current probe. */
-   int   probe_size;
+   u32   probe_size:31,
+   /* Is the MTUP feature enabled for this connection? */
+ enabled:1;
 
u32   probe_timestamp;
} icsk_mtup;
-- 
2.30.0.365.g02bc693789-goog

[PATCH net] tcp: fix cwnd-limited bug for TSO deferral where we send nothing

2020-12-08 Thread Neal Cardwell

From: Neal Cardwell 

When cwnd is not a multiple of the TSO skb size of N*MSS, we can get
into persistent scenarios where we have the following sequence:

(1) ACK for full-sized skb of N*MSS arrives
  -> tcp_write_xmit() transmit full-sized skb with N*MSS
  -> move pacing release time forward
  -> exit tcp_write_xmit() because pacing time is in the future

(2) TSQ callback or TCP internal pacing timer fires
  -> try to transmit next skb, but TSO deferral finds remainder of
 available cwnd is not big enough to trigger an immediate send
 now, so we defer sending until the next ACK.

(3) repeat...

So we can get into a case where we never mark ourselves as
cwnd-limited for many seconds at a time, even with
bulk/infinite-backlog senders, because:

o In case (1) above, every time in tcp_write_xmit() we have enough
cwnd to send a full-sized skb, we are not fully using the cwnd
(because cwnd is not a multiple of the TSO skb size). So every time we
send data, we are not cwnd limited, and so in the cwnd-limited
tracking code in tcp_cwnd_validate() we mark ourselves as not
cwnd-limited.

o In case (2) above, every time in tcp_write_xmit() that we try to
transmit the "remainder" of the cwnd but defer, we set the local
variable is_cwnd_limited to true, but we do not send any packets, so
sent_pkts is zero, so we don't call the cwnd-limited logic to update
tp->is_cwnd_limited.

Fixes: ca8a22634381 ("tcp: make cwnd-limited checks measurement-based, and 
gentler")
Reported-by: Ingemar Johansson 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Acked-by: Soheil Hassas Yeganeh 
Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_output.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index bf48cd73e967..99011768c264 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1880,7 +1880,8 @@ static void tcp_cwnd_validate(struct sock *sk, bool 
is_cwnd_limited)
 * window, and remember whether we were cwnd-limited then.
 */
if (!before(tp->snd_una, tp->max_packets_seq) ||
-   tp->packets_out > tp->max_packets_out) {
+   tp->packets_out > tp->max_packets_out ||
+   is_cwnd_limited) {
tp->max_packets_out = tp->packets_out;
tp->max_packets_seq = tp->snd_nxt;
tp->is_cwnd_limited = is_cwnd_limited;
@@ -2702,6 +2703,10 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int 
mss_now, int nonagle,
else
tcp_chrono_stop(sk, TCP_CHRONO_RWND_LIMITED);
 
+   is_cwnd_limited |= (tcp_packets_in_flight(tp) >= tp->snd_cwnd);
+   if (likely(sent_pkts || is_cwnd_limited))
+   tcp_cwnd_validate(sk, is_cwnd_limited);
+
if (likely(sent_pkts)) {
if (tcp_in_cwnd_reduction(sk))
tp->prr_out += sent_pkts;
@@ -2709,8 +2714,6 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int 
mss_now, int nonagle,
/* Send one loss probe per tail loss episode. */
if (push_one != 2)
tcp_schedule_loss_probe(sk, false);
-   is_cwnd_limited |= (tcp_packets_in_flight(tp) >= tp->snd_cwnd);
-   tcp_cwnd_validate(sk, is_cwnd_limited);
return false;
}
return !tp->packets_out && !tcp_write_queue_empty(sk);
-- 
2.29.2.576.ga3fc446d84-goog

[PATCH net] tcp: fix to update snd_wl1 in bulk receiver fast path

2020-10-22 Thread Neal Cardwell

From: Neal Cardwell 

In the header prediction fast path for a bulk data receiver, if no
data is newly acknowledged then we do not call tcp_ack() and do not
call tcp_ack_update_window(). This means that a bulk receiver that
receives large amounts of data can have the incoming sequence numbers
wrap, so that the check in tcp_may_update_window fails:
   after(ack_seq, tp->snd_wl1)

If the incoming receive windows are zero in this state, and then the
connection that was a bulk data receiver later wants to send data,
that connection can find itself persistently rejecting the window
updates in incoming ACKs. This means the connection can persistently
fail to discover that the receive window has opened, which in turn
means that the connection is unable to send anything, and the
connection's sending process can get permanently "stuck".

The fix is to update snd_wl1 in the header prediction fast path for a
bulk data receiver, so that it keeps up and does not see wrapping
problems.

This fix is based on a very nice and thorough analysis and diagnosis
by Apollon Oikonomopoulos (see link below).

This is a stable candidate but there is no Fixes tag here since the
bug predates current git history. Just for fun: looks like the bug
dates back to when header prediction was added in Linux v2.1.8 in Nov
1996. In that version tcp_rcv_established() was added, and the code
only updates snd_wl1 in tcp_ack(), and in the new "Bulk data transfer:
receiver" code path it does not call tcp_ack(). This fix seems to
apply cleanly at least as far back as v3.2.

Signed-off-by: Neal Cardwell 
Reported-by: Apollon Oikonomopoulos 
Tested-by: Apollon Oikonomopoulos 
Link: https://www.spinics.net/lists/netdev/msg692430.html
Acked-by: Soheil Hassas Yeganeh 
Acked-by: Yuchung Cheng 
Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_input.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 67f10d3ec240..fc445833b5e5 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5827,6 +5827,8 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff 
*skb)
tcp_data_snd_check(sk);
if (!inet_csk_ack_scheduled(sk))
goto no_ack;
+   } else {
+   tcp_update_wl(tp, TCP_SKB_CB(skb)->seq);
}
 
__tcp_ack_snd_check(sk, 0);
-- 
2.29.0.rc1.297.gfa9743e501-goog

[PATCH bpf-next v3 5/5] tcp: simplify tcp_set_congestion_control() load=false case

2020-09-10 Thread Neal Cardwell

From: Neal Cardwell 

Simplify tcp_set_congestion_control() by removing the initialization
code path for the !load case.

There are only two call sites for tcp_set_congestion_control(). The
EBPF call site is the only one that passes load=false; it also passes
cap_net_admin=true. Because of that, the exact same behavior can be
achieved by removing the special if (!load) branch of the logic. Both
before and after this commit, the EBPF case will call
bpf_try_module_get(), and if that succeeds then call
tcp_reinit_congestion_control() or if that fails then return EBUSY.

Note that this returns the logic to a structure very similar to the
structure before:
  commit 91b5b21c7c16 ("bpf: Add support for changing congestion control")
except that the CAP_NET_ADMIN status is passed in as a function
argument.

This clean-up was suggested by Martin KaFai Lau.

Signed-off-by: Neal Cardwell 
Suggested-by: Martin KaFai Lau 
Cc: Lawrence Brakmo 
Cc: Eric Dumazet 
Cc: Yuchung Cheng 
Cc: Kevin Yang 
---
 net/ipv4/tcp_cong.c | 15 ---
 1 file changed, 4 insertions(+), 11 deletions(-)

diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index a9b0fb52a1ec..db47ac24d057 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -362,21 +362,14 @@ int tcp_set_congestion_control(struct sock *sk, const 
char *name, bool load,
goto out;
}
 
-   if (!ca) {
+   if (!ca)
err = -ENOENT;
-   } else if (!load) {
-   if (bpf_try_module_get(ca, ca->owner)) {
-   tcp_reinit_congestion_control(sk, ca);
-   } else {
-   err = -EBUSY;
-   }
-   } else if (!((ca->flags & TCP_CONG_NON_RESTRICTED) || cap_net_admin)) {
+   else if (!((ca->flags & TCP_CONG_NON_RESTRICTED) || cap_net_admin))
err = -EPERM;
-   } else if (!bpf_try_module_get(ca, ca->owner)) {
+   else if (!bpf_try_module_get(ca, ca->owner))
err = -EBUSY;
-   } else {
+   else
tcp_reinit_congestion_control(sk, ca);
-   }
  out:
rcu_read_unlock();
return err;
-- 
2.28.0.618.gf4bc123cb7-goog

[PATCH bpf-next v3 0/5] tcp: increase flexibility of EBPF congestion control initialization

2020-09-10 Thread Neal Cardwell

From: Neal Cardwell 

This patch series reorganizes TCP congestion control initialization so that if
EBPF code called by tcp_init_transfer() sets the congestion control algorithm
by calling setsockopt(TCP_CONGESTION) then the TCP stack initializes the
congestion control module immediately, instead of having tcp_init_transfer()
later initialize the congestion control module.

This increases flexibility for the EBPF code that runs at connection
establishment time, and simplifies the code.

This has the following benefits:

(1) This allows CC module customizations made by the EBPF called in
tcp_init_transfer() to persist, and not be wiped out by a later
call to tcp_init_congestion_control() in tcp_init_transfer().

(2) Does not flip the order of EBPF and CC init, to avoid causing bugs
for existing code upstream that depends on the current order.

(3) Does not cause 2 initializations for for CC in the case where the
EBPF called in tcp_init_transfer() wants to set the CC to a new CC
algorithm.

(4) Allows follow-on simplifications to the code in net/core/filter.c
and net/ipv4/tcp_cong.c, which currently both have some complexity
to special-case CC initialization to avoid double CC
initialization if EBPF sets the CC.

changes in v2:

o rebase onto bpf-next

o add another follow-on simplification suggested by Martin KaFai Lau:
   "tcp: simplify tcp_set_congestion_control() load=false case"

changes in v3:

o no change in commits

o resent patch series from @gmail.com, since mail from ncardw...@google.com
  stopped being accepted at netdev@vger.kernel.org mid-way through processing
  the v2 patch series (between patches 2 and 3), confusing patchwork about
  which patches belonged to the v2 patch series

Neal Cardwell (5):
  tcp: only init congestion control if not initialized already
  tcp: simplify EBPF TCP_CONGESTION to always init CC
  tcp: simplify tcp_set_congestion_control(): always reinitialize
  tcp: simplify _bpf_setsockopt(): remove flags argument
  tcp: simplify tcp_set_congestion_control() load=false case

 include/net/inet_connection_sock.h |  3 ++-
 include/net/tcp.h  |  2 +-
 net/core/filter.c  | 18 --
 net/ipv4/tcp.c |  3 ++-
 net/ipv4/tcp_cong.c| 27 +++
 net/ipv4/tcp_input.c   |  4 +++-
 6 files changed, 19 insertions(+), 38 deletions(-)

-- 
2.28.0.618.gf4bc123cb7-goog

[PATCH bpf-next v3 4/5] tcp: simplify _bpf_setsockopt(): remove flags argument

2020-09-10 Thread Neal Cardwell

From: Neal Cardwell 

Now that the previous patches have removed the code that uses the
flags argument to _bpf_setsockopt(), we can remove that argument.

Signed-off-by: Neal Cardwell 
Acked-by: Yuchung Cheng 
Acked-by: Kevin Yang 
Signed-off-by: Eric Dumazet 
Cc: Lawrence Brakmo 
---
 net/core/filter.c | 10 +++---
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index e89d6d7da03c..d266c6941967 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4314,7 +4314,7 @@ static const struct bpf_func_proto 
bpf_get_socket_uid_proto = {
 };
 
 static int _bpf_setsockopt(struct sock *sk, int level, int optname,
-  char *optval, int optlen, u32 flags)
+  char *optval, int optlen)
 {
char devname[IFNAMSIZ];
int val, valbool;
@@ -4611,9 +4611,7 @@ static int _bpf_getsockopt(struct sock *sk, int level, 
int optname,
 BPF_CALL_5(bpf_sock_addr_setsockopt, struct bpf_sock_addr_kern *, ctx,
   int, level, int, optname, char *, optval, int, optlen)
 {
-   u32 flags = 0;
-   return _bpf_setsockopt(ctx->sk, level, optname, optval, optlen,
-  flags);
+   return _bpf_setsockopt(ctx->sk, level, optname, optval, optlen);
 }
 
 static const struct bpf_func_proto bpf_sock_addr_setsockopt_proto = {
@@ -4647,9 +4645,7 @@ static const struct bpf_func_proto 
bpf_sock_addr_getsockopt_proto = {
 BPF_CALL_5(bpf_sock_ops_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
   int, level, int, optname, char *, optval, int, optlen)
 {
-   u32 flags = 0;
-   return _bpf_setsockopt(bpf_sock->sk, level, optname, optval, optlen,
-  flags);
+   return _bpf_setsockopt(bpf_sock->sk, level, optname, optval, optlen);
 }
 
 static const struct bpf_func_proto bpf_sock_ops_setsockopt_proto = {
-- 
2.28.0.618.gf4bc123cb7-goog

[PATCH bpf-next v3 2/5] tcp: simplify EBPF TCP_CONGESTION to always init CC

2020-09-10 Thread Neal Cardwell

From: Neal Cardwell 

Now that the previous patch ensures we don't initialize the congestion
control twice, when EBPF sets the congestion control algorithm at
connection establishment we can simplify the code by simply
initializing the congestion control module at that time.

Signed-off-by: Neal Cardwell 
Acked-by: Yuchung Cheng 
Acked-by: Kevin Yang 
Signed-off-by: Eric Dumazet 
Cc: Lawrence Brakmo 
---
 net/core/filter.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 47eef9a0be6a..067f6759a68f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4313,8 +4313,6 @@ static const struct bpf_func_proto 
bpf_get_socket_uid_proto = {
.arg1_type  = ARG_PTR_TO_CTX,
 };
 
-#define SOCKOPT_CC_REINIT (1 << 0)
-
 static int _bpf_setsockopt(struct sock *sk, int level, int optname,
   char *optval, int optlen, u32 flags)
 {
@@ -4449,13 +4447,12 @@ static int _bpf_setsockopt(struct sock *sk, int level, 
int optname,
   sk->sk_prot->setsockopt == tcp_setsockopt) {
if (optname == TCP_CONGESTION) {
char name[TCP_CA_NAME_MAX];
-   bool reinit = flags & SOCKOPT_CC_REINIT;
 
strncpy(name, optval, min_t(long, optlen,
TCP_CA_NAME_MAX-1));
name[TCP_CA_NAME_MAX-1] = 0;
ret = tcp_set_congestion_control(sk, name, false,
-reinit, true);
+true, true);
} else {
struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
@@ -4652,8 +4649,6 @@ BPF_CALL_5(bpf_sock_ops_setsockopt, struct 
bpf_sock_ops_kern *, bpf_sock,
   int, level, int, optname, char *, optval, int, optlen)
 {
u32 flags = 0;
-   if (bpf_sock->op > BPF_SOCK_OPS_NEEDS_ECN)
-   flags |= SOCKOPT_CC_REINIT;
return _bpf_setsockopt(bpf_sock->sk, level, optname, optval, optlen,
   flags);
 }
-- 
2.28.0.618.gf4bc123cb7-goog

[PATCH bpf-next v3 3/5] tcp: simplify tcp_set_congestion_control(): always reinitialize

2020-09-10 Thread Neal Cardwell

From: Neal Cardwell 

Now that the previous patches ensure that all call sites for
tcp_set_congestion_control() want to initialize congestion control, we
can simplify tcp_set_congestion_control() by removing the reinit
argument and the code to support it.

Signed-off-by: Neal Cardwell 
Acked-by: Yuchung Cheng 
Acked-by: Kevin Yang 
Signed-off-by: Eric Dumazet 
Cc: Lawrence Brakmo 
---
 include/net/tcp.h   |  2 +-
 net/core/filter.c   |  3 +--
 net/ipv4/tcp.c  |  2 +-
 net/ipv4/tcp_cong.c | 11 ++-
 4 files changed, 5 insertions(+), 13 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index e85d564446c6..f857146c17a5 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1104,7 +1104,7 @@ void tcp_get_available_congestion_control(char *buf, 
size_t len);
 void tcp_get_allowed_congestion_control(char *buf, size_t len);
 int tcp_set_allowed_congestion_control(char *allowed);
 int tcp_set_congestion_control(struct sock *sk, const char *name, bool load,
-  bool reinit, bool cap_net_admin);
+  bool cap_net_admin);
 u32 tcp_slow_start(struct tcp_sock *tp, u32 acked);
 void tcp_cong_avoid_ai(struct tcp_sock *tp, u32 w, u32 acked);
 
diff --git a/net/core/filter.c b/net/core/filter.c
index 067f6759a68f..e89d6d7da03c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4451,8 +4451,7 @@ static int _bpf_setsockopt(struct sock *sk, int level, 
int optname,
strncpy(name, optval, min_t(long, optlen,
TCP_CA_NAME_MAX-1));
name[TCP_CA_NAME_MAX-1] = 0;
-   ret = tcp_set_congestion_control(sk, name, false,
-true, true);
+   ret = tcp_set_congestion_control(sk, name, false, true);
} else {
struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 7360d3db2b61..e58ab9db73ff 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3050,7 +3050,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level, 
int optname,
name[val] = 0;
 
lock_sock(sk);
-   err = tcp_set_congestion_control(sk, name, true, true,
+   err = tcp_set_congestion_control(sk, name, true,
 
ns_capable(sock_net(sk)->user_ns,
CAP_NET_ADMIN));
release_sock(sk);
diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index d18d7a1ce4ce..a9b0fb52a1ec 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -341,7 +341,7 @@ int tcp_set_allowed_congestion_control(char *val)
  * already initialized.
  */
 int tcp_set_congestion_control(struct sock *sk, const char *name, bool load,
-  bool reinit, bool cap_net_admin)
+  bool cap_net_admin)
 {
struct inet_connection_sock *icsk = inet_csk(sk);
const struct tcp_congestion_ops *ca;
@@ -365,15 +365,8 @@ int tcp_set_congestion_control(struct sock *sk, const char 
*name, bool load,
if (!ca) {
err = -ENOENT;
} else if (!load) {
-   const struct tcp_congestion_ops *old_ca = icsk->icsk_ca_ops;
-
if (bpf_try_module_get(ca, ca->owner)) {
-   if (reinit) {
-   tcp_reinit_congestion_control(sk, ca);
-   } else {
-   icsk->icsk_ca_ops = ca;
-   bpf_module_put(old_ca, old_ca->owner);
-   }
+   tcp_reinit_congestion_control(sk, ca);
} else {
err = -EBUSY;
}
-- 
2.28.0.618.gf4bc123cb7-goog

[PATCH bpf-next v3 1/5] tcp: only init congestion control if not initialized already

2020-09-10 Thread Neal Cardwell

From: Neal Cardwell 

Change tcp_init_transfer() to only initialize congestion control if it
has not been initialized already.

With this new approach, we can arrange things so that if the EBPF code
sets the congestion control by calling setsockopt(TCP_CONGESTION) then
tcp_init_transfer() will not re-initialize the CC module.

This is an approach that has the following beneficial properties:

(1) This allows CC module customizations made by the EBPF called in
tcp_init_transfer() to persist, and not be wiped out by a later
call to tcp_init_congestion_control() in tcp_init_transfer().

(2) Does not flip the order of EBPF and CC init, to avoid causing bugs
for existing code upstream that depends on the current order.

(3) Does not cause 2 initializations for for CC in the case where the
EBPF called in tcp_init_transfer() wants to set the CC to a new CC
algorithm.

(4) Allows follow-on simplifications to the code in net/core/filter.c
and net/ipv4/tcp_cong.c, which currently both have some complexity
to special-case CC initialization to avoid double CC
initialization if EBPF sets the CC.

Signed-off-by: Neal Cardwell 
Acked-by: Yuchung Cheng 
Acked-by: Kevin Yang 
Signed-off-by: Eric Dumazet 
Cc: Lawrence Brakmo 
---
 include/net/inet_connection_sock.h | 3 ++-
 net/ipv4/tcp.c | 1 +
 net/ipv4/tcp_cong.c| 3 ++-
 net/ipv4/tcp_input.c   | 4 +++-
 4 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/net/inet_connection_sock.h 
b/include/net/inet_connection_sock.h
index c738abeb3265..dc763ca9413c 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -96,7 +96,8 @@ struct inet_connection_sock {
void (*icsk_clean_acked)(struct sock *sk, u32 acked_seq);
struct hlist_node icsk_listen_portaddr_node;
unsigned int  (*icsk_sync_mss)(struct sock *sk, u32 pmtu);
-   __u8  icsk_ca_state:6,
+   __u8  icsk_ca_state:5,
+ icsk_ca_initialized:1,
  icsk_ca_setsockopt:1,
  icsk_ca_dst_locked:1;
__u8  icsk_retransmits;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 57a568875539..7360d3db2b61 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2698,6 +2698,7 @@ int tcp_disconnect(struct sock *sk, int flags)
if (icsk->icsk_ca_ops->release)
icsk->icsk_ca_ops->release(sk);
memset(icsk->icsk_ca_priv, 0, sizeof(icsk->icsk_ca_priv));
+   icsk->icsk_ca_initialized = 0;
tcp_set_ca_state(sk, TCP_CA_Open);
tp->is_sack_reneg = 0;
tcp_clear_retrans(tp);
diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index 62878cf26d9c..d18d7a1ce4ce 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -176,7 +176,7 @@ void tcp_assign_congestion_control(struct sock *sk)
 
 void tcp_init_congestion_control(struct sock *sk)
 {
-   const struct inet_connection_sock *icsk = inet_csk(sk);
+   struct inet_connection_sock *icsk = inet_csk(sk);
 
tcp_sk(sk)->prior_ssthresh = 0;
if (icsk->icsk_ca_ops->init)
@@ -185,6 +185,7 @@ void tcp_init_congestion_control(struct sock *sk)
INET_ECN_xmit(sk);
else
INET_ECN_dontxmit(sk);
+   icsk->icsk_ca_initialized = 1;
 }
 
 static void tcp_reinit_congestion_control(struct sock *sk,
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 4337841faeff..0e5ac0d33fd3 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5894,8 +5894,10 @@ void tcp_init_transfer(struct sock *sk, int bpf_op, 
struct sk_buff *skb)
tp->snd_cwnd = tcp_init_cwnd(tp, __sk_dst_get(sk));
tp->snd_cwnd_stamp = tcp_jiffies32;
 
+   icsk->icsk_ca_initialized = 0;
bpf_skops_established(sk, bpf_op, skb);
-   tcp_init_congestion_control(sk);
+   if (!icsk->icsk_ca_initialized)
+   tcp_init_congestion_control(sk);
tcp_init_buffer_space(sk);
 }
 
-- 
2.28.0.618.gf4bc123cb7-goog

[PATCH bpf-next v2 0/5] tcp: increase flexibility of EBPF congestion control initialization

2020-09-10 Thread Neal Cardwell

From: Neal Cardwell 

This patch series reorganizes TCP congestion control initialization so that if
EBPF code called by tcp_init_transfer() sets the congestion control algorithm
by calling setsockopt(TCP_CONGESTION) then the TCP stack initializes the
congestion control module immediately, instead of having tcp_init_transfer()
later initialize the congestion control module.

This increases flexibility for the EBPF code that runs at connection
establishment time, and simplifies the code.

This has the following benefits:

(1) This allows CC module customizations made by the EBPF called in
tcp_init_transfer() to persist, and not be wiped out by a later
call to tcp_init_congestion_control() in tcp_init_transfer().

(2) Does not flip the order of EBPF and CC init, to avoid causing bugs
for existing code upstream that depends on the current order.

(3) Does not cause 2 initializations for for CC in the case where the
EBPF called in tcp_init_transfer() wants to set the CC to a new CC
algorithm.

(4) Allows follow-on simplifications to the code in net/core/filter.c
and net/ipv4/tcp_cong.c, which currently both have some complexity
to special-case CC initialization to avoid double CC
initialization if EBPF sets the CC.

changes in v2:

o rebase onto bpf-next

o add another follow-on simplification suggested by Martin KaFai Lau:
   "tcp: simplify tcp_set_congestion_control() load=false case"

Neal Cardwell (5):
  tcp: only init congestion control if not initialized already
  tcp: simplify EBPF TCP_CONGESTION to always init CC
  tcp: simplify tcp_set_congestion_control(): always reinitialize
  tcp: simplify _bpf_setsockopt(): remove flags argument
  tcp: simplify tcp_set_congestion_control() load=false case

 include/net/inet_connection_sock.h |  3 ++-
 include/net/tcp.h  |  2 +-
 net/core/filter.c  | 18 --
 net/ipv4/tcp.c |  3 ++-
 net/ipv4/tcp_cong.c| 27 +++
 net/ipv4/tcp_input.c   |  4 +++-
 6 files changed, 19 insertions(+), 38 deletions(-)

-- 
2.28.0.526.ge36021eeef-goog

[PATCH bpf-next v2 1/5] tcp: only init congestion control if not initialized already

2020-09-10 Thread Neal Cardwell

From: Neal Cardwell 

Change tcp_init_transfer() to only initialize congestion control if it
has not been initialized already.

With this new approach, we can arrange things so that if the EBPF code
sets the congestion control by calling setsockopt(TCP_CONGESTION) then
tcp_init_transfer() will not re-initialize the CC module.

This is an approach that has the following beneficial properties:

(1) This allows CC module customizations made by the EBPF called in
tcp_init_transfer() to persist, and not be wiped out by a later
call to tcp_init_congestion_control() in tcp_init_transfer().

(2) Does not flip the order of EBPF and CC init, to avoid causing bugs
for existing code upstream that depends on the current order.

(3) Does not cause 2 initializations for for CC in the case where the
EBPF called in tcp_init_transfer() wants to set the CC to a new CC
algorithm.

(4) Allows follow-on simplifications to the code in net/core/filter.c
and net/ipv4/tcp_cong.c, which currently both have some complexity
to special-case CC initialization to avoid double CC
initialization if EBPF sets the CC.

Signed-off-by: Neal Cardwell 
Acked-by: Yuchung Cheng 
Acked-by: Kevin Yang 
Signed-off-by: Eric Dumazet 
Cc: Lawrence Brakmo 
---
 include/net/inet_connection_sock.h | 3 ++-
 net/ipv4/tcp.c | 1 +
 net/ipv4/tcp_cong.c| 3 ++-
 net/ipv4/tcp_input.c   | 4 +++-
 4 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/net/inet_connection_sock.h 
b/include/net/inet_connection_sock.h
index c738abeb3265..dc763ca9413c 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -96,7 +96,8 @@ struct inet_connection_sock {
void (*icsk_clean_acked)(struct sock *sk, u32 acked_seq);
struct hlist_node icsk_listen_portaddr_node;
unsigned int  (*icsk_sync_mss)(struct sock *sk, u32 pmtu);
-   __u8  icsk_ca_state:6,
+   __u8  icsk_ca_state:5,
+ icsk_ca_initialized:1,
  icsk_ca_setsockopt:1,
  icsk_ca_dst_locked:1;
__u8  icsk_retransmits;
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 57a568875539..7360d3db2b61 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2698,6 +2698,7 @@ int tcp_disconnect(struct sock *sk, int flags)
if (icsk->icsk_ca_ops->release)
icsk->icsk_ca_ops->release(sk);
memset(icsk->icsk_ca_priv, 0, sizeof(icsk->icsk_ca_priv));
+   icsk->icsk_ca_initialized = 0;
tcp_set_ca_state(sk, TCP_CA_Open);
tp->is_sack_reneg = 0;
tcp_clear_retrans(tp);
diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index 62878cf26d9c..d18d7a1ce4ce 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -176,7 +176,7 @@ void tcp_assign_congestion_control(struct sock *sk)
 
 void tcp_init_congestion_control(struct sock *sk)
 {
-   const struct inet_connection_sock *icsk = inet_csk(sk);
+   struct inet_connection_sock *icsk = inet_csk(sk);
 
tcp_sk(sk)->prior_ssthresh = 0;
if (icsk->icsk_ca_ops->init)
@@ -185,6 +185,7 @@ void tcp_init_congestion_control(struct sock *sk)
INET_ECN_xmit(sk);
else
INET_ECN_dontxmit(sk);
+   icsk->icsk_ca_initialized = 1;
 }
 
 static void tcp_reinit_congestion_control(struct sock *sk,
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 4337841faeff..0e5ac0d33fd3 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5894,8 +5894,10 @@ void tcp_init_transfer(struct sock *sk, int bpf_op, 
struct sk_buff *skb)
tp->snd_cwnd = tcp_init_cwnd(tp, __sk_dst_get(sk));
tp->snd_cwnd_stamp = tcp_jiffies32;
 
+   icsk->icsk_ca_initialized = 0;
bpf_skops_established(sk, bpf_op, skb);
-   tcp_init_congestion_control(sk);
+   if (!icsk->icsk_ca_initialized)
+   tcp_init_congestion_control(sk);
tcp_init_buffer_space(sk);
 }
 
-- 
2.28.0.526.ge36021eeef-goog

[PATCH bpf-next v2 2/5] tcp: simplify EBPF TCP_CONGESTION to always init CC

2020-09-10 Thread Neal Cardwell

From: Neal Cardwell 

Now that the previous patch ensures we don't initialize the congestion
control twice, when EBPF sets the congestion control algorithm at
connection establishment we can simplify the code by simply
initializing the congestion control module at that time.

Signed-off-by: Neal Cardwell 
Acked-by: Yuchung Cheng 
Acked-by: Kevin Yang 
Signed-off-by: Eric Dumazet 
Cc: Lawrence Brakmo 
---
 net/core/filter.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 47eef9a0be6a..067f6759a68f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4313,8 +4313,6 @@ static const struct bpf_func_proto 
bpf_get_socket_uid_proto = {
.arg1_type  = ARG_PTR_TO_CTX,
 };
 
-#define SOCKOPT_CC_REINIT (1 << 0)
-
 static int _bpf_setsockopt(struct sock *sk, int level, int optname,
   char *optval, int optlen, u32 flags)
 {
@@ -4449,13 +4447,12 @@ static int _bpf_setsockopt(struct sock *sk, int level, 
int optname,
   sk->sk_prot->setsockopt == tcp_setsockopt) {
if (optname == TCP_CONGESTION) {
char name[TCP_CA_NAME_MAX];
-   bool reinit = flags & SOCKOPT_CC_REINIT;
 
strncpy(name, optval, min_t(long, optlen,
TCP_CA_NAME_MAX-1));
name[TCP_CA_NAME_MAX-1] = 0;
ret = tcp_set_congestion_control(sk, name, false,
-reinit, true);
+true, true);
} else {
struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
@@ -4652,8 +4649,6 @@ BPF_CALL_5(bpf_sock_ops_setsockopt, struct 
bpf_sock_ops_kern *, bpf_sock,
   int, level, int, optname, char *, optval, int, optlen)
 {
u32 flags = 0;
-   if (bpf_sock->op > BPF_SOCK_OPS_NEEDS_ECN)
-   flags |= SOCKOPT_CC_REINIT;
return _bpf_setsockopt(bpf_sock->sk, level, optname, optval, optlen,
   flags);
 }
-- 
2.28.0.526.ge36021eeef-goog

[PATCH bpf-next v2 5/5] tcp: simplify tcp_set_congestion_control() load=false case

2020-09-10 Thread Neal Cardwell

From: Neal Cardwell 

Simplify tcp_set_congestion_control() by removing the initialization
code path for the !load case.

There are only two call sites for tcp_set_congestion_control(). The
EBPF call site is the only one that passes load=false; it also passes
cap_net_admin=true. Because of that, the exact same behavior can be
achieved by removing the special if (!load) branch of the logic. Both
before and after this commit, the EBPF case will call
bpf_try_module_get(), and if that succeeds then call
tcp_reinit_congestion_control() or if that fails then return EBUSY.

Note that this returns the logic to a structure very similar to the
structure before:
  commit 91b5b21c7c16 ("bpf: Add support for changing congestion control")
except that the CAP_NET_ADMIN status is passed in as a function
argument.

This clean-up was suggested by Martin KaFai Lau.

Signed-off-by: Neal Cardwell 
Suggested-by: Martin KaFai Lau 
Cc: Lawrence Brakmo 
Cc: Eric Dumazet 
Cc: Yuchung Cheng 
Cc: Kevin Yang 
---
 net/ipv4/tcp_cong.c | 15 ---
 1 file changed, 4 insertions(+), 11 deletions(-)

diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index a9b0fb52a1ec..db47ac24d057 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -362,21 +362,14 @@ int tcp_set_congestion_control(struct sock *sk, const 
char *name, bool load,
goto out;
}
 
-   if (!ca) {
+   if (!ca)
err = -ENOENT;
-   } else if (!load) {
-   if (bpf_try_module_get(ca, ca->owner)) {
-   tcp_reinit_congestion_control(sk, ca);
-   } else {
-   err = -EBUSY;
-   }
-   } else if (!((ca->flags & TCP_CONG_NON_RESTRICTED) || cap_net_admin)) {
+   else if (!((ca->flags & TCP_CONG_NON_RESTRICTED) || cap_net_admin))
err = -EPERM;
-   } else if (!bpf_try_module_get(ca, ca->owner)) {
+   else if (!bpf_try_module_get(ca, ca->owner))
err = -EBUSY;
-   } else {
+   else
tcp_reinit_congestion_control(sk, ca);
-   }
  out:
rcu_read_unlock();
return err;
-- 
2.28.0.526.ge36021eeef-goog

[PATCH bpf-next v2 4/5] tcp: simplify _bpf_setsockopt(): remove flags argument

2020-09-10 Thread Neal Cardwell

From: Neal Cardwell 

Now that the previous patches have removed the code that uses the
flags argument to _bpf_setsockopt(), we can remove that argument.

Signed-off-by: Neal Cardwell 
Acked-by: Yuchung Cheng 
Acked-by: Kevin Yang 
Signed-off-by: Eric Dumazet 
Cc: Lawrence Brakmo 
---
 net/core/filter.c | 10 +++---
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index e89d6d7da03c..d266c6941967 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4314,7 +4314,7 @@ static const struct bpf_func_proto 
bpf_get_socket_uid_proto = {
 };
 
 static int _bpf_setsockopt(struct sock *sk, int level, int optname,
-  char *optval, int optlen, u32 flags)
+  char *optval, int optlen)
 {
char devname[IFNAMSIZ];
int val, valbool;
@@ -4611,9 +4611,7 @@ static int _bpf_getsockopt(struct sock *sk, int level, 
int optname,
 BPF_CALL_5(bpf_sock_addr_setsockopt, struct bpf_sock_addr_kern *, ctx,
   int, level, int, optname, char *, optval, int, optlen)
 {
-   u32 flags = 0;
-   return _bpf_setsockopt(ctx->sk, level, optname, optval, optlen,
-  flags);
+   return _bpf_setsockopt(ctx->sk, level, optname, optval, optlen);
 }
 
 static const struct bpf_func_proto bpf_sock_addr_setsockopt_proto = {
@@ -4647,9 +4645,7 @@ static const struct bpf_func_proto 
bpf_sock_addr_getsockopt_proto = {
 BPF_CALL_5(bpf_sock_ops_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
   int, level, int, optname, char *, optval, int, optlen)
 {
-   u32 flags = 0;
-   return _bpf_setsockopt(bpf_sock->sk, level, optname, optval, optlen,
-  flags);
+   return _bpf_setsockopt(bpf_sock->sk, level, optname, optval, optlen);
 }
 
 static const struct bpf_func_proto bpf_sock_ops_setsockopt_proto = {
-- 
2.28.0.526.ge36021eeef-goog

[PATCH bpf-next v2 3/5] tcp: simplify tcp_set_congestion_control(): always reinitialize

2020-09-10 Thread Neal Cardwell

From: Neal Cardwell 

Now that the previous patches ensure that all call sites for
tcp_set_congestion_control() want to initialize congestion control, we
can simplify tcp_set_congestion_control() by removing the reinit
argument and the code to support it.

Signed-off-by: Neal Cardwell 
Acked-by: Yuchung Cheng 
Acked-by: Kevin Yang 
Signed-off-by: Eric Dumazet 
Cc: Lawrence Brakmo 
---
 include/net/tcp.h   |  2 +-
 net/core/filter.c   |  3 +--
 net/ipv4/tcp.c  |  2 +-
 net/ipv4/tcp_cong.c | 11 ++-
 4 files changed, 5 insertions(+), 13 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index e85d564446c6..f857146c17a5 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1104,7 +1104,7 @@ void tcp_get_available_congestion_control(char *buf, 
size_t len);
 void tcp_get_allowed_congestion_control(char *buf, size_t len);
 int tcp_set_allowed_congestion_control(char *allowed);
 int tcp_set_congestion_control(struct sock *sk, const char *name, bool load,
-  bool reinit, bool cap_net_admin);
+  bool cap_net_admin);
 u32 tcp_slow_start(struct tcp_sock *tp, u32 acked);
 void tcp_cong_avoid_ai(struct tcp_sock *tp, u32 w, u32 acked);
 
diff --git a/net/core/filter.c b/net/core/filter.c
index 067f6759a68f..e89d6d7da03c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4451,8 +4451,7 @@ static int _bpf_setsockopt(struct sock *sk, int level, 
int optname,
strncpy(name, optval, min_t(long, optlen,
TCP_CA_NAME_MAX-1));
name[TCP_CA_NAME_MAX-1] = 0;
-   ret = tcp_set_congestion_control(sk, name, false,
-true, true);
+   ret = tcp_set_congestion_control(sk, name, false, true);
} else {
struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 7360d3db2b61..e58ab9db73ff 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3050,7 +3050,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level, 
int optname,
name[val] = 0;
 
lock_sock(sk);
-   err = tcp_set_congestion_control(sk, name, true, true,
+   err = tcp_set_congestion_control(sk, name, true,
 
ns_capable(sock_net(sk)->user_ns,
CAP_NET_ADMIN));
release_sock(sk);
diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index d18d7a1ce4ce..a9b0fb52a1ec 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -341,7 +341,7 @@ int tcp_set_allowed_congestion_control(char *val)
  * already initialized.
  */
 int tcp_set_congestion_control(struct sock *sk, const char *name, bool load,
-  bool reinit, bool cap_net_admin)
+  bool cap_net_admin)
 {
struct inet_connection_sock *icsk = inet_csk(sk);
const struct tcp_congestion_ops *ca;
@@ -365,15 +365,8 @@ int tcp_set_congestion_control(struct sock *sk, const char 
*name, bool load,
if (!ca) {
err = -ENOENT;
} else if (!load) {
-   const struct tcp_congestion_ops *old_ca = icsk->icsk_ca_ops;
-
if (bpf_try_module_get(ca, ca->owner)) {
-   if (reinit) {
-   tcp_reinit_congestion_control(sk, ca);
-   } else {
-   icsk->icsk_ca_ops = ca;
-   bpf_module_put(old_ca, old_ca->owner);
-   }
+   tcp_reinit_congestion_control(sk, ca);
} else {
err = -EBUSY;
}
-- 
2.28.0.526.ge36021eeef-goog

Re: [net-next] tcp: add TCP_INFO status for failed client TFO

2019-10-21 Thread Neal Cardwell

On Mon, Oct 21, 2019 at 5:11 PM Jason Baron  wrote:
>
>
>
> On 10/21/19 4:36 PM, Eric Dumazet wrote:
> > On Mon, Oct 21, 2019 at 12:53 PM Christoph Paasch  wrote:
> >>
> >
> >> Actually, longterm I hope we would be able to get rid of the
> >> blackhole-detection and fallback heuristics. In a far distant future where
> >> these middleboxes have been weeded out ;-)
> >>
> >> So, do we really want to eternalize this as part of the API in tcp_info ?
> >>
> >
> > A getsockopt() option won't be available for netlink diag (ss command).
> >
> > So it all depends on plans to use this FASTOPEN information ?
> >
>
> The original proposal I had 4 states of interest:
>
> First, we are interested in knowing when a socket has TFO set but
> actually requires a retransmit of a non-TFO syn to become established.
> In this case, we'd be better off not using TFO.
>
> A second case is when the server immediately rejects the DATA and just
> acks the syn (but not the data). Again in that case, we don't want to be
> sending syn+data.
>
> The third case was whether or not we sent a cookie. Perhaps, the server
> doesn't have TFO enabled in which case, it really doesn't make make
> sense to enable TFO in the first place. Or if one also controls the
> server its helpful in understanding if the server is mis-configured. So
> that was the motivation I had for the original four states that I
> proposed (final state was a catch-all state).
>
> Yuchung suggested dividing the 3rd case into 2 for - no cookie sent
> because of blackhole or no cookie sent because its not in cache. And
> then dropping the second state because we already have the
> TCPI_OPT_SYN_DATA bit. However, the TCPI_OPT_SYN_DATA may not be set
> because we may fallback in tcp_send_syn_data() due to send failure. So
> I'm inclined to say that the second state is valuable. And since
> blackhole is already a global thing via MIB, I didn't see a strong need
> for it. But it sounded like Yuchung had an interest in it, and I'd
> obviously like a set of states that is generally useful.

I have not kept up with all the proposals in this thread, but would it
work to include all of the cases proposed in this thread? It seems
like there are enough bits available in holes in tcp_sock and tcp_info
to have up to 7 bits of information saved in tcp_sock and exported to
tcp_info. That seems like more than enough?

The pahole tool shows the following small holes in tcp_sock:

u8 compressed_ack;   /*  1530 1 */

/* XXX 1 byte hole, try to pack */

u32chrono_start; /*  1532 4 */
...
u8 bpf_sock_ops_cb_flags; /*  2076 1 */

/* XXX 3 bytes hole, try to pack */

u32rcv_ooopack;  /*  2080 4 */

...and the following hole in tcp_info:
__u8   tcpi_delivery_rate_app_limited:1;
/* 7: 7  1 */

/* XXX 7 bits hole, try to pack */

__u32  tcpi_rto; /* 8 4 */

cheers,
neal

Re: Crash when receiving FIN-ACK in TCP_FIN_WAIT1 state

2019-10-21 Thread Neal Cardwell

On Mon, Oct 21, 2019 at 8:04 PM Subash Abhinov Kasiviswanathan
 wrote:
>
> > Interesting! As tcp_input.c summarizes, "packets_out is
> > SND.NXT-SND.UNA counted in packets". In the normal operation of a
> > socket, tp->packets_out should not be 0 if any of those other fields
> > are non-zero.
> >
> > The tcp_write_queue_purge() function sets packets_out to 0:
> >
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/tree/net/ipv4/tcp.c?h=v4.19#n2526
> >
> > So the execution of tcp_write_queue_purge()  before this point is one
> > way for the socket to end up in this weird state.
> >
>
> In one of the instances, the values are tp->snd_nxt = 1016118098,
> tp->snd_una = 1016047820
>
> tp->mss_cache = 1378
>
> I assume the number of outstanding segments should be
> (tp->snd_nxt - tp->snd_una)/tp->mss_cache = 51

That would be a good expectation if all the packets were full-sized.

> tp->packets_out = 0 and tp->sacked_out = 158 in this case.

OK, thanks. It could be that sacked_out is reasonable and some of the
packets were not full-sized. But, as discussed above, typically the
packets_out should not be 0 if sacked_out is non-zero (with at least
the exception of the tcp_write_queue_purge()  case).

> >> > Yes, one guess would be that somehow the skbs in the retransmit queue
> >> > have been freed, but tp->sacked_out is still non-zero and
> >> > tp->highest_sack is still a dangling pointer into one of those freed
> >> > skbs. The tcp_write_queue_purge() function is one function that fees
> >> > the skbs in the retransmit queue and leaves tp->sacked_out as non-zero
> >> > and  tp->highest_sack as a dangling pointer to a freed skb, AFAICT, so
> >> > that's why I'm wondering about that function. I can't think of a
> >> > specific sequence of events that would involve tcp_write_queue_purge()
> >> > and then a socket that's still in FIN-WAIT1. Maybe I'm not being
> >> > creative enough, or maybe that guess is on the wrong track. Would you
> >> > be able to set a new bit in the tcp_sock in tcp_write_queue_purge()
> >> > and log it in your instrumentation point, to see if
> >> > tcp_write_queue_purge()  was called for these connections that cause
> >> > this crash?
>
> I've queued up a build which logs calls to tcp_write_queue_purge and
> clears tp->highest_sack and tp->sacked_out. I will let you know how
> it fares by end of week.

OK, thanks. That should be a useful data point.

cheers,
neal

Re: Crash when receiving FIN-ACK in TCP_FIN_WAIT1 state

2019-10-21 Thread Neal Cardwell

On Sun, Oct 20, 2019 at 10:45 PM Subash Abhinov Kasiviswanathan
 wrote:
>
> > FIN-WAIT1 just means the local application has called close() or
> > shutdown() to shut down the sending direction of the socket, and the
> > local TCP stack has sent a FIN, and is waiting to receive a FIN and an
> > ACK from the other side (in either order, or simultaneously). The
> > ASCII art state transition diagram on page 22 of RFC 793 (e.g.
> > https://tools.ietf.org/html/rfc793#section-3.2 ) is one source for
> > this, though the W. Richard Stevens books have a much more readable
> > diagram.
> >
> > There may still be unacked and SACKed data in the retransmit queue at
> > this point.
> >
>
> Thanks for the clarification.
>
> > Thanks, that is a useful data point. Do you know what particular value
> >  tp->sacked_out has? Would you be able to capture/log the value of
> > tp->packets_out, tp->lost_out, and tp->retrans_out as well?
> >
>
> tp->sacket_out varies per crash instance - 55, 180 etc.
> However the other values are always the same - tp->packets_out is 0,
> tp->lost_out is 1 and tp->retrans_out is 1.

Interesting! As tcp_input.c summarizes, "packets_out is
SND.NXT-SND.UNA counted in packets". In the normal operation of a
socket, tp->packets_out should not be 0 if any of those other fields
are non-zero.

The tcp_write_queue_purge() function sets packets_out to 0:

  
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/tree/net/ipv4/tcp.c?h=v4.19#n2526

So the execution of tcp_write_queue_purge()  before this point is one
way for the socket to end up in this weird state.

> > Yes, one guess would be that somehow the skbs in the retransmit queue
> > have been freed, but tp->sacked_out is still non-zero and
> > tp->highest_sack is still a dangling pointer into one of those freed
> > skbs. The tcp_write_queue_purge() function is one function that fees
> > the skbs in the retransmit queue and leaves tp->sacked_out as non-zero
> > and  tp->highest_sack as a dangling pointer to a freed skb, AFAICT, so
> > that's why I'm wondering about that function. I can't think of a
> > specific sequence of events that would involve tcp_write_queue_purge()
> > and then a socket that's still in FIN-WAIT1. Maybe I'm not being
> > creative enough, or maybe that guess is on the wrong track. Would you
> > be able to set a new bit in the tcp_sock in tcp_write_queue_purge()
> > and log it in your instrumentation point, to see if
> > tcp_write_queue_purge()  was called for these connections that cause
> > this crash?
>
> Sure, I can try this out.

Great! Thanks!

neal

Re: Crash when receiving FIN-ACK in TCP_FIN_WAIT1 state

2019-10-20 Thread Neal Cardwell

On Sun, Oct 20, 2019 at 7:15 PM Subash Abhinov Kasiviswanathan
 wrote:
>
> > Hmm. Random related thought while searching for a possible cause: I
> > wonder if tcp_write_queue_purge() should clear tp->highest_sack (and
> > possibly tp->sacked_out)? The tcp_write_queue_purge() code is careful
> > to call  tcp_clear_all_retrans_hints(tcp_sk(sk)) and I would imagine
> > that similar considerations would imply that we should clear at least
> > tp->highest_sack?
> >
> > neal
>
> Hi Neal
>
> If the socket is in FIN-WAIT1, does that mean that all the segments
> corresponding to SACK blocks are sent and ACKed already?

FIN-WAIT1 just means the local application has called close() or
shutdown() to shut down the sending direction of the socket, and the
local TCP stack has sent a FIN, and is waiting to receive a FIN and an
ACK from the other side (in either order, or simultaneously). The
ASCII art state transition diagram on page 22 of RFC 793 (e.g.
https://tools.ietf.org/html/rfc793#section-3.2 ) is one source for
this, though the W. Richard Stevens books have a much more readable
diagram.

There may still be unacked and SACKed data in the retransmit queue at
this point.

> tp->sacked_out is non zero in all these crashes

Thanks, that is a useful data point. Do you know what particular value
 tp->sacked_out has? Would you be able to capture/log the value of
tp->packets_out, tp->lost_out, and tp->retrans_out as well?

> (is the SACK information possibly invalid or stale here?).

Yes, one guess would be that somehow the skbs in the retransmit queue
have been freed, but tp->sacked_out is still non-zero and
tp->highest_sack is still a dangling pointer into one of those freed
skbs. The tcp_write_queue_purge() function is one function that fees
the skbs in the retransmit queue and leaves tp->sacked_out as non-zero
and  tp->highest_sack as a dangling pointer to a freed skb, AFAICT, so
that's why I'm wondering about that function. I can't think of a
specific sequence of events that would involve tcp_write_queue_purge()
and then a socket that's still in FIN-WAIT1. Maybe I'm not being
creative enough, or maybe that guess is on the wrong track. Would you
be able to set a new bit in the tcp_sock in tcp_write_queue_purge()
and log it in your instrumentation point, to see if
tcp_write_queue_purge()  was called for these connections that cause
this crash?

thanks,
neal

Re: Crash when receiving FIN-ACK in TCP_FIN_WAIT1 state

2019-10-20 Thread Neal Cardwell

tcp_write_queue_purgeOn Sun, Oct 20, 2019 at 4:25 PM Subash Abhinov
Kasiviswanathan  wrote:
>
> We are seeing a crash in the TCP ACK codepath often in our regression
> racks with an ARM64 device with 4.19 based kernel.
>
> It appears that the tp->highest_ack is invalid when being accessed when
> a
> FIN-ACK is received. In all the instances of the crash, the tcp socket
> is in TCP_FIN_WAIT1 state.

Hmm. Random related thought while searching for a possible cause: I
wonder if tcp_write_queue_purge() should clear tp->highest_sack (and
possibly tp->sacked_out)? The tcp_write_queue_purge() code is careful
to call  tcp_clear_all_retrans_hints(tcp_sk(sk)) and I would imagine
that similar considerations would imply that we should clear at least
tp->highest_sack?

neal

Re: [net-next] tcp: add TCP_INFO status for failed client TFO

2019-10-18 Thread Neal Cardwell

On Fri, Oct 18, 2019 at 3:03 PM Jason Baron  wrote:
>
> The TCPI_OPT_SYN_DATA bit as part of tcpi_options currently reports whether
> or not data-in-SYN was ack'd on both the client and server side. We'd like
> to gather more information on the client-side in the failure case in order
> to indicate the reason for the failure. This can be useful for not only
> debugging TFO, but also for creating TFO socket policies. For example, if
> a middle box removes the TFO option or drops a data-in-SYN, we can
> can detect this case, and turn off TFO for these connections saving the
> extra retransmits.
>
> The newly added tcpi_fastopen_client_fail status is 2 bits and has 4
> states:
>
> 1) TFO_STATUS_UNSPEC
>
> catch-all.
>
> 2) TFO_NO_COOKIE_SENT
>
> If TFO_CLIENT_NO_COOKIE mode is off, this state indicates that no cookie
> was sent because we don't have one yet, its not in cache or black-holing
> may be enabled (already indicated by the global
> LINUX_MIB_TCPFASTOPENBLACKHOLE).
>
> 3) TFO_NO_SYN_DATA
>
> Data was sent with SYN, we received a SYN/ACK but it did not cover the data
> portion. Cookie is not accepted by server because the cookie may be invalid
> or the server may be overloaded.
>
>
> 4) TFO_NO_SYN_DATA_TIMEOUT
>
> Data was sent with SYN, we received a SYN/ACK which did not cover the data
> after at least 1 additional SYN was sent (without data). It may be the case
> that a middle-box is dropping data-in-SYN packets. Thus, it would be more
> efficient to not use TFO on this connection to avoid extra retransmits
> during connection establishment.
>
> These new fields certainly not cover all the cases where TFO may fail, but
> other failures, such as SYN/ACK + data being dropped, will result in the
> connection not becoming established. And a connection blackhole after
> session establishment shows up as a stalled connection.
>
> Signed-off-by: Jason Baron 
> Cc: Eric Dumazet 
> Cc: Neal Cardwell 
> Cc: Christoph Paasch 
> ---

Thanks for adding this!

It would be good to reset tp->fastopen_client_fail to 0 in tcp_disconnect().

> +/* why fastopen failed from client perspective */
> +enum tcp_fastopen_client_fail {
> +   TFO_STATUS_UNSPEC, /* catch-all */
> +   TFO_NO_COOKIE_SENT, /* if not in TFO_CLIENT_NO_COOKIE mode */
> +   TFO_NO_SYN_DATA, /* SYN-ACK did not ack SYN data */

I found the "TFO_NO_SYN_DATA" name a little unintuitive; it sounded to
me like this means the client didn't send a SYN+DATA. What about
"TFO_DATA_NOT_ACKED", or something like that?

If you don't mind, it would be great to cc: Yuchung on the next rev.

thanks,
neal

Re: [PATCH v2] tcp: Add TCP_INFO counter for packets received out-of-order

2019-09-17 Thread Neal Cardwell

On Tue, Sep 17, 2019 at 1:22 PM Eric Dumazet  wrote:
>
>  Tue, Sep 17, 2019 at 10:13 AM Jason Baron  wrote:
> >
> >
> > Hi,
> >
> > I was interested in adding a field to tcp_info around the TFO state of a
> > socket. So for the server side it would indicate if TFO was used to
> > create the socket and on the client side it would report whether TFO
> > worked and if not that it failed with maybe some additional states
> > around why it failed. I'm thinking it would be maybe 3 bits.

BTW, one aspect of that "did TFO work" info is available already in
tcp_info in the tcpi_options field.

Kernel side is:
  if (tp->syn_data_acked)
info->tcpi_options |= TCPI_OPT_SYN_DATA;

We use this bit in packetdrill tests on client and server side to
check that the TFO data-in-SYN succeeded:
   +0 %{ assert (tcpi_options & TCPI_OPT_SYN_DATA) != 0, tcpi_options }%

These TFO bits were added much later than the other bits, so IMHO it
would be OK to add more bits somewhere unused in tcp_info to indicate
reasons for TFO failure. Especially if, as you suggest, "0" as a code
point could indicate that the code point is undefined, and all
meaningful code points were non-zero.

neal

> > My question is whether its reasonable to use the unused bits of
> > __u8tcpi_delivery_rate_app_limited:1;. Or is this not good because
> > the size hasn't changed? What if I avoided using 0 for the new field to
> > avoid the possibility of not knowing if 0 because its the old kernel or
> > 0 because that's now its a TFO state? IE the new field could always be >
> > 0 for the new kernel.
> >
>
> I guess that storing the 'why it has failed' would need more bits.
>
> I suggest maybe using an event for this, instead of TCP_INFO ?
>
> As of using the bits, maybe the monitoring application does not really care
> if running on an old kernel where the bits would be zero.
>
> Commit eb8329e0a04db0061f714f033b4454326ba147f4 reserved a single
> bit and did not bother about making sure the monitoring would detect if this
> runs on an old kernel.

Re: [PATCH v5 2/2] tcp: Add snd_wnd to TCP_INFO

2019-09-14 Thread Neal Cardwell

On Fri, Sep 13, 2019 at 7:23 PM Thomas Higdon  wrote:
>
> Neal Cardwell mentioned that snd_wnd would be useful for diagnosing TCP
> performance problems --
> > (1) Usually when we're diagnosing TCP performance problems, we do so
> > from the sender, since the sender makes most of the
> > performance-critical decisions (cwnd, pacing, TSO size, TSQ, etc).
> > From the sender-side the thing that would be most useful is to see
> > tp->snd_wnd, the receive window that the receiver has advertised to
> > the sender.
>
> This serves the purpose of adding an additional __u32 to avoid the
> would-be hole caused by the addition of the tcpi_rcvi_ooopack field.
>
> Signed-off-by: Thomas Higdon 
> ---
> changes since v4:
>  - clarify comment

Acked-by: Neal Cardwell 

Thanks!

neal

Re: [PATCH v5 1/2] tcp: Add TCP_INFO counter for packets received out-of-order

2019-09-14 Thread Neal Cardwell

On Fri, Sep 13, 2019 at 7:23 PM Thomas Higdon  wrote:
>
> For receive-heavy cases on the server-side, we want to track the
> connection quality for individual client IPs. This counter, similar to
> the existing system-wide TCPOFOQueue counter in /proc/net/netstat,
> tracks out-of-order packet reception. By providing this counter in
> TCP_INFO, it will allow understanding to what degree receive-heavy
> sockets are experiencing out-of-order delivery and packet drops
> indicating congestion.
>
> Please note that this is similar to the counter in NetBSD TCP_INFO, and
> has the same name.
>
> Also note that we avoid increasing the size of the tcp_sock struct by
> taking advantage of a hole.
>
> Signed-off-by: Thomas Higdon 
> ---
> changes since v4:
>  - optimize placement of rcv_ooopack to avoid increasing tcp_sock struct
>size


Acked-by: Neal Cardwell 

Thanks, Thomas, for adding this!

After this is merged, would you mind sending a patch to add support to
the "ss" command line tool to print these 2 new fields?

My favorite recent example of such a patch to ss is Eric's change:
  
https://git.kernel.org/pub/scm/network/iproute2/iproute2.git/commit/misc/ss.c?id=5eead6270a19f00464052d4084f32182cfe027ff

thanks,
neal

Re: [PATCH v4 2/2] tcp: Add snd_wnd to TCP_INFO

2019-09-13 Thread Neal Cardwell

On Fri, Sep 13, 2019 at 5:29 PM Yuchung Cheng  wrote:
> > What if the comment is shortened up to fit in 80 columns and the units
> > (bytes) are added, something like:
> >
> > __u32   tcpi_snd_wnd;/* peer's advertised recv window 
> > (bytes) */
> just a thought: will tcpi_peer_rcv_wnd be more self-explanatory?

Good suggestion. I'm on the fence about that one. By itself, I agree
tcpi_peer_rcv_wnd sounds much more clear. But tcpi_snd_wnd has the
virtue of matching both the kernel code (tp->snd_wnd) and RFC 793
(SND.WND). So they both have pros and cons. Maybe someone else feels
more strongly one way or the other.

neal

Re: [PATCH v4 2/2] tcp: Add snd_wnd to TCP_INFO

2019-09-13 Thread Neal Cardwell

On Fri, Sep 13, 2019 at 3:36 PM Thomas Higdon  wrote:
>
> Neal Cardwell mentioned that snd_wnd would be useful for diagnosing TCP
> performance problems --
> > (1) Usually when we're diagnosing TCP performance problems, we do so
> > from the sender, since the sender makes most of the
> > performance-critical decisions (cwnd, pacing, TSO size, TSQ, etc).
> > From the sender-side the thing that would be most useful is to see
> > tp->snd_wnd, the receive window that the receiver has advertised to
> > the sender.
>
> This serves the purpose of adding an additional __u32 to avoid the
> would-be hole caused by the addition of the tcpi_rcvi_ooopack field.
>
> Signed-off-by: Thomas Higdon 
> ---
> changes from v3:
>  - changed from rcv_wnd to snd_wnd
>
>  include/uapi/linux/tcp.h | 1 +
>  net/ipv4/tcp.c   | 1 +
>  2 files changed, 2 insertions(+)
>
> diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
> index 20237987ccc8..240654f22d98 100644
> --- a/include/uapi/linux/tcp.h
> +++ b/include/uapi/linux/tcp.h
> @@ -272,6 +272,7 @@ struct tcp_info {
> __u32   tcpi_reord_seen; /* reordering events seen */
>
> __u32   tcpi_rcv_ooopack;/* Out-of-order packets received */
> +   __u32   tcpi_snd_wnd;/* Remote peer's advertised recv window 
> size */
>  };

Thanks for adding this!

My run of ./scripts/checkpatch.pl is showing a warning on this line:

WARNING: line over 80 characters
#19: FILE: include/uapi/linux/tcp.h:273:
+   __u32   tcpi_snd_wnd;/* Remote peer's advertised recv
window size */

What if the comment is shortened up to fit in 80 columns and the units
(bytes) are added, something like:

__u32   tcpi_snd_wnd;/* peer's advertised recv window (bytes) */

neal

Re: [PATCH v4 1/2] tcp: Add TCP_INFO counter for packets received out-of-order

2019-09-13 Thread Neal Cardwell

On Fri, Sep 13, 2019 at 3:37 PM Thomas Higdon  wrote:
>
> For receive-heavy cases on the server-side, we want to track the
> connection quality for individual client IPs. This counter, similar to
> the existing system-wide TCPOFOQueue counter in /proc/net/netstat,
> tracks out-of-order packet reception. By providing this counter in
> TCP_INFO, it will allow understanding to what degree receive-heavy
> sockets are experiencing out-of-order delivery and packet drops
> indicating congestion.
>
> Please note that this is similar to the counter in NetBSD TCP_INFO, and
> has the same name.
>
> Signed-off-by: Thomas Higdon 
> ---
>
> no changes from v3
>
>  include/linux/tcp.h  | 2 ++
>  include/uapi/linux/tcp.h | 2 ++
>  net/ipv4/tcp.c   | 2 ++
>  net/ipv4/tcp_input.c | 1 +
>  4 files changed, 7 insertions(+)
>
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index f3a85a7fb4b1..a01dc78218f1 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -393,6 +393,8 @@ struct tcp_sock {
>  */
> struct request_sock *fastopen_rsk;
> u32 *saved_syn;
> +
> +   u32 rcv_ooopack; /* Received out-of-order packets, for tcpinfo */

Thanks for adding this.

A thought: putting the new rcv_ooopack field here makes struct
tcp_sock bigger, and increases the odds of taking a cache miss
(according to "pahole" this field is the only one in a new cache
line).

I'd suggest putting the new rcv_ooopack field immediately before
rcv_rtt_last_tsecr. That would use up a 4-byte hole, and would place
it in a cache line already used on TCP receivers (for rcv_rtt logic).
This would make it less likely this new field causes more cache misses
or uses more space.

Details: looking at the output of "pahole" for tcp_sock in various cases:

net-next before this patch:
-
...
u8 bpf_sock_ops_cb_flags; /*  2076 1 */

/* XXX 3 bytes hole, try to pack */

u32rcv_rtt_last_tsecr;   /*  2080 4 */

/* XXX 4 bytes hole, try to pack */

struct {
u32rtt_us;   /*  2088 4 */
u32seq;  /*  2092 4 */
u64time; /*  2096 8 */
} rcv_rtt_est;   /*  208816 */
...
/* size: 2176, cachelines: 34, members: 134 */
/* sum members: 2164, holes: 4, sum holes: 12 */
/* paddings: 3, sum paddings: 12 */


net-next with this patch:
-
...
u32 *  saved_syn;/*  2168 8 */
/* --- cacheline 34 boundary (2176 bytes) --- */
u32rcv_ooopack;  /*  2176 4 */
...
/* size: 2184, cachelines: 35, members: 135 */
/* sum members: 2168, holes: 4, sum holes: 12 */
/* padding: 4 */
/* paddings: 3, sum paddings: 12 */
/* last cacheline: 8 bytes */


net-next with this field in the suggested spot:
-
...
   /* XXX 3 bytes hole, try to pack */

u32rcv_ooopack;  /*  2080 4 */
u32rcv_rtt_last_tsecr;   /*  2084 4 */
struct {
u32rtt_us;   /*  2088 4 */
u32seq;  /*  2092 4 */
u64time; /*  2096 8 */
} rcv_rtt_est;   /*  208816 */
...
/* size: 2176, cachelines: 34, members: 135 */
/* sum members: 2168, holes: 3, sum holes: 8 */
/* paddings: 3, sum paddings: 12 */

neal


neal

Re: [PATCH v3 2/2] tcp: Add rcv_wnd to TCP_INFO

2019-09-13 Thread Neal Cardwell

On Fri, Sep 13, 2019 at 10:29 AM Thomas Higdon  wrote:
>
> On Thu, Sep 12, 2019 at 10:14:33AM +0100, Dave Taht wrote:
> > On Thu, Sep 12, 2019 at 1:59 AM Neal Cardwell  wrote:
> > >
> > > On Wed, Sep 11, 2019 at 6:32 PM Thomas Higdon  wrote:
> > > >
> > > > Neal Cardwell mentioned that rcv_wnd would be useful for helping
> > > > diagnose whether a flow is receive-window-limited at a given instant.
> > > >
> > > > This serves the purpose of adding an additional __u32 to avoid the
> > > > would-be hole caused by the addition of the tcpi_rcvi_ooopack field.
> > > >
> > > > Signed-off-by: Thomas Higdon 
> > > > ---
> > >
> > > Thanks, Thomas.
> > >
> > > I know that when I mentioned this before I mentioned the idea of both
> > > tp->snd_wnd (send-side receive window) and tp->rcv_wnd (receive-side
> > > receive window) in tcp_info, and did not express a preference between
> > > the two. Now that we are faced with a decision between the two,
> > > personally I think it would be a little more useful to start with
> > > tp->snd_wnd. :-)
> > >
> > > Two main reasons:
> > >
> > > (1) Usually when we're diagnosing TCP performance problems, we do so
> > > from the sender, since the sender makes most of the
> > > performance-critical decisions (cwnd, pacing, TSO size, TSQ, etc).
> > > From the sender-side the thing that would be most useful is to see
> > > tp->snd_wnd, the receive window that the receiver has advertised to
> > > the sender.
> >
> > I am under the impression, that particularly in the mobile space, that
> > network behavior
> > is often governed by rcv_wnd. At least, there's been so many papers on
> > this that I'd
> > tended to assume so.
> >
> > Given a desire to do both vars, is there a *third* u32 we could add to
> > fill in the next hole? :)
> > ecn marks?
>
> Neal makes some good points -- there is a fair amount of existing
> information for deriving receive window. It seems like snd_wnd would be
> more valuable at this moment. For the purpose of pairing up these __u32s
> to get something we can commit, I propose that we go with
> the rcv_ooopack/snd_wnd pair for now, and when something comes up later,
> one might consider pairing up rcv_wnd.

FWIW that sounds like a great plan to me. Thanks, Thomas!

neal

Re: [PATCH v3 2/2] tcp: Add rcv_wnd to TCP_INFO

2019-09-11 Thread Neal Cardwell

On Wed, Sep 11, 2019 at 6:32 PM Thomas Higdon  wrote:
>
> Neal Cardwell mentioned that rcv_wnd would be useful for helping
> diagnose whether a flow is receive-window-limited at a given instant.
>
> This serves the purpose of adding an additional __u32 to avoid the
> would-be hole caused by the addition of the tcpi_rcvi_ooopack field.
>
> Signed-off-by: Thomas Higdon 
> ---

Thanks, Thomas.

I know that when I mentioned this before I mentioned the idea of both
tp->snd_wnd (send-side receive window) and tp->rcv_wnd (receive-side
receive window) in tcp_info, and did not express a preference between
the two. Now that we are faced with a decision between the two,
personally I think it would be a little more useful to start with
tp->snd_wnd. :-)

Two main reasons:

(1) Usually when we're diagnosing TCP performance problems, we do so
from the sender, since the sender makes most of the
performance-critical decisions (cwnd, pacing, TSO size, TSQ, etc).
>From the sender-side the thing that would be most useful is to see
tp->snd_wnd, the receive window that the receiver has advertised to
the sender.

(2) From the receiver side, "ss" can already show a fair amount of
info about receive-side buffer/window limits, like:
info->tcpi_rcv_ssthresh, info->tcpi_rcv_space,
skmeminfo[SK_MEMINFO_RMEM_ALLOC], skmeminfo[SK_MEMINFO_RCVBUF]. Often
the rwin can be approximated by combining those.

Hopefully Eric, Yuchung, and Soheil can weigh in on the question of
snd_wnd vs rcv_wnd. Or we can perhaps think of another field, and add
the tcpi_rcvi_ooopack, snd_wnd, rcv_wnd, and that final field, all
together.

thanks,
neal

Re: [PATCH net-next] tcp: force a PSH flag on TSO packets

2019-09-10 Thread Neal Cardwell

On Tue, Sep 10, 2019 at 5:49 PM Eric Dumazet  wrote:
>
> When tcp sends a TSO packet, adding a PSH flag on it
> reduces the sojourn time of GRO packet in GRO receivers.
>
> This is particularly the case under pressure, since RX queues
> receive packets for many concurrent flows.
>
> A sender can give a hint to GRO engines when it is
> appropriate to flush a super-packet, especially when pacing
> is in the picture, since next packet is probably delayed by
> one ms.
>
> Having less packets in GRO engine reduces chance
> of LRU eviction or inflated RTT, and reduces GRO cost.
>
> We found recently that we must not set the PSH flag on
> individual full-size MSS segments [1] :
>
>  Under pressure (CWR state), we better let the packet sit
>  for a small delay (depending on NAPI logic) so that the
>  ACK packet is delayed, and thus next packet we send is
>  also delayed a bit. Eventually the bottleneck queue can
>  be drained. DCTCP flows with CWND=1 have demonstrated
>  the issue.
>
> This patch allows to slowdown the aggregate traffic without
> involving high resolution timers on senders and/or
> receivers.
>
> It has been used at Google for about four years,
> and has been discussed at various networking conferences.
>
> [1] segments smaller than MSS already have PSH flag set
> by tcp_sendmsg() / tcp_mark_push(), unless MSG_MORE
> has been requested by the user.
>
> Signed-off-by: Eric Dumazet 
> Cc: Soheil Hassas Yeganeh 
> Cc: Neal Cardwell 
> Cc: Yuchung Cheng 
> Cc: Daniel Borkmann 
> Cc: Tariq Toukan 
> ---
>  net/ipv4/tcp_output.c | 15 +--
>  1 file changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 
> 42abc9bd687a5fea627cbc7cfa750d022f393d84..fec6d67bfd146dc78f0f25173fd71b8b8cc752fe
>  100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1050,11 +1050,22 @@ static int __tcp_transmit_skb(struct sock *sk, struct 
> sk_buff *skb,
> tcb = TCP_SKB_CB(skb);
> memset(&opts, 0, sizeof(opts));
>
> -   if (unlikely(tcb->tcp_flags & TCPHDR_SYN))
> +   if (unlikely(tcb->tcp_flags & TCPHDR_SYN)) {
> tcp_options_size = tcp_syn_options(sk, skb, &opts, &md5);
> -   else
> +   } else {
> tcp_options_size = tcp_established_options(sk, skb, &opts,
>&md5);
> +   /* Force a PSH flag on all (GSO) packets to expedite GRO flush
> +* at receiver : This slightly improve GRO performance.
> +* Note that we do not force the PSH flag for non GSO packets,
> +* because they might be sent under high congestion events,
> +* and in this case it is better to delay the delivery of 
> 1-MSS
> +* packets and thus the corresponding ACK packet that would
> +* release the following packet.
> +*/
> +   if (tcp_skb_pcount(skb) > 1)
> +   tcb->tcp_flags |= TCPHDR_PSH;
> +   }
> tcp_header_size = tcp_options_size + sizeof(struct tcphdr);
>
> /* if no packet is in qdisc/device queue, then allow XPS to select

Acked-by: Neal Cardwell 

Thanks, Eric!

neal

Re: [PATCH v2] tcp: Add TCP_INFO counter for packets received out-of-order

2019-09-10 Thread Neal Cardwell

On Tue, Sep 10, 2019 at 4:39 PM Eric Dumazet  wrote:
>
> On Tue, Sep 10, 2019 at 10:11 PM Thomas Higdon  wrote:
> >
> >
> ...
> > Because an additional 32-bit member in struct tcp_info would cause
> > a hole on 64-bit systems, we reserve a struct member '_reserved'.
> ...
> > diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
> > index b3564f85a762..990a5bae3ac1 100644
> > --- a/include/uapi/linux/tcp.h
> > +++ b/include/uapi/linux/tcp.h
> > @@ -270,6 +270,9 @@ struct tcp_info {
> > __u64   tcpi_bytes_retrans;  /* RFC4898 tcpEStatsPerfOctetsRetrans 
> > */
> > __u32   tcpi_dsack_dups; /* RFC4898 tcpEStatsStackDSACKDups */
> > __u32   tcpi_reord_seen; /* reordering events seen */
> > +
> > +   __u32   _reserved;   /* Reserved for future 32-bit member. 
> > */
> > +   __u32   tcpi_rcv_ooopack;/* Out-of-order packets received */
> >  };
> >
>
> Unfortunately we won't be able to use this hole, because the way the
> TCP_INFO works,
>
> The kernel will report the same size after the reserved field is
> renamed to something else.
>
> User space code is able to detect which fields are there or not based
> on what the kernel
> returns for the size of the structure.

If we are looking for something else useful to use for a __u32 to pair
up with this new field, I would suggest  we export tp->snd_wnd
(send-side receive window) and tp->rcv_wnd (receive-side receive
window) in tcp_info. We could export one of them now, and the other
the next time we need to add a field and need some useful "padding".
These fields could help folks diagnose whether a flow is
receive-window-limited at a given instant, using "ss", etc. I think
there was some interest in this internally in our team a while ago.

neal

[PATCH net] tcp: fix tcp_ecn_withdraw_cwr() to clear TCP_ECN_QUEUE_CWR

2019-09-09 Thread Neal Cardwell

Fix tcp_ecn_withdraw_cwr() to clear the correct bit:
TCP_ECN_QUEUE_CWR.

Rationale: basically, TCP_ECN_DEMAND_CWR is a bit that is purely about
the behavior of data receivers, and deciding whether to reflect
incoming IP ECN CE marks as outgoing TCP th->ece marks. The
TCP_ECN_QUEUE_CWR bit is purely about the behavior of data senders,
and deciding whether to send CWR. The tcp_ecn_withdraw_cwr() function
is only called from tcp_undo_cwnd_reduction() by data senders during
an undo, so it should zero the sender-side state,
TCP_ECN_QUEUE_CWR. It does not make sense to stop the reflection of
incoming CE bits on incoming data packets just because outgoing
packets were spuriously retransmitted.

The bug has been reproduced with packetdrill to manifest in a scenario
with RFC3168 ECN, with an incoming data packet with CE bit set and
carrying a TCP timestamp value that causes cwnd undo. Before this fix,
the IP CE bit was ignored and not reflected in the TCP ECE header bit,
and sender sent a TCP CWR ('W') bit on the next outgoing data packet,
even though the cwnd reduction had been undone.  After this fix, the
sender properly reflects the CE bit and does not set the W bit.

Note: the bug actually predates 2005 git history; this Fixes footer is
chosen to be the oldest SHA1 I have tested (from Sep 2007) for which
the patch applies cleanly (since before this commit the code was in a
.h file).

Fixes: bdf1ee5d3bd3 ("[TCP]: Move code from tcp_ecn.h to tcp*.c and tcp.h & 
remove it")
Signed-off-by: Neal Cardwell 
Acked-by: Yuchung Cheng 
Acked-by: Soheil Hassas Yeganeh 
Cc: Eric Dumazet 
---
 net/ipv4/tcp_input.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c21e8a22fb3b..8a1cd93dbb09 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -266,7 +266,7 @@ static void tcp_ecn_accept_cwr(struct sock *sk, const 
struct sk_buff *skb)
 
 static void tcp_ecn_withdraw_cwr(struct tcp_sock *tp)
 {
-   tp->ecn_flags &= ~TCP_ECN_DEMAND_CWR;
+   tp->ecn_flags &= ~TCP_ECN_QUEUE_CWR;
 }
 
 static void __tcp_ecn_check_ce(struct sock *sk, const struct sk_buff *skb)
-- 
2.23.0.162.g0b9fbb3734-goog

Re: [PATCH net] tcp: remove empty skb from write queue in error cases

2019-08-26 Thread Neal Cardwell

On Mon, Aug 26, 2019 at 12:19 PM Eric Dumazet  wrote:
>
> Vladimir Rutsky reported stuck TCP sessions after memory pressure
> events. Edge Trigger epoll() user would never receive an EPOLLOUT
> notification allowing them to retry a sendmsg().
>
> Jason tested the case of sk_stream_alloc_skb() returning NULL,
> but there are other paths that could lead both sendmsg() and sendpage()
> to return -1 (EAGAIN), with an empty skb queued on the write queue.
>
> This patch makes sure we remove this empty skb so that
> Jason code can detect that the queue is empty, and
> call sk->sk_write_space(sk) accordingly.
>
> Fixes: ce5ec440994b ("tcp: ensure epoll edge trigger wakeup when write queue 
> is empty")
> Signed-off-by: Eric Dumazet 
> Cc: Jason Baron 
> Reported-by: Vladimir Rutsky 
> Cc: Soheil Hassas Yeganeh 
> Cc: Neal Cardwell 
> ---

Acked-by: Neal Cardwell 

Nice detective work. :-) Thanks, Eric!

neal

Re: [PATCH net] tcp: make sure EPOLLOUT wont be missed

2019-08-17 Thread Neal Cardwell

On Sat, Aug 17, 2019 at 12:26 AM Eric Dumazet  wrote:
>
> As Jason Baron explained in commit 790ba4566c1a ("tcp: set SOCK_NOSPACE
> under memory pressure"), it is crucial we properly set SOCK_NOSPACE
> when needed.
>
> However, Jason patch had a bug, because the 'nonblocking' status
> as far as sk_stream_wait_memory() is concerned is governed
> by MSG_DONTWAIT flag passed at sendmsg() time :
>
> long timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
>
> So it is very possible that tcp sendmsg() calls sk_stream_wait_memory(),
> and that sk_stream_wait_memory() returns -EAGAIN with SOCK_NOSPACE
> cleared, if sk->sk_sndtimeo has been set to a small (but not zero)
> value.
>
> This patch removes the 'noblock' variable since we must always
> set SOCK_NOSPACE if -EAGAIN is returned.
>
> It also renames the do_nonblock label since we might reach this
> code path even if we were in blocking mode.
>
> Fixes: 790ba4566c1a ("tcp: set SOCK_NOSPACE under memory pressure")
> Signed-off-by: Eric Dumazet 
> Cc: Jason Baron 
> Reported-by: Vladimir Rutsky  
> ---
>  net/core/stream.c | 16 +---
>  1 file changed, 9 insertions(+), 7 deletions(-)

Acked-by: Neal Cardwell 

Thanks, Eric!

neal

Re: [PATCH v2 1/2] tcp: add new tcp_mtu_probe_floor sysctl

2019-08-08 Thread Neal Cardwell

On Wed, Aug 7, 2019 at 7:55 PM Josh Hunt  wrote:
>
> The current implementation of TCP MTU probing can considerably
> underestimate the MTU on lossy connections allowing the MSS to get down to
> 48. We have found that in almost all of these cases on our networks these
> paths can handle much larger MTUs meaning the connections are being
> artificially limited. Even though TCP MTU probing can raise the MSS back up
> we have seen this not to be the case causing connections to be "stuck" with
> an MSS of 48 when heavy loss is present.
>
> Prior to pushing out this change we could not keep TCP MTU probing enabled
> b/c of the above reasons. Now with a reasonble floor set we've had it
> enabled for the past 6 months.
>
> The new sysctl will still default to TCP_MIN_SND_MSS (48), but gives
> administrators the ability to control the floor of MSS probing.
>
> Signed-off-by: Josh Hunt 
> ---

Acked-by: Neal Cardwell 

Thanks, Josh. I agree with Eric that it would be great if you are able
to share the value that you have found to work well.

neal

Re: [PATCH v2 2/2] tcp: Update TCP_BASE_MSS comment

2019-08-08 Thread Neal Cardwell

On Thu, Aug 8, 2019 at 2:13 AM Eric Dumazet  wrote:
> On 8/8/19 1:52 AM, Josh Hunt wrote:
> > TCP_BASE_MSS is used as the default initial MSS value when MTU probing is
> > enabled. Update the comment to reflect this.
> >
> > Suggested-by: Neal Cardwell 
> > Signed-off-by: Josh Hunt 
> > ---
>
> Signed-off-by: Eric Dumazet 

Acked-by: Neal Cardwell 

Thanks, Josh!

neal

Re: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits

2019-08-02 Thread Neal Cardwell

On Fri, Aug 2, 2019 at 3:03 PM Bernd  wrote:
>
> Hello,
>
> While analyzing a aborted upload packet capture I came across a odd
> trace where a sender was not responding to a duplicate SACK but
> sending further segments until it stalled.
>
> Took me some time until I remembered this fix, and actually the
> problems started since the security fix was applied.
>
> I see a high counter for TCPWqueueTooBig - and I don’t think that’s an
> actual attack.
>
> Is there a probability for triggering the limit with connections with
> big windows and large send buffers and dropped segments? If so what
> would be the plan? It does not look like it is configurable. The trace
> seem to have 100 (filled) inflight segments.
>
> Gruss
> Bernd
> --
> http://bernd.eckenfels.net

What's the exact kernel version you are using?

Eric submitted a patch recently that may address your issue:
   tcp: be more careful in tcp_fragment()
  
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=b617158dc096709d8600c53b6052144d12b89fab

Would you be able to test your workload with that commit
cherry-picked, and see if the issue still occurs?

That commit was targeted to many stable releases, so you may be able
to pick up that fix from a stable branch.

cheers,
neal

Re: [PATCH] tcp: add new tcp_mtu_probe_floor sysctl

2019-07-29 Thread Neal Cardwell

On Sun, Jul 28, 2019 at 5:14 PM Josh Hunt  wrote:
>
> On 7/28/19 6:54 AM, Eric Dumazet wrote:
> > On Sun, Jul 28, 2019 at 1:21 AM Josh Hunt  wrote:
> >>
> >> On 7/27/19 12:05 AM, Eric Dumazet wrote:
> >>> On Sat, Jul 27, 2019 at 4:23 AM Josh Hunt  wrote:
> 
>  The current implementation of TCP MTU probing can considerably
>  underestimate the MTU on lossy connections allowing the MSS to get down 
>  to
>  48. We have found that in almost all of these cases on our networks these
>  paths can handle much larger MTUs meaning the connections are being
>  artificially limited. Even though TCP MTU probing can raise the MSS back 
>  up
>  we have seen this not to be the case causing connections to be "stuck" 
>  with
>  an MSS of 48 when heavy loss is present.
> 
>  Prior to pushing out this change we could not keep TCP MTU probing 
>  enabled
>  b/c of the above reasons. Now with a reasonble floor set we've had it
>  enabled for the past 6 months.
> >>>
> >>> And what reasonable value have you used ???
> >>
> >> Reasonable for some may not be reasonable for others hence the new
> >> sysctl :) We're currently running with a fairly high value based off of
> >> the v6 min MTU minus headers and options, etc. We went conservative with
> >> our setting initially as it seemed a reasonable first step when
> >> re-enabling TCP MTU probing since with no configurable floor we saw a #
> >> of cases where connections were using severely reduced mss b/c of loss
> >> and not b/c of actual path restriction. I plan to reevaluate the setting
> >> at some point, but since the probing method is still the same it means
> >> the same clients who got stuck with mss of 48 before will land at
> >> whatever floor we set. Looking forward we are interested in trying to
> >> improve TCP MTU probing so it does not penalize clients like this.
> >>
> >> A suggestion for a more reasonable floor default would be 512, which is
> >> the same as the min_pmtu. Given both mechanisms are trying to achieve
> >> the same goal it seems like they should have a similar min/floor.
> >>
> >>>
> 
>  The new sysctl will still default to TCP_MIN_SND_MSS (48), but gives
>  administrators the ability to control the floor of MSS probing.
> 
>  Signed-off-by: Josh Hunt 
>  ---
> Documentation/networking/ip-sysctl.txt | 6 ++
> include/net/netns/ipv4.h   | 1 +
> net/ipv4/sysctl_net_ipv4.c | 9 +
> net/ipv4/tcp_ipv4.c| 1 +
> net/ipv4/tcp_timer.c   | 2 +-
> 5 files changed, 18 insertions(+), 1 deletion(-)
> 
>  diff --git a/Documentation/networking/ip-sysctl.txt 
>  b/Documentation/networking/ip-sysctl.txt
>  index df33674799b5..49e95f438ed7 100644
>  --- a/Documentation/networking/ip-sysctl.txt
>  +++ b/Documentation/networking/ip-sysctl.txt
>  @@ -256,6 +256,12 @@ tcp_base_mss - INTEGER
>    Path MTU discovery (MTU probing).  If MTU probing is enabled,
>    this is the initial MSS used by the connection.
> 
>  +tcp_mtu_probe_floor - INTEGER
>  +   If MTU probing is enabled this caps the minimum MSS used for 
>  search_low
>  +   for the connection.
>  +
>  +   Default : 48
>  +
> tcp_min_snd_mss - INTEGER
>    TCP SYN and SYNACK messages usually advertise an ADVMSS option,
>    as described in RFC 1122 and RFC 6691.
>  diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
>  index bc24a8ec1ce5..c0c0791b1912 100644
>  --- a/include/net/netns/ipv4.h
>  +++ b/include/net/netns/ipv4.h
>  @@ -116,6 +116,7 @@ struct netns_ipv4 {
>    int sysctl_tcp_l3mdev_accept;
> #endif
>    int sysctl_tcp_mtu_probing;
>  +   int sysctl_tcp_mtu_probe_floor;
>    int sysctl_tcp_base_mss;
>    int sysctl_tcp_min_snd_mss;
>    int sysctl_tcp_probe_threshold;
>  diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
>  index 0b980e841927..59ded25acd04 100644
>  --- a/net/ipv4/sysctl_net_ipv4.c
>  +++ b/net/ipv4/sysctl_net_ipv4.c
>  @@ -820,6 +820,15 @@ static struct ctl_table ipv4_net_table[] = {
>    .extra2 = &tcp_min_snd_mss_max,
>    },
>    {
>  +   .procname   = "tcp_mtu_probe_floor",
>  +   .data   = 
>  &init_net.ipv4.sysctl_tcp_mtu_probe_floor,
>  +   .maxlen = sizeof(int),
>  +   .mode   = 0644,
>  +   .proc_handler   = proc_dointvec_minmax,
>  +   .extra1 = &tcp_min_snd_mss_min,
>  +   .extra2 = &tcp_min_snd_mss_max,
>  +   },
>  +   {
>    .procname   = "tcp_probe_t

Re: [PATCH net] tcp: fix tcp_set_congestion_control() use from bpf hook

2019-07-18 Thread Neal Cardwell

On Thu, Jul 18, 2019 at 10:28 PM Eric Dumazet  wrote:
>
> Neal reported incorrect use of ns_capable() from bpf hook.
>
> bpf_setsockopt(...TCP_CONGESTION...)
>   -> tcp_set_congestion_control()
>-> ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)
> -> ns_capable_common()
>  -> current_cred()
>   -> rcu_dereference_protected(current->cred, 1)
>
> Accessing 'current' in bpf context makes no sense, since packets
> are processed from softirq context.
>
> As Neal stated : The capability check in tcp_set_congestion_control()
> was written assuming a system call context, and then was reused from
> a BPF call site.
>
> The fix is to add a new parameter to tcp_set_congestion_control(),
> so that the ns_capable() call is only performed under the right
> context.
>
> Fixes: 91b5b21c7c16 ("bpf: Add support for changing congestion control")
> Signed-off-by: Eric Dumazet 
> Cc: Lawrence Brakmo 
> Reported-by: Neal Cardwell 
> ---

Acked-by: Neal Cardwell 

Thanks, Eric!

neal

Re: Kernel BUG: epoll_wait() (and epoll_pwait) stall for 206 ms per call on sockets with a small-ish snd/rcv buffer.

2019-07-08 Thread Neal Cardwell

On Sat, Jul 6, 2019 at 2:19 PM Carlo Wood  wrote:
>
> While investigating this further, I read on
> http://www.masterraghu.com/subjects/np/introduction/unix_network_programming_v1.3/ch07lev1sec5.html
> under "SO_RCVBUF and SO_SNDBUF Socket Options":
>
> When setting the size of the TCP socket receive buffer, the
> ordering of the function calls is important. This is because of
> TCP's window scale option (Section 2.6), which is exchanged with
> the peer on the SYN segments when the connection is established.
> For a client, this means the SO_RCVBUF socket option must be set
> before calling connect. For a server, this means the socket option
> must be set for the listening socket before calling listen. Setting
> this option for the connected socket will have no effect whatsoever
> on the possible window scale option because accept does not return
> with the connected socket until TCP's three-way handshake is
> complete. That is why this option must be set for the listening
> socket. (The sizes of the socket buffers are always inherited from
> the listening socket by the newly created connected socket: pp.
> 462–463 of TCPv2.)
>
> As mentioned in a previous post, I had already discovered about
> needing to set the socket buffers before connect, but I didn't know
> about setting them before the call to listen() in order to get the
> buffer sizes inherited by the accepted sockets.
>
> After fixing this in my test program, all problems disappeared when
> keeping the send and receive buffers the same on both sides.
>
> However, when only setting the send and receive buffers on the client
> socket (not on the (accepted or) listen socket), epoll_wait() still
> stalls 43ms. When the SO_SNDBUF is smaller than 33182 bytes.
>
> Here is the latest version of my test program:
>
> https://github.com/CarloWood/ai-evio-testsuite/blob/master/src/epoll_bug.c
>
> I have to retract most of my "bug" report, it might even not really be
> a bug then... but nevertheless, what remains strange is the fact
> that setting the socket buffer sizes on the accepted sockets can lead
> to so much crippling effect, while the quoted website states:
>
> Setting this option for the connected socket will have no effect
> whatsoever on the possible window scale option because accept does
> not return with the connected socket until TCP's three-way
> handshake is complete.
>
> And when only setting the socket buffer sizes for the client socket
> (that I use to send back received data; so this is the sending
> side now) then why does epoll_wait() stall 43 ms per call when the
> receiving side is using the default (much larger) socket buffer sizes?
>
> That 43 ms is STILL crippling-- slowing down the transmission of the
> data to a trickling speed compared to what it should be.

Based on the magic numbers you cite, including the fact that this test
program seems to send traffic over a loopback device (presumably
MTU=65536), epoll_wait() stalling 43 ms (slightly longer than the
typical Linux delayed ACK timer), and the problem only happening if
SO_SNDBUF is smaller than 33182 bytes (half the MTU)... a guess would
be that when you artificially make the SO_SNDBUF that low, then you
are asking the kernel to only allow your sending sockets to buffer
less than a single MTU of data, which means that the packets the
sender sends are going to be less than the MTU, which means that the
receiver may tend to eventually (after the initial quick ACKs expire)
delay its ACKs in hopes of eventually receiving a full MSS of data
(see __tcp_ack_snd_check()). Since the SO_SNDBUF is so small in this
case, the sender then would not be able to write() or transmit
anything else until the receiver sends a delayed ACK for the data
~40ms later, allowing the sending socket to free the previously sent
data and trigger the sender's next EPOLLOUT and write().

You could try grabbing a packet capture of the traffic over your
loopback device during the test to see if it matches that theory:
  tcpdump  -i lo -s 100 -w /tmp/test.pcap

cheers,
neal

Re: tp->copied_seq used before assignment in tcp_check_urg

2019-06-11 Thread Neal Cardwell

On Tue, Jun 11, 2019 at 2:46 AM Zhongjie Wang  wrote:
>
> Hi Neal,
>
> Thanks for your valuable feedback! Yes, I think you are right.
> It seems not a problem if tp->urg_data and tp->urg_seq are used together.
> From our test results, we can only see there are some paths requiring
> specific initial sequence number to reach.
> But as you said, it would not cause a difference in the code logic.
> We haven't observed any abnormal states.

Great. Thanks for confirming!

cheers,
neal

Re: tp->copied_seq used before assignment in tcp_check_urg

2019-06-10 Thread Neal Cardwell

On Mon, Jun 10, 2019 at 7:48 PM Zhongjie Wang  wrote:
>
> Hi Neal,
>
> Thanks for your reply. Sorry, I made a mistake in my previous email.
> After I double checked the source code, I think it should be tp->urg_seq,
> which is used before assignment, instead of tp->copied_seq.
> Still in the same if-statement:
>
> 5189 if (tp->urg_seq == tp->copied_seq && tp->urg_data &&
> 5190 !sock_flag(sk, SOCK_URGINLINE) && tp->copied_seq != tp->rcv_nxt) 
> {
> 5191 struct sk_buff *skb = skb_peek(&sk->sk_receive_queue);
> 5192 tp->copied_seq++;
> 5193 if (skb && !before(tp->copied_seq, TCP_SKB_CB(skb)->end_seq)) {
> 5194 __skb_unlink(skb, &sk->sk_receive_queue);
> 5195 __kfree_skb(skb);   // wzj(a)
> 5196 }
> 5197 }
> 5198
> 5199 tp->urg_data = TCP_URG_NOTYET;
> 5200 tp->urg_seq = ptr;
>
> It compares tp->copied_seq with tp->urg_seq.
> And I found only 1 assignment of tp->urg_seq in the code base,
> which is after the if-statement in the same tcp_check_urg() function.
>
> So it seems tp->urg_seq is not assigned to any sequence number before
> its first use.
> Is that correct?

I agree, it does seem that tp->urg_seq is not assigned to any sequence
number before its first use.

AFAICT from a quick read of the code, this does not matter. It seems
the idea is for tp->urg_data and tp->urg_seq to be set and used
together, so that tp->urg_seq is never relied upon to be set to
something meaningful unless tp->urg_data has also been verified to be
set to something (something non-zero).

I suppose it might be more clear to structure the code to check urg_data first:

  if (tp->urg_data && tp->urg_seq == tp->copied_seq &&

...but in practice AFAICT it does not make a difference, since no
matter which order the expressions use, both conditions must be true
for the code to have any side effects.

> P.S. In our symbolic execution tool, we found an execution path that
> requires the client initial sequence number (ISN) to be 0xFF FF FF FF.
> And when it traverse that path, the tp->copied_seq is equal to (client
> ISN + 1), and compared with 0 in this if-statatement.
> Therefore the client ISN has to be exactly 0xFF FF FF FF to hit this
> execution path.
>
> To trigger this, we first sent a SYN packet, and then an ACK packet
> with urgent pointer.

Does your test show any invalid behavior by the TCP endpoint? For
example, does the state in tcp_sock become incorrect, or is some
system call return value or outgoing packet incorrect? AFAICT from the
scenario you describe it seems that the "if" condition would fail when
the receiver processes the ACK packet with urgent pointer, because
tp->urg_data was not yet set at this point. Thus it would seem that in
this case it does not matter that tp->urg_seq is not assigned to any
sequence number before being first used.

cheers,
neal

Re: tp->copied_seq used before assignment in tcp_check_urg

2019-06-10 Thread Neal Cardwell

On Sun, Jun 9, 2019 at 11:12 PM Zhongjie Wang  wrote:
...
> It compares tp->copied_seq with tcp->rcv_nxt.
> However, tp->copied_seq is only assigned to an appropriate sequence number 
> when
> it copies data to user space. So here tp->copied_seq could be equal to 0,
> which is its initial value, if no data are copied yet.

I don't believe that's the case. As far as I can see, the
tp->copied_seq field is initialized to tp->rcv_nxt in the various
places where TCP connections are initialized:

  tp->copied_seq = tp->rcv_nxt;

> In this case, the condition becomes 0 != tp->rcv_nxt,
> and it renders this comparison ineffective.
>
> For example, if we send a SYN packet with initial sequence number 0xFF FF FF 
> FF,
> and after receiving SYN/ACK response, then send a ACK packet with sequence
> number 0, it will bypass this if-then block.
>
> We are not sure how this would affect the TCP logic. Could you please confirm
> that tp->copied_seq should be assigned to a sequence number before its use?

Yes, the tp->copied_seq  ought to be assigned to a sequence number
before its use, and AFAICT it is. Can you identify a specific sequence
of code execution (and ideally construct a packetdrill script) where
tp->copied_seq is somehow read before it is initialized?

cheers,
neal

Re: [PATCH net] dctcp: more accurate tracking of packets delivery

2019-04-11 Thread Neal Cardwell

On Thu, Apr 11, 2019 at 8:55 AM Eric Dumazet  wrote:
>
> After commit e21db6f69a95 ("tcp: track total bytes delivered with ECN CE 
> marks")
> core TCP stack does a very good job tracking ECN signals.
>
> The "sender's best estimate of CE information" Yuchung mentioned in his
> patch is indeed the best we can do.
>
> DCTCP can use tp->delivered_ce and tp->delivered to not duplicate the logic,
> and use the existing best estimate.
>
> This solves some problems, since current DCTCP logic does not deal with losses
> and/or GRO or ack aggregation very well.
>
> This also removes a dubious use of inet_csk(sk)->icsk_ack.rcv_mss
> (this should have been tp->mss_cache), and a 64 bit divide.
>
> Finally, we can see that the DCTCP logic, calling dctcp_update_alpha() for
> every ACK could be done differently, calling it only once per RTT.
>
> Signed-off-by: Eric Dumazet 
> Cc: Yuchung Cheng 
> Cc: Neal Cardwell 
> Cc: Soheil Hassas Yeganeh 
> Cc: Florian Westphal 
> Cc: Daniel Borkmann 
> Cc: Lawrence Brakmo 
> Cc: Abdul Kabbani 
> ---

Thanks, Eric!

There is a slight change in semantics, switching from bytes to
packets. But my sense is that the new semantics of packet-based
counting are a better approach, and a closer fit to the design in the
DCTCP paper from SIGCOMM 2010, which talks about measuring "the
fraction of packets that were marked in the last window of data", and
using that to estimate "the probability that the queue size is greater
than K".

Acked-by: Neal Cardwell 

thanks,
neal

Re: [PATCH v2 bpf-next 5/7] bpf: sysctl for probe_on_drop

2019-04-08 Thread Neal Cardwell

On Wed, Apr 3, 2019 at 8:13 PM brakmo  wrote:
>
> When a packet is dropped when calling queue_xmit in  __tcp_transmit_skb
> and packets_out is 0, it is beneficial to set a small probe timer.
> Otherwise, the throughput for the flow can suffer because it may need to
> depend on the probe timer to start sending again. The default value for
> the probe timer is at least 200ms, this patch sets it to 20ms when a
> packet is dropped and there are no other packets in flight.
>
> This patch introduces a new sysctl, sysctl_tcp_probe_on_drop_ms, that is
> used to specify the duration of the probe timer for the case described
> earlier. The allowed values are between 0 and TCP_RTO_MIN. A value of 0
> disables setting the probe timer with a small value.
>
> Signed-off-by: Lawrence Brakmo 
> ---
>  include/net/netns/ipv4.h   |  1 +
>  net/ipv4/sysctl_net_ipv4.c | 10 ++
>  net/ipv4/tcp_ipv4.c|  1 +
>  net/ipv4/tcp_output.c  | 18 +++---
>  4 files changed, 27 insertions(+), 3 deletions(-)
>
> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> index 7698460a3dd1..26c52d294dea 100644
> --- a/include/net/netns/ipv4.h
> +++ b/include/net/netns/ipv4.h
> @@ -166,6 +166,7 @@ struct netns_ipv4 {
> int sysctl_tcp_wmem[3];
> int sysctl_tcp_rmem[3];
> int sysctl_tcp_comp_sack_nr;
> +   int sysctl_tcp_probe_on_drop_ms;
> unsigned long sysctl_tcp_comp_sack_delay_ns;
> struct inet_timewait_death_row tcp_death_row;
> int sysctl_max_syn_backlog;
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 2316c08e9591..b1e4420b6d6c 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -49,6 +49,7 @@ static int ip_ping_group_range_min[] = { 0, 0 };
>  static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX };
>  static int comp_sack_nr_max = 255;
>  static u32 u32_max_div_HZ = UINT_MAX / HZ;
> +static int probe_on_drop_max = TCP_RTO_MIN;
>
>  /* obsolete */
>  static int sysctl_tcp_low_latency __read_mostly;
> @@ -1228,6 +1229,15 @@ static struct ctl_table ipv4_net_table[] = {
> .extra1 = &zero,
> .extra2 = &comp_sack_nr_max,
> },
> +   {
> +   .procname   = "tcp_probe_on_drop_ms",
> +   .data   = &init_net.ipv4.sysctl_tcp_probe_on_drop_ms,
> +   .maxlen = sizeof(int),
> +   .mode   = 0644,
> +   .proc_handler   = proc_dointvec_minmax,

The tcp_reset_xmit_timer () function expects a time in jiffies, so
this would probably need to be proc_dointvec_ms_jiffies?

> +   .extra1 = &zero,
> +   .extra2 = &probe_on_drop_max,
> +   },
> {
> .procname   = "udp_rmem_min",
> .data   = &init_net.ipv4.sysctl_udp_rmem_min,
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 3979939804b7..3df6735f0f07 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -2686,6 +2686,7 @@ static int __net_init tcp_sk_init(struct net *net)
> spin_lock_init(&net->ipv4.tcp_fastopen_ctx_lock);
> net->ipv4.sysctl_tcp_fastopen_blackhole_timeout = 60 * 60;
> atomic_set(&net->ipv4.tfo_active_disable_times, 0);
> +   net->ipv4.sysctl_tcp_probe_on_drop_ms = 20;

The tcp_reset_xmit_timer () function expects a time in jiffies, so
this would probably need to be msecs_to_jiffies(20)?

>
> /* Reno is always built in */
> if (!net_eq(net, &init_net) &&
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index e265d1aeeb66..f101e4d7c1ae 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1154,9 +1154,21 @@ static int __tcp_transmit_skb(struct sock *sk, struct 
> sk_buff *skb,
>
> err = icsk->icsk_af_ops->queue_xmit(sk, skb, &inet->cork.fl);
>
> -   if (unlikely(err > 0)) {
> -   tcp_enter_cwr(sk);
> -   err = net_xmit_eval(err);
> +   if (unlikely(err)) {
> +   if (unlikely(err > 0)) {
> +   tcp_enter_cwr(sk);
> +   err = net_xmit_eval(err);
> +   }
> +   /* Packet was dropped. If there are no packets out,
> +* we may need to depend on probe timer to start sending
> +* again. Hence, use a smaller value.
> +*/
> +   if (!tp->packets_out && !inet_csk(sk)->icsk_pending &&
> +   sock_net(sk)->ipv4.sysctl_tcp_probe_on_drop_ms > 0) {
> +   tcp_reset_xmit_timer(sk, ICSK_TIME_PROBE0,
> +
> sock_net(sk)->ipv4.sysctl_tcp_probe_on_drop_ms,
> +TCP_RTO_MAX, NULL);
> +   }

My reading of the code is that any timer that was set here would
typically be quickly overridden by the tcp_check_probe_timer() that

Re: [PATCH net] tcp: repaired skbs must init their tso_segs

2019-02-23 Thread Neal Cardwell

On Sat, Feb 23, 2019 at 6:51 PM Eric Dumazet  wrote:
>
> syzbot reported a WARN_ON(!tcp_skb_pcount(skb))
> in tcp_send_loss_probe() [1]
>
> This was caused by TCP_REPAIR sent skbs that inadvertenly
> were missing a call to tcp_init_tso_segs()
>

Acked-by: Neal Cardwell 

Thanks, Eric!

neal

Re: [PATCH net 2/2] tcp: tcp_v4_err() should be more careful

2019-02-15 Thread Neal Cardwell

On Fri, Feb 15, 2019 at 4:36 PM Eric Dumazet  wrote:
>
> ICMP handlers are not very often stressed, we should
> make them more resilient to bugs that might surface in
> the future.
>
> If there is no packet in retransmit queue, we should
> avoid a NULL deref.
>
> Signed-off-by: Eric Dumazet 
> Reported-by: soukjin bae 
> ---
>  net/ipv4/tcp_ipv4.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)

Acked-by: Neal Cardwell 

Thanks!

neal

Re: [PATCH net 1/2] tcp: clear icsk_backoff in tcp_write_queue_purge()

2019-02-15 Thread Neal Cardwell

On Fri, Feb 15, 2019 at 4:36 PM Eric Dumazet  wrote:
>
> soukjin bae reported a crash in tcp_v4_err() handling
> ICMP_DEST_UNREACH after tcp_write_queue_head(sk)
> returned a NULL pointer.
>
> Current logic should have prevented this :
>
>   if (seq != tp->snd_una  || !icsk->icsk_retransmits ||
>   !icsk->icsk_backoff || fastopen)
>   break;
>
> Problem is the write queue might have been purged
> and icsk_backoff has not been cleared.
>
> Signed-off-by: Eric Dumazet 
> Reported-by: soukjin bae 
> ---
>  net/ipv4/tcp.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Acked-by: Neal Cardwell 

Thanks!

neal

Re: [PATCH net] tcp: clear tp->retrans_stamp in tcp_finish_connect()

2018-12-19 Thread Neal Cardwell

On Wed, Dec 19, 2018 at 4:50 PM Eric Dumazet  wrote:
>
> tcp_connect() does :
>
>  tp->retrans_stamp = tcp_time_stamp(tp);
>
> When 3WHS completes, we need to clear retrans_stamp otherwise
> various things can break.
>
> This bug became more visible after commit b701a99e431d ("tcp: Add
> tcp_clamp_rto_to_user_timeout() helper to improve accuracy"), but
> predates git history.
>
> Signed-off-by: Eric Dumazet 
> Acked-by: Soheil Hassas Yeganeh 
> ---

Acked-by: Neal Cardwell 

Thanks!

neal

Re: [PATCH v3 net-next 4/4] tcp: implement coalescing on backlog queue

2018-11-28 Thread Neal Cardwell

On Tue, Nov 27, 2018 at 5:42 PM Eric Dumazet  wrote:
>
> In case GRO is not as efficient as it should be or disabled,
> we might have a user thread trapped in __release_sock() while
> softirq handler flood packets up to the point we have to drop.
>
> This patch balances work done from user thread and softirq,
> to give more chances to __release_sock() to complete its work
> before new packets are added the the backlog.
>
> This also helps if we receive many ACK packets, since GRO
> does not aggregate them.
>
> This patch brings ~60% throughput increase on a receiver
> without GRO, but the spectacular gain is really on
> 1000x release_sock() latency reduction I have measured.
>
> Signed-off-by: Eric Dumazet 
> Cc: Neal Cardwell 
> Cc: Yuchung Cheng 
> ---

Acked-by: Neal Cardwell 

Thanks!

neal

Re: [PATCH v3 net-next 2/4] tcp: take care of compressed acks in tcp_add_reno_sack()

2018-11-28 Thread Neal Cardwell

On Tue, Nov 27, 2018 at 5:42 PM Eric Dumazet  wrote:
>
> Neal pointed out that non sack flows might suffer from ACK compression
> added in the following patch ("tcp: implement coalescing on backlog queue")
>
> Instead of tweaking tcp_add_backlog() we can take into
> account how many ACK were coalesced, this information
> will be available in skb_shinfo(skb)->gso_segs
>
> Signed-off-by: Eric Dumazet 
> ---

Acked-by: Neal Cardwell 

Thanks!

neal

Re: [PATCH v2 net-next 4/4] tcp: implement coalescing on backlog queue

2018-11-27 Thread Neal Cardwell

On Tue, Nov 27, 2018 at 10:57 AM Eric Dumazet  wrote:
>
> In case GRO is not as efficient as it should be or disabled,
> we might have a user thread trapped in __release_sock() while
> softirq handler flood packets up to the point we have to drop.
>
> This patch balances work done from user thread and softirq,
> to give more chances to __release_sock() to complete its work
> before new packets are added the the backlog.
>
> This also helps if we receive many ACK packets, since GRO
> does not aggregate them.
>
> This patch brings ~60% throughput increase on a receiver
> without GRO, but the spectacular gain is really on
> 1000x release_sock() latency reduction I have measured.
>
> Signed-off-by: Eric Dumazet 
> Cc: Neal Cardwell 
> Cc: Yuchung Cheng 
> ---
...
> +   if (TCP_SKB_CB(tail)->end_seq != TCP_SKB_CB(skb)->seq ||
> +   TCP_SKB_CB(tail)->ip_dsfield != TCP_SKB_CB(skb)->ip_dsfield ||
> +#ifdef CONFIG_TLS_DEVICE
> +   tail->decrypted != skb->decrypted ||
> +#endif
> +   thtail->doff != th->doff ||
> +   memcmp(thtail + 1, th + 1, hdrlen - sizeof(*th)))
> +   goto no_coalesce;
> +
> +   __skb_pull(skb, hdrlen);
> +   if (skb_try_coalesce(tail, skb, &fragstolen, &delta)) {
> +   thtail->window = th->window;
> +
> +   TCP_SKB_CB(tail)->end_seq = TCP_SKB_CB(skb)->end_seq;
> +
> +   if (after(TCP_SKB_CB(skb)->ack_seq, 
> TCP_SKB_CB(tail)->ack_seq))
> +   TCP_SKB_CB(tail)->ack_seq = TCP_SKB_CB(skb)->ack_seq;
> +
> +   TCP_SKB_CB(tail)->tcp_flags |= TCP_SKB_CB(skb)->tcp_flags;

I wonder if technically perhaps the logic should skip coalescing if
the tail or skb has the TCP_FLAG_URG bit set? It seems if skbs are
coalesced, and some have urgent data and some do not, then the
TCP_FLAG_URG bit will be accumulated into the tail header, but there
will be no way to ensure the correct urgent offsets for the one or
more skbs with urgent data are passed along.

Thinking out loud, I guess if this is ECN/DCTCP and some ACKs have
TCP_FLAG_ECE and some don't, this will effectively have all ACKed
bytes be treated as ECN-marked. Probably OK, since if this coalescing
path is being hit the sender may be overloaded and slowing down might
be a good thing.

Otherwise, looks great to me. Thanks for doing this!

neal

Re: [PATCH v2 net-next 3/4] tcp: make tcp_space() aware of socket backlog

2018-11-27 Thread Neal Cardwell

On Tue, Nov 27, 2018 at 10:57 AM Eric Dumazet  wrote:
>
> Jean-Louis Dupond reported poor iscsi TCP receive performance
> that we tracked to backlog drops.
>
> Apparently we fail to send window updates reflecting the
> fact that we are under stress.
>
> Note that we might lack a proper window increase when
> backlog is fully processed, since __release_sock() clears
> sk->sk_backlog.len _after_ all skbs have been processed.
>
> This should not matter in practice. If we had a significant
> load through socket backlog, we are in a dangerous
> situation.
>
> Reported-by: Jean-Louis Dupond 
> Signed-off-by: Eric Dumazet 
> ---

Acked-by: Neal Cardwell 

Nice. Thanks!

neal

Re: [PATCH v2 net-next 2/4] tcp: take care of compressed acks in tcp_add_reno_sack()

2018-11-27 Thread Neal Cardwell

On Tue, Nov 27, 2018 at 10:57 AM Eric Dumazet  wrote:
>
> Neal pointed out that non sack flows might suffer from ACK compression
> added in the following patch ("tcp: implement coalescing on backlog queue")
>
> Instead of tweaking tcp_add_backlog() we can take into
> account how many ACK were coalesced, this information
> will be available in skb_shinfo(skb)->gso_segs
>
> Signed-off-by: Eric Dumazet 
> ---
...
> @@ -2679,8 +2683,8 @@ static void tcp_process_loss(struct sock *sk, int flag, 
> bool is_dupack,
> /* A Reno DUPACK means new data in F-RTO step 2.b above are
>  * delivered. Lower inflight to clock out (re)tranmissions.
>  */
> -   if (after(tp->snd_nxt, tp->high_seq) && is_dupack)
> -   tcp_add_reno_sack(sk);
> +   if (after(tp->snd_nxt, tp->high_seq))
> +   tcp_add_reno_sack(sk, num_dupack);
> else if (flag & FLAG_SND_UNA_ADVANCED)
> tcp_reset_reno_sack(tp);
> }

I think this probably should be checking num_dupack, something like:

+   if (after(tp->snd_nxt, tp->high_seq) && num_dupack)
+   tcp_add_reno_sack(sk, num_dupack);

If we don't check num_dupack, that seems to mean that after FRTO sends
the two new data packets (making snd_nxt after high_seq), the patch
would have a particular non-SACK FRTO loss recovery always go into the
"if" branch where we tcp_add_reno_sack() function, and we would never
have a chance to get to the "else" branch where we check if
FLAG_SND_UNA_ADVANCED and zero out the reno SACKs.

Otherwise the patch looks great to me. Thanks for doing this!

neal

Re: [PATCH v2 net-next 1/4] tcp: hint compiler about sack flows

2018-11-27 Thread Neal Cardwell

On Tue, Nov 27, 2018 at 10:57 AM Eric Dumazet  wrote:
>
> Tell the compiler that most TCP flows are using SACK these days.
>
> There is no need to add the unlikely() clause in tcp_is_reno(),
> the compiler is able to infer it.
>
> Signed-off-by: Eric Dumazet 
> ---

Acked-by: Neal Cardwell 

Nice. Thanks!

neal

Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue

2018-11-22 Thread Neal Cardwell

On Wed, Nov 21, 2018 at 12:52 PM Eric Dumazet  wrote:
>
> In case GRO is not as efficient as it should be or disabled,
> we might have a user thread trapped in __release_sock() while
> softirq handler flood packets up to the point we have to drop.
>
> This patch balances work done from user thread and softirq,
> to give more chances to __release_sock() to complete its work.
>
> This also helps if we receive many ACK packets, since GRO
> does not aggregate them.

Would this coalesce duplicate incoming ACK packets? Is there a risk
that this would eliminate incoming dupacks needed for fast recovery in
non-SACK connections? Perhaps pure ACKs should only be coalesced if
the ACK field is different?

neal

Re: [PATCH net-next 3/3] tcp: get rid of tcp_tso_should_defer() dependency on HZ/jiffies

2018-11-11 Thread Neal Cardwell

On Sun, Nov 11, 2018 at 9:41 AM Eric Dumazet  wrote:
>
> tcp_tso_should_defer() first heuristic is to not defer
> if last send is "old enough".
>
> Its current implementation uses jiffies and its low granularity.
>
> TSO autodefer performance should not rely on kernel HZ :/
>
> After EDT conversion, we have state variables in nanoseconds that
> can allow us to properly implement the heuristic.
>
> This patch increases TSO chunk sizes on medium rate flows,
> especially when receivers do not use GRO or similar aggregation.
>
> It also reduces bursts for HZ=100 or HZ=250 kernels, making TCP
> behavior more uniform.
>
> Signed-off-by: Eric Dumazet 
> Acked-by: Soheil Hassas Yeganeh 
> ---

Nice. Thanks!

Acked-by: Neal Cardwell 

neal

Re: [PATCH net-next 2/3] tcp: refine tcp_tso_should_defer() after EDT adoption

2018-11-11 Thread Neal Cardwell

On Sun, Nov 11, 2018 at 9:41 AM Eric Dumazet  wrote:
>
> tcp_tso_should_defer() last step tries to check if the probable
> next ACK packet is coming in less than half rtt.
>
> Problem is that the head->tstamp might be in the future,
> so we need to use signed arithmetics to avoid overflows.
>
> Signed-off-by: Eric Dumazet 
> Acked-by: Soheil Hassas Yeganeh 
> ---
>  net/ipv4/tcp_output.c | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)

Thanks!

Acked-by: Neal Cardwell 

neal

Re: [PATCH net-next 1/3] tcp: do not try to defer skbs with eor mark (MSG_EOR)

2018-11-11 Thread Neal Cardwell

On Sun, Nov 11, 2018 at 9:41 AM Eric Dumazet  wrote:
>
> Applications using MSG_EOR are giving a strong hint to TCP stack :
>
> Subsequent sendmsg() can not append more bytes to skbs having
> the EOR mark.
>
> Do not try to TSO defer suchs skbs, there is really no hope.
>
> Signed-off-by: Eric Dumazet 
> Acked-by: Soheil Hassas Yeganeh 
> ---
>  net/ipv4/tcp_output.c | 4 
>  1 file changed, 4 insertions(+)

Thanks!

Acked-by: Neal Cardwell 

neal

Re: [PATCH net-next] net_sched: sch_fq: add dctcp-like marking

2018-11-11 Thread Neal Cardwell

On Sun, Nov 11, 2018 at 12:11 PM Eric Dumazet  wrote:
>
> Similar to 80ba92fa1a92 ("codel: add ce_threshold attribute")
>
> After EDT adoption, it became easier to implement DCTCP-like CE marking.
>
> In many cases, queues are not building in the network fabric but on
> the hosts themselves.
>
> If packets leaving fq missed their Earliest Departure Time by XXX usec,
> we mark them with ECN CE. This gives a feedback (after one RTT) to
> the sender to slow down and find better operating mode.
>
> Example :
>
> tc qd replace dev eth0 root fq ce_threshold 2.5ms
>
> Signed-off-by: Eric Dumazet 
> ---

Very nice! Thanks, Eric. :-)

Acked-by: Neal Cardwell 

neal

[PATCH net-next] tcp_bbr: update comments to reflect pacing_margin_percent

2018-11-08 Thread Neal Cardwell

Recently, in commit ab408b6dc744 ("tcp: switch tcp and sch_fq to new
earliest departure time model"), the TCP BBR code switched to a new
approach of using an explicit bbr_pacing_margin_percent for shaving a
pacing rate "haircut", rather than the previous implict
approach. Update an old comment to reflect the new approach.

Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Soheil Hassas Yeganeh 
Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_bbr.c | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
index 9277abdd822a0..0f497fc49c3fe 100644
--- a/net/ipv4/tcp_bbr.c
+++ b/net/ipv4/tcp_bbr.c
@@ -128,7 +128,12 @@ static const u32 bbr_probe_rtt_mode_ms = 200;
 /* Skip TSO below the following bandwidth (bits/sec): */
 static const int bbr_min_tso_rate = 120;
 
-/* Pace at ~1% below estimated bw, on average, to reduce queue at bottleneck. 
*/
+/* Pace at ~1% below estimated bw, on average, to reduce queue at bottleneck.
+ * In order to help drive the network toward lower queues and low latency while
+ * maintaining high utilization, the average pacing rate aims to be slightly
+ * lower than the estimated bandwidth. This is an important aspect of the
+ * design.
+ */
 static const int bbr_pacing_margin_percent = 1;
 
 /* We use a high_gain value of 2/ln(2) because it's the smallest pacing gain
@@ -247,13 +252,7 @@ static void bbr_init_pacing_rate_from_rtt(struct sock *sk)
sk->sk_pacing_rate = bbr_bw_to_pacing_rate(sk, bw, bbr_high_gain);
 }
 
-/* Pace using current bw estimate and a gain factor. In order to help drive the
- * network toward lower queues while maintaining high utilization and low
- * latency, the average pacing rate aims to be slightly (~1%) lower than the
- * estimated bandwidth. This is an important aspect of the design. In this
- * implementation this slightly lower pacing rate is achieved implicitly by not
- * including link-layer headers in the packet size used for the pacing rate.
- */
+/* Pace using current bw estimate and a gain factor. */
 static void bbr_set_pacing_rate(struct sock *sk, u32 bw, int gain)
 {
struct tcp_sock *tp = tcp_sk(sk);
-- 
2.19.1.930.g4563a0d9d0-goog

[PATCH net-next 2/2] tcp_bbr: centralize code to set gains

2018-10-16 Thread Neal Cardwell

Centralize the code that sets gains used for computing cwnd and pacing
rate. This simplifies the code and makes it easier to change the state
machine or (in the future) dynamically change the gain values and
ensure that the correct gain values are always used.

Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Soheil Hassas Yeganeh 
Signed-off-by: Priyaranjan Jha 
Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_bbr.c | 40 ++--
 1 file changed, 30 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
index 4cc2223d2cd54..9277abdd822a0 100644
--- a/net/ipv4/tcp_bbr.c
+++ b/net/ipv4/tcp_bbr.c
@@ -521,8 +521,6 @@ static void bbr_advance_cycle_phase(struct sock *sk)
 
bbr->cycle_idx = (bbr->cycle_idx + 1) & (CYCLE_LEN - 1);
bbr->cycle_mstamp = tp->delivered_mstamp;
-   bbr->pacing_gain = bbr->lt_use_bw ? BBR_UNIT :
-   bbr_pacing_gain[bbr->cycle_idx];
 }
 
 /* Gain cycling: cycle pacing gain to converge to fair share of available bw. 
*/
@@ -540,8 +538,6 @@ static void bbr_reset_startup_mode(struct sock *sk)
struct bbr *bbr = inet_csk_ca(sk);
 
bbr->mode = BBR_STARTUP;
-   bbr->pacing_gain = bbr_high_gain;
-   bbr->cwnd_gain   = bbr_high_gain;
 }
 
 static void bbr_reset_probe_bw_mode(struct sock *sk)
@@ -549,8 +545,6 @@ static void bbr_reset_probe_bw_mode(struct sock *sk)
struct bbr *bbr = inet_csk_ca(sk);
 
bbr->mode = BBR_PROBE_BW;
-   bbr->pacing_gain = BBR_UNIT;
-   bbr->cwnd_gain = bbr_cwnd_gain;
bbr->cycle_idx = CYCLE_LEN - 1 - prandom_u32_max(bbr_cycle_rand);
bbr_advance_cycle_phase(sk);/* flip to next phase of gain cycle */
 }
@@ -768,8 +762,6 @@ static void bbr_check_drain(struct sock *sk, const struct 
rate_sample *rs)
 
if (bbr->mode == BBR_STARTUP && bbr_full_bw_reached(sk)) {
bbr->mode = BBR_DRAIN;  /* drain queue we created */
-   bbr->pacing_gain = bbr_drain_gain;  /* pace slow to drain */
-   bbr->cwnd_gain = bbr_high_gain; /* maintain cwnd */
tcp_sk(sk)->snd_ssthresh =
bbr_target_cwnd(sk, bbr_max_bw(sk), BBR_UNIT);
}   /* fall through to check if in-flight is already small: */
@@ -831,8 +823,6 @@ static void bbr_update_min_rtt(struct sock *sk, const 
struct rate_sample *rs)
if (bbr_probe_rtt_mode_ms > 0 && filter_expired &&
!bbr->idle_restart && bbr->mode != BBR_PROBE_RTT) {
bbr->mode = BBR_PROBE_RTT;  /* dip, drain queue */
-   bbr->pacing_gain = BBR_UNIT;
-   bbr->cwnd_gain = BBR_UNIT;
bbr_save_cwnd(sk);  /* note cwnd so we can restore it */
bbr->probe_rtt_done_stamp = 0;
}
@@ -860,6 +850,35 @@ static void bbr_update_min_rtt(struct sock *sk, const 
struct rate_sample *rs)
bbr->idle_restart = 0;
 }
 
+static void bbr_update_gains(struct sock *sk)
+{
+   struct bbr *bbr = inet_csk_ca(sk);
+
+   switch (bbr->mode) {
+   case BBR_STARTUP:
+   bbr->pacing_gain = bbr_high_gain;
+   bbr->cwnd_gain   = bbr_high_gain;
+   break;
+   case BBR_DRAIN:
+   bbr->pacing_gain = bbr_drain_gain;  /* slow, to drain */
+   bbr->cwnd_gain   = bbr_high_gain;   /* keep cwnd */
+   break;
+   case BBR_PROBE_BW:
+   bbr->pacing_gain = (bbr->lt_use_bw ?
+   BBR_UNIT :
+   bbr_pacing_gain[bbr->cycle_idx]);
+   bbr->cwnd_gain   = bbr_cwnd_gain;
+   break;
+   case BBR_PROBE_RTT:
+   bbr->pacing_gain = BBR_UNIT;
+   bbr->cwnd_gain   = BBR_UNIT;
+   break;
+   default:
+   WARN_ONCE(1, "BBR bad mode: %u\n", bbr->mode);
+   break;
+   }
+}
+
 static void bbr_update_model(struct sock *sk, const struct rate_sample *rs)
 {
bbr_update_bw(sk, rs);
@@ -867,6 +886,7 @@ static void bbr_update_model(struct sock *sk, const struct 
rate_sample *rs)
bbr_check_full_bw_reached(sk, rs);
bbr_check_drain(sk, rs);
bbr_update_min_rtt(sk, rs);
+   bbr_update_gains(sk);
 }
 
 static void bbr_main(struct sock *sk, const struct rate_sample *rs)
-- 
2.19.1.331.ge82ca0e54c-goog

[PATCH net-next 0/2] tcp_bbr: TCP BBR changes for EDT pacing model

2018-10-16 Thread Neal Cardwell

Two small patches for TCP BBR to follow up with Eric's recent work to change
the TCP and fq pacing machinery to an "earliest departure time" (EDT) model:

- The first patch adjusts the TCP BBR logic to work with the new
  "earliest departure time" (EDT) pacing model.

- The second patch adjusts the TCP BBR logic to centralize the setting
  of gain values, to simplify the code and prepare for future changes.

Neal Cardwell (2):
  tcp_bbr: adjust TCP BBR for departure time pacing
  tcp_bbr: centralize code to set gains

 net/ipv4/tcp_bbr.c | 77 ++
 1 file changed, 65 insertions(+), 12 deletions(-)

-- 
2.19.1.331.ge82ca0e54c-goog

[PATCH net-next 1/2] tcp_bbr: adjust TCP BBR for departure time pacing

2018-10-16 Thread Neal Cardwell

Adjust TCP BBR for the new departure time pacing model in the recent
commit ab408b6dc7449 ("tcp: switch tcp and sch_fq to new earliest
departure time model").

With TSQ and pacing at lower layers, there are often several skbs
queued in the pacing layer, and thus there is less data "in the
network" than "in flight".

With departure time pacing at lower layers (e.g. fq or potential
future NICs), the data in the pacing layer now has a pre-scheduled
("baked-in") departure time that cannot be changed, even if the
congestion control algorithm decides to use a new pacing rate.

This means that there can be a non-trivial lag between when BBR makes
a pacing rate change and when the inter-skb pacing delays
change. After a pacing rate change, the number of packets in the
network can gradually evolve to be higher or lower, depending on
whether the sending rate is higher or lower than the delivery
rate. Thus ignoring this lag can cause significant overshoot, with the
flow ending up with too many or too few packets in the network.

This commit changes BBR to adapt its pacing rate based on the amount
of data in the network that it estimates has already been "baked in"
by previous departure time decisions. We estimate the number of our
packets that will be in the network at the earliest departure time
(EDT) for the next skb scheduled as:

   in_network_at_edt = inflight_at_edt - (EDT - now) * bw

If we're increasing the amount of data in the network ("in_network"),
then we want to know if the transmit of the EDT skb will push
in_network above the target, so our answer includes
bbr_tso_segs_goal() from the skb departing at EDT. If we're decreasing
in_network, then we want to know if in_network will sink too low just
before the EDT transmit, so our answer does not include the segments
from the skb departing at EDT.

Why do we treat pacing_gain > 1.0 case and pacing_gain < 1.0 case
differently? The in_network curve is a step function: in_network goes
up on transmits, and down on ACKs. To accurately predict when
in_network will go beyond our target value, this will happen on
different events, depending on whether we're concerned about
in_network potentially going too high or too low:

 o if pushing in_network up (pacing_gain > 1.0),
   then in_network goes above target upon a transmit event

 o if pushing in_network down (pacing_gain < 1.0),
   then in_network goes below target upon an ACK event

This commit changes the BBR state machine to use this estimated
"packets in network" value to make its decisions.

Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_bbr.c | 37 +++--
 1 file changed, 35 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
index b88081285fd17..4cc2223d2cd54 100644
--- a/net/ipv4/tcp_bbr.c
+++ b/net/ipv4/tcp_bbr.c
@@ -369,6 +369,39 @@ static u32 bbr_target_cwnd(struct sock *sk, u32 bw, int 
gain)
return cwnd;
 }
 
+/* With pacing at lower layers, there's often less data "in the network" than
+ * "in flight". With TSQ and departure time pacing at lower layers (e.g. fq),
+ * we often have several skbs queued in the pacing layer with a pre-scheduled
+ * earliest departure time (EDT). BBR adapts its pacing rate based on the
+ * inflight level that it estimates has already been "baked in" by previous
+ * departure time decisions. We calculate a rough estimate of the number of our
+ * packets that might be in the network at the earliest departure time for the
+ * next skb scheduled:
+ *   in_network_at_edt = inflight_at_edt - (EDT - now) * bw
+ * If we're increasing inflight, then we want to know if the transmit of the
+ * EDT skb will push inflight above the target, so inflight_at_edt includes
+ * bbr_tso_segs_goal() from the skb departing at EDT. If decreasing inflight,
+ * then estimate if inflight will sink too low just before the EDT transmit.
+ */
+static u32 bbr_packets_in_net_at_edt(struct sock *sk, u32 inflight_now)
+{
+   struct tcp_sock *tp = tcp_sk(sk);
+   struct bbr *bbr = inet_csk_ca(sk);
+   u64 now_ns, edt_ns, interval_us;
+   u32 interval_delivered, inflight_at_edt;
+
+   now_ns = tp->tcp_clock_cache;
+   edt_ns = max(tp->tcp_wstamp_ns, now_ns);
+   interval_us = div_u64(edt_ns - now_ns, NSEC_PER_USEC);
+   interval_delivered = (u64)bbr_bw(sk) * interval_us >> BW_SCALE;
+   inflight_at_edt = inflight_now;
+   if (bbr->pacing_gain > BBR_UNIT)  /* increasing inflight */
+   inflight_at_edt += bbr_tso_segs_goal(sk);  /* include EDT skb */
+   if (interval_delivered >= inflight_at_edt)
+   return 0;
+   return inflight_at_edt - interval_delivered;
+}
+
 /* An optimization in BBR to reduce losses: On the first round of recovery,

Re: Why not use all the syn queues? in the function "tcp_conn_request", I have some questions.

2018-09-08 Thread Neal Cardwell

On Sat, Sep 8, 2018 at 11:23 AM Ttttabcd  wrote:
>
> Thank you very much for your previous answer, sorry for the inconvenience.
>
> But now I want to ask you one more question.
>
> The question is why we need two variables to control the syn queue?
>
> The first is the "backlog" parameter of the "listen" system call that 
> controls the maximum length limit of the syn queue, it also controls the 
> accept queue.

By default, and essentially always in practice (AFAIK), Linux
installations enable syncookies. With syncookies, there is essentially
no limit on the syn queue, or number of incomplete passive connections
(as the man page you quoted notes). So in practice the listen()
parameter usually controls only the accept queue.

> The second is /proc/sys/net/ipv4/tcp_max_syn_backlog, which also controls the 
> maximum length limit of the syn queue.
>
> So simply changing one of them and wanting to increase the syn queue is not 
> working.
>
> In our last discussion, I understood tcp_max_syn_backlog will retain the last 
> quarter to the IP that has been proven to be alive

That discussion pertains to a code path that is relevant if syncookies
are disabled, which is very uncommon (see above).

> But if tcp_max_syn_backlog is very large, the syn queue will be filled as 
> well.
>
> So I don't understand why not just use a variable to control the syn queue.
>
> For example, just use tcp_max_syn_backlog, which is the maximum length limit 
> for the syn queue, and it can also be retained to prove that the IP remains 
> the last quarter.
>
> The backlog parameter of the listen system call only controls the accpet 
> queue.
>
> I feel this is more reasonable. If I don't look at the source code, I really 
> can't guess the backlog parameter actually controls the syn queue.
>
> I always thought that it only controlled the accept queue before I looked at 
> the source code, because the man page is written like this.

Keep in mind that the semantics of the listen() argument and the
/proc/sys/net/ipv4/tcp_max_syn_backlog sysctl knob, as described in
the man page, are part of the Linux kernel's user-visible API. So, in
essence, they cannot be changed. Changing the semantics of system
calls and sysctl knobs breaks applications and system configuration
scripts. :-)

neal

Re: Why not use all the syn queues? in the function "tcp_conn_request", I have some questions.

2018-09-04 Thread Neal Cardwell

On Tue, Sep 4, 2018 at 1:48 AM Ttttabcd  wrote:
>
> Hello everyone,recently I am looking at the source code for handling TCP 
> three-way handshake(Linux Kernel version 4.18.5).
>
> I found some strange places in the source code for handling syn messages.
>
> in the function "tcp_conn_request"
>
> This code will be executed when we don't enable the syn cookies.
>
> if (!net->ipv4.sysctl_tcp_syncookies &&
> (net->ipv4.sysctl_max_syn_backlog - 
> inet_csk_reqsk_queue_len(sk) <
>  (net->ipv4.sysctl_max_syn_backlog >> 2)) &&
> !tcp_peer_is_proven(req, dst)) {
> /* Without syncookies last quarter of
>  * backlog is filled with destinations,
>  * proven to be alive.
>  * It means that we continue to communicate
>  * to destinations, already remembered
>  * to the moment of synflood.
>  */
> pr_drop_req(req, ntohs(tcp_hdr(skb)->source),
> rsk_ops->family);
> goto drop_and_release;
> }
>
> But why don't we use all the syn queues?

If tcp_peer_is_proven() returns true then we do allow ourselves to use
the whole queue.

> Why do we need to leave the size of (net->ipv4.sysctl_max_syn_backlog >> 2) 
> in the queue?
>
> Even if the system is attacked by a syn flood, there is no need to leave a 
> part. Why do we need to leave a part?

The comment describes the rationale. If syncookies are disabled, then
the last quarter of the backlog is reserved for filling with
destinations that were proven to be alive, according to
tcp_peer_is_proven() (which uses RTTs measured in previous
connections). The idea is that if there is a SYN flood, we do not want
to use all of our queue budget on attack traffic but instead want to
reserve some queue space for SYNs from real remote machines that we
have actually contacted in the past.

> The value of sysctl_max_syn_backlog is the maximum length of the queue only 
> if syn cookies are enabled.

Even if syncookies are disabled, sysctl_max_syn_backlog is the maximum
length of the queue.

> This is the first strange place, here is another strange place
>
> __u32 isn = TCP_SKB_CB(skb)->tcp_tw_isn;
>
> if ((net->ipv4.sysctl_tcp_syncookies == 2 ||
>  inet_csk_reqsk_queue_is_full(sk)) && !isn) {
>
> if (!want_cookie && !isn) {
>
> The value of "isn" comes from TCP_SKB_CB(skb)->tcp_tw_isn, then it is judged 
> twice whether its value is indeed 0.
>
> But "tcp_tw_isn" is initialized in the function "tcp_v4_fill_cb"
>
> TCP_SKB_CB(skb)->tcp_tw_isn = 0;
>
> So it has always been 0, I used printk to test, and the result is always 0.

That field is also set in tcp_timewait_state_process():

  TCP_SKB_CB(skb)->tcp_tw_isn = isn;

So there can be cases where it is not 0.

Hope that helps,
neal

[PATCH net] tcp_bbr: fix bw probing to raise in-flight data for very small BDPs

2018-07-27 Thread Neal Cardwell

For some very small BDPs (with just a few packets) there was a
quantization effect where the target number of packets in flight
during the super-unity-gain (1.25x) phase of gain cycling was
implicitly truncated to a number of packets no larger than the normal
unity-gain (1.0x) phase of gain cycling. This meant that in multi-flow
scenarios some flows could get stuck with a lower bandwidth, because
they did not push enough packets inflight to discover that there was
more bandwidth available. This was really only an issue in multi-flow
LAN scenarios, where RTTs and BDPs are low enough for this to be an
issue.

This fix ensures that gain cycling can raise inflight for small BDPs
by ensuring that in PROBE_BW mode target inflight values with a
super-unity gain are always greater than inflight values with a gain
<= 1. Importantly, this applies whether the inflight value is
calculated for use as a cwnd value, or as a target inflight value for
the end of the super-unity phase in bbr_is_next_cycle_phase() (both
need to be bigger to ensure we can probe with more packets in flight
reliably).

This is a candidate fix for stable releases.

Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Neal Cardwell 
Acked-by: Yuchung Cheng 
Acked-by: Soheil Hassas Yeganeh 
Acked-by: Priyaranjan Jha 
Reviewed-by: Eric Dumazet 
---
 net/ipv4/tcp_bbr.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
index 3b5f45b9e81eb..13d34427ca3dd 100644
--- a/net/ipv4/tcp_bbr.c
+++ b/net/ipv4/tcp_bbr.c
@@ -358,6 +358,10 @@ static u32 bbr_target_cwnd(struct sock *sk, u32 bw, int 
gain)
/* Reduce delayed ACKs by rounding up cwnd to the next even number. */
cwnd = (cwnd + 1) & ~1U;
 
+   /* Ensure gain cycling gets inflight above BDP even for small BDPs. */
+   if (bbr->mode == BBR_PROBE_BW && gain > BBR_UNIT)
+   cwnd += 2;
+
return cwnd;
 }
 
-- 
2.18.0.345.g5c9ce644c3-goog

Re: [PATCH net-next] tcp: ack immediately when a cwr packet arrives

2018-07-24 Thread Neal Cardwell

On Tue, Jul 24, 2018 at 1:42 PM Lawrence Brakmo  wrote:
>
> Note that without this fix the 99% latencies when doing 10KB RPCs
> in a congested network using DCTCP are 40ms vs. 190us with the patch.
> Also note that these 40ms high tail latencies started after commit
> 3759824da87b30ce7a35b4873b62b0ba38905ef5 in Jul 2015,
> which triggered the bugs/features we are fixing/adding. I agree it is a
> debatable whether it is a bug fix or a feature improvement and I am
> fine either way.

Good point. The fact that this greatly mitigates a regression in DCTCP
performance resulting from 3759824da87b30ce7a35b4873b62b0ba38905ef5
("tcp: PRR uses CRB mode by default and SS mode conditionally") IMHO
seems to be a good argument for putting this patch ("tcp: ack
immediately when a cwr packet arrives") in the "net" branch and stable
releases.

thanks,
neal

Re: [PATCH net-next] tcp: ack immediately when a cwr packet arrives

2018-07-24 Thread Neal Cardwell

On Tue, Jul 24, 2018 at 1:07 PM Yuchung Cheng  wrote:
>
> On Mon, Jul 23, 2018 at 7:23 PM, Daniel Borkmann  wrote:
> > Should this go to net tree instead where all the other fixes went?
> I am neutral but this feels more like a feature improvement

I agree this feels like a feature improvement rather than a bug fix.

neal

Re: [PATCH net-next] tcp: ack immediately when a cwr packet arrives

2018-07-23 Thread Neal Cardwell

On Mon, Jul 23, 2018 at 8:49 PM Lawrence Brakmo  wrote:
>
> We observed high 99 and 99.9% latencies when doing RPCs with DCTCP. The
> problem is triggered when the last packet of a request arrives CE
> marked. The reply will carry the ECE mark causing TCP to shrink its cwnd
> to 1 (because there are no packets in flight). When the 1st packet of
> the next request arrives, the ACK was sometimes delayed even though it
> is CWR marked, adding up to 40ms to the RPC latency.
>
> This patch insures that CWR marked data packets arriving will be acked
> immediately.
...
> Modified based on comments by Neal Cardwell 
>
> Signed-off-by: Lawrence Brakmo 
> ---
>  net/ipv4/tcp_input.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)

Seems like a nice mechanism to have, IMHO.

Acked-by: Neal Cardwell 

Thanks!
neal

Re: [PATCH net-next] tcp: expose both send and receive intervals for rate sample

2018-07-09 Thread Neal Cardwell

On Mon, Jul 9, 2018 at 1:58 PM Deepti Raghavan  wrote:
>
> Congestion control algorithms, which access the rate sample
> through the tcp_cong_control function, only have access to the maximum
> of the send and receive interval, for cases where the acknowledgment
> rate may be inaccurate due to ACK compression or decimation. Algorithms
> may want to use send rates and receive rates as separate signals.
>
> Signed-off-by: Deepti Raghavan 
> ---
>  include/net/tcp.h   | 2 ++
>  net/ipv4/tcp_rate.c | 4 
>  2 files changed, 6 insertions(+)

Thanks for re-sending. It does seem to be showing up in patchwork now:
  https://patchwork.ozlabs.org/patch/941532/
And I can confirm I'm able to apply it to net-next.

Acked-by: Neal Cardwell 

thanks,
neal

Re: [PATCH net-next v3 0/2] tcp: fix high tail latencies in DCTCP

2018-07-07 Thread Neal Cardwell

On Sat, Jul 7, 2018 at 7:15 AM David Miller  wrote:
>
> From: Lawrence Brakmo 
> Date: Tue, 3 Jul 2018 09:26:13 -0700
>
> > When have observed high tail latencies when using DCTCP for RPCs as
> > compared to using Cubic. For example, in one setup there are 2 hosts
> > sending to a 3rd one, with each sender having 3 flows (1 stream,
> > 1 1MB back-to-back RPCs and 1 10KB back-to-back RPCs). The following
> > table shows the 99% and 99.9% latencies for both Cubic and dctcp:
> >
> >Cubic 99%  Cubic 99.9%   dctcp 99%dctcp 99.9%
> > 1MB RPCs2.6ms   5.5ms 43ms  208ms
> > 10KB RPCs1.1ms   1.3ms 53ms  212ms
>  ...
> > v2: Removed call to tcp_ca_event from tcp_send_ack since I added one in
> > tcp_event_ack_sent. Based on Neal Cardwell 
> > feedback.
> > Modified tcp_ecn_check_ce (and renamed it tcp_ecn_check) instead of 
> > modifying
> > tcp_ack_send_check to insure an ACK when cwr is received.
> > v3: Handling cwr in tcp_ecn_accept_cwr instead of in tcp_ecn_check.
> >
> > [PATCH net-next v3 1/2] tcp: notify when a delayed ack is sent
> > [PATCH net-next v3 2/2] tcp: ack immediately when a cwr packet
>
> Neal and co., what are your thoughts right now about this patch series?
>
> Thank you.

IMHO these patches are a definite improvement over what we have now.

That said, in chatting with Yuchung before the July 4th break, I think
Yuchung and I agreed that we would ideally like to see something like
the following:

(1) refactor the DCTCP code to check for pending delayed ACKs directly
using existing state (inet_csk(sk)->icsk_ack.pending &
ICSK_ACK_TIMER), and remove the ca->delayed_ack_reserved DCTCP field
and the CA_EVENT_DELAYED_ACK and CA_EVENT_NON_DELAYED_ACK callbacks
added for DCTCP (which Larry determined had at least one bug).

(2) fix the bug with the DCTCP call to tcp_send_ack(sk) causing
delayed ACKs to be incorrectly dropped/forgotten (not yet addressed by
this patch series)

(3) then with fixes (1) and (2) in place, re-run tests and see if we
still need Larry's heuristic (in patch 2) to fire an ACK immediately
if a receiver receives a CWR packet (I suspect this is still very
useful, but I think Yuchung is reluctant to add this complexity unless
we have verified it's still needed after (1) and (2))

Our team may be able to help out with some proposed patches for (1) and (2).

In any case, I would love to have Yuchung and Eric weigh in (perhaps
Monday) before we merge this patch series.

Thanks,
neal

Re: [PATCH net-next v2 0/2] tcp: fix high tail latencies in DCTCP

2018-07-03 Thread Neal Cardwell

On Tue, Jul 3, 2018 at 11:10 AM Lawrence Brakmo  wrote:
>
> On 7/2/18, 5:52 PM, "netdev-ow...@vger.kernel.org on behalf of Neal Cardwell" 
>  wrote:
>
> On Mon, Jul 2, 2018 at 5:39 PM Lawrence Brakmo  wrote:
> >
> > When have observed high tail latencies when using DCTCP for RPCs as
> > compared to using Cubic. For example, in one setup there are 2 hosts
> > sending to a 3rd one, with each sender having 3 flows (1 stream,
> > 1 1MB back-to-back RPCs and 1 10KB back-to-back RPCs). The following
> > table shows the 99% and 99.9% latencies for both Cubic and dctcp:
> >
> >Cubic 99%  Cubic 99.9%   dctcp 99%dctcp 99.9%
> >  1MB RPCs2.6ms   5.5ms 43ms  208ms
> > 10KB RPCs1.1ms   1.3ms 53ms  212ms
> >
> > Looking at tcpdump traces showed that there are two causes for the
> > latency.
> >
> >   1) RTOs caused by the receiver sending a dup ACK and not ACKing
> >  the last (and only) packet sent.
> >   2) Delaying ACKs when the sender has a cwnd of 1, so everything
> >  pauses for the duration of the delayed ACK.
> >
> > The first patch fixes the cause of the dup ACKs, not updating DCTCP
> > state when an ACK that was initially delayed has been sent with a
> > data packet.
> >
> > The second patch insures that an ACK is sent immediately when a
> > CWR marked packet arrives.
> >
> > With the patches the latencies for DCTCP now look like:
> >
> >dctcp 99%  dctcp 99.9%
> >  1MB RPCs5.8ms   6.9ms
> > 10KB RPCs146us   203us
> >
> > Note that while the 1MB RPCs tail latencies are higher than Cubic's,
> > the 10KB latencies are much smaller than Cubic's. These patches fix
> > issues on the receiver, but tcpdump traces indicate there is an
> > opportunity to also fix an issue at the sender that adds about 3ms
> > to the tail latencies.
> >
> > The following trace shows the issue that tiggers an RTO (fixed by these 
> patches):
> >
> >Host A sends the last packets of the request
> >Host B receives them, and the last packet is marked with congestion 
> (CE)
> >Host B sends ACKs for packets not marked with congestion
> >Host B sends data packet with reply and ACK for packet marked with
> >   congestion (TCP flag ECE)
> >Host A receives ACKs with no ECE flag
> >Host A receives data packet with ACK for the last packet of request
> >   and which has TCP ECE bit set
> >Host A sends 1st data packet of the next request with TCP flag CWR
> >Host B receives the packet (as seen in tcpdump at B), no CE flag
> >Host B sends a dup ACK that also has the TCP ECE flag
> >Host A RTO timer fires!
> >Host A to send the next packet
> >Host A receives an ACK for everything it has sent (i.e. Host B
> >   did receive 1st packet of request)
> >Host A send more packets…
> >
> > [PATCH net-next v2 1/2] tcp: notify when a delayed ack is sent
> > [PATCH net-next v2 2/2] tcp: ack immediately when a cwr packet
> >
> >  net/ipv4/tcp_input.c  | 16 +++-
> >  net/ipv4/tcp_output.c |  4 ++--
> >  2 files changed, 13 insertions(+), 7 deletions(-)
>
> Thanks, Larry. Just for context, can you please let us know whether
> your tests included zero, one, or both of Eric's recent commits
> (listed below) that tuned the number of ACKs after ECN events? (Or
> maybe the tests were literally using a net-next kernel?) Just wanted
> to get a better handle on any possible interactions there.
>
> Thanks!
>
> neal
>
> Yes, my test kernel includes both patches listed below.

OK, great.

> BTW, I will send a new
> patch where I move the call to tcp_incr_quickack from tcp_ecn_check_ce to 
> tcp_ecn_accept_cwr.

OK, sounds good to me.

thanks,
neal

Re: [PATCH net-next v2 1/2] tcp: notify when a delayed ack is sent

2018-07-03 Thread Neal Cardwell

On Mon, Jul 2, 2018 at 7:49 PM Yuchung Cheng  wrote:
>
> On Mon, Jul 2, 2018 at 2:39 PM, Lawrence Brakmo  wrote:
> >
> > DCTCP depends on the CA_EVENT_NON_DELAYED_ACK and CA_EVENT_DELAYED_ACK
> > notifications to keep track if it needs to send an ACK for packets that
> > were received with a particular ECN state but whose ACK was delayed.
> >
> > Under some circumstances, for example when a delayed ACK is sent with a
> > data packet, DCTCP state was not being updated due to a lack of
> > notification that the previously delayed ACK was sent. As a result, it
> > would sometimes send a duplicate ACK when a new data packet arrived.
> >
> > This patch insures that DCTCP's state is correctly updated so it will
> > not send the duplicate ACK.
> Sorry to chime-in late here (lame excuse: IETF deadline)
>
> IIRC this issue would exist prior to 4.11 kernel. While it'd be good
> to fix that, it's not clear which patch introduced the regression
> between 4.11 and 4.16? I assume you tested Eric's most recent quickack
> fix.

I don't think Larry is saying there's a regression between 4.11 and
4.16. His recent "tcp: force cwnd at least 2 in tcp_cwnd_reduction"
patch here:

  https://patchwork.ozlabs.org/patch/935233/

said that 4.11 was bad (high tail latency and lots of RTOs) and 4.16
was still bad but with different netstat counters (no RTOs but still
high tail latency):

"""
On 4.11, pcap traces indicate that in some instances the 1st packet of
the RPC is received but no ACK is sent before the packet is
retransmitted. On 4.11 netstat shows TCP timeouts, with some of them
spurious.

On 4.16, we don't see retransmits in netstat but the high tail latencies
are still there.
"""

I suspect the RTOs disappeared but latencies remained too high because
between 4.11 and 4.16 we introduced:
  tcp: allow TLP in ECN CWR

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=b4f70c3d4ec32a2ff4c62e1e2da0da5f55fe12bd

So the RTOs probably disappeared because this commit turned them into
TLPs. But the latencies remained high because the fundamental bug
remained throughout 4.11 and 4.16 and today: the DCTCP use of
tcp_send_ack() with an old rcv_nxt caused delayed ACKs to be cancelled
when they should not have been.

> In terms of the fix itself, it seems odd the tcp_send_ack() call in
> DCTCP generates NON_DELAYED_ACK event to toggle DCTCP's
> delayed_ack_reserved bit. Shouldn't the fix to have DCTCP send the
> "prior" ACK w/o cancelling delayed-ACK and mis-recording that it's
> cancelled, because that prior-ACK really is a supplementary old ACK.

This patch is fixing an issue that's orthogonal to the one you are
talking about. Using the taxonomy from our team's internal discussion
yesterday, the issue you mention where the DCTCP "prior" ACK is
cancelling delayed ACKs is issue (4); the issue that this particular
"tcp: notify when a delayed ack is sent" patch from Larry fixes is
issue (3). It's a bit tricky because both issues appeared in Larry's
trace summary and packetdrill script to reproduce the issue.

neal

Re: [PATCH net-next v2 0/2] tcp: fix high tail latencies in DCTCP

2018-07-02 Thread Neal Cardwell

On Mon, Jul 2, 2018 at 5:39 PM Lawrence Brakmo  wrote:
>
> When have observed high tail latencies when using DCTCP for RPCs as
> compared to using Cubic. For example, in one setup there are 2 hosts
> sending to a 3rd one, with each sender having 3 flows (1 stream,
> 1 1MB back-to-back RPCs and 1 10KB back-to-back RPCs). The following
> table shows the 99% and 99.9% latencies for both Cubic and dctcp:
>
>Cubic 99%  Cubic 99.9%   dctcp 99%dctcp 99.9%
>  1MB RPCs2.6ms   5.5ms 43ms  208ms
> 10KB RPCs1.1ms   1.3ms 53ms  212ms
>
> Looking at tcpdump traces showed that there are two causes for the
> latency.
>
>   1) RTOs caused by the receiver sending a dup ACK and not ACKing
>  the last (and only) packet sent.
>   2) Delaying ACKs when the sender has a cwnd of 1, so everything
>  pauses for the duration of the delayed ACK.
>
> The first patch fixes the cause of the dup ACKs, not updating DCTCP
> state when an ACK that was initially delayed has been sent with a
> data packet.
>
> The second patch insures that an ACK is sent immediately when a
> CWR marked packet arrives.
>
> With the patches the latencies for DCTCP now look like:
>
>dctcp 99%  dctcp 99.9%
>  1MB RPCs5.8ms   6.9ms
> 10KB RPCs146us   203us
>
> Note that while the 1MB RPCs tail latencies are higher than Cubic's,
> the 10KB latencies are much smaller than Cubic's. These patches fix
> issues on the receiver, but tcpdump traces indicate there is an
> opportunity to also fix an issue at the sender that adds about 3ms
> to the tail latencies.
>
> The following trace shows the issue that tiggers an RTO (fixed by these 
> patches):
>
>Host A sends the last packets of the request
>Host B receives them, and the last packet is marked with congestion (CE)
>Host B sends ACKs for packets not marked with congestion
>Host B sends data packet with reply and ACK for packet marked with
>   congestion (TCP flag ECE)
>Host A receives ACKs with no ECE flag
>Host A receives data packet with ACK for the last packet of request
>   and which has TCP ECE bit set
>Host A sends 1st data packet of the next request with TCP flag CWR
>Host B receives the packet (as seen in tcpdump at B), no CE flag
>Host B sends a dup ACK that also has the TCP ECE flag
>Host A RTO timer fires!
>Host A to send the next packet
>Host A receives an ACK for everything it has sent (i.e. Host B
>   did receive 1st packet of request)
>Host A send more packets…
>
> [PATCH net-next v2 1/2] tcp: notify when a delayed ack is sent
> [PATCH net-next v2 2/2] tcp: ack immediately when a cwr packet
>
>  net/ipv4/tcp_input.c  | 16 +++-
>  net/ipv4/tcp_output.c |  4 ++--
>  2 files changed, 13 insertions(+), 7 deletions(-)

Thanks, Larry. Just for context, can you please let us know whether
your tests included zero, one, or both of Eric's recent commits
(listed below) that tuned the number of ACKs after ECN events? (Or
maybe the tests were literally using a net-next kernel?) Just wanted
to get a better handle on any possible interactions there.

Thanks!

neal

---
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=522040ea5fdd1c33bbf75e1d7c7c0422b96a94ef
commit 522040ea5fdd1c33bbf75e1d7c7c0422b96a94ef
Author: Eric Dumazet 
Date:   Mon May 21 15:08:57 2018 -0700

tcp: do not aggressively quick ack after ECN events

ECN signals currently forces TCP to enter quickack mode for
up to 16 (TCP_MAX_QUICKACKS) following incoming packets.

We believe this is not needed, and only sending one immediate ack
for the current packet should be enough.

This should reduce the extra load noticed in DCTCP environments,
    after congestion events.

This is part 2 of our effort to reduce pure ACK packets.

Signed-off-by: Eric Dumazet 
Acked-by: Soheil Hassas Yeganeh 
Acked-by: Yuchung Cheng 
Acked-by: Neal Cardwell 
Signed-off-by: David S. Miller 

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=15ecbe94a45ef88491ca459b26efdd02f91edb6d
commit 15ecbe94a45ef88491ca459b26efdd02f91edb6d
Author: Eric Dumazet 
Date:   Wed Jun 27 08:47:21 2018 -0700

tcp: add one more quick ack after after ECN events

Larry Brakmo proposal ( https://patchwork.ozlabs.org/patch/935233/
tcp: force cwnd at least 2 in tcp_cwnd_reduction) made us rethink
about our recent patch removing ~16 quick acks after ECN events.

tcp_enter_quickack_mode(sk, 1) makes sure one immediate ack is sent,
but in the case the sender cwnd was lowered to 1, we do not want
to have a delayed ack for the next packet we will receive.

Fixes: 522040ea5fdd ("tcp: do not aggressively quick ack after ECN events")
Signed-off-by: Eric Dumazet 
Reported-by: Neal Cardwell 
Cc: Lawrence Brakmo 
Acked-by: Neal Cardwell 
Signed-off-by: David S. Miller

Re: [PATCH net-next 1/2] tcp: notify when a delayed ack is sent

2018-07-02 Thread Neal Cardwell

On Fri, Jun 29, 2018 at 9:48 PM Lawrence Brakmo  wrote:
>
> DCTCP depends on the CA_EVENT_NON_DELAYED_ACK and CA_EVENT_DELAYED_ACK
> notifications to keep track if it needs to send an ACK for packets that
> were received with a particular ECN state but whose ACK was delayed.
>
> Under some circumstances, for example when a delayed ACK is sent with a
> data packet, DCTCP state was not being updated due to a lack of
> notification that the previously delayed ACK was sent. As a result, it
> would sometimes send a duplicate ACK when a new data packet arrived.
>
> This patch insures that DCTCP's state is correctly updated so it will
> not send the duplicate ACK.
>
> Signed-off-by: Lawrence Brakmo 
> ---
>  net/ipv4/tcp_output.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index f8f6129160dd..41f6ad7a21e4 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -172,6 +172,8 @@ static inline void tcp_event_ack_sent(struct sock *sk, 
> unsigned int pkts)
> __sock_put(sk);
> }
> tcp_dec_quickack_mode(sk, pkts);
> +   if (inet_csk_ack_scheduled(sk))
> +   tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK);
> inet_csk_clear_xmit_timer(sk, ICSK_TIME_DACK);
>  }

Thanks for this fix! Seems like this would work, but if I am reading
this correctly then it seems like this would cause a duplicate call to
tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK) when we are sending a pure
ACK (delayed or non-delayed):

(1) once from tcp_send_ack() before we send the ACK:

tcp_send_ack(struct sock *sk)
 ->tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK);

(2) then again from tcp_event_ack_sent() after we have sent the ACK:

tcp_event_ack_sent()
->   if (inet_csk_ack_scheduled(sk))
 tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK);

What if we remove the original CA_EVENT_NON_DELAYED_ACK call and just
replace it with your new one? (not compiled, not tested):

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 3889dcd4868d4..bddb49617d9be 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -184,6 +184,8 @@ static inline void tcp_event_ack_sent(struct sock
*sk, unsigned int pkts)
__sock_put(sk);
}
tcp_dec_quickack_mode(sk, pkts);
+   if (inet_csk_ack_scheduled(sk))
+   tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK);
inet_csk_clear_xmit_timer(sk, ICSK_TIME_DACK);
 }

@@ -3836,8 +3838,6 @@ void tcp_send_ack(struct sock *sk)
if (sk->sk_state == TCP_CLOSE)
return;

-   tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK);
-
/* We are not putting this on the write queue, so
 * tcp_transmit_skb() will set the ownership to this
 * sock.

Aside from lower CPU overhead, one nice benefit of that is that we
then only call tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK) in one
place, which might be a little easier to reason about.

Does that work?

neal

Re: [PATCH net-next 2/2] tcp: ack immediately when a cwr packet arrives

2018-07-02 Thread Neal Cardwell

On Sat, Jun 30, 2018 at 9:47 PM Lawrence Brakmo  wrote:
> I see two issues, one is that entering quickack mode as you
> mentioned does not insure that it will still be on when the CWR
> arrives. The second issue is that the problem occurs right after the
> receiver sends a small reply which results in entering pingpong mode
> right before the sender starts the new request by sending just one
> packet (i.e. forces delayed ack).
>
> I compiled and tested your patch. Both 99 and 99.9 percentile
> latencies are around 40ms. Looking at the packet traces shows that
> some CWR marked packets are not being ack immediately (delayed by
> 40ms).

Thanks, Larry! So your tests provide nice, specific evidence that it
is good to force an immediate ACK when a receiver receives a packet
with CWR marked. Given that, I am wondering what the simplest way is
to achieve that goal.

What if, rather than plumbing a new specific signal into
__tcp_ack_snd_check(), we use the existing general quick-ack
mechanism, where various parts of the TCP stack (like
__tcp_ecn_check_ce())  are already using the quick-ack mechanism to
"remotely" signal to __tcp_ack_snd_check() that they want an immediate
ACK.

For example, would it work to do something like:

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c53ae5fc834a5..8168d1938b376 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -262,6 +262,12 @@ static void __tcp_ecn_check_ce(struct sock *sk,
const struct sk_buff *skb)
 {
struct tcp_sock *tp = tcp_sk(sk);

+   /* If the sender is telling us it has entered CWR, then its cwnd may be
+* very low (even just 1 packet), so we should ACK immediately.
+*/
+   if (tcp_hdr(skb)->cwr)
+   tcp_enter_quickack_mode(sk, 2);
+
switch (TCP_SKB_CB(skb)->ip_dsfield & INET_ECN_MASK) {
case INET_ECN_NOT_ECT:
/* Insure that GCN will not continue to mark packets. */

And then since that broadens the mission of this function beyond
checking just the ECT/CE bits, I supposed we could rename the
__tcp_ecn_check_ce() and tcp_ecn_check_ce() functions to
__tcp_ecn_check() and tcp_ecn_check(), or something like that.

Would that work for this particular issue?

neal

Re: [PATCH net] tcp: prevent bogus FRTO undos with non-SACK flows

2018-06-30 Thread Neal Cardwell

On Fri, Jun 29, 2018 at 3:52 PM Ilpo Järvinen  wrote:
> > +.005 < . 1:1(0) ack 2001 win 257
>
> Why did the receiver send a cumulative ACK only for 2001?

Sorry, you are right Ilpo. Upon further reflection, the packetdrill
scenario I posted is not a realistic one, and I agree we should not
worry about it.

As I mentioned, I ran your patch through all our team's TCP
packetdrill tests, and it passes all of the tests. One of our tests
needed updating, because if there is a non-SACK connection with a
spurious RTO due to a delayed flight of ACKs then the FRTO undo now
happens one ACK later (when we get an ACK that doesn't cover a
retransmit). But that seems fine to me.

I also cooked the new packetdrill test below to explicitly cover this
case you are addressing (please let me know if you have an alternate
suggestion).

Tested-by: Neal Cardwell 
Acked-by: Neal Cardwell 

Thanks!
neal

---

0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 < S 0:0(0) win 32792 
   +0 > S. 0:0(0) ack 1 
 +.02 < . 1:1(0) ack 1 win 257
   +0 accept(3, ..., ...) = 4

// Send 3 packets. First is really lost. And the dupacks
// for the data packets that arrived at the reciver are slow in arriving.
   +0 write(4, ..., 3000) = 3000
   +0 > P. 1:3001(3000) ack 1

// RTO and retransmit head. This fills a real loss.
 +.22 > . 1:1001(1000) ack 1

// Dupacks for packets 2 and 3 arrive.
+.02  < . 1:1(0) ack 1 win 257
   +0 < . 1:1(0) ack 1 win 257

// The cumulative ACK for all the data arrives. We do not undo, because
// this is a non-SACK connection, and retransmitted data was ACKed.
// It's good that there's no FRTO undo, since a packet was really lost.
// Because this is non-SACK, tcp_try_undo_recovery() holds CA_Loss
// until something beyond high_seq is ACKed.
+.005 < . 1:1(0) ack 3001 win 257
   +0 %{ assert tcpi_ca_state == TCP_CA_Loss, tcpi_ca_state }%
   +0 %{ assert tcpi_snd_cwnd == 4, tcpi_snd_cwnd }%
   +0 %{ assert tcpi_snd_ssthresh == 7, tcpi_snd_ssthresh }%

Re: [PATCH net-next 0/2] tcp: fix high tail latencies in DCTCP

2018-06-30 Thread Neal Cardwell

On Fri, Jun 29, 2018 at 9:48 PM Lawrence Brakmo  wrote:
>
> When have observed high tail latencies when using DCTCP for RPCs as
> compared to using Cubic. For example, in one setup there are 2 hosts
> sending to a 3rd one, with each sender having 3 flows (1 stream,
> 1 1MB back-to-back RPCs and 1 10KB back-to-back RPCs). The following
> table shows the 99% and 99.9% latencies for both Cubic and dctcp:
>
>Cubic 99%  Cubic 99.9%   dctcp 99%dctcp 99.9%
>   1MB RPCs2.6ms   5.5ms 43ms  208ms
>  10KB RPCs1.1ms   1.3ms 53ms  212ms
>
> Looking at tcpdump traces showed that there are two causes for the
> latency.
>
>   1) RTOs caused by the receiver sending a dup ACK and not ACKing
>  the last (and only) packet sent.
>   2) Delaying ACKs when the sender has a cwnd of 1, so everything
>  pauses for the duration of the delayed ACK.
>
> The first patch fixes the cause of the dup ACKs, not updating DCTCP
> state when an ACK that was initially delayed has been sent with a
> data packet.
>
> The second patch insures that an ACK is sent immediately when a
> CWR marked packet arrives.
>
> With the patches the latencies for DCTCP now look like:
>
>dctcp 99%  dctcp 99.9%
>   1MB RPCs4.8ms   6.5ms
>  10KB RPCs143us   184us
>
> Note that while the 1MB RPCs tail latencies are higher than Cubic's,
> the 10KB latencies are much smaller than Cubic's. These patches fix
> issues on the receiver, but tcpdump traces indicate there is an
> opportunity to also fix an issue at the sender that adds about 3ms
> to the tail latencies.
>
> The following trace shows the issue that tiggers an RTO (fixed by these 
> patches):
>
>Host A sends the last packets of the request
>Host B receives them, and the last packet is marked with congestion (CE)
>Host B sends ACKs for packets not marked with congestion
>Host B sends data packet with reply and ACK for packet marked with
>   congestion (TCP flag ECE)
>Host A receives ACKs with no ECE flag
>Host A receives data packet with ACK for the last packet of request
>   and which has TCP ECE bit set
>Host A sends 1st data packet of the next request with TCP flag CWR
>Host B receives the packet (as seen in tcpdump at B), no CE flag
>Host B sends a dup ACK that also has the TCP ECE flag
>Host A RTO timer fires!
>Host A to send the next packet
>Host A receives an ACK for everything it has sent (i.e. Host B
>   did receive 1st packet of request)
>Host A send more packets…
>
> [PATCH net-next 1/2] tcp: notify when a delayed ack is sent
> [PATCH net-next 2/2] tcp: ack immediately when a cwr packet arrives

Thanks, Larry!

I suspect there is a broader problem with "DCTCP-style precise ECE
ACKs" that this patch series does not address.

AFAICT the DCTCP "historical" ACKs for ECE precision, generated in
dctcp_ce_state_0_to_1() and dctcp_ce_state_1_to_0(), violate the
assumptions of the pre-existing delayed ACK state machine. They
violate those assumptions by rewinding tp->rcv_nxt backwards and then
calling tcp_send_ack(sk). But the existing code path to send an ACK
assumes that any ACK transmission always sends an ACK that accounts
for all received data, so that after sending an ACK we can cancel the
delayed ACK timer. So it seems with DCTCP we can have buggy sequences
where the DCTCP historical ACK causes us to forget that we need to
schedule a delayed ACK (which will force the sender to RTO, as shown
in the trace):

tcp_event_data_recv()
 inet_csk_schedule_ack()  // remember that we need an ACK
 tcp_ecn_check_ce()
  tcp_ca_event() // with CA_EVENT_ECN_IS_CE or CA_EVENT_ECN_NO_CE
dctcp_cwnd_event()
  dctcp_ce_state_0_to_1() or dctcp_ce_state_1_to_0()
   if (... && ca->delayed_ack_reserved)
  tp->rcv_nxt = ca->prior_rcv_nxt;
  tcp_send_ack(sk); // send an ACK, but for old data!
tcp_transmit_skb()
  tcp_event_ack_sent()
inet_csk_clear_xmit_timer(sk, ICSK_TIME_DACK);
// forget that we need a delayed ACK!

AFAICT the first patch in this series, to force an immediate ACK on
any packet with a CWR, papers over this issue of the forgotten delayed
ACK by forcing an immediate ack in __tcp_ack_snd_check(). This ensures
that at least for CWR packets the forgotten delayed ACK does not
matter. But for packets without CWR I believe the buggy "forgotten
delayed ACK" sequence above is still possible, and could still plague
DCTCP with high tail latencies in some cases. I suspect that after
your CWR-forces-ACK patch it may not be jumping out in performance
test results because (a) if there is no CWR involved then usually the
cwnd is high enough that missing delayed ACK does not cause a stall,
and (b) if there is a CWR involved then your patch forces an immediate
ACK. But AFAICT that forgotten delayed ACK bug is still there.

What do folks think?

neal

Re: [PATCH net-next 2/2] tcp: ack immediately when a cwr packet arrives

2018-06-30 Thread Neal Cardwell

On Fri, Jun 29, 2018 at 9:48 PM Lawrence Brakmo  wrote:
>
> We observed high 99 and 99.9% latencies when doing RPCs with DCTCP. The
> problem is triggered when the last packet of a request arrives CE
> marked. The reply will carry the ECE mark causing TCP to shrink its cwnd
> to 1 (because there are no packets in flight). When the 1st packet of
> the next request arrives it was sometimes delayed adding up to 40ms to
> the latency.
>
> This patch insures that CWR makred data packets arriving will be acked
> immediately.
>
> Signed-off-by: Lawrence Brakmo 
> ---

Thanks, Larry. Ensuring that CWR-marked data packets arriving will be
acked immediately sounds like a good goal to me.

I am wondering if we can have a simpler fix.

The dctcp_ce_state_0_to_1() code is setting the TCP_ECN_DEMAND_CWR
bit in ecn_flags, which disables the code in __tcp_ecn_check_ce() that
would have otherwise used the tcp_enter_quickack_mode() mechanism to
force an ACK:

__tcp_ecn_check_ce():
...
case INET_ECN_CE:
  if (tcp_ca_needs_ecn(sk))
tcp_ca_event(sk, CA_EVENT_ECN_IS_CE);
   // -> dctcp_ce_state_0_to_1()
   // ->  tp->ecn_flags |= TCP_ECN_DEMAND_CWR;

  if (!(tp->ecn_flags & TCP_ECN_DEMAND_CWR)) {
/* Better not delay acks, sender can have a very low cwnd */
tcp_enter_quickack_mode(sk, 1);
tp->ecn_flags |= TCP_ECN_DEMAND_CWR;
  }
  tp->ecn_flags |= TCP_ECN_SEEN;
  break;

So it seems like the bug here may be that dctcp_ce_state_0_to_1()  is
setting the TCP_ECN_DEMAND_CWR  bit in ecn_flags, when really it
should let its caller, __tcp_ecn_check_ce() set TCP_ECN_DEMAND_CWR, in
which case the code would already properly force an immediate ACK.

So, what if we use this fix instead (not compiled, not tested):

diff --git a/net/ipv4/tcp_dctcp.c b/net/ipv4/tcp_dctcp.c
index 5f5e5936760e..4fecd2824edb 100644
--- a/net/ipv4/tcp_dctcp.c
+++ b/net/ipv4/tcp_dctcp.c
@@ -152,8 +152,6 @@ static void dctcp_ce_state_0_to_1(struct sock *sk)

ca->prior_rcv_nxt = tp->rcv_nxt;
ca->ce_state = 1;
-
-   tp->ecn_flags |= TCP_ECN_DEMAND_CWR;
 }

 static void dctcp_ce_state_1_to_0(struct sock *sk)

What do you think?

neal

Re: [PATCH net] tcp: prevent bogus FRTO undos with non-SACK flows

2018-06-29 Thread Neal Cardwell

On Fri, Jun 29, 2018 at 6:07 AM Ilpo Järvinen  wrote:
>
> If SACK is not enabled and the first cumulative ACK after the RTO
> retransmission covers more than the retransmitted skb, a spurious
> FRTO undo will trigger (assuming FRTO is enabled for that RTO).
> The reason is that any non-retransmitted segment acknowledged will
> set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is
> no indication that it would have been delivered for real (the
> scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK
> case so the check for that bit won't help like it does with SACK).
> Having FLAG_ORIG_SACK_ACKED set results in the spurious FRTO undo
> in tcp_process_loss.
>
> We need to use more strict condition for non-SACK case and check
> that none of the cumulatively ACKed segments were retransmitted
> to prove that progress is due to original transmissions. Only then
> keep FLAG_ORIG_SACK_ACKED set, allowing FRTO undo to proceed in
> non-SACK case.
>
> (FLAG_ORIG_SACK_ACKED is planned to be renamed to FLAG_ORIG_PROGRESS
> to better indicate its purpose but to keep this change minimal, it
> will be done in another patch).
>
> Besides burstiness and congestion control violations, this problem
> can result in RTO loop: When the loss recovery is prematurely
> undoed, only new data will be transmitted (if available) and
> the next retransmission can occur only after a new RTO which in case
> of multiple losses (that are not for consecutive packets) requires
> one RTO per loss to recover.
>
> Signed-off-by: Ilpo Järvinen 
> ---
>  net/ipv4/tcp_input.c | 9 +
>  1 file changed, 9 insertions(+)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 045d930..8e5522c 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -3181,6 +3181,15 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 
> prior_fack,
>
> if (tcp_is_reno(tp)) {
> tcp_remove_reno_sacks(sk, pkts_acked);
> +
> +   /* If any of the cumulatively ACKed segments was
> +* retransmitted, non-SACK case cannot confirm that
> +* progress was due to original transmission due to
> +* lack of TCPCB_SACKED_ACKED bits even if some of
> +* the packets may have been never retransmitted.
> +*/
> +   if (flag & FLAG_RETRANS_DATA_ACKED)
> +   flag &= ~FLAG_ORIG_SACK_ACKED;
> } else {
> int delta;

Thanks, Ilpo. I ran this through our TCP packetdrill tests and only
got one failure, which detected the change you made, so that on the
first ACK in a non-SACK F-RTO case we stay in CA_Loss.

However, depending on the exact scenario we are concerned about here,
this may not be a complete solution. Consider the packetdrill test I
cooked below, which is for a scenario where we imagine a non-SACK
connection, with data packet #1 and all dupacks lost. With your patch
we don't have an F-RTO undo on the first cumulative ACK, which is a
definite improvement. However, we end up with an F-RTO undo on the
*second* cumulative ACK, because the second ACK didn't cover any
retransmitted data. That means that we end up in the second round trip
back in the initial slow-start mode with a very high cwnd and infinite
ssthresh, even though data was actually lost in the first round trip.

I'm not sure how much we care about all these cases. Perhaps we should
just disable F-RTO if the connection has no SACK enabled? These
non-SACK connections are just such a rare case at this point that I
would propose it's not worth spending too much time on this.

The following packetdrill script passes when Ilpo's patch is applied.
This documents the behavior, which I think is questionable:

0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 < S 0:0(0) win 32792 
   +0 > S. 0:0(0) ack 1 
 +.02 < . 1:1(0) ack 1 win 257
   +0 accept(3, ..., ...) = 4
   +0 write(4, ..., 27000) = 27000
   +0 > . 1:10001(1) ack 1

// Suppose 1:1001 is lost and all dupacks are lost.

// RTO and retransmit head. This fills a real hole.
 +.22 > . 1:1001(1000) ack 1

+.005 < . 1:1(0) ack 2001 win 257
   +0 > . 10001:13001(3000) ack 1
   +0 %{ assert tcpi_ca_state == TCP_CA_Loss, tcpi_ca_state }%
   +0 %{ assert tcpi_snd_cwnd == 3, tcpi_snd_cwnd }%
   +0 %{ assert tcpi_snd_ssthresh == 7, tcpi_snd_ssthresh }%

+.002 < . 1:1(0) ack 10001 win 257
   +0 %{ assert tcpi_ca_state == TCP_CA_Open, tcpi_ca_state }%
   +0 %{ assert tcpi_snd_cwnd == 18, tcpi_snd_cwnd }%
   +0 %{ assert tcpi_snd_ssthresh == 2147483647, tcpi_snd_ssthresh }%
   +0 > . 13001:15001(2000) ack 1

---

neal

Re: [PATCH net-next v2] tcp: force cwnd at least 2 in tcp_cwnd_reduction

2018-06-28 Thread Neal Cardwell

On Thu, Jun 28, 2018 at 4:20 PM Lawrence Brakmo  wrote:
>
> I just looked at 4.18 traces and the behavior is as follows:
>
>Host A sends the last packets of the request
>
>Host B receives them, and the last packet is marked with congestion (CE)
>
>Host B sends ACKs for packets not marked with congestion
>
>Host B sends data packet with reply and ACK for packet marked with 
> congestion (TCP flag ECE)
>
>Host A receives ACKs with no ECE flag
>
>Host A receives data packet with ACK for the last packet of request and 
> has TCP ECE bit set
>
>Host A sends 1st data packet of the next request with TCP flag CWR
>
>Host B receives the packet (as seen in tcpdump at B), no CE flag
>
>Host B sends a dup ACK that also has the TCP ECE flag
>
>Host A RTO timer fires!
>
>Host A to send the next packet
>
>Host A receives an ACK for everything it has sent (i.e. Host B did receive 
> 1st packet of request)
>
>Host A send more packets…

Thanks, Larry! This is very interesting. I don't know the cause, but
this reminds me of an issue  Steve Ibanez raised on the netdev list
last December, where he was seeing cases with DCTCP where a CWR packet
would be received and buffered by Host B but not ACKed by Host B. This
was the thread "Re: Linux ECN Handling", starting around December 5. I
have cc-ed Steve.

I wonder if this may somehow be related to the DCTCP logic to rewind
tp->rcv_nxt and call tcp_send_ack(), and then restore tp->rcv_nxt, if
DCTCP notices that the incoming CE bits have been changed while the
receiver thinks it is holding on to a delayed ACK (in
dctcp_ce_state_0_to_1() and dctcp_ce_state_1_to_0()). I wonder if the
"synthetic" call to tcp_send_ack() somehow has side effects in the
delayed ACK state machine that can cause the connection to forget that
it still needs to fire a delayed ACK, even though it just sent an ACK
just now.

neal

Re: [PATCH net] tcp: add one more quick ack after after ECN events

2018-06-27 Thread Neal Cardwell

On Wed, Jun 27, 2018 at 11:47 AM Eric Dumazet  wrote:
>
> Larry Brakmo proposal ( https://patchwork.ozlabs.org/patch/935233/
> tcp: force cwnd at least 2 in tcp_cwnd_reduction) made us rethink
> about our recent patch removing ~16 quick acks after ECN events.
>
> tcp_enter_quickack_mode(sk, 1) makes sure one immediate ack is sent,
> but in the case the sender cwnd was lowered to 1, we do not want
> to have a delayed ack for the next packet we will receive.
>
> Fixes: 522040ea5fdd ("tcp: do not aggressively quick ack after ECN events")
> Signed-off-by: Eric Dumazet 
> Reported-by: Neal Cardwell 
> Cc: Lawrence Brakmo 
> ---
>  net/ipv4/tcp_input.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Acked-by: Neal Cardwell 

Thanks, Eric!

neal

Re: [PATCH net-next v2] tcp: force cwnd at least 2 in tcp_cwnd_reduction

2018-06-27 Thread Neal Cardwell

On Tue, Jun 26, 2018 at 10:34 PM Lawrence Brakmo  wrote:
> The only issue is if it is safe to always use 2 or if it is better to
> use min(2, snd_ssthresh) (which could still trigger the problem).

Always using 2 SGTM. I don't think we need min(2, snd_ssthresh), as
that should be the same as just 2, since:

(a) RFCs mandate ssthresh should not be below 2, e.g.
https://tools.ietf.org/html/rfc5681 page 7:

 ssthresh = max (FlightSize / 2, 2*SMSS)(4)

(b) The main loss-based CCs used in Linux (CUBIC, Reno, DCTCP) respect
that constraint, and always have an ssthresh of at least 2.

And if some CC misbehaves and uses a lower ssthresh, then taking
min(2, snd_ssthresh) will trigger problems, as you note.

> +   tp->snd_cwnd = max((int)tcp_packets_in_flight(tp) + sndcnt, 2);

AFAICT this does seem like it will make the sender behavior more
aggressive in cases with high loss and/or a very low per-flow
fair-share.

Old:

o send N packets
o receive SACKs for last 3 packets
o fast retransmit packet 1
o using ACKs, slow-start upward

New:

o send N packets
o receive SACKs for last 3 packets
o fast retransmit packets 1 and 2
o using ACKs, slow-start upward

In the extreme case, if the available fair share is less than 2
packets, whereas inflight would have oscillated between 1 packet and 2
packets with the existing code, it now seems like with this commit the
inflight will now hover at 2. It seems like this would have
significantly higher losses than we had with the existing code.

This may or may not be OK in practice, but IMHO it is worth mentioning
and discussing.

neal

Re: [PATCH net-next] tcp: remove one indentation level in tcp_create_openreq_child

2018-06-26 Thread Neal Cardwell

On Tue, Jun 26, 2018 at 11:46 AM Eric Dumazet  wrote:
>
> Signed-off-by: Eric Dumazet 
> ---
>  net/ipv4/tcp_minisocks.c | 223 ---
>  1 file changed, 113 insertions(+), 110 deletions(-)

Yes, very nice clean-up! Thanks for doing this.

Acked-by: Neal Cardwell 

neal

Re: [PATCH net] sctp: not allow to set rto_min with a value below 200 msecs

2018-05-29 Thread Neal Cardwell

On Tue, May 29, 2018 at 11:45 AM Marcelo Ricardo Leitner <
marcelo.leit...@gmail.com> wrote:
> - patch2 - fix rtx attack vector
>- Add the floor value to rto_min to HZ/20 (which fits the values
>  that Michael shared on the other email)

I would encourage allowing minimum RTO values down to 5ms, if the ACK
policy in the receiver makes this feasible. Our experience is that in
datacenter environments it can be advantageous to allow timer-based loss
recoveries using timeout values as low as 5ms, e.g.:

https://www.ietf.org/proceedings/97/slides/slides-97-tcpm-tcp-options-for-low-latency-00.pdf
   https://tools.ietf.org/html/draft-wang-tcpm-low-latency-opt-00

cheers,
neal

Re: [PATCH net-next 1/2] tcp: add max_quickacks param to tcp_incr_quickack and tcp_enter_quickack_mode

2018-05-22 Thread Neal Cardwell

On Tue, May 22, 2018 at 8:31 PM kbuild test robot  wrote:

> Hi Eric,

> Thank you for the patch! Yet something to improve:

> [auto build test ERROR on net/master]
> [also build test ERROR on v4.17-rc6 next-20180517]
> [cannot apply to net-next/master]
> [if your patch is applied to the wrong git tree, please drop us a note to
help improve the system]

> url:
https://github.com/0day-ci/linux/commits/Eric-Dumazet/tcp-add-max_quickacks-param-to-tcp_incr_quickack-and-tcp_enter_quickack_mode/20180523-075103
> config: i386-randconfig-x012-201820 (attached as .config)
> compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
> reproduce:
>  # save the attached .config to linux build tree
>  make ARCH=i386

> All errors (new ones prefixed by >>):

> net//ipv4/tcp_input.c: In function 'tcp_data_queue':
> >> net//ipv4/tcp_input.c:4656:2: error: too few arguments to function
'tcp_enter_quickack_mode'
>   tcp_enter_quickack_mode(sk);
>   ^~~
> net//ipv4/tcp_input.c:199:13: note: declared here
>  static void tcp_enter_quickack_mode(struct sock *sk, unsigned int
max_quickacks)
>  ^~~
...

For the record, this is an error in the tool, rather than the patch. The
tool seems to be using a stale net-next tree for building this patch.

The compile error is here in line 4656:

> ^1da177e4 Linus Torvalds   2005-04-16  4652 /* Out of window.
F.e. zero window probe. */
> ^1da177e4 Linus Torvalds   2005-04-16  4653 if
(!before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt + tcp_receive_window(tp)))
> ^1da177e4 Linus Torvalds   2005-04-16  4654 goto
out_of_window;
> ^1da177e4 Linus Torvalds   2005-04-16  4655
> 463c84b97 Arnaldo Carvalho de Melo 2005-08-09 @4656
tcp_enter_quickack_mode(sk);
> ^1da177e4 Linus Torvalds   2005-04-16  4657
> ^1da177e4 Linus Torvalds   2005-04-16  4658 if
(before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt)) {
> ^1da177e4 Linus Torvalds   2005-04-16  4659 /*
Partial packet, seq < rcv_next < end_seq */
...

But that line is not in net-next any more, after Eric's recent net-next
commit:

a3893637e1eb0e ("tcp: do not force quickack when receiving out-of-order
packets")

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/net/ipv4/tcp_input.c?id=a3893637e1eb0ef5eb1bbc52b3a8d2dfa317a35d

That commit removed that line:

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 0bf032839548f..f5622b2506651 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4715,8 +4715,6 @@ drop:
 if (!before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt +
tcp_receive_window(tp)))
 goto out_of_window;

-   tcp_enter_quickack_mode(sk);
-
 if (before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt)) {
 /* Partial packet, seq < rcv_next < end_seq */
 SOCK_DEBUG(sk, "partial packet: rcv_next %X seq %X - %X\n",

cheers,
neal

Re: [PATCH net-next 2/2] tcp: do not aggressively quick ack after ECN events

2018-05-22 Thread Neal Cardwell

On Mon, May 21, 2018 at 6:09 PM Eric Dumazet  wrote:

> ECN signals currently forces TCP to enter quickack mode for
> up to 16 (TCP_MAX_QUICKACKS) following incoming packets.

> We believe this is not needed, and only sending one immediate ack
> for the current packet should be enough.

> This should reduce the extra load noticed in DCTCP environments,
> after congestion events.

> This is part 2 of our effort to reduce pure ACK packets.

> Signed-off-by: Eric Dumazet 
> ---

Acked-by: Neal Cardwell 

Thanks!

neal

Re: [PATCH net-next 1/2] tcp: add max_quickacks param to tcp_incr_quickack and tcp_enter_quickack_mode

2018-05-22 Thread Neal Cardwell

On Mon, May 21, 2018 at 6:09 PM Eric Dumazet  wrote:

> We want to add finer control of the number of ACK packets sent after
> ECN events.

> This patch is not changing current behavior, it only enables following
> change.

> Signed-off-by: Eric Dumazet 
> ---

Acked-by: Neal Cardwell 

Thanks!

neal

Re: [PATCH v3 net-next 6/6] tcp: add tcp_comp_sack_nr sysctl

2018-05-17 Thread Neal Cardwell

On Thu, May 17, 2018 at 5:47 PM Eric Dumazet  wrote:

> This per netns sysctl allows for TCP SACK compression fine-tuning.

> This limits number of SACK that can be compressed.
> Using 0 disables SACK compression.

> Signed-off-by: Eric Dumazet 
> ---

Acked-by: Neal Cardwell 

Thanks!

neal

Re: [PATCH v3 net-next 5/6] tcp: add tcp_comp_sack_delay_ns sysctl

2018-05-17 Thread Neal Cardwell

On Thu, May 17, 2018 at 5:47 PM Eric Dumazet  wrote:

> This per netns sysctl allows for TCP SACK compression fine-tuning.

> Its default value is 1,000,000, or 1 ms to meet TSO autosizing period.

> Signed-off-by: Eric Dumazet 
> ---

Acked-by: Neal Cardwell 

Thanks!

neal

Re: [PATCH v3 net-next 3/6] tcp: add SACK compression

2018-05-17 Thread Neal Cardwell

On Thu, May 17, 2018 at 5:47 PM Eric Dumazet  wrote:

> When TCP receives an out-of-order packet, it immediately sends
> a SACK packet, generating network load but also forcing the
> receiver to send 1-MSS pathological packets, increasing its
> RTX queue length/depth, and thus processing time.

> Wifi networks suffer from this aggressive behavior, but generally
> speaking, all these SACK packets add fuel to the fire when networks
> are under congestion.

> This patch adds a high resolution timer and tp->compressed_ack counter.

> Instead of sending a SACK, we program this timer with a small delay,
> based on RTT and capped to 1 ms :

>  delay = min ( 5 % of RTT, 1 ms)

> If subsequent SACKs need to be sent while the timer has not yet
> expired, we simply increment tp->compressed_ack.

> When timer expires, a SACK is sent with the latest information.
> Whenever an ACK is sent (if data is sent, or if in-order
> data is received) timer is canceled.

> Note that tcp_sack_new_ofo_skb() is able to force a SACK to be sent
> if the sack blocks need to be shuffled, even if the timer has not
> expired.

> A new SNMP counter is added in the following patch.

> Two other patches add sysctls to allow changing the 1,000,000 and 44
> values that this commit hard-coded.

> Signed-off-by: Eric Dumazet 
> ---

Very nice. I like the constants and the min(rcv_rtt, srtt).

Acked-by: Neal Cardwell 

Thanks!

neal

Re: [PATCH net-next 3/4] tcp: add SACK compression

2018-05-17 Thread Neal Cardwell

On Thu, May 17, 2018 at 11:40 AM Eric Dumazet 
wrote:
> On 05/17/2018 08:14 AM, Neal Cardwell wrote:
> > Is there a particular motivation for the cap of 127? IMHO 127 ACKs is
quite
> > a few to compress. Experience seems to show that it works well to have
one
> > GRO ACK for ~64KBytes that triggers a single TSO skb of ~64KBytes. It
might
> > be nice to try to match those dynamics in this SACK compression case,
so it
> > might be nice to cap the number of compressed ACKs at something like 44?
> > (0x / 1448 - 1).  That way for high-speed paths we could try to keep
> > the ACK clock going with ACKs for ~64KBytes that trigger a single TSO
skb
> > of ~64KBytes, no matter whether we are sending SACKs or cumulative ACKs.

> 127 was chosen because the field is u8, and since skb allocation for the
ACK
> can fail, we could have cases were the field goes above 127.

> Ultimately, I believe a followup patch would add a sysctl, so that we can
fine-tune
> this, and eventually disable ACK compression if this sysctl is set to 0

OK, a sysctl sounds good. I would still vote for a default of 44.  :-)


> >> +   if (hrtimer_is_queued(&tp->compressed_ack_timer))
> >> +   return;
> >> +
> >> +   /* compress ack timer : 5 % of srtt, but no more than 2.5 ms */
> >> +
> >> +   delay = min_t(unsigned long, 2500 * NSEC_PER_USEC,
> >> + tp->rcv_rtt_est.rtt_us * (NSEC_PER_USEC >>
3)/20);
> >
> > Any particular motivation for the 2.5ms here? It might be nice to match
the
> > existing TSO autosizing dynamics and use 1ms here instead of having a
> > separate new constant of 2.5ms. Smaller time scales here should lead to
> > less burstiness and queue pressure from data packets in the network,
and we
> > know from experience that the CPU overhead of 1ms chunks is acceptable.

> This came from my tests on wifi really :)

> I also had the idea to make this threshold adjustable for wifi, like we
did for sk_pacing_shift.

> (On wifi, we might want to increase the max delay between ACK)

> So maybe use 1ms delay, when sk_pacing_shift == 10, but increase it if
sk_pacing_shift has been lowered.

Sounds good to me.

Thanks for implementing this! Overall this patch seems nice to me.

Acked-by: Neal Cardwell 

BTW, I guess we should spread the word to maintainers of other major TCP
stacks that they need to be prepared for what may be a much higher degree
of compression/aggregation in the SACK stream. Linux stacks going back many
years should be fine with this, but I'm not sure about the other major OSes
(they may only allow sending one MSS per ACK-with-SACKs received).

neal

Re: [PATCH net-next 3/4] tcp: add SACK compression

2018-05-17 Thread Neal Cardwell

On Thu, May 17, 2018 at 8:12 AM Eric Dumazet  wrote:

> When TCP receives an out-of-order packet, it immediately sends
> a SACK packet, generating network load but also forcing the
> receiver to send 1-MSS pathological packets, increasing its
> RTX queue length/depth, and thus processing time.

> Wifi networks suffer from this aggressive behavior, but generally
> speaking, all these SACK packets add fuel to the fire when networks
> are under congestion.

> This patch adds a high resolution timer and tp->compressed_ack counter.

> Instead of sending a SACK, we program this timer with a small delay,
> based on SRTT and capped to 2.5 ms : delay = min ( 5 % of SRTT, 2.5 ms)
...

Very nice. Thanks for implementing this, Eric! I was wondering if the
constants here might be worth some discussion/elaboration.

> @@ -5074,6 +5076,7 @@ static inline void tcp_data_snd_check(struct sock
*sk)
>   static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible)
>   {
>  struct tcp_sock *tp = tcp_sk(sk);
> +   unsigned long delay;

>  /* More than one full frame received... */
>  if (((tp->rcv_nxt - tp->rcv_wup) > inet_csk(sk)->icsk_ack.rcv_mss
&&
> @@ -5085,15 +5088,31 @@ static void __tcp_ack_snd_check(struct sock *sk,
int ofo_possible)
>  (tp->rcv_nxt - tp->copied_seq < sk->sk_rcvlowat ||
>   __tcp_select_window(sk) >= tp->rcv_wnd)) ||
>  /* We ACK each frame or... */
> -   tcp_in_quickack_mode(sk) ||
> -   /* We have out of order data. */
> -   (ofo_possible && !RB_EMPTY_ROOT(&tp->out_of_order_queue))) {
> -   /* Then ack it now */
> +   tcp_in_quickack_mode(sk)) {
> +send_now:
>  tcp_send_ack(sk);
> -   } else {
> -   /* Else, send delayed ack. */
> +   return;
> +   }
> +
> +   if (!ofo_possible || RB_EMPTY_ROOT(&tp->out_of_order_queue)) {
>  tcp_send_delayed_ack(sk);
> +   return;
>  }
> +
> +   if (!tcp_is_sack(tp) || tp->compressed_ack >= 127)
> +   goto send_now;
> +   tp->compressed_ack++;

Is there a particular motivation for the cap of 127? IMHO 127 ACKs is quite
a few to compress. Experience seems to show that it works well to have one
GRO ACK for ~64KBytes that triggers a single TSO skb of ~64KBytes. It might
be nice to try to match those dynamics in this SACK compression case, so it
might be nice to cap the number of compressed ACKs at something like 44?
(0x / 1448 - 1).  That way for high-speed paths we could try to keep
the ACK clock going with ACKs for ~64KBytes that trigger a single TSO skb
of ~64KBytes, no matter whether we are sending SACKs or cumulative ACKs.

> +
> +   if (hrtimer_is_queued(&tp->compressed_ack_timer))
> +   return;
> +
> +   /* compress ack timer : 5 % of srtt, but no more than 2.5 ms */
> +
> +   delay = min_t(unsigned long, 2500 * NSEC_PER_USEC,
> + tp->rcv_rtt_est.rtt_us * (NSEC_PER_USEC >> 3)/20);

Any particular motivation for the 2.5ms here? It might be nice to match the
existing TSO autosizing dynamics and use 1ms here instead of having a
separate new constant of 2.5ms. Smaller time scales here should lead to
less burstiness and queue pressure from data packets in the network, and we
know from experience that the CPU overhead of 1ms chunks is acceptable.

thanks,
neal

Re: [PATCH net-next 2/4] tcp: do not force quickack when receiving out-of-order packets

2018-05-17 Thread Neal Cardwell

On Thu, May 17, 2018 at 8:12 AM Eric Dumazet  wrote:

> As explained in commit 9f9843a751d0 ("tcp: properly handle stretch
> acks in slow start"), TCP stacks have to consider how many packets
> are acknowledged in one single ACK, because of GRO, but also
> because of ACK compression or losses.

> We plan to add SACK compression in the following patch, we
> must therefore not call tcp_enter_quickack_mode()

> Signed-off-by: Eric Dumazet 
> ---

Acked-by: Neal Cardwell 

Thanks!

neal

Re: [PATCH net-next 4/4] tcp: add TCPAckCompressed SNMP counter

2018-05-17 Thread Neal Cardwell

On Thu, May 17, 2018 at 8:12 AM Eric Dumazet  wrote:

> This counter tracks number of ACK packets that the host has not sent,
> thanks to ACK compression.

> Sample output :

> $ nstat -n;sleep 1;nstat|egrep
"IpInReceives|IpOutRequests|TcpInSegs|TcpOutSegs|TcpExtTCPAckCompressed"
> IpInReceives123250 0.0
> IpOutRequests   3684   0.0
> TcpInSegs   123251 0.0
> TcpOutSegs  3684   0.0
> TcpExtTCPAckCompressed  119252 0.0

> Signed-off-by: Eric Dumazet 
> ---

Acked-by: Neal Cardwell 

Thanks!

neal

Re: [PATCH net-next 1/4] tcp: use __sock_put() instead of sock_put() in tcp_clear_xmit_timers()

2018-05-17 Thread Neal Cardwell

On Thu, May 17, 2018 at 8:12 AM Eric Dumazet  wrote:

> Socket can not disappear under us.

> Signed-off-by: Eric Dumazet 
> ---

Acked-by: Neal Cardwell 

Thanks!

neal

Re: [PATCH net] tcp: purge write queue in tcp_connect_init()

2018-05-15 Thread Neal Cardwell

On Tue, May 15, 2018 at 12:14 AM Eric Dumazet  wrote:

> syzkaller found a reliable way to crash the host, hitting a BUG()
> in __tcp_retransmit_skb()

> Malicous MSG_FASTOPEN is the root cause. We need to purge write queue
> in tcp_connect_init() at the point we init snd_una/write_seq.

> This patch also replaces the BUG() by a less intrusive WARN_ON_ONCE()

> kernel BUG at net/ipv4/tcp_output.c:2837!
...

> Fixes: cf60af03ca4e ("net-tcp: Fast Open client - sendmsg(MSG_FASTOPEN)")
> Signed-off-by: Eric Dumazet 
> Cc: Yuchung Cheng 
> Cc: Neal Cardwell 
> Reported-by: syzbot 
> ---

Acked-by: Neal Cardwell 

Thanks, Eric!

neal

Re: [PATCH net] tcp: restore autocorking

2018-05-03 Thread Neal Cardwell

On Wed, May 2, 2018 at 11:25 PM Eric Dumazet  wrote:

> When adding rb-tree for TCP retransmit queue, we inadvertently broke
> TCP autocorking.

> tcp_should_autocork() should really check if the rtx queue is not empty.

...

> Fixes: 75c119afe14f ("tcp: implement rb-tree based retransmit queue")
> Signed-off-by: Eric Dumazet 
> Reported-by: Michael Wenig 
> Tested-by: Michael Wenig 
> ---

Acked-by: Neal Cardwell 

Nice. Thanks, Eric!

neal

[PATCH net] tcp_bbr: fix to zero idle_restart only upon S/ACKed data

2018-05-01 Thread Neal Cardwell

Previously the bbr->idle_restart tracking was zeroing out the
bbr->idle_restart bit upon ACKs that did not SACK or ACK anything,
e.g. receiving incoming data or receiver window updates. In such
situations BBR would forget that this was a restart-from-idle
situation, and if the min_rtt had expired it would unnecessarily enter
PROBE_RTT (even though we were actually restarting from idle but had
merely forgotten that fact).

The fix is simple: we need to remember we are restarting from idle
until we receive a S/ACK for some data (a S/ACK for the first flight
of data we send as we are restarting).

This commit is a stable candidate for kernels back as far as 4.9.

Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Soheil Hassas Yeganeh 
Signed-off-by: Priyaranjan Jha 
Signed-off-by: Yousuk Seung 
---
 net/ipv4/tcp_bbr.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
index 158d105e76da1..58e2f479ffb4d 100644
--- a/net/ipv4/tcp_bbr.c
+++ b/net/ipv4/tcp_bbr.c
@@ -806,7 +806,9 @@ static void bbr_update_min_rtt(struct sock *sk, const 
struct rate_sample *rs)
}
}
}
-   bbr->idle_restart = 0;
+   /* Restart after idle ends only once we process a new S/ACK for data */
+   if (rs->delivered > 0)
+   bbr->idle_restart = 0;
 }
 
 static void bbr_update_model(struct sock *sk, const struct rate_sample *rs)
-- 
2.17.0.441.gb46fe60e1d-goog

1 2 3 4 >

1 - 100 of 394 matches

Mail list logo