[PATCH net-next] tcp: shrink inet_connection_sock icsk_mtup enabled and probe_size
From: Neal Cardwell This commit shrinks inet_connection_sock by 4 bytes, by shrinking icsk_mtup.enabled from 32 bits to 1 bit, and shrinking icsk_mtup.probe_size from s32 to an unsuigned 31 bit field. This is to save space to compensate for the recent introduction of a new u32 in inet_connection_sock, icsk_probes_tstamp, in the recent bug fix commit 9d9b1ee0b2d1 ("tcp: fix TCP_USER_TIMEOUT with zero window"). This should not change functionality, since icsk_mtup.enabled is only ever set to 0 or 1, and icsk_mtup.probe_size can only be either 0 or a positive MTU value returned by tcp_mss_to_mtu() Signed-off-by: Neal Cardwell Signed-off-by: Eric Dumazet --- include/net/inet_connection_sock.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h index c11f80f328f1..10a625760de9 100644 --- a/include/net/inet_connection_sock.h +++ b/include/net/inet_connection_sock.h @@ -120,14 +120,14 @@ struct inet_connection_sock { __u16 rcv_mss; /* MSS used for delayed ACK decisions */ } icsk_ack; struct { - int enabled; - /* Range of MTUs to search */ int search_high; int search_low; /* Information on the current probe. */ - int probe_size; + u32 probe_size:31, + /* Is the MTUP feature enabled for this connection? */ + enabled:1; u32 probe_timestamp; } icsk_mtup; -- 2.30.0.365.g02bc693789-goog
[PATCH net] tcp: fix cwnd-limited bug for TSO deferral where we send nothing
From: Neal Cardwell When cwnd is not a multiple of the TSO skb size of N*MSS, we can get into persistent scenarios where we have the following sequence: (1) ACK for full-sized skb of N*MSS arrives -> tcp_write_xmit() transmit full-sized skb with N*MSS -> move pacing release time forward -> exit tcp_write_xmit() because pacing time is in the future (2) TSQ callback or TCP internal pacing timer fires -> try to transmit next skb, but TSO deferral finds remainder of available cwnd is not big enough to trigger an immediate send now, so we defer sending until the next ACK. (3) repeat... So we can get into a case where we never mark ourselves as cwnd-limited for many seconds at a time, even with bulk/infinite-backlog senders, because: o In case (1) above, every time in tcp_write_xmit() we have enough cwnd to send a full-sized skb, we are not fully using the cwnd (because cwnd is not a multiple of the TSO skb size). So every time we send data, we are not cwnd limited, and so in the cwnd-limited tracking code in tcp_cwnd_validate() we mark ourselves as not cwnd-limited. o In case (2) above, every time in tcp_write_xmit() that we try to transmit the "remainder" of the cwnd but defer, we set the local variable is_cwnd_limited to true, but we do not send any packets, so sent_pkts is zero, so we don't call the cwnd-limited logic to update tp->is_cwnd_limited. Fixes: ca8a22634381 ("tcp: make cwnd-limited checks measurement-based, and gentler") Reported-by: Ingemar Johansson Signed-off-by: Neal Cardwell Signed-off-by: Yuchung Cheng Acked-by: Soheil Hassas Yeganeh Signed-off-by: Eric Dumazet --- net/ipv4/tcp_output.c | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index bf48cd73e967..99011768c264 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1880,7 +1880,8 @@ static void tcp_cwnd_validate(struct sock *sk, bool is_cwnd_limited) * window, and remember whether we were cwnd-limited then. */ if (!before(tp->snd_una, tp->max_packets_seq) || - tp->packets_out > tp->max_packets_out) { + tp->packets_out > tp->max_packets_out || + is_cwnd_limited) { tp->max_packets_out = tp->packets_out; tp->max_packets_seq = tp->snd_nxt; tp->is_cwnd_limited = is_cwnd_limited; @@ -2702,6 +2703,10 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, else tcp_chrono_stop(sk, TCP_CHRONO_RWND_LIMITED); + is_cwnd_limited |= (tcp_packets_in_flight(tp) >= tp->snd_cwnd); + if (likely(sent_pkts || is_cwnd_limited)) + tcp_cwnd_validate(sk, is_cwnd_limited); + if (likely(sent_pkts)) { if (tcp_in_cwnd_reduction(sk)) tp->prr_out += sent_pkts; @@ -2709,8 +2714,6 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, /* Send one loss probe per tail loss episode. */ if (push_one != 2) tcp_schedule_loss_probe(sk, false); - is_cwnd_limited |= (tcp_packets_in_flight(tp) >= tp->snd_cwnd); - tcp_cwnd_validate(sk, is_cwnd_limited); return false; } return !tp->packets_out && !tcp_write_queue_empty(sk); -- 2.29.2.576.ga3fc446d84-goog
[PATCH net] tcp: fix to update snd_wl1 in bulk receiver fast path
From: Neal Cardwell In the header prediction fast path for a bulk data receiver, if no data is newly acknowledged then we do not call tcp_ack() and do not call tcp_ack_update_window(). This means that a bulk receiver that receives large amounts of data can have the incoming sequence numbers wrap, so that the check in tcp_may_update_window fails: after(ack_seq, tp->snd_wl1) If the incoming receive windows are zero in this state, and then the connection that was a bulk data receiver later wants to send data, that connection can find itself persistently rejecting the window updates in incoming ACKs. This means the connection can persistently fail to discover that the receive window has opened, which in turn means that the connection is unable to send anything, and the connection's sending process can get permanently "stuck". The fix is to update snd_wl1 in the header prediction fast path for a bulk data receiver, so that it keeps up and does not see wrapping problems. This fix is based on a very nice and thorough analysis and diagnosis by Apollon Oikonomopoulos (see link below). This is a stable candidate but there is no Fixes tag here since the bug predates current git history. Just for fun: looks like the bug dates back to when header prediction was added in Linux v2.1.8 in Nov 1996. In that version tcp_rcv_established() was added, and the code only updates snd_wl1 in tcp_ack(), and in the new "Bulk data transfer: receiver" code path it does not call tcp_ack(). This fix seems to apply cleanly at least as far back as v3.2. Signed-off-by: Neal Cardwell Reported-by: Apollon Oikonomopoulos Tested-by: Apollon Oikonomopoulos Link: https://www.spinics.net/lists/netdev/msg692430.html Acked-by: Soheil Hassas Yeganeh Acked-by: Yuchung Cheng Signed-off-by: Eric Dumazet --- net/ipv4/tcp_input.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 67f10d3ec240..fc445833b5e5 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5827,6 +5827,8 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb) tcp_data_snd_check(sk); if (!inet_csk_ack_scheduled(sk)) goto no_ack; + } else { + tcp_update_wl(tp, TCP_SKB_CB(skb)->seq); } __tcp_ack_snd_check(sk, 0); -- 2.29.0.rc1.297.gfa9743e501-goog
[PATCH bpf-next v3 5/5] tcp: simplify tcp_set_congestion_control() load=false case
From: Neal Cardwell Simplify tcp_set_congestion_control() by removing the initialization code path for the !load case. There are only two call sites for tcp_set_congestion_control(). The EBPF call site is the only one that passes load=false; it also passes cap_net_admin=true. Because of that, the exact same behavior can be achieved by removing the special if (!load) branch of the logic. Both before and after this commit, the EBPF case will call bpf_try_module_get(), and if that succeeds then call tcp_reinit_congestion_control() or if that fails then return EBUSY. Note that this returns the logic to a structure very similar to the structure before: commit 91b5b21c7c16 ("bpf: Add support for changing congestion control") except that the CAP_NET_ADMIN status is passed in as a function argument. This clean-up was suggested by Martin KaFai Lau. Signed-off-by: Neal Cardwell Suggested-by: Martin KaFai Lau Cc: Lawrence Brakmo Cc: Eric Dumazet Cc: Yuchung Cheng Cc: Kevin Yang --- net/ipv4/tcp_cong.c | 15 --- 1 file changed, 4 insertions(+), 11 deletions(-) diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c index a9b0fb52a1ec..db47ac24d057 100644 --- a/net/ipv4/tcp_cong.c +++ b/net/ipv4/tcp_cong.c @@ -362,21 +362,14 @@ int tcp_set_congestion_control(struct sock *sk, const char *name, bool load, goto out; } - if (!ca) { + if (!ca) err = -ENOENT; - } else if (!load) { - if (bpf_try_module_get(ca, ca->owner)) { - tcp_reinit_congestion_control(sk, ca); - } else { - err = -EBUSY; - } - } else if (!((ca->flags & TCP_CONG_NON_RESTRICTED) || cap_net_admin)) { + else if (!((ca->flags & TCP_CONG_NON_RESTRICTED) || cap_net_admin)) err = -EPERM; - } else if (!bpf_try_module_get(ca, ca->owner)) { + else if (!bpf_try_module_get(ca, ca->owner)) err = -EBUSY; - } else { + else tcp_reinit_congestion_control(sk, ca); - } out: rcu_read_unlock(); return err; -- 2.28.0.618.gf4bc123cb7-goog
[PATCH bpf-next v3 0/5] tcp: increase flexibility of EBPF congestion control initialization
From: Neal Cardwell This patch series reorganizes TCP congestion control initialization so that if EBPF code called by tcp_init_transfer() sets the congestion control algorithm by calling setsockopt(TCP_CONGESTION) then the TCP stack initializes the congestion control module immediately, instead of having tcp_init_transfer() later initialize the congestion control module. This increases flexibility for the EBPF code that runs at connection establishment time, and simplifies the code. This has the following benefits: (1) This allows CC module customizations made by the EBPF called in tcp_init_transfer() to persist, and not be wiped out by a later call to tcp_init_congestion_control() in tcp_init_transfer(). (2) Does not flip the order of EBPF and CC init, to avoid causing bugs for existing code upstream that depends on the current order. (3) Does not cause 2 initializations for for CC in the case where the EBPF called in tcp_init_transfer() wants to set the CC to a new CC algorithm. (4) Allows follow-on simplifications to the code in net/core/filter.c and net/ipv4/tcp_cong.c, which currently both have some complexity to special-case CC initialization to avoid double CC initialization if EBPF sets the CC. changes in v2: o rebase onto bpf-next o add another follow-on simplification suggested by Martin KaFai Lau: "tcp: simplify tcp_set_congestion_control() load=false case" changes in v3: o no change in commits o resent patch series from @gmail.com, since mail from ncardw...@google.com stopped being accepted at netdev@vger.kernel.org mid-way through processing the v2 patch series (between patches 2 and 3), confusing patchwork about which patches belonged to the v2 patch series Neal Cardwell (5): tcp: only init congestion control if not initialized already tcp: simplify EBPF TCP_CONGESTION to always init CC tcp: simplify tcp_set_congestion_control(): always reinitialize tcp: simplify _bpf_setsockopt(): remove flags argument tcp: simplify tcp_set_congestion_control() load=false case include/net/inet_connection_sock.h | 3 ++- include/net/tcp.h | 2 +- net/core/filter.c | 18 -- net/ipv4/tcp.c | 3 ++- net/ipv4/tcp_cong.c| 27 +++ net/ipv4/tcp_input.c | 4 +++- 6 files changed, 19 insertions(+), 38 deletions(-) -- 2.28.0.618.gf4bc123cb7-goog
[PATCH bpf-next v3 4/5] tcp: simplify _bpf_setsockopt(): remove flags argument
From: Neal Cardwell Now that the previous patches have removed the code that uses the flags argument to _bpf_setsockopt(), we can remove that argument. Signed-off-by: Neal Cardwell Acked-by: Yuchung Cheng Acked-by: Kevin Yang Signed-off-by: Eric Dumazet Cc: Lawrence Brakmo --- net/core/filter.c | 10 +++--- 1 file changed, 3 insertions(+), 7 deletions(-) diff --git a/net/core/filter.c b/net/core/filter.c index e89d6d7da03c..d266c6941967 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -4314,7 +4314,7 @@ static const struct bpf_func_proto bpf_get_socket_uid_proto = { }; static int _bpf_setsockopt(struct sock *sk, int level, int optname, - char *optval, int optlen, u32 flags) + char *optval, int optlen) { char devname[IFNAMSIZ]; int val, valbool; @@ -4611,9 +4611,7 @@ static int _bpf_getsockopt(struct sock *sk, int level, int optname, BPF_CALL_5(bpf_sock_addr_setsockopt, struct bpf_sock_addr_kern *, ctx, int, level, int, optname, char *, optval, int, optlen) { - u32 flags = 0; - return _bpf_setsockopt(ctx->sk, level, optname, optval, optlen, - flags); + return _bpf_setsockopt(ctx->sk, level, optname, optval, optlen); } static const struct bpf_func_proto bpf_sock_addr_setsockopt_proto = { @@ -4647,9 +4645,7 @@ static const struct bpf_func_proto bpf_sock_addr_getsockopt_proto = { BPF_CALL_5(bpf_sock_ops_setsockopt, struct bpf_sock_ops_kern *, bpf_sock, int, level, int, optname, char *, optval, int, optlen) { - u32 flags = 0; - return _bpf_setsockopt(bpf_sock->sk, level, optname, optval, optlen, - flags); + return _bpf_setsockopt(bpf_sock->sk, level, optname, optval, optlen); } static const struct bpf_func_proto bpf_sock_ops_setsockopt_proto = { -- 2.28.0.618.gf4bc123cb7-goog
[PATCH bpf-next v3 2/5] tcp: simplify EBPF TCP_CONGESTION to always init CC
From: Neal Cardwell Now that the previous patch ensures we don't initialize the congestion control twice, when EBPF sets the congestion control algorithm at connection establishment we can simplify the code by simply initializing the congestion control module at that time. Signed-off-by: Neal Cardwell Acked-by: Yuchung Cheng Acked-by: Kevin Yang Signed-off-by: Eric Dumazet Cc: Lawrence Brakmo --- net/core/filter.c | 7 +-- 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/net/core/filter.c b/net/core/filter.c index 47eef9a0be6a..067f6759a68f 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -4313,8 +4313,6 @@ static const struct bpf_func_proto bpf_get_socket_uid_proto = { .arg1_type = ARG_PTR_TO_CTX, }; -#define SOCKOPT_CC_REINIT (1 << 0) - static int _bpf_setsockopt(struct sock *sk, int level, int optname, char *optval, int optlen, u32 flags) { @@ -4449,13 +4447,12 @@ static int _bpf_setsockopt(struct sock *sk, int level, int optname, sk->sk_prot->setsockopt == tcp_setsockopt) { if (optname == TCP_CONGESTION) { char name[TCP_CA_NAME_MAX]; - bool reinit = flags & SOCKOPT_CC_REINIT; strncpy(name, optval, min_t(long, optlen, TCP_CA_NAME_MAX-1)); name[TCP_CA_NAME_MAX-1] = 0; ret = tcp_set_congestion_control(sk, name, false, -reinit, true); +true, true); } else { struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); @@ -4652,8 +4649,6 @@ BPF_CALL_5(bpf_sock_ops_setsockopt, struct bpf_sock_ops_kern *, bpf_sock, int, level, int, optname, char *, optval, int, optlen) { u32 flags = 0; - if (bpf_sock->op > BPF_SOCK_OPS_NEEDS_ECN) - flags |= SOCKOPT_CC_REINIT; return _bpf_setsockopt(bpf_sock->sk, level, optname, optval, optlen, flags); } -- 2.28.0.618.gf4bc123cb7-goog
[PATCH bpf-next v3 3/5] tcp: simplify tcp_set_congestion_control(): always reinitialize
From: Neal Cardwell Now that the previous patches ensure that all call sites for tcp_set_congestion_control() want to initialize congestion control, we can simplify tcp_set_congestion_control() by removing the reinit argument and the code to support it. Signed-off-by: Neal Cardwell Acked-by: Yuchung Cheng Acked-by: Kevin Yang Signed-off-by: Eric Dumazet Cc: Lawrence Brakmo --- include/net/tcp.h | 2 +- net/core/filter.c | 3 +-- net/ipv4/tcp.c | 2 +- net/ipv4/tcp_cong.c | 11 ++- 4 files changed, 5 insertions(+), 13 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index e85d564446c6..f857146c17a5 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1104,7 +1104,7 @@ void tcp_get_available_congestion_control(char *buf, size_t len); void tcp_get_allowed_congestion_control(char *buf, size_t len); int tcp_set_allowed_congestion_control(char *allowed); int tcp_set_congestion_control(struct sock *sk, const char *name, bool load, - bool reinit, bool cap_net_admin); + bool cap_net_admin); u32 tcp_slow_start(struct tcp_sock *tp, u32 acked); void tcp_cong_avoid_ai(struct tcp_sock *tp, u32 w, u32 acked); diff --git a/net/core/filter.c b/net/core/filter.c index 067f6759a68f..e89d6d7da03c 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -4451,8 +4451,7 @@ static int _bpf_setsockopt(struct sock *sk, int level, int optname, strncpy(name, optval, min_t(long, optlen, TCP_CA_NAME_MAX-1)); name[TCP_CA_NAME_MAX-1] = 0; - ret = tcp_set_congestion_control(sk, name, false, -true, true); + ret = tcp_set_congestion_control(sk, name, false, true); } else { struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 7360d3db2b61..e58ab9db73ff 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -3050,7 +3050,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level, int optname, name[val] = 0; lock_sock(sk); - err = tcp_set_congestion_control(sk, name, true, true, + err = tcp_set_congestion_control(sk, name, true, ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)); release_sock(sk); diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c index d18d7a1ce4ce..a9b0fb52a1ec 100644 --- a/net/ipv4/tcp_cong.c +++ b/net/ipv4/tcp_cong.c @@ -341,7 +341,7 @@ int tcp_set_allowed_congestion_control(char *val) * already initialized. */ int tcp_set_congestion_control(struct sock *sk, const char *name, bool load, - bool reinit, bool cap_net_admin) + bool cap_net_admin) { struct inet_connection_sock *icsk = inet_csk(sk); const struct tcp_congestion_ops *ca; @@ -365,15 +365,8 @@ int tcp_set_congestion_control(struct sock *sk, const char *name, bool load, if (!ca) { err = -ENOENT; } else if (!load) { - const struct tcp_congestion_ops *old_ca = icsk->icsk_ca_ops; - if (bpf_try_module_get(ca, ca->owner)) { - if (reinit) { - tcp_reinit_congestion_control(sk, ca); - } else { - icsk->icsk_ca_ops = ca; - bpf_module_put(old_ca, old_ca->owner); - } + tcp_reinit_congestion_control(sk, ca); } else { err = -EBUSY; } -- 2.28.0.618.gf4bc123cb7-goog
[PATCH bpf-next v3 1/5] tcp: only init congestion control if not initialized already
From: Neal Cardwell Change tcp_init_transfer() to only initialize congestion control if it has not been initialized already. With this new approach, we can arrange things so that if the EBPF code sets the congestion control by calling setsockopt(TCP_CONGESTION) then tcp_init_transfer() will not re-initialize the CC module. This is an approach that has the following beneficial properties: (1) This allows CC module customizations made by the EBPF called in tcp_init_transfer() to persist, and not be wiped out by a later call to tcp_init_congestion_control() in tcp_init_transfer(). (2) Does not flip the order of EBPF and CC init, to avoid causing bugs for existing code upstream that depends on the current order. (3) Does not cause 2 initializations for for CC in the case where the EBPF called in tcp_init_transfer() wants to set the CC to a new CC algorithm. (4) Allows follow-on simplifications to the code in net/core/filter.c and net/ipv4/tcp_cong.c, which currently both have some complexity to special-case CC initialization to avoid double CC initialization if EBPF sets the CC. Signed-off-by: Neal Cardwell Acked-by: Yuchung Cheng Acked-by: Kevin Yang Signed-off-by: Eric Dumazet Cc: Lawrence Brakmo --- include/net/inet_connection_sock.h | 3 ++- net/ipv4/tcp.c | 1 + net/ipv4/tcp_cong.c| 3 ++- net/ipv4/tcp_input.c | 4 +++- 4 files changed, 8 insertions(+), 3 deletions(-) diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h index c738abeb3265..dc763ca9413c 100644 --- a/include/net/inet_connection_sock.h +++ b/include/net/inet_connection_sock.h @@ -96,7 +96,8 @@ struct inet_connection_sock { void (*icsk_clean_acked)(struct sock *sk, u32 acked_seq); struct hlist_node icsk_listen_portaddr_node; unsigned int (*icsk_sync_mss)(struct sock *sk, u32 pmtu); - __u8 icsk_ca_state:6, + __u8 icsk_ca_state:5, + icsk_ca_initialized:1, icsk_ca_setsockopt:1, icsk_ca_dst_locked:1; __u8 icsk_retransmits; diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 57a568875539..7360d3db2b61 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2698,6 +2698,7 @@ int tcp_disconnect(struct sock *sk, int flags) if (icsk->icsk_ca_ops->release) icsk->icsk_ca_ops->release(sk); memset(icsk->icsk_ca_priv, 0, sizeof(icsk->icsk_ca_priv)); + icsk->icsk_ca_initialized = 0; tcp_set_ca_state(sk, TCP_CA_Open); tp->is_sack_reneg = 0; tcp_clear_retrans(tp); diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c index 62878cf26d9c..d18d7a1ce4ce 100644 --- a/net/ipv4/tcp_cong.c +++ b/net/ipv4/tcp_cong.c @@ -176,7 +176,7 @@ void tcp_assign_congestion_control(struct sock *sk) void tcp_init_congestion_control(struct sock *sk) { - const struct inet_connection_sock *icsk = inet_csk(sk); + struct inet_connection_sock *icsk = inet_csk(sk); tcp_sk(sk)->prior_ssthresh = 0; if (icsk->icsk_ca_ops->init) @@ -185,6 +185,7 @@ void tcp_init_congestion_control(struct sock *sk) INET_ECN_xmit(sk); else INET_ECN_dontxmit(sk); + icsk->icsk_ca_initialized = 1; } static void tcp_reinit_congestion_control(struct sock *sk, diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 4337841faeff..0e5ac0d33fd3 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5894,8 +5894,10 @@ void tcp_init_transfer(struct sock *sk, int bpf_op, struct sk_buff *skb) tp->snd_cwnd = tcp_init_cwnd(tp, __sk_dst_get(sk)); tp->snd_cwnd_stamp = tcp_jiffies32; + icsk->icsk_ca_initialized = 0; bpf_skops_established(sk, bpf_op, skb); - tcp_init_congestion_control(sk); + if (!icsk->icsk_ca_initialized) + tcp_init_congestion_control(sk); tcp_init_buffer_space(sk); } -- 2.28.0.618.gf4bc123cb7-goog
[PATCH bpf-next v2 0/5] tcp: increase flexibility of EBPF congestion control initialization
From: Neal Cardwell This patch series reorganizes TCP congestion control initialization so that if EBPF code called by tcp_init_transfer() sets the congestion control algorithm by calling setsockopt(TCP_CONGESTION) then the TCP stack initializes the congestion control module immediately, instead of having tcp_init_transfer() later initialize the congestion control module. This increases flexibility for the EBPF code that runs at connection establishment time, and simplifies the code. This has the following benefits: (1) This allows CC module customizations made by the EBPF called in tcp_init_transfer() to persist, and not be wiped out by a later call to tcp_init_congestion_control() in tcp_init_transfer(). (2) Does not flip the order of EBPF and CC init, to avoid causing bugs for existing code upstream that depends on the current order. (3) Does not cause 2 initializations for for CC in the case where the EBPF called in tcp_init_transfer() wants to set the CC to a new CC algorithm. (4) Allows follow-on simplifications to the code in net/core/filter.c and net/ipv4/tcp_cong.c, which currently both have some complexity to special-case CC initialization to avoid double CC initialization if EBPF sets the CC. changes in v2: o rebase onto bpf-next o add another follow-on simplification suggested by Martin KaFai Lau: "tcp: simplify tcp_set_congestion_control() load=false case" Neal Cardwell (5): tcp: only init congestion control if not initialized already tcp: simplify EBPF TCP_CONGESTION to always init CC tcp: simplify tcp_set_congestion_control(): always reinitialize tcp: simplify _bpf_setsockopt(): remove flags argument tcp: simplify tcp_set_congestion_control() load=false case include/net/inet_connection_sock.h | 3 ++- include/net/tcp.h | 2 +- net/core/filter.c | 18 -- net/ipv4/tcp.c | 3 ++- net/ipv4/tcp_cong.c| 27 +++ net/ipv4/tcp_input.c | 4 +++- 6 files changed, 19 insertions(+), 38 deletions(-) -- 2.28.0.526.ge36021eeef-goog
[PATCH bpf-next v2 1/5] tcp: only init congestion control if not initialized already
From: Neal Cardwell Change tcp_init_transfer() to only initialize congestion control if it has not been initialized already. With this new approach, we can arrange things so that if the EBPF code sets the congestion control by calling setsockopt(TCP_CONGESTION) then tcp_init_transfer() will not re-initialize the CC module. This is an approach that has the following beneficial properties: (1) This allows CC module customizations made by the EBPF called in tcp_init_transfer() to persist, and not be wiped out by a later call to tcp_init_congestion_control() in tcp_init_transfer(). (2) Does not flip the order of EBPF and CC init, to avoid causing bugs for existing code upstream that depends on the current order. (3) Does not cause 2 initializations for for CC in the case where the EBPF called in tcp_init_transfer() wants to set the CC to a new CC algorithm. (4) Allows follow-on simplifications to the code in net/core/filter.c and net/ipv4/tcp_cong.c, which currently both have some complexity to special-case CC initialization to avoid double CC initialization if EBPF sets the CC. Signed-off-by: Neal Cardwell Acked-by: Yuchung Cheng Acked-by: Kevin Yang Signed-off-by: Eric Dumazet Cc: Lawrence Brakmo --- include/net/inet_connection_sock.h | 3 ++- net/ipv4/tcp.c | 1 + net/ipv4/tcp_cong.c| 3 ++- net/ipv4/tcp_input.c | 4 +++- 4 files changed, 8 insertions(+), 3 deletions(-) diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h index c738abeb3265..dc763ca9413c 100644 --- a/include/net/inet_connection_sock.h +++ b/include/net/inet_connection_sock.h @@ -96,7 +96,8 @@ struct inet_connection_sock { void (*icsk_clean_acked)(struct sock *sk, u32 acked_seq); struct hlist_node icsk_listen_portaddr_node; unsigned int (*icsk_sync_mss)(struct sock *sk, u32 pmtu); - __u8 icsk_ca_state:6, + __u8 icsk_ca_state:5, + icsk_ca_initialized:1, icsk_ca_setsockopt:1, icsk_ca_dst_locked:1; __u8 icsk_retransmits; diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 57a568875539..7360d3db2b61 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2698,6 +2698,7 @@ int tcp_disconnect(struct sock *sk, int flags) if (icsk->icsk_ca_ops->release) icsk->icsk_ca_ops->release(sk); memset(icsk->icsk_ca_priv, 0, sizeof(icsk->icsk_ca_priv)); + icsk->icsk_ca_initialized = 0; tcp_set_ca_state(sk, TCP_CA_Open); tp->is_sack_reneg = 0; tcp_clear_retrans(tp); diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c index 62878cf26d9c..d18d7a1ce4ce 100644 --- a/net/ipv4/tcp_cong.c +++ b/net/ipv4/tcp_cong.c @@ -176,7 +176,7 @@ void tcp_assign_congestion_control(struct sock *sk) void tcp_init_congestion_control(struct sock *sk) { - const struct inet_connection_sock *icsk = inet_csk(sk); + struct inet_connection_sock *icsk = inet_csk(sk); tcp_sk(sk)->prior_ssthresh = 0; if (icsk->icsk_ca_ops->init) @@ -185,6 +185,7 @@ void tcp_init_congestion_control(struct sock *sk) INET_ECN_xmit(sk); else INET_ECN_dontxmit(sk); + icsk->icsk_ca_initialized = 1; } static void tcp_reinit_congestion_control(struct sock *sk, diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 4337841faeff..0e5ac0d33fd3 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5894,8 +5894,10 @@ void tcp_init_transfer(struct sock *sk, int bpf_op, struct sk_buff *skb) tp->snd_cwnd = tcp_init_cwnd(tp, __sk_dst_get(sk)); tp->snd_cwnd_stamp = tcp_jiffies32; + icsk->icsk_ca_initialized = 0; bpf_skops_established(sk, bpf_op, skb); - tcp_init_congestion_control(sk); + if (!icsk->icsk_ca_initialized) + tcp_init_congestion_control(sk); tcp_init_buffer_space(sk); } -- 2.28.0.526.ge36021eeef-goog
[PATCH bpf-next v2 2/5] tcp: simplify EBPF TCP_CONGESTION to always init CC
From: Neal Cardwell Now that the previous patch ensures we don't initialize the congestion control twice, when EBPF sets the congestion control algorithm at connection establishment we can simplify the code by simply initializing the congestion control module at that time. Signed-off-by: Neal Cardwell Acked-by: Yuchung Cheng Acked-by: Kevin Yang Signed-off-by: Eric Dumazet Cc: Lawrence Brakmo --- net/core/filter.c | 7 +-- 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/net/core/filter.c b/net/core/filter.c index 47eef9a0be6a..067f6759a68f 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -4313,8 +4313,6 @@ static const struct bpf_func_proto bpf_get_socket_uid_proto = { .arg1_type = ARG_PTR_TO_CTX, }; -#define SOCKOPT_CC_REINIT (1 << 0) - static int _bpf_setsockopt(struct sock *sk, int level, int optname, char *optval, int optlen, u32 flags) { @@ -4449,13 +4447,12 @@ static int _bpf_setsockopt(struct sock *sk, int level, int optname, sk->sk_prot->setsockopt == tcp_setsockopt) { if (optname == TCP_CONGESTION) { char name[TCP_CA_NAME_MAX]; - bool reinit = flags & SOCKOPT_CC_REINIT; strncpy(name, optval, min_t(long, optlen, TCP_CA_NAME_MAX-1)); name[TCP_CA_NAME_MAX-1] = 0; ret = tcp_set_congestion_control(sk, name, false, -reinit, true); +true, true); } else { struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); @@ -4652,8 +4649,6 @@ BPF_CALL_5(bpf_sock_ops_setsockopt, struct bpf_sock_ops_kern *, bpf_sock, int, level, int, optname, char *, optval, int, optlen) { u32 flags = 0; - if (bpf_sock->op > BPF_SOCK_OPS_NEEDS_ECN) - flags |= SOCKOPT_CC_REINIT; return _bpf_setsockopt(bpf_sock->sk, level, optname, optval, optlen, flags); } -- 2.28.0.526.ge36021eeef-goog
[PATCH bpf-next v2 5/5] tcp: simplify tcp_set_congestion_control() load=false case
From: Neal Cardwell Simplify tcp_set_congestion_control() by removing the initialization code path for the !load case. There are only two call sites for tcp_set_congestion_control(). The EBPF call site is the only one that passes load=false; it also passes cap_net_admin=true. Because of that, the exact same behavior can be achieved by removing the special if (!load) branch of the logic. Both before and after this commit, the EBPF case will call bpf_try_module_get(), and if that succeeds then call tcp_reinit_congestion_control() or if that fails then return EBUSY. Note that this returns the logic to a structure very similar to the structure before: commit 91b5b21c7c16 ("bpf: Add support for changing congestion control") except that the CAP_NET_ADMIN status is passed in as a function argument. This clean-up was suggested by Martin KaFai Lau. Signed-off-by: Neal Cardwell Suggested-by: Martin KaFai Lau Cc: Lawrence Brakmo Cc: Eric Dumazet Cc: Yuchung Cheng Cc: Kevin Yang --- net/ipv4/tcp_cong.c | 15 --- 1 file changed, 4 insertions(+), 11 deletions(-) diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c index a9b0fb52a1ec..db47ac24d057 100644 --- a/net/ipv4/tcp_cong.c +++ b/net/ipv4/tcp_cong.c @@ -362,21 +362,14 @@ int tcp_set_congestion_control(struct sock *sk, const char *name, bool load, goto out; } - if (!ca) { + if (!ca) err = -ENOENT; - } else if (!load) { - if (bpf_try_module_get(ca, ca->owner)) { - tcp_reinit_congestion_control(sk, ca); - } else { - err = -EBUSY; - } - } else if (!((ca->flags & TCP_CONG_NON_RESTRICTED) || cap_net_admin)) { + else if (!((ca->flags & TCP_CONG_NON_RESTRICTED) || cap_net_admin)) err = -EPERM; - } else if (!bpf_try_module_get(ca, ca->owner)) { + else if (!bpf_try_module_get(ca, ca->owner)) err = -EBUSY; - } else { + else tcp_reinit_congestion_control(sk, ca); - } out: rcu_read_unlock(); return err; -- 2.28.0.526.ge36021eeef-goog
[PATCH bpf-next v2 4/5] tcp: simplify _bpf_setsockopt(): remove flags argument
From: Neal Cardwell Now that the previous patches have removed the code that uses the flags argument to _bpf_setsockopt(), we can remove that argument. Signed-off-by: Neal Cardwell Acked-by: Yuchung Cheng Acked-by: Kevin Yang Signed-off-by: Eric Dumazet Cc: Lawrence Brakmo --- net/core/filter.c | 10 +++--- 1 file changed, 3 insertions(+), 7 deletions(-) diff --git a/net/core/filter.c b/net/core/filter.c index e89d6d7da03c..d266c6941967 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -4314,7 +4314,7 @@ static const struct bpf_func_proto bpf_get_socket_uid_proto = { }; static int _bpf_setsockopt(struct sock *sk, int level, int optname, - char *optval, int optlen, u32 flags) + char *optval, int optlen) { char devname[IFNAMSIZ]; int val, valbool; @@ -4611,9 +4611,7 @@ static int _bpf_getsockopt(struct sock *sk, int level, int optname, BPF_CALL_5(bpf_sock_addr_setsockopt, struct bpf_sock_addr_kern *, ctx, int, level, int, optname, char *, optval, int, optlen) { - u32 flags = 0; - return _bpf_setsockopt(ctx->sk, level, optname, optval, optlen, - flags); + return _bpf_setsockopt(ctx->sk, level, optname, optval, optlen); } static const struct bpf_func_proto bpf_sock_addr_setsockopt_proto = { @@ -4647,9 +4645,7 @@ static const struct bpf_func_proto bpf_sock_addr_getsockopt_proto = { BPF_CALL_5(bpf_sock_ops_setsockopt, struct bpf_sock_ops_kern *, bpf_sock, int, level, int, optname, char *, optval, int, optlen) { - u32 flags = 0; - return _bpf_setsockopt(bpf_sock->sk, level, optname, optval, optlen, - flags); + return _bpf_setsockopt(bpf_sock->sk, level, optname, optval, optlen); } static const struct bpf_func_proto bpf_sock_ops_setsockopt_proto = { -- 2.28.0.526.ge36021eeef-goog
[PATCH bpf-next v2 3/5] tcp: simplify tcp_set_congestion_control(): always reinitialize
From: Neal Cardwell Now that the previous patches ensure that all call sites for tcp_set_congestion_control() want to initialize congestion control, we can simplify tcp_set_congestion_control() by removing the reinit argument and the code to support it. Signed-off-by: Neal Cardwell Acked-by: Yuchung Cheng Acked-by: Kevin Yang Signed-off-by: Eric Dumazet Cc: Lawrence Brakmo --- include/net/tcp.h | 2 +- net/core/filter.c | 3 +-- net/ipv4/tcp.c | 2 +- net/ipv4/tcp_cong.c | 11 ++- 4 files changed, 5 insertions(+), 13 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index e85d564446c6..f857146c17a5 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1104,7 +1104,7 @@ void tcp_get_available_congestion_control(char *buf, size_t len); void tcp_get_allowed_congestion_control(char *buf, size_t len); int tcp_set_allowed_congestion_control(char *allowed); int tcp_set_congestion_control(struct sock *sk, const char *name, bool load, - bool reinit, bool cap_net_admin); + bool cap_net_admin); u32 tcp_slow_start(struct tcp_sock *tp, u32 acked); void tcp_cong_avoid_ai(struct tcp_sock *tp, u32 w, u32 acked); diff --git a/net/core/filter.c b/net/core/filter.c index 067f6759a68f..e89d6d7da03c 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -4451,8 +4451,7 @@ static int _bpf_setsockopt(struct sock *sk, int level, int optname, strncpy(name, optval, min_t(long, optlen, TCP_CA_NAME_MAX-1)); name[TCP_CA_NAME_MAX-1] = 0; - ret = tcp_set_congestion_control(sk, name, false, -true, true); + ret = tcp_set_congestion_control(sk, name, false, true); } else { struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 7360d3db2b61..e58ab9db73ff 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -3050,7 +3050,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level, int optname, name[val] = 0; lock_sock(sk); - err = tcp_set_congestion_control(sk, name, true, true, + err = tcp_set_congestion_control(sk, name, true, ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)); release_sock(sk); diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c index d18d7a1ce4ce..a9b0fb52a1ec 100644 --- a/net/ipv4/tcp_cong.c +++ b/net/ipv4/tcp_cong.c @@ -341,7 +341,7 @@ int tcp_set_allowed_congestion_control(char *val) * already initialized. */ int tcp_set_congestion_control(struct sock *sk, const char *name, bool load, - bool reinit, bool cap_net_admin) + bool cap_net_admin) { struct inet_connection_sock *icsk = inet_csk(sk); const struct tcp_congestion_ops *ca; @@ -365,15 +365,8 @@ int tcp_set_congestion_control(struct sock *sk, const char *name, bool load, if (!ca) { err = -ENOENT; } else if (!load) { - const struct tcp_congestion_ops *old_ca = icsk->icsk_ca_ops; - if (bpf_try_module_get(ca, ca->owner)) { - if (reinit) { - tcp_reinit_congestion_control(sk, ca); - } else { - icsk->icsk_ca_ops = ca; - bpf_module_put(old_ca, old_ca->owner); - } + tcp_reinit_congestion_control(sk, ca); } else { err = -EBUSY; } -- 2.28.0.526.ge36021eeef-goog
Re: [net-next] tcp: add TCP_INFO status for failed client TFO
On Mon, Oct 21, 2019 at 5:11 PM Jason Baron wrote: > > > > On 10/21/19 4:36 PM, Eric Dumazet wrote: > > On Mon, Oct 21, 2019 at 12:53 PM Christoph Paasch wrote: > >> > > > >> Actually, longterm I hope we would be able to get rid of the > >> blackhole-detection and fallback heuristics. In a far distant future where > >> these middleboxes have been weeded out ;-) > >> > >> So, do we really want to eternalize this as part of the API in tcp_info ? > >> > > > > A getsockopt() option won't be available for netlink diag (ss command). > > > > So it all depends on plans to use this FASTOPEN information ? > > > > The original proposal I had 4 states of interest: > > First, we are interested in knowing when a socket has TFO set but > actually requires a retransmit of a non-TFO syn to become established. > In this case, we'd be better off not using TFO. > > A second case is when the server immediately rejects the DATA and just > acks the syn (but not the data). Again in that case, we don't want to be > sending syn+data. > > The third case was whether or not we sent a cookie. Perhaps, the server > doesn't have TFO enabled in which case, it really doesn't make make > sense to enable TFO in the first place. Or if one also controls the > server its helpful in understanding if the server is mis-configured. So > that was the motivation I had for the original four states that I > proposed (final state was a catch-all state). > > Yuchung suggested dividing the 3rd case into 2 for - no cookie sent > because of blackhole or no cookie sent because its not in cache. And > then dropping the second state because we already have the > TCPI_OPT_SYN_DATA bit. However, the TCPI_OPT_SYN_DATA may not be set > because we may fallback in tcp_send_syn_data() due to send failure. So > I'm inclined to say that the second state is valuable. And since > blackhole is already a global thing via MIB, I didn't see a strong need > for it. But it sounded like Yuchung had an interest in it, and I'd > obviously like a set of states that is generally useful. I have not kept up with all the proposals in this thread, but would it work to include all of the cases proposed in this thread? It seems like there are enough bits available in holes in tcp_sock and tcp_info to have up to 7 bits of information saved in tcp_sock and exported to tcp_info. That seems like more than enough? The pahole tool shows the following small holes in tcp_sock: u8 compressed_ack; /* 1530 1 */ /* XXX 1 byte hole, try to pack */ u32chrono_start; /* 1532 4 */ ... u8 bpf_sock_ops_cb_flags; /* 2076 1 */ /* XXX 3 bytes hole, try to pack */ u32rcv_ooopack; /* 2080 4 */ ...and the following hole in tcp_info: __u8 tcpi_delivery_rate_app_limited:1; /* 7: 7 1 */ /* XXX 7 bits hole, try to pack */ __u32 tcpi_rto; /* 8 4 */ cheers, neal
Re: Crash when receiving FIN-ACK in TCP_FIN_WAIT1 state
On Mon, Oct 21, 2019 at 8:04 PM Subash Abhinov Kasiviswanathan wrote: > > > Interesting! As tcp_input.c summarizes, "packets_out is > > SND.NXT-SND.UNA counted in packets". In the normal operation of a > > socket, tp->packets_out should not be 0 if any of those other fields > > are non-zero. > > > > The tcp_write_queue_purge() function sets packets_out to 0: > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/tree/net/ipv4/tcp.c?h=v4.19#n2526 > > > > So the execution of tcp_write_queue_purge() before this point is one > > way for the socket to end up in this weird state. > > > > In one of the instances, the values are tp->snd_nxt = 1016118098, > tp->snd_una = 1016047820 > > tp->mss_cache = 1378 > > I assume the number of outstanding segments should be > (tp->snd_nxt - tp->snd_una)/tp->mss_cache = 51 That would be a good expectation if all the packets were full-sized. > tp->packets_out = 0 and tp->sacked_out = 158 in this case. OK, thanks. It could be that sacked_out is reasonable and some of the packets were not full-sized. But, as discussed above, typically the packets_out should not be 0 if sacked_out is non-zero (with at least the exception of the tcp_write_queue_purge() case). > >> > Yes, one guess would be that somehow the skbs in the retransmit queue > >> > have been freed, but tp->sacked_out is still non-zero and > >> > tp->highest_sack is still a dangling pointer into one of those freed > >> > skbs. The tcp_write_queue_purge() function is one function that fees > >> > the skbs in the retransmit queue and leaves tp->sacked_out as non-zero > >> > and tp->highest_sack as a dangling pointer to a freed skb, AFAICT, so > >> > that's why I'm wondering about that function. I can't think of a > >> > specific sequence of events that would involve tcp_write_queue_purge() > >> > and then a socket that's still in FIN-WAIT1. Maybe I'm not being > >> > creative enough, or maybe that guess is on the wrong track. Would you > >> > be able to set a new bit in the tcp_sock in tcp_write_queue_purge() > >> > and log it in your instrumentation point, to see if > >> > tcp_write_queue_purge() was called for these connections that cause > >> > this crash? > > I've queued up a build which logs calls to tcp_write_queue_purge and > clears tp->highest_sack and tp->sacked_out. I will let you know how > it fares by end of week. OK, thanks. That should be a useful data point. cheers, neal
Re: Crash when receiving FIN-ACK in TCP_FIN_WAIT1 state
On Sun, Oct 20, 2019 at 10:45 PM Subash Abhinov Kasiviswanathan wrote: > > > FIN-WAIT1 just means the local application has called close() or > > shutdown() to shut down the sending direction of the socket, and the > > local TCP stack has sent a FIN, and is waiting to receive a FIN and an > > ACK from the other side (in either order, or simultaneously). The > > ASCII art state transition diagram on page 22 of RFC 793 (e.g. > > https://tools.ietf.org/html/rfc793#section-3.2 ) is one source for > > this, though the W. Richard Stevens books have a much more readable > > diagram. > > > > There may still be unacked and SACKed data in the retransmit queue at > > this point. > > > > Thanks for the clarification. > > > Thanks, that is a useful data point. Do you know what particular value > > tp->sacked_out has? Would you be able to capture/log the value of > > tp->packets_out, tp->lost_out, and tp->retrans_out as well? > > > > tp->sacket_out varies per crash instance - 55, 180 etc. > However the other values are always the same - tp->packets_out is 0, > tp->lost_out is 1 and tp->retrans_out is 1. Interesting! As tcp_input.c summarizes, "packets_out is SND.NXT-SND.UNA counted in packets". In the normal operation of a socket, tp->packets_out should not be 0 if any of those other fields are non-zero. The tcp_write_queue_purge() function sets packets_out to 0: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/tree/net/ipv4/tcp.c?h=v4.19#n2526 So the execution of tcp_write_queue_purge() before this point is one way for the socket to end up in this weird state. > > Yes, one guess would be that somehow the skbs in the retransmit queue > > have been freed, but tp->sacked_out is still non-zero and > > tp->highest_sack is still a dangling pointer into one of those freed > > skbs. The tcp_write_queue_purge() function is one function that fees > > the skbs in the retransmit queue and leaves tp->sacked_out as non-zero > > and tp->highest_sack as a dangling pointer to a freed skb, AFAICT, so > > that's why I'm wondering about that function. I can't think of a > > specific sequence of events that would involve tcp_write_queue_purge() > > and then a socket that's still in FIN-WAIT1. Maybe I'm not being > > creative enough, or maybe that guess is on the wrong track. Would you > > be able to set a new bit in the tcp_sock in tcp_write_queue_purge() > > and log it in your instrumentation point, to see if > > tcp_write_queue_purge() was called for these connections that cause > > this crash? > > Sure, I can try this out. Great! Thanks! neal
Re: Crash when receiving FIN-ACK in TCP_FIN_WAIT1 state
On Sun, Oct 20, 2019 at 7:15 PM Subash Abhinov Kasiviswanathan wrote: > > > Hmm. Random related thought while searching for a possible cause: I > > wonder if tcp_write_queue_purge() should clear tp->highest_sack (and > > possibly tp->sacked_out)? The tcp_write_queue_purge() code is careful > > to call tcp_clear_all_retrans_hints(tcp_sk(sk)) and I would imagine > > that similar considerations would imply that we should clear at least > > tp->highest_sack? > > > > neal > > Hi Neal > > If the socket is in FIN-WAIT1, does that mean that all the segments > corresponding to SACK blocks are sent and ACKed already? FIN-WAIT1 just means the local application has called close() or shutdown() to shut down the sending direction of the socket, and the local TCP stack has sent a FIN, and is waiting to receive a FIN and an ACK from the other side (in either order, or simultaneously). The ASCII art state transition diagram on page 22 of RFC 793 (e.g. https://tools.ietf.org/html/rfc793#section-3.2 ) is one source for this, though the W. Richard Stevens books have a much more readable diagram. There may still be unacked and SACKed data in the retransmit queue at this point. > tp->sacked_out is non zero in all these crashes Thanks, that is a useful data point. Do you know what particular value tp->sacked_out has? Would you be able to capture/log the value of tp->packets_out, tp->lost_out, and tp->retrans_out as well? > (is the SACK information possibly invalid or stale here?). Yes, one guess would be that somehow the skbs in the retransmit queue have been freed, but tp->sacked_out is still non-zero and tp->highest_sack is still a dangling pointer into one of those freed skbs. The tcp_write_queue_purge() function is one function that fees the skbs in the retransmit queue and leaves tp->sacked_out as non-zero and tp->highest_sack as a dangling pointer to a freed skb, AFAICT, so that's why I'm wondering about that function. I can't think of a specific sequence of events that would involve tcp_write_queue_purge() and then a socket that's still in FIN-WAIT1. Maybe I'm not being creative enough, or maybe that guess is on the wrong track. Would you be able to set a new bit in the tcp_sock in tcp_write_queue_purge() and log it in your instrumentation point, to see if tcp_write_queue_purge() was called for these connections that cause this crash? thanks, neal
Re: Crash when receiving FIN-ACK in TCP_FIN_WAIT1 state
tcp_write_queue_purgeOn Sun, Oct 20, 2019 at 4:25 PM Subash Abhinov Kasiviswanathan wrote: > > We are seeing a crash in the TCP ACK codepath often in our regression > racks with an ARM64 device with 4.19 based kernel. > > It appears that the tp->highest_ack is invalid when being accessed when > a > FIN-ACK is received. In all the instances of the crash, the tcp socket > is in TCP_FIN_WAIT1 state. Hmm. Random related thought while searching for a possible cause: I wonder if tcp_write_queue_purge() should clear tp->highest_sack (and possibly tp->sacked_out)? The tcp_write_queue_purge() code is careful to call tcp_clear_all_retrans_hints(tcp_sk(sk)) and I would imagine that similar considerations would imply that we should clear at least tp->highest_sack? neal
Re: [net-next] tcp: add TCP_INFO status for failed client TFO
On Fri, Oct 18, 2019 at 3:03 PM Jason Baron wrote: > > The TCPI_OPT_SYN_DATA bit as part of tcpi_options currently reports whether > or not data-in-SYN was ack'd on both the client and server side. We'd like > to gather more information on the client-side in the failure case in order > to indicate the reason for the failure. This can be useful for not only > debugging TFO, but also for creating TFO socket policies. For example, if > a middle box removes the TFO option or drops a data-in-SYN, we can > can detect this case, and turn off TFO for these connections saving the > extra retransmits. > > The newly added tcpi_fastopen_client_fail status is 2 bits and has 4 > states: > > 1) TFO_STATUS_UNSPEC > > catch-all. > > 2) TFO_NO_COOKIE_SENT > > If TFO_CLIENT_NO_COOKIE mode is off, this state indicates that no cookie > was sent because we don't have one yet, its not in cache or black-holing > may be enabled (already indicated by the global > LINUX_MIB_TCPFASTOPENBLACKHOLE). > > 3) TFO_NO_SYN_DATA > > Data was sent with SYN, we received a SYN/ACK but it did not cover the data > portion. Cookie is not accepted by server because the cookie may be invalid > or the server may be overloaded. > > > 4) TFO_NO_SYN_DATA_TIMEOUT > > Data was sent with SYN, we received a SYN/ACK which did not cover the data > after at least 1 additional SYN was sent (without data). It may be the case > that a middle-box is dropping data-in-SYN packets. Thus, it would be more > efficient to not use TFO on this connection to avoid extra retransmits > during connection establishment. > > These new fields certainly not cover all the cases where TFO may fail, but > other failures, such as SYN/ACK + data being dropped, will result in the > connection not becoming established. And a connection blackhole after > session establishment shows up as a stalled connection. > > Signed-off-by: Jason Baron > Cc: Eric Dumazet > Cc: Neal Cardwell > Cc: Christoph Paasch > --- Thanks for adding this! It would be good to reset tp->fastopen_client_fail to 0 in tcp_disconnect(). > +/* why fastopen failed from client perspective */ > +enum tcp_fastopen_client_fail { > + TFO_STATUS_UNSPEC, /* catch-all */ > + TFO_NO_COOKIE_SENT, /* if not in TFO_CLIENT_NO_COOKIE mode */ > + TFO_NO_SYN_DATA, /* SYN-ACK did not ack SYN data */ I found the "TFO_NO_SYN_DATA" name a little unintuitive; it sounded to me like this means the client didn't send a SYN+DATA. What about "TFO_DATA_NOT_ACKED", or something like that? If you don't mind, it would be great to cc: Yuchung on the next rev. thanks, neal
Re: [PATCH v2] tcp: Add TCP_INFO counter for packets received out-of-order
On Tue, Sep 17, 2019 at 1:22 PM Eric Dumazet wrote: > > Tue, Sep 17, 2019 at 10:13 AM Jason Baron wrote: > > > > > > Hi, > > > > I was interested in adding a field to tcp_info around the TFO state of a > > socket. So for the server side it would indicate if TFO was used to > > create the socket and on the client side it would report whether TFO > > worked and if not that it failed with maybe some additional states > > around why it failed. I'm thinking it would be maybe 3 bits. BTW, one aspect of that "did TFO work" info is available already in tcp_info in the tcpi_options field. Kernel side is: if (tp->syn_data_acked) info->tcpi_options |= TCPI_OPT_SYN_DATA; We use this bit in packetdrill tests on client and server side to check that the TFO data-in-SYN succeeded: +0 %{ assert (tcpi_options & TCPI_OPT_SYN_DATA) != 0, tcpi_options }% These TFO bits were added much later than the other bits, so IMHO it would be OK to add more bits somewhere unused in tcp_info to indicate reasons for TFO failure. Especially if, as you suggest, "0" as a code point could indicate that the code point is undefined, and all meaningful code points were non-zero. neal > > My question is whether its reasonable to use the unused bits of > > __u8tcpi_delivery_rate_app_limited:1;. Or is this not good because > > the size hasn't changed? What if I avoided using 0 for the new field to > > avoid the possibility of not knowing if 0 because its the old kernel or > > 0 because that's now its a TFO state? IE the new field could always be > > > 0 for the new kernel. > > > > I guess that storing the 'why it has failed' would need more bits. > > I suggest maybe using an event for this, instead of TCP_INFO ? > > As of using the bits, maybe the monitoring application does not really care > if running on an old kernel where the bits would be zero. > > Commit eb8329e0a04db0061f714f033b4454326ba147f4 reserved a single > bit and did not bother about making sure the monitoring would detect if this > runs on an old kernel.
Re: [PATCH v5 2/2] tcp: Add snd_wnd to TCP_INFO
On Fri, Sep 13, 2019 at 7:23 PM Thomas Higdon wrote: > > Neal Cardwell mentioned that snd_wnd would be useful for diagnosing TCP > performance problems -- > > (1) Usually when we're diagnosing TCP performance problems, we do so > > from the sender, since the sender makes most of the > > performance-critical decisions (cwnd, pacing, TSO size, TSQ, etc). > > From the sender-side the thing that would be most useful is to see > > tp->snd_wnd, the receive window that the receiver has advertised to > > the sender. > > This serves the purpose of adding an additional __u32 to avoid the > would-be hole caused by the addition of the tcpi_rcvi_ooopack field. > > Signed-off-by: Thomas Higdon > --- > changes since v4: > - clarify comment Acked-by: Neal Cardwell Thanks! neal
Re: [PATCH v5 1/2] tcp: Add TCP_INFO counter for packets received out-of-order
On Fri, Sep 13, 2019 at 7:23 PM Thomas Higdon wrote: > > For receive-heavy cases on the server-side, we want to track the > connection quality for individual client IPs. This counter, similar to > the existing system-wide TCPOFOQueue counter in /proc/net/netstat, > tracks out-of-order packet reception. By providing this counter in > TCP_INFO, it will allow understanding to what degree receive-heavy > sockets are experiencing out-of-order delivery and packet drops > indicating congestion. > > Please note that this is similar to the counter in NetBSD TCP_INFO, and > has the same name. > > Also note that we avoid increasing the size of the tcp_sock struct by > taking advantage of a hole. > > Signed-off-by: Thomas Higdon > --- > changes since v4: > - optimize placement of rcv_ooopack to avoid increasing tcp_sock struct >size Acked-by: Neal Cardwell Thanks, Thomas, for adding this! After this is merged, would you mind sending a patch to add support to the "ss" command line tool to print these 2 new fields? My favorite recent example of such a patch to ss is Eric's change: https://git.kernel.org/pub/scm/network/iproute2/iproute2.git/commit/misc/ss.c?id=5eead6270a19f00464052d4084f32182cfe027ff thanks, neal
Re: [PATCH v4 2/2] tcp: Add snd_wnd to TCP_INFO
On Fri, Sep 13, 2019 at 5:29 PM Yuchung Cheng wrote: > > What if the comment is shortened up to fit in 80 columns and the units > > (bytes) are added, something like: > > > > __u32 tcpi_snd_wnd;/* peer's advertised recv window > > (bytes) */ > just a thought: will tcpi_peer_rcv_wnd be more self-explanatory? Good suggestion. I'm on the fence about that one. By itself, I agree tcpi_peer_rcv_wnd sounds much more clear. But tcpi_snd_wnd has the virtue of matching both the kernel code (tp->snd_wnd) and RFC 793 (SND.WND). So they both have pros and cons. Maybe someone else feels more strongly one way or the other. neal
Re: [PATCH v4 2/2] tcp: Add snd_wnd to TCP_INFO
On Fri, Sep 13, 2019 at 3:36 PM Thomas Higdon wrote: > > Neal Cardwell mentioned that snd_wnd would be useful for diagnosing TCP > performance problems -- > > (1) Usually when we're diagnosing TCP performance problems, we do so > > from the sender, since the sender makes most of the > > performance-critical decisions (cwnd, pacing, TSO size, TSQ, etc). > > From the sender-side the thing that would be most useful is to see > > tp->snd_wnd, the receive window that the receiver has advertised to > > the sender. > > This serves the purpose of adding an additional __u32 to avoid the > would-be hole caused by the addition of the tcpi_rcvi_ooopack field. > > Signed-off-by: Thomas Higdon > --- > changes from v3: > - changed from rcv_wnd to snd_wnd > > include/uapi/linux/tcp.h | 1 + > net/ipv4/tcp.c | 1 + > 2 files changed, 2 insertions(+) > > diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h > index 20237987ccc8..240654f22d98 100644 > --- a/include/uapi/linux/tcp.h > +++ b/include/uapi/linux/tcp.h > @@ -272,6 +272,7 @@ struct tcp_info { > __u32 tcpi_reord_seen; /* reordering events seen */ > > __u32 tcpi_rcv_ooopack;/* Out-of-order packets received */ > + __u32 tcpi_snd_wnd;/* Remote peer's advertised recv window > size */ > }; Thanks for adding this! My run of ./scripts/checkpatch.pl is showing a warning on this line: WARNING: line over 80 characters #19: FILE: include/uapi/linux/tcp.h:273: + __u32 tcpi_snd_wnd;/* Remote peer's advertised recv window size */ What if the comment is shortened up to fit in 80 columns and the units (bytes) are added, something like: __u32 tcpi_snd_wnd;/* peer's advertised recv window (bytes) */ neal
Re: [PATCH v4 1/2] tcp: Add TCP_INFO counter for packets received out-of-order
On Fri, Sep 13, 2019 at 3:37 PM Thomas Higdon wrote: > > For receive-heavy cases on the server-side, we want to track the > connection quality for individual client IPs. This counter, similar to > the existing system-wide TCPOFOQueue counter in /proc/net/netstat, > tracks out-of-order packet reception. By providing this counter in > TCP_INFO, it will allow understanding to what degree receive-heavy > sockets are experiencing out-of-order delivery and packet drops > indicating congestion. > > Please note that this is similar to the counter in NetBSD TCP_INFO, and > has the same name. > > Signed-off-by: Thomas Higdon > --- > > no changes from v3 > > include/linux/tcp.h | 2 ++ > include/uapi/linux/tcp.h | 2 ++ > net/ipv4/tcp.c | 2 ++ > net/ipv4/tcp_input.c | 1 + > 4 files changed, 7 insertions(+) > > diff --git a/include/linux/tcp.h b/include/linux/tcp.h > index f3a85a7fb4b1..a01dc78218f1 100644 > --- a/include/linux/tcp.h > +++ b/include/linux/tcp.h > @@ -393,6 +393,8 @@ struct tcp_sock { > */ > struct request_sock *fastopen_rsk; > u32 *saved_syn; > + > + u32 rcv_ooopack; /* Received out-of-order packets, for tcpinfo */ Thanks for adding this. A thought: putting the new rcv_ooopack field here makes struct tcp_sock bigger, and increases the odds of taking a cache miss (according to "pahole" this field is the only one in a new cache line). I'd suggest putting the new rcv_ooopack field immediately before rcv_rtt_last_tsecr. That would use up a 4-byte hole, and would place it in a cache line already used on TCP receivers (for rcv_rtt logic). This would make it less likely this new field causes more cache misses or uses more space. Details: looking at the output of "pahole" for tcp_sock in various cases: net-next before this patch: - ... u8 bpf_sock_ops_cb_flags; /* 2076 1 */ /* XXX 3 bytes hole, try to pack */ u32rcv_rtt_last_tsecr; /* 2080 4 */ /* XXX 4 bytes hole, try to pack */ struct { u32rtt_us; /* 2088 4 */ u32seq; /* 2092 4 */ u64time; /* 2096 8 */ } rcv_rtt_est; /* 208816 */ ... /* size: 2176, cachelines: 34, members: 134 */ /* sum members: 2164, holes: 4, sum holes: 12 */ /* paddings: 3, sum paddings: 12 */ net-next with this patch: - ... u32 * saved_syn;/* 2168 8 */ /* --- cacheline 34 boundary (2176 bytes) --- */ u32rcv_ooopack; /* 2176 4 */ ... /* size: 2184, cachelines: 35, members: 135 */ /* sum members: 2168, holes: 4, sum holes: 12 */ /* padding: 4 */ /* paddings: 3, sum paddings: 12 */ /* last cacheline: 8 bytes */ net-next with this field in the suggested spot: - ... /* XXX 3 bytes hole, try to pack */ u32rcv_ooopack; /* 2080 4 */ u32rcv_rtt_last_tsecr; /* 2084 4 */ struct { u32rtt_us; /* 2088 4 */ u32seq; /* 2092 4 */ u64time; /* 2096 8 */ } rcv_rtt_est; /* 208816 */ ... /* size: 2176, cachelines: 34, members: 135 */ /* sum members: 2168, holes: 3, sum holes: 8 */ /* paddings: 3, sum paddings: 12 */ neal neal
Re: [PATCH v3 2/2] tcp: Add rcv_wnd to TCP_INFO
On Fri, Sep 13, 2019 at 10:29 AM Thomas Higdon wrote: > > On Thu, Sep 12, 2019 at 10:14:33AM +0100, Dave Taht wrote: > > On Thu, Sep 12, 2019 at 1:59 AM Neal Cardwell wrote: > > > > > > On Wed, Sep 11, 2019 at 6:32 PM Thomas Higdon wrote: > > > > > > > > Neal Cardwell mentioned that rcv_wnd would be useful for helping > > > > diagnose whether a flow is receive-window-limited at a given instant. > > > > > > > > This serves the purpose of adding an additional __u32 to avoid the > > > > would-be hole caused by the addition of the tcpi_rcvi_ooopack field. > > > > > > > > Signed-off-by: Thomas Higdon > > > > --- > > > > > > Thanks, Thomas. > > > > > > I know that when I mentioned this before I mentioned the idea of both > > > tp->snd_wnd (send-side receive window) and tp->rcv_wnd (receive-side > > > receive window) in tcp_info, and did not express a preference between > > > the two. Now that we are faced with a decision between the two, > > > personally I think it would be a little more useful to start with > > > tp->snd_wnd. :-) > > > > > > Two main reasons: > > > > > > (1) Usually when we're diagnosing TCP performance problems, we do so > > > from the sender, since the sender makes most of the > > > performance-critical decisions (cwnd, pacing, TSO size, TSQ, etc). > > > From the sender-side the thing that would be most useful is to see > > > tp->snd_wnd, the receive window that the receiver has advertised to > > > the sender. > > > > I am under the impression, that particularly in the mobile space, that > > network behavior > > is often governed by rcv_wnd. At least, there's been so many papers on > > this that I'd > > tended to assume so. > > > > Given a desire to do both vars, is there a *third* u32 we could add to > > fill in the next hole? :) > > ecn marks? > > Neal makes some good points -- there is a fair amount of existing > information for deriving receive window. It seems like snd_wnd would be > more valuable at this moment. For the purpose of pairing up these __u32s > to get something we can commit, I propose that we go with > the rcv_ooopack/snd_wnd pair for now, and when something comes up later, > one might consider pairing up rcv_wnd. FWIW that sounds like a great plan to me. Thanks, Thomas! neal
Re: [PATCH v3 2/2] tcp: Add rcv_wnd to TCP_INFO
On Wed, Sep 11, 2019 at 6:32 PM Thomas Higdon wrote: > > Neal Cardwell mentioned that rcv_wnd would be useful for helping > diagnose whether a flow is receive-window-limited at a given instant. > > This serves the purpose of adding an additional __u32 to avoid the > would-be hole caused by the addition of the tcpi_rcvi_ooopack field. > > Signed-off-by: Thomas Higdon > --- Thanks, Thomas. I know that when I mentioned this before I mentioned the idea of both tp->snd_wnd (send-side receive window) and tp->rcv_wnd (receive-side receive window) in tcp_info, and did not express a preference between the two. Now that we are faced with a decision between the two, personally I think it would be a little more useful to start with tp->snd_wnd. :-) Two main reasons: (1) Usually when we're diagnosing TCP performance problems, we do so from the sender, since the sender makes most of the performance-critical decisions (cwnd, pacing, TSO size, TSQ, etc). >From the sender-side the thing that would be most useful is to see tp->snd_wnd, the receive window that the receiver has advertised to the sender. (2) From the receiver side, "ss" can already show a fair amount of info about receive-side buffer/window limits, like: info->tcpi_rcv_ssthresh, info->tcpi_rcv_space, skmeminfo[SK_MEMINFO_RMEM_ALLOC], skmeminfo[SK_MEMINFO_RCVBUF]. Often the rwin can be approximated by combining those. Hopefully Eric, Yuchung, and Soheil can weigh in on the question of snd_wnd vs rcv_wnd. Or we can perhaps think of another field, and add the tcpi_rcvi_ooopack, snd_wnd, rcv_wnd, and that final field, all together. thanks, neal
Re: [PATCH net-next] tcp: force a PSH flag on TSO packets
On Tue, Sep 10, 2019 at 5:49 PM Eric Dumazet wrote: > > When tcp sends a TSO packet, adding a PSH flag on it > reduces the sojourn time of GRO packet in GRO receivers. > > This is particularly the case under pressure, since RX queues > receive packets for many concurrent flows. > > A sender can give a hint to GRO engines when it is > appropriate to flush a super-packet, especially when pacing > is in the picture, since next packet is probably delayed by > one ms. > > Having less packets in GRO engine reduces chance > of LRU eviction or inflated RTT, and reduces GRO cost. > > We found recently that we must not set the PSH flag on > individual full-size MSS segments [1] : > > Under pressure (CWR state), we better let the packet sit > for a small delay (depending on NAPI logic) so that the > ACK packet is delayed, and thus next packet we send is > also delayed a bit. Eventually the bottleneck queue can > be drained. DCTCP flows with CWND=1 have demonstrated > the issue. > > This patch allows to slowdown the aggregate traffic without > involving high resolution timers on senders and/or > receivers. > > It has been used at Google for about four years, > and has been discussed at various networking conferences. > > [1] segments smaller than MSS already have PSH flag set > by tcp_sendmsg() / tcp_mark_push(), unless MSG_MORE > has been requested by the user. > > Signed-off-by: Eric Dumazet > Cc: Soheil Hassas Yeganeh > Cc: Neal Cardwell > Cc: Yuchung Cheng > Cc: Daniel Borkmann > Cc: Tariq Toukan > --- > net/ipv4/tcp_output.c | 15 +-- > 1 file changed, 13 insertions(+), 2 deletions(-) > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index > 42abc9bd687a5fea627cbc7cfa750d022f393d84..fec6d67bfd146dc78f0f25173fd71b8b8cc752fe > 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -1050,11 +1050,22 @@ static int __tcp_transmit_skb(struct sock *sk, struct > sk_buff *skb, > tcb = TCP_SKB_CB(skb); > memset(&opts, 0, sizeof(opts)); > > - if (unlikely(tcb->tcp_flags & TCPHDR_SYN)) > + if (unlikely(tcb->tcp_flags & TCPHDR_SYN)) { > tcp_options_size = tcp_syn_options(sk, skb, &opts, &md5); > - else > + } else { > tcp_options_size = tcp_established_options(sk, skb, &opts, >&md5); > + /* Force a PSH flag on all (GSO) packets to expedite GRO flush > +* at receiver : This slightly improve GRO performance. > +* Note that we do not force the PSH flag for non GSO packets, > +* because they might be sent under high congestion events, > +* and in this case it is better to delay the delivery of > 1-MSS > +* packets and thus the corresponding ACK packet that would > +* release the following packet. > +*/ > + if (tcp_skb_pcount(skb) > 1) > + tcb->tcp_flags |= TCPHDR_PSH; > + } > tcp_header_size = tcp_options_size + sizeof(struct tcphdr); > > /* if no packet is in qdisc/device queue, then allow XPS to select Acked-by: Neal Cardwell Thanks, Eric! neal
Re: [PATCH v2] tcp: Add TCP_INFO counter for packets received out-of-order
On Tue, Sep 10, 2019 at 4:39 PM Eric Dumazet wrote: > > On Tue, Sep 10, 2019 at 10:11 PM Thomas Higdon wrote: > > > > > ... > > Because an additional 32-bit member in struct tcp_info would cause > > a hole on 64-bit systems, we reserve a struct member '_reserved'. > ... > > diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h > > index b3564f85a762..990a5bae3ac1 100644 > > --- a/include/uapi/linux/tcp.h > > +++ b/include/uapi/linux/tcp.h > > @@ -270,6 +270,9 @@ struct tcp_info { > > __u64 tcpi_bytes_retrans; /* RFC4898 tcpEStatsPerfOctetsRetrans > > */ > > __u32 tcpi_dsack_dups; /* RFC4898 tcpEStatsStackDSACKDups */ > > __u32 tcpi_reord_seen; /* reordering events seen */ > > + > > + __u32 _reserved; /* Reserved for future 32-bit member. > > */ > > + __u32 tcpi_rcv_ooopack;/* Out-of-order packets received */ > > }; > > > > Unfortunately we won't be able to use this hole, because the way the > TCP_INFO works, > > The kernel will report the same size after the reserved field is > renamed to something else. > > User space code is able to detect which fields are there or not based > on what the kernel > returns for the size of the structure. If we are looking for something else useful to use for a __u32 to pair up with this new field, I would suggest we export tp->snd_wnd (send-side receive window) and tp->rcv_wnd (receive-side receive window) in tcp_info. We could export one of them now, and the other the next time we need to add a field and need some useful "padding". These fields could help folks diagnose whether a flow is receive-window-limited at a given instant, using "ss", etc. I think there was some interest in this internally in our team a while ago. neal
[PATCH net] tcp: fix tcp_ecn_withdraw_cwr() to clear TCP_ECN_QUEUE_CWR
Fix tcp_ecn_withdraw_cwr() to clear the correct bit: TCP_ECN_QUEUE_CWR. Rationale: basically, TCP_ECN_DEMAND_CWR is a bit that is purely about the behavior of data receivers, and deciding whether to reflect incoming IP ECN CE marks as outgoing TCP th->ece marks. The TCP_ECN_QUEUE_CWR bit is purely about the behavior of data senders, and deciding whether to send CWR. The tcp_ecn_withdraw_cwr() function is only called from tcp_undo_cwnd_reduction() by data senders during an undo, so it should zero the sender-side state, TCP_ECN_QUEUE_CWR. It does not make sense to stop the reflection of incoming CE bits on incoming data packets just because outgoing packets were spuriously retransmitted. The bug has been reproduced with packetdrill to manifest in a scenario with RFC3168 ECN, with an incoming data packet with CE bit set and carrying a TCP timestamp value that causes cwnd undo. Before this fix, the IP CE bit was ignored and not reflected in the TCP ECE header bit, and sender sent a TCP CWR ('W') bit on the next outgoing data packet, even though the cwnd reduction had been undone. After this fix, the sender properly reflects the CE bit and does not set the W bit. Note: the bug actually predates 2005 git history; this Fixes footer is chosen to be the oldest SHA1 I have tested (from Sep 2007) for which the patch applies cleanly (since before this commit the code was in a .h file). Fixes: bdf1ee5d3bd3 ("[TCP]: Move code from tcp_ecn.h to tcp*.c and tcp.h & remove it") Signed-off-by: Neal Cardwell Acked-by: Yuchung Cheng Acked-by: Soheil Hassas Yeganeh Cc: Eric Dumazet --- net/ipv4/tcp_input.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index c21e8a22fb3b..8a1cd93dbb09 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -266,7 +266,7 @@ static void tcp_ecn_accept_cwr(struct sock *sk, const struct sk_buff *skb) static void tcp_ecn_withdraw_cwr(struct tcp_sock *tp) { - tp->ecn_flags &= ~TCP_ECN_DEMAND_CWR; + tp->ecn_flags &= ~TCP_ECN_QUEUE_CWR; } static void __tcp_ecn_check_ce(struct sock *sk, const struct sk_buff *skb) -- 2.23.0.162.g0b9fbb3734-goog
Re: [PATCH net] tcp: remove empty skb from write queue in error cases
On Mon, Aug 26, 2019 at 12:19 PM Eric Dumazet wrote: > > Vladimir Rutsky reported stuck TCP sessions after memory pressure > events. Edge Trigger epoll() user would never receive an EPOLLOUT > notification allowing them to retry a sendmsg(). > > Jason tested the case of sk_stream_alloc_skb() returning NULL, > but there are other paths that could lead both sendmsg() and sendpage() > to return -1 (EAGAIN), with an empty skb queued on the write queue. > > This patch makes sure we remove this empty skb so that > Jason code can detect that the queue is empty, and > call sk->sk_write_space(sk) accordingly. > > Fixes: ce5ec440994b ("tcp: ensure epoll edge trigger wakeup when write queue > is empty") > Signed-off-by: Eric Dumazet > Cc: Jason Baron > Reported-by: Vladimir Rutsky > Cc: Soheil Hassas Yeganeh > Cc: Neal Cardwell > --- Acked-by: Neal Cardwell Nice detective work. :-) Thanks, Eric! neal
Re: [PATCH net] tcp: make sure EPOLLOUT wont be missed
On Sat, Aug 17, 2019 at 12:26 AM Eric Dumazet wrote: > > As Jason Baron explained in commit 790ba4566c1a ("tcp: set SOCK_NOSPACE > under memory pressure"), it is crucial we properly set SOCK_NOSPACE > when needed. > > However, Jason patch had a bug, because the 'nonblocking' status > as far as sk_stream_wait_memory() is concerned is governed > by MSG_DONTWAIT flag passed at sendmsg() time : > > long timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT); > > So it is very possible that tcp sendmsg() calls sk_stream_wait_memory(), > and that sk_stream_wait_memory() returns -EAGAIN with SOCK_NOSPACE > cleared, if sk->sk_sndtimeo has been set to a small (but not zero) > value. > > This patch removes the 'noblock' variable since we must always > set SOCK_NOSPACE if -EAGAIN is returned. > > It also renames the do_nonblock label since we might reach this > code path even if we were in blocking mode. > > Fixes: 790ba4566c1a ("tcp: set SOCK_NOSPACE under memory pressure") > Signed-off-by: Eric Dumazet > Cc: Jason Baron > Reported-by: Vladimir Rutsky > --- > net/core/stream.c | 16 +--- > 1 file changed, 9 insertions(+), 7 deletions(-) Acked-by: Neal Cardwell Thanks, Eric! neal
Re: [PATCH v2 1/2] tcp: add new tcp_mtu_probe_floor sysctl
On Wed, Aug 7, 2019 at 7:55 PM Josh Hunt wrote: > > The current implementation of TCP MTU probing can considerably > underestimate the MTU on lossy connections allowing the MSS to get down to > 48. We have found that in almost all of these cases on our networks these > paths can handle much larger MTUs meaning the connections are being > artificially limited. Even though TCP MTU probing can raise the MSS back up > we have seen this not to be the case causing connections to be "stuck" with > an MSS of 48 when heavy loss is present. > > Prior to pushing out this change we could not keep TCP MTU probing enabled > b/c of the above reasons. Now with a reasonble floor set we've had it > enabled for the past 6 months. > > The new sysctl will still default to TCP_MIN_SND_MSS (48), but gives > administrators the ability to control the floor of MSS probing. > > Signed-off-by: Josh Hunt > --- Acked-by: Neal Cardwell Thanks, Josh. I agree with Eric that it would be great if you are able to share the value that you have found to work well. neal
Re: [PATCH v2 2/2] tcp: Update TCP_BASE_MSS comment
On Thu, Aug 8, 2019 at 2:13 AM Eric Dumazet wrote: > On 8/8/19 1:52 AM, Josh Hunt wrote: > > TCP_BASE_MSS is used as the default initial MSS value when MTU probing is > > enabled. Update the comment to reflect this. > > > > Suggested-by: Neal Cardwell > > Signed-off-by: Josh Hunt > > --- > > Signed-off-by: Eric Dumazet Acked-by: Neal Cardwell Thanks, Josh! neal
Re: [PATCH net 2/4] tcp: tcp_fragment() should apply sane memory limits
On Fri, Aug 2, 2019 at 3:03 PM Bernd wrote: > > Hello, > > While analyzing a aborted upload packet capture I came across a odd > trace where a sender was not responding to a duplicate SACK but > sending further segments until it stalled. > > Took me some time until I remembered this fix, and actually the > problems started since the security fix was applied. > > I see a high counter for TCPWqueueTooBig - and I don’t think that’s an > actual attack. > > Is there a probability for triggering the limit with connections with > big windows and large send buffers and dropped segments? If so what > would be the plan? It does not look like it is configurable. The trace > seem to have 100 (filled) inflight segments. > > Gruss > Bernd > -- > http://bernd.eckenfels.net What's the exact kernel version you are using? Eric submitted a patch recently that may address your issue: tcp: be more careful in tcp_fragment() https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=b617158dc096709d8600c53b6052144d12b89fab Would you be able to test your workload with that commit cherry-picked, and see if the issue still occurs? That commit was targeted to many stable releases, so you may be able to pick up that fix from a stable branch. cheers, neal
Re: [PATCH] tcp: add new tcp_mtu_probe_floor sysctl
On Sun, Jul 28, 2019 at 5:14 PM Josh Hunt wrote: > > On 7/28/19 6:54 AM, Eric Dumazet wrote: > > On Sun, Jul 28, 2019 at 1:21 AM Josh Hunt wrote: > >> > >> On 7/27/19 12:05 AM, Eric Dumazet wrote: > >>> On Sat, Jul 27, 2019 at 4:23 AM Josh Hunt wrote: > > The current implementation of TCP MTU probing can considerably > underestimate the MTU on lossy connections allowing the MSS to get down > to > 48. We have found that in almost all of these cases on our networks these > paths can handle much larger MTUs meaning the connections are being > artificially limited. Even though TCP MTU probing can raise the MSS back > up > we have seen this not to be the case causing connections to be "stuck" > with > an MSS of 48 when heavy loss is present. > > Prior to pushing out this change we could not keep TCP MTU probing > enabled > b/c of the above reasons. Now with a reasonble floor set we've had it > enabled for the past 6 months. > >>> > >>> And what reasonable value have you used ??? > >> > >> Reasonable for some may not be reasonable for others hence the new > >> sysctl :) We're currently running with a fairly high value based off of > >> the v6 min MTU minus headers and options, etc. We went conservative with > >> our setting initially as it seemed a reasonable first step when > >> re-enabling TCP MTU probing since with no configurable floor we saw a # > >> of cases where connections were using severely reduced mss b/c of loss > >> and not b/c of actual path restriction. I plan to reevaluate the setting > >> at some point, but since the probing method is still the same it means > >> the same clients who got stuck with mss of 48 before will land at > >> whatever floor we set. Looking forward we are interested in trying to > >> improve TCP MTU probing so it does not penalize clients like this. > >> > >> A suggestion for a more reasonable floor default would be 512, which is > >> the same as the min_pmtu. Given both mechanisms are trying to achieve > >> the same goal it seems like they should have a similar min/floor. > >> > >>> > > The new sysctl will still default to TCP_MIN_SND_MSS (48), but gives > administrators the ability to control the floor of MSS probing. > > Signed-off-by: Josh Hunt > --- > Documentation/networking/ip-sysctl.txt | 6 ++ > include/net/netns/ipv4.h | 1 + > net/ipv4/sysctl_net_ipv4.c | 9 + > net/ipv4/tcp_ipv4.c| 1 + > net/ipv4/tcp_timer.c | 2 +- > 5 files changed, 18 insertions(+), 1 deletion(-) > > diff --git a/Documentation/networking/ip-sysctl.txt > b/Documentation/networking/ip-sysctl.txt > index df33674799b5..49e95f438ed7 100644 > --- a/Documentation/networking/ip-sysctl.txt > +++ b/Documentation/networking/ip-sysctl.txt > @@ -256,6 +256,12 @@ tcp_base_mss - INTEGER > Path MTU discovery (MTU probing). If MTU probing is enabled, > this is the initial MSS used by the connection. > > +tcp_mtu_probe_floor - INTEGER > + If MTU probing is enabled this caps the minimum MSS used for > search_low > + for the connection. > + > + Default : 48 > + > tcp_min_snd_mss - INTEGER > TCP SYN and SYNACK messages usually advertise an ADVMSS option, > as described in RFC 1122 and RFC 6691. > diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h > index bc24a8ec1ce5..c0c0791b1912 100644 > --- a/include/net/netns/ipv4.h > +++ b/include/net/netns/ipv4.h > @@ -116,6 +116,7 @@ struct netns_ipv4 { > int sysctl_tcp_l3mdev_accept; > #endif > int sysctl_tcp_mtu_probing; > + int sysctl_tcp_mtu_probe_floor; > int sysctl_tcp_base_mss; > int sysctl_tcp_min_snd_mss; > int sysctl_tcp_probe_threshold; > diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c > index 0b980e841927..59ded25acd04 100644 > --- a/net/ipv4/sysctl_net_ipv4.c > +++ b/net/ipv4/sysctl_net_ipv4.c > @@ -820,6 +820,15 @@ static struct ctl_table ipv4_net_table[] = { > .extra2 = &tcp_min_snd_mss_max, > }, > { > + .procname = "tcp_mtu_probe_floor", > + .data = > &init_net.ipv4.sysctl_tcp_mtu_probe_floor, > + .maxlen = sizeof(int), > + .mode = 0644, > + .proc_handler = proc_dointvec_minmax, > + .extra1 = &tcp_min_snd_mss_min, > + .extra2 = &tcp_min_snd_mss_max, > + }, > + { > .procname = "tcp_probe_t
Re: [PATCH net] tcp: fix tcp_set_congestion_control() use from bpf hook
On Thu, Jul 18, 2019 at 10:28 PM Eric Dumazet wrote: > > Neal reported incorrect use of ns_capable() from bpf hook. > > bpf_setsockopt(...TCP_CONGESTION...) > -> tcp_set_congestion_control() >-> ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN) > -> ns_capable_common() > -> current_cred() > -> rcu_dereference_protected(current->cred, 1) > > Accessing 'current' in bpf context makes no sense, since packets > are processed from softirq context. > > As Neal stated : The capability check in tcp_set_congestion_control() > was written assuming a system call context, and then was reused from > a BPF call site. > > The fix is to add a new parameter to tcp_set_congestion_control(), > so that the ns_capable() call is only performed under the right > context. > > Fixes: 91b5b21c7c16 ("bpf: Add support for changing congestion control") > Signed-off-by: Eric Dumazet > Cc: Lawrence Brakmo > Reported-by: Neal Cardwell > --- Acked-by: Neal Cardwell Thanks, Eric! neal
Re: Kernel BUG: epoll_wait() (and epoll_pwait) stall for 206 ms per call on sockets with a small-ish snd/rcv buffer.
On Sat, Jul 6, 2019 at 2:19 PM Carlo Wood wrote: > > While investigating this further, I read on > http://www.masterraghu.com/subjects/np/introduction/unix_network_programming_v1.3/ch07lev1sec5.html > under "SO_RCVBUF and SO_SNDBUF Socket Options": > > When setting the size of the TCP socket receive buffer, the > ordering of the function calls is important. This is because of > TCP's window scale option (Section 2.6), which is exchanged with > the peer on the SYN segments when the connection is established. > For a client, this means the SO_RCVBUF socket option must be set > before calling connect. For a server, this means the socket option > must be set for the listening socket before calling listen. Setting > this option for the connected socket will have no effect whatsoever > on the possible window scale option because accept does not return > with the connected socket until TCP's three-way handshake is > complete. That is why this option must be set for the listening > socket. (The sizes of the socket buffers are always inherited from > the listening socket by the newly created connected socket: pp. > 462–463 of TCPv2.) > > As mentioned in a previous post, I had already discovered about > needing to set the socket buffers before connect, but I didn't know > about setting them before the call to listen() in order to get the > buffer sizes inherited by the accepted sockets. > > After fixing this in my test program, all problems disappeared when > keeping the send and receive buffers the same on both sides. > > However, when only setting the send and receive buffers on the client > socket (not on the (accepted or) listen socket), epoll_wait() still > stalls 43ms. When the SO_SNDBUF is smaller than 33182 bytes. > > Here is the latest version of my test program: > > https://github.com/CarloWood/ai-evio-testsuite/blob/master/src/epoll_bug.c > > I have to retract most of my "bug" report, it might even not really be > a bug then... but nevertheless, what remains strange is the fact > that setting the socket buffer sizes on the accepted sockets can lead > to so much crippling effect, while the quoted website states: > > Setting this option for the connected socket will have no effect > whatsoever on the possible window scale option because accept does > not return with the connected socket until TCP's three-way > handshake is complete. > > And when only setting the socket buffer sizes for the client socket > (that I use to send back received data; so this is the sending > side now) then why does epoll_wait() stall 43 ms per call when the > receiving side is using the default (much larger) socket buffer sizes? > > That 43 ms is STILL crippling-- slowing down the transmission of the > data to a trickling speed compared to what it should be. Based on the magic numbers you cite, including the fact that this test program seems to send traffic over a loopback device (presumably MTU=65536), epoll_wait() stalling 43 ms (slightly longer than the typical Linux delayed ACK timer), and the problem only happening if SO_SNDBUF is smaller than 33182 bytes (half the MTU)... a guess would be that when you artificially make the SO_SNDBUF that low, then you are asking the kernel to only allow your sending sockets to buffer less than a single MTU of data, which means that the packets the sender sends are going to be less than the MTU, which means that the receiver may tend to eventually (after the initial quick ACKs expire) delay its ACKs in hopes of eventually receiving a full MSS of data (see __tcp_ack_snd_check()). Since the SO_SNDBUF is so small in this case, the sender then would not be able to write() or transmit anything else until the receiver sends a delayed ACK for the data ~40ms later, allowing the sending socket to free the previously sent data and trigger the sender's next EPOLLOUT and write(). You could try grabbing a packet capture of the traffic over your loopback device during the test to see if it matches that theory: tcpdump -i lo -s 100 -w /tmp/test.pcap cheers, neal
Re: tp->copied_seq used before assignment in tcp_check_urg
On Tue, Jun 11, 2019 at 2:46 AM Zhongjie Wang wrote: > > Hi Neal, > > Thanks for your valuable feedback! Yes, I think you are right. > It seems not a problem if tp->urg_data and tp->urg_seq are used together. > From our test results, we can only see there are some paths requiring > specific initial sequence number to reach. > But as you said, it would not cause a difference in the code logic. > We haven't observed any abnormal states. Great. Thanks for confirming! cheers, neal
Re: tp->copied_seq used before assignment in tcp_check_urg
On Mon, Jun 10, 2019 at 7:48 PM Zhongjie Wang wrote: > > Hi Neal, > > Thanks for your reply. Sorry, I made a mistake in my previous email. > After I double checked the source code, I think it should be tp->urg_seq, > which is used before assignment, instead of tp->copied_seq. > Still in the same if-statement: > > 5189 if (tp->urg_seq == tp->copied_seq && tp->urg_data && > 5190 !sock_flag(sk, SOCK_URGINLINE) && tp->copied_seq != tp->rcv_nxt) > { > 5191 struct sk_buff *skb = skb_peek(&sk->sk_receive_queue); > 5192 tp->copied_seq++; > 5193 if (skb && !before(tp->copied_seq, TCP_SKB_CB(skb)->end_seq)) { > 5194 __skb_unlink(skb, &sk->sk_receive_queue); > 5195 __kfree_skb(skb); // wzj(a) > 5196 } > 5197 } > 5198 > 5199 tp->urg_data = TCP_URG_NOTYET; > 5200 tp->urg_seq = ptr; > > It compares tp->copied_seq with tp->urg_seq. > And I found only 1 assignment of tp->urg_seq in the code base, > which is after the if-statement in the same tcp_check_urg() function. > > So it seems tp->urg_seq is not assigned to any sequence number before > its first use. > Is that correct? I agree, it does seem that tp->urg_seq is not assigned to any sequence number before its first use. AFAICT from a quick read of the code, this does not matter. It seems the idea is for tp->urg_data and tp->urg_seq to be set and used together, so that tp->urg_seq is never relied upon to be set to something meaningful unless tp->urg_data has also been verified to be set to something (something non-zero). I suppose it might be more clear to structure the code to check urg_data first: if (tp->urg_data && tp->urg_seq == tp->copied_seq && ...but in practice AFAICT it does not make a difference, since no matter which order the expressions use, both conditions must be true for the code to have any side effects. > P.S. In our symbolic execution tool, we found an execution path that > requires the client initial sequence number (ISN) to be 0xFF FF FF FF. > And when it traverse that path, the tp->copied_seq is equal to (client > ISN + 1), and compared with 0 in this if-statatement. > Therefore the client ISN has to be exactly 0xFF FF FF FF to hit this > execution path. > > To trigger this, we first sent a SYN packet, and then an ACK packet > with urgent pointer. Does your test show any invalid behavior by the TCP endpoint? For example, does the state in tcp_sock become incorrect, or is some system call return value or outgoing packet incorrect? AFAICT from the scenario you describe it seems that the "if" condition would fail when the receiver processes the ACK packet with urgent pointer, because tp->urg_data was not yet set at this point. Thus it would seem that in this case it does not matter that tp->urg_seq is not assigned to any sequence number before being first used. cheers, neal
Re: tp->copied_seq used before assignment in tcp_check_urg
On Sun, Jun 9, 2019 at 11:12 PM Zhongjie Wang wrote: ... > It compares tp->copied_seq with tcp->rcv_nxt. > However, tp->copied_seq is only assigned to an appropriate sequence number > when > it copies data to user space. So here tp->copied_seq could be equal to 0, > which is its initial value, if no data are copied yet. I don't believe that's the case. As far as I can see, the tp->copied_seq field is initialized to tp->rcv_nxt in the various places where TCP connections are initialized: tp->copied_seq = tp->rcv_nxt; > In this case, the condition becomes 0 != tp->rcv_nxt, > and it renders this comparison ineffective. > > For example, if we send a SYN packet with initial sequence number 0xFF FF FF > FF, > and after receiving SYN/ACK response, then send a ACK packet with sequence > number 0, it will bypass this if-then block. > > We are not sure how this would affect the TCP logic. Could you please confirm > that tp->copied_seq should be assigned to a sequence number before its use? Yes, the tp->copied_seq ought to be assigned to a sequence number before its use, and AFAICT it is. Can you identify a specific sequence of code execution (and ideally construct a packetdrill script) where tp->copied_seq is somehow read before it is initialized? cheers, neal
Re: [PATCH net] dctcp: more accurate tracking of packets delivery
On Thu, Apr 11, 2019 at 8:55 AM Eric Dumazet wrote: > > After commit e21db6f69a95 ("tcp: track total bytes delivered with ECN CE > marks") > core TCP stack does a very good job tracking ECN signals. > > The "sender's best estimate of CE information" Yuchung mentioned in his > patch is indeed the best we can do. > > DCTCP can use tp->delivered_ce and tp->delivered to not duplicate the logic, > and use the existing best estimate. > > This solves some problems, since current DCTCP logic does not deal with losses > and/or GRO or ack aggregation very well. > > This also removes a dubious use of inet_csk(sk)->icsk_ack.rcv_mss > (this should have been tp->mss_cache), and a 64 bit divide. > > Finally, we can see that the DCTCP logic, calling dctcp_update_alpha() for > every ACK could be done differently, calling it only once per RTT. > > Signed-off-by: Eric Dumazet > Cc: Yuchung Cheng > Cc: Neal Cardwell > Cc: Soheil Hassas Yeganeh > Cc: Florian Westphal > Cc: Daniel Borkmann > Cc: Lawrence Brakmo > Cc: Abdul Kabbani > --- Thanks, Eric! There is a slight change in semantics, switching from bytes to packets. But my sense is that the new semantics of packet-based counting are a better approach, and a closer fit to the design in the DCTCP paper from SIGCOMM 2010, which talks about measuring "the fraction of packets that were marked in the last window of data", and using that to estimate "the probability that the queue size is greater than K". Acked-by: Neal Cardwell thanks, neal
Re: [PATCH v2 bpf-next 5/7] bpf: sysctl for probe_on_drop
On Wed, Apr 3, 2019 at 8:13 PM brakmo wrote: > > When a packet is dropped when calling queue_xmit in __tcp_transmit_skb > and packets_out is 0, it is beneficial to set a small probe timer. > Otherwise, the throughput for the flow can suffer because it may need to > depend on the probe timer to start sending again. The default value for > the probe timer is at least 200ms, this patch sets it to 20ms when a > packet is dropped and there are no other packets in flight. > > This patch introduces a new sysctl, sysctl_tcp_probe_on_drop_ms, that is > used to specify the duration of the probe timer for the case described > earlier. The allowed values are between 0 and TCP_RTO_MIN. A value of 0 > disables setting the probe timer with a small value. > > Signed-off-by: Lawrence Brakmo > --- > include/net/netns/ipv4.h | 1 + > net/ipv4/sysctl_net_ipv4.c | 10 ++ > net/ipv4/tcp_ipv4.c| 1 + > net/ipv4/tcp_output.c | 18 +++--- > 4 files changed, 27 insertions(+), 3 deletions(-) > > diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h > index 7698460a3dd1..26c52d294dea 100644 > --- a/include/net/netns/ipv4.h > +++ b/include/net/netns/ipv4.h > @@ -166,6 +166,7 @@ struct netns_ipv4 { > int sysctl_tcp_wmem[3]; > int sysctl_tcp_rmem[3]; > int sysctl_tcp_comp_sack_nr; > + int sysctl_tcp_probe_on_drop_ms; > unsigned long sysctl_tcp_comp_sack_delay_ns; > struct inet_timewait_death_row tcp_death_row; > int sysctl_max_syn_backlog; > diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c > index 2316c08e9591..b1e4420b6d6c 100644 > --- a/net/ipv4/sysctl_net_ipv4.c > +++ b/net/ipv4/sysctl_net_ipv4.c > @@ -49,6 +49,7 @@ static int ip_ping_group_range_min[] = { 0, 0 }; > static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX }; > static int comp_sack_nr_max = 255; > static u32 u32_max_div_HZ = UINT_MAX / HZ; > +static int probe_on_drop_max = TCP_RTO_MIN; > > /* obsolete */ > static int sysctl_tcp_low_latency __read_mostly; > @@ -1228,6 +1229,15 @@ static struct ctl_table ipv4_net_table[] = { > .extra1 = &zero, > .extra2 = &comp_sack_nr_max, > }, > + { > + .procname = "tcp_probe_on_drop_ms", > + .data = &init_net.ipv4.sysctl_tcp_probe_on_drop_ms, > + .maxlen = sizeof(int), > + .mode = 0644, > + .proc_handler = proc_dointvec_minmax, The tcp_reset_xmit_timer () function expects a time in jiffies, so this would probably need to be proc_dointvec_ms_jiffies? > + .extra1 = &zero, > + .extra2 = &probe_on_drop_max, > + }, > { > .procname = "udp_rmem_min", > .data = &init_net.ipv4.sysctl_udp_rmem_min, > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c > index 3979939804b7..3df6735f0f07 100644 > --- a/net/ipv4/tcp_ipv4.c > +++ b/net/ipv4/tcp_ipv4.c > @@ -2686,6 +2686,7 @@ static int __net_init tcp_sk_init(struct net *net) > spin_lock_init(&net->ipv4.tcp_fastopen_ctx_lock); > net->ipv4.sysctl_tcp_fastopen_blackhole_timeout = 60 * 60; > atomic_set(&net->ipv4.tfo_active_disable_times, 0); > + net->ipv4.sysctl_tcp_probe_on_drop_ms = 20; The tcp_reset_xmit_timer () function expects a time in jiffies, so this would probably need to be msecs_to_jiffies(20)? > > /* Reno is always built in */ > if (!net_eq(net, &init_net) && > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index e265d1aeeb66..f101e4d7c1ae 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -1154,9 +1154,21 @@ static int __tcp_transmit_skb(struct sock *sk, struct > sk_buff *skb, > > err = icsk->icsk_af_ops->queue_xmit(sk, skb, &inet->cork.fl); > > - if (unlikely(err > 0)) { > - tcp_enter_cwr(sk); > - err = net_xmit_eval(err); > + if (unlikely(err)) { > + if (unlikely(err > 0)) { > + tcp_enter_cwr(sk); > + err = net_xmit_eval(err); > + } > + /* Packet was dropped. If there are no packets out, > +* we may need to depend on probe timer to start sending > +* again. Hence, use a smaller value. > +*/ > + if (!tp->packets_out && !inet_csk(sk)->icsk_pending && > + sock_net(sk)->ipv4.sysctl_tcp_probe_on_drop_ms > 0) { > + tcp_reset_xmit_timer(sk, ICSK_TIME_PROBE0, > + > sock_net(sk)->ipv4.sysctl_tcp_probe_on_drop_ms, > +TCP_RTO_MAX, NULL); > + } My reading of the code is that any timer that was set here would typically be quickly overridden by the tcp_check_probe_timer() that
Re: [PATCH net] tcp: repaired skbs must init their tso_segs
On Sat, Feb 23, 2019 at 6:51 PM Eric Dumazet wrote: > > syzbot reported a WARN_ON(!tcp_skb_pcount(skb)) > in tcp_send_loss_probe() [1] > > This was caused by TCP_REPAIR sent skbs that inadvertenly > were missing a call to tcp_init_tso_segs() > Acked-by: Neal Cardwell Thanks, Eric! neal
Re: [PATCH net 2/2] tcp: tcp_v4_err() should be more careful
On Fri, Feb 15, 2019 at 4:36 PM Eric Dumazet wrote: > > ICMP handlers are not very often stressed, we should > make them more resilient to bugs that might surface in > the future. > > If there is no packet in retransmit queue, we should > avoid a NULL deref. > > Signed-off-by: Eric Dumazet > Reported-by: soukjin bae > --- > net/ipv4/tcp_ipv4.c | 5 - > 1 file changed, 4 insertions(+), 1 deletion(-) Acked-by: Neal Cardwell Thanks! neal
Re: [PATCH net 1/2] tcp: clear icsk_backoff in tcp_write_queue_purge()
On Fri, Feb 15, 2019 at 4:36 PM Eric Dumazet wrote: > > soukjin bae reported a crash in tcp_v4_err() handling > ICMP_DEST_UNREACH after tcp_write_queue_head(sk) > returned a NULL pointer. > > Current logic should have prevented this : > > if (seq != tp->snd_una || !icsk->icsk_retransmits || > !icsk->icsk_backoff || fastopen) > break; > > Problem is the write queue might have been purged > and icsk_backoff has not been cleared. > > Signed-off-by: Eric Dumazet > Reported-by: soukjin bae > --- > net/ipv4/tcp.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) Acked-by: Neal Cardwell Thanks! neal
Re: [PATCH net] tcp: clear tp->retrans_stamp in tcp_finish_connect()
On Wed, Dec 19, 2018 at 4:50 PM Eric Dumazet wrote: > > tcp_connect() does : > > tp->retrans_stamp = tcp_time_stamp(tp); > > When 3WHS completes, we need to clear retrans_stamp otherwise > various things can break. > > This bug became more visible after commit b701a99e431d ("tcp: Add > tcp_clamp_rto_to_user_timeout() helper to improve accuracy"), but > predates git history. > > Signed-off-by: Eric Dumazet > Acked-by: Soheil Hassas Yeganeh > --- Acked-by: Neal Cardwell Thanks! neal
Re: [PATCH v3 net-next 4/4] tcp: implement coalescing on backlog queue
On Tue, Nov 27, 2018 at 5:42 PM Eric Dumazet wrote: > > In case GRO is not as efficient as it should be or disabled, > we might have a user thread trapped in __release_sock() while > softirq handler flood packets up to the point we have to drop. > > This patch balances work done from user thread and softirq, > to give more chances to __release_sock() to complete its work > before new packets are added the the backlog. > > This also helps if we receive many ACK packets, since GRO > does not aggregate them. > > This patch brings ~60% throughput increase on a receiver > without GRO, but the spectacular gain is really on > 1000x release_sock() latency reduction I have measured. > > Signed-off-by: Eric Dumazet > Cc: Neal Cardwell > Cc: Yuchung Cheng > --- Acked-by: Neal Cardwell Thanks! neal
Re: [PATCH v3 net-next 2/4] tcp: take care of compressed acks in tcp_add_reno_sack()
On Tue, Nov 27, 2018 at 5:42 PM Eric Dumazet wrote: > > Neal pointed out that non sack flows might suffer from ACK compression > added in the following patch ("tcp: implement coalescing on backlog queue") > > Instead of tweaking tcp_add_backlog() we can take into > account how many ACK were coalesced, this information > will be available in skb_shinfo(skb)->gso_segs > > Signed-off-by: Eric Dumazet > --- Acked-by: Neal Cardwell Thanks! neal
Re: [PATCH v2 net-next 4/4] tcp: implement coalescing on backlog queue
On Tue, Nov 27, 2018 at 10:57 AM Eric Dumazet wrote: > > In case GRO is not as efficient as it should be or disabled, > we might have a user thread trapped in __release_sock() while > softirq handler flood packets up to the point we have to drop. > > This patch balances work done from user thread and softirq, > to give more chances to __release_sock() to complete its work > before new packets are added the the backlog. > > This also helps if we receive many ACK packets, since GRO > does not aggregate them. > > This patch brings ~60% throughput increase on a receiver > without GRO, but the spectacular gain is really on > 1000x release_sock() latency reduction I have measured. > > Signed-off-by: Eric Dumazet > Cc: Neal Cardwell > Cc: Yuchung Cheng > --- ... > + if (TCP_SKB_CB(tail)->end_seq != TCP_SKB_CB(skb)->seq || > + TCP_SKB_CB(tail)->ip_dsfield != TCP_SKB_CB(skb)->ip_dsfield || > +#ifdef CONFIG_TLS_DEVICE > + tail->decrypted != skb->decrypted || > +#endif > + thtail->doff != th->doff || > + memcmp(thtail + 1, th + 1, hdrlen - sizeof(*th))) > + goto no_coalesce; > + > + __skb_pull(skb, hdrlen); > + if (skb_try_coalesce(tail, skb, &fragstolen, &delta)) { > + thtail->window = th->window; > + > + TCP_SKB_CB(tail)->end_seq = TCP_SKB_CB(skb)->end_seq; > + > + if (after(TCP_SKB_CB(skb)->ack_seq, > TCP_SKB_CB(tail)->ack_seq)) > + TCP_SKB_CB(tail)->ack_seq = TCP_SKB_CB(skb)->ack_seq; > + > + TCP_SKB_CB(tail)->tcp_flags |= TCP_SKB_CB(skb)->tcp_flags; I wonder if technically perhaps the logic should skip coalescing if the tail or skb has the TCP_FLAG_URG bit set? It seems if skbs are coalesced, and some have urgent data and some do not, then the TCP_FLAG_URG bit will be accumulated into the tail header, but there will be no way to ensure the correct urgent offsets for the one or more skbs with urgent data are passed along. Thinking out loud, I guess if this is ECN/DCTCP and some ACKs have TCP_FLAG_ECE and some don't, this will effectively have all ACKed bytes be treated as ECN-marked. Probably OK, since if this coalescing path is being hit the sender may be overloaded and slowing down might be a good thing. Otherwise, looks great to me. Thanks for doing this! neal
Re: [PATCH v2 net-next 3/4] tcp: make tcp_space() aware of socket backlog
On Tue, Nov 27, 2018 at 10:57 AM Eric Dumazet wrote: > > Jean-Louis Dupond reported poor iscsi TCP receive performance > that we tracked to backlog drops. > > Apparently we fail to send window updates reflecting the > fact that we are under stress. > > Note that we might lack a proper window increase when > backlog is fully processed, since __release_sock() clears > sk->sk_backlog.len _after_ all skbs have been processed. > > This should not matter in practice. If we had a significant > load through socket backlog, we are in a dangerous > situation. > > Reported-by: Jean-Louis Dupond > Signed-off-by: Eric Dumazet > --- Acked-by: Neal Cardwell Nice. Thanks! neal
Re: [PATCH v2 net-next 2/4] tcp: take care of compressed acks in tcp_add_reno_sack()
On Tue, Nov 27, 2018 at 10:57 AM Eric Dumazet wrote: > > Neal pointed out that non sack flows might suffer from ACK compression > added in the following patch ("tcp: implement coalescing on backlog queue") > > Instead of tweaking tcp_add_backlog() we can take into > account how many ACK were coalesced, this information > will be available in skb_shinfo(skb)->gso_segs > > Signed-off-by: Eric Dumazet > --- ... > @@ -2679,8 +2683,8 @@ static void tcp_process_loss(struct sock *sk, int flag, > bool is_dupack, > /* A Reno DUPACK means new data in F-RTO step 2.b above are > * delivered. Lower inflight to clock out (re)tranmissions. > */ > - if (after(tp->snd_nxt, tp->high_seq) && is_dupack) > - tcp_add_reno_sack(sk); > + if (after(tp->snd_nxt, tp->high_seq)) > + tcp_add_reno_sack(sk, num_dupack); > else if (flag & FLAG_SND_UNA_ADVANCED) > tcp_reset_reno_sack(tp); > } I think this probably should be checking num_dupack, something like: + if (after(tp->snd_nxt, tp->high_seq) && num_dupack) + tcp_add_reno_sack(sk, num_dupack); If we don't check num_dupack, that seems to mean that after FRTO sends the two new data packets (making snd_nxt after high_seq), the patch would have a particular non-SACK FRTO loss recovery always go into the "if" branch where we tcp_add_reno_sack() function, and we would never have a chance to get to the "else" branch where we check if FLAG_SND_UNA_ADVANCED and zero out the reno SACKs. Otherwise the patch looks great to me. Thanks for doing this! neal
Re: [PATCH v2 net-next 1/4] tcp: hint compiler about sack flows
On Tue, Nov 27, 2018 at 10:57 AM Eric Dumazet wrote: > > Tell the compiler that most TCP flows are using SACK these days. > > There is no need to add the unlikely() clause in tcp_is_reno(), > the compiler is able to infer it. > > Signed-off-by: Eric Dumazet > --- Acked-by: Neal Cardwell Nice. Thanks! neal
Re: [PATCH net-next 2/3] tcp: implement coalescing on backlog queue
On Wed, Nov 21, 2018 at 12:52 PM Eric Dumazet wrote: > > In case GRO is not as efficient as it should be or disabled, > we might have a user thread trapped in __release_sock() while > softirq handler flood packets up to the point we have to drop. > > This patch balances work done from user thread and softirq, > to give more chances to __release_sock() to complete its work. > > This also helps if we receive many ACK packets, since GRO > does not aggregate them. Would this coalesce duplicate incoming ACK packets? Is there a risk that this would eliminate incoming dupacks needed for fast recovery in non-SACK connections? Perhaps pure ACKs should only be coalesced if the ACK field is different? neal
Re: [PATCH net-next 3/3] tcp: get rid of tcp_tso_should_defer() dependency on HZ/jiffies
On Sun, Nov 11, 2018 at 9:41 AM Eric Dumazet wrote: > > tcp_tso_should_defer() first heuristic is to not defer > if last send is "old enough". > > Its current implementation uses jiffies and its low granularity. > > TSO autodefer performance should not rely on kernel HZ :/ > > After EDT conversion, we have state variables in nanoseconds that > can allow us to properly implement the heuristic. > > This patch increases TSO chunk sizes on medium rate flows, > especially when receivers do not use GRO or similar aggregation. > > It also reduces bursts for HZ=100 or HZ=250 kernels, making TCP > behavior more uniform. > > Signed-off-by: Eric Dumazet > Acked-by: Soheil Hassas Yeganeh > --- Nice. Thanks! Acked-by: Neal Cardwell neal
Re: [PATCH net-next 2/3] tcp: refine tcp_tso_should_defer() after EDT adoption
On Sun, Nov 11, 2018 at 9:41 AM Eric Dumazet wrote: > > tcp_tso_should_defer() last step tries to check if the probable > next ACK packet is coming in less than half rtt. > > Problem is that the head->tstamp might be in the future, > so we need to use signed arithmetics to avoid overflows. > > Signed-off-by: Eric Dumazet > Acked-by: Soheil Hassas Yeganeh > --- > net/ipv4/tcp_output.c | 7 --- > 1 file changed, 4 insertions(+), 3 deletions(-) Thanks! Acked-by: Neal Cardwell neal
Re: [PATCH net-next 1/3] tcp: do not try to defer skbs with eor mark (MSG_EOR)
On Sun, Nov 11, 2018 at 9:41 AM Eric Dumazet wrote: > > Applications using MSG_EOR are giving a strong hint to TCP stack : > > Subsequent sendmsg() can not append more bytes to skbs having > the EOR mark. > > Do not try to TSO defer suchs skbs, there is really no hope. > > Signed-off-by: Eric Dumazet > Acked-by: Soheil Hassas Yeganeh > --- > net/ipv4/tcp_output.c | 4 > 1 file changed, 4 insertions(+) Thanks! Acked-by: Neal Cardwell neal
Re: [PATCH net-next] net_sched: sch_fq: add dctcp-like marking
On Sun, Nov 11, 2018 at 12:11 PM Eric Dumazet wrote: > > Similar to 80ba92fa1a92 ("codel: add ce_threshold attribute") > > After EDT adoption, it became easier to implement DCTCP-like CE marking. > > In many cases, queues are not building in the network fabric but on > the hosts themselves. > > If packets leaving fq missed their Earliest Departure Time by XXX usec, > we mark them with ECN CE. This gives a feedback (after one RTT) to > the sender to slow down and find better operating mode. > > Example : > > tc qd replace dev eth0 root fq ce_threshold 2.5ms > > Signed-off-by: Eric Dumazet > --- Very nice! Thanks, Eric. :-) Acked-by: Neal Cardwell neal
[PATCH net-next] tcp_bbr: update comments to reflect pacing_margin_percent
Recently, in commit ab408b6dc744 ("tcp: switch tcp and sch_fq to new earliest departure time model"), the TCP BBR code switched to a new approach of using an explicit bbr_pacing_margin_percent for shaving a pacing rate "haircut", rather than the previous implict approach. Update an old comment to reflect the new approach. Signed-off-by: Neal Cardwell Signed-off-by: Yuchung Cheng Signed-off-by: Soheil Hassas Yeganeh Signed-off-by: Eric Dumazet --- net/ipv4/tcp_bbr.c | 15 +++ 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c index 9277abdd822a0..0f497fc49c3fe 100644 --- a/net/ipv4/tcp_bbr.c +++ b/net/ipv4/tcp_bbr.c @@ -128,7 +128,12 @@ static const u32 bbr_probe_rtt_mode_ms = 200; /* Skip TSO below the following bandwidth (bits/sec): */ static const int bbr_min_tso_rate = 120; -/* Pace at ~1% below estimated bw, on average, to reduce queue at bottleneck. */ +/* Pace at ~1% below estimated bw, on average, to reduce queue at bottleneck. + * In order to help drive the network toward lower queues and low latency while + * maintaining high utilization, the average pacing rate aims to be slightly + * lower than the estimated bandwidth. This is an important aspect of the + * design. + */ static const int bbr_pacing_margin_percent = 1; /* We use a high_gain value of 2/ln(2) because it's the smallest pacing gain @@ -247,13 +252,7 @@ static void bbr_init_pacing_rate_from_rtt(struct sock *sk) sk->sk_pacing_rate = bbr_bw_to_pacing_rate(sk, bw, bbr_high_gain); } -/* Pace using current bw estimate and a gain factor. In order to help drive the - * network toward lower queues while maintaining high utilization and low - * latency, the average pacing rate aims to be slightly (~1%) lower than the - * estimated bandwidth. This is an important aspect of the design. In this - * implementation this slightly lower pacing rate is achieved implicitly by not - * including link-layer headers in the packet size used for the pacing rate. - */ +/* Pace using current bw estimate and a gain factor. */ static void bbr_set_pacing_rate(struct sock *sk, u32 bw, int gain) { struct tcp_sock *tp = tcp_sk(sk); -- 2.19.1.930.g4563a0d9d0-goog
[PATCH net-next 2/2] tcp_bbr: centralize code to set gains
Centralize the code that sets gains used for computing cwnd and pacing rate. This simplifies the code and makes it easier to change the state machine or (in the future) dynamically change the gain values and ensure that the correct gain values are always used. Signed-off-by: Neal Cardwell Signed-off-by: Yuchung Cheng Signed-off-by: Soheil Hassas Yeganeh Signed-off-by: Priyaranjan Jha Signed-off-by: Eric Dumazet --- net/ipv4/tcp_bbr.c | 40 ++-- 1 file changed, 30 insertions(+), 10 deletions(-) diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c index 4cc2223d2cd54..9277abdd822a0 100644 --- a/net/ipv4/tcp_bbr.c +++ b/net/ipv4/tcp_bbr.c @@ -521,8 +521,6 @@ static void bbr_advance_cycle_phase(struct sock *sk) bbr->cycle_idx = (bbr->cycle_idx + 1) & (CYCLE_LEN - 1); bbr->cycle_mstamp = tp->delivered_mstamp; - bbr->pacing_gain = bbr->lt_use_bw ? BBR_UNIT : - bbr_pacing_gain[bbr->cycle_idx]; } /* Gain cycling: cycle pacing gain to converge to fair share of available bw. */ @@ -540,8 +538,6 @@ static void bbr_reset_startup_mode(struct sock *sk) struct bbr *bbr = inet_csk_ca(sk); bbr->mode = BBR_STARTUP; - bbr->pacing_gain = bbr_high_gain; - bbr->cwnd_gain = bbr_high_gain; } static void bbr_reset_probe_bw_mode(struct sock *sk) @@ -549,8 +545,6 @@ static void bbr_reset_probe_bw_mode(struct sock *sk) struct bbr *bbr = inet_csk_ca(sk); bbr->mode = BBR_PROBE_BW; - bbr->pacing_gain = BBR_UNIT; - bbr->cwnd_gain = bbr_cwnd_gain; bbr->cycle_idx = CYCLE_LEN - 1 - prandom_u32_max(bbr_cycle_rand); bbr_advance_cycle_phase(sk);/* flip to next phase of gain cycle */ } @@ -768,8 +762,6 @@ static void bbr_check_drain(struct sock *sk, const struct rate_sample *rs) if (bbr->mode == BBR_STARTUP && bbr_full_bw_reached(sk)) { bbr->mode = BBR_DRAIN; /* drain queue we created */ - bbr->pacing_gain = bbr_drain_gain; /* pace slow to drain */ - bbr->cwnd_gain = bbr_high_gain; /* maintain cwnd */ tcp_sk(sk)->snd_ssthresh = bbr_target_cwnd(sk, bbr_max_bw(sk), BBR_UNIT); } /* fall through to check if in-flight is already small: */ @@ -831,8 +823,6 @@ static void bbr_update_min_rtt(struct sock *sk, const struct rate_sample *rs) if (bbr_probe_rtt_mode_ms > 0 && filter_expired && !bbr->idle_restart && bbr->mode != BBR_PROBE_RTT) { bbr->mode = BBR_PROBE_RTT; /* dip, drain queue */ - bbr->pacing_gain = BBR_UNIT; - bbr->cwnd_gain = BBR_UNIT; bbr_save_cwnd(sk); /* note cwnd so we can restore it */ bbr->probe_rtt_done_stamp = 0; } @@ -860,6 +850,35 @@ static void bbr_update_min_rtt(struct sock *sk, const struct rate_sample *rs) bbr->idle_restart = 0; } +static void bbr_update_gains(struct sock *sk) +{ + struct bbr *bbr = inet_csk_ca(sk); + + switch (bbr->mode) { + case BBR_STARTUP: + bbr->pacing_gain = bbr_high_gain; + bbr->cwnd_gain = bbr_high_gain; + break; + case BBR_DRAIN: + bbr->pacing_gain = bbr_drain_gain; /* slow, to drain */ + bbr->cwnd_gain = bbr_high_gain; /* keep cwnd */ + break; + case BBR_PROBE_BW: + bbr->pacing_gain = (bbr->lt_use_bw ? + BBR_UNIT : + bbr_pacing_gain[bbr->cycle_idx]); + bbr->cwnd_gain = bbr_cwnd_gain; + break; + case BBR_PROBE_RTT: + bbr->pacing_gain = BBR_UNIT; + bbr->cwnd_gain = BBR_UNIT; + break; + default: + WARN_ONCE(1, "BBR bad mode: %u\n", bbr->mode); + break; + } +} + static void bbr_update_model(struct sock *sk, const struct rate_sample *rs) { bbr_update_bw(sk, rs); @@ -867,6 +886,7 @@ static void bbr_update_model(struct sock *sk, const struct rate_sample *rs) bbr_check_full_bw_reached(sk, rs); bbr_check_drain(sk, rs); bbr_update_min_rtt(sk, rs); + bbr_update_gains(sk); } static void bbr_main(struct sock *sk, const struct rate_sample *rs) -- 2.19.1.331.ge82ca0e54c-goog
[PATCH net-next 0/2] tcp_bbr: TCP BBR changes for EDT pacing model
Two small patches for TCP BBR to follow up with Eric's recent work to change the TCP and fq pacing machinery to an "earliest departure time" (EDT) model: - The first patch adjusts the TCP BBR logic to work with the new "earliest departure time" (EDT) pacing model. - The second patch adjusts the TCP BBR logic to centralize the setting of gain values, to simplify the code and prepare for future changes. Neal Cardwell (2): tcp_bbr: adjust TCP BBR for departure time pacing tcp_bbr: centralize code to set gains net/ipv4/tcp_bbr.c | 77 ++ 1 file changed, 65 insertions(+), 12 deletions(-) -- 2.19.1.331.ge82ca0e54c-goog
[PATCH net-next 1/2] tcp_bbr: adjust TCP BBR for departure time pacing
Adjust TCP BBR for the new departure time pacing model in the recent commit ab408b6dc7449 ("tcp: switch tcp and sch_fq to new earliest departure time model"). With TSQ and pacing at lower layers, there are often several skbs queued in the pacing layer, and thus there is less data "in the network" than "in flight". With departure time pacing at lower layers (e.g. fq or potential future NICs), the data in the pacing layer now has a pre-scheduled ("baked-in") departure time that cannot be changed, even if the congestion control algorithm decides to use a new pacing rate. This means that there can be a non-trivial lag between when BBR makes a pacing rate change and when the inter-skb pacing delays change. After a pacing rate change, the number of packets in the network can gradually evolve to be higher or lower, depending on whether the sending rate is higher or lower than the delivery rate. Thus ignoring this lag can cause significant overshoot, with the flow ending up with too many or too few packets in the network. This commit changes BBR to adapt its pacing rate based on the amount of data in the network that it estimates has already been "baked in" by previous departure time decisions. We estimate the number of our packets that will be in the network at the earliest departure time (EDT) for the next skb scheduled as: in_network_at_edt = inflight_at_edt - (EDT - now) * bw If we're increasing the amount of data in the network ("in_network"), then we want to know if the transmit of the EDT skb will push in_network above the target, so our answer includes bbr_tso_segs_goal() from the skb departing at EDT. If we're decreasing in_network, then we want to know if in_network will sink too low just before the EDT transmit, so our answer does not include the segments from the skb departing at EDT. Why do we treat pacing_gain > 1.0 case and pacing_gain < 1.0 case differently? The in_network curve is a step function: in_network goes up on transmits, and down on ACKs. To accurately predict when in_network will go beyond our target value, this will happen on different events, depending on whether we're concerned about in_network potentially going too high or too low: o if pushing in_network up (pacing_gain > 1.0), then in_network goes above target upon a transmit event o if pushing in_network down (pacing_gain < 1.0), then in_network goes below target upon an ACK event This commit changes the BBR state machine to use this estimated "packets in network" value to make its decisions. Signed-off-by: Neal Cardwell Signed-off-by: Yuchung Cheng Signed-off-by: Eric Dumazet --- net/ipv4/tcp_bbr.c | 37 +++-- 1 file changed, 35 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c index b88081285fd17..4cc2223d2cd54 100644 --- a/net/ipv4/tcp_bbr.c +++ b/net/ipv4/tcp_bbr.c @@ -369,6 +369,39 @@ static u32 bbr_target_cwnd(struct sock *sk, u32 bw, int gain) return cwnd; } +/* With pacing at lower layers, there's often less data "in the network" than + * "in flight". With TSQ and departure time pacing at lower layers (e.g. fq), + * we often have several skbs queued in the pacing layer with a pre-scheduled + * earliest departure time (EDT). BBR adapts its pacing rate based on the + * inflight level that it estimates has already been "baked in" by previous + * departure time decisions. We calculate a rough estimate of the number of our + * packets that might be in the network at the earliest departure time for the + * next skb scheduled: + * in_network_at_edt = inflight_at_edt - (EDT - now) * bw + * If we're increasing inflight, then we want to know if the transmit of the + * EDT skb will push inflight above the target, so inflight_at_edt includes + * bbr_tso_segs_goal() from the skb departing at EDT. If decreasing inflight, + * then estimate if inflight will sink too low just before the EDT transmit. + */ +static u32 bbr_packets_in_net_at_edt(struct sock *sk, u32 inflight_now) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct bbr *bbr = inet_csk_ca(sk); + u64 now_ns, edt_ns, interval_us; + u32 interval_delivered, inflight_at_edt; + + now_ns = tp->tcp_clock_cache; + edt_ns = max(tp->tcp_wstamp_ns, now_ns); + interval_us = div_u64(edt_ns - now_ns, NSEC_PER_USEC); + interval_delivered = (u64)bbr_bw(sk) * interval_us >> BW_SCALE; + inflight_at_edt = inflight_now; + if (bbr->pacing_gain > BBR_UNIT) /* increasing inflight */ + inflight_at_edt += bbr_tso_segs_goal(sk); /* include EDT skb */ + if (interval_delivered >= inflight_at_edt) + return 0; + return inflight_at_edt - interval_delivered; +} + /* An optimization in BBR to reduce losses: On the first round of recovery,
Re: Why not use all the syn queues? in the function "tcp_conn_request", I have some questions.
On Sat, Sep 8, 2018 at 11:23 AM Ttttabcd wrote: > > Thank you very much for your previous answer, sorry for the inconvenience. > > But now I want to ask you one more question. > > The question is why we need two variables to control the syn queue? > > The first is the "backlog" parameter of the "listen" system call that > controls the maximum length limit of the syn queue, it also controls the > accept queue. By default, and essentially always in practice (AFAIK), Linux installations enable syncookies. With syncookies, there is essentially no limit on the syn queue, or number of incomplete passive connections (as the man page you quoted notes). So in practice the listen() parameter usually controls only the accept queue. > The second is /proc/sys/net/ipv4/tcp_max_syn_backlog, which also controls the > maximum length limit of the syn queue. > > So simply changing one of them and wanting to increase the syn queue is not > working. > > In our last discussion, I understood tcp_max_syn_backlog will retain the last > quarter to the IP that has been proven to be alive That discussion pertains to a code path that is relevant if syncookies are disabled, which is very uncommon (see above). > But if tcp_max_syn_backlog is very large, the syn queue will be filled as > well. > > So I don't understand why not just use a variable to control the syn queue. > > For example, just use tcp_max_syn_backlog, which is the maximum length limit > for the syn queue, and it can also be retained to prove that the IP remains > the last quarter. > > The backlog parameter of the listen system call only controls the accpet > queue. > > I feel this is more reasonable. If I don't look at the source code, I really > can't guess the backlog parameter actually controls the syn queue. > > I always thought that it only controlled the accept queue before I looked at > the source code, because the man page is written like this. Keep in mind that the semantics of the listen() argument and the /proc/sys/net/ipv4/tcp_max_syn_backlog sysctl knob, as described in the man page, are part of the Linux kernel's user-visible API. So, in essence, they cannot be changed. Changing the semantics of system calls and sysctl knobs breaks applications and system configuration scripts. :-) neal
Re: Why not use all the syn queues? in the function "tcp_conn_request", I have some questions.
On Tue, Sep 4, 2018 at 1:48 AM Ttttabcd wrote: > > Hello everyone,recently I am looking at the source code for handling TCP > three-way handshake(Linux Kernel version 4.18.5). > > I found some strange places in the source code for handling syn messages. > > in the function "tcp_conn_request" > > This code will be executed when we don't enable the syn cookies. > > if (!net->ipv4.sysctl_tcp_syncookies && > (net->ipv4.sysctl_max_syn_backlog - > inet_csk_reqsk_queue_len(sk) < > (net->ipv4.sysctl_max_syn_backlog >> 2)) && > !tcp_peer_is_proven(req, dst)) { > /* Without syncookies last quarter of > * backlog is filled with destinations, > * proven to be alive. > * It means that we continue to communicate > * to destinations, already remembered > * to the moment of synflood. > */ > pr_drop_req(req, ntohs(tcp_hdr(skb)->source), > rsk_ops->family); > goto drop_and_release; > } > > But why don't we use all the syn queues? If tcp_peer_is_proven() returns true then we do allow ourselves to use the whole queue. > Why do we need to leave the size of (net->ipv4.sysctl_max_syn_backlog >> 2) > in the queue? > > Even if the system is attacked by a syn flood, there is no need to leave a > part. Why do we need to leave a part? The comment describes the rationale. If syncookies are disabled, then the last quarter of the backlog is reserved for filling with destinations that were proven to be alive, according to tcp_peer_is_proven() (which uses RTTs measured in previous connections). The idea is that if there is a SYN flood, we do not want to use all of our queue budget on attack traffic but instead want to reserve some queue space for SYNs from real remote machines that we have actually contacted in the past. > The value of sysctl_max_syn_backlog is the maximum length of the queue only > if syn cookies are enabled. Even if syncookies are disabled, sysctl_max_syn_backlog is the maximum length of the queue. > This is the first strange place, here is another strange place > > __u32 isn = TCP_SKB_CB(skb)->tcp_tw_isn; > > if ((net->ipv4.sysctl_tcp_syncookies == 2 || > inet_csk_reqsk_queue_is_full(sk)) && !isn) { > > if (!want_cookie && !isn) { > > The value of "isn" comes from TCP_SKB_CB(skb)->tcp_tw_isn, then it is judged > twice whether its value is indeed 0. > > But "tcp_tw_isn" is initialized in the function "tcp_v4_fill_cb" > > TCP_SKB_CB(skb)->tcp_tw_isn = 0; > > So it has always been 0, I used printk to test, and the result is always 0. That field is also set in tcp_timewait_state_process(): TCP_SKB_CB(skb)->tcp_tw_isn = isn; So there can be cases where it is not 0. Hope that helps, neal
[PATCH net] tcp_bbr: fix bw probing to raise in-flight data for very small BDPs
For some very small BDPs (with just a few packets) there was a quantization effect where the target number of packets in flight during the super-unity-gain (1.25x) phase of gain cycling was implicitly truncated to a number of packets no larger than the normal unity-gain (1.0x) phase of gain cycling. This meant that in multi-flow scenarios some flows could get stuck with a lower bandwidth, because they did not push enough packets inflight to discover that there was more bandwidth available. This was really only an issue in multi-flow LAN scenarios, where RTTs and BDPs are low enough for this to be an issue. This fix ensures that gain cycling can raise inflight for small BDPs by ensuring that in PROBE_BW mode target inflight values with a super-unity gain are always greater than inflight values with a gain <= 1. Importantly, this applies whether the inflight value is calculated for use as a cwnd value, or as a target inflight value for the end of the super-unity phase in bbr_is_next_cycle_phase() (both need to be bigger to ensure we can probe with more packets in flight reliably). This is a candidate fix for stable releases. Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control") Signed-off-by: Neal Cardwell Acked-by: Yuchung Cheng Acked-by: Soheil Hassas Yeganeh Acked-by: Priyaranjan Jha Reviewed-by: Eric Dumazet --- net/ipv4/tcp_bbr.c | 4 1 file changed, 4 insertions(+) diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c index 3b5f45b9e81eb..13d34427ca3dd 100644 --- a/net/ipv4/tcp_bbr.c +++ b/net/ipv4/tcp_bbr.c @@ -358,6 +358,10 @@ static u32 bbr_target_cwnd(struct sock *sk, u32 bw, int gain) /* Reduce delayed ACKs by rounding up cwnd to the next even number. */ cwnd = (cwnd + 1) & ~1U; + /* Ensure gain cycling gets inflight above BDP even for small BDPs. */ + if (bbr->mode == BBR_PROBE_BW && gain > BBR_UNIT) + cwnd += 2; + return cwnd; } -- 2.18.0.345.g5c9ce644c3-goog
Re: [PATCH net-next] tcp: ack immediately when a cwr packet arrives
On Tue, Jul 24, 2018 at 1:42 PM Lawrence Brakmo wrote: > > Note that without this fix the 99% latencies when doing 10KB RPCs > in a congested network using DCTCP are 40ms vs. 190us with the patch. > Also note that these 40ms high tail latencies started after commit > 3759824da87b30ce7a35b4873b62b0ba38905ef5 in Jul 2015, > which triggered the bugs/features we are fixing/adding. I agree it is a > debatable whether it is a bug fix or a feature improvement and I am > fine either way. Good point. The fact that this greatly mitigates a regression in DCTCP performance resulting from 3759824da87b30ce7a35b4873b62b0ba38905ef5 ("tcp: PRR uses CRB mode by default and SS mode conditionally") IMHO seems to be a good argument for putting this patch ("tcp: ack immediately when a cwr packet arrives") in the "net" branch and stable releases. thanks, neal
Re: [PATCH net-next] tcp: ack immediately when a cwr packet arrives
On Tue, Jul 24, 2018 at 1:07 PM Yuchung Cheng wrote: > > On Mon, Jul 23, 2018 at 7:23 PM, Daniel Borkmann wrote: > > Should this go to net tree instead where all the other fixes went? > I am neutral but this feels more like a feature improvement I agree this feels like a feature improvement rather than a bug fix. neal
Re: [PATCH net-next] tcp: ack immediately when a cwr packet arrives
On Mon, Jul 23, 2018 at 8:49 PM Lawrence Brakmo wrote: > > We observed high 99 and 99.9% latencies when doing RPCs with DCTCP. The > problem is triggered when the last packet of a request arrives CE > marked. The reply will carry the ECE mark causing TCP to shrink its cwnd > to 1 (because there are no packets in flight). When the 1st packet of > the next request arrives, the ACK was sometimes delayed even though it > is CWR marked, adding up to 40ms to the RPC latency. > > This patch insures that CWR marked data packets arriving will be acked > immediately. ... > Modified based on comments by Neal Cardwell > > Signed-off-by: Lawrence Brakmo > --- > net/ipv4/tcp_input.c | 9 - > 1 file changed, 8 insertions(+), 1 deletion(-) Seems like a nice mechanism to have, IMHO. Acked-by: Neal Cardwell Thanks! neal
Re: [PATCH net-next] tcp: expose both send and receive intervals for rate sample
On Mon, Jul 9, 2018 at 1:58 PM Deepti Raghavan wrote: > > Congestion control algorithms, which access the rate sample > through the tcp_cong_control function, only have access to the maximum > of the send and receive interval, for cases where the acknowledgment > rate may be inaccurate due to ACK compression or decimation. Algorithms > may want to use send rates and receive rates as separate signals. > > Signed-off-by: Deepti Raghavan > --- > include/net/tcp.h | 2 ++ > net/ipv4/tcp_rate.c | 4 > 2 files changed, 6 insertions(+) Thanks for re-sending. It does seem to be showing up in patchwork now: https://patchwork.ozlabs.org/patch/941532/ And I can confirm I'm able to apply it to net-next. Acked-by: Neal Cardwell thanks, neal
Re: [PATCH net-next v3 0/2] tcp: fix high tail latencies in DCTCP
On Sat, Jul 7, 2018 at 7:15 AM David Miller wrote: > > From: Lawrence Brakmo > Date: Tue, 3 Jul 2018 09:26:13 -0700 > > > When have observed high tail latencies when using DCTCP for RPCs as > > compared to using Cubic. For example, in one setup there are 2 hosts > > sending to a 3rd one, with each sender having 3 flows (1 stream, > > 1 1MB back-to-back RPCs and 1 10KB back-to-back RPCs). The following > > table shows the 99% and 99.9% latencies for both Cubic and dctcp: > > > >Cubic 99% Cubic 99.9% dctcp 99%dctcp 99.9% > > 1MB RPCs2.6ms 5.5ms 43ms 208ms > > 10KB RPCs1.1ms 1.3ms 53ms 212ms > ... > > v2: Removed call to tcp_ca_event from tcp_send_ack since I added one in > > tcp_event_ack_sent. Based on Neal Cardwell > > feedback. > > Modified tcp_ecn_check_ce (and renamed it tcp_ecn_check) instead of > > modifying > > tcp_ack_send_check to insure an ACK when cwr is received. > > v3: Handling cwr in tcp_ecn_accept_cwr instead of in tcp_ecn_check. > > > > [PATCH net-next v3 1/2] tcp: notify when a delayed ack is sent > > [PATCH net-next v3 2/2] tcp: ack immediately when a cwr packet > > Neal and co., what are your thoughts right now about this patch series? > > Thank you. IMHO these patches are a definite improvement over what we have now. That said, in chatting with Yuchung before the July 4th break, I think Yuchung and I agreed that we would ideally like to see something like the following: (1) refactor the DCTCP code to check for pending delayed ACKs directly using existing state (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_TIMER), and remove the ca->delayed_ack_reserved DCTCP field and the CA_EVENT_DELAYED_ACK and CA_EVENT_NON_DELAYED_ACK callbacks added for DCTCP (which Larry determined had at least one bug). (2) fix the bug with the DCTCP call to tcp_send_ack(sk) causing delayed ACKs to be incorrectly dropped/forgotten (not yet addressed by this patch series) (3) then with fixes (1) and (2) in place, re-run tests and see if we still need Larry's heuristic (in patch 2) to fire an ACK immediately if a receiver receives a CWR packet (I suspect this is still very useful, but I think Yuchung is reluctant to add this complexity unless we have verified it's still needed after (1) and (2)) Our team may be able to help out with some proposed patches for (1) and (2). In any case, I would love to have Yuchung and Eric weigh in (perhaps Monday) before we merge this patch series. Thanks, neal
Re: [PATCH net-next v2 0/2] tcp: fix high tail latencies in DCTCP
On Tue, Jul 3, 2018 at 11:10 AM Lawrence Brakmo wrote: > > On 7/2/18, 5:52 PM, "netdev-ow...@vger.kernel.org on behalf of Neal Cardwell" > wrote: > > On Mon, Jul 2, 2018 at 5:39 PM Lawrence Brakmo wrote: > > > > When have observed high tail latencies when using DCTCP for RPCs as > > compared to using Cubic. For example, in one setup there are 2 hosts > > sending to a 3rd one, with each sender having 3 flows (1 stream, > > 1 1MB back-to-back RPCs and 1 10KB back-to-back RPCs). The following > > table shows the 99% and 99.9% latencies for both Cubic and dctcp: > > > >Cubic 99% Cubic 99.9% dctcp 99%dctcp 99.9% > > 1MB RPCs2.6ms 5.5ms 43ms 208ms > > 10KB RPCs1.1ms 1.3ms 53ms 212ms > > > > Looking at tcpdump traces showed that there are two causes for the > > latency. > > > > 1) RTOs caused by the receiver sending a dup ACK and not ACKing > > the last (and only) packet sent. > > 2) Delaying ACKs when the sender has a cwnd of 1, so everything > > pauses for the duration of the delayed ACK. > > > > The first patch fixes the cause of the dup ACKs, not updating DCTCP > > state when an ACK that was initially delayed has been sent with a > > data packet. > > > > The second patch insures that an ACK is sent immediately when a > > CWR marked packet arrives. > > > > With the patches the latencies for DCTCP now look like: > > > >dctcp 99% dctcp 99.9% > > 1MB RPCs5.8ms 6.9ms > > 10KB RPCs146us 203us > > > > Note that while the 1MB RPCs tail latencies are higher than Cubic's, > > the 10KB latencies are much smaller than Cubic's. These patches fix > > issues on the receiver, but tcpdump traces indicate there is an > > opportunity to also fix an issue at the sender that adds about 3ms > > to the tail latencies. > > > > The following trace shows the issue that tiggers an RTO (fixed by these > patches): > > > >Host A sends the last packets of the request > >Host B receives them, and the last packet is marked with congestion > (CE) > >Host B sends ACKs for packets not marked with congestion > >Host B sends data packet with reply and ACK for packet marked with > > congestion (TCP flag ECE) > >Host A receives ACKs with no ECE flag > >Host A receives data packet with ACK for the last packet of request > > and which has TCP ECE bit set > >Host A sends 1st data packet of the next request with TCP flag CWR > >Host B receives the packet (as seen in tcpdump at B), no CE flag > >Host B sends a dup ACK that also has the TCP ECE flag > >Host A RTO timer fires! > >Host A to send the next packet > >Host A receives an ACK for everything it has sent (i.e. Host B > > did receive 1st packet of request) > >Host A send more packets… > > > > [PATCH net-next v2 1/2] tcp: notify when a delayed ack is sent > > [PATCH net-next v2 2/2] tcp: ack immediately when a cwr packet > > > > net/ipv4/tcp_input.c | 16 +++- > > net/ipv4/tcp_output.c | 4 ++-- > > 2 files changed, 13 insertions(+), 7 deletions(-) > > Thanks, Larry. Just for context, can you please let us know whether > your tests included zero, one, or both of Eric's recent commits > (listed below) that tuned the number of ACKs after ECN events? (Or > maybe the tests were literally using a net-next kernel?) Just wanted > to get a better handle on any possible interactions there. > > Thanks! > > neal > > Yes, my test kernel includes both patches listed below. OK, great. > BTW, I will send a new > patch where I move the call to tcp_incr_quickack from tcp_ecn_check_ce to > tcp_ecn_accept_cwr. OK, sounds good to me. thanks, neal
Re: [PATCH net-next v2 1/2] tcp: notify when a delayed ack is sent
On Mon, Jul 2, 2018 at 7:49 PM Yuchung Cheng wrote: > > On Mon, Jul 2, 2018 at 2:39 PM, Lawrence Brakmo wrote: > > > > DCTCP depends on the CA_EVENT_NON_DELAYED_ACK and CA_EVENT_DELAYED_ACK > > notifications to keep track if it needs to send an ACK for packets that > > were received with a particular ECN state but whose ACK was delayed. > > > > Under some circumstances, for example when a delayed ACK is sent with a > > data packet, DCTCP state was not being updated due to a lack of > > notification that the previously delayed ACK was sent. As a result, it > > would sometimes send a duplicate ACK when a new data packet arrived. > > > > This patch insures that DCTCP's state is correctly updated so it will > > not send the duplicate ACK. > Sorry to chime-in late here (lame excuse: IETF deadline) > > IIRC this issue would exist prior to 4.11 kernel. While it'd be good > to fix that, it's not clear which patch introduced the regression > between 4.11 and 4.16? I assume you tested Eric's most recent quickack > fix. I don't think Larry is saying there's a regression between 4.11 and 4.16. His recent "tcp: force cwnd at least 2 in tcp_cwnd_reduction" patch here: https://patchwork.ozlabs.org/patch/935233/ said that 4.11 was bad (high tail latency and lots of RTOs) and 4.16 was still bad but with different netstat counters (no RTOs but still high tail latency): """ On 4.11, pcap traces indicate that in some instances the 1st packet of the RPC is received but no ACK is sent before the packet is retransmitted. On 4.11 netstat shows TCP timeouts, with some of them spurious. On 4.16, we don't see retransmits in netstat but the high tail latencies are still there. """ I suspect the RTOs disappeared but latencies remained too high because between 4.11 and 4.16 we introduced: tcp: allow TLP in ECN CWR https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=b4f70c3d4ec32a2ff4c62e1e2da0da5f55fe12bd So the RTOs probably disappeared because this commit turned them into TLPs. But the latencies remained high because the fundamental bug remained throughout 4.11 and 4.16 and today: the DCTCP use of tcp_send_ack() with an old rcv_nxt caused delayed ACKs to be cancelled when they should not have been. > In terms of the fix itself, it seems odd the tcp_send_ack() call in > DCTCP generates NON_DELAYED_ACK event to toggle DCTCP's > delayed_ack_reserved bit. Shouldn't the fix to have DCTCP send the > "prior" ACK w/o cancelling delayed-ACK and mis-recording that it's > cancelled, because that prior-ACK really is a supplementary old ACK. This patch is fixing an issue that's orthogonal to the one you are talking about. Using the taxonomy from our team's internal discussion yesterday, the issue you mention where the DCTCP "prior" ACK is cancelling delayed ACKs is issue (4); the issue that this particular "tcp: notify when a delayed ack is sent" patch from Larry fixes is issue (3). It's a bit tricky because both issues appeared in Larry's trace summary and packetdrill script to reproduce the issue. neal
Re: [PATCH net-next v2 0/2] tcp: fix high tail latencies in DCTCP
On Mon, Jul 2, 2018 at 5:39 PM Lawrence Brakmo wrote: > > When have observed high tail latencies when using DCTCP for RPCs as > compared to using Cubic. For example, in one setup there are 2 hosts > sending to a 3rd one, with each sender having 3 flows (1 stream, > 1 1MB back-to-back RPCs and 1 10KB back-to-back RPCs). The following > table shows the 99% and 99.9% latencies for both Cubic and dctcp: > >Cubic 99% Cubic 99.9% dctcp 99%dctcp 99.9% > 1MB RPCs2.6ms 5.5ms 43ms 208ms > 10KB RPCs1.1ms 1.3ms 53ms 212ms > > Looking at tcpdump traces showed that there are two causes for the > latency. > > 1) RTOs caused by the receiver sending a dup ACK and not ACKing > the last (and only) packet sent. > 2) Delaying ACKs when the sender has a cwnd of 1, so everything > pauses for the duration of the delayed ACK. > > The first patch fixes the cause of the dup ACKs, not updating DCTCP > state when an ACK that was initially delayed has been sent with a > data packet. > > The second patch insures that an ACK is sent immediately when a > CWR marked packet arrives. > > With the patches the latencies for DCTCP now look like: > >dctcp 99% dctcp 99.9% > 1MB RPCs5.8ms 6.9ms > 10KB RPCs146us 203us > > Note that while the 1MB RPCs tail latencies are higher than Cubic's, > the 10KB latencies are much smaller than Cubic's. These patches fix > issues on the receiver, but tcpdump traces indicate there is an > opportunity to also fix an issue at the sender that adds about 3ms > to the tail latencies. > > The following trace shows the issue that tiggers an RTO (fixed by these > patches): > >Host A sends the last packets of the request >Host B receives them, and the last packet is marked with congestion (CE) >Host B sends ACKs for packets not marked with congestion >Host B sends data packet with reply and ACK for packet marked with > congestion (TCP flag ECE) >Host A receives ACKs with no ECE flag >Host A receives data packet with ACK for the last packet of request > and which has TCP ECE bit set >Host A sends 1st data packet of the next request with TCP flag CWR >Host B receives the packet (as seen in tcpdump at B), no CE flag >Host B sends a dup ACK that also has the TCP ECE flag >Host A RTO timer fires! >Host A to send the next packet >Host A receives an ACK for everything it has sent (i.e. Host B > did receive 1st packet of request) >Host A send more packets… > > [PATCH net-next v2 1/2] tcp: notify when a delayed ack is sent > [PATCH net-next v2 2/2] tcp: ack immediately when a cwr packet > > net/ipv4/tcp_input.c | 16 +++- > net/ipv4/tcp_output.c | 4 ++-- > 2 files changed, 13 insertions(+), 7 deletions(-) Thanks, Larry. Just for context, can you please let us know whether your tests included zero, one, or both of Eric's recent commits (listed below) that tuned the number of ACKs after ECN events? (Or maybe the tests were literally using a net-next kernel?) Just wanted to get a better handle on any possible interactions there. Thanks! neal --- https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=522040ea5fdd1c33bbf75e1d7c7c0422b96a94ef commit 522040ea5fdd1c33bbf75e1d7c7c0422b96a94ef Author: Eric Dumazet Date: Mon May 21 15:08:57 2018 -0700 tcp: do not aggressively quick ack after ECN events ECN signals currently forces TCP to enter quickack mode for up to 16 (TCP_MAX_QUICKACKS) following incoming packets. We believe this is not needed, and only sending one immediate ack for the current packet should be enough. This should reduce the extra load noticed in DCTCP environments, after congestion events. This is part 2 of our effort to reduce pure ACK packets. Signed-off-by: Eric Dumazet Acked-by: Soheil Hassas Yeganeh Acked-by: Yuchung Cheng Acked-by: Neal Cardwell Signed-off-by: David S. Miller https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=15ecbe94a45ef88491ca459b26efdd02f91edb6d commit 15ecbe94a45ef88491ca459b26efdd02f91edb6d Author: Eric Dumazet Date: Wed Jun 27 08:47:21 2018 -0700 tcp: add one more quick ack after after ECN events Larry Brakmo proposal ( https://patchwork.ozlabs.org/patch/935233/ tcp: force cwnd at least 2 in tcp_cwnd_reduction) made us rethink about our recent patch removing ~16 quick acks after ECN events. tcp_enter_quickack_mode(sk, 1) makes sure one immediate ack is sent, but in the case the sender cwnd was lowered to 1, we do not want to have a delayed ack for the next packet we will receive. Fixes: 522040ea5fdd ("tcp: do not aggressively quick ack after ECN events") Signed-off-by: Eric Dumazet Reported-by: Neal Cardwell Cc: Lawrence Brakmo Acked-by: Neal Cardwell Signed-off-by: David S. Miller
Re: [PATCH net-next 1/2] tcp: notify when a delayed ack is sent
On Fri, Jun 29, 2018 at 9:48 PM Lawrence Brakmo wrote: > > DCTCP depends on the CA_EVENT_NON_DELAYED_ACK and CA_EVENT_DELAYED_ACK > notifications to keep track if it needs to send an ACK for packets that > were received with a particular ECN state but whose ACK was delayed. > > Under some circumstances, for example when a delayed ACK is sent with a > data packet, DCTCP state was not being updated due to a lack of > notification that the previously delayed ACK was sent. As a result, it > would sometimes send a duplicate ACK when a new data packet arrived. > > This patch insures that DCTCP's state is correctly updated so it will > not send the duplicate ACK. > > Signed-off-by: Lawrence Brakmo > --- > net/ipv4/tcp_output.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index f8f6129160dd..41f6ad7a21e4 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -172,6 +172,8 @@ static inline void tcp_event_ack_sent(struct sock *sk, > unsigned int pkts) > __sock_put(sk); > } > tcp_dec_quickack_mode(sk, pkts); > + if (inet_csk_ack_scheduled(sk)) > + tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK); > inet_csk_clear_xmit_timer(sk, ICSK_TIME_DACK); > } Thanks for this fix! Seems like this would work, but if I am reading this correctly then it seems like this would cause a duplicate call to tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK) when we are sending a pure ACK (delayed or non-delayed): (1) once from tcp_send_ack() before we send the ACK: tcp_send_ack(struct sock *sk) ->tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK); (2) then again from tcp_event_ack_sent() after we have sent the ACK: tcp_event_ack_sent() -> if (inet_csk_ack_scheduled(sk)) tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK); What if we remove the original CA_EVENT_NON_DELAYED_ACK call and just replace it with your new one? (not compiled, not tested): diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 3889dcd4868d4..bddb49617d9be 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -184,6 +184,8 @@ static inline void tcp_event_ack_sent(struct sock *sk, unsigned int pkts) __sock_put(sk); } tcp_dec_quickack_mode(sk, pkts); + if (inet_csk_ack_scheduled(sk)) + tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK); inet_csk_clear_xmit_timer(sk, ICSK_TIME_DACK); } @@ -3836,8 +3838,6 @@ void tcp_send_ack(struct sock *sk) if (sk->sk_state == TCP_CLOSE) return; - tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK); - /* We are not putting this on the write queue, so * tcp_transmit_skb() will set the ownership to this * sock. Aside from lower CPU overhead, one nice benefit of that is that we then only call tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK) in one place, which might be a little easier to reason about. Does that work? neal
Re: [PATCH net-next 2/2] tcp: ack immediately when a cwr packet arrives
On Sat, Jun 30, 2018 at 9:47 PM Lawrence Brakmo wrote: > I see two issues, one is that entering quickack mode as you > mentioned does not insure that it will still be on when the CWR > arrives. The second issue is that the problem occurs right after the > receiver sends a small reply which results in entering pingpong mode > right before the sender starts the new request by sending just one > packet (i.e. forces delayed ack). > > I compiled and tested your patch. Both 99 and 99.9 percentile > latencies are around 40ms. Looking at the packet traces shows that > some CWR marked packets are not being ack immediately (delayed by > 40ms). Thanks, Larry! So your tests provide nice, specific evidence that it is good to force an immediate ACK when a receiver receives a packet with CWR marked. Given that, I am wondering what the simplest way is to achieve that goal. What if, rather than plumbing a new specific signal into __tcp_ack_snd_check(), we use the existing general quick-ack mechanism, where various parts of the TCP stack (like __tcp_ecn_check_ce()) are already using the quick-ack mechanism to "remotely" signal to __tcp_ack_snd_check() that they want an immediate ACK. For example, would it work to do something like: diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index c53ae5fc834a5..8168d1938b376 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -262,6 +262,12 @@ static void __tcp_ecn_check_ce(struct sock *sk, const struct sk_buff *skb) { struct tcp_sock *tp = tcp_sk(sk); + /* If the sender is telling us it has entered CWR, then its cwnd may be +* very low (even just 1 packet), so we should ACK immediately. +*/ + if (tcp_hdr(skb)->cwr) + tcp_enter_quickack_mode(sk, 2); + switch (TCP_SKB_CB(skb)->ip_dsfield & INET_ECN_MASK) { case INET_ECN_NOT_ECT: /* Insure that GCN will not continue to mark packets. */ And then since that broadens the mission of this function beyond checking just the ECT/CE bits, I supposed we could rename the __tcp_ecn_check_ce() and tcp_ecn_check_ce() functions to __tcp_ecn_check() and tcp_ecn_check(), or something like that. Would that work for this particular issue? neal
Re: [PATCH net] tcp: prevent bogus FRTO undos with non-SACK flows
On Fri, Jun 29, 2018 at 3:52 PM Ilpo Järvinen wrote: > > +.005 < . 1:1(0) ack 2001 win 257 > > Why did the receiver send a cumulative ACK only for 2001? Sorry, you are right Ilpo. Upon further reflection, the packetdrill scenario I posted is not a realistic one, and I agree we should not worry about it. As I mentioned, I ran your patch through all our team's TCP packetdrill tests, and it passes all of the tests. One of our tests needed updating, because if there is a non-SACK connection with a spurious RTO due to a delayed flight of ACKs then the FRTO undo now happens one ACK later (when we get an ACK that doesn't cover a retransmit). But that seems fine to me. I also cooked the new packetdrill test below to explicitly cover this case you are addressing (please let me know if you have an alternate suggestion). Tested-by: Neal Cardwell Acked-by: Neal Cardwell Thanks! neal --- 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +0 < S 0:0(0) win 32792 +0 > S. 0:0(0) ack 1 +.02 < . 1:1(0) ack 1 win 257 +0 accept(3, ..., ...) = 4 // Send 3 packets. First is really lost. And the dupacks // for the data packets that arrived at the reciver are slow in arriving. +0 write(4, ..., 3000) = 3000 +0 > P. 1:3001(3000) ack 1 // RTO and retransmit head. This fills a real loss. +.22 > . 1:1001(1000) ack 1 // Dupacks for packets 2 and 3 arrive. +.02 < . 1:1(0) ack 1 win 257 +0 < . 1:1(0) ack 1 win 257 // The cumulative ACK for all the data arrives. We do not undo, because // this is a non-SACK connection, and retransmitted data was ACKed. // It's good that there's no FRTO undo, since a packet was really lost. // Because this is non-SACK, tcp_try_undo_recovery() holds CA_Loss // until something beyond high_seq is ACKed. +.005 < . 1:1(0) ack 3001 win 257 +0 %{ assert tcpi_ca_state == TCP_CA_Loss, tcpi_ca_state }% +0 %{ assert tcpi_snd_cwnd == 4, tcpi_snd_cwnd }% +0 %{ assert tcpi_snd_ssthresh == 7, tcpi_snd_ssthresh }%
Re: [PATCH net-next 0/2] tcp: fix high tail latencies in DCTCP
On Fri, Jun 29, 2018 at 9:48 PM Lawrence Brakmo wrote: > > When have observed high tail latencies when using DCTCP for RPCs as > compared to using Cubic. For example, in one setup there are 2 hosts > sending to a 3rd one, with each sender having 3 flows (1 stream, > 1 1MB back-to-back RPCs and 1 10KB back-to-back RPCs). The following > table shows the 99% and 99.9% latencies for both Cubic and dctcp: > >Cubic 99% Cubic 99.9% dctcp 99%dctcp 99.9% > 1MB RPCs2.6ms 5.5ms 43ms 208ms > 10KB RPCs1.1ms 1.3ms 53ms 212ms > > Looking at tcpdump traces showed that there are two causes for the > latency. > > 1) RTOs caused by the receiver sending a dup ACK and not ACKing > the last (and only) packet sent. > 2) Delaying ACKs when the sender has a cwnd of 1, so everything > pauses for the duration of the delayed ACK. > > The first patch fixes the cause of the dup ACKs, not updating DCTCP > state when an ACK that was initially delayed has been sent with a > data packet. > > The second patch insures that an ACK is sent immediately when a > CWR marked packet arrives. > > With the patches the latencies for DCTCP now look like: > >dctcp 99% dctcp 99.9% > 1MB RPCs4.8ms 6.5ms > 10KB RPCs143us 184us > > Note that while the 1MB RPCs tail latencies are higher than Cubic's, > the 10KB latencies are much smaller than Cubic's. These patches fix > issues on the receiver, but tcpdump traces indicate there is an > opportunity to also fix an issue at the sender that adds about 3ms > to the tail latencies. > > The following trace shows the issue that tiggers an RTO (fixed by these > patches): > >Host A sends the last packets of the request >Host B receives them, and the last packet is marked with congestion (CE) >Host B sends ACKs for packets not marked with congestion >Host B sends data packet with reply and ACK for packet marked with > congestion (TCP flag ECE) >Host A receives ACKs with no ECE flag >Host A receives data packet with ACK for the last packet of request > and which has TCP ECE bit set >Host A sends 1st data packet of the next request with TCP flag CWR >Host B receives the packet (as seen in tcpdump at B), no CE flag >Host B sends a dup ACK that also has the TCP ECE flag >Host A RTO timer fires! >Host A to send the next packet >Host A receives an ACK for everything it has sent (i.e. Host B > did receive 1st packet of request) >Host A send more packets… > > [PATCH net-next 1/2] tcp: notify when a delayed ack is sent > [PATCH net-next 2/2] tcp: ack immediately when a cwr packet arrives Thanks, Larry! I suspect there is a broader problem with "DCTCP-style precise ECE ACKs" that this patch series does not address. AFAICT the DCTCP "historical" ACKs for ECE precision, generated in dctcp_ce_state_0_to_1() and dctcp_ce_state_1_to_0(), violate the assumptions of the pre-existing delayed ACK state machine. They violate those assumptions by rewinding tp->rcv_nxt backwards and then calling tcp_send_ack(sk). But the existing code path to send an ACK assumes that any ACK transmission always sends an ACK that accounts for all received data, so that after sending an ACK we can cancel the delayed ACK timer. So it seems with DCTCP we can have buggy sequences where the DCTCP historical ACK causes us to forget that we need to schedule a delayed ACK (which will force the sender to RTO, as shown in the trace): tcp_event_data_recv() inet_csk_schedule_ack() // remember that we need an ACK tcp_ecn_check_ce() tcp_ca_event() // with CA_EVENT_ECN_IS_CE or CA_EVENT_ECN_NO_CE dctcp_cwnd_event() dctcp_ce_state_0_to_1() or dctcp_ce_state_1_to_0() if (... && ca->delayed_ack_reserved) tp->rcv_nxt = ca->prior_rcv_nxt; tcp_send_ack(sk); // send an ACK, but for old data! tcp_transmit_skb() tcp_event_ack_sent() inet_csk_clear_xmit_timer(sk, ICSK_TIME_DACK); // forget that we need a delayed ACK! AFAICT the first patch in this series, to force an immediate ACK on any packet with a CWR, papers over this issue of the forgotten delayed ACK by forcing an immediate ack in __tcp_ack_snd_check(). This ensures that at least for CWR packets the forgotten delayed ACK does not matter. But for packets without CWR I believe the buggy "forgotten delayed ACK" sequence above is still possible, and could still plague DCTCP with high tail latencies in some cases. I suspect that after your CWR-forces-ACK patch it may not be jumping out in performance test results because (a) if there is no CWR involved then usually the cwnd is high enough that missing delayed ACK does not cause a stall, and (b) if there is a CWR involved then your patch forces an immediate ACK. But AFAICT that forgotten delayed ACK bug is still there. What do folks think? neal
Re: [PATCH net-next 2/2] tcp: ack immediately when a cwr packet arrives
On Fri, Jun 29, 2018 at 9:48 PM Lawrence Brakmo wrote: > > We observed high 99 and 99.9% latencies when doing RPCs with DCTCP. The > problem is triggered when the last packet of a request arrives CE > marked. The reply will carry the ECE mark causing TCP to shrink its cwnd > to 1 (because there are no packets in flight). When the 1st packet of > the next request arrives it was sometimes delayed adding up to 40ms to > the latency. > > This patch insures that CWR makred data packets arriving will be acked > immediately. > > Signed-off-by: Lawrence Brakmo > --- Thanks, Larry. Ensuring that CWR-marked data packets arriving will be acked immediately sounds like a good goal to me. I am wondering if we can have a simpler fix. The dctcp_ce_state_0_to_1() code is setting the TCP_ECN_DEMAND_CWR bit in ecn_flags, which disables the code in __tcp_ecn_check_ce() that would have otherwise used the tcp_enter_quickack_mode() mechanism to force an ACK: __tcp_ecn_check_ce(): ... case INET_ECN_CE: if (tcp_ca_needs_ecn(sk)) tcp_ca_event(sk, CA_EVENT_ECN_IS_CE); // -> dctcp_ce_state_0_to_1() // -> tp->ecn_flags |= TCP_ECN_DEMAND_CWR; if (!(tp->ecn_flags & TCP_ECN_DEMAND_CWR)) { /* Better not delay acks, sender can have a very low cwnd */ tcp_enter_quickack_mode(sk, 1); tp->ecn_flags |= TCP_ECN_DEMAND_CWR; } tp->ecn_flags |= TCP_ECN_SEEN; break; So it seems like the bug here may be that dctcp_ce_state_0_to_1() is setting the TCP_ECN_DEMAND_CWR bit in ecn_flags, when really it should let its caller, __tcp_ecn_check_ce() set TCP_ECN_DEMAND_CWR, in which case the code would already properly force an immediate ACK. So, what if we use this fix instead (not compiled, not tested): diff --git a/net/ipv4/tcp_dctcp.c b/net/ipv4/tcp_dctcp.c index 5f5e5936760e..4fecd2824edb 100644 --- a/net/ipv4/tcp_dctcp.c +++ b/net/ipv4/tcp_dctcp.c @@ -152,8 +152,6 @@ static void dctcp_ce_state_0_to_1(struct sock *sk) ca->prior_rcv_nxt = tp->rcv_nxt; ca->ce_state = 1; - - tp->ecn_flags |= TCP_ECN_DEMAND_CWR; } static void dctcp_ce_state_1_to_0(struct sock *sk) What do you think? neal
Re: [PATCH net] tcp: prevent bogus FRTO undos with non-SACK flows
On Fri, Jun 29, 2018 at 6:07 AM Ilpo Järvinen wrote: > > If SACK is not enabled and the first cumulative ACK after the RTO > retransmission covers more than the retransmitted skb, a spurious > FRTO undo will trigger (assuming FRTO is enabled for that RTO). > The reason is that any non-retransmitted segment acknowledged will > set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is > no indication that it would have been delivered for real (the > scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK > case so the check for that bit won't help like it does with SACK). > Having FLAG_ORIG_SACK_ACKED set results in the spurious FRTO undo > in tcp_process_loss. > > We need to use more strict condition for non-SACK case and check > that none of the cumulatively ACKed segments were retransmitted > to prove that progress is due to original transmissions. Only then > keep FLAG_ORIG_SACK_ACKED set, allowing FRTO undo to proceed in > non-SACK case. > > (FLAG_ORIG_SACK_ACKED is planned to be renamed to FLAG_ORIG_PROGRESS > to better indicate its purpose but to keep this change minimal, it > will be done in another patch). > > Besides burstiness and congestion control violations, this problem > can result in RTO loop: When the loss recovery is prematurely > undoed, only new data will be transmitted (if available) and > the next retransmission can occur only after a new RTO which in case > of multiple losses (that are not for consecutive packets) requires > one RTO per loss to recover. > > Signed-off-by: Ilpo Järvinen > --- > net/ipv4/tcp_input.c | 9 + > 1 file changed, 9 insertions(+) > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index 045d930..8e5522c 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -3181,6 +3181,15 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 > prior_fack, > > if (tcp_is_reno(tp)) { > tcp_remove_reno_sacks(sk, pkts_acked); > + > + /* If any of the cumulatively ACKed segments was > +* retransmitted, non-SACK case cannot confirm that > +* progress was due to original transmission due to > +* lack of TCPCB_SACKED_ACKED bits even if some of > +* the packets may have been never retransmitted. > +*/ > + if (flag & FLAG_RETRANS_DATA_ACKED) > + flag &= ~FLAG_ORIG_SACK_ACKED; > } else { > int delta; Thanks, Ilpo. I ran this through our TCP packetdrill tests and only got one failure, which detected the change you made, so that on the first ACK in a non-SACK F-RTO case we stay in CA_Loss. However, depending on the exact scenario we are concerned about here, this may not be a complete solution. Consider the packetdrill test I cooked below, which is for a scenario where we imagine a non-SACK connection, with data packet #1 and all dupacks lost. With your patch we don't have an F-RTO undo on the first cumulative ACK, which is a definite improvement. However, we end up with an F-RTO undo on the *second* cumulative ACK, because the second ACK didn't cover any retransmitted data. That means that we end up in the second round trip back in the initial slow-start mode with a very high cwnd and infinite ssthresh, even though data was actually lost in the first round trip. I'm not sure how much we care about all these cases. Perhaps we should just disable F-RTO if the connection has no SACK enabled? These non-SACK connections are just such a rare case at this point that I would propose it's not worth spending too much time on this. The following packetdrill script passes when Ilpo's patch is applied. This documents the behavior, which I think is questionable: 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +0 < S 0:0(0) win 32792 +0 > S. 0:0(0) ack 1 +.02 < . 1:1(0) ack 1 win 257 +0 accept(3, ..., ...) = 4 +0 write(4, ..., 27000) = 27000 +0 > . 1:10001(1) ack 1 // Suppose 1:1001 is lost and all dupacks are lost. // RTO and retransmit head. This fills a real hole. +.22 > . 1:1001(1000) ack 1 +.005 < . 1:1(0) ack 2001 win 257 +0 > . 10001:13001(3000) ack 1 +0 %{ assert tcpi_ca_state == TCP_CA_Loss, tcpi_ca_state }% +0 %{ assert tcpi_snd_cwnd == 3, tcpi_snd_cwnd }% +0 %{ assert tcpi_snd_ssthresh == 7, tcpi_snd_ssthresh }% +.002 < . 1:1(0) ack 10001 win 257 +0 %{ assert tcpi_ca_state == TCP_CA_Open, tcpi_ca_state }% +0 %{ assert tcpi_snd_cwnd == 18, tcpi_snd_cwnd }% +0 %{ assert tcpi_snd_ssthresh == 2147483647, tcpi_snd_ssthresh }% +0 > . 13001:15001(2000) ack 1 --- neal
Re: [PATCH net-next v2] tcp: force cwnd at least 2 in tcp_cwnd_reduction
On Thu, Jun 28, 2018 at 4:20 PM Lawrence Brakmo wrote: > > I just looked at 4.18 traces and the behavior is as follows: > >Host A sends the last packets of the request > >Host B receives them, and the last packet is marked with congestion (CE) > >Host B sends ACKs for packets not marked with congestion > >Host B sends data packet with reply and ACK for packet marked with > congestion (TCP flag ECE) > >Host A receives ACKs with no ECE flag > >Host A receives data packet with ACK for the last packet of request and > has TCP ECE bit set > >Host A sends 1st data packet of the next request with TCP flag CWR > >Host B receives the packet (as seen in tcpdump at B), no CE flag > >Host B sends a dup ACK that also has the TCP ECE flag > >Host A RTO timer fires! > >Host A to send the next packet > >Host A receives an ACK for everything it has sent (i.e. Host B did receive > 1st packet of request) > >Host A send more packets… Thanks, Larry! This is very interesting. I don't know the cause, but this reminds me of an issue Steve Ibanez raised on the netdev list last December, where he was seeing cases with DCTCP where a CWR packet would be received and buffered by Host B but not ACKed by Host B. This was the thread "Re: Linux ECN Handling", starting around December 5. I have cc-ed Steve. I wonder if this may somehow be related to the DCTCP logic to rewind tp->rcv_nxt and call tcp_send_ack(), and then restore tp->rcv_nxt, if DCTCP notices that the incoming CE bits have been changed while the receiver thinks it is holding on to a delayed ACK (in dctcp_ce_state_0_to_1() and dctcp_ce_state_1_to_0()). I wonder if the "synthetic" call to tcp_send_ack() somehow has side effects in the delayed ACK state machine that can cause the connection to forget that it still needs to fire a delayed ACK, even though it just sent an ACK just now. neal
Re: [PATCH net] tcp: add one more quick ack after after ECN events
On Wed, Jun 27, 2018 at 11:47 AM Eric Dumazet wrote: > > Larry Brakmo proposal ( https://patchwork.ozlabs.org/patch/935233/ > tcp: force cwnd at least 2 in tcp_cwnd_reduction) made us rethink > about our recent patch removing ~16 quick acks after ECN events. > > tcp_enter_quickack_mode(sk, 1) makes sure one immediate ack is sent, > but in the case the sender cwnd was lowered to 1, we do not want > to have a delayed ack for the next packet we will receive. > > Fixes: 522040ea5fdd ("tcp: do not aggressively quick ack after ECN events") > Signed-off-by: Eric Dumazet > Reported-by: Neal Cardwell > Cc: Lawrence Brakmo > --- > net/ipv4/tcp_input.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) Acked-by: Neal Cardwell Thanks, Eric! neal
Re: [PATCH net-next v2] tcp: force cwnd at least 2 in tcp_cwnd_reduction
On Tue, Jun 26, 2018 at 10:34 PM Lawrence Brakmo wrote: > The only issue is if it is safe to always use 2 or if it is better to > use min(2, snd_ssthresh) (which could still trigger the problem). Always using 2 SGTM. I don't think we need min(2, snd_ssthresh), as that should be the same as just 2, since: (a) RFCs mandate ssthresh should not be below 2, e.g. https://tools.ietf.org/html/rfc5681 page 7: ssthresh = max (FlightSize / 2, 2*SMSS)(4) (b) The main loss-based CCs used in Linux (CUBIC, Reno, DCTCP) respect that constraint, and always have an ssthresh of at least 2. And if some CC misbehaves and uses a lower ssthresh, then taking min(2, snd_ssthresh) will trigger problems, as you note. > + tp->snd_cwnd = max((int)tcp_packets_in_flight(tp) + sndcnt, 2); AFAICT this does seem like it will make the sender behavior more aggressive in cases with high loss and/or a very low per-flow fair-share. Old: o send N packets o receive SACKs for last 3 packets o fast retransmit packet 1 o using ACKs, slow-start upward New: o send N packets o receive SACKs for last 3 packets o fast retransmit packets 1 and 2 o using ACKs, slow-start upward In the extreme case, if the available fair share is less than 2 packets, whereas inflight would have oscillated between 1 packet and 2 packets with the existing code, it now seems like with this commit the inflight will now hover at 2. It seems like this would have significantly higher losses than we had with the existing code. This may or may not be OK in practice, but IMHO it is worth mentioning and discussing. neal
Re: [PATCH net-next] tcp: remove one indentation level in tcp_create_openreq_child
On Tue, Jun 26, 2018 at 11:46 AM Eric Dumazet wrote: > > Signed-off-by: Eric Dumazet > --- > net/ipv4/tcp_minisocks.c | 223 --- > 1 file changed, 113 insertions(+), 110 deletions(-) Yes, very nice clean-up! Thanks for doing this. Acked-by: Neal Cardwell neal
Re: [PATCH net] sctp: not allow to set rto_min with a value below 200 msecs
On Tue, May 29, 2018 at 11:45 AM Marcelo Ricardo Leitner < marcelo.leit...@gmail.com> wrote: > - patch2 - fix rtx attack vector >- Add the floor value to rto_min to HZ/20 (which fits the values > that Michael shared on the other email) I would encourage allowing minimum RTO values down to 5ms, if the ACK policy in the receiver makes this feasible. Our experience is that in datacenter environments it can be advantageous to allow timer-based loss recoveries using timeout values as low as 5ms, e.g.: https://www.ietf.org/proceedings/97/slides/slides-97-tcpm-tcp-options-for-low-latency-00.pdf https://tools.ietf.org/html/draft-wang-tcpm-low-latency-opt-00 cheers, neal
Re: [PATCH net-next 1/2] tcp: add max_quickacks param to tcp_incr_quickack and tcp_enter_quickack_mode
On Tue, May 22, 2018 at 8:31 PM kbuild test robot wrote: > Hi Eric, > Thank you for the patch! Yet something to improve: > [auto build test ERROR on net/master] > [also build test ERROR on v4.17-rc6 next-20180517] > [cannot apply to net-next/master] > [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] > url: https://github.com/0day-ci/linux/commits/Eric-Dumazet/tcp-add-max_quickacks-param-to-tcp_incr_quickack-and-tcp_enter_quickack_mode/20180523-075103 > config: i386-randconfig-x012-201820 (attached as .config) > compiler: gcc-7 (Debian 7.3.0-16) 7.3.0 > reproduce: > # save the attached .config to linux build tree > make ARCH=i386 > All errors (new ones prefixed by >>): > net//ipv4/tcp_input.c: In function 'tcp_data_queue': > >> net//ipv4/tcp_input.c:4656:2: error: too few arguments to function 'tcp_enter_quickack_mode' > tcp_enter_quickack_mode(sk); > ^~~ > net//ipv4/tcp_input.c:199:13: note: declared here > static void tcp_enter_quickack_mode(struct sock *sk, unsigned int max_quickacks) > ^~~ ... For the record, this is an error in the tool, rather than the patch. The tool seems to be using a stale net-next tree for building this patch. The compile error is here in line 4656: > ^1da177e4 Linus Torvalds 2005-04-16 4652 /* Out of window. F.e. zero window probe. */ > ^1da177e4 Linus Torvalds 2005-04-16 4653 if (!before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt + tcp_receive_window(tp))) > ^1da177e4 Linus Torvalds 2005-04-16 4654 goto out_of_window; > ^1da177e4 Linus Torvalds 2005-04-16 4655 > 463c84b97 Arnaldo Carvalho de Melo 2005-08-09 @4656 tcp_enter_quickack_mode(sk); > ^1da177e4 Linus Torvalds 2005-04-16 4657 > ^1da177e4 Linus Torvalds 2005-04-16 4658 if (before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt)) { > ^1da177e4 Linus Torvalds 2005-04-16 4659 /* Partial packet, seq < rcv_next < end_seq */ ... But that line is not in net-next any more, after Eric's recent net-next commit: a3893637e1eb0e ("tcp: do not force quickack when receiving out-of-order packets") https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/net/ipv4/tcp_input.c?id=a3893637e1eb0ef5eb1bbc52b3a8d2dfa317a35d That commit removed that line: diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 0bf032839548f..f5622b2506651 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -4715,8 +4715,6 @@ drop: if (!before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt + tcp_receive_window(tp))) goto out_of_window; - tcp_enter_quickack_mode(sk); - if (before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt)) { /* Partial packet, seq < rcv_next < end_seq */ SOCK_DEBUG(sk, "partial packet: rcv_next %X seq %X - %X\n", cheers, neal
Re: [PATCH net-next 2/2] tcp: do not aggressively quick ack after ECN events
On Mon, May 21, 2018 at 6:09 PM Eric Dumazet wrote: > ECN signals currently forces TCP to enter quickack mode for > up to 16 (TCP_MAX_QUICKACKS) following incoming packets. > We believe this is not needed, and only sending one immediate ack > for the current packet should be enough. > This should reduce the extra load noticed in DCTCP environments, > after congestion events. > This is part 2 of our effort to reduce pure ACK packets. > Signed-off-by: Eric Dumazet > --- Acked-by: Neal Cardwell Thanks! neal
Re: [PATCH net-next 1/2] tcp: add max_quickacks param to tcp_incr_quickack and tcp_enter_quickack_mode
On Mon, May 21, 2018 at 6:09 PM Eric Dumazet wrote: > We want to add finer control of the number of ACK packets sent after > ECN events. > This patch is not changing current behavior, it only enables following > change. > Signed-off-by: Eric Dumazet > --- Acked-by: Neal Cardwell Thanks! neal
Re: [PATCH v3 net-next 6/6] tcp: add tcp_comp_sack_nr sysctl
On Thu, May 17, 2018 at 5:47 PM Eric Dumazet wrote: > This per netns sysctl allows for TCP SACK compression fine-tuning. > This limits number of SACK that can be compressed. > Using 0 disables SACK compression. > Signed-off-by: Eric Dumazet > --- Acked-by: Neal Cardwell Thanks! neal
Re: [PATCH v3 net-next 5/6] tcp: add tcp_comp_sack_delay_ns sysctl
On Thu, May 17, 2018 at 5:47 PM Eric Dumazet wrote: > This per netns sysctl allows for TCP SACK compression fine-tuning. > Its default value is 1,000,000, or 1 ms to meet TSO autosizing period. > Signed-off-by: Eric Dumazet > --- Acked-by: Neal Cardwell Thanks! neal
Re: [PATCH v3 net-next 3/6] tcp: add SACK compression
On Thu, May 17, 2018 at 5:47 PM Eric Dumazet wrote: > When TCP receives an out-of-order packet, it immediately sends > a SACK packet, generating network load but also forcing the > receiver to send 1-MSS pathological packets, increasing its > RTX queue length/depth, and thus processing time. > Wifi networks suffer from this aggressive behavior, but generally > speaking, all these SACK packets add fuel to the fire when networks > are under congestion. > This patch adds a high resolution timer and tp->compressed_ack counter. > Instead of sending a SACK, we program this timer with a small delay, > based on RTT and capped to 1 ms : > delay = min ( 5 % of RTT, 1 ms) > If subsequent SACKs need to be sent while the timer has not yet > expired, we simply increment tp->compressed_ack. > When timer expires, a SACK is sent with the latest information. > Whenever an ACK is sent (if data is sent, or if in-order > data is received) timer is canceled. > Note that tcp_sack_new_ofo_skb() is able to force a SACK to be sent > if the sack blocks need to be shuffled, even if the timer has not > expired. > A new SNMP counter is added in the following patch. > Two other patches add sysctls to allow changing the 1,000,000 and 44 > values that this commit hard-coded. > Signed-off-by: Eric Dumazet > --- Very nice. I like the constants and the min(rcv_rtt, srtt). Acked-by: Neal Cardwell Thanks! neal
Re: [PATCH net-next 3/4] tcp: add SACK compression
On Thu, May 17, 2018 at 11:40 AM Eric Dumazet wrote: > On 05/17/2018 08:14 AM, Neal Cardwell wrote: > > Is there a particular motivation for the cap of 127? IMHO 127 ACKs is quite > > a few to compress. Experience seems to show that it works well to have one > > GRO ACK for ~64KBytes that triggers a single TSO skb of ~64KBytes. It might > > be nice to try to match those dynamics in this SACK compression case, so it > > might be nice to cap the number of compressed ACKs at something like 44? > > (0x / 1448 - 1). That way for high-speed paths we could try to keep > > the ACK clock going with ACKs for ~64KBytes that trigger a single TSO skb > > of ~64KBytes, no matter whether we are sending SACKs or cumulative ACKs. > 127 was chosen because the field is u8, and since skb allocation for the ACK > can fail, we could have cases were the field goes above 127. > Ultimately, I believe a followup patch would add a sysctl, so that we can fine-tune > this, and eventually disable ACK compression if this sysctl is set to 0 OK, a sysctl sounds good. I would still vote for a default of 44. :-) > >> + if (hrtimer_is_queued(&tp->compressed_ack_timer)) > >> + return; > >> + > >> + /* compress ack timer : 5 % of srtt, but no more than 2.5 ms */ > >> + > >> + delay = min_t(unsigned long, 2500 * NSEC_PER_USEC, > >> + tp->rcv_rtt_est.rtt_us * (NSEC_PER_USEC >> 3)/20); > > > > Any particular motivation for the 2.5ms here? It might be nice to match the > > existing TSO autosizing dynamics and use 1ms here instead of having a > > separate new constant of 2.5ms. Smaller time scales here should lead to > > less burstiness and queue pressure from data packets in the network, and we > > know from experience that the CPU overhead of 1ms chunks is acceptable. > This came from my tests on wifi really :) > I also had the idea to make this threshold adjustable for wifi, like we did for sk_pacing_shift. > (On wifi, we might want to increase the max delay between ACK) > So maybe use 1ms delay, when sk_pacing_shift == 10, but increase it if sk_pacing_shift has been lowered. Sounds good to me. Thanks for implementing this! Overall this patch seems nice to me. Acked-by: Neal Cardwell BTW, I guess we should spread the word to maintainers of other major TCP stacks that they need to be prepared for what may be a much higher degree of compression/aggregation in the SACK stream. Linux stacks going back many years should be fine with this, but I'm not sure about the other major OSes (they may only allow sending one MSS per ACK-with-SACKs received). neal
Re: [PATCH net-next 3/4] tcp: add SACK compression
On Thu, May 17, 2018 at 8:12 AM Eric Dumazet wrote: > When TCP receives an out-of-order packet, it immediately sends > a SACK packet, generating network load but also forcing the > receiver to send 1-MSS pathological packets, increasing its > RTX queue length/depth, and thus processing time. > Wifi networks suffer from this aggressive behavior, but generally > speaking, all these SACK packets add fuel to the fire when networks > are under congestion. > This patch adds a high resolution timer and tp->compressed_ack counter. > Instead of sending a SACK, we program this timer with a small delay, > based on SRTT and capped to 2.5 ms : delay = min ( 5 % of SRTT, 2.5 ms) ... Very nice. Thanks for implementing this, Eric! I was wondering if the constants here might be worth some discussion/elaboration. > @@ -5074,6 +5076,7 @@ static inline void tcp_data_snd_check(struct sock *sk) > static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible) > { > struct tcp_sock *tp = tcp_sk(sk); > + unsigned long delay; > /* More than one full frame received... */ > if (((tp->rcv_nxt - tp->rcv_wup) > inet_csk(sk)->icsk_ack.rcv_mss && > @@ -5085,15 +5088,31 @@ static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible) > (tp->rcv_nxt - tp->copied_seq < sk->sk_rcvlowat || > __tcp_select_window(sk) >= tp->rcv_wnd)) || > /* We ACK each frame or... */ > - tcp_in_quickack_mode(sk) || > - /* We have out of order data. */ > - (ofo_possible && !RB_EMPTY_ROOT(&tp->out_of_order_queue))) { > - /* Then ack it now */ > + tcp_in_quickack_mode(sk)) { > +send_now: > tcp_send_ack(sk); > - } else { > - /* Else, send delayed ack. */ > + return; > + } > + > + if (!ofo_possible || RB_EMPTY_ROOT(&tp->out_of_order_queue)) { > tcp_send_delayed_ack(sk); > + return; > } > + > + if (!tcp_is_sack(tp) || tp->compressed_ack >= 127) > + goto send_now; > + tp->compressed_ack++; Is there a particular motivation for the cap of 127? IMHO 127 ACKs is quite a few to compress. Experience seems to show that it works well to have one GRO ACK for ~64KBytes that triggers a single TSO skb of ~64KBytes. It might be nice to try to match those dynamics in this SACK compression case, so it might be nice to cap the number of compressed ACKs at something like 44? (0x / 1448 - 1). That way for high-speed paths we could try to keep the ACK clock going with ACKs for ~64KBytes that trigger a single TSO skb of ~64KBytes, no matter whether we are sending SACKs or cumulative ACKs. > + > + if (hrtimer_is_queued(&tp->compressed_ack_timer)) > + return; > + > + /* compress ack timer : 5 % of srtt, but no more than 2.5 ms */ > + > + delay = min_t(unsigned long, 2500 * NSEC_PER_USEC, > + tp->rcv_rtt_est.rtt_us * (NSEC_PER_USEC >> 3)/20); Any particular motivation for the 2.5ms here? It might be nice to match the existing TSO autosizing dynamics and use 1ms here instead of having a separate new constant of 2.5ms. Smaller time scales here should lead to less burstiness and queue pressure from data packets in the network, and we know from experience that the CPU overhead of 1ms chunks is acceptable. thanks, neal
Re: [PATCH net-next 2/4] tcp: do not force quickack when receiving out-of-order packets
On Thu, May 17, 2018 at 8:12 AM Eric Dumazet wrote: > As explained in commit 9f9843a751d0 ("tcp: properly handle stretch > acks in slow start"), TCP stacks have to consider how many packets > are acknowledged in one single ACK, because of GRO, but also > because of ACK compression or losses. > We plan to add SACK compression in the following patch, we > must therefore not call tcp_enter_quickack_mode() > Signed-off-by: Eric Dumazet > --- Acked-by: Neal Cardwell Thanks! neal
Re: [PATCH net-next 4/4] tcp: add TCPAckCompressed SNMP counter
On Thu, May 17, 2018 at 8:12 AM Eric Dumazet wrote: > This counter tracks number of ACK packets that the host has not sent, > thanks to ACK compression. > Sample output : > $ nstat -n;sleep 1;nstat|egrep "IpInReceives|IpOutRequests|TcpInSegs|TcpOutSegs|TcpExtTCPAckCompressed" > IpInReceives123250 0.0 > IpOutRequests 3684 0.0 > TcpInSegs 123251 0.0 > TcpOutSegs 3684 0.0 > TcpExtTCPAckCompressed 119252 0.0 > Signed-off-by: Eric Dumazet > --- Acked-by: Neal Cardwell Thanks! neal
Re: [PATCH net-next 1/4] tcp: use __sock_put() instead of sock_put() in tcp_clear_xmit_timers()
On Thu, May 17, 2018 at 8:12 AM Eric Dumazet wrote: > Socket can not disappear under us. > Signed-off-by: Eric Dumazet > --- Acked-by: Neal Cardwell Thanks! neal
Re: [PATCH net] tcp: purge write queue in tcp_connect_init()
On Tue, May 15, 2018 at 12:14 AM Eric Dumazet wrote: > syzkaller found a reliable way to crash the host, hitting a BUG() > in __tcp_retransmit_skb() > Malicous MSG_FASTOPEN is the root cause. We need to purge write queue > in tcp_connect_init() at the point we init snd_una/write_seq. > This patch also replaces the BUG() by a less intrusive WARN_ON_ONCE() > kernel BUG at net/ipv4/tcp_output.c:2837! ... > Fixes: cf60af03ca4e ("net-tcp: Fast Open client - sendmsg(MSG_FASTOPEN)") > Signed-off-by: Eric Dumazet > Cc: Yuchung Cheng > Cc: Neal Cardwell > Reported-by: syzbot > --- Acked-by: Neal Cardwell Thanks, Eric! neal
Re: [PATCH net] tcp: restore autocorking
On Wed, May 2, 2018 at 11:25 PM Eric Dumazet wrote: > When adding rb-tree for TCP retransmit queue, we inadvertently broke > TCP autocorking. > tcp_should_autocork() should really check if the rtx queue is not empty. ... > Fixes: 75c119afe14f ("tcp: implement rb-tree based retransmit queue") > Signed-off-by: Eric Dumazet > Reported-by: Michael Wenig > Tested-by: Michael Wenig > --- Acked-by: Neal Cardwell Nice. Thanks, Eric! neal
[PATCH net] tcp_bbr: fix to zero idle_restart only upon S/ACKed data
Previously the bbr->idle_restart tracking was zeroing out the bbr->idle_restart bit upon ACKs that did not SACK or ACK anything, e.g. receiving incoming data or receiver window updates. In such situations BBR would forget that this was a restart-from-idle situation, and if the min_rtt had expired it would unnecessarily enter PROBE_RTT (even though we were actually restarting from idle but had merely forgotten that fact). The fix is simple: we need to remember we are restarting from idle until we receive a S/ACK for some data (a S/ACK for the first flight of data we send as we are restarting). This commit is a stable candidate for kernels back as far as 4.9. Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control") Signed-off-by: Neal Cardwell Signed-off-by: Yuchung Cheng Signed-off-by: Soheil Hassas Yeganeh Signed-off-by: Priyaranjan Jha Signed-off-by: Yousuk Seung --- net/ipv4/tcp_bbr.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c index 158d105e76da1..58e2f479ffb4d 100644 --- a/net/ipv4/tcp_bbr.c +++ b/net/ipv4/tcp_bbr.c @@ -806,7 +806,9 @@ static void bbr_update_min_rtt(struct sock *sk, const struct rate_sample *rs) } } } - bbr->idle_restart = 0; + /* Restart after idle ends only once we process a new S/ACK for data */ + if (rs->delivered > 0) + bbr->idle_restart = 0; } static void bbr_update_model(struct sock *sk, const struct rate_sample *rs) -- 2.17.0.441.gb46fe60e1d-goog