Re: [PATCH v3 net-next 16/16] tcp_bbr: add BBR congestion control
> > It generates some slightly smaller code. > if (bbr->lt_rtt_cnt < bbr_lt_intvl_min_rtts) > - 3e7: 0f b6 c0movzbl %al,%eax > - 3ea: 83 f8 03cmp$0x3,%eax > - 3ed: 0f 86 d4 00 00 00 jbe4c7 > + 3e7: 3c 03 cmp$0x3,%al > + 3e9: 0f 86 d1 00 00 00 jbe4c0 > > And different code for abs > /* Is new bw close to the lt_bw from the previous interval? */ > diff = abs(bw - bbr->lt_bw); > - 47a: 44 89 e2mov%r12d,%edx > - 47d: 29 c2 sub%eax,%edx > - 47f: 89 d1 mov%edx,%ecx > - 481: 89 d3 mov%edx,%ebx > + 475: 44 89 e3mov%r12d,%ebx > + 478: 29 c3 sub%eax,%ebx > + 47a: 89 da mov%ebx,%edx > + 47c: c1 fa 1fsar$0x1f,%edx > + 47f: 31 d3 xor%edx,%ebx > + 481: 29 d3 sub%edx,%ebx > > The biggest change is getting rid of the always true conditional. > ... > > Overall, it really makes little difference. Actual file sizes come out the > same. > The idea is more to document what is variable > and what is immutable in the algorithm. SGTM, thanks Stephen !
Re: [PATCH v3 net-next 16/16] tcp_bbr: add BBR congestion control
On Mon, 19 Sep 2016 14:10:39 -0700 Eric Dumazet wrote: > On Mon, Sep 19, 2016 at 1:57 PM, Stephen Hemminger > wrote: > > > Looks good, but could I suggest a simple optimization. > > All these parameters are immutable in the version of BBR you are submitting. > > Why not make the values const? And eliminate the always true long-term bw > > estimate > > variable? > > > > We could do that. > > We used to have variables (aka module params) while BBR was cooking in > our kernels ;) > > Are you sure generated code is indeed 'optimized' ? It generates some slightly smaller code. if (bbr->lt_rtt_cnt < bbr_lt_intvl_min_rtts) - 3e7: 0f b6 c0movzbl %al,%eax - 3ea: 83 f8 03cmp$0x3,%eax - 3ed: 0f 86 d4 00 00 00 jbe4c7 + 3e7: 3c 03 cmp$0x3,%al + 3e9: 0f 86 d1 00 00 00 jbe4c0 And different code for abs /* Is new bw close to the lt_bw from the previous interval? */ diff = abs(bw - bbr->lt_bw); - 47a: 44 89 e2mov%r12d,%edx - 47d: 29 c2 sub%eax,%edx - 47f: 89 d1 mov%edx,%ecx - 481: 89 d3 mov%edx,%ebx + 475: 44 89 e3mov%r12d,%ebx + 478: 29 c3 sub%eax,%ebx + 47a: 89 da mov%ebx,%edx + 47c: c1 fa 1fsar$0x1f,%edx + 47f: 31 d3 xor%edx,%ebx + 481: 29 d3 sub%edx,%ebx The biggest change is getting rid of the always true conditional. - u32 diff; - - if (bbr->lt_bw && /* do we have bw from a previous interval? */ - bbr_lt_bw_estimator) { /* using long-term bw estimator enabled? */ - /* Is new bw close to the lt_bw from the previous interval? */ - diff = abs(bw - bbr->lt_bw); - 485: c1 f9 1fsar$0x1f,%ecx - if ((diff * BBR_UNIT <= bbr_lt_conv_thresh * bbr->lt_bw) || - 488: c1 e2 05shl$0x5,%edx - u32 diff; - - if (bbr->lt_bw && /* do we have bw from a previous interval? */ - bbr_lt_bw_estimator) { /* using long-term bw estimator enabled? */ - /* Is new bw close to the lt_bw from the previous interval? */ - diff = abs(bw - bbr->lt_bw); - 48b: 31 cb xor%ecx,%ebx - 48d: 29 cb sub%ecx,%ebx - if ((diff * BBR_UNIT <= bbr_lt_conv_thresh * bbr->lt_bw) || - 48f: 89 d9 mov%ebx,%ecx - 491: c1 e1 08shl$0x8,%ecx - 494: 39 d1 cmp%edx,%ecx - 496: 0f 87 6e 01 00 00 ja 60a + 485: 89 d9 mov%ebx,%ecx + 487: c1 e2 05shl$0x5,%edx + 48a: c1 e1 08shl$0x8,%ecx + 48d: 39 d1 cmp%edx,%ecx + 48f: 0f 87 6e 01 00 00 ja 603 Overall, it really makes little difference. Actual file sizes come out the same. The idea is more to document what is variable and what is immutable in the algorithm.
Re: [PATCH v3 net-next 16/16] tcp_bbr: add BBR congestion control
On Mon, Sep 19, 2016 at 2:17 PM, Rick Jones wrote: > > Are there better than epsilon odds of someone perhaps wanting to poke those > values as it gets exposure beyond Google? > This does not matter. A change would require patching net/ipv4/tcp_bbr.c , and the 'const' attribute being there or not does not prevent the change. I was simply asking to Stephen if the compiler would actually emit a very different code, worth doing a last minute change. The main BBR costs are divides and some multiplies , and they are using per socket fields.
Re: [PATCH v3 net-next 16/16] tcp_bbr: add BBR congestion control
On 09/19/2016 02:10 PM, Eric Dumazet wrote: On Mon, Sep 19, 2016 at 1:57 PM, Stephen Hemminger wrote: Looks good, but could I suggest a simple optimization. All these parameters are immutable in the version of BBR you are submitting. Why not make the values const? And eliminate the always true long-term bw estimate variable? We could do that. We used to have variables (aka module params) while BBR was cooking in our kernels ;) Are there better than epsilon odds of someone perhaps wanting to poke those values as it gets exposure beyond Google? happy benchmarking, rick jones
Re: [PATCH v3 net-next 16/16] tcp_bbr: add BBR congestion control
On Mon, Sep 19, 2016 at 1:57 PM, Stephen Hemminger wrote: > Looks good, but could I suggest a simple optimization. > All these parameters are immutable in the version of BBR you are submitting. > Why not make the values const? And eliminate the always true long-term bw > estimate > variable? > We could do that. We used to have variables (aka module params) while BBR was cooking in our kernels ;) Are you sure generated code is indeed 'optimized' ?
Re: [PATCH v3 net-next 16/16] tcp_bbr: add BBR congestion control
On Sun, 18 Sep 2016 18:03:53 -0400 Neal Cardwell wrote: > +static int bbr_bw_rtts = CYCLE_LEN + 2; /* win len of bw filter (in > rounds) */ > +static u32 bbr_min_rtt_win_sec = 10; /* min RTT filter window (in sec) */ > +static u32 bbr_probe_rtt_mode_ms = 200; /* min ms at cwnd=4 in > BBR_PROBE_RTT */ > +static int bbr_min_tso_rate = 120; /* skip TSO below here (bits/sec) */ > + > +/* We use a high_gain value chosen to allow a smoothly increasing pacing rate > + * that will double each RTT and send the same number of packets per RTT that > + * an un-paced, slow-starting Reno or CUBIC flow would. > + */ > +static int bbr_high_gain = BBR_UNIT * 2885 / 1000 + 1; /* 2/ln(2) */ > +static int bbr_drain_gain = BBR_UNIT * 1000 / 2885; /* 1/high_gain */ > +static int bbr_cwnd_gain = BBR_UNIT * 2;/* gain for steady-state cwnd */ > +/* The pacing_gain values for the PROBE_BW gain cycle: */ > +static int bbr_pacing_gain[] = { BBR_UNIT * 5 / 4, BBR_UNIT * 3 / 4, > + BBR_UNIT, BBR_UNIT, BBR_UNIT, > + BBR_UNIT, BBR_UNIT, BBR_UNIT }; > +static u32 bbr_cycle_rand = 7; /* randomize gain cycling phase over N > phases */ > + > +/* Try to keep at least this many packets in flight, if things go smoothly. > For > + * smooth functioning, a sliding window protocol ACKing every other packet > + * needs at least 4 packets in flight. > + */ > +static u32 bbr_cwnd_min_target = 4; > + > +/* To estimate if BBR_STARTUP mode (i.e. high_gain) has filled pipe. */ > +static u32 bbr_full_bw_thresh = BBR_UNIT * 5 / 4; /* bw up 1.25x per round? > */ > +static u32 bbr_full_bw_cnt= 3;/* N rounds w/o bw growth -> pipe full > */ > + > +/* "long-term" ("LT") bandwidth estimator parameters: */ > +static bool bbr_lt_bw_estimator = true; /* use the long-term bw > estimate? */ > +static u32 bbr_lt_intvl_min_rtts = 4;/* min rounds in sampling > interval */ > +static u32 bbr_lt_loss_thresh = 50; /* lost/delivered > 20% -> "lossy" */ > +static u32 bbr_lt_conv_thresh = BBR_UNIT / 8; /* bw diff <= 12.5% -> > "close" */ > +static u32 bbr_lt_bw_max_rtts= 48; /* max # of round trips using > lt_bw */ > + Looks good, but could I suggest a simple optimization. All these parameters are immutable in the version of BBR you are submitting. Why not make the values const? And eliminate the always true long-term bw estimate variable? diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c index 76baa8a..9c270a2 100644 --- a/net/ipv4/tcp_bbr.c +++ b/net/ipv4/tcp_bbr.c @@ -90,40 +90,41 @@ struct bbr { #define CYCLE_LEN 8 /* number of phases in a pacing gain cycle */ -static int bbr_bw_rtts = CYCLE_LEN + 2; /* win len of bw filter (in rounds) */ -static u32 bbr_min_rtt_win_sec = 10;/* min RTT filter window (in sec) */ -static u32 bbr_probe_rtt_mode_ms = 200; /* min ms at cwnd=4 in BBR_PROBE_RTT */ -static int bbr_min_tso_rate= 120; /* skip TSO below here (bits/sec) */ +static const int bbr_bw_rtts = CYCLE_LEN + 2; /* win len of bw filter (in rounds) */ +static const u32 bbr_min_rtt_win_sec = 10; /* min RTT filter window (in sec) */ +static const u32 bbr_probe_rtt_mode_ms = 200; /* min ms at cwnd=4 in BBR_PROBE_RTT */ +static const int bbr_min_tso_rate = 120; /* skip TSO below here (bits/sec) */ /* We use a high_gain value chosen to allow a smoothly increasing pacing rate * that will double each RTT and send the same number of packets per RTT that * an un-paced, slow-starting Reno or CUBIC flow would. */ -static int bbr_high_gain = BBR_UNIT * 2885 / 1000 + 1;/* 2/ln(2) */ -static int bbr_drain_gain = BBR_UNIT * 1000 / 2885;/* 1/high_gain */ -static int bbr_cwnd_gain = BBR_UNIT * 2; /* gain for steady-state cwnd */ +static const int bbr_high_gain = BBR_UNIT * 2885 / 1000 + 1; /* 2/ln(2) */ +static const int bbr_drain_gain = BBR_UNIT * 1000 / 2885; /* 1/high_gain */ +static const int bbr_cwnd_gain = BBR_UNIT * 2;/* gain for steady-state cwnd */ /* The pacing_gain values for the PROBE_BW gain cycle: */ -static int bbr_pacing_gain[] = { BBR_UNIT * 5 / 4, BBR_UNIT * 3 / 4, -BBR_UNIT, BBR_UNIT, BBR_UNIT, -BBR_UNIT, BBR_UNIT, BBR_UNIT }; -static u32 bbr_cycle_rand = 7; /* randomize gain cycling phase over N phases */ +static const int bbr_pacing_gain[] = { + BBR_UNIT * 5 / 4, BBR_UNIT * 3 / 4, + BBR_UNIT, BBR_UNIT, BBR_UNIT, + BBR_UNIT, BBR_UNIT, BBR_UNIT +}; +static const u32 bbr_cycle_rand = 7; /* randomize gain cycling phase over N phases */ /* Try to keep at least this many packets in flight, if things go smoothly. For * smooth functioning, a sliding window protocol ACKing every other packet * needs at least 4 packets in flight. */ -static u32 bbr_cwnd_min_target = 4; +static const u32 bbr_cwnd_min_target = 4; /* To estimate if BBR_STARTUP mode (i.e. high_
Re: [PATCH v3 net-next 16/16] tcp_bbr: add BBR congestion control
On Sun, Sep 18, 2016 at 9:18 PM, Kenneth Klette Jonassen wrote: >> +static u64 bbr_rate_kbps(struct sock *sk, u64 rate) >> +{ >> + return bbr_rate_bytes_per_sec(sk, rate, BBR_UNIT) * 8 / 1000; > > > Consider div_u64() here to keep all builds happy. :-) This adds __udivdi3() > with the nxp powerpc toolchains I'm using. > > (Or have the one caller use bbr_rate_bytes_per_sec() instead.) Thanks, Kenneth. We will fix this in the next revision (we like your suggested approach of just using bbr_rate_bytes_per_sec()). cheers, neal
[PATCH v3 net-next 16/16] tcp_bbr: add BBR congestion control
This commit implements a new TCP congestion control algorithm: BBR (Bottleneck Bandwidth and RTT). A detailed description of BBR will be published in ACM Queue, Vol. 14 No. 5, September-October 2016, as "BBR: Congestion-Based Congestion Control". BBR has significantly increased throughput and reduced latency for connections on Google's internal backbone networks and google.com and YouTube Web servers. BBR requires only changes on the sender side, not in the network or the receiver side. Thus it can be incrementally deployed on today's Internet, or in datacenters. The Internet has predominantly used loss-based congestion control (largely Reno or CUBIC) since the 1980s, relying on packet loss as the signal to slow down. While this worked well for many years, loss-based congestion control is unfortunately out-dated in today's networks. On today's Internet, loss-based congestion control causes the infamous bufferbloat problem, often causing seconds of needless queuing delay, since it fills the bloated buffers in many last-mile links. On today's high-speed long-haul links using commodity switches with shallow buffers, loss-based congestion control has abysmal throughput because it over-reacts to losses caused by transient traffic bursts. In 1981 Kleinrock and Gale showed that the optimal operating point for a network maximizes delivered bandwidth while minimizing delay and loss, not only for single connections but for the network as a whole. Finding that optimal operating point has been elusive, since any single network measurement is ambiguous: network measurements are the result of both bandwidth and propagation delay, and those two cannot be measured simultaneously. While it is impossible to disambiguate any single bandwidth or RTT measurement, a connection's behavior over time tells a clearer story. BBR uses a measurement strategy designed to resolve this ambiguity. It combines these measurements with a robust servo loop using recent control systems advances to implement a distributed congestion control algorithm that reacts to actual congestion, not packet loss or transient queue delay, and is designed to converge with high probability to a point near the optimal operating point. In a nutshell, BBR creates an explicit model of the network pipe by sequentially probing the bottleneck bandwidth and RTT. On the arrival of each ACK, BBR derives the current delivery rate of the last round trip, and feeds it through a windowed max-filter to estimate the bottleneck bandwidth. Conversely it uses a windowed min-filter to estimate the round trip propagation delay. The max-filtered bandwidth and min-filtered RTT estimates form BBR's model of the network pipe. Using its model, BBR sets control parameters to govern sending behavior. The primary control is the pacing rate: BBR applies a gain multiplier to transmit faster or slower than the observed bottleneck bandwidth. The conventional congestion window (cwnd) is now the secondary control; the cwnd is set to a small multiple of the estimated BDP (bandwidth-delay product) in order to allow full utilization and bandwidth probing while bounding the potential amount of queue at the bottleneck. When a BBR connection starts, it enters STARTUP mode and applies a high gain to perform an exponential search to quickly probe the bottleneck bandwidth (doubling its sending rate each round trip, like slow start). However, instead of continuing until it fills up the buffer (i.e. a loss), or until delay or ACK spacing reaches some threshold (like Hystart), it uses its model of the pipe to estimate when that pipe is full: it estimates the pipe is full when it notices the estimated bandwidth has stopped growing. At that point it exits STARTUP and enters DRAIN mode, where it reduces its pacing rate to drain the queue it estimates it has created. Then BBR enters steady state. In steady state, PROBE_BW mode cycles between first pacing faster to probe for more bandwidth, then pacing slower to drain any queue that created if no more bandwidth was available, and then cruising at the estimated bandwidth to utilize the pipe without creating excess queue. Occasionally, on an as-needed basis, it sends significantly slower to probe for RTT (PROBE_RTT mode). Our long-term goal is to improve the congestion control algorithms used on the Internet. We are hopeful that BBR can help advance the efforts toward this goal, and motivate the community to do further research. Test results, performance evaluations, feedback, and BBR-related discussions are very welcome in the public e-mail list for BBR: https://groups.google.com/forum/#!forum/bbr-dev Signed-off-by: Van Jacobson Signed-off-by: Neal Cardwell Signed-off-by: Yuchung Cheng Signed-off-by: Nandita Dukkipati Signed-off-by: Eric Dumazet Signed-off-by: Soheil Hassas Yeganeh --- include/uapi/linux/inet_diag.h | 13 + net/ipv4/Kconfig | 18 + net/ipv4/Makefile | 1 + net/ipv4/tcp_bbr.c | 875