Re: [PATCH v3 net-next 16/16] tcp_bbr: add BBR congestion control

2016-09-19 Thread Eric Dumazet
>
> It generates some slightly smaller code.
> if (bbr->lt_rtt_cnt < bbr_lt_intvl_min_rtts)
> - 3e7:  0f b6 c0movzbl %al,%eax
> - 3ea:  83 f8 03cmp$0x3,%eax
> - 3ed:  0f 86 d4 00 00 00   jbe4c7 
> + 3e7:  3c 03   cmp$0x3,%al
> + 3e9:  0f 86 d1 00 00 00   jbe4c0 
>
> And different code for abs
> /* Is new bw close to the lt_bw from the previous interval? */
> diff = abs(bw - bbr->lt_bw);
> - 47a:  44 89 e2mov%r12d,%edx
> - 47d:  29 c2   sub%eax,%edx
> - 47f:  89 d1   mov%edx,%ecx
> - 481:  89 d3   mov%edx,%ebx
> + 475:  44 89 e3mov%r12d,%ebx
> + 478:  29 c3   sub%eax,%ebx
> + 47a:  89 da   mov%ebx,%edx
> + 47c:  c1 fa 1fsar$0x1f,%edx
> + 47f:  31 d3   xor%edx,%ebx
> + 481:  29 d3   sub%edx,%ebx
>
> The biggest change is getting rid of the always true conditional.
>
...
>
> Overall, it really makes little difference. Actual file sizes come out the 
> same.
> The idea is more to document what is variable
> and what is immutable in the algorithm.

SGTM, thanks Stephen !


Re: [PATCH v3 net-next 16/16] tcp_bbr: add BBR congestion control

2016-09-19 Thread Stephen Hemminger
On Mon, 19 Sep 2016 14:10:39 -0700
Eric Dumazet  wrote:

> On Mon, Sep 19, 2016 at 1:57 PM, Stephen Hemminger
>  wrote:
> 
> > Looks good, but could I suggest a simple optimization.
> > All these parameters are immutable in the version of BBR you are submitting.
> > Why not make the values const? And eliminate the always true long-term bw 
> > estimate
> > variable?
> >  
> 
> We could do that.
> 
> We used to have variables (aka module params) while BBR was cooking in
> our kernels ;)
> 
> Are you sure generated code is indeed 'optimized' ?


It generates some slightly smaller code.
if (bbr->lt_rtt_cnt < bbr_lt_intvl_min_rtts)
- 3e7:  0f b6 c0movzbl %al,%eax
- 3ea:  83 f8 03cmp$0x3,%eax
- 3ed:  0f 86 d4 00 00 00   jbe4c7 
+ 3e7:  3c 03   cmp$0x3,%al
+ 3e9:  0f 86 d1 00 00 00   jbe4c0 

And different code for abs
/* Is new bw close to the lt_bw from the previous interval? */
diff = abs(bw - bbr->lt_bw);
- 47a:  44 89 e2mov%r12d,%edx
- 47d:  29 c2   sub%eax,%edx
- 47f:  89 d1   mov%edx,%ecx
- 481:  89 d3   mov%edx,%ebx
+ 475:  44 89 e3mov%r12d,%ebx
+ 478:  29 c3   sub%eax,%ebx
+ 47a:  89 da   mov%ebx,%edx
+ 47c:  c1 fa 1fsar$0x1f,%edx
+ 47f:  31 d3   xor%edx,%ebx
+ 481:  29 d3   sub%edx,%ebx

The biggest change is getting rid of the always true conditional.

-   u32 diff;
-
-   if (bbr->lt_bw &&  /* do we have bw from a previous interval? */
-   bbr_lt_bw_estimator) {  /* using long-term bw estimator enabled? */
-   /* Is new bw close to the lt_bw from the previous interval? */
-   diff = abs(bw - bbr->lt_bw);
- 485:  c1 f9 1fsar$0x1f,%ecx
-   if ((diff * BBR_UNIT <= bbr_lt_conv_thresh * bbr->lt_bw) ||
- 488:  c1 e2 05shl$0x5,%edx
-   u32 diff;
-
-   if (bbr->lt_bw &&  /* do we have bw from a previous interval? */
-   bbr_lt_bw_estimator) {  /* using long-term bw estimator enabled? */
-   /* Is new bw close to the lt_bw from the previous interval? */
-   diff = abs(bw - bbr->lt_bw);
- 48b:  31 cb   xor%ecx,%ebx
- 48d:  29 cb   sub%ecx,%ebx
-   if ((diff * BBR_UNIT <= bbr_lt_conv_thresh * bbr->lt_bw) ||
- 48f:  89 d9   mov%ebx,%ecx
- 491:  c1 e1 08shl$0x8,%ecx
- 494:  39 d1   cmp%edx,%ecx
- 496:  0f 87 6e 01 00 00   ja 60a 
+ 485:  89 d9   mov%ebx,%ecx
+ 487:  c1 e2 05shl$0x5,%edx
+ 48a:  c1 e1 08shl$0x8,%ecx
+ 48d:  39 d1   cmp%edx,%ecx
+ 48f:  0f 87 6e 01 00 00   ja 603 


Overall, it really makes little difference. Actual file sizes come out the same.
The idea is more to document what is variable
and what is immutable in the algorithm.



Re: [PATCH v3 net-next 16/16] tcp_bbr: add BBR congestion control

2016-09-19 Thread Eric Dumazet
On Mon, Sep 19, 2016 at 2:17 PM, Rick Jones  wrote:

>
> Are there better than epsilon odds of someone perhaps wanting to poke those
> values as it gets exposure beyond Google?
>

This does not matter.

A change would require patching net/ipv4/tcp_bbr.c , and the 'const'
attribute being there or not does not prevent the change.

I was simply asking to Stephen if the compiler would actually emit a
very different code, worth doing a last minute change.

The main BBR costs are divides and some multiplies , and they are
using per socket fields.


Re: [PATCH v3 net-next 16/16] tcp_bbr: add BBR congestion control

2016-09-19 Thread Rick Jones

On 09/19/2016 02:10 PM, Eric Dumazet wrote:

On Mon, Sep 19, 2016 at 1:57 PM, Stephen Hemminger
 wrote:


Looks good, but could I suggest a simple optimization.
All these parameters are immutable in the version of BBR you are submitting.
Why not make the values const? And eliminate the always true long-term bw 
estimate
variable?



We could do that.

We used to have variables (aka module params) while BBR was cooking in
our kernels ;)


Are there better than epsilon odds of someone perhaps wanting to poke 
those values as it gets exposure beyond Google?


happy benchmarking,

rick jones


Re: [PATCH v3 net-next 16/16] tcp_bbr: add BBR congestion control

2016-09-19 Thread Eric Dumazet
On Mon, Sep 19, 2016 at 1:57 PM, Stephen Hemminger
 wrote:

> Looks good, but could I suggest a simple optimization.
> All these parameters are immutable in the version of BBR you are submitting.
> Why not make the values const? And eliminate the always true long-term bw 
> estimate
> variable?
>

We could do that.

We used to have variables (aka module params) while BBR was cooking in
our kernels ;)

Are you sure generated code is indeed 'optimized' ?


Re: [PATCH v3 net-next 16/16] tcp_bbr: add BBR congestion control

2016-09-19 Thread Stephen Hemminger
On Sun, 18 Sep 2016 18:03:53 -0400
Neal Cardwell  wrote:

> +static int bbr_bw_rtts   = CYCLE_LEN + 2; /* win len of bw filter (in 
> rounds) */
> +static u32 bbr_min_rtt_win_sec = 10;  /* min RTT filter window (in sec) */
> +static u32 bbr_probe_rtt_mode_ms = 200;   /* min ms at cwnd=4 in 
> BBR_PROBE_RTT */
> +static int bbr_min_tso_rate  = 120;  /* skip TSO below here (bits/sec) */
> +
> +/* We use a high_gain value chosen to allow a smoothly increasing pacing rate
> + * that will double each RTT and send the same number of packets per RTT that
> + * an un-paced, slow-starting Reno or CUBIC flow would.
> + */
> +static int bbr_high_gain  = BBR_UNIT * 2885 / 1000 + 1;  /* 2/ln(2) */
> +static int bbr_drain_gain = BBR_UNIT * 1000 / 2885;  /* 1/high_gain */
> +static int bbr_cwnd_gain  = BBR_UNIT * 2;/* gain for steady-state cwnd */
> +/* The pacing_gain values for the PROBE_BW gain cycle: */
> +static int bbr_pacing_gain[] = { BBR_UNIT * 5 / 4, BBR_UNIT * 3 / 4,
> +  BBR_UNIT, BBR_UNIT, BBR_UNIT,
> +  BBR_UNIT, BBR_UNIT, BBR_UNIT };
> +static u32 bbr_cycle_rand = 7;  /* randomize gain cycling phase over N 
> phases */
> +
> +/* Try to keep at least this many packets in flight, if things go smoothly. 
> For
> + * smooth functioning, a sliding window protocol ACKing every other packet
> + * needs at least 4 packets in flight.
> + */
> +static u32 bbr_cwnd_min_target   = 4;
> +
> +/* To estimate if BBR_STARTUP mode (i.e. high_gain) has filled pipe. */
> +static u32 bbr_full_bw_thresh = BBR_UNIT * 5 / 4;  /* bw up 1.25x per round? 
> */
> +static u32 bbr_full_bw_cnt= 3;/* N rounds w/o bw growth -> pipe full 
> */
> +
> +/* "long-term" ("LT") bandwidth estimator parameters: */
> +static bool bbr_lt_bw_estimator = true;  /* use the long-term bw 
> estimate? */
> +static u32 bbr_lt_intvl_min_rtts = 4;/* min rounds in sampling 
> interval */
> +static u32 bbr_lt_loss_thresh = 50;  /*  lost/delivered > 20% -> "lossy" */
> +static u32 bbr_lt_conv_thresh = BBR_UNIT / 8;  /* bw diff <= 12.5% -> 
> "close" */
> +static u32 bbr_lt_bw_max_rtts= 48;   /* max # of round trips using 
> lt_bw */
> +

Looks good, but could I suggest a simple optimization.
All these parameters are immutable in the version of BBR you are submitting.
Why not make the values const? And eliminate the always true long-term bw 
estimate
variable?

diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
index 76baa8a..9c270a2 100644
--- a/net/ipv4/tcp_bbr.c
+++ b/net/ipv4/tcp_bbr.c
@@ -90,40 +90,41 @@ struct bbr {
 
 #define CYCLE_LEN  8   /* number of phases in a pacing gain cycle */
 
-static int bbr_bw_rtts = CYCLE_LEN + 2; /* win len of bw filter (in rounds) */
-static u32 bbr_min_rtt_win_sec = 10;/* min RTT filter window (in sec) */
-static u32 bbr_probe_rtt_mode_ms = 200; /* min ms at cwnd=4 in 
BBR_PROBE_RTT */
-static int bbr_min_tso_rate= 120;  /* skip TSO below here (bits/sec) */
+static const int bbr_bw_rtts = CYCLE_LEN + 2;  /* win len of bw filter (in 
rounds) */
+static const u32 bbr_min_rtt_win_sec = 10; /* min RTT filter window (in 
sec) */
+static const u32 bbr_probe_rtt_mode_ms = 200;  /* min ms at cwnd=4 in 
BBR_PROBE_RTT */
+static const int bbr_min_tso_rate  = 120;  /* skip TSO below here 
(bits/sec) */
 
 /* We use a high_gain value chosen to allow a smoothly increasing pacing rate
  * that will double each RTT and send the same number of packets per RTT that
  * an un-paced, slow-starting Reno or CUBIC flow would.
  */
-static int bbr_high_gain  = BBR_UNIT * 2885 / 1000 + 1;/* 2/ln(2) */
-static int bbr_drain_gain = BBR_UNIT * 1000 / 2885;/* 1/high_gain */
-static int bbr_cwnd_gain  = BBR_UNIT * 2;  /* gain for steady-state cwnd */
+static const int bbr_high_gain  = BBR_UNIT * 2885 / 1000 + 1;  /* 2/ln(2) */
+static const int bbr_drain_gain = BBR_UNIT * 1000 / 2885;  /* 1/high_gain 
*/
+static const int bbr_cwnd_gain  = BBR_UNIT * 2;/* gain for 
steady-state cwnd */
 /* The pacing_gain values for the PROBE_BW gain cycle: */
-static int bbr_pacing_gain[] = { BBR_UNIT * 5 / 4, BBR_UNIT * 3 / 4,
-BBR_UNIT, BBR_UNIT, BBR_UNIT,
-BBR_UNIT, BBR_UNIT, BBR_UNIT };
-static u32 bbr_cycle_rand = 7;  /* randomize gain cycling phase over N phases 
*/
+static const int bbr_pacing_gain[] = {
+  BBR_UNIT * 5 / 4, BBR_UNIT * 3 / 4,
+  BBR_UNIT, BBR_UNIT, BBR_UNIT,
+  BBR_UNIT, BBR_UNIT, BBR_UNIT
+};
+static const u32 bbr_cycle_rand = 7;  /* randomize gain cycling phase over N 
phases */
 
 /* Try to keep at least this many packets in flight, if things go smoothly. For
  * smooth functioning, a sliding window protocol ACKing every other packet
  * needs at least 4 packets in flight.
  */
-static u32 bbr_cwnd_min_target = 4;
+static const u32 bbr_cwnd_min_target   = 4;
 
 /* To estimate if BBR_STARTUP mode (i.e. high_

Re: [PATCH v3 net-next 16/16] tcp_bbr: add BBR congestion control

2016-09-18 Thread Neal Cardwell
On Sun, Sep 18, 2016 at 9:18 PM, Kenneth Klette Jonassen
 wrote:
>> +static u64 bbr_rate_kbps(struct sock *sk, u64 rate)
>> +{
>> +   return bbr_rate_bytes_per_sec(sk, rate, BBR_UNIT) * 8 / 1000;
>
>
> Consider div_u64() here to keep all builds happy. :-) This adds __udivdi3()
> with the nxp powerpc toolchains I'm using.
>
> (Or have the one caller use bbr_rate_bytes_per_sec() instead.)

Thanks, Kenneth. We will fix this in the next revision (we like your
suggested approach of just using bbr_rate_bytes_per_sec()).

cheers,
neal


[PATCH v3 net-next 16/16] tcp_bbr: add BBR congestion control

2016-09-18 Thread Neal Cardwell
This commit implements a new TCP congestion control algorithm: BBR
(Bottleneck Bandwidth and RTT). A detailed description of BBR will be
published in ACM Queue, Vol. 14 No. 5, September-October 2016, as
"BBR: Congestion-Based Congestion Control".

BBR has significantly increased throughput and reduced latency for
connections on Google's internal backbone networks and google.com and
YouTube Web servers.

BBR requires only changes on the sender side, not in the network or
the receiver side. Thus it can be incrementally deployed on today's
Internet, or in datacenters.

The Internet has predominantly used loss-based congestion control
(largely Reno or CUBIC) since the 1980s, relying on packet loss as the
signal to slow down. While this worked well for many years, loss-based
congestion control is unfortunately out-dated in today's networks. On
today's Internet, loss-based congestion control causes the infamous
bufferbloat problem, often causing seconds of needless queuing delay,
since it fills the bloated buffers in many last-mile links. On today's
high-speed long-haul links using commodity switches with shallow
buffers, loss-based congestion control has abysmal throughput because
it over-reacts to losses caused by transient traffic bursts.

In 1981 Kleinrock and Gale showed that the optimal operating point for
a network maximizes delivered bandwidth while minimizing delay and
loss, not only for single connections but for the network as a
whole. Finding that optimal operating point has been elusive, since
any single network measurement is ambiguous: network measurements are
the result of both bandwidth and propagation delay, and those two
cannot be measured simultaneously.

While it is impossible to disambiguate any single bandwidth or RTT
measurement, a connection's behavior over time tells a clearer
story. BBR uses a measurement strategy designed to resolve this
ambiguity. It combines these measurements with a robust servo loop
using recent control systems advances to implement a distributed
congestion control algorithm that reacts to actual congestion, not
packet loss or transient queue delay, and is designed to converge with
high probability to a point near the optimal operating point.

In a nutshell, BBR creates an explicit model of the network pipe by
sequentially probing the bottleneck bandwidth and RTT. On the arrival
of each ACK, BBR derives the current delivery rate of the last round
trip, and feeds it through a windowed max-filter to estimate the
bottleneck bandwidth. Conversely it uses a windowed min-filter to
estimate the round trip propagation delay. The max-filtered bandwidth
and min-filtered RTT estimates form BBR's model of the network pipe.

Using its model, BBR sets control parameters to govern sending
behavior. The primary control is the pacing rate: BBR applies a gain
multiplier to transmit faster or slower than the observed bottleneck
bandwidth. The conventional congestion window (cwnd) is now the
secondary control; the cwnd is set to a small multiple of the
estimated BDP (bandwidth-delay product) in order to allow full
utilization and bandwidth probing while bounding the potential amount
of queue at the bottleneck.

When a BBR connection starts, it enters STARTUP mode and applies a
high gain to perform an exponential search to quickly probe the
bottleneck bandwidth (doubling its sending rate each round trip, like
slow start). However, instead of continuing until it fills up the
buffer (i.e. a loss), or until delay or ACK spacing reaches some
threshold (like Hystart), it uses its model of the pipe to estimate
when that pipe is full: it estimates the pipe is full when it notices
the estimated bandwidth has stopped growing. At that point it exits
STARTUP and enters DRAIN mode, where it reduces its pacing rate to
drain the queue it estimates it has created.

Then BBR enters steady state. In steady state, PROBE_BW mode cycles
between first pacing faster to probe for more bandwidth, then pacing
slower to drain any queue that created if no more bandwidth was
available, and then cruising at the estimated bandwidth to utilize the
pipe without creating excess queue. Occasionally, on an as-needed
basis, it sends significantly slower to probe for RTT (PROBE_RTT
mode).

Our long-term goal is to improve the congestion control algorithms
used on the Internet. We are hopeful that BBR can help advance the
efforts toward this goal, and motivate the community to do further
research.

Test results, performance evaluations, feedback, and BBR-related
discussions are very welcome in the public e-mail list for BBR:

  https://groups.google.com/forum/#!forum/bbr-dev

Signed-off-by: Van Jacobson 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Nandita Dukkipati 
Signed-off-by: Eric Dumazet 
Signed-off-by: Soheil Hassas Yeganeh 
---
 include/uapi/linux/inet_diag.h |  13 +
 net/ipv4/Kconfig   |  18 +
 net/ipv4/Makefile  |   1 +
 net/ipv4/tcp_bbr.c | 875