Re: [Cerowrt-devel] Ubiquiti QOS

2014-05-29 Thread Dave Taht
On Thu, May 29, 2014 at 4:40 PM, Michael Richardson  wrote:
>
> David P. Reed  wrote:
> > ECN-style signaling has the right properties ... just like TTL it can
> > provide
>
> How would you send these signals?
>
> > A Bloom style filter can remember flow statistics for both of these 
> local
> > policies. A great use for the memory no longer misapplied to
> > buffering
>
> Well.
>
> On the higher speed dataflow equipment, the buffer is general purpose memory,
> so reuse like this is particularly possible.
>
> On routers built around general purpose architectures, the limiting factor
> in performance is often memory throughput; adding memory rarely increases
> total throughput.   Packet I/O is generally quiet sequential and so makes
> good use of wide memory data paths and multiple accesses per address cycle.
> Updating of tables such as Bloom filter or any other hash has a big impact
> due to the RMW and random access nature.

In hardware using a parallel memory layout makes sense.

I had always envisioned the per flow fq_codel table to be on a lookaside cache,
much like how mac and route lookups happen today in hw. In a general purpose
architecture with fat amounts of cache (like ivy bridge) you can set aside
some main cache if you like.

It needent be big (64k for 1024 flows but you can shrink the structure some
if you want) - and it needent be fast, just fast enough to be accessed on
a per packet basis.

There are other ways to do it of course. you could set it up as 1024 8
32 bit register register banks, for example, in the asic or fpga, and eliminate
the concept of using ram for it entirely.

This is not a lot of gates  (quite a lot when you consider the invsqrt
dependency
in codel alone is 3k gates or so - or "free" in a FPGA with dsp multipliers)

I've never thought that pure "drop head" was possible in high speed hardware
- the various operations need to be pipelined, the timestamp needs to go
at the head of the packet for codel to operate on them, etc, etc...

> All I'm saying is that quantity of memory is seldom the problem, but access
> to it, is.

Concur. I keep hoping my parallela arrives. You can write your own ethernet
device with that...

> I do like the entire idea; it seems that it has to be implemented at the
> places where the flow converge, which is often in the DSL line card, or
> CTMS...

The elephant in the room on those devices is the per user rate shaper.
In software this
accounts for 95% of the cpu time and scheduling headaches, and that's without
dealing with ipv6 pools.

> --
> ]   Never tell me the odds! | ipv6 mesh networks [
> ]   Michael Richardson, Sandelman Software Works| network architect  [
> ] m...@sandelman.ca  http://www.sandelman.ca/|   ruby on rails
> [
>
> ___
> Cerowrt-devel mailing list
> Cerowrt-devel@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cerowrt-devel



-- 
Dave Täht

NSFW: 
https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article
___
Cerowrt-devel mailing list
Cerowrt-devel@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/cerowrt-devel


Re: [Cerowrt-devel] Ubiquiti QOS

2014-05-29 Thread David P. Reed
Good points...

On May 29, 2014, Michael Richardson  wrote:
>
>David P. Reed  wrote:
>> ECN-style signaling has the right properties ... just like TTL it can
>> provide
>
>How would you send these signals?
>
>> A Bloom style filter can remember flow statistics for both of these
>local
>> policies. A great use for the memory no longer misapplied to
>> buffering
>
>Well.
>
>On the higher speed dataflow equipment, the buffer is general purpose
>memory,
>so reuse like this is particularly possible.
>
>On routers built around general purpose architectures, the limiting
>factor
>in performance is often memory throughput; adding memory rarely
>increases
>total throughput.   Packet I/O is generally quiet sequential and so
>makes
>good use of wide memory data paths and multiple accesses per address
>cycle.
>Updating of tables such as Bloom filter or any other hash has a big
>impact
>due to the RMW and random access nature.
>
>All I'm saying is that quantity of memory is seldom the problem, but
>access
>to it, is.
>
>I do like the entire idea; it seems that it has to be implemented at
>the
>places where the flow converge, which is often in the DSL line card, or
>CTMS...
>
>--
>]   Never tell me the odds! | ipv6 mesh
>networks [
>]   Michael Richardson, Sandelman Software Works| network
>architect  [
>] m...@sandelman.ca  http://www.sandelman.ca/|   ruby on
>rails[

-- Sent from my Android device with K-@ Mail. Please excuse my brevity.___
Cerowrt-devel mailing list
Cerowrt-devel@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/cerowrt-devel


Re: [Cerowrt-devel] Ubiquiti QOS

2014-05-29 Thread Michael Richardson

David P. Reed  wrote:
> ECN-style signaling has the right properties ... just like TTL it can
> provide

How would you send these signals?

> A Bloom style filter can remember flow statistics for both of these local
> policies. A great use for the memory no longer misapplied to
> buffering

Well.

On the higher speed dataflow equipment, the buffer is general purpose memory,
so reuse like this is particularly possible.

On routers built around general purpose architectures, the limiting factor
in performance is often memory throughput; adding memory rarely increases
total throughput.   Packet I/O is generally quiet sequential and so makes
good use of wide memory data paths and multiple accesses per address cycle.
Updating of tables such as Bloom filter or any other hash has a big impact
due to the RMW and random access nature.

All I'm saying is that quantity of memory is seldom the problem, but access
to it, is.

I do like the entire idea; it seems that it has to be implemented at the
places where the flow converge, which is often in the DSL line card, or
CTMS...

--
]   Never tell me the odds! | ipv6 mesh networks [
]   Michael Richardson, Sandelman Software Works| network architect  [
] m...@sandelman.ca  http://www.sandelman.ca/|   ruby on rails[

___
Cerowrt-devel mailing list
Cerowrt-devel@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/cerowrt-devel


Re: [Cerowrt-devel] Ubiquiti QOS

2014-05-29 Thread David Lang
The problem is that without co-existing well with existing stacks (and 
especially misbehaing stacks), you are not talking about something that will 
ever be able to be used in real life.


unless I am mixing things up, RED and it's varients are a perfect example of 
this. If everyone on the network is using them, they work pretty well, but when 
someone isn't (or decides to "cheat"), it becomes very unfair in favor of the 
non-complying system.


David Lang

On Thu, 29 May 2014, dpr...@reed.com wrote:


Note: this is all about "how to achieve and sustain the ballistic phase that is 
optimal for Internet transport" in an end-to-end based control system like TCP.

I think those who have followed this know that, but I want to make it clear that I'm 
proposing a significant improvement that requires changes at the OS stacks and changes in 
the switches' approach to congestion signaling.  There are ways to phase it in gradually. 
 In "meshes", etc. it could probably be developed and deployed more quickly - 
but my thoughts on co-existence with the current TCP stacks and current IP routers are 
far less precisely worked out.

I am way too busy with my day job to do what needs to be done ... but my sense 
is that the folks who reduce this to practice will make a HUGE difference to 
Internet performance.  Bigger than getting bloat fixed, and to me that is a 
major, major potential triumph.



On Thursday, May 29, 2014 8:11am, "David P. Reed"  said:


ECN-style signaling has the right properties ... just like TTL it can provide 
valid and current sampling of the packet ' s environment as it travels. The 
idea is to sample what is happening at a bottleneck for the packet ' s flow.  
The bottleneck is the link with the most likelihood of a collision from flows 
sharing that link.

A control - theoretic estimator of recent collision likelihood is easy to do at 
each queue.  All active flows would receive that signal, with the busiest ones 
getting it most quickly. Also it is reasonable to count all potentially 
colliding flows at all outbound queues, and report that.

The estimator can then provide the signal that each flow responds to.

The problem of "defectors" is best dealt with by punishment... An aggressive 
packet drop policy that makes causing congestion reduce the cause's throughput and 
increases latency is the best kind of answer. Since the router can remember recent flow 
behavior, it can penalize recent flows.

A Bloom style filter can remember flow statistics for both of these local 
policies. A great use for the memory no longer misapplied to buffering

Simple?


On May 28, 2014, David Lang  wrote:
On Wed, 28 May 2014, dpr...@reed.com wrote:

I did not mean that "pacing".  Sorry I used a generic term.  I meant what my 
longer description described - a specific mechanism for reducing bunching that 
is essentially "cooperative" among all active flows through a bottlenecked 
link.  That's part of a "closed loop" control system driving each TCP endpoint 
into a cooperative mode.
how do you think we can get feedback from the bottleneck node to all the 
different senders?


what happens to the ones who try to play nice if one doesn't?, including what 
happens if one isn't just ignorant of the new cooperative mode, but activly 
tries to cheat? (as I understand it, this is the fatal flaw in many of the past 
buffering improvement proposals)


While the in-h ouserouter is the first bottleneck that user's traffic hits, the 
bigger problems happen when the bottleneck is in the peering between ISPs, many 
hops away from any sender, with many different senders competing for the 
avialable bandwidth.


This is where the new buffering approaches win. If the traffic is below the 
congestion level, they add very close to zero overhead, but when congestion 
happens, they manage the resulting buffers in a way that's works better for 
people (allowing short, fast connections to be fast with only a small impact on 
very long connections)


David Lang

The thing you call "pacing" is something quite different.  It is disconnected 
from the TCP control loops involved, which basically means it is flying blind. 
Introducing that kind of "pacing" almost certainly  reducesthroughput, because 
it *delays* packets.


The thing I called "pacing" is in no version of Linux that I know of.  Give it 
a different name: "anti-bunching cooperation" or "timing phase management for 
congestion reduction". Rather than *delaying* packets, it tries to get packets 
to avoid bunching only when reducing window size, and doing so by tightening 
the control loop so that the sender transmits as *soon* as it can, not by 
delaying sending after the sender dallies around not sending when it can.








On Tuesday, May 27, 2014 11:23am, "Jim Gettys"  said:







On Sun, May 25, 2014 at 4:00 PM,  <[dpr...@reed.com](mailto:dpr...@reed.com)> 
wrote:

Not that it is directly relevant, but there is no essential reason to require 
50 ms. of buffering.  T

Re: [Cerowrt-devel] Ubiquiti QOS

2014-05-29 Thread dpreed

Note: this is all about "how to achieve and sustain the ballistic phase that is 
optimal for Internet transport" in an end-to-end based control system like TCP.
 
I think those who have followed this know that, but I want to make it clear 
that I'm proposing a significant improvement that requires changes at the OS 
stacks and changes in the switches' approach to congestion signaling.  There 
are ways to phase it in gradually.  In "meshes", etc. it could probably be 
developed and deployed more quickly - but my thoughts on co-existence with the 
current TCP stacks and current IP routers are far less precisely worked out.
 
I am way too busy with my day job to do what needs to be done ... but my sense 
is that the folks who reduce this to practice will make a HUGE difference to 
Internet performance.  Bigger than getting bloat fixed, and to me that is a 
major, major potential triumph.
 


On Thursday, May 29, 2014 8:11am, "David P. Reed"  said:


ECN-style signaling has the right properties ... just like TTL it can provide 
valid and current sampling of the packet ' s environment as it travels. The 
idea is to sample what is happening at a bottleneck for the packet ' s flow.  
The bottleneck is the link with the most likelihood of a collision from flows 
sharing that link.

 A control - theoretic estimator of recent collision likelihood is easy to do 
at each queue.  All active flows would receive that signal, with the busiest 
ones getting it most quickly. Also it is reasonable to count all potentially 
colliding flows at all outbound queues, and report that.

 The estimator can then provide the signal that each flow responds to.

 The problem of "defectors" is best dealt with by punishment... An aggressive 
packet drop policy that makes causing congestion reduce the cause's throughput 
and increases latency is the best kind of answer. Since the router can remember 
recent flow behavior, it can penalize recent flows.

 A Bloom style filter can remember flow statistics for both of these local 
policies. A great use for the memory no longer misapplied to buffering

 Simple?


On May 28, 2014, David Lang  wrote:
On Wed, 28 May 2014, dpr...@reed.com wrote:

I did not mean that "pacing".  Sorry I used a generic term.  I meant what my 
longer description described - a specific mechanism for reducing bunching that 
is essentially "cooperative" among all active flows through a bottlenecked 
link.  That's part of a "closed loop" control system driving each TCP endpoint 
into a cooperative mode.
how do you think we can get feedback from the bottleneck node to all the 
different senders?

what happens to the ones who try to play nice if one doesn't?, including what 
happens if one isn't just ignorant of the new cooperative mode, but activly 
tries to cheat? (as I understand it, this is the fatal flaw in many of the past 
buffering improvement proposals)

While the in-h ouserouter is the first bottleneck that user's traffic hits, the 
bigger problems happen when the bottleneck is in the peering between ISPs, many 
hops away from any sender, with many different senders competing for the 
avialable bandwidth.

This is where the new buffering approaches win. If the traffic is below the 
congestion level, they add very close to zero overhead, but when congestion 
happens, they manage the resulting buffers in a way that's works better for 
people (allowing short, fast connections to be fast with only a small impact on 
very long connections)

David Lang

The thing you call "pacing" is something quite different.  It is disconnected 
from the TCP control loops involved, which basically means it is flying blind. 
Introducing that kind of "pacing" almost certainly  reducesthroughput, because 
it *delays* packets.

The thing I called "pacing" is in no version of Linux that I know of.  Give it 
a different name: "anti-bunching cooperation" or "timing phase management for 
congestion reduction". Rather than *delaying* packets, it tries to get packets 
to avoid bunching only when reducing window size, and doing so by tightening 
the control loop so that the sender transmits as *soon* as it can, not by 
delaying sending after the sender dallies around not sending when it can.







On Tuesday, May 27, 2014 11:23am, "Jim Gettys"  said:







On Sun, May 25, 2014 at 4:00 PM,  <[dpr...@reed.com](mailto:dpr...@reed.com)> 
wrote:

Not that it is directly relevant, but there is no essential reason to require 
50 ms. of buffering.  That might be true of some particular QOS-related router 
algorith m.  50ms. is about all one can tolerate in any router between source 
and destination for today's networks - an upper-bound rather than a minimum.

The optimum buffer state for throughput is 1-2 packets worth - in other words, 
if we have an MTU of 1500, 1500 - 3000 bytes. Only the bottleneck buffer (the 
input queue to the lowest speed link along the path) should have this much 
actually buffered. Buffering more than this increases end-

Re: [Cerowrt-devel] Ubiquiti QOS

2014-05-29 Thread David P. Reed
ECN-style signaling has the right properties ... just like TTL it can provide 
valid and current sampling of the packet ' s environment as it travels. The 
idea is to sample what is happening at a bottleneck for the packet ' s flow.  
The bottleneck is the link with the most likelihood of a collision from flows 
sharing that link.

A control - theoretic estimator of recent collision likelihood is easy to do at 
each queue.  All active flows would receive that signal, with the busiest ones 
getting it most quickly. Also it is reasonable to count all potentially 
colliding flows at all outbound queues, and report that.

The estimator can then provide the signal that each flow responds to.

The problem of "defectors" is best dealt with by punishment... An aggressive 
packet drop policy that makes causing congestion reduce the cause's throughput 
and increases latency is the best kind of answer. Since the router can remember 
recent flow behavior, it can penalize recent flows.

A Bloom style filter can remember flow statistics for both of these local 
policies. A great use for the memory no longer misapplied to buffering

Simple?

On May 28, 2014, David Lang  wrote:
>On Wed, 28 May 2014, dpr...@reed.com wrote:
>
>> I did not mean that "pacing".  Sorry I used a generic term.  I meant
>what my 
>> longer description described - a specific mechanism for reducing
>bunching that 
>> is essentially "cooperative" among all active flows through a
>bottlenecked 
>> link.  That's part of a "closed loop" control system driving each TCP
>endpoint 
>> into a cooperative mode.
>
>how do you think we can get feedback from the bottleneck node to all
>the 
>different senders?
>
>what happens to the ones who try to play nice if one doesn't?,
>including what 
>happens if one isn't just ignorant of the new cooperative mode, but
>activly 
>tries to cheat? (as I understand it, this is the fatal flaw in many of
>the past 
>buffering improvement proposals)
>
>While the in-house router is the first bottleneck that user's traffic
>hits, the 
>bigger problems happen when the bottleneck is in the peering between
>ISPs, many 
>hops away from any sender, with many different senders competing for
>the 
>avialable bandwidth.
>
>This is where the new buffering approaches win. If the traffic is below
>the 
>congestion level, they add very close to zero overhead, but when
>congestion 
>happens, they manage the resulting buffers in a way that's works better
>for 
>people (allowing short, fast connections to be fast with only a small
>impact on 
>very long connections)
>
>David Lang
>
>> The thing you call "pacing" is something quite different.  It is
>disconnected 
>> from the TCP control loops involved, which basically means it is
>flying blind. 
>> Introducing that kind of "pacing" almost certainly reduces
>throughput, because 
>> it *delays* packets.
>> 
>> The thing I called "pacing" is in no version of Linux that I know of.
> Give it 
>> a different name: "anti-bunching cooperation" or "timing phase
>management for 
>> congestion reduction". Rather than *delaying* packets, it tries to
>get packets 
>> to avoid bunching only when reducing window size, and doing so by
>tightening 
>> the control loop so that the sender transmits as *soon* as it can,
>not by 
>> delaying sending after the sender dallies around not sending when it
>can.
>> 
>> 
>> 
>> 
>> 
>>
>>
>> On Tuesday, May 27, 2014 11:23am, "Jim Gettys" 
>said:
>>
>>
>>
>>
>>
>>
>>
>> On Sun, May 25, 2014 at 4:00 PM, 
><[dpr...@reed.com](mailto:dpr...@reed.com)> wrote:
>>
>> Not that it is directly relevant, but there is no essential reason to
>require 50 ms. of buffering.  That might be true of some particular
>QOS-related router algorithm.  50 ms. is about all one can tolerate in
>any router between source and destination for today's networks - an
>upper-bound rather than a minimum.
>> 
>> The optimum buffer state for throughput is 1-2 packets worth - in
>other words, if we have an MTU of 1500, 1500 - 3000 bytes. Only the
>bottleneck buffer (the input queue to the lowest speed link along the
>path) should have this much actually buffered. Buffering more than this
>increases end-to-end latency beyond its optimal state.  Increased
>end-to-end latency reduces the effectiveness of control loops, creating
>more congestion.
>> 
>> The rationale for having 50 ms. of buffering is probably to avoid
>disruption of bursty mixed flows where the bursts might persist for 50
>ms. and then die. One reason for this is that source nodes run
>operating systems that tend to release packets in bursts. That's a
>whole other discussion - in an ideal world, source nodes would avoid
>bursty packet releases by letting the control by the receiver window be
>"tight" timing-wise.  That is, to transmit a packet immediately at the
>instant an ACK arrives increasing the window.  This would pace the flow
>- current OS's tend (due to scheduling mismatches) to send bursts of
>packets, "catching up" on sending that 

Re: [Cerowrt-devel] Ubiquiti QOS

2014-05-28 Thread David Lang

On Wed, 28 May 2014, dpr...@reed.com wrote:

I did not mean that "pacing".  Sorry I used a generic term.  I meant what my 
longer description described - a specific mechanism for reducing bunching that 
is essentially "cooperative" among all active flows through a bottlenecked 
link.  That's part of a "closed loop" control system driving each TCP endpoint 
into a cooperative mode.


how do you think we can get feedback from the bottleneck node to all the 
different senders?


what happens to the ones who try to play nice if one doesn't?, including what 
happens if one isn't just ignorant of the new cooperative mode, but activly 
tries to cheat? (as I understand it, this is the fatal flaw in many of the past 
buffering improvement proposals)


While the in-house router is the first bottleneck that user's traffic hits, the 
bigger problems happen when the bottleneck is in the peering between ISPs, many 
hops away from any sender, with many different senders competing for the 
avialable bandwidth.


This is where the new buffering approaches win. If the traffic is below the 
congestion level, they add very close to zero overhead, but when congestion 
happens, they manage the resulting buffers in a way that's works better for 
people (allowing short, fast connections to be fast with only a small impact on 
very long connections)


David Lang

The thing you call "pacing" is something quite different.  It is disconnected 
from the TCP control loops involved, which basically means it is flying blind. 
Introducing that kind of "pacing" almost certainly reduces throughput, because 
it *delays* packets.


The thing I called "pacing" is in no version of Linux that I know of.  Give it 
a different name: "anti-bunching cooperation" or "timing phase management for 
congestion reduction". Rather than *delaying* packets, it tries to get packets 
to avoid bunching only when reducing window size, and doing so by tightening 
the control loop so that the sender transmits as *soon* as it can, not by 
delaying sending after the sender dallies around not sending when it can.








On Tuesday, May 27, 2014 11:23am, "Jim Gettys"  said:







On Sun, May 25, 2014 at 4:00 PM,  <[dpr...@reed.com](mailto:dpr...@reed.com)> 
wrote:

Not that it is directly relevant, but there is no essential reason to require 
50 ms. of buffering.  That might be true of some particular QOS-related router 
algorithm.  50 ms. is about all one can tolerate in any router between source 
and destination for today's networks - an upper-bound rather than a minimum.

The optimum buffer state for throughput is 1-2 packets worth - in other words, 
if we have an MTU of 1500, 1500 - 3000 bytes. Only the bottleneck buffer (the 
input queue to the lowest speed link along the path) should have this much 
actually buffered. Buffering more than this increases end-to-end latency beyond 
its optimal state.  Increased end-to-end latency reduces the effectiveness of 
control loops, creating more congestion.

The rationale for having 50 ms. of buffering is probably to avoid disruption of bursty mixed flows 
where the bursts might persist for 50 ms. and then die. One reason for this is that source nodes 
run operating systems that tend to release packets in bursts. That's a whole other discussion - in 
an ideal world, source nodes would avoid bursty packet releases by letting the control by the 
receiver window be "tight" timing-wise.  That is, to transmit a packet immediately at the 
instant an ACK arrives increasing the window.  This would pace the flow - current OS's tend (due to 
scheduling mismatches) to send bursts of packets, "catching up" on sending that could 
have been spaced out and done earlier if the feedback from the receiver's window advancing were 
heeded.

​

That is, endpoint network stacks (TCP implementations) can worsen congestion by 
"dallying".  The ideal end-to-end flows occupying a congested router would have their 
packets paced so that the packets end up being sent in the least bursty manner that an application 
can support.  The effect of this pacing is to move the "backlog" for each flow quickly 
into the source node for that flow, which then provides back pressure on the application driving 
the flow, which ultimately is necessary to stanch congestion.  The ideal congestion control 
mechanism slows the sender part of the application to a pace that can go through the network 
without contributing to buffering.
​​
​Pacing is in Linux 3.12(?).  How long it will take to see widespread 
deployment is another question, and as for other operating systems, who knows.
See: [https://lwn.net/Articles/564978/](https://lwn.net/Articles/564978/)
​​

Current network stacks (including Linux's) don't achieve that goal - their 
pushback on application sources is minimal - instead they accumulate buffering 
internal to the network implementation.
​This is much, much less true than it once was.  There have been substantial 
changes in the Linux TCP stack in the 

Re: [Cerowrt-devel] Ubiquiti QOS

2014-05-28 Thread David Lang

Ok, I am not understanding your proposal then.

I thought you were claiming that since the optimum buffer length is 1-2 packets, 
the endpoints should be adjusting their sending speeds to try and make that 
happen on all switches and routers in the path.


The endpoints do know what their latency budget is and can have up to that much 
data in flight. They don't know if that data is sitting in router buffers, or is 
in transit on a high-speed-high-latency link (high speed satellite links can 
have a LOT of data that's left the transmitter, but not arrived at the receiver 
yet, this is will look exactly like data sitting in a buffer to the endpoints)


the endpoints don't know the state of all the intermediate connections, so 
unless they get feedback (ECN or dropped packets) they have to assume that there 
is no congestion.


David Lang

On Wed, 28 May 2014, dpr...@reed.com wrote:


Interesting conversation.   A particular switch has no idea of the "latency 
budget" of a particular flow - so it cannot have its *own* latency budget.   The 
switch designer has no choice but to assume that his latency budget is near zero.

The number of packets that should be sustained in flight to maintain maximum 
throughput between the source (entry) switch and destination (exit) switch of 
the flow need be no higher than

the flow's share of bandwidth of the bottleneck

multiplied by

the end-to-end delay (including packet forwarding, but not queueing).

All buffering needed for isochrony ("jitter buffer") and "alternative path 
selection" can be moved to either before the entry switch or after the exit switch.

If you have multiple simultaneous paths, the number of packets in flight involves replacing 
"bandwidth of the bottleneck" with "aggregate bandwidth across the minimum cut-set 
of the chosen paths used for the flow".

Of course, these are dynamic - "the flow's share" and "paths used for the flow" 
change over short time scales.  That's why you have a control loop that needs to measure them.

The whole point of minimizing buffering is to make the measurements more timely 
and the control inputs more timely.  This is not about convergence to an 
asymptote

A network where every internal buffer is driven hard toward zero makes it 
possible to handle multiple paths, alternate paths, etc. more *easily*.   
That's partly because you allow endpoints to see what is happening to their 
flows more quickly so they can compensate.

And of course for shared wireless resources, things change more quickly because 
of new factors - more sharing, more competition for collision-free slots, 
varying transmission rates, etc.

The last thing you want is long-term standing waves caused by large buffers and 
very loose control.



On Tuesday, May 27, 2014 11:21pm, "David Lang"  said:




On Tue, 27 May 2014, Dave Taht wrote:

> On Tue, May 27, 2014 at 4:27 PM, David Lang  wrote:
>> On Tue, 27 May 2014, Dave Taht wrote:
>>
>>> There is a phrase in this thread that is begging to bother me.
>>>
>>> "Throughput". Everyone assumes that throughput is a big goal - and
it
>>> certainly is - and latency is also a big goal - and it certainly is
-
>>> but by specifying what you want from "throughput" as a compromise
with
>>> latency is not the right thing...
>>>
>>> If what you want is actually "high speed in-order packet delivery" -
>>> say, for example a movie,
>>> or a video conference, youtube, or a video conference - excessive
>>> latency with high throughput, really, really makes in-order packet
>>> delivery at high speed tough.
>>
>>
>> the key word here is "excessive", that's why I said that for max
throughput
>> you want to buffer as much as your latency budget will allow you to.
>
> Again I'm trying to make a distinction between "throughput", and "packets
> delivered-in-order-to-the-user." (for-which-we-need-a-new-word-I think)
>
> The buffering should not be in-the-network, it can be in the application.
>
> Take our hypothetical video stream for example. I am 20ms RTT from netflix.
> If I artificially inflate that by adding 50ms of in-network buffering,
> that means a loss can
> take 120ms to recover from.
>
> If instead, I keep a 3*RTT buffer in my application, and expect that I have
5ms
> worth of network-buffering, instead, I recover from a loss in 40ms.
>
> (please note, it's late, I might not have got the math entirely right)

but you aren't going to be tuning the retry wait time per connection. what is
the retry time that is set in your stack? It's something huge to survive
international connections with satellite paths (so several seconds worth). If
your server-to-eyeball buffering is shorter than this, you will get a window
where you aren't fully utilizing the connection.

so yes, I do think that if your purpose is to get the maximum possible in-order
packets delivered, you end up making different decisions than if you are just
trying to stream a HD video, or do other normal things.

The problem is thinking that this absolu

Re: [Cerowrt-devel] Ubiquiti QOS

2014-05-28 Thread dpreed

Interesting conversation.   A particular switch has no idea of the "latency 
budget" of a particular flow - so it cannot have its *own* latency budget.   
The switch designer has no choice but to assume that his latency budget is near 
zero.
 
The number of packets that should be sustained in flight to maintain maximum 
throughput between the source (entry) switch and destination (exit) switch of 
the flow need be no higher than
 
the flow's share of bandwidth of the bottleneck
 
multiplied by
 
the end-to-end delay (including packet forwarding, but not queueing).
 
All buffering needed for isochrony ("jitter buffer") and "alternative path 
selection" can be moved to either before the entry switch or after the exit 
switch.
 
If you have multiple simultaneous paths, the number of packets in flight 
involves replacing "bandwidth of the bottleneck" with "aggregate bandwidth 
across the minimum cut-set of the chosen paths used for the flow".
 
Of course, these are dynamic - "the flow's share" and "paths used for the flow" 
change over short time scales.  That's why you have a control loop that needs 
to measure them.
 
The whole point of minimizing buffering is to make the measurements more timely 
and the control inputs more timely.  This is not about convergence to an 
asymptote
 
A network where every internal buffer is driven hard toward zero makes it 
possible to handle multiple paths, alternate paths, etc. more *easily*.   
That's partly because you allow endpoints to see what is happening to their 
flows more quickly so they can compensate.
 
And of course for shared wireless resources, things change more quickly because 
of new factors - more sharing, more competition for collision-free slots, 
varying transmission rates, etc.
 
The last thing you want is long-term standing waves caused by large buffers and 
very loose control.
 


On Tuesday, May 27, 2014 11:21pm, "David Lang"  said:



> On Tue, 27 May 2014, Dave Taht wrote:
> 
> > On Tue, May 27, 2014 at 4:27 PM, David Lang  wrote:
> >> On Tue, 27 May 2014, Dave Taht wrote:
> >>
> >>> There is a phrase in this thread that is begging to bother me.
> >>>
> >>> "Throughput". Everyone assumes that throughput is a big goal - and
> it
> >>> certainly is - and latency is also a big goal - and it certainly is
> -
> >>> but by specifying what you want from "throughput" as a compromise
> with
> >>> latency is not the right thing...
> >>>
> >>> If what you want is actually "high speed in-order packet delivery" -
> >>> say, for example a movie,
> >>> or a video conference, youtube, or a video conference - excessive
> >>> latency with high throughput, really, really makes in-order packet
> >>> delivery at high speed tough.
> >>
> >>
> >> the key word here is "excessive", that's why I said that for max
> throughput
> >> you want to buffer as much as your latency budget will allow you to.
> >
> > Again I'm trying to make a distinction between "throughput", and "packets
> > delivered-in-order-to-the-user." (for-which-we-need-a-new-word-I think)
> >
> > The buffering should not be in-the-network, it can be in the application.
> >
> > Take our hypothetical video stream for example. I am 20ms RTT from netflix.
> > If I artificially inflate that by adding 50ms of in-network buffering,
> > that means a loss can
> > take 120ms to recover from.
> >
> > If instead, I keep a 3*RTT buffer in my application, and expect that I have
> 5ms
> > worth of network-buffering, instead, I recover from a loss in 40ms.
> >
> > (please note, it's late, I might not have got the math entirely right)
> 
> but you aren't going to be tuning the retry wait time per connection. what is
> the retry time that is set in your stack? It's something huge to survive
> international connections with satellite paths (so several seconds worth). If
> your server-to-eyeball buffering is shorter than this, you will get a window
> where you aren't fully utilizing the connection.
> 
> so yes, I do think that if your purpose is to get the maximum possible 
> in-order
> packets delivered, you end up making different decisions than if you are just
> trying to stream a HD video, or do other normal things.
> 
> The problem is thinking that this absolute throughput is representitive of
> normal use.
> 
> > As physical RTTs grow shorter, the advantages of smaller buffers grow
> larger.
> >
> > You don't need 50ms queueing delay on a 100us path.
> >
> > Many applications buffer for seconds due to needing to be at least
> > 2*(actual buffering+RTT) on the path.
> 
> For something like streaming video, there's nothing wrong with the application
> buffering aggressivly (assuming you have the space to do so on the client 
> side),
> the more you have gotten transmitted to the client, the longer it can survive 
> a
> disruption of it's network.
> 
> There's nothing wrong with having an hour of buffered data between the server
> and the viewer's eyes.now, this buffering should not be in the network 
> devices, it
> sh

Re: [Cerowrt-devel] Ubiquiti QOS

2014-05-28 Thread dpreed

Same concern I mentioned with Jim's message.   I was not clear what I meant by 
"pacing" in the context of optimization of latency while preserving throughput. 
 It is NOT just a matter of spreading packets out in time that I was talking 
about.   It is a matter of doing so without reducing throughput.  That means 
transmitting as *early* as possible while avoiding congestion.  Building a 
"backlog" and then artificially spreading it out by "add-on pacing" will 
definitely reduce throughput below the flow's fair share of the bottleneck 
resource.
 
It is pretty clear to me that you can't get to a minimal latency, optimal 
throughput control algorithm by a series of "add ons" in LART.  It requires 
rethinking of the control discipline, and changes to get more information about 
congestion earlier, without ever allowing a buffer queue to build up in 
intermediate nodes - since that destroys latency by definition.
 
As long as you require buffers to grow at bottleneck links in order to get 
measurements of congestion, you probably are stuck with long-time-constant 
control loops, and as long as you encourage buffering at OS send stacks you are 
even worse off at the application layer.
 
The problem is in the assumption that buffer queueing is the only possible 
answer.  The "pacing" being included in Linux is just another way to build 
bigger buffers (on the sending host), by taking control away from the TCP 
control loop.
 
 


On Tuesday, May 27, 2014 1:31pm, "Dave Taht"  said:



> This has been a good thread, and I'm sorry it was mostly on
> cerowrt-devel rather than the main list...
> 
> It is not clear from observing google's deployment that pacing of the
> IW is not in use. I see
> clear 1ms boundaries for individual flows on much lower than iw10
> boundaries. (e.g. I see 1-4
> packets at a time arrive at 1ms intervals - but this could be an
> artifact of the capture, intermediate
> devices, etc)
> 
> sch_fq comes with explicit support for spreading out the initial
> window, (by default it allows a full iw10 burst however) and tcp small
> queues and pacing-aware tcps and the tso fixes and stuff we don't know
> about all are collaborating to reduce the web burst size...
> 
> sch_fq_codel used as the host/router qdisc basically does spread out
> any flow if there is a bottleneck on the link. The pacing stuff
> spreads flow delivery out across an estimate of srtt by clock tick...
> 
> It makes tremendous sense to pace out a flow if you are hitting the
> wire at 10gbit and know you are stepping down to 100mbit or less on
> the end device - that 100x difference in rate is meaningful... and at
> the same time to get full throughput out of 10gbit some level of tso
> offloads is needed... and the initial guess
> at the right pace is hard to get right before a couple RTTs go by.
> 
> I look forward to learning what's up.
> 
> On Tue, May 27, 2014 at 8:23 AM, Jim Gettys  wrote:
> >
> >
> >
> > On Sun, May 25, 2014 at 4:00 PM,  wrote:
> >>
> >> Not that it is directly relevant, but there is no essential reason to
> >> require 50 ms. of buffering.  That might be true of some particular
> >> QOS-related router algorithm.  50 ms. is about all one can tolerate in
> any
> >> router between source and destination for today's networks - an
> upper-bound
> >> rather than a minimum.
> >>
> >>
> >>
> >> The optimum buffer state for throughput is 1-2 packets worth - in other
> >> words, if we have an MTU of 1500, 1500 - 3000 bytes. Only the bottleneck
> >> buffer (the input queue to the lowest speed link along the path) should
> have
> >> this much actually buffered. Buffering more than this increases
> end-to-end
> >> latency beyond its optimal state.  Increased end-to-end latency reduces
> the
> >> effectiveness of control loops, creating more congestion.
> 
> This misses an important facet of modern macs (wifi, wireless, cable, and 
> gpon),
> which which can aggregate 32k or more in packets.
> 
> So the ideal size in those cases is much larger than a MTU, and has additional
> factors governing the ideal - such as the probability of a packet loss 
> inducing
> a retransmit
> 
> Ethernet, sure.
> 
> >>
> >>
> >>
> >> The rationale for having 50 ms. of buffering is probably to avoid
> >> disruption of bursty mixed flows where the bursts might persist for 50
> ms.
> >> and then die. One reason for this is that source nodes run operating
> systems
> >> that tend to release packets in bursts. That's a whole other discussion -
> in
> >> an ideal world, source nodes would avoid bursty packet releases by
> letting
> >> the control by the receiver window be "tight" timing-wise.  That is, to
> >> transmit a packet immediately at the instant an ACK arrives increasing
> the
> >> window.  This would pace the flow - current OS's tend (due to scheduling
> >> mismatches) to send bursts of packets, "catching up" on sending that
> could
> >> have been spaced out and done earlier if the feedback from the
> receiver's
> >> window advan

Re: [Cerowrt-devel] Ubiquiti QOS

2014-05-28 Thread dpreed

I did not mean that "pacing".  Sorry I used a generic term.  I meant what my 
longer description described - a specific mechanism for reducing bunching that 
is essentially "cooperative" among all active flows through a bottlenecked 
link.  That's part of a "closed loop" control system driving each TCP endpoint 
into a cooperative mode.
 
The thing you call "pacing" is something quite different.  It is disconnected 
from the TCP control loops involved, which basically means it is flying blind.  
Introducing that kind of "pacing" almost certainly reduces throughput, because 
it *delays* packets.
 
The thing I called "pacing" is in no version of Linux that I know of.  Give it 
a different name: "anti-bunching cooperation" or "timing phase management for 
congestion reduction". Rather than *delaying* packets, it tries to get packets 
to avoid bunching only when reducing window size, and doing so by tightening 
the control loop so that the sender transmits as *soon* as it can, not by 
delaying sending after the sender dallies around not sending when it can.
 
 
 
 
 


On Tuesday, May 27, 2014 11:23am, "Jim Gettys"  said:







On Sun, May 25, 2014 at 4:00 PM,  <[dpr...@reed.com](mailto:dpr...@reed.com)> 
wrote:

Not that it is directly relevant, but there is no essential reason to require 
50 ms. of buffering.  That might be true of some particular QOS-related router 
algorithm.  50 ms. is about all one can tolerate in any router between source 
and destination for today's networks - an upper-bound rather than a minimum.
 
The optimum buffer state for throughput is 1-2 packets worth - in other words, 
if we have an MTU of 1500, 1500 - 3000 bytes. Only the bottleneck buffer (the 
input queue to the lowest speed link along the path) should have this much 
actually buffered. Buffering more than this increases end-to-end latency beyond 
its optimal state.  Increased end-to-end latency reduces the effectiveness of 
control loops, creating more congestion.
 
The rationale for having 50 ms. of buffering is probably to avoid disruption of 
bursty mixed flows where the bursts might persist for 50 ms. and then die. One 
reason for this is that source nodes run operating systems that tend to release 
packets in bursts. That's a whole other discussion - in an ideal world, source 
nodes would avoid bursty packet releases by letting the control by the receiver 
window be "tight" timing-wise.  That is, to transmit a packet immediately at 
the instant an ACK arrives increasing the window.  This would pace the flow - 
current OS's tend (due to scheduling mismatches) to send bursts of packets, 
"catching up" on sending that could have been spaced out and done earlier if 
the feedback from the receiver's window advancing were heeded.

​
 
That is, endpoint network stacks (TCP implementations) can worsen congestion by 
"dallying".  The ideal end-to-end flows occupying a congested router would have 
their packets paced so that the packets end up being sent in the least bursty 
manner that an application can support.  The effect of this pacing is to move 
the "backlog" for each flow quickly into the source node for that flow, which 
then provides back pressure on the application driving the flow, which 
ultimately is necessary to stanch congestion.  The ideal congestion control 
mechanism slows the sender part of the application to a pace that can go 
through the network without contributing to buffering.
​​
​Pacing is in Linux 3.12(?).  How long it will take to see widespread 
deployment is another question, and as for other operating systems, who knows.
See: [https://lwn.net/Articles/564978/](https://lwn.net/Articles/564978/)
​​
 
Current network stacks (including Linux's) don't achieve that goal - their 
pushback on application sources is minimal - instead they accumulate buffering 
internal to the network implementation.
​This is much, much less true than it once was.  There have been substantial 
changes in the Linux TCP stack in the last year or two, to avoid generating 
packets before necessary.  Again, how long it will take for people to deploy 
this on Linux (and implement on other OS's) is a question.
​
This contributes to end-to-end latency as well.  But if you think about it, 
this is almost as bad as switch-level bufferbloat in terms of degrading user 
experience.  The reason I say "almost" is that there are tools, rarely used in 
practice, that allow an application to specify that buffering should not build 
up in the network stack (in the kernel or wherever it is).  But the default is 
not to use those APIs, and to buffer way too much.
 
Remember, the network send stack can act similarly to a congested switch (it is 
a switch among all the user applications running on that node).  IF there is a 
heavy file transfer, the file transfer's buffering acts to increase latency for 
all other networked communications on that machine.
 
Traditionally this problem has been thought of only as a within-node fairness 
issue, b

Re: [Cerowrt-devel] Ubiquiti QOS

2014-05-27 Thread David Lang

On Tue, 27 May 2014, Dave Taht wrote:


On Tue, May 27, 2014 at 4:27 PM, David Lang  wrote:

On Tue, 27 May 2014, Dave Taht wrote:


There is a phrase in this thread that is begging to bother me.

"Throughput". Everyone assumes that throughput is a big goal - and it
certainly is - and latency is also a big goal - and it certainly is -
but by specifying what you want from "throughput" as a compromise with
latency is not the right thing...

If what you want is actually "high speed in-order packet delivery" -
say, for example a movie,
or a video conference, youtube, or a video conference - excessive
latency with high throughput, really, really makes in-order packet
delivery at high speed tough.



the key word here is "excessive", that's why I said that for max throughput
you want to buffer as much as your latency budget will allow you to.


Again I'm trying to make a distinction between "throughput", and "packets
delivered-in-order-to-the-user." (for-which-we-need-a-new-word-I think)

The buffering should not be in-the-network, it can be in the application.

Take our hypothetical video stream for example. I am 20ms RTT from netflix.
If I artificially inflate that by adding 50ms of in-network buffering,
that means a loss can
take 120ms to recover from.

If instead, I keep a 3*RTT buffer in my application, and expect that I have 5ms
worth of network-buffering, instead, I recover from a loss in 40ms.

(please note, it's late, I might not have got the math entirely right)


but you aren't going to be tuning the retry wait time per connection. what is 
the retry time that is set in your stack? It's something huge to survive 
international connections with satellite paths (so several seconds worth). If 
your server-to-eyeball buffering is shorter than this, you will get a window 
where you aren't fully utilizing the connection.


so yes, I do think that if your purpose is to get the maximum possible in-order 
packets delivered, you end up making different decisions than if you are just 
trying to stream a HD video, or do other normal things.


The problem is thinking that this absolute throughput is representitive of 
normal use.



As physical RTTs grow shorter, the advantages of smaller buffers grow larger.

You don't need 50ms queueing delay on a 100us path.

Many applications buffer for seconds due to needing to be at least
2*(actual buffering+RTT) on the path.


For something like streaming video, there's nothing wrong with the application 
buffering aggressivly (assuming you have the space to do so on the client side), 
the more you have gotten transmitted to the client, the longer it can survive a 
disruption of it's network.


There's nothing wrong with having an hour of buffered data between the server 
and the viewer's eyes.now, this buffering should not be in the network devices, it should be in the 
client app, but this isn't because there's something wrong with bufferng, it's 
just because the client device has so much more available space to hold stuff.


David Lang




You eventually lose a packet, and you have to wait a really long time
until a replacement arrives. Stuart and I showed that at last ietf.
And you get the classic "buffering" song playing



Yep, and if you buffer too much, your "lost packet" is actually still in
flight and eating bandwidth.

David Lang



low latency makes recovery from a loss in an in-order stream much, much
faster.

Honestly, for most applications on the web, what you want is high
speed in-order packet delivery, not
"bulk throughput". There is a whole class of apps (bittorrent, file
transfer) that don't need that, and we
have protocols for those



On Tue, May 27, 2014 at 2:19 PM, David Lang  wrote:


the problem is that paths change, they mix traffic from streams, and in
other ways the utilization of the links can change radically in a short
amount of time.

If you try to limit things to exactly the ballistic throughput, you are
not
going to be able to exactly maintain this state, you are either going to
overshoot (too much traffic, requiring dropping packets to maintain your
minimal buffer), or you are going to undershoot (too little traffic and
your
connection is idle)

Since you can't predict all the competing traffic throughout the
Internet,
if you want to maximize throughput, you want to buffer as much as you can
tolerate for latency reasons. For most apps, this is more than enough to
cause problems for other connections.

David Lang


 On Mon, 26 May 2014, David P. Reed wrote:


Codel and PIE are excellent first steps... but I don't think they are
the
best eventual approach.  I want to see them deployed ASAP in CMTS' s and
server load balancing networks... it would be a disaster to not deploy
the
far better option we have today immediately at the point of most
leverage.
The best is the enemy of the good.

But, the community needs to learn once and for all that throughput and
latency do not trade off. We can in principle get far better latency
while
m

Re: [Cerowrt-devel] Ubiquiti QOS

2014-05-27 Thread Dave Taht
On Tue, May 27, 2014 at 4:27 PM, David Lang  wrote:
> On Tue, 27 May 2014, Dave Taht wrote:
>
>> There is a phrase in this thread that is begging to bother me.
>>
>> "Throughput". Everyone assumes that throughput is a big goal - and it
>> certainly is - and latency is also a big goal - and it certainly is -
>> but by specifying what you want from "throughput" as a compromise with
>> latency is not the right thing...
>>
>> If what you want is actually "high speed in-order packet delivery" -
>> say, for example a movie,
>> or a video conference, youtube, or a video conference - excessive
>> latency with high throughput, really, really makes in-order packet
>> delivery at high speed tough.
>
>
> the key word here is "excessive", that's why I said that for max throughput
> you want to buffer as much as your latency budget will allow you to.

Again I'm trying to make a distinction between "throughput", and "packets
delivered-in-order-to-the-user." (for-which-we-need-a-new-word-I think)

The buffering should not be in-the-network, it can be in the application.

Take our hypothetical video stream for example. I am 20ms RTT from netflix.
If I artificially inflate that by adding 50ms of in-network buffering,
that means a loss can
take 120ms to recover from.

If instead, I keep a 3*RTT buffer in my application, and expect that I have 5ms
worth of network-buffering, instead, I recover from a loss in 40ms.

(please note, it's late, I might not have got the math entirely right)

As physical RTTs grow shorter, the advantages of smaller buffers grow larger.

You don't need 50ms queueing delay on a 100us path.

Many applications buffer for seconds due to needing to be at least
2*(actual buffering+RTT) on the path.

>
>> You eventually lose a packet, and you have to wait a really long time
>> until a replacement arrives. Stuart and I showed that at last ietf.
>> And you get the classic "buffering" song playing
>
>
> Yep, and if you buffer too much, your "lost packet" is actually still in
> flight and eating bandwidth.
>
> David Lang
>
>
>> low latency makes recovery from a loss in an in-order stream much, much
>> faster.
>>
>> Honestly, for most applications on the web, what you want is high
>> speed in-order packet delivery, not
>> "bulk throughput". There is a whole class of apps (bittorrent, file
>> transfer) that don't need that, and we
>> have protocols for those
>>
>>
>>
>> On Tue, May 27, 2014 at 2:19 PM, David Lang  wrote:
>>>
>>> the problem is that paths change, they mix traffic from streams, and in
>>> other ways the utilization of the links can change radically in a short
>>> amount of time.
>>>
>>> If you try to limit things to exactly the ballistic throughput, you are
>>> not
>>> going to be able to exactly maintain this state, you are either going to
>>> overshoot (too much traffic, requiring dropping packets to maintain your
>>> minimal buffer), or you are going to undershoot (too little traffic and
>>> your
>>> connection is idle)
>>>
>>> Since you can't predict all the competing traffic throughout the
>>> Internet,
>>> if you want to maximize throughput, you want to buffer as much as you can
>>> tolerate for latency reasons. For most apps, this is more than enough to
>>> cause problems for other connections.
>>>
>>> David Lang
>>>
>>>
>>>  On Mon, 26 May 2014, David P. Reed wrote:
>>>
 Codel and PIE are excellent first steps... but I don't think they are
 the
 best eventual approach.  I want to see them deployed ASAP in CMTS' s and
 server load balancing networks... it would be a disaster to not deploy
 the
 far better option we have today immediately at the point of most
 leverage.
 The best is the enemy of the good.

 But, the community needs to learn once and for all that throughput and
 latency do not trade off. We can in principle get far better latency
 while
 maintaining high throughput and we need to start thinking about
 that.
 That means that the framing of the issue as AQM is counterproductive.

 On May 26, 2014, Mikael Abrahamsson  wrote:
>
>
> On Mon, 26 May 2014, dpr...@reed.com wrote:
>
>> I would look to queue minimization rather than "queue management"
>
>
> (which
>>
>>
>> implied queues are often long) as a goal, and think harder about the
>> end-to-end problem of minimizing total end-to-end queueing delay
>
>
> while
>>
>>
>> maximizing throughput.
>
>
>
> As far as I can tell, this is exactly what CODEL and PIE tries to do.
> They
> try to find a decent tradeoff between having queues to make sure the
> pipe
> is filled, and not making these queues big enough to seriously affect
> interactive performance.
>
> The latter part looks like what LEDBAT does?
> 
>
> Or are you thinking about something else?



 -- Sent from my Android d

Re: [Cerowrt-devel] Ubiquiti QOS

2014-05-27 Thread David Lang

On Tue, 27 May 2014, Dave Taht wrote:


There is a phrase in this thread that is begging to bother me.

"Throughput". Everyone assumes that throughput is a big goal - and it
certainly is - and latency is also a big goal - and it certainly is -
but by specifying what you want from "throughput" as a compromise with
latency is not the right thing...

If what you want is actually "high speed in-order packet delivery" -
say, for example a movie,
or a video conference, youtube, or a video conference - excessive
latency with high throughput, really, really makes in-order packet
delivery at high speed tough.


the key word here is "excessive", that's why I said that for max throughput you 
want to buffer as much as your latency budget will allow you to.



You eventually lose a packet, and you have to wait a really long time
until a replacement arrives. Stuart and I showed that at last ietf.
And you get the classic "buffering" song playing


Yep, and if you buffer too much, your "lost packet" is actually still in flight 
and eating bandwidth.


David Lang


low latency makes recovery from a loss in an in-order stream much, much faster.

Honestly, for most applications on the web, what you want is high
speed in-order packet delivery, not
"bulk throughput". There is a whole class of apps (bittorrent, file
transfer) that don't need that, and we
have protocols for those



On Tue, May 27, 2014 at 2:19 PM, David Lang  wrote:

the problem is that paths change, they mix traffic from streams, and in
other ways the utilization of the links can change radically in a short
amount of time.

If you try to limit things to exactly the ballistic throughput, you are not
going to be able to exactly maintain this state, you are either going to
overshoot (too much traffic, requiring dropping packets to maintain your
minimal buffer), or you are going to undershoot (too little traffic and your
connection is idle)

Since you can't predict all the competing traffic throughout the Internet,
if you want to maximize throughput, you want to buffer as much as you can
tolerate for latency reasons. For most apps, this is more than enough to
cause problems for other connections.

David Lang


 On Mon, 26 May 2014, David P. Reed wrote:


Codel and PIE are excellent first steps... but I don't think they are the
best eventual approach.  I want to see them deployed ASAP in CMTS' s and
server load balancing networks... it would be a disaster to not deploy the
far better option we have today immediately at the point of most leverage.
The best is the enemy of the good.

But, the community needs to learn once and for all that throughput and
latency do not trade off. We can in principle get far better latency while
maintaining high throughput and we need to start thinking about that.
That means that the framing of the issue as AQM is counterproductive.

On May 26, 2014, Mikael Abrahamsson  wrote:


On Mon, 26 May 2014, dpr...@reed.com wrote:


I would look to queue minimization rather than "queue management"


(which


implied queues are often long) as a goal, and think harder about the
end-to-end problem of minimizing total end-to-end queueing delay


while


maximizing throughput.



As far as I can tell, this is exactly what CODEL and PIE tries to do.
They
try to find a decent tradeoff between having queues to make sure the
pipe
is filled, and not making these queues big enough to seriously affect
interactive performance.

The latter part looks like what LEDBAT does?


Or are you thinking about something else?



-- Sent from my Android device with K-@ Mail. Please excuse my brevity.



___
Cerowrt-devel mailing list
Cerowrt-devel@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/cerowrt-devel

___
Cerowrt-devel mailing list
Cerowrt-devel@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/cerowrt-devel







___
Cerowrt-devel mailing list
Cerowrt-devel@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/cerowrt-devel


Re: [Cerowrt-devel] Ubiquiti QOS

2014-05-27 Thread Dave Taht
There is a phrase in this thread that is begging to bother me.

"Throughput". Everyone assumes that throughput is a big goal - and it
certainly is - and latency is also a big goal - and it certainly is -
but by specifying what you want from "throughput" as a compromise with
latency is not the right thing...

If what you want is actually "high speed in-order packet delivery" -
say, for example a movie,
or a video conference, youtube, or a video conference - excessive
latency with high throughput, really, really makes in-order packet
delivery at high speed tough.

You eventually lose a packet, and you have to wait a really long time
until a replacement arrives. Stuart and I showed that at last ietf.
And you get the classic "buffering" song playing

low latency makes recovery from a loss in an in-order stream much, much faster.

Honestly, for most applications on the web, what you want is high
speed in-order packet delivery, not
"bulk throughput". There is a whole class of apps (bittorrent, file
transfer) that don't need that, and we
have protocols for those



On Tue, May 27, 2014 at 2:19 PM, David Lang  wrote:
> the problem is that paths change, they mix traffic from streams, and in
> other ways the utilization of the links can change radically in a short
> amount of time.
>
> If you try to limit things to exactly the ballistic throughput, you are not
> going to be able to exactly maintain this state, you are either going to
> overshoot (too much traffic, requiring dropping packets to maintain your
> minimal buffer), or you are going to undershoot (too little traffic and your
> connection is idle)
>
> Since you can't predict all the competing traffic throughout the Internet,
> if you want to maximize throughput, you want to buffer as much as you can
> tolerate for latency reasons. For most apps, this is more than enough to
> cause problems for other connections.
>
> David Lang
>
>
>  On Mon, 26 May 2014, David P. Reed wrote:
>
>> Codel and PIE are excellent first steps... but I don't think they are the
>> best eventual approach.  I want to see them deployed ASAP in CMTS' s and
>> server load balancing networks... it would be a disaster to not deploy the
>> far better option we have today immediately at the point of most leverage.
>> The best is the enemy of the good.
>>
>> But, the community needs to learn once and for all that throughput and
>> latency do not trade off. We can in principle get far better latency while
>> maintaining high throughput and we need to start thinking about that.
>> That means that the framing of the issue as AQM is counterproductive.
>>
>> On May 26, 2014, Mikael Abrahamsson  wrote:
>>>
>>> On Mon, 26 May 2014, dpr...@reed.com wrote:
>>>
 I would look to queue minimization rather than "queue management"
>>>
>>> (which

 implied queues are often long) as a goal, and think harder about the
 end-to-end problem of minimizing total end-to-end queueing delay
>>>
>>> while

 maximizing throughput.
>>>
>>>
>>> As far as I can tell, this is exactly what CODEL and PIE tries to do.
>>> They
>>> try to find a decent tradeoff between having queues to make sure the
>>> pipe
>>> is filled, and not making these queues big enough to seriously affect
>>> interactive performance.
>>>
>>> The latter part looks like what LEDBAT does?
>>> 
>>>
>>> Or are you thinking about something else?
>>
>>
>> -- Sent from my Android device with K-@ Mail. Please excuse my brevity.
>
>
> ___
> Cerowrt-devel mailing list
> Cerowrt-devel@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cerowrt-devel
>
> ___
> Cerowrt-devel mailing list
> Cerowrt-devel@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cerowrt-devel
>



-- 
Dave Täht

NSFW: 
https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article
___
Cerowrt-devel mailing list
Cerowrt-devel@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/cerowrt-devel


Re: [Cerowrt-devel] Ubiquiti QOS

2014-05-27 Thread David Lang
the problem is that paths change, they mix traffic from streams, and in other 
ways the utilization of the links can change radically in a short amount of 
time.


If you try to limit things to exactly the ballistic throughput, you are not 
going to be able to exactly maintain this state, you are either going to 
overshoot (too much traffic, requiring dropping packets to maintain your minimal 
buffer), or you are going to undershoot (too little traffic and your connection 
is idle)


Since you can't predict all the competing traffic throughout the Internet, if 
you want to maximize throughput, you want to buffer as much as you can tolerate 
for latency reasons. For most apps, this is more than enough to cause problems 
for other connections.


David Lang

 On Mon, 26 May 2014, David P. Reed wrote:

Codel and PIE are excellent first steps... but I don't think they are the best 
eventual approach.  I want to see them deployed ASAP in CMTS' s and server 
load balancing networks... it would be a disaster to not deploy the far better 
option we have today immediately at the point of most leverage. The best is 
the enemy of the good.


But, the community needs to learn once and for all that throughput and latency 
do not trade off. We can in principle get far better latency while maintaining 
high throughput and we need to start thinking about that.  That means that 
the framing of the issue as AQM is counterproductive.


On May 26, 2014, Mikael Abrahamsson  wrote:

On Mon, 26 May 2014, dpr...@reed.com wrote:


I would look to queue minimization rather than "queue management"

(which

implied queues are often long) as a goal, and think harder about the
end-to-end problem of minimizing total end-to-end queueing delay

while

maximizing throughput.


As far as I can tell, this is exactly what CODEL and PIE tries to do.
They
try to find a decent tradeoff between having queues to make sure the
pipe
is filled, and not making these queues big enough to seriously affect
interactive performance.

The latter part looks like what LEDBAT does?


Or are you thinking about something else?


-- Sent from my Android device with K-@ Mail. Please excuse my brevity.___
Cerowrt-devel mailing list
Cerowrt-devel@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/cerowrt-devel
___
Cerowrt-devel mailing list
Cerowrt-devel@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/cerowrt-devel


Re: [Cerowrt-devel] Ubiquiti QOS

2014-05-27 Thread Dave Taht
This has been a good thread, and I'm sorry it was mostly on
cerowrt-devel rather than the main list...

It is not clear from observing google's deployment that pacing of the
IW is not in use. I see
clear 1ms boundaries for individual flows on much lower than iw10
boundaries. (e.g. I see 1-4
packets at a time arrive at 1ms intervals - but this could be an
artifact of the capture, intermediate
devices, etc)

sch_fq comes with explicit support for spreading out the initial
window, (by default it allows a full iw10 burst however) and tcp small
queues and pacing-aware tcps and the tso fixes and stuff we don't know
about all are collaborating to reduce the web burst size...

sch_fq_codel used as the host/router qdisc basically does spread out
any flow if there is a bottleneck on the link. The pacing stuff
spreads flow delivery out across an estimate of srtt by clock tick...

It makes tremendous sense to pace out a flow if you are hitting the
wire at 10gbit and know you are stepping down to 100mbit or less on
the end device - that 100x difference in rate is meaningful... and at
the same time to get full throughput out of 10gbit some level of tso
offloads is needed... and the initial guess
at the right pace is hard to get right before a couple RTTs go by.

I look forward to learning what's up.

On Tue, May 27, 2014 at 8:23 AM, Jim Gettys  wrote:
>
>
>
> On Sun, May 25, 2014 at 4:00 PM,  wrote:
>>
>> Not that it is directly relevant, but there is no essential reason to
>> require 50 ms. of buffering.  That might be true of some particular
>> QOS-related router algorithm.  50 ms. is about all one can tolerate in any
>> router between source and destination for today's networks - an upper-bound
>> rather than a minimum.
>>
>>
>>
>> The optimum buffer state for throughput is 1-2 packets worth - in other
>> words, if we have an MTU of 1500, 1500 - 3000 bytes. Only the bottleneck
>> buffer (the input queue to the lowest speed link along the path) should have
>> this much actually buffered. Buffering more than this increases end-to-end
>> latency beyond its optimal state.  Increased end-to-end latency reduces the
>> effectiveness of control loops, creating more congestion.

This misses an important facet of modern macs (wifi, wireless, cable, and gpon),
which which can aggregate 32k or more in packets.

So the ideal size in those cases is much larger than a MTU, and has additional
factors governing the ideal - such as the probability of a packet loss inducing
a retransmit

Ethernet, sure.

>>
>>
>>
>> The rationale for having 50 ms. of buffering is probably to avoid
>> disruption of bursty mixed flows where the bursts might persist for 50 ms.
>> and then die. One reason for this is that source nodes run operating systems
>> that tend to release packets in bursts. That's a whole other discussion - in
>> an ideal world, source nodes would avoid bursty packet releases by letting
>> the control by the receiver window be "tight" timing-wise.  That is, to
>> transmit a packet immediately at the instant an ACK arrives increasing the
>> window.  This would pace the flow - current OS's tend (due to scheduling
>> mismatches) to send bursts of packets, "catching up" on sending that could
>> have been spaced out and done earlier if the feedback from the receiver's
>> window advancing were heeded.

This loop has got ever tighter since linux 3.3, to where it's really as tight
as a modern cpu scheduler can get it. (or so I keep thinking -
but successive improvements in linux tcp keep proving me wrong. :)

I am really in awe of linux tcp these days. Recently I was benchmarking
windows and macos. Windows only got 60% of the throughput linux tcp
did at gigE speeds, and osx had a lot of issues at 10mbit and below,
stretch acks and holding the window too high for the path)

I keep hoping better ethernet hardware will arrive that can mix flows
even more.

>>
>>
>>
>> That is, endpoint network stacks (TCP implementations) can worsen
>> congestion by "dallying".  The ideal end-to-end flows occupying a congested
>> router would have their packets paced so that the packets end up being sent
>> in the least bursty manner that an application can support.  The effect of
>> this pacing is to move the "backlog" for each flow quickly into the source
>> node for that flow, which then provides back pressure on the application
>> driving the flow, which ultimately is necessary to stanch congestion.  The
>> ideal congestion control mechanism slows the sender part of the application
>> to a pace that can go through the network without contributing to buffering.
>
>
> Pacing is in Linux 3.12(?).  How long it will take to see widespread
> deployment is another question, and as for other operating systems, who
> knows.
>
> See: https://lwn.net/Articles/564978/

Steinar drove some of this with persistence and results...

http://www.linux-support.com/cms/steinar-h-gunderson-paced-tcp-and-the-fq-scheduler/

>>
>>
>>
>> Current network stacks (including Linux

Re: [Cerowrt-devel] Ubiquiti QOS

2014-05-27 Thread Jim Gettys
On Sun, May 25, 2014 at 4:00 PM,  wrote:

> Not that it is directly relevant, but there is no essential reason to
> require 50 ms. of buffering.  That might be true of some particular
> QOS-related router algorithm.  50 ms. is about all one can tolerate in any
> router between source and destination for today's networks - an upper-bound
> rather than a minimum.
>
>
>
> The optimum buffer state for throughput is 1-2 packets worth - in other
> words, if we have an MTU of 1500, 1500 - 3000 bytes. Only the bottleneck
> buffer (the input queue to the lowest speed link along the path) should
> have this much actually buffered. Buffering more than this increases
> end-to-end latency beyond its optimal state.  Increased end-to-end latency
> reduces the effectiveness of control loops, creating more congestion.
>
>
>
> The rationale for having 50 ms. of buffering is probably to avoid
> disruption of bursty mixed flows where the bursts might persist for 50 ms.
> and then die. One reason for this is that source nodes run operating
> systems that tend to release packets in bursts. That's a whole other
> discussion - in an ideal world, source nodes would avoid bursty packet
> releases by letting the control by the receiver window be "tight"
> timing-wise.  That is, to transmit a packet immediately at the instant an
> ACK arrives increasing the window.  This would pace the flow - current OS's
> tend (due to scheduling mismatches) to send bursts of packets, "catching
> up" on sending that could have been spaced out and done earlier if the
> feedback from the receiver's window advancing were heeded.
>
​

>
>
> That is, endpoint network stacks (TCP implementations) can worsen
> congestion by "dallying".  The ideal end-to-end flows occupying a congested
> router would have their packets paced so that the packets end up being sent
> in the least bursty manner that an application can support.  The effect of
> this pacing is to move the "backlog" for each flow quickly into the source
> node for that flow, which then provides back pressure on the application
> driving the flow, which ultimately is necessary to stanch congestion.  The
> ideal congestion control mechanism slows the sender part of the application
> to a pace that can go through the network without contributing to buffering.
>

​​
​Pacing is in Linux 3.12(?).  How long it will take to see widespread
deployment is another question, and as for other operating systems, who
knows.

See: https://lwn.net/Articles/564978/
​​

>
>
> Current network stacks (including Linux's) don't achieve that goal - their
> pushback on application sources is minimal - instead they accumulate
> buffering internal to the network implementation.
>

​This is much, much less true than it once was.  There have been
substantial changes in the Linux TCP stack in the last year or two, to
avoid generating packets before necessary.  Again, how long it will take
for people to deploy this on Linux (and implement on other OS's) is a
question.
​

> This contributes to end-to-end latency as well.  But if you think about
> it, this is almost as bad as switch-level bufferbloat in terms of degrading
> user experience.  The reason I say "almost" is that there are tools, rarely
> used in practice, that allow an application to specify that buffering
> should not build up in the network stack (in the kernel or wherever it is).
>  But the default is not to use those APIs, and to buffer way too much.
>
>
>
> Remember, the network send stack can act similarly to a congested switch
> (it is a switch among all the user applications running on that node).  IF
> there is a heavy file transfer, the file transfer's buffering acts to
> increase latency for all other networked communications on that machine.
>
>
>
> Traditionally this problem has been thought of only as a within-node
> fairness issue, but in fact it has a big effect on the switches in between
> source and destination due to the lack of dispersed pacing of the packets
> at the source - in other words, the current design does nothing to stem the
> "burst groups" from a single source mentioned above.
>
>
>
> So we do need the source nodes to implement less "bursty" sending stacks.
>  This is especially true for multiplexed source nodes, such as web servers
> implementing thousands of flows.
>
>
>
> A combination of codel-style switch-level buffer management and the stack
> at the sender being implemented to spread packets in a particular TCP flow
> out over time would improve things a lot.  To achieve best throughput, the
> optimal way to spread packets out on an end-to-end basis is to update the
> receive window (sending ACK) at the receive end as quickly as possible, and
> to respond to the updated receive window as quickly as possible when it
> increases.
>
>
>
> Just like the "bufferbloat" issue, the problem is caused by applications
> like streaming video, file transfers and big web pages that the application
> programmer sees as not having a latency req

Re: [Cerowrt-devel] Ubiquiti QOS

2014-05-26 Thread David P. Reed
Codel and PIE are excellent first steps... but I don't think they are the best 
eventual approach.  I want to see them deployed ASAP in CMTS' s and server load 
balancing networks... it would be a disaster to not deploy the far better 
option we have today immediately at the point of most leverage. The best is the 
enemy of the good.

But, the community needs to learn once and for all that throughput and latency 
do not trade off. We can in principle get far better latency while maintaining 
high throughput and we need to start thinking about that.  That means that 
the framing of the issue as AQM is counterproductive. 

On May 26, 2014, Mikael Abrahamsson  wrote:
>On Mon, 26 May 2014, dpr...@reed.com wrote:
>
>> I would look to queue minimization rather than "queue management"
>(which 
>> implied queues are often long) as a goal, and think harder about the 
>> end-to-end problem of minimizing total end-to-end queueing delay
>while 
>> maximizing throughput.
>
>As far as I can tell, this is exactly what CODEL and PIE tries to do.
>They 
>try to find a decent tradeoff between having queues to make sure the
>pipe 
>is filled, and not making these queues big enough to seriously affect 
>interactive performance.
>
>The latter part looks like what LEDBAT does?
>
>
>Or are you thinking about something else?

-- Sent from my Android device with K-@ Mail. Please excuse my brevity.___
Cerowrt-devel mailing list
Cerowrt-devel@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/cerowrt-devel


Re: [Cerowrt-devel] Ubiquiti QOS

2014-05-26 Thread Mikael Abrahamsson

On Mon, 26 May 2014, dpr...@reed.com wrote:

I would look to queue minimization rather than "queue management" (which 
implied queues are often long) as a goal, and think harder about the 
end-to-end problem of minimizing total end-to-end queueing delay while 
maximizing throughput.


As far as I can tell, this is exactly what CODEL and PIE tries to do. They 
try to find a decent tradeoff between having queues to make sure the pipe 
is filled, and not making these queues big enough to seriously affect 
interactive performance.


The latter part looks like what LEDBAT does?


Or are you thinking about something else?
___
Cerowrt-devel mailing list
Cerowrt-devel@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/cerowrt-devel


Re: [Cerowrt-devel] Ubiquiti QOS

2014-05-26 Thread dpreed

 
On Monday, May 26, 2014 9:02am, "Mikael Abrahamsson"  said:



> So, I'd agree that a lot of the time you need very little buffers, but
> stating you need a buffer of 2 packets deep regardless of speed, well, I
> don't see how that would work.
>
 
My main point is that looking to increased buffering to achieve throughput 
while maintaining latency is not that helpful, and often causes more harm than 
good. There are alternatives to buffering that can be managed more dynamically 
(managing bunching and something I didn't mention - spreading flows or packets 
within flows across multiple routes when a bottleneck appears - are some of 
them).
 
I would look to queue minimization rather than "queue management" (which 
implied queues are often long) as a goal, and think harder about the end-to-end 
problem of minimizing total end-to-end queueing delay while maximizing 
throughput.
 
It's clearly a totally false tradeoff between throughput and latency - in the 
IP framework.  There is no such tradeoff for the operating point.  There may be 
such a tradeoff for certain specific implementations of TCP, but that's not 
fixed in stone.
 ___
Cerowrt-devel mailing list
Cerowrt-devel@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/cerowrt-devel


Re: [Cerowrt-devel] Ubiquiti QOS

2014-05-26 Thread Mikael Abrahamsson

On Mon, 26 May 2014, dpr...@reed.com wrote:

Len Kleinrock and his student proved that the "optimal" state for 
throughput in the internet is the 1-2 buffer case.  It's easy to think 
this through...


Yes, but how do we achieve it?

If you signal congestion with very small buffer depth used, TCP will back 
off and you will drain the buffer, meaning the link will be underutilized. 
This is great from an interactive point of view, but not so much for 
keeping the link used actually at capacity without incurring excessive 
buffering latency?


So you would like to see ECN drop=1 markings on all packets as soon as 
they're the 3rd (or deeper) packet in the buffer? Or if the packet doesn't 
have ECN markings, you'd just drop it?


I doubt this will create a beneficial system for the end user, sounds like 
it focuses too much on interactivity and too little on throughput.


I just don't buy your statement that adding buffers won't increase 
throughput. If you're optimizing for throughput, then you let a single 
session use 1 second of buffering, meaning you know for a fact that the 
link is always going to be used at 100%. This totally kills interactivity, 
but it's still more throughput efficient than having 2 packet deep 
buffers, where you're very likely to drain these two packets and then have 
no packets in the buffer, meaning the link will be underutilized.


So, I'd agree that a lot of the time you need very little buffers, but 
stating you need a buffer of 2 packets deep regardless of speed, well, I 
don't see how that would work.


--
Mikael Abrahamssonemail: swm...@swm.pp.se
___
Cerowrt-devel mailing list
Cerowrt-devel@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/cerowrt-devel


Re: [Cerowrt-devel] Ubiquiti QOS

2014-05-25 Thread dpreed

Len Kleinrock and his student proved that the "optimal" state for throughput in 
the internet is the 1-2 buffer case.  It's easy to think this through...
 
A simple intuition is that each node that qualifies as a "bottleneck" (meaning 
that the input rate exceeds the service rate of the outbound queue) will work 
optimally if it is in a double buffering state - that is a complete packet 
comes in for the outbound link during the time that the current packet goes out.
 
That's topology independent.   It's equivalent to saying that the number of 
packets in flight along a path in an optimal state between two endpoints is 
equal to the path's share of the bottleneck link's capacity times the physical 
minimum RTT for the MTU packet - the amount of "pipelining" that can be 
achieved along that path.
 
Having more buffering can only make the throughput lower or at best the same. 
In other words, you might get the same throughput with more buffering, but more 
likely the extra buffering will make things worse.  (the rare special case is 
the "hot rod" scenario of maximum end-to-end throughput with no competing 
flows.)
 
The real congestion management issue, which I described, is the unnecessary 
"bunching" of packets in a flow.  The bunching can be ameliorated at the source 
endpoint (or controlled by the receive endpoint transmitting an ack only when 
it receives a packet in the optimal state, but immediately responding to it to 
increase the responsiveness of the control loop: analogous to "impedance 
matching" in complex networks of transmission lines - bunching analogously 
corresponds to standing waves that reduce power transfer when impedance is not 
matched approximately.  The maximum power transfer does not happen if some 
intermediate point includes a bad impedance match, storing energy that 
interferes with future energy transfer).
 
Bunching has many causes, but it's better to remove it at the entry to the 
network than to allow it to clog up latency of competing flows.
 
I'm deliberately not using queueing theory descriptions, because the queueing 
theory and control theory associated with networks that can drop packets and 
have finite buffering with end-to-end feedback congestion control is quite 
complex, especially for non-Poisson traffic - far beyond what is taught in 
elementary queueing theory.
 
But if you want, I can dig that up for you.
 
The problem of understanding the network congestion phenomenon as a whole is 
that one can not carry over intuitions from a single, multi hop, linear network 
of nodes to the global network congestion control problem.
 
[The reason a CDMA (wired) or CSMA (wireless) Ethernet has "collision-driven" 
exponential-random back off is the same rationale - it's important to de-bunch 
the various flows that are competing for the Ethernet segment.  The right level 
of randomness creates local de-bunching (or pacing) almost as effectively as a 
perfect, zero-latency admission control that knows the rates of all incoming 
queues. That is, when a packet ends, all senders with a packet ready to 
transmit do so.  They all collide, and back off for different times - 
de-bunching the packet arrivals that next time around. This may not achieve 
maximal throughput, however, because there's unnecessary blocking of newly 
arrived packets on the "backed-off" NICs - but fixing that is a different 
story, especially when the Ethernet is an internal path in the Internet as a 
whole - there you need some kind of buffer limiting on each NIC, and ideally to 
treat each "flow" as distinct "back-off" entity.]
 
The same basic idea - using collision-based back-off to force competing flows 
to de-bunch themselves - and keeping the end-to-end feedback loops very, very 
short by avoiding any more than the optimal buffering, leads to a network that 
can operate at near-optimal throughput *and* near-optimal latency.
 
This is what I've been calling in my own terminology, a "ballistic state" of 
the network - analogous to, but not the same as, a gaseous rather than a liquid 
or solid phase of matter. The least congested state that has the most fluidity, 
and therefore the highest throughput of individual molecules (rather than a 
liquid which transmits pressure very well, but does not transmit individual 
tagged molecules very fast at all).
 
That's what Kleinrock and his student showed.  Counterintuitive though it may 
seem. (It doesn't seem counterintuitive to me at all, but many, many network 
engineers are taught and continue to think that you need lots of buffering to 
maximize throughput).
 
I conjecture that it's an achievable operating mode of the Internet based 
solely on end-to-end congestion-control algorithms, probably not very different 
from TCP, running over a network where each switch signals congestion to all 
flows passing through it.  It's probably the most desirable operating mode, 
because it maximizes throughput while minimizing latency simultaneously.  
There's no inh

Re: [Cerowrt-devel] Ubiquiti QOS

2014-05-25 Thread Mikael Abrahamsson

On Sun, 25 May 2014, dpr...@reed.com wrote:

The optimum buffer state for throughput is 1-2 packets worth - in other 
words, if we have an MTU of 1500, 1500 - 3000 bytes. Only the bottleneck


No, the optimal state for througbut is to have huge buffers and have them 
filled. The optimal state for interactivity is to have very small buffers. 
FQ_CODEL tries to strike a balance between the two at 10ms of buffer. PIE 
does the same around 20ms. In order for PIE to work properly I'd say you 
need 50ms of buffering as a minimum, otherwise you're going to get 100% 
tail drop and multiple sequential drops occasionally (which might be 
desireable to keep interactivity good).


My comment about 50ms is that you seldom need a lot more.

--
Mikael Abrahamssonemail: swm...@swm.pp.se
___
Cerowrt-devel mailing list
Cerowrt-devel@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/cerowrt-devel



Re: [Cerowrt-devel] Ubiquiti QOS

2014-05-25 Thread dpreed

Not that it is directly relevant, but there is no essential reason to require 
50 ms. of buffering.  That might be true of some particular QOS-related router 
algorithm.  50 ms. is about all one can tolerate in any router between source 
and destination for today's networks - an upper-bound rather than a minimum.
 
The optimum buffer state for throughput is 1-2 packets worth - in other words, 
if we have an MTU of 1500, 1500 - 3000 bytes. Only the bottleneck buffer (the 
input queue to the lowest speed link along the path) should have this much 
actually buffered. Buffering more than this increases end-to-end latency beyond 
its optimal state.  Increased end-to-end latency reduces the effectiveness of 
control loops, creating more congestion.
 
The rationale for having 50 ms. of buffering is probably to avoid disruption of 
bursty mixed flows where the bursts might persist for 50 ms. and then die. One 
reason for this is that source nodes run operating systems that tend to release 
packets in bursts. That's a whole other discussion - in an ideal world, source 
nodes would avoid bursty packet releases by letting the control by the receiver 
window be "tight" timing-wise.  That is, to transmit a packet immediately at 
the instant an ACK arrives increasing the window.  This would pace the flow - 
current OS's tend (due to scheduling mismatches) to send bursts of packets, 
"catching up" on sending that could have been spaced out and done earlier if 
the feedback from the receiver's window advancing were heeded.
 
That is, endpoint network stacks (TCP implementations) can worsen congestion by 
"dallying".  The ideal end-to-end flows occupying a congested router would have 
their packets paced so that the packets end up being sent in the least bursty 
manner that an application can support.  The effect of this pacing is to move 
the "backlog" for each flow quickly into the source node for that flow, which 
then provides back pressure on the application driving the flow, which 
ultimately is necessary to stanch congestion.  The ideal congestion control 
mechanism slows the sender part of the application to a pace that can go 
through the network without contributing to buffering.
 
Current network stacks (including Linux's) don't achieve that goal - their 
pushback on application sources is minimal - instead they accumulate buffering 
internal to the network implementation.  This contributes to end-to-end latency 
as well.  But if you think about it, this is almost as bad as switch-level 
bufferbloat in terms of degrading user experience.  The reason I say "almost" 
is that there are tools, rarely used in practice, that allow an application to 
specify that buffering should not build up in the network stack (in the kernel 
or wherever it is).  But the default is not to use those APIs, and to buffer 
way too much.
 
Remember, the network send stack can act similarly to a congested switch (it is 
a switch among all the user applications running on that node).  IF there is a 
heavy file transfer, the file transfer's buffering acts to increase latency for 
all other networked communications on that machine.
 
Traditionally this problem has been thought of only as a within-node fairness 
issue, but in fact it has a big effect on the switches in between source and 
destination due to the lack of dispersed pacing of the packets at the source - 
in other words, the current design does nothing to stem the "burst groups" from 
a single source mentioned above.
 
So we do need the source nodes to implement less "bursty" sending stacks.  This 
is especially true for multiplexed source nodes, such as web servers 
implementing thousands of flows.
 
A combination of codel-style switch-level buffer management and the stack at 
the sender being implemented to spread packets in a particular TCP flow out 
over time would improve things a lot.  To achieve best throughput, the optimal 
way to spread packets out on an end-to-end basis is to update the receive 
window (sending ACK) at the receive end as quickly as possible, and to respond 
to the updated receive window as quickly as possible when it increases.
 
Just like the "bufferbloat" issue, the problem is caused by applications like 
streaming video, file transfers and big web pages that the application 
programmer sees as not having a latency requirement within the flow, so the 
application programmer does not have an incentive to control pacing.  Thus the 
operating system has got to push back on the applications' flow somehow, so 
that the flow ends up paced once it enters the Internet itself.  So there's no 
real problem caused by large buffering in the network stack at the endpoint, as 
long as the stack's delivery to the Internet is paced by some mechanism, e.g. 
tight management of receive window control on an end-to-end basis.
 
I don't think this can be fixed by cerowrt, so this is out of place here.  It's 
partially ameliorated by cerowrt, if it aggressively drops packe

Re: [Cerowrt-devel] Ubiquiti QOS

2014-05-25 Thread Dave Taht
On Sun, May 25, 2014 at 11:39 AM, Sebastian Moeller  wrote:
> Hi Dane,
>
>
> On May 25, 2014, at 08:17 , Dane Medic  wrote:
>
>> Is it true that devices with less than 64 MB can't handle QOS? -> 
>> https://lists.chambana.net/pipermail/commotion-dev/2014-May/001816.html
>
> I think this means that the commotion developers think that 64MB are 
> required.

A dev thinks so. It aint true.

(I have a deployed mesh network of mostly nano and picostations and
they never crash due to being out of memory. Sadly, they do crash for
other reasons)

As you get below 64MB some compromises are needed, and as you try to
push more bits compromises are needed, and as you add more interfaces
and queues compromises are needed. If you are using up all your memory
running some application or another, instead of having it free for
packets, you are generally in more trouble than you want to be.

So the starting factor is how much free ram you have in normal
operation with your applications running. Start with that as a
baseline. (it isn't strictly true either, you can
discard (not swap out) a great deal of program text and still have
your applications run well)

You typically want something like sqm on your gateway ethernet
interface. If you have seriously limited free ram , just run a single
fq_codel instance as in simplest.qos.

As for the mesh backbone, well, there remains so much buffering
underneath in the ath9k wifi drivers that fq_codel only takes the edge
off a little.

We just disabled 802.11e completely (getting rid of 3 out of 4 queues
per interface), and my results have always been better for that, and I
hope to catagorize them this summer -
and it also saves on memory usage.

> But it does not sound like they have first hand experience so this is either 
> hearsay, or commotions mesh networking is memory intensive. On the openwrt 
> side there seems no documentation of minim ram requirements. Doing a quick 
> back-of-the-envelop calculation here:

> openWRT qos has 4 tiers which run fq_codel in both directions so we have 8 
> fq_codel instances, with each fq_codel having a limit of 10240 packets, so 
> worst case we expect:
>
> 4 * 2 * 10240 = 81920 packets
>
> at 1500bytes this equals
>
> 4 * 2 * 10240 * 1500 / (1024 * 1024) = 117.1875 MB

Try to watch out for this sort of equivalence. Acks in the other
direction are 66 bytes. Arguably we should have specified fq_codel's
outside limit in bytes, not packets, and made it autotune
to some ratio around free memory.  And NONE of this memory is
pre-allocated... more on that in a paragraph

> this indeed is a bit heavy on a 32MB router, but honestly 64MB will not 
> really help you. Then again current openwrt has a limit off 800 instead of 
> 10240 so we end up at a worst case of:

It's the extra SSIDs that hurt most these days, followed by the queues
"needed" for 802.11e, but wait a paragraph...

> 4 * 2 * 800 * 1500 / (1024 * 1024) = 9.1552734375 MB
>
> which should still be possible with 32MB. (Note that typically fq_codel does 
> not fill its queues up to limit, but it still would be bad if a router can 
> easily be DOSed into OOM and rebooting…)

The principal reason for the limit is! to avoid a DOS in memory
limited routers. Otherwise I'd be perfectly happy if we could run it
at the defaults. And the limit should go up some
as we get closer to pushing gigE speeds (presently the router can only
forward at about 330mbit). I am really hating seeing people cut/paste
the limit into newer code without realizing that it was just there to
keep a 32MB box from crashing under a dos...

(and I note, that if you run out of memory to service packets, the
odds are very good your box won't crash anyway - but this exercises
code paths that are rarely touched.)

>
> (For current cerowrt with simple.qos the worst case is:
> (1001 * 4 + 1000 * 13 + 800 * 12) * 1500 / (1024 * 1024) = 38.0573272705 MB
>
> yet this still works quite well on a 64MB device (only 4 of these queues are 
> connected to the WAN interface though)

3 up, 3 down, actually in simple.qos, 1 up, 1 down in simple.

And each wifi SSID consumes 4 queues (although right now, 3 are unused)

I want to make really clear: the FIXED overhead of fq_codel is
something like 100+ 64 bytes*flows (usually 1024) - so each fq_codel
instance eats 64K (not M!) of data to run.


The limit just keeps packet data (don't quote me, something like 200
bytes + packet size) under control, and usually only builds up on the
bottleneck device, not anywhere else.

So you might have a bottleneck on your wifi, someone dos-ing you
maybe, but the ethernet interface will be running at only a few
packets outstanding at anytime.



> One of the bigger issues with devices with small RAM is that often they have 
> relatively weak CPUs and I seem to recall that cerowrt tops out around 60 to 
> 70 Mbit/sec (total for ingress and egress) due to its shaping performance.

Yes. Doing considerably


>
>
> So unless you want to run commotion you might wan

Re: [Cerowrt-devel] Ubiquiti QOS

2014-05-25 Thread Sebastian Moeller
Hi Dane,


On May 25, 2014, at 08:17 , Dane Medic  wrote:

> Is it true that devices with less than 64 MB can't handle QOS? -> 
> https://lists.chambana.net/pipermail/commotion-dev/2014-May/001816.html

I think this means that the commotion developers think that 64MB are 
required. But it does not sound like they have first hand experience so this is 
either hearsay, or commotions mesh networking is memory intensive. On the 
openwrt side there seems no documentation of minim ram requirements. Doing a 
quick back-of-the-envelop calculation here:
openWRT qos has 4 tiers which run fq_codel in both directions so we have 8 
fq_codel instances, with each fq_codel having a limit of 10240 packets, so 
worst case we expect:

4 * 2 * 10240 = 81920 packets

at 1500bytes this equals

4 * 2 * 10240 * 1500 / (1024 * 1024) = 117.1875 MB

this indeed is a bit heavy on a 32MB router, but honestly 64MB will not really 
help you. Then again current openwrt has a limit off 800 instead of 10240 so we 
end up at a worst case of:

4 * 2 * 800 * 1500 / (1024 * 1024) = 9.1552734375 MB

which should still be possible with 32MB. (Note that typically fq_codel does 
not fill its queues up to limit, but it still would be bad if a router can 
easily be DOSed into OOM and rebooting…)


(For current cerowrt with simple.qos the worst case is:
(1001 * 4 + 1000 * 13 + 800 * 12) * 1500 / (1024 * 1024) = 38.0573272705 MB

yet this still works quite well on a 64MB device (only 4 of these queues are 
connected to the WAN interface though)

One of the bigger issues with devices with small RAM is that often they have 
relatively weak CPUs and I seem to recall that cerowrt tops out around 60 to 70 
Mbit/sec (total for ingress and egress) due to its shaping performance.


So unless you want to run commotion you might want to ask on the openwrt list…

Best Regards
Sebastan

 

> ___
> Cerowrt-devel mailing list
> Cerowrt-devel@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/cerowrt-devel

___
Cerowrt-devel mailing list
Cerowrt-devel@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/cerowrt-devel


Re: [Cerowrt-devel] Ubiquiti QOS

2014-05-25 Thread Mikael Abrahamsson

On Sun, 25 May 2014, Dane Medic wrote:


Is it true that devices with less than 64 MB can't handle QOS? ->
https://lists.chambana.net/pipermail/commotion-dev/2014-May/001816.html


At gig speeds you need around 50ms worth of buffering. 1 gigabit/s = 
125 megabyte/s meaning for 50ms you need 6.25 megabyte of buffer.


I also don't see why performance and memory size would be relevant, I'd 
say forwarding performance has more to do with CPU speed than anything 
else.


--
Mikael Abrahamssonemail: swm...@swm.pp.se
___
Cerowrt-devel mailing list
Cerowrt-devel@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/cerowrt-devel


Re: [Cerowrt-devel] Ubiquiti QOS

2014-05-25 Thread Valdis . Kletnieks
On Sun, 25 May 2014 08:17:47 +0200, Dane Medic said:

> Is it true that devices with less than 64 MB can't handle QOS? ->
> https://lists.chambana.net/pipermail/commotion-dev/2014-May/001816.html

I'm not going to give one post on a list very much credence, especially when
it doesn't contain a single actual fact or definitive claim.  An explanation
of exactly which data structure won't fit in 32M would be ideal.  Even some
numbers on RAM usage from /proc/slabinfo and a "who ate all the frobozz slabs?"
would be better than what's in the post.



pgp6zuyDtkid8.pgp
Description: PGP signature
___
Cerowrt-devel mailing list
Cerowrt-devel@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/cerowrt-devel


[Cerowrt-devel] Ubiquiti QOS

2014-05-24 Thread Dane Medic
Is it true that devices with less than 64 MB can't handle QOS? ->
https://lists.chambana.net/pipermail/commotion-dev/2014-May/001816.html
___
Cerowrt-devel mailing list
Cerowrt-devel@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/cerowrt-devel