[aqm] some comments I'd made on the cablelabs study privately

Dave Taht Sun, 09 Feb 2014 18:35:34 -0800

Since that study and test design were highly influential on the AQM
requirements draft, I am going to publish now here what my comments
were at the time, with a couple updates here and there....


I am not sure if any of the ns2 code from the last round made it out
to the public. ?

* Executive summary

  The Cablelabs AQM paper was the best simulation study of the effects
of the new bufferbloat-fighting AQMs verses the common Internet
traffic types in the home that has been created to date...

  However, it is only a study of half the edge network. It assumes
throughout that there are no excessive latencies to be had from the
CMTS side. Field measurements show latencies in excess of 1.8 seconds
from the CMTS side. 300ms on Verizon gpon. Buffer sizes in the range
of 64k to 512k on DSL in general, with some RED, SFQ, and SQF actually
deployed there.

  So... while the focus has been on what is perceived as the larger
problem, the cable modems themselves, downstream behavior was not
studied and the entire simulation set to "reasonable" values for ns2
modelers... and not seen in the real world.

  In the real world (RW) Flows are almost always bidirectional. What
happens on the downstream side affects the upstream side and vice
versa, as per Van Jacobson's "fountain" analogy. Correctly
compensating for bidirectional TCP dynamics is incredibly important.

  The second largest problem with original cablelabs study is that it
only analyzed traffic at one specific (although common) setting for
cable operators, 20Mbits down, and 5 Mbits up. A common, lower setting
should be analyzed, as well as more premier services. Some tweaking of
codel derived technologies (flows and quantum), and of pie (alpha and
beta) are indicated both at lower and higher bandwidths for optimum
results.  Additionally the effects of classification, notably of
background traffic, has not been explored.

  There are numerous other difficulties in the simulations and models
that need to be understood in order to make good decisions moving
forward. This document goes into more detail on those later.

All the AQMs tested performed vastly better than standard FIFO drop
tail as well as buffercontrol. They all require minimal configuration
to work. With some configuration they can be made to work better.

* Recomendations I'd made at the time

** Study be repeated using at least two more bandwidth settings
** More exact emulation of current CMTS behavior
   Based on real world measurements
** Addition of more traffic types, notably VPN and Videoconferencing
** Improvements to the VOIP, web models
** Continued attempts at getting real world and simulated benchmarks
to "line up".

My approach has been to follow the simulation work and try to devise
real world benchmarks that are similar, and feed back the results into
the ongoing simulation process. There are multiple limitations in this
method, too, notably getting repeatable results, and doing large scale
tests on customer equipment, both of which are subject to heisenbugs.

* Issues in the cablelabs study
** Downstream behavior
   Tests with actual cablemodems in actual configurations shows a
significant amount of buffering on the downstream. At 20Mbits, DS
buffering well in excess of 1 second has been observed. The effect of
excessive buffering on this side has not been explored in these tests.

Certain behaviors - TCP's "burstiness" as it opens it's window to
account for what it is thinking as a long path - reflect interestingly
on congestion avoidance, on the downstream, and the effects on the
upstream side of the pair are interesting too.

I note that my own RW statistics were often very skewed by some very
bad ack behavior on TSO offloads that has been a bug in Linux for
years and recently fixed.

** Web model
*** The web model does not emulate DNS lookups.
      Caching DNS forwarders are typically located on a gateway box
(not sure about cablemodems ??), and the ISP locates a full DNS server
nearby, (within 10ms RTT). DNS traffic is particularly sensitive to
delay, loss and head of line blocking, and slowed DNS traffic stalls
subsequent tcp connections on sharded web traffic in particular.

*** The web model does no caching
    A fairly large percentage (not high enough) of websites make use
of various forms of caching, ranging from marking whole objects as
cachable for a certain amount of time, or using the etags method to
provide checksum-like value for an if-modified get request. Use of the
former method eliminates a RTT entirely, the latter works inside of a
http 1.1 pipeline well.

*** The web model does not use https
    Establishing a secure http connection requires additional round trips.

*** The web model doesn't emulate tons of tabs
    Web users, already highly interactive, now tend to have tons of
tabs, all on individual web sites, many of which are doing some sort
of polling or interaction in the background against the remote web
server. These benchmarks do not emulate this highly common behavior.

** TCP Cubic in ns2 is not the same as modern TCP cubic in Linux

Showing that these new AQMs work correctly with modern TCPs is essential.

** VOIP model
   The VOIP model measures *one-way* egress delays only, and does not
track the burstiness of packet loss, or the burstiness of jitter. It
uses an archaic codec emulation, as well. Something like opus should
be looked at.

** The gaming model
   Treats gaming as equivalent to VOIP, with the same problems as the
VOIP model. Gaming traffic is bidirectional, even more so than voip.

Additionally, some of the traces I have access to use a
15ms, rather than 20ms period, to emitted packets. (I have collected a
vast amount of gaming traffic data that I have not yet had time to
analyze)

** Videoconferencing

was not stuidied.

** Bittorrent model
   The torrent model is flawed in multiple ways. Notably it only tests
phase III of the typical torrent cycle, not the download or down/up
phases...

**** The download saturation problem

    Torrent can and WILL saturate a downlink and not respond to
congestion indicators until over 100ms of delay is observed. Most
clients do have a ratelimit set for download. It is often turned off
after-hours.

**** The upload saturation problem shown in the study

    Bittorrent clients have evolved to where, out of the box, there is
a very low rate limit set, typically in the range of 50-150KBytes/sec.
This makes bittorrent uploads a non-problem for most people.

Still, benchmarking each of these phases would be worthwhile.
Torrent can be fixed.

*** LEDBAT != Bittorrent

    LEDBAT as defined and uTP as deployed remain significantly
different. No implementation of bittorrent I've looked at (utorrent
and transmission) behaves anything like the LEDBAT modules I have.

*** TCP-LEDBAT kernel module is buggy
    In my RW tests this Linux congestion control module never gets out
of SSTHRESH and into congestion avoidance. The behavior LOOKs like
bittorrent (look! It's scavaging), when in reality, it's merely stuck
at a low rate.

   However, it is possible this module works correctly under older
linuxes under ns2.

   The RW behavior of bittorrent under RED and SFQ was explored in
[[YIXI2012]] and under those two AQM systems, formerly scavaging flows
are reprioritized to have roughly the same weight as non-scavaging
flows until 100ms of delay is incurred.

   The simulated LEDBAT results do show that extraordinarily high
numbers of persistent high rate flows interact badly with fq_codel, at
this 20/4 bandwidth setting. Using another TCP that is not buggy will
probably be even worse. However, bittorrent is a very special case.
Dozens of full rate flows are extremely rare in the real world.

(RW benchmarks show even fairly large numbers of flows (50+) show
fq_codel still doing quite well)

** IP address Hashing

   ns2 does not have support for a full 5 tuple including the protocol
(e.g. TCP, UDP, ARP, etc). This makes hashing multiple protocols
together problematic and I'm unsure if this was compensated for
correctly in all the models.

** Fq_codel and PIE configuration with 10ms target

   While I buy into (kathie doesn't) the idea that the delay target
variable needs to be greater than the cable MAC media acquisition time
(in this model), it does not need to be set to 10ms. 6.X should be
sufficient to avoid MAC acquisition artifacts. There is a quadratic
response time to delay in TCP, so the doubling of the default target
to aim for of 5ms to 10ms results in much fuller queues. (Lest I be
mis-interpreted, the quadratic behavior is end to end and I honestly
don't know what differences we'd see between a 6.5ms target and a 10ms
target)

PIE has similar constraints, (but apparently ran fine on the cable MAC)

so a target delay of 6.X or 7 being tried
for both would be interesting.

**I note that (2014) pie has now grown a target of 20ms in the Linux
implementation, and a closer tie with the htb rate shaper in the
as-yet-unpublished cablelabs model. **

At very, very low bandwidths (<4mbit) we have found it desireable to
increase the  fq_codel target to account for a single MTU at that
bandwidth.

The fq_codel interval estimation window in the cablelabs
testing was set to 150ms, rather than the default 100ms. I note that
most experimental variants of codel fiddle with the "drop resumption"
portion of the algorithm which is very sensitive to the interval. It's
an area of research...

The estimation window in pie has grown to 10k bytes.

* rate limiters have a cost.

At higher rates, more fq_codel queues are of use, and the RW tests
point to the rate limiter being the principal CPU hog and source of
problems, the drop/mark/scheduler algorithm hardly enter into it.

The PIE codel and fq_codel algorithms barely show upon a trace....

** Back to back packet drop

Gaming and VOIP traffic tolerate single, random packet drop with
aplomb. It's bursty packet loss and sudden delays that cause audible
artifacts, and gaming misbehavior. You can hear 3 packets lost in a
row, (which is no different from a sudden delay of 60ms on a stream)
So in addition to the packet loss figure, a good measurement would be
a bursty packet loss, and a bursty delay graph.

Hopefully in the aqm cases it follows a random distribution, but...

(Similarly many tcps respond to bursty packet loss in a drastic
fashion, but do not react too much to bursty delay)

Another problem with VOIP is "creeping delay", where a voip queue
builds and builds and then delivers or drops a full boatload of
packets to catch up. I have experienced this on multiple wifi based
voip sessions where I ended up with seconds of delay on the line over
time...

** ns2 issues
    Many of the problems in this test series are due to using an obsolete
and undermaintained network model system (ns2). Alternatives exist
that are better (ns3, minitel, etc) in multiple aspects. While ns2 can
certainly be improved the cost/benefit ratio of using ns3 seems
better. (There is no harm in using other technologies as a cross-check
too.)

*** Ns2 doesn't support many modern networking features, like ECN.
*** ns2 doesn't have TOS values (so it can't do diffserv)
*** ns2 doesn't have port numbers
   fq_codel uses a 5 tuple of port, protocol, src and destination ips

* Statistical notes

  Analyzing network queue, and network traffic behavior does not lend
itself to many means of statistical analysis. In particular, throwing
out the upper or lower percentiles of most results is a bad practice -
with real time systems, it's the outliers that are interesting and
important.

  This paper uses CDF plots throughout, which is a fine way to measure
the full range of results, however the use of a log scale is difficult
on the unpracticed eye.

  In it's summary form, the paper uses a method of averaging together
the results of each separate subtest, and the weighting is unclear.
This in particular gives the bittorrent result a lot more weight than
one
would expect.

NOTE: RW = real world

* Overall recommendations...
** Continue developing better models!
** Find a cable vendor willing to do the exploratory work
    It took me 24 hours to port the code from ns2 to Linux.
Analysis of several other operating systems indicate that a week in
conjunction with a local expert would be enough to make this code work
on most other network OSes.
** Better models of media acquisition,  (cable, wifi, lte),
aggregation, or scheduling characteristics (wifi, lte, token ring)
dissimilar to ethernet.

* Nits

** fq_pie

Given the probalistic dropping technique in PIE, it too can benefit
from flow queuing, fair queueing or weighted fair queuing tecniques
inserted before it in the chain, with only one PIE queue needed to do
so.

(This is unlike fq_codel, which due to the isochronous nature of the
timestamping has to have one codel queue per fq in order to work,
presently)

The prospect of fq + pie seems quite promising.

* Packet Trains

* Next steps in the RW cerowrt tests

** pie and fq_pie improvements
   Cerowrt gained pie support back in august or so and has tracked
revisions 1-4. Codel has had some tiny tweaks as well.

** Several new codel and fq_codel models

   based on the cablelabs testing, it is evident that number of queues
could be dependent on the available bandwidth, which can be made
automatic, there are also a few tweaks that can be made to pie and
codel (ecn handling, notably)

   Also it seems possible to improve fq_codel's behavior by adopting
slightly different strategies for nearly empty queues, under load.
Like (original) PIE, favoring dropping big packets more, particularly
when a queue is at 1 and the system is experiencing large delays,
seems plausible to handle the bittorrent problem while not affecting
most other traffic types.

** classification

  Under test are several shapers that use limited amounts of
classification. In the RW, in the campground test site, Over 52% of
all packets are marked CS1 on egress. Applications are obviously
trying to deprioritize themselves, it seems logical to try to support
that.

* Side notes

It should be clear that these AQMs and packet schedulers can apply not
only to edge networks, but anywhere there is a fast to slow transition
on a network. This includes within a machine itself!

It may well be easier to apply these technogies to load balancers,
high end Linux based routers, and vm root hardware, piecemeal, far
faster than they can be deployed en-mass across the customer edge
network. Linux based servers can also benefit, today. Deploying these
AQM technologies on any scale will help gain needed operational
experience with them before they have to be burned into hard to update
edge customer network hardware.


-- 
Dave Täht

Fixing bufferbloat with cerowrt: http://www.teklibre.com/cerowrt/subscribe.html
_______________________________________________
aqm mailing list
aqm@ietf.org
https://www.ietf.org/mailman/listinfo/aqm

[aqm] some comments I'd made on the cablelabs study privately

Reply via email to