Re: [aqm] think once to mark, think twice to drop: draft-ietf-aqm-ecn-benefits-02

David Lang Mon, 13 Apr 2015 14:48:06 -0700

On Mon, 13 Apr 2015, Bob Briscoe wrote:

David,
Returning from a fortnight offlist...
I think your conception of how ECN works is incorrect. You describe ECN as ifthe AQM marks one packet when it drops another packet. You say that theECN-mark speeds up the retransmission of the dropped packet. On the contrary,the idea of classic ECN [RFC3168] is that the ECN marks replace the drops. Inall known testing (except pathological cases), classic ECN effectivelyeliminates drops for all ECN-capable packets.

That's what I thought, and if that was the case, then marking packets asECN-capable would mean that they would have an advantage over non-ECN packets(by not getting dropped, so getting a higher share of bandwidth)

that's what the gaming ECN thread was about, and if I understood the responses,I was being told that marking packets as ECN-capable, but not slowing down(actually responding to ECN) would not let an application get any advantagebecause the packets would just end up getting dropped anyway, since marking anddropping happen at the same level, even on ECN-capable flows.

If the packets are just marked, but not dropped, then the ECN-capable flows willoccupy a disproportinate share of the available buffer space, since they justget marked instead of dropped.


David Lang

Nonetheless, I do agree with your sentiment that the perfect is the enemy ofthe good. We can remove most of the really bloat-induced latency without ECN.So the message must be clear: Deploy AQM now. No need to wait for ECN. Butimplementations SHOULD allow ECN packets to be classifed into a separatelyconfigurable instance of the AQM algo. {Note 1}
Similarly, this WG made sure we did not deprecate RED in the AQMrecommendations. Because, in existing equipment, even a poorly tuned RED isusually much better than a bloated buffer with no AQM.
Just as the WG mustn't confuse messages, you mustn't get confused by adiscussion about the potential for more ambitious reductions in latency. Koenstarted the thread with reference to our presentation in the ICCRG in theIRTF, where 'R' in both cases stands for Research.
And I believe it is valid for the ECN benefits draft (in the IETF AQM WG) topoint to the potential of ECN, by using insights from research in progress.
There are a number of different contributions to unnecessary latency, notjust the one popularised as bufferbloat:
[In the following, I will use the term 'short message(s)' as shorthand foreither a short interactive flow or a long flow consisting of shortinteractive messages (like in a game), or video frames or voice datagrams oranything where the perceived latency depends on the latency of each'message', especially a string of messages with serial dependency as istypical in Web.]
#1 The well-known bufferbloat problem, where a long-running flow fills abloated buffer, delays short messages.
 - AQM and/or flow queuing can remove this delay, without needing ECN.
#2 A loss causes head-of-line blocking for short message(s) while waiting forthe retransmission.
 - AQM without ECN cannot remove this delay.
 - Flow queuing cannot remove this delay, if the losses are self-induced.
- FEC can remove this delay without needing ECN, but the increasedredundancy is equivalent to poorer utilisation (altho only the short flowsneed the redundancy).
 - ECN can remove this delay.

#3 A loss near the end of a short flow can lead to multi-RTT delay.
 - AQM without ECN cannot remove this delay.
- Techniques like tail-loss probe and RTO-restart can mitigate this delay,without ECN, but not remove it.
 - ECN can remove this delay.
#4 The Reno/Cubic sawtooth causes variation in delay between 1 and 2 baseRTTs. This can affect short messages.- AQM without ECN cannot remove this delay unless configured to sacrificeutilisation.- Flow queuing removes this delay variation if caused by a separatelong-running flow.{Note 2}- A change to the TCP algo (e.g. DCTCP) can remove this delay variation.Smaller sawteeth imply a much higher signalling rate, which in turn requiresECN, otherwise drop probability would be excessive. (This was the main pointKoen was making.)
 - Therefore, ECN can remove this delay.

#5 Slow-starts{Note 3} cause spikes in delay.
- AQM without ECN cannot remove this delay, and typically AQM is designedto allow such bursts of delay in the hope they will disappear of their ownaccord.- Flow queuing can remove the effect of these delay bursts on other flows,but only if it gives all flows a separate queue from the start.{Note 2}- Delay-based softening of slow-start, such as Hybrid Slow-Start in Linux,can mitigate these variations, but with increased risk of coming out of SSearly, causing significantly longer completion time.- ECN with AQM based on the instantaneous queue limits this delay, withoutthe risk of longer completion time.
#6 Slow-starts{Note 3} can cause runs of losses, which in turn cause delays.
 - AQM without ECN cannot remove these delays.
 - Flow queuing cannot remove these losses, if self-induced.
- Delay-based SS like HSS can mitigate these losses, with increased risk oflonger completion time.
 - ECN can remove these losses, and the consequent delays.

Summary:
* AQM alone solves the main problem
* Flow queuing solves or mitigates most of the remaining secondary problems.
* ECN has the potential to solve all the remaining secondary problems(pending further research to prove some of them).
Whether flow queuing is applicable depends on the scale. The work I'm doingwith Koen is to reduce the cost of the queuing mechanisms on our BNGs(broadband network gateways). We're trying to reduce the cost of per-customerqueuing at scale, so per-flow queuing is simply out of the question. WhereasECN requires no more processing than drop.
ECN has potential cheating problems, but we have per-customer queues anyway.
Using flow as the unit of allocation also has its own problems, with noproposed solutions.
Bob
{Note 1}: Your general point is that the perfect can be the enemy of thegood. Here's the presentation I gave in TSVAREA straight after VJ presentedCoDel in 2012 entitled "DCTCP & CoDel; the Best is the Friend of the Good."
<http://www.bobbriscoe.net/presents/1207ietf/1207-tsvarea-dctcp.pdf>
{Note 2}: A lone flow can cause this delay variation to itself, but that'sirrelevant because, if the delay were not in the network it would be at thesender.
{Note 3}: Delay and loss spikes can equivalently be caused when Cubic'swindow rises to seek out newly available capacity after another flow finishesor the link rate varies.
At 05:16 30/03/2015, David Lang wrote:
On Sat, 28 Mar 2015, Scheffenegger, Richard wrote:
David,
Perhaps you would care to provide some text to address the misconceptionthat you pointed out? (To wait for a 100% fix as a 90% fix appears muchless appealing, while the current state of art is at 0%)
Ok, you put me on the spot :-) Here goes.
If you think that aqm-recommendations is not strogly enough worded. Ithink this particular discussion (to aqm or not) really belongs there. Theother document (ecn benefits) has a different target in arguing for goingthose last 10%...
so here is my "elevator pitch" on the problem. Feel free to take anything Isay here for any purpose, and I'm sure I'll get corrected for anything I amwong on
Problem statement: Transmit buffers are needed to keep the network layerfully utilized, but excessive buffers result in poor latency for alltraffic. This latency is frequently bad enough to cause some types oftraffic to fail entirely.
<link to more background goes here, including how separate benchmarks forthroughput and latency have mislead people, "packet loss considered evil",cheaper memory encouraging larger buffers, etc. Include tests likenetperf-wrapper and ping latency while under load, etc. Include exampleswhere buffers have resulted in latencies so long that packets areretransmitted before the first copy gets to the destination>
Traditionally, transmit buffers have been sized to handle a fixed number ofpackets. Due to teh variation in packet sizes, it is impossible to tunethis value to both keep the link fully utilized when small packets dominatethe trafific without having the queue size be large enough to cause latencyproblems when large packets dominate the traffic.
Shifting to Byte Queue Lengths where queues are allowed to hold a variablenumber of packets depending on how large they are makes it possible tomanually tune the transmit buffer size to get good latency under alltraffic conditions at a given speed. However, this step forward revealedtwo additional problems.
1. whenever the data rate changes, this value needs to be manually changed(multi-link paths loose a link, noise degrades max throughput on a link,etc)
2. high volume flows (i.e. bulk downloads) can starve other flows (DNSlookups, VoIP, Gaming, etc). this happens because space in tue queue is ona first-com-first-served basis, so the high-volume traffic fills the queue(at which point it starts to be dropped), but all other traffic that triesto arrive is also dropped. It turns out that these light flows tend to havea larger effect on the user experience than heavier flows, because thingstend to be serialized behind the lighter flows (DNS lookup before doing alarge download, retrieving a small HTML page to find what additionalresources need to be fetched to display a page), or the user experience isdirectly effected by light flows (gaming lag, VoIP drops, etc)
Active Queue Management addresses these problems by adapting the amount ofdata that is buffered to match the data transmission capacity, and preventshigh volume flows from starving low-volume flows without the need toimplement QoS classifications.
<insert link about how you can't trust QoS tags that are made by otherorganizations, ways that it can be abused, etc>
This is possible because AQM algoithms don't have to drop the new packetthat arrives, the algorithm can decide to drop the packet for one of theheavy flows rather than for one of the lightweight flows.
<insert references to currently favored AQM options here, PIE, fq_codel,cake, ???. Also links to failed approaches>
Turning on aqm on every bottleneck link makes the Internet usable foreveryone, no matter what sort of application they are using.
<insert link on how to deal with equipment you can't configure bythrottling bandwidth before the bottleneck oand/or doing ingress shaping oftraffic>
While AQM makes the network usable, there is still additional room forimprovement. While dropping packets does result in the TCP senders slowingdown,and eventually stabilizing at around the right speed to keep the linkfully utilized, the only way that senders have been able to detect problemsis to discover that they have not received an ack for the traffic withinthe allowed time. This causes a 'bubble' in the flow as teh dropped packetmust be retransmitted (and sometimes a significant amount of data after thedropped packet that did make it to the destination, but could not be ackedbecause fo the missing packet).
This "bubble" in the data flow can be greatly compressed by configuring theAQM algorithm to send an ECN packet to the sender when it drops a packet ina flow. The sender can then adapt faster, slowing down it's new data, andre-sending the dropped packet without having to wait for the timeout. Thishas two major effects by allowing the sender to retransmit the packetsooner the dealy on the dropped data is not as long, and because thereplacement data can arrive before the timeout of the following packets,they may not need to be re-sent. by configuring the AQM algorithm to sendthe ECN notification to the sender only when the packet is being dropped,the effect of failure of the ECN packet to get through to the sender (thenotification packet runs into congestion and gets dropped, some networkdevice blocks it, etc) is that the ECN enabled case devolves to match thenon-ECN case in that the sender will still detect the dropped packet viathe timeout waiting for the ack as if ENCN was not enabled.
<insert link to possible problems that can happen here, including thepotential for an app to 'game' things if packets are marked at a differentlevel than when they are dropped.>
So a very strong recommendation to enable Active Queue Management, whilethe different algorithms have different advantages and levels of testing,even the 'worst' of the set results in a night-and-day improvement forusability compared to unmanaged buffers.
Enabling ECN at the same point as dropping packets as part of enabling anyAQM algorithm results in a noticable improvement over the base algorithmwithout ECN. When compared to the baseline, the improvement added by ECN istiny compared to the improvement from enabling AQM.
Is it fair to say that plain aqm vs aqm+ecn variation is on the sameorder of difference as the differences between the different AQMalgorithms?
Future research items (which others here may already have done, and wouldnot be part of my 'elevator pitch')
I believe that currently ECn triggers the exact same slowdown that a missedpacket does, and it may be appropriate to have the sender do a less drasticslowdown.
It would be very interesing to provide soem way for the application sendingthe traffic to detect dropped packets and ECN responses. For example, astreaming media source (especially an interactive one like videoconferencing) could adjust the bitrate that it's sending.
David Lang

_______________________________________________
aqm mailing list
aqm@ietf.org
https://www.ietf.org/mailman/listinfo/aqm
________________________________________________________________
Bob Briscoe, BT


_______________________________________________
aqm mailing list
aqm@ietf.org
https://www.ietf.org/mailman/listinfo/aqm

Re: [aqm] think once to mark, think twice to drop: draft-ietf-aqm-ecn-benefits-02

Reply via email to