Re: [Bloat] The Dark Problem with AQM in the Internet?

Jonathan Morton Thu, 28 Aug 2014 07:09:14 -0700

On 28 Aug, 2014, at 4:19 pm, Jerry Jongerius wrote:

> AQM is a great solution for bufferbloat.  End of story.  But if you want to 
> track down which device in the network intentionally dropped a packet (when 
> many devices in the network path will be running AQM), how are you going to 
> do that?  Or how do you propose to do that?


We don't plan to do that.  Not from the outside.  Frankly, we can't reliably 
tell which routers drop packets today, when AQM is not at all widely deployed, 
so that's no great loss.

But if ECN finally gets deployed, AQM can set the Congestion Experienced flag 
instead of dropping packets, most of the time.  You still don't get to see 
which router did it, but the packet still gets through and the TCP session 
knows what to do about it.

> The graph presented is caused the interaction of a single dropped packet, 
> bufferbloat, and the Westwood+ congestion control algorithm – and not power 
> boost.

This surprises me somewhat - Westwood+ is supposed to be deliberately tolerant 
of single packet losses, since it was designed explicitly to get around the 
problem of slight random loss on wireless networks.

I'd be surprised if, in fact, *only* one packet was lost.  The more usual case 
is of "burst loss", where several packets are lost in quick succession, and not 
necessarily consecutive packets.  This tends to happen repeatedly on dump 
drop-tail queues, unless the buffer is so large that it accommodates the entire 
receive window (which, for modern OSes, is quite impressive in a dark sort of 
way).  Burst loss is characteristic of congestion, whereas random loss tends to 
lose isolated packets, so it would be much less surprising for Westwood+ to 
react to it.

The packets were lost in the first place because the queue became chock-full, 
probably at just about the exact moment when the PowerBoost allowance ran out 
and the bandwidth came down (which tends to cause the buffer to fill rapidly), 
so you get the worst-case scenario: the buffer at its fullest, and the 
bandwidth draining it at its minimum.  This maximises the time before your TCP 
gets to even notice the lost packet's nonexistence, during which the sender 
keeps the buffer full because it still thinks everything's fine.

What is probably happening is that the bottleneck queue, being so large, delays 
the retransmission of the lost packet until the Retransmit Timer expires.  This 
will cause Reno-family TCPs to revert to slow-start, assuming (rightly in this 
case) that the characteristics of the channel have changed.  You can see that 
it takes most of the first second for the sender to ramp up to full speed, and 
nearly as long to ramp back up to the reduced speed, both of which are 
characteristic of slow-start at WAN latencies.  NB: during slow-start, the 
buffer remains empty as long as the incoming data rate is less than the output 
capacity, so latency is at a minimum.

Do you have TCP SACK and timestamps turned on?  Those usually allow minor 
losses like that to be handled more gracefully - the sending TCP gets a better 
idea of the RTT (allowing it to set the Retransmit Timer more intelligently), 
and would be able to see that progress is still being made with the backlog of 
buffered packets, even though the core TCP ACK is not advancing.  In the event 
of burst loss, it would also be able to retransmit the correct set of packets 
straight away.

What AQM would do for you here - if your ISP implemented it properly - is to 
eliminate the negative effects of filling that massive buffer at your ISP.  It 
would allow the sending TCP to detect and recover from any packet loss more 
quickly, and with ECN turned on you probably wouldn't even get any packet loss.

What's also interesting is that, after recovering from the change in bandwidth, 
you get smaller bursts of about 15-40KB arriving at roughly half-second 
intervals, mixed in with the relatively steady 1-, 2- and 3-packet stream.  
That is characteristic of low-level packet loss with a low-latency recovery.

This either implies that your ISP has stuck you on a much shorter buffer for 
the lower-bandwidth (non-PowerBoost) regime, *or* that the sender is enforcing 
a smaller congestion window on you after having suffered a slow-start recovery. 
 The latter restricts your bandwidth to match the delay-bandwidth product, but 
happily the "delay" in that equation is at a minimum if it keeps your buffer 
empty.

And frankly, you're still getting 45Mbps under those conditions.  Many people 
would kill for that sort of performance - although they'd probably then want to 
kill everyone in the Comcast call centre later on.

 - Jonathan Morton

_______________________________________________
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat

Re: [Bloat] The Dark Problem with AQM in the Internet?

Reply via email to