Re: [Bloat] The Dark Problem with AQM in the Internet?
Hi Jerry, isn't this the problem statement of Conex? Again, you at the end host would gain little insight with Conex, but every intermediate network operator can observe the red/black marked packets, compare the ratios and know to what extent (by looking at ingress vs egress into his network ) he is contributing... Best regards, Richard - Original Message - From: Jerry Jongerius To: 'Rich Brown' Cc: bloat@lists.bufferbloat.net Sent: Thursday, August 28, 2014 6:20 PM Subject: Re: [Bloat] The Dark Problem with AQM in the Internet? It add accountability. Everyone in the path right now denies that they could possibly be the one dropping the packet. If I want (or need!) to address the problem, I can't now. I would have to make a change and just hope that it fixed the problem. With accountability, I can address the problem. I then have a choice. If the problem is the ISP, I can switch ISP's. If the problem is the mid-level peer or the hosting provider, I can test out new hosting providers. - Jerry From: Rich Brown [mailto:richb.hano...@gmail.com] Sent: Thursday, August 28, 2014 10:39 AM To: Jerry Jongerius Cc: Greg White; Sebastian Moeller; bloat@lists.bufferbloat.net Subject: Re: [Bloat] The Dark Problem with AQM in the Internet? Hi Jerry, AQM is a great solution for bufferbloat. End of story. But if you want to track down which device in the network intentionally dropped a packet (when many devices in the network path will be running AQM), how are you going to do that? Or how do youpropose to do that? Yes, but... I want to understand why you are looking to know which device dropped the packet. What would you do with the information? The great beauty of fq_codel is that it discards packets that have dwelt too long in a queue by actually *measuring* how long they've been in the queue. If the drops happen in your local gateway/home router, then it's interesting to you as the operator of that device. If the drops happen elsewhere (perhaps some enlightened ISP has installed fq_codel, PIE, or some other zoomy queue discipline) then they're doing the right thing as well - they're managing their traffic as well as they can. But once the data leaves your gateway router, you can't make any further predictions. The SQM/AQM efforts of CeroWrt/fq_codel are designed to give near optimal performance of the *local* gateway, to make it adapt to the remainder of the (black box) network. It might make sense to instrument the CeroWrt/OpenWrt code to track the number of fq_codel drops to come up with a sense of what's 'normal'. And if you need to know exactly what's happening, then tcpdump/wireshark are your friends. Maybe I'm missing the point of your note, but I'm not sure there's anything you can do beyond your gateway. In the broader network, operators are continually watching their traffic and drop rates, and adjusting/reconfiguring their networks to adapt. But in general, it's impossible for you to have any sway/influence on their operations, so I'm not sure what you would do if you could know that the third router in traceroute was dropping... Best regards, Rich -- ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] The Dark Problem with AQM in the Internet?
On 29 Aug, 2014, at 5:37 pm, Jerry Jongerius wrote: did you check to see if packets were re-sent even if they weren't lost? on of the side effects of excessive buffering is that it's possible for a packet to be held in the buffer long enough that the sender thinks that it's been lost and retransmits it, so the packet is effectivly 'lost' even if it actually arrives at it's destination. Yes. A duplicate packet for the missing packet is not seen. The receiver 'misses' a packet; starts sending out tons of dup acks (for all packets in flight and queued up due to bufferbloat), and then way later, the packet does come in (after the RTT caused by bufferbloat; indicating it is the 'resent' packet). I think I've cracked this one - the cause, if not the solution. Let's assume, for the moment, that Jerry is correct and PowerBoost plays no part in this. That implies that the flow is not using the full bandwidth after the loss, *and* that the additive increase of cwnd isn't sufficient to recover to that point within the test period. There *is* a sequence of events that can lead to that happening: 1) Packet is lost, at the tail end of the bottleneck queue. 2) Eventually, receiver sees the loss and starts sending duplicate acks (each triggering CA_EVENT_SLOW_ACK path in the sender). Sender (running Westwood+) assumes that each of these represents a received, full-size packet, for bandwidth estimation purposes. 3) The receiver doesn't send, or the sender doesn't receive, a duplicate ack for every packet actually received. Maybe some firewall sees a large number of identical packets arriving - without SACK or timestamps, they *would* be identical - and filters some of them. The bandwidth estimate therefore becomes significantly lower than the true value, and additionally the RTO fires and causes the sender to reset cwnd to 1 (CA_EVENT_LOSS). 4) The retransmitted packet finally reaches the receiver, and the ack it sends includes all the data received in the meantime (about 3.5MB). This is not sufficient to immediately reset the bandwidth estimate to the true value, because the BWE is sampled at RTT intervals, and also includes low-pass filtering. 5) This ends the recovery phase (CA_EVENT_CWR_COMPLETE), and the sender resets the slow-start threshold to correspond to the estimated delay-bandwidth product (MinRTT * BWE) at that moment. 6) This estimated DBP is lower than the true value, so the subsequent slow-start phase ends with the cwnd inadequately sized. Additive increase would eventually correct that - but the key word is *eventually*. - Jonathan Morton ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] The Dark Problem with AQM in the Internet?
On Sat, 30 Aug 2014 09:05:58 +0300 Jonathan Morton chromati...@gmail.com wrote: On 29 Aug, 2014, at 5:37 pm, Jerry Jongerius wrote: did you check to see if packets were re-sent even if they weren't lost? on of the side effects of excessive buffering is that it's possible for a packet to be held in the buffer long enough that the sender thinks that it's been lost and retransmits it, so the packet is effectivly 'lost' even if it actually arrives at it's destination. Yes. A duplicate packet for the missing packet is not seen. The receiver 'misses' a packet; starts sending out tons of dup acks (for all packets in flight and queued up due to bufferbloat), and then way later, the packet does come in (after the RTT caused by bufferbloat; indicating it is the 'resent' packet). I think I've cracked this one - the cause, if not the solution. Let's assume, for the moment, that Jerry is correct and PowerBoost plays no part in this. That implies that the flow is not using the full bandwidth after the loss, *and* that the additive increase of cwnd isn't sufficient to recover to that point within the test period. There *is* a sequence of events that can lead to that happening: 1) Packet is lost, at the tail end of the bottleneck queue. 2) Eventually, receiver sees the loss and starts sending duplicate acks (each triggering CA_EVENT_SLOW_ACK path in the sender). Sender (running Westwood+) assumes that each of these represents a received, full-size packet, for bandwidth estimation purposes. 3) The receiver doesn't send, or the sender doesn't receive, a duplicate ack for every packet actually received. Maybe some firewall sees a large number of identical packets arriving - without SACK or timestamps, they *would* be identical - and filters some of them. The bandwidth estimate therefore becomes significantly lower than the true value, and additionally the RTO fires and causes the sender to reset cwnd to 1 (CA_EVENT_LOSS). 4) The retransmitted packet finally reaches the receiver, and the ack it sends includes all the data received in the meantime (about 3.5MB). This is not sufficient to immediately reset the bandwidth estimate to the true value, because the BWE is sampled at RTT intervals, and also includes low-pass filtering. 5) This ends the recovery phase (CA_EVENT_CWR_COMPLETE), and the sender resets the slow-start threshold to correspond to the estimated delay-bandwidth product (MinRTT * BWE) at that moment. 6) This estimated DBP is lower than the true value, so the subsequent slow-start phase ends with the cwnd inadequately sized. Additive increase would eventually correct that - but the key word is *eventually*. - Jonathan Morton Bandwidth estimates by ack RTT is fraught with problems. The returning ACK can be delayed for any number of reasons such as other traffic or aggregation. This kind of delay based congestion control suffers badly from any latency induced in the network. So instead of causing bloat, it gets hit by bloat. ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] The Dark Problem with AQM in the Internet?
On 30 Aug, 2014, at 9:28 am, Stephen Hemminger wrote: On Sat, 30 Aug 2014 09:05:58 +0300 Jonathan Morton chromati...@gmail.com wrote: On 29 Aug, 2014, at 5:37 pm, Jerry Jongerius wrote: did you check to see if packets were re-sent even if they weren't lost? on of the side effects of excessive buffering is that it's possible for a packet to be held in the buffer long enough that the sender thinks that it's been lost and retransmits it, so the packet is effectivly 'lost' even if it actually arrives at it's destination. Yes. A duplicate packet for the missing packet is not seen. The receiver 'misses' a packet; starts sending out tons of dup acks (for all packets in flight and queued up due to bufferbloat), and then way later, the packet does come in (after the RTT caused by bufferbloat; indicating it is the 'resent' packet). I think I've cracked this one - the cause, if not the solution. Let's assume, for the moment, that Jerry is correct and PowerBoost plays no part in this. That implies that the flow is not using the full bandwidth after the loss, *and* that the additive increase of cwnd isn't sufficient to recover to that point within the test period. There *is* a sequence of events that can lead to that happening: 1) Packet is lost, at the tail end of the bottleneck queue. 2) Eventually, receiver sees the loss and starts sending duplicate acks (each triggering CA_EVENT_SLOW_ACK path in the sender). Sender (running Westwood+) assumes that each of these represents a received, full-size packet, for bandwidth estimation purposes. 3) The receiver doesn't send, or the sender doesn't receive, a duplicate ack for every packet actually received. Maybe some firewall sees a large number of identical packets arriving - without SACK or timestamps, they *would* be identical - and filters some of them. The bandwidth estimate therefore becomes significantly lower than the true value, and additionally the RTO fires and causes the sender to reset cwnd to 1 (CA_EVENT_LOSS). 4) The retransmitted packet finally reaches the receiver, and the ack it sends includes all the data received in the meantime (about 3.5MB). This is not sufficient to immediately reset the bandwidth estimate to the true value, because the BWE is sampled at RTT intervals, and also includes low-pass filtering. 5) This ends the recovery phase (CA_EVENT_CWR_COMPLETE), and the sender resets the slow-start threshold to correspond to the estimated delay-bandwidth product (MinRTT * BWE) at that moment. 6) This estimated DBP is lower than the true value, so the subsequent slow-start phase ends with the cwnd inadequately sized. Additive increase would eventually correct that - but the key word is *eventually*. - Jonathan Morton Bandwidth estimates by ack RTT is fraught with problems. The returning ACK can be delayed for any number of reasons such as other traffic or aggregation. This kind of delay based congestion control suffers badly from any latency induced in the network. So instead of causing bloat, it gets hit by bloat. In this case, the TCP is actually tracking RTT surprisingly well, but the bandwidth estimate goes wrong because the duplicate ACKs go missing. Note that if the MinRTT was estimated too high (which is the only direction it could go), this would result in the slow-start threshold being *higher* than required, and the symptoms observed would not occur, since the cwnd would grow to the required value after recovery. This is the opposite effect from what happens to TCP Vegas in a bloated environment. Vegas stops increasing cwnd when the estimated RTT is noticeably higher than MinRTT, but if the true MinRTT changes (or it has to compete with a non-Vegas TCP flow), it has trouble tracking that fact. There is another possibility: that the assumption of non-queue RTT being constant against varying bandwidth is incorrect. If that is the case, then the observed behaviour can be explained without recourse to lost duplicate ACKs - so Westwood+ is correctly tracking both MinRTT and BWE - but (MinRTT * BWE) turns out to be a poor estimate of the true BDP. I think this still fails to explain why the cwnd is reset (which should occur only on RTO), but everything else potentially fits. I think we can distinguish the two theories by running tests against a server that supports SACK and timestamps, and where ideally we can capture packet traces at both ends. - Jonathan Morton ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] The Dark Problem with AQM in the Internet?
Okay that is interesting, Could I convince you to try to enable SACK on the server and test whether you still see the catastrophic results? And/or try another tcp variant instead of westwood+, like the default cubic. Would love to, but can not. I have read only access to settings on that server. ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] The Dark Problem with AQM in the Internet?
Hi Jerry, On Aug 29, 2014, at 13:33 , Jerry Jongerius jer...@duckware.com wrote: Okay that is interesting, Could I convince you to try to enable SACK on the server and test whether you still see the catastrophic results? And/or try another tcp variant instead of westwood+, like the default cubic. Would love to, but can not. I have read only access to settings on that server. Ah, too bad, it would have been nice to be able to pinpoint this closer (is this effect a quirk/bug in westwood+ or caused by the “archaic” lack of SACKs). But this list contains vast knowledge about networking, so I hope that someone has an idea how to get closer to the root-cause even without root access on the server. Oh, maybe you can ask the hosting company/ owner of the server to switch the tcp for you? Best Regards Sebastian ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] The Dark Problem with AQM in the Internet?
A ‘boost’ has never been seen. Bandwidth graphs where there is no packet loss look like: From: Jonathan Morton [mailto:chromati...@gmail.com] Sent: Thursday, August 28, 2014 2:15 PM To: Jerry Jongerius Cc: bloat Subject: RE: [Bloat] The Dark Problem with AQM in the Internet? If it is genuinely a single packet, then I have an alternate theory. I note from http://www.dslreports.com/faq/14520 that PowerBoost works on the first 20MB of a download. At 100Mbps or so, that's about 2 seconds. So that's quite convincing evidence that your packet loss is happening at the moment PowerBoost switches off. It might be that the switching process takes long enough to drop one packet. Or it might be that Comcast deliberately drop one packet in order to signal the change in bandwidth to the sender. Clever, if mildly distasteful. - Jonathan Morton attachment: image001.jpg ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] The Dark Problem with AQM in the Internet?
did you check to see if packets were re-sent even if they weren't lost? on of the side effects of excessive buffering is that it's possible for a packet to be held in the buffer long enough that the sender thinks that it's been lost and retransmits it, so the packet is effectivly 'lost' even if it actually arrives at it's destination. Yes. A duplicate packet for the missing packet is not seen. The receiver 'misses' a packet; starts sending out tons of dup acks (for all packets in flight and queued up due to bufferbloat), and then way later, the packet does come in (after the RTT caused by bufferbloat; indicating it is the 'resent' packet). ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] The Dark Problem with AQM in the Internet?
A ‘boost’ has never been seen. Bandwidth graphs where there is no packet loss look like: That's very odd, if true. Westwood+ should still be increasing the congestion window additively after recovering, so even if it got the bandwidth or latency estimates wrong, it should still recover full performance. Not necessarily very quickly, but it should still be visible on a timescale of several seconds. More likely is that you're conflating cause and effect. The packet is only lost when the boost ends, so if for some reason the boost never ends, the packet is never lost. - Jonathan Morton ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] The Dark Problem with AQM in the Internet?
The additive increase is there in the raw data. From: Jonathan Morton [mailto:chromati...@gmail.com] Sent: Friday, August 29, 2014 12:31 PM To: Jerry Jongerius Cc: bloat Subject: RE: [Bloat] The Dark Problem with AQM in the Internet? A ‘boost’ has never been seen. Bandwidth graphs where there is no packet loss look like: That's very odd, if true. Westwood+ should still be increasing the congestion window additively after recovering, so even if it got the bandwidth or latency estimates wrong, it should still recover full performance. Not necessarily very quickly, but it should still be visible on a timescale of several seconds. More likely is that you're conflating cause and effect. The packet is only lost when the boost ends, so if for some reason the boost never ends, the packet is never lost. - Jonathan Morton ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] The Dark Problem with AQM in the Internet?
Mr. White, AQM is a great solution for bufferbloat. End of story. But if you want to track down which device in the network intentionally dropped a packet (when many devices in the network path will be running AQM), how are you going to do that? Or how do you propose to do that? The graph presented is caused the interaction of a single dropped packet, bufferbloat, and the Westwood+ congestion control algorithm and not power boost. - Jerry -Original Message- From: Greg White [mailto:g.wh...@cablelabs.com] Sent: Monday, August 25, 2014 1:14 PM To: Sebastian Moeller; Jerry Jongerius Cc: bloat@lists.bufferbloat.net Subject: Re: [Bloat] The Dark Problem with AQM in the Internet? As far as I know there are no deployments of AQM in DOCSIS networks yet. So, the effect you are seeing is unlikely to be due to AQM. As Sebastian indicated, it looks like an interaction between power boost, a drop tail buffer and the tcp congestion window getting reset to slow-start. I ran a quick simulation of a simple network with power boost and basic (bloated) drop tail buffer (no AQM) this morning in an attempt to understand what is going on here. You didn't give me a lot to go on in the text of your blog post, but nonetheless after playing around with parameters a bit, I was able to get a result that was close to what you are seeing (attached). Let me know if you disagree. I'm a bit concerned with the tone of your article, making AQM out to be the bad guy here (weapon against end users, etc.). The folks on this list and those who participate in the IETF AQM WG are working on AQM and packet scheduling algorithms in an attempt to fix the Internet. At this point AQM/PS is the best known solution, let's not create negative perceptions unnecessarily. -Greg On 8/23/14, 2:01 PM, Sebastian Moeller mailto:moell...@gmx.de moell...@gmx.de wrote: Hi Jerry, On Aug 23, 2014, at 20:16 , Jerry Jongerius mailto:jer...@duckware.com jer...@duckware.com wrote: Request for comments on: http://www.duckware.com/darkaqm www.duckware.com/darkaqm The bottom line: How do you know which AQM device in a network intentionally drops a packet, without cooperation from AQM? Or is this in AQM somewhere and I just missed it? I am sure you will get more expert responses later, but let me try to comment. Paragraph 1: I think you hit the nail on the head with your observation: The average user can not figure out what AQM device intentionally dropped packets Only, I might add, this does not depend on AQM, the user can not figure out where packets where dropped in the case that not all involved network hops are under said user¹s control ;) So move on, nothing to see here ;) Paragraph 2: There is no guarantee that any network equipment responds to ICMP requests at all (for example my DSLAM does not). What about pinging a host further away and look at that hosts RTT development over time? (Minor clarification: its the load dependent increase of ping RTT to the CMTS that would be diagnostic of a queue, not the RTT per se). No increase of ICMP RTT could also mean there is no AQM involved ;) I used to think along similar lines, but reading https://www.nanog.org/meetings/nanog47/presentations/Sunday/RAS_Tracero https://www.nanog.org/meetings/nanog47/presentations/Sunday/RAS_Tracero ute _N47_Sun.pdf made me realize that my assumptions about ping and trace route were not really backed up by reality. Notably traceroute will not necessarily show the real data's path and latencies or drop probability. Paragraph 3 What is the advertised bandwidth of your link? To my naive eye this looks a bit like power boosting (the cable company allowing you higher than advertised bandwidth for a short time that is later reduced to the advertised speed). Your plot needs a better legend, BTW, what is the blue line showing? When you say that neither ping nor trace route showed anything, I assumed that you measured concurrently to your download. It would be really great if you could netperf-wrapper to get comparable data (see the link on http://www.bufferbloat.net/projects/cerowrt/wiki/Quick_Test_for_Bufferb http://www.bufferbloat.net/projects/cerowrt/wiki/Quick_Test_for_Bufferb loa t ) There the latency is not only assessed by ICMP echo requests but also by UDP packets, and it is very unlikely that your ISP can special case these in any tricky way, short of giving priority to sparse flows (which is pretty much what you would like your ISP to do in the first place ;) ) Here is where I reveal that I am just a layman, but you complain about the loss of one packet, but how do you assume does a (TCP) settle on its transfer speed? Exactly it keeps increasing until it looses a packet, then reduces its speed to 50% or so and slowly ramps up again until the next packet loss. So
Re: [Bloat] The Dark Problem with AQM in the Internet?
On 28 Aug, 2014, at 4:19 pm, Jerry Jongerius wrote: AQM is a great solution for bufferbloat. End of story. But if you want to track down which device in the network intentionally dropped a packet (when many devices in the network path will be running AQM), how are you going to do that? Or how do you propose to do that? We don't plan to do that. Not from the outside. Frankly, we can't reliably tell which routers drop packets today, when AQM is not at all widely deployed, so that's no great loss. But if ECN finally gets deployed, AQM can set the Congestion Experienced flag instead of dropping packets, most of the time. You still don't get to see which router did it, but the packet still gets through and the TCP session knows what to do about it. The graph presented is caused the interaction of a single dropped packet, bufferbloat, and the Westwood+ congestion control algorithm – and not power boost. This surprises me somewhat - Westwood+ is supposed to be deliberately tolerant of single packet losses, since it was designed explicitly to get around the problem of slight random loss on wireless networks. I'd be surprised if, in fact, *only* one packet was lost. The more usual case is of burst loss, where several packets are lost in quick succession, and not necessarily consecutive packets. This tends to happen repeatedly on dump drop-tail queues, unless the buffer is so large that it accommodates the entire receive window (which, for modern OSes, is quite impressive in a dark sort of way). Burst loss is characteristic of congestion, whereas random loss tends to lose isolated packets, so it would be much less surprising for Westwood+ to react to it. The packets were lost in the first place because the queue became chock-full, probably at just about the exact moment when the PowerBoost allowance ran out and the bandwidth came down (which tends to cause the buffer to fill rapidly), so you get the worst-case scenario: the buffer at its fullest, and the bandwidth draining it at its minimum. This maximises the time before your TCP gets to even notice the lost packet's nonexistence, during which the sender keeps the buffer full because it still thinks everything's fine. What is probably happening is that the bottleneck queue, being so large, delays the retransmission of the lost packet until the Retransmit Timer expires. This will cause Reno-family TCPs to revert to slow-start, assuming (rightly in this case) that the characteristics of the channel have changed. You can see that it takes most of the first second for the sender to ramp up to full speed, and nearly as long to ramp back up to the reduced speed, both of which are characteristic of slow-start at WAN latencies. NB: during slow-start, the buffer remains empty as long as the incoming data rate is less than the output capacity, so latency is at a minimum. Do you have TCP SACK and timestamps turned on? Those usually allow minor losses like that to be handled more gracefully - the sending TCP gets a better idea of the RTT (allowing it to set the Retransmit Timer more intelligently), and would be able to see that progress is still being made with the backlog of buffered packets, even though the core TCP ACK is not advancing. In the event of burst loss, it would also be able to retransmit the correct set of packets straight away. What AQM would do for you here - if your ISP implemented it properly - is to eliminate the negative effects of filling that massive buffer at your ISP. It would allow the sending TCP to detect and recover from any packet loss more quickly, and with ECN turned on you probably wouldn't even get any packet loss. What's also interesting is that, after recovering from the change in bandwidth, you get smaller bursts of about 15-40KB arriving at roughly half-second intervals, mixed in with the relatively steady 1-, 2- and 3-packet stream. That is characteristic of low-level packet loss with a low-latency recovery. This either implies that your ISP has stuck you on a much shorter buffer for the lower-bandwidth (non-PowerBoost) regime, *or* that the sender is enforcing a smaller congestion window on you after having suffered a slow-start recovery. The latter restricts your bandwidth to match the delay-bandwidth product, but happily the delay in that equation is at a minimum if it keeps your buffer empty. And frankly, you're still getting 45Mbps under those conditions. Many people would kill for that sort of performance - although they'd probably then want to kill everyone in the Comcast call centre later on. - Jonathan Morton ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] The Dark Problem with AQM in the Internet?
Hi Jerry, AQM is a great solution for bufferbloat. End of story. But if you want to track down which device in the network intentionally dropped a packet (when many devices in the network path will be running AQM), how are you going to do that? Or how do youpropose to do that? Yes, but... I want to understand why you are looking to know which device dropped the packet. What would you do with the information? The great beauty of fq_codel is that it discards packets that have dwelt too long in a queue by actually *measuring* how long they've been in the queue. If the drops happen in your local gateway/home router, then it's interesting to you as the operator of that device. If the drops happen elsewhere (perhaps some enlightened ISP has installed fq_codel, PIE, or some other zoomy queue discipline) then they're doing the right thing as well - they're managing their traffic as well as they can. But once the data leaves your gateway router, you can't make any further predictions. The SQM/AQM efforts of CeroWrt/fq_codel are designed to give near optimal performance of the *local* gateway, to make it adapt to the remainder of the (black box) network. It might make sense to instrument the CeroWrt/OpenWrt code to track the number of fq_codel drops to come up with a sense of what's 'normal'. And if you need to know exactly what's happening, then tcpdump/wireshark are your friends. Maybe I'm missing the point of your note, but I'm not sure there's anything you can do beyond your gateway. In the broader network, operators are continually watching their traffic and drop rates, and adjusting/reconfiguring their networks to adapt. But in general, it's impossible for you to have any sway/influence on their operations, so I'm not sure what you would do if you could know that the third router in traceroute was dropping... Best regards, Rich signature.asc Description: Message signed with OpenPGP using GPGMail ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] The Dark Problem with AQM in the Internet?
On Aug 28, 2014, at 9:20 AM, Jerry Jongerius jer...@duckware.com wrote: It add accountability. Everyone in the path right now denies that they could possibly be the one dropping the packet. If I want (or need!) to address the problem, I can’t now. I would have to make a change and just hope that it fixed the problem. With accountability, I can address the problem. I then have a choice. If the problem is the ISP, I can switch ISP’s. If the problem is the mid-level peer or the hosting provider, I can test out new hosting providers. May I ask what may be a dumb question? All communications has some probability of error. That’s the reason we have CRCs on link layer frames; to detect and discard errored packets. The probability of such an error varies by media type; it’s relatively uncommon (O(10^-11)) on fiber, a little more common (perhaps O(10^-9)) on wired Ethernet, likely on Wifi (O(1p^-7 or so), which is why Wifi incorporates local retransmission), and very likely (O(10^-4)) on satellite links, which is why they use forward error correction. Errors are not usually single bit errors. They are far more commonly block errors, especially if trellis coding is in use, as once there is an error the entire link goes screwy until it works out where the data is going. Such block errors might consume entire messages, or sets of messages, including not only the messages but the gaps between them. When a message is lost due to an error, how do you determine whose fault it is? - Jerry From: Rich Brown [mailto:richb.hano...@gmail.com] Sent: Thursday, August 28, 2014 10:39 AM To: Jerry Jongerius Cc: Greg White; Sebastian Moeller; bloat@lists.bufferbloat.net Subject: Re: [Bloat] The Dark Problem with AQM in the Internet? Hi Jerry, AQM is a great solution for bufferbloat. End of story. But if you want to track down which device in the network intentionally dropped a packet (when many devices in the network path will be running AQM), how are you going to do that? Or how do youpropose to do that? Yes, but... I want to understand why you are looking to know which device dropped the packet. What would you do with the information? The great beauty of fq_codel is that it discards packets that have dwelt too long in a queue by actually *measuring* how long they've been in the queue. If the drops happen in your local gateway/home router, then it's interesting to you as the operator of that device. If the drops happen elsewhere (perhaps some enlightened ISP has installed fq_codel, PIE, or some other zoomy queue discipline) then they're doing the right thing as well - they're managing their traffic as well as they can. But once the data leaves your gateway router, you can't make any further predictions. The SQM/AQM efforts of CeroWrt/fq_codel are designed to give near optimal performance of the *local* gateway, to make it adapt to the remainder of the (black box) network. It might make sense to instrument the CeroWrt/OpenWrt code to track the number of fq_codel drops to come up with a sense of what's 'normal'. And if you need to know exactly what's happening, then tcpdump/wireshark are your friends. Maybe I'm missing the point of your note, but I'm not sure there's anything you can do beyond your gateway. In the broader network, operators are continually watching their traffic and drop rates, and adjusting/reconfiguring their networks to adapt. But in general, it's impossible for you to have any sway/influence on their operations, so I'm not sure what you would do if you could know that the third router in traceroute was dropping... Best regards, Rich ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat signature.asc Description: Message signed with OpenPGP using GPGMail ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] The Dark Problem with AQM in the Internet?
And again, AQM is not causing the problem that you observed. As Jonathan indicated, it would almost certainly make your performance better.I can't speak for Comcast, but AFAIK they are on a path to deploy AQM. If their customers start raising FUD that could change. TCP requires congestion signals. In the vast majority of cases today (and for the foreseeable future) those signals are dropped packets. Going on a witch hunt to find the evildoer that dropped your packet is counter productive. I think you should instead be asking why didn't you drop my packet earlier, before the buffer got so bloated and power boost cut the BDP by 60%? -Greg From: Jerry Jongerius jer...@duckware.commailto:jer...@duckware.com Date: Thursday, August 28, 2014 at 10:20 AM To: 'Rich Brown' richb.hano...@gmail.commailto:richb.hano...@gmail.com Cc: bloat@lists.bufferbloat.netmailto:bloat@lists.bufferbloat.net bloat@lists.bufferbloat.netmailto:bloat@lists.bufferbloat.net Subject: Re: [Bloat] The Dark Problem with AQM in the Internet? It add accountability. Everyone in the path right now denies that they could possibly be the one dropping the packet. If I want (or need!) to address the problem, I can’t now. I would have to make a change and just hope that it fixed the problem. With accountability, I can address the problem. I then have a choice. If the problem is the ISP, I can switch ISP’s. If the problem is the mid-level peer or the hosting provider, I can test out new hosting providers. - Jerry From: Rich Brown [mailto:richb.hano...@gmail.com] Sent: Thursday, August 28, 2014 10:39 AM To: Jerry Jongerius Cc: Greg White; Sebastian Moeller; bloat@lists.bufferbloat.netmailto:bloat@lists.bufferbloat.net Subject: Re: [Bloat] The Dark Problem with AQM in the Internet? Hi Jerry, AQM is a great solution for bufferbloat. End of story. But if you want to track down which device in the network intentionally dropped a packet (when many devices in the network path will be running AQM), how are you going to do that? Or how do youpropose to do that? Yes, but... I want to understand why you are looking to know which device dropped the packet. What would you do with the information? The great beauty of fq_codel is that it discards packets that have dwelt too long in a queue by actually *measuring* how long they've been in the queue. If the drops happen in your local gateway/home router, then it's interesting to you as the operator of that device. If the drops happen elsewhere (perhaps some enlightened ISP has installed fq_codel, PIE, or some other zoomy queue discipline) then they're doing the right thing as well - they're managing their traffic as well as they can. But once the data leaves your gateway router, you can't make any further predictions. The SQM/AQM efforts of CeroWrt/fq_codel are designed to give near optimal performance of the *local* gateway, to make it adapt to the remainder of the (black box) network. It might make sense to instrument the CeroWrt/OpenWrt code to track the number of fq_codel drops to come up with a sense of what's 'normal'. And if you need to know exactly what's happening, then tcpdump/wireshark are your friends. Maybe I'm missing the point of your note, but I'm not sure there's anything you can do beyond your gateway. In the broader network, operators are continually watching their traffic and drop rates, and adjusting/reconfiguring their networks to adapt. But in general, it's impossible for you to have any sway/influence on their operations, so I'm not sure what you would do if you could know that the third router in traceroute was dropping... Best regards, Rich ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] The Dark Problem with AQM in the Internet?
Regarding AQM in North American HFC deployments- I also can't speak for individual Service Providers, but Greg was being modest and the following may be interesting. The most recent DOCSIS 3.1 specs calls for AQM in the CMTS. It specifically calls for a specific variant of PIE that is designed with the DOCSIS MAC layer in mind. The DOCSIS 3.0 spec is also being amended to require AQM. Both specs also have recommendations to include AQM in the Cable Modems that can be turned on in the HFC network. See http://tools.ietf.org/html/draft-white-aqm-docsis-pie-00 for more details. bvs [http://www.cisco.com/web/europe/images/email/signature/logo05.jpg] Bill Ver Steeg Distinguished Engineer Cisco Systems From: bloat-boun...@lists.bufferbloat.net [mailto:bloat-boun...@lists.bufferbloat.net] On Behalf Of Greg White Sent: Thursday, August 28, 2014 12:36 PM To: Jerry Jongerius; 'Rich Brown' Cc: bloat@lists.bufferbloat.net Subject: Re: [Bloat] The Dark Problem with AQM in the Internet? And again, AQM is not causing the problem that you observed. As Jonathan indicated, it would almost certainly make your performance better.I can't speak for Comcast, but AFAIK they are on a path to deploy AQM. If their customers start raising FUD that could change. TCP requires congestion signals. In the vast majority of cases today (and for the foreseeable future) those signals are dropped packets. Going on a witch hunt to find the evildoer that dropped your packet is counter productive. I think you should instead be asking why didn't you drop my packet earlier, before the buffer got so bloated and power boost cut the BDP by 60%? -Greg From: Jerry Jongerius jer...@duckwae.commailto:jer...@duckwae.com Date: Thursday, August 28, 2014 at 10:20 AM To: 'Rich Brown' richb.hano...@gmail.commailto:richb.hano...@gmail.com Cc: bloat@lists.bufferbloat.netmailto:bloat@lists.bufferbloat.net bloat@lists.bufferbloat.netmailto:bloat@lists.bufferbloat.net Subject: Re: [Bloat] The Dark Problem with AQM in the Internet? It add accountability. Everyone in the path right now denies that they could possibly be the one dropping the packet. If I want (or need!) to address the problem, I can't now. I would have to make a change and just hope that it fixed the problem. With accountability, I can address the problem. I then have a choice. If the problem is the ISP, I can switch ISP's. If the problem is the mid-level peer or the hosting provider, I can test out new hosting providers. - Jerry From: Rich Brown [mailto:richb.hano...@gmail.com] Sent: Thursday, August 28, 2014 10:39 AM To: Jerry Jongerius Cc: Greg White; Sebastian Moeller; bloat@lists.bufferbloat.netmailto:bloat@lists.bufferbloat.net Subject: Re: [Bloat] The Dark Problem with AQM in the Internet? Hi Jerry, AQM is a great solution for bufferbloat. End of story. But if you want to track down which device in the network intentionally dropped a packet (when many devices in the network path will be running AQM), how are you going to do that? Or how do youpropose to do that? Yes, but... I want to understand why you are looking to know which device dropped the packet. What would you do with the information? The great beauty of fq_codel is that it discards packets that have dwelt too long in a queue by actually *measuring* how long they've been in the queue. If the drops happen in your local gateway/home router, then it's interesting to you as the operator of that device. If the drops happen elsewhere (perhaps some enlightened ISP has installed fq_codel, PIE, or some other zoomy queue discipline) then they're doing the right thing as well - they're managing their traffic as well as they can. But once the data leaves your gateway router, you can't make any further predictions. The SQM/AQM efforts of CeroWrt/fq_codel are designed to give near optimal performance of the *local* gateway, to make it adapt to the remainder of the (black box) network. It might make sense to instrument the CeroWrt/OpenWrt code to track the number of fq_codel drops to come up with a sense of what's 'normal'. And if you need to know exactly what's happening, then tcpdump/wireshark are your friends. Maybe I'm missing the point of your note, but I'm not sure there's anything you can do beyond your gateway. In the broader network, operators are continually watching their traffic and drop rates, and adjusting/reconfiguring their networks to adapt. But in general, it's impossible for you to have any sway/influence on their operations, so I'm not sure what you would do if you could know that the third router in traceroute was dropping... Best regards, Rich ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] The Dark Problem with AQM in the Internet?
On Thu, Aug 28, 2014 at 10:20 AM, Jerry Jongerius jer...@duckware.com wrote: Jonathan, Yes, WireShark shows that *only* one packet gets lost. Regardless of RWIN size. The RWIN size can be below the BDP (no measurable queuing within the CMTS). Or, the RWIN size can be very large, causing significant queuing within the CMTS. With a larger RWIN value, the single dropped packet typically happens sooner in the download, rather than later. The fact there is no burst loss is a significant clue. The graph is fully explained by the Westwood+ algorithm that the server is using. If you input the data observed into the Westwood+ bandwidth estimator, you end up with the rate seen in the graph after the packet loss event. The reason the rate gets limited (no ramp up) is due to Westwood+ behavior on a RTO. And the reason there is the RTO is due the bufferbloat, and the timing of the lost packet in relation to when the bufferbloat starts. When there is no RTO, I see the expected drop (to the Westwood+ bandwidth estimate) and ramp back up. On a RTO, Westwood+ sets both ssthresh and cwnd to its bandwidth estimate. On the same network, what does cubic do? The PC does SACK, the server does not, so not used. Timestamps off. Timestamps are *critical* for good tcp performance above 5-10mbit on most cc algos. I note that the netperf-wrapper test has the ability to test multiple variants of TCP, if enabled on the server (basically you need to modprobe the needed algorithms, enable them in /proc/sys/net/ipv4/tcp_allowed_congestion_control, and select them in the test tool (iperf and netperf have support)). Everyone here has installed netperf-wrapper already, yes? Very fast to generate a good test and a variety of plots like those shown here: http://burntchrome.blogspot.com/2014_05_01_archive.html (in reading that over, does anyone have any news on CMTS aqm or packet scheduling systems? It's the bulk of the problem there...) netperf-wrapper is easy to bring up on linux, on osx it needs macports, and the only way I've come up to test windows behavior is using windows as a netperf client rather than server. I haven't looked into westwood+'s behavior much of late, I will try to add it and a few other tcps to some future tests. I do have some old plots showing it misbehaving relative to other TCPs, but that was before many fixes landed in the kernel. Note: I keep hoping to find a correctly working ledbat module, the one I have doesn't look correct (and needs to be updated to linux 3.15's change to us based timestamping.) - Jerry -Original Message- From: Jonathan Morton [mailto:chromati...@gmail.com] Sent: Thursday, August 28, 2014 10:08 AM To: Jerry Jongerius Cc: 'Greg White'; 'Sebastian Moeller'; bloat@lists.bufferbloat.net Subject: Re: [Bloat] The Dark Problem with AQM in the Internet? On 28 Aug, 2014, at 4:19 pm, Jerry Jongerius wrote: AQM is a great solution for bufferbloat. End of story. But if you want to track down which device in the network intentionally dropped a packet (when many devices in the network path will be running AQM), how are you going to do that? Or how do you propose to do that? We don't plan to do that. Not from the outside. Frankly, we can't reliably tell which routers drop packets today, when AQM is not at all widely deployed, so that's no great loss. But if ECN finally gets deployed, AQM can set the Congestion Experienced flag instead of dropping packets, most of the time. You still don't get to see which router did it, but the packet still gets through and the TCP session knows what to do about it. The graph presented is caused the interaction of a single dropped packet, bufferbloat, and the Westwood+ congestion control algorithm - and not power boost. This surprises me somewhat - Westwood+ is supposed to be deliberately tolerant of single packet losses, since it was designed explicitly to get around the problem of slight random loss on wireless networks. I'd be surprised if, in fact, *only* one packet was lost. The more usual case is of burst loss, where several packets are lost in quick succession, and not necessarily consecutive packets. This tends to happen repeatedly on dump drop-tail queues, unless the buffer is so large that it accommodates the entire receive window (which, for modern OSes, is quite impressive in a dark sort of way). Burst loss is characteristic of congestion, whereas random loss tends to lose isolated packets, so it would be much less surprising for Westwood+ to react to it. The packets were lost in the first place because the queue became chock-full, probably at just about the exact moment when the PowerBoost allowance ran out and the bandwidth came down (which tends to cause the buffer to fill rapidly), so you get the worst-case scenario: the buffer at its fullest, and the bandwidth draining it at its minimum. This maximises the time before your TCP gets to even notice
Re: [Bloat] The Dark Problem with AQM in the Internet?
On 08/28/2014 06:35 PM, Fred Baker (fred) wrote: When a message is lost due to an error, how do you determine whose fault it is? Links need to be engineered for the optimum combination of power, bandwidth, overhead and residual error that meets requirements. I agree with your implied point that a single error is unlikely to be indicative of a real problem, but a link not meeting requirements is someone's fault. So like Jerry I'd be interested in an ability for endpoints to be able to collect statistics on per-hop loss probabilities so that admins can hold their providers accountable. Jan ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] The Dark Problem with AQM in the Internet?
If it is genuinely a single packet, then I have an alternate theory. I note from http://www.dslreports.com/faq/14520 that PowerBoost works on the first 20MB of a download. At 100Mbps or so, that's about 2 seconds. So that's quite convincing evidence that your packet loss is happening at the moment PowerBoost switches off. It might be that the switching process takes long enough to drop one packet. Or it might be that Comcast deliberately drop one packet in order to signal the change in bandwidth to the sender. Clever, if mildly distasteful. - Jonathan Morton ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] The Dark Problem with AQM in the Internet?
On 2014-08-28T20:00:54+0200, Jan Ceuleers jan.ceule...@gmail.com wrote: On 08/28/2014 06:35 PM, Fred Baker (fred) wrote: When a message is lost due to an error, how do you determine whose fault it is? Links need to be engineered for the optimum combination of power, bandwidth, overhead and residual error that meets requirements. I agree with your implied point that a single error is unlikely to be indicative of a real problem, but a link not meeting requirements is someone's fault. So like Jerry I'd be interested in an ability for endpoints to be able to collect statistics on per-hop loss probabilities so that admins can hold their providers accountable. Here is some relevant work: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2417573 Measurement and Analysis of Internet Interconnection and Congestion -- Kenyon Ralph signature.asc Description: Digital signature ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] The Dark Problem with AQM in the Internet?
Hi Jerry, On Aug 28, 2014, at 19:20 , Jerry Jongerius jer...@duckware.com wrote: Jonathan, Yes, WireShark shows that *only* one packet gets lost. Regardless of RWIN size. The RWIN size can be below the BDP (no measurable queuing within the CMTS). Or, the RWIN size can be very large, causing significant queuing within the CMTS. With a larger RWIN value, the single dropped packet typically happens sooner in the download, rather than later. The fact there is no burst loss is a significant clue. The graph is fully explained by the Westwood+ algorithm that the server is using. If you input the data observed into the Westwood+ bandwidth estimator, you end up with the rate seen in the graph after the packet loss event. The reason the rate gets limited (no ramp up) is due to Westwood+ behavior on a RTO. And the reason there is the RTO is due the bufferbloat, and the timing of the lost packet in relation to when the bufferbloat starts. When there is no RTO, I see the expected drop (to the Westwood+ bandwidth estimate) and ramp back up. On a RTO, Westwood+ sets both ssthresh and cwnd to its bandwidth estimate. The PC does SACK, the server does not, so not used. Timestamps off. Okay that is interesting, Could I convince you to try to enable SACK on the server and test whether you still see the catastrophic results? And/or try another tcp variant instead of westwood+, like the default cubic. Best Regards Sebastian - Jerry -Original Message- From: Jonathan Morton [mailto:chromati...@gmail.com] Sent: Thursday, August 28, 2014 10:08 AM To: Jerry Jongerius Cc: 'Greg White'; 'Sebastian Moeller'; bloat@lists.bufferbloat.net Subject: Re: [Bloat] The Dark Problem with AQM in the Internet? On 28 Aug, 2014, at 4:19 pm, Jerry Jongerius wrote: AQM is a great solution for bufferbloat. End of story. But if you want to track down which device in the network intentionally dropped a packet (when many devices in the network path will be running AQM), how are you going to do that? Or how do you propose to do that? We don't plan to do that. Not from the outside. Frankly, we can't reliably tell which routers drop packets today, when AQM is not at all widely deployed, so that's no great loss. But if ECN finally gets deployed, AQM can set the Congestion Experienced flag instead of dropping packets, most of the time. You still don't get to see which router did it, but the packet still gets through and the TCP session knows what to do about it. The graph presented is caused the interaction of a single dropped packet, bufferbloat, and the Westwood+ congestion control algorithm - and not power boost. This surprises me somewhat - Westwood+ is supposed to be deliberately tolerant of single packet losses, since it was designed explicitly to get around the problem of slight random loss on wireless networks. I'd be surprised if, in fact, *only* one packet was lost. The more usual case is of burst loss, where several packets are lost in quick succession, and not necessarily consecutive packets. This tends to happen repeatedly on dump drop-tail queues, unless the buffer is so large that it accommodates the entire receive window (which, for modern OSes, is quite impressive in a dark sort of way). Burst loss is characteristic of congestion, whereas random loss tends to lose isolated packets, so it would be much less surprising for Westwood+ to react to it. The packets were lost in the first place because the queue became chock-full, probably at just about the exact moment when the PowerBoost allowance ran out and the bandwidth came down (which tends to cause the buffer to fill rapidly), so you get the worst-case scenario: the buffer at its fullest, and the bandwidth draining it at its minimum. This maximises the time before your TCP gets to even notice the lost packet's nonexistence, during which the sender keeps the buffer full because it still thinks everything's fine. What is probably happening is that the bottleneck queue, being so large, delays the retransmission of the lost packet until the Retransmit Timer expires. This will cause Reno-family TCPs to revert to slow-start, assuming (rightly in this case) that the characteristics of the channel have changed. You can see that it takes most of the first second for the sender to ramp up to full speed, and nearly as long to ramp back up to the reduced speed, both of which are characteristic of slow-start at WAN latencies. NB: during slow-start, the buffer remains empty as long as the incoming data rate is less than the output capacity, so latency is at a minimum. Do you have TCP SACK and timestamps turned on? Those usually allow minor losses like that to be handled more gracefully - the sending TCP gets a better idea of the RTT (allowing it to set the Retransmit Timer more intelligently), and would be able to see that progress is still being
Re: [Bloat] The Dark Problem with AQM in the Internet?
On Thu, 28 Aug 2014, Dave Taht wrote: On Thu, Aug 28, 2014 at 11:00 AM, Jan Ceuleers jan.ceule...@gmail.com wrote: On 08/28/2014 06:35 PM, Fred Baker (fred) wrote: When a message is lost due to an error, how do you determine whose fault it is? Links need to be engineered for the optimum combination of power, bandwidth, overhead and residual error that meets requirements. I agree with your implied point that a single error is unlikely to be indicative of a real problem, but a link not meeting requirements is someone's fault. So like Jerry I'd be interested in an ability for endpoints to be able to collect statistics on per-hop loss probabilities so that admins can hold their providers accountable. I will argue that a provider demonstrating 3% packet loss and low latency is better than a provider showing .03% packet loss and exorbitant latency. So I'd rather be measuring latency AND loss. Yep, the drive to never loose a packet is what caused buffer sizes to grow to such silly extremes. David Lang One very cool thing that went by at sigcomm last week was the concept of active networking revived in the form of Tiny Packet Programs: see: http://arxiv.org/pdf/1405.7143v3.pdf Which has a core concept of a protocol and virtual machine that can actively gather data from the path itself about buffering, loss, etc. No implementation was presented, but I could see a way to easily do it in linux via iptables. Regrettably, elsewhere in the real world, we have to infer these statistics via various means. Jan ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat -- Dave Täht NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] The Dark Problem with AQM in the Internet?
On Thu, 28 Aug 2014, Jerry Jongerius wrote: Yes, WireShark shows that *only* one packet gets lost. Regardless of RWIN size. The RWIN size can be below the BDP (no measurable queuing within the CMTS). Or, the RWIN size can be very large, causing significant queuing within the CMTS. With a larger RWIN value, the single dropped packet typically happens sooner in the download, rather than later. The fact there is no burst loss is a significant clue. did you check to see if packets were re-sent even if they weren't lost? on of the side effects of excessive buffering is that it's possible for a packet to be held in the buffer long enough that the sender thinks that it's been lost and retransmits it, so the packet is effectivly 'lost' even if it actually arrives at it's destination. David Lang The graph is fully explained by the Westwood+ algorithm that the server is using. If you input the data observed into the Westwood+ bandwidth estimator, you end up with the rate seen in the graph after the packet loss event. The reason the rate gets limited (no ramp up) is due to Westwood+ behavior on a RTO. And the reason there is the RTO is due the bufferbloat, and the timing of the lost packet in relation to when the bufferbloat starts. When there is no RTO, I see the expected drop (to the Westwood+ bandwidth estimate) and ramp back up. On a RTO, Westwood+ sets both ssthresh and cwnd to its bandwidth estimate. The PC does SACK, the server does not, so not used. Timestamps off. - Jerry -Original Message- From: Jonathan Morton [mailto:chromati...@gmail.com] Sent: Thursday, August 28, 2014 10:08 AM To: Jerry Jongerius Cc: 'Greg White'; 'Sebastian Moeller'; bloat@lists.bufferbloat.net Subject: Re: [Bloat] The Dark Problem with AQM in the Internet? On 28 Aug, 2014, at 4:19 pm, Jerry Jongerius wrote: AQM is a great solution for bufferbloat. End of story. But if you want to track down which device in the network intentionally dropped a packet (when many devices in the network path will be running AQM), how are you going to do that? Or how do you propose to do that? We don't plan to do that. Not from the outside. Frankly, we can't reliably tell which routers drop packets today, when AQM is not at all widely deployed, so that's no great loss. But if ECN finally gets deployed, AQM can set the Congestion Experienced flag instead of dropping packets, most of the time. You still don't get to see which router did it, but the packet still gets through and the TCP session knows what to do about it. The graph presented is caused the interaction of a single dropped packet, bufferbloat, and the Westwood+ congestion control algorithm - and not power boost. This surprises me somewhat - Westwood+ is supposed to be deliberately tolerant of single packet losses, since it was designed explicitly to get around the problem of slight random loss on wireless networks. I'd be surprised if, in fact, *only* one packet was lost. The more usual case is of burst loss, where several packets are lost in quick succession, and not necessarily consecutive packets. This tends to happen repeatedly on dump drop-tail queues, unless the buffer is so large that it accommodates the entire receive window (which, for modern OSes, is quite impressive in a dark sort of way). Burst loss is characteristic of congestion, whereas random loss tends to lose isolated packets, so it would be much less surprising for Westwood+ to react to it. The packets were lost in the first place because the queue became chock-full, probably at just about the exact moment when the PowerBoost allowance ran out and the bandwidth came down (which tends to cause the buffer to fill rapidly), so you get the worst-case scenario: the buffer at its fullest, and the bandwidth draining it at its minimum. This maximises the time before your TCP gets to even notice the lost packet's nonexistence, during which the sender keeps the buffer full because it still thinks everything's fine. What is probably happening is that the bottleneck queue, being so large, delays the retransmission of the lost packet until the Retransmit Timer expires. This will cause Reno-family TCPs to revert to slow-start, assuming (rightly in this case) that the characteristics of the channel have changed. You can see that it takes most of the first second for the sender to ramp up to full speed, and nearly as long to ramp back up to the reduced speed, both of which are characteristic of slow-start at WAN latencies. NB: during slow-start, the buffer remains empty as long as the incoming data rate is less than the output capacity, so latency is at a minimum. Do you have TCP SACK and timestamps turned on? Those usually allow minor losses like that to be handled more gracefully - the sending TCP gets a better idea of the RTT (allowing it to set the Retransmit Timer more intelligently), and would be able to see that progress is still being made with the backlog of buffered
Re: [Bloat] The Dark Problem with AQM in the Internet?
As far as I know there are no deployments of AQM in DOCSIS networks yet. So, the effect you are seeing is unlikely to be due to AQM. As Sebastian indicated, it looks like an interaction between power boost, a drop tail buffer and the tcp congestion window getting reset to slow-start. I ran a quick simulation of a simple network with power boost and basic (bloated) drop tail buffer (no AQM) this morning in an attempt to understand what is going on here. You didn't give me a lot to go on in the text of your blog post, but nonetheless after playing around with parameters a bit, I was able to get a result that was close to what you are seeing (attached). Let me know if you disagree. I'm a bit concerned with the tone of your article, making AQM out to be the bad guy here (weapon against end users, etc.). The folks on this list and those who participate in the IETF AQM WG are working on AQM and packet scheduling algorithms in an attempt to fix the Internet. At this point AQM/PS is the best known solution, let's not create negative perceptions unnecessarily. -Greg On 8/23/14, 2:01 PM, Sebastian Moeller moell...@gmx.de wrote: Hi Jerry, On Aug 23, 2014, at 20:16 , Jerry Jongerius jer...@duckware.com wrote: Request for comments on: www.duckware.com/darkaqm The bottom line: How do you know which AQM device in a network intentionally drops a packet, without cooperation from AQM? Or is this in AQM somewhere and I just missed it? I am sure you will get more expert responses later, but let me try to comment. Paragraph 1: I think you hit the nail on the head with your observation: The average user can not figure out what AQM device intentionally dropped packets Only, I might add, this does not depend on AQM, the user can not figure out where packets where dropped in the case that not all involved network hops are under said user¹s control ;) So move on, nothing to see here ;) Paragraph 2: There is no guarantee that any network equipment responds to ICMP requests at all (for example my DSLAM does not). What about pinging a host further away and look at that hosts RTT development over time? (Minor clarification: its the load dependent increase of ping RTT to the CMTS that would be diagnostic of a queue, not the RTT per se). No increase of ICMP RTT could also mean there is no AQM involved ;) I used to think along similar lines, but reading https://www.nanog.org/meetings/nanog47/presentations/Sunday/RAS_Traceroute _N47_Sun.pdf made me realize that my assumptions about ping and trace route were not really backed up by reality. Notably traceroute will not necessarily show the real data's path and latencies or drop probability. Paragraph 3 What is the advertised bandwidth of your link? To my naive eye this looks a bit like power boosting (the cable company allowing you higher than advertised bandwidth for a short time that is later reduced to the advertised speed). Your plot needs a better legend, BTW, what is the blue line showing? When you say that neither ping nor trace route showed anything, I assumed that you measured concurrently to your download. It would be really great if you could netperf-wrapper to get comparable data (see the link on http://www.bufferbloat.net/projects/cerowrt/wiki/Quick_Test_for_Bufferbloa t ) There the latency is not only assessed by ICMP echo requests but also by UDP packets, and it is very unlikely that your ISP can special case these in any tricky way, short of giving priority to sparse flows (which is pretty much what you would like your ISP to do in the first place ;) ) Here is where I reveal that I am just a layman, but you complain about the loss of one packet, but how do you assume does a (TCP) settle on its transfer speed? Exactly it keeps increasing until it looses a packet, then reduces its speed to 50% or so and slowly ramps up again until the next packet loss. So unless your test data is not TCP I see no way to avoid packet loss (and no reason why it is harmful). Now if my power boost intuition should prove right I can explain the massive drop quite well, TCP had ramped up to above the long-term stable and suffers several packet losses in a short time, basically resetting it to 0 or so, therefore the new ramping to 40Mbps looks pretty similar to the initial ramping to 110Mbps... Paragraph 4: I guess, ECN, explicit congestion notification is the best you can expect, or routers will initially set a mark on a packet to notify the TCP endpoints that they need to throttle the speed unless that want to risk packet loss. But not all routers are configured to use it (plus you need to configure your endpoints correctly, see: http://www.bufferbloat.net/projects/cerowrt/wiki/Enable_ECN ). But this will not tell you where along the path congestion occurred, only that it occurred (and if push comes to shove your packets still get dropped.) Also, I believe, a congested router is going to drop packets to be able to ³survive² the current load, it
Re: [Bloat] The Dark Problem with AQM in the Internet?
Note that I worked with Folkert Van Heusden to get some options added to his httping program to allow ping style testing against any HTTP server out there using HTTP/TCP. See: http://www.vanheusden.com/httping/ I find it slightly ironic that people are now concerned about ICMP ping no longer returning queuing information given that when I started working on bufferbloat, a number of people claimed that ICMP Ping could not be relied upon to report reliable information, as it may be prioritized differently by routers. This urban legend may or may not be true; I never observed it in my explorations. In any case, you all may find it useful, and my thanks to Folkert for a very useful tool. - Jim ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] The Dark Problem with AQM in the Internet?
Just a cautionary tale- There was a fairly well publicized DOS attack that involved TCP SYN packets with a zero TTL (If I recall correctly), so be careful running that tool. Be particularly careful if you run it in bulk, as you may end up in a black list on a firewall somewhere.. Bill Ver Steeg Distinguished Engineer Cisco Systems -Original Message- From: bloat-boun...@lists.bufferbloat.net [mailto:bloat-boun...@lists.bufferbloat.net] On Behalf Of Sebastian Moeller Sent: Monday, August 25, 2014 3:13 PM To: Jim Gettys Cc: bloat@lists.bufferbloat.net Subject: Re: [Bloat] The Dark Problem with AQM in the Internet? Hi Jim, On Aug 25, 2014, at 20:09 , Jim Gettys j...@freedesktop.org wrote: Note that I worked with Folkert Van Heusden to get some options added to his httping program to allow ping style testing against any HTTP server out there using HTTP/TCP. See: http://www.vanheusden.com/httping/ That is quite cool! I find it slightly ironic that people are now concerned about ICMP ping no longer returning queuing information given that when I started working on bufferbloat, a number of people claimed that ICMP Ping could not be relied upon to report reliable information, as it may be prioritized differently by routers. Just to add what I learned: some routers seem to have rate limiting for ICMP processing and process these on a slow-path (see https://www.nanog.org/meetings/nanog47/presentations/Sunday/RAS_Traceroute_N47_Sun.pdf ). Mind you this applies if the router processes the ICMP packet, not if it simply passes it along. So as long as the host responding to the pings is not a router with interesting limitations, this should not affect the suitability of ICMP to detect and measure buffer bloat (heck this is what netperf-wrapper's RRUL test automated). But since Jerry wants to pinpoint the exact location of his assumed single packet drop he wants to use ping/traceroute to actually probe routers on the way, so all this urban legends about ICMP processing on routers will actually affect him. But then what do I know... Best Regards Sebastian This urban legend may or may not be true; I never observed it in my explorations. In any case, you all may find it useful, and my thanks to Folkert for a very useful tool. - Jim ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] The Dark Problem with AQM in the Internet?
Oops - never mind. I thought the tool was doing traceroute-like things with varying TTLs in order to get per-hop data. Go back to whatever you were doing.. Bill Ver Steeg Distinguished Engineer Cisco Systems -Original Message- From: bloat-boun...@lists.bufferbloat.net [mailto:bloat-boun...@lists.bufferbloat.net] On Behalf Of Bill Ver Steeg (versteb) Sent: Monday, August 25, 2014 5:17 PM To: Sebastian Moeller; Jim Gettys Cc: bloat@lists.bufferbloat.net Subject: Re: [Bloat] The Dark Problem with AQM in the Internet? Just a cautionary tale- There was a fairly well publicized DOS attack that involved TCP SYN packets with a zero TTL (If I recall correctly), so be careful running that tool. Be particularly careful if you run it in bulk, as you may end up in a black list on a firewall somewhere.. Bill Ver Steeg Distinguished Engineer Cisco Systems -Original Message- From: bloat-boun...@lists.bufferbloat.net [mailto:bloat-boun...@lists.bufferbloat.net] On Behalf Of Sebastian Moeller Sent: Monday, August 25, 2014 3:13 PM To: Jim Gettys Cc: bloat@lists.bufferbloat.net Subject: Re: [Bloat] The Dark Problem with AQM in the Internet? Hi Jim, On Aug 25, 2014, at 20:09 , Jim Gettys j...@freedesktop.org wrote: Note that I worked with Folkert Van Heusden to get some options added to his httping program to allow ping style testing against any HTTP server out there using HTTP/TCP. See: http://www.vanheusden.com/httping/ That is quite cool! I find it slightly ironic that people are now concerned about ICMP ping no longer returning queuing information given that when I started working on bufferbloat, a number of people claimed that ICMP Ping could not be relied upon to report reliable information, as it may be prioritized differently by routers. Just to add what I learned: some routers seem to have rate limiting for ICMP processing and process these on a slow-path (see https://www.nanog.org/meetings/nanog47/presentations/Sunday/RAS_Traceroute_N47_Sun.pdf ). Mind you this applies if the router processes the ICMP packet, not if it simply passes it along. So as long as the host responding to the pings is not a router with interesting limitations, this should not affect the suitability of ICMP to detect and measure buffer bloat (heck this is what netperf-wrapper's RRUL test automated). But since Jerry wants to pinpoint the exact location of his assumed single packet drop he wants to use ping/traceroute to actually probe routers on the way, so all this urban legends about ICMP processing on routers will actually affect him. But then what do I know... Best Regards Sebastian This urban legend may or may not be true; I never observed it in my explorations. In any case, you all may find it useful, and my thanks to Folkert for a very useful tool. - Jim ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat