Re: [Bloat] The Dark Problem with AQM in the Internet?
Hi Jerry, isn't this the problem statement of Conex? Again, you at the end host would gain little insight with Conex, but every intermediate network operator can observe the red/black marked packets, compare the ratios and know to what extent (by looking at ingress vs egress into his network ) he is contributing... Best regards, Richard - Original Message - From: Jerry Jongerius To: 'Rich Brown' Cc: bloat@lists.bufferbloat.net Sent: Thursday, August 28, 2014 6:20 PM Subject: Re: [Bloat] The Dark Problem with AQM in the Internet? It add accountability. Everyone in the path right now denies that they could possibly be the one dropping the packet. If I want (or need!) to address the problem, I can't now. I would have to make a change and just hope that it fixed the problem. With accountability, I can address the problem. I then have a choice. If the problem is the ISP, I can switch ISP's. If the problem is the mid-level peer or the hosting provider, I can test out new hosting providers. - Jerry From: Rich Brown [mailto:richb.hano...@gmail.com] Sent: Thursday, August 28, 2014 10:39 AM To: Jerry Jongerius Cc: Greg White; Sebastian Moeller; bloat@lists.bufferbloat.net Subject: Re: [Bloat] The Dark Problem with AQM in the Internet? Hi Jerry, AQM is a great solution for bufferbloat. End of story. But if you want to track down which device in the network intentionally dropped a packet (when many devices in the network path will be running AQM), how are you going to do that? Or how do youpropose to do that? Yes, but... I want to understand why you are looking to know which device dropped the packet. What would you do with the information? The great beauty of fq_codel is that it discards packets that have dwelt too long in a queue by actually *measuring* how long they've been in the queue. If the drops happen in your local gateway/home router, then it's interesting to you as the operator of that device. If the drops happen elsewhere (perhaps some enlightened ISP has installed fq_codel, PIE, or some other zoomy queue discipline) then they're doing the right thing as well - they're managing their traffic as well as they can. But once the data leaves your gateway router, you can't make any further predictions. The SQM/AQM efforts of CeroWrt/fq_codel are designed to give near optimal performance of the *local* gateway, to make it adapt to the remainder of the (black box) network. It might make sense to instrument the CeroWrt/OpenWrt code to track the number of fq_codel drops to come up with a sense of what's 'normal'. And if you need to know exactly what's happening, then tcpdump/wireshark are your friends. Maybe I'm missing the point of your note, but I'm not sure there's anything you can do beyond your gateway. In the broader network, operators are continually watching their traffic and drop rates, and adjusting/reconfiguring their networks to adapt. But in general, it's impossible for you to have any sway/influence on their operations, so I'm not sure what you would do if you could know that the third router in traceroute was dropping... Best regards, Rich -- ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Comcast upped service levels - WNDR3800 can't cope...
On Sun, Aug 31, 2014 at 3:18 AM, Jonathan Morton chromati...@gmail.com wrote: On 31 Aug, 2014, at 1:30 am, Dave Taht wrote: Could I get you to also try HFSC? Once I got a kernel running that included it, and figured out how to make it do what I wanted... ...it seems to be indistinguishable from HTB and FQ in terms of CPU load. If you are feeling really inspired, try cbq. :) One thing I sort of like about cbq is that it (I think) (unlike htb presently) operates off an estimated size for the next packet (which isn't dynamic, sadly), where the others buffer up an extra packet until they can be delivered. In my quest for absolutely minimal latency I'd love to be rid of that last extra non-in-the-fq_codel-qdisc packet... either with a peek operation or with a running estimate. I think this would (along with killing the maxpacket check in codel) allow for a faster system with less tuning (no tweaks below 2.5mbit in particular) across the entire operational range of ethernet. There would also need to be some support for what I call GRO slicing, where a large receive is split back into packets if a drop decision could be made. It would be cool to be able to program the ethernet hardware itself to return completion interrupts at a given transmit rate (so you could program the hardware to be any bandwidth not just 10/100/1000). Some hardware so far as I know supports this with a pacing feature. This doesn't help on inbound rate limiting, unfortunately, just egress. Actually, I think most of the CPU load is due to overheads in the userspace-kernel interface and the device driver, rather than the qdiscs themselves. You will see it bound by the softirq thread, but, what, exactly, inside that, is kind of unknown. (I presently lack time to build up profilable kernels on these low end arches. ) Something about TBF causes more overhead - it goes through periods of lower CPU use similar to the other shapers, but then spends periods at considerably higher CPU load, all without changing the overall throughput. The flip side of this is that TBF might be producing a smoother stream of packets. The receiving computer (which is fast enough to notice such things) reports a substantially larger number of recv() calls are required to take in the data from TBF than from anything else - averaging about 4.4KB rather than 9KB or so. But at these data rates, it probably matters little. Well, htb has various tuning options (see quantum and burst) that alter it's behavior along the lines of what you re seeing from tbf. FWIW, apparently Apple's variant of the GEM chipset doesn't support jumbo frames. This does, however, mean that I'm definitely working with an MTU of 1500, similar to what would be sent over the Internet. These tests were all run using nttpc. I wanted to finally try out RRUL, but the wrappers fail to install via pip on my Gentoo boxes. I'll need to investigate further before I can make pretty graphs like everyone else. - Jonathan Morton -- Dave Täht NSFW: https://w2.eff.org/Censorship/Internet_censorship_bills/russell_0296_indecent.article ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
[Bloat] What is wrong with Microsoft's receive window auto-tuning?
I am noticing (via WireShark traces) at times that Microsoft's (Windows 7) receive window auto-tuning goes horribly wrong, causing significant buffer bloat. And at other times, the tuning appears to work just fine. For example, BDP suggests a receive window of 750k, and most often Windows tunes around 1MB -- but at other times, it will tune to 3.8MB (way more than it should). Is anyone aware of any research either pointing out how their tuning algorithm works, or of known bugs/problems with the algorithm? ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Comcast upped service levels - WNDR3800 can't cope...
On 1 Sep, 2014, at 8:01 pm, Dave Taht wrote: On Sun, Aug 31, 2014 at 3:18 AM, Jonathan Morton chromati...@gmail.com wrote: On 31 Aug, 2014, at 1:30 am, Dave Taht wrote: Could I get you to also try HFSC? Once I got a kernel running that included it, and figured out how to make it do what I wanted... ...it seems to be indistinguishable from HTB and FQ in terms of CPU load. If you are feeling really inspired, try cbq. :) One thing I sort of like about cbq is that it (I think) (unlike htb presently) operates off an estimated size for the next packet (which isn't dynamic, sadly), where the others buffer up an extra packet until they can be delivered. It's also hilariously opaque to configure, which is probably why nobody uses it - the RED problem again - and the top link when I Googled for best practice on it gushes enthusiastically about Linux 2.2! The idea of manually specifying an average packet size in particular feels intuitively wrong to me. Still, I might be able to try it later on. Most class-based shapers are probably more complex to set up for simple needs than they need to be. I have to issue three separate 'tc' invocations for a minimal configuration of each of them, repeating several items of data between them. They scale up reasonably well to complex situations, but such uses are relatively rare. In my quest for absolutely minimal latency I'd love to be rid of that last extra non-in-the-fq_codel-qdisc packet... either with a peek operation or with a running estimate. I suspect that something like fq_codel which included its own shaper (with the knobs set sensibly by default) would gain more traction via ease of use - and might even answer your wish. It would be cool to be able to program the ethernet hardware itself to return completion interrupts at a given transmit rate (so you could program the hardware to be any bandwidth not just 10/100/1000). Some hardware so far as I know supports this with a pacing feature. Is there a summary of hardware features like this anywhere? It'd be nice to see what us GEM and RTL proles are missing out on. :-) Actually, I think most of the CPU load is due to overheads in the userspace-kernel interface and the device driver, rather than the qdiscs themselves. You will see it bound by the softirq thread, but, what, exactly, inside that, is kind of unknown. (I presently lack time to build up profilable kernels on these low end arches. ) When I eventually got RRUL running (on one of the AMD boxes, so the PowerBook only has to run the server end of netperf), the bandwidth maxed out at about 300Mbps each way, and the softirq was bouncing around 60% CPU. I'm pretty sure most of that is shoving stuff across the PCI bus (even though it's internal to the northbridge), or at least waiting for it to go there. I'm happy to assume that the rest was mostly kernel-userspace interface overhead to the netserver instances. But this doesn't really answer the question of why the WNDR has so much lower a ceiling with shaping than without. The G4 is powerful enough that the overhead of shaping simply disappears next to the overhead of shoving data around. Even when I turn up the shaping knob to a value quite close to the hardware's unshaped capabilities (eg. 400Mbps one-way), most of the shapers stick to the requested limit like glue, and even the worst offender is within 10%. I estimate that it's using only about 500 clocks per packet *unless* it saturates the PCI bus. It's possible, however, that we're not really looking at a CPU limitation, but a timer problem. The PowerBook is a proper desktop computer with hardware to match (modulo its age). If all the shapers now depend on the high-resolution timer, how high-resolution is the WNDR's timer? - Jonathan Morton ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Comcast upped service levels - WNDR3800 can't cope...
On Mon, Sep 1, 2014 at 11:06 AM, Jonathan Morton chromati...@gmail.com wrote: On 1 Sep, 2014, at 8:01 pm, Dave Taht wrote: On Sun, Aug 31, 2014 at 3:18 AM, Jonathan Morton chromati...@gmail.com wrote: On 31 Aug, 2014, at 1:30 am, Dave Taht wrote: Could I get you to also try HFSC? Once I got a kernel running that included it, and figured out how to make it do what I wanted... ...it seems to be indistinguishable from HTB and FQ in terms of CPU load. If you are feeling really inspired, try cbq. :) One thing I sort of like about cbq is that it (I think) (unlike htb presently) operates off an estimated size for the next packet (which isn't dynamic, sadly), where the others buffer up an extra packet until they can be delivered. It's also hilariously opaque to configure, which is probably why nobody uses it - the RED problem again - and the top link when I Googled for best practice on it gushes enthusiastically about Linux 2.2! The idea of manually specifying an average packet size in particular feels intuitively wrong to me. Still, I might be able to try it later on. I felt a ewma of egress packet sizes would be a better estimator, yes. Most class-based shapers are probably more complex to set up for simple needs than they need to be. I have to issue three separate 'tc' invocations for a minimal configuration of each of them, repeating several items of data between them. They scale up reasonably well to complex situations, but such uses are relatively rare. In my quest for absolutely minimal latency I'd love to be rid of that last extra non-in-the-fq_codel-qdisc packet... either with a peek operation or with a running estimate. I suspect that something like fq_codel which included its own shaper (with the knobs set sensibly by default) would gain more traction via ease of use - and might even answer your wish. I agree that a simpler to use qdisc would be good. I'd like something that preserves multiple (3-4) service classes (as pfifo_fast and sch_fq do) using drr, deals with diffserv, and could be invoked with a command line like: tc qdisc add dev eth0 cake bandwidth 50mbit diffservmap std I had started at that (basically pouring cerowrt's simple.qos code into C with a simple lookup table for diffserv) many moons ago, but the contents of the yurtlab and that code was stolen - and I was (and remain) completely stuck on how to do soft rate limiting saner, particularly in asymmetric scenarios. (cake stood for Common Applications Kept Enhanced. fq_codel is not a drop in replacement for pfifo_fast due to the classless nature of it. sch_fq comes closer, but it's more server oriented. QFQ with 4 weighted bands + fq_codel can be made to do the levels of service stuff fairly straight forwardly at line rate, but the tc filter code tends to get rather long to handle all the diffserv classes... So... we keep polishing the sqm system, and I keep tracking progress in how diffserv classification will be done in the future (in ietf groups like rmcat and dart), and figuring out how to deal better with aggregating macs in general is what keeps me awake nights, more than finishing cake... We'll get there, eventually. It would be cool to be able to program the ethernet hardware itself to return completion interrupts at a given transmit rate (so you could program the hardware to be any bandwidth not just 10/100/1000). Some hardware so far as I know supports this with a pacing feature. Is there a summary of hardware features like this anywhere? It'd be nice to see what us GEM and RTL proles are missing out on. :-) I'd like one. There are certain 3rd party firmwares like octeon's where it seems possible to add more features to the firmware co-processor, in particular. Actually, I think most of the CPU load is due to overheads in the userspace-kernel interface and the device driver, rather than the qdiscs themselves. You will see it bound by the softirq thread, but, what, exactly, inside that, is kind of unknown. (I presently lack time to build up profilable kernels on these low end arches. ) When I eventually got RRUL running (on one of the AMD boxes, so the PowerBook only has to run the server end of netperf), the bandwidth maxed out at about 300Mbps each way, and the softirq was bouncing around 60% CPU. I'm pretty sure most of that is shoving stuff across the PCI bus (even though it's internal to the northbridge), or at least waiting for it to go there. I'm happy to assume that the rest was mostly kernel-userspace interface overhead to the netserver instances. perf and the older oprofile are our friends here. But this doesn't really answer the question of why the WNDR has so much lower a ceiling with shaping than without. The G4 is powerful enough that the overhead of shaping simply disappears next to the overhead of shoving data around. Even when I turn up the shaping knob to a value quite close to the hardware's
Re: [Bloat] Comcast upped service levels - WNDR3800 can't cope...
But this doesn't really answer the question of why the WNDR has so much lower a ceiling with shaping than without. The G4 is powerful enough that the overhead of shaping simply disappears next to the overhead of shoving data around. Even when I turn up the shaping knob to a value quite close to the hardware's unshaped capabilities (eg. 400Mbps one-way), most of the shapers stick to the requested limit like glue, and even the worst offender is within 10%. I estimate that it's using only about 500 clocks per packet *unless* it saturates the PCI bus. It's possible, however, that we're not really looking at a CPU limitation, but a timer problem. The PowerBook is a proper desktop computer with hardware to match (modulo its age). If all the shapers now depend on the high-resolution timer, how high-resolution is the WNDR's timer? Both good questions worth further exploration. Doing some napkin math and some spec reading, I think that the memory bus is a likely factory. The G4 had a fairly impressive memory bus for the day (64-bit?). The WNDR3800 appears to be used in an x16 configuration (based on the numbers on the memory parts). It may have *just* enough bw to push concurrent 3x3 802.11n through the software bridge interface, which short-circuits a lot of processing (IIRC). The typical way I've seen a home router being benchmarked for the marketing numbers is to flow tcp data to/from a wifi client to a wired client. Single socket is used, for a uni-directional stream of data. So long as they can hit peak rates (peak MCS), it will get marked as good for up to 900Mbps!! or whatever they want to say. The small cache of the AR7161 vs. the G4 is another issue (32KB vs. 2MB) the various buffers for fq_codel and htb may stay in L2 on the G4, but there simply isn't room in the AR7161 for that, which puts further pressure on the bus. -Aaron ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Comcast upped service levels - WNDR3800 can't cope...
On 1 Sep, 2014, at 11:25 pm, Aaron Wood wrote: But this doesn't really answer the question of why the WNDR has so much lower a ceiling with shaping than without. The G4 is powerful enough that the overhead of shaping simply disappears next to the overhead of shoving data around. Even when I turn up the shaping knob to a value quite close to the hardware's unshaped capabilities (eg. 400Mbps one-way), most of the shapers stick to the requested limit like glue, and even the worst offender is within 10%. I estimate that it's using only about 500 clocks per packet *unless* it saturates the PCI bus. It's possible, however, that we're not really looking at a CPU limitation, but a timer problem. The PowerBook is a proper desktop computer with hardware to match (modulo its age). If all the shapers now depend on the high-resolution timer, how high-resolution is the WNDR's timer? Both good questions worth further exploration. Doing some napkin math and some spec reading, I think that the memory bus is a likely factory. The G4 had a fairly impressive memory bus for the day (64-bit?). The WNDR3800 appears to be used in an x16 configuration (based on the numbers on the memory parts). It may have *just* enough bw to push concurrent 3x3 802.11n through the software bridge interface, which short-circuits a lot of processing (IIRC). The typical way I've seen a home router being benchmarked for the marketing numbers is to flow tcp data to/from a wifi client to a wired client. Single socket is used, for a uni-directional stream of data. So long as they can hit peak rates (peak MCS), it will get marked as good for up to 900Mbps!! or whatever they want to say. The small cache of the AR7161 vs. the G4 is another issue (32KB vs. 2MB) the various buffers for fq_codel and htb may stay in L2 on the G4, but there simply isn't room in the AR7161 for that, which puts further pressure on the bus. I don't think that's it. First a nitpick: the PowerBook version of the late-model G4 (7447A) doesn't have the external L3 cache interface, so it only has the 256KB or 512KB internal L2 cache (I forget which). The desktop version (7457A) used external cache. The G4 was considered to be *crippled* by its FSB by the end of its run, since it never adopted high-performance signalling techniques, nor moved the memory controller on-die; it was quoted that the G5 (970) could move data using *single-byte* operations faster than the *peak* throughput of the G4's FSB. The only reason the G5 never made it into a PowerBook was because it wasn't battery-friendly in the slightest. But that makes little difference to your argument - compared to a cheap CPE-class embedded SoC, the PowerBook is eminently desktop-class hardware, even if it is already a decade old. More compelling is that even at 16-bit width, the WNDR's RAM should have more bandwidth than my PowerBook's PCI bus. Standard PCI is 33MHz x 32-bit, and I can push a steady 30MB/sec in both directions simultaneously, which corresponds in total to about half the PCI bus's theoretical capacity. (The GEM reports 66MHz capability, but it shares the bus with an IDE controller which doesn't, so I assume it is stuck at 33MHz.) A 16-bit RAM should be able to match PCI if it runs at 66MHz, which is the lower limit of JEDEC standards for SDRAM. The AR7161 datasheet says it has a DDR-capable SDRAM interface, which implies at least 200MHz unless the integrator was colossally stingy. Further, a little digging suggests that the memory bus should be 32-bit wide (hence two 16-bit RAM chips), and that the WNDR runs it at 340MHz, half the CPU core speed. For an embedded SoC, that's really not too bad - it should be able to sustain 1GB/sec, in one direction at a time. So that takes care of the argument for simply moving the payload around. In any case, the WNDR demonstrably *can* cope with the available bandwidth if the shaping is turned off. For the purposes of shaping, the CPU shouldn't need to touch the majority of the payload - only the headers, which are relatively small. The bulk of the payload should DMA from one NIC to RAM, then DMA back out of RAM to the other NIC. It has to do that anyway to route them, and without shaping there'd be more of them to handle. The difference might be in the data structures used by the shaper itself, but I think those are also reasonably compact. It doesn't even have to touch userspace, since it's not acting as the endpoint as my PowerBook was during my tests. And while the MIPS 24K core is old, it's also been die-shrunk over the intervening years, so it runs a lot faster than it originally did. I very much doubt that it's as refined as my G4, but it could probably hold its own relative to a comparable ARM SoC such as the Raspberry Pi. (Unfortunately, the latter doesn't have the I/O capacity to do high-speed networking - USB only.) Atheros publicity