On Fri, Jan 22, 2021 at 11:43 AM Stuart Cheshire <chesh...@apple.com> wrote: > > On 20 Jan 2021, at 07:55, Dave Taht <dave.t...@gmail.com> wrote: > > > This review, highly recommending this router on the high end > > > > https://www.increasebroadbandspeed.co.uk/best-router-2020 > > > > also states that the sqm implementation has been dumbed down significantly > > and can only shape 800Mbit inbound. Long ago we did a backport of cake to > > the other ubnt routers mentioned in the review, has anyone tackled this one?
It's nice to see the "godfadder" of our effort back again here. I do re-read periodically http://www.stuartcheshire.org/rants/latency.html At the price of perhaps over-lecturing for a wider audience. > According to the UniFi Dream Machine Pro data sheet, it has a 1.7 GHz > quad-core ARM Cortex-A57 processor and achieves the following throughput > numbers (downlink direction): > > 8.0 Gb/s with Deep Packet Inspection I'm always very dubious of these kind of numbers against anything but single large, bulk flows. Also if the fast path is not entirely offloaded, performance goes to hell. > 3.5 Gb/s with DPI + Intrusion Detection > 0.8 Gb/s with IPsec VPN Especially here, also. I should also note that the rapidly deploying wireguard vpn outperforms ipsec in just about every way... in software. > > <https://dl.ubnt.com/ds/udm-pro> > > Is implementing CoDel queueing really 10x more burden than running > “Ubiquiti’s proprietary Deep Packet Inspection (DPI) engine”? Is CoDel 4x > more burden than Ubiquiti’s IDS (Intrusion Detection System) and IPS > (Intrusion Prevention System)? These questions, given that the actual fq-codel overhead is nearly immeasurable, and the code complexity much less than these, are the makings of a very good rant targetted at a hw offload maker. :) hashing is generally "free" and in hw, selecting a different queue can be done with single indirect Cake has a lot of ideas that would benefit from actual hw offloads. a 4 or 8 way associative cache is a common IP hw block.... > Is CoDel really the same per-packet cost as doing full IPsec VPN decryption > on every packet? I realize the IPsec VPN decryption probably has No. >some assist from crypto-specific ARM instructions or hardware, but even so, >crypto operations are generally considered relatively expensive. If this >device can do 800 Mb/s throughput doing IPsec VPN decryption for every packet, >it feels like it ought to be able to do a lot better than that just doing >CoDel queueing calculations for every packet. yep. the only even semi-costly codel function is an invsqrt which can be implemented in 3k gates or so in hw. In software the newton approximation is nearly immeasurable, and accurate enough. (we went to great lengths to make it accurate in cake to no observable effect) codel is not O(1) A nice thing about fq is that you can be codeling in parallel, or if you are acting on a single queue at a time, short circuit the overload section of codel to give up and deliver a packet if you cannot meet the deadline... or... using a very small fifo queue (say 3k bytes at a gbit), the odds are extremely good (millions? ... A lot. I worked it out once with various assumptions...) that no matter how many packets you need to drop at once, you can still run at line rate at a reasonable clock. bql manages this short fifo in linux, but it tends to be much larger and inflated by tso offloads. You really don't need to drop or mark a lot of packets to achieve good congestion control at high rates. But you know that. :) Most "hw" offloads are actually offloads to a specialized cpu and thus O(1) or not isn't much of a problem there. > Is this just a software polish issue, that could be remedied by doing some > performance optimization on the CoDel code? Don't know how to make it faster. The linux version is about as optimized as we know how. A p4 implementation exists. As everyone points out later on this thread, it's the software *shaper* (on inbound especially) that is the real burden. TB has been offloaded to hw. The QCA offloaded version has both the tb and fq_codel in there. also hw shaping outbound is vastly cheaper with a programmable completion interrrupt. tell 1Gbit hardware to interrupt at half the rate, bang, it's 500Mbit. (this is implemented in several intel ethernet cards) inbound shaping in sw is another one of the it's the latency stupid things. It's not so much the clock rate, but how fast the cpu can reschedule the thread, a number that doesn't scale much with clock, but with cache and pipeline depth. One reason why I adore the mill cpu design is that it can context switch in 5 clocks, where x86 takes 1000.... > It’s also possible that the information in the review might simply be wrong > -- it’s hard to measure throughput numbers in excess of 1 Gb/s unless you > have both a client and a server connected faster than that in order to run > the test. In other words, gigabit Ethernet is out, so both client and server > would have to be connected via the 10 Gb/s SFP+ ports (of which the UDM-PRO > has just two -- one in the upstream direction, and one in the downstream > direction). Speaking for myself personally, I don’t have any devices with 10 > Gb/s capability, and my Internet connection isn’t above 1 Gb/s either, so as > long as it can get reasonably close to 1 Gb/s that’s more than I need (or > could use) right now. As most 1Gbit ISP links are still quite overbuffered (over 120ms was what I'd measured with comcast, 60ms on sonic fiber, both a few years back), vs a total induced latency of *0-5ms* with sqm at 800mbit, it generally seems to me that inbound shaping to something close to a gbit is a win for videoconferencing, gaming, vr, jacktrip and other latency sensitive traffic. On a 35Mbit upload, fq_codel or cake are *loafing*. If we were to get around to doing a backport of cake to this device, I'd probably go with htb+fq_codel on the download and cake on the upload, where the ack-filtering and per host/per flow fq of cake would be ideal. (this, btw, is what I do presently) ack-filtering at these asymmetries is a pretty big win for retaining a high download speed with competing upload traffic. https://blog.cerowrt.org/post/ack_filtering/ you cannot do anything even close to a steady gbit down with competing uplink traffic on the cable modems I've tested to date. > Stuart Cheshire > -- "For a successful technology, reality must take precedence over public relations, for Mother Nature cannot be fooled" - Richard Feynman d...@taht.net <Dave Täht> CTO, TekLibre, LLC Tel: 1-831-435-0729 _______________________________________________ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat