On Sat, Sep 1, 2012 at 5:53 AM, Eric Dumazet <[email protected]> wrote: > On Fri, 2012-08-31 at 09:59 -0700, Dave Taht wrote: > >> I realize that 10GigE and datacenter host based work is sexy and fun, >> but getting stuff that runs well in today's 1-20Mbit environments is >> my own priority, going up to 100Mbit, with something that can be >> embedded in a SoC. The latest generation of SoCs all do QoS in >> hardware... badly. > > Maybe 'datacenter' word was badly chosen and you obviously jumped on it, > because it meant different things for you.
I am hypersensitive about optimizing for sub-ms problems when there are huge multi-second problems like in cable, wifi, and cellular. Recent paper: http://conferences.sigcomm.org/sigcomm/2012/paper/cellnet/p1.pdf Sorry. If the srtt idea can scale UP as well as down sanely, cool. I'm concerned about how different TCPs might react to this and have a long comment about the placement of this at this layer at the bottom of this email. > Point was that when your machine has flows with quite different RTT, 1 > ms on your local LAN, and 100 ms on different continent, current control > law might clamp long distance communications, or have slow response time > for the LAN traffic. fq_codel, far less likely, and if you have a collision between long distance and local streams in a single queue, there, what will happen if you fiddle with srrt? > The shortest path you have, the sooner you should drop packets because > losses have much less impact on latencies. Sure. > Yuchung idea sounds very good and my intuition is it will give > tremendous results for standard linux qdisc setups ( a single qdisc per > device) I tend to agree. > To get similar effects, you could use two (or more) fq codels per > ethernet device. Ugh. > One fq_codel with interval = 1 or 5 ms for LAN communications > One fq_codel with interval = 100 ms for other communications and one mfq_codel with a calculated maxpacket, weird interval, etc for wifi. > tc filters to select the right qdisc by destination addresses Meh. A simple default might be "Am I going out the default route for this?" > Then we are a bit far from codel spirit (no knob qdisc) > > I am pretty sure you noticed that if your ethernet adapter is only used > for LAN communications, you have to setup codel interval to a much > smaller value than the 100 ms default to get reasonably fast answer to > congestion. At 100Mbit, (as I've noted elsewhere), BQL choses defaults about double optimum (6-7k), and gso is currently left on. With those disabled, I tend to run a pretty congested network, and rarely notice. That does not mean that reaction time isn't an issue, it is merely masked so well that I don't care. > Just make this automatic, because people dont want to think about it. Like you, I want one qdisc to rule them all, with sane defaults. I do feel it is very necessary to add in one pfifo_fast-like behavior in fq_codel: deprioritizing background traffic, in its own set of fq'd flows. Simple way to do that is to have a bkweight of, say 20, and only check "q->slow_flows" on that interval of packet deliveries. This is the only way I can think of to survive bittorrent-like flows, and to capture the intent of traffic marked background. However, I did want to talk to the using-codel-to-solve-everything issue for fixing host bufferbloat... Fixing host bufferbloat by adding local tcp awareness is a neat idea, don't let me stop you! But... Codel will push stuff down to, but not below, 5ms of latency (or target). In fq_codel you will typically end up with 1 packet outstanding in each active queue under heavy load. At 10Mbit it's pretty easy to have it strain mightily and fail to get to 5ms, particularly on torrent-like workloads. The "right" amount of host latency to aim for is ... 0, or as close to it as you can get. Fiddling with codel target and interval on the host to get less host latency is well and good, but you can't get to 0 that way... The best queue on a host is no extra queue. I spent some time evaluating linux fq_codel vs the ns2 nfq_codel version I just got working. In 150 bidirectional competing streams, at 100Mbit, it retained about 30% less packets in queue (110 vs 140). Next up on my list is longer RTTs and wifi, but all else was pretty equivalent. The effects of fiddling with /proc/sys/net/ipv4/tcp_limit_output_bytes was even more remarkable. At 6000, I would get down to a nice steady 71-81 packets in queue on that 150 stream workload. So, I started thinking through and playing with how TSQ works: At one hop 100Mbit, with a BQL of 3000 and a tcp_limit_output_bytes of 6000, all offloads off, nfq_codel on both ends, I get single stream throughoutput of 92.85Mbit. Backlog in qdisc is, 0. 2 netperf streams, bidirectional: 91.47 each, darn close to theoretical, less than one packet in the backlog. 4 streams backlogs a little over 3. (and sums to 91.94 in each direction) 8, backlog of 8. (optimal throughput) Repeating the 8 stream test with tcp_output_limit of 1500, I get packets outstanding of around 3, and optimal throughput. (1 stream test: 42Mbit throughput (obviously starved), 150 streams: 82...) 8 streams, limit set to 127k, I get 50 packets outstanding in the queue, and the same throughput. (150 streams, ~100) So I might argue that a more "right" number for tcp_output_bytes is not 128k per TCP socket, but (BQL_limit*2/active_sockets), in conjunction with fq_codel. I realize that that raises interesting questions as to when to use TSO/GSO, and how to schedule tcp packet releases, and pushes the window reduction issue all the way up into the tcp stack rather than responding to indications from the qdisc... but it does get you closer to a 0 backlog in qdisc. And *usually* the bottleneck link is not on the host but on something inbetween, and that's where your signalling comes from, anyway. -- Dave Täht http://www.bufferbloat.net/projects/cerowrt/wiki - "3.3.8-17 is out with fq_codel!" _______________________________________________ Codel mailing list [email protected] https://lists.bufferbloat.net/listinfo/codel
