> 2) Users are pissed off, because they clicked on a web page, and got nothing > back. They retry on their screen, or they try another site. Meanwhile, the > underlying TCP connection remains there, pumping the network full of more > packets on that old path, which is still backed up with packets that haven't > been delivered that are sitting in queues.
Agree. I’ve experienced that as utilization of a network segment or supporting network systems (e.g. DNS) increases, you may see very small delay creep in - but not much as things are stable until they are *quite suddenly* not so. At that stability inflection point you immediately & dramatically fall off a cliff, which is then exacerbated by what you note here – user and machine-based retries/retransmissions that drives a huge increase in traffic. The solution has typically been throwing massive new capacity at it until the storm recedes. > I should say that most operators, and especially ATT in this case, do not > measure end-to-end latency. Instead they use Little's Lemma to query routers > for their current throughput in bits per second, and calculate latency as if > Little's Lemma applied. IMO network operators views/practices vary widely & have been evolving quite a bit in recent years. Yes, it used to be all about capacity utilization metrics but I think that is changing. In my day job, we run E2E latency tests (among others) to CPE and the distribution is a lot more instructive than the mean/median to continuously improving the network experience. > And management responds, Hooray! Because utilization of 100% of their > hardware is their investors' metric of maximizing profits. The hardware they > are operating is fully utilized. No waste! And users are happy because no > packets have been dropped! Well, I hope it wasn’t 100% utilization meant they were ‘green’ on their network KPIs. ;-) Ha. But I think you are correct that a network engineering team would have been measured by how well they kept ahead of utilization/demand & network capacity decisions driven in large part by utilization trends. In a lot of networks I suspect an informal rule of thumb arose that things got a little funny once p98 utilization got to around 94-95% of link capacity – so backup from there to figure out when you need to trigger automatic capacity augments to avoid that. While I do not think managing via utilization in that way is incorrect, ISTM it’s mostly being used as the measure is an indirect proxy for end user QoE. I think latency/delay is becoming seen to be as important certainly, if not a more direct proxy for end user QoE. This is all still evolving and I have to say is a super interesting & fun thing to work on. :-) Jason
_______________________________________________ Cake mailing list Cake@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/cake