> 2) Users are pissed off, because they clicked on a web page, and got nothing 
> back. They retry on their screen, or they try another site. Meanwhile, the 
> underlying TCP connection remains there, pumping the network full of more 
> packets on that old path, which is still backed up with packets that haven't 
> been delivered that are sitting in queues.



Agree. I’ve experienced that as utilization of a network segment or supporting 
network systems (e.g. DNS) increases, you may see very small delay creep in - 
but not much as things are stable until they are *quite suddenly* not so. At 
that stability inflection point you immediately & dramatically fall off a 
cliff, which is then exacerbated by what you note here – user and machine-based 
retries/retransmissions that drives a huge increase in traffic. The solution 
has typically been throwing massive new capacity at it until the storm recedes.



> I should say that most operators, and especially ATT in this case, do not 
> measure end-to-end latency. Instead they use Little's Lemma to query routers 
> for their current throughput in bits per second, and calculate latency as if 
> Little's Lemma applied.



IMO network operators views/practices vary widely & have been evolving quite a 
bit in recent years. Yes, it used to be all about capacity utilization metrics 
but I think that is changing. In my day job, we run E2E latency tests (among 
others) to CPE and the distribution is a lot more instructive than the 
mean/median to continuously improving the network experience.



> And management responds, Hooray! Because utilization of 100% of their 
> hardware is their investors' metric of maximizing profits. The hardware they 
> are operating is fully utilized. No waste! And users are happy because no 
> packets have been dropped!



Well, I hope it wasn’t 100% utilization meant they were ‘green’ on their 
network KPIs. ;-) Ha. But I think you are correct that a network engineering 
team would have been measured by how well they kept ahead of utilization/demand 
& network capacity decisions driven in large part by utilization trends. In a 
lot of networks I suspect an informal rule of thumb arose that things got a 
little funny once p98 utilization got to around 94-95% of link capacity – so 
backup from there to figure out when you need to trigger automatic capacity 
augments to avoid that. While I do not think managing via utilization in that 
way is incorrect, ISTM it’s mostly being used as the measure is an indirect 
proxy for end user QoE. I think latency/delay is becoming seen to be as 
important certainly, if not a more direct proxy for end user QoE. This is all 
still evolving and I have to say is a super interesting & fun thing to work on. 
:-)



Jason












_______________________________________________
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat

Reply via email to