--- Begin Message ---
> 2) Users are pissed off, because they clicked on a web page, and got nothing
> back. They retry on their screen, or they try another site. Meanwhile, the
> underlying TCP connection remains there, pumping the network full of more
> packets on that old path, which is still backed up with packets that haven't
> been delivered that are sitting in queues.
Agree. I’ve experienced that as utilization of a network segment or supporting
network systems (e.g. DNS) increases, you may see very small delay creep in -
but not much as things are stable until they are *quite suddenly* not so. At
that stability inflection point you immediately & dramatically fall off a
cliff, which is then exacerbated by what you note here – user and machine-based
retries/retransmissions that drives a huge increase in traffic. The solution
has typically been throwing massive new capacity at it until the storm recedes.
> I should say that most operators, and especially ATT in this case, do not
> measure end-to-end latency. Instead they use Little's Lemma to query routers
> for their current throughput in bits per second, and calculate latency as if
> Little's Lemma applied.
IMO network operators views/practices vary widely & have been evolving quite a
bit in recent years. Yes, it used to be all about capacity utilization metrics
but I think that is changing. In my day job, we run E2E latency tests (among
others) to CPE and the distribution is a lot more instructive than the
mean/median to continuously improving the network experience.
> And management responds, Hooray! Because utilization of 100% of their
> hardware is their investors' metric of maximizing profits. The hardware they
> are operating is fully utilized. No waste! And users are happy because no
> packets have been dropped!
Well, I hope it wasn’t 100% utilization meant they were ‘green’ on their
network KPIs. ;-) Ha. But I think you are correct that a network engineering
team would have been measured by how well they kept ahead of utilization/demand
& network capacity decisions driven in large part by utilization trends. In a
lot of networks I suspect an informal rule of thumb arose that things got a
little funny once p98 utilization got to around 94-95% of link capacity – so
backup from there to figure out when you need to trigger automatic capacity
augments to avoid that. While I do not think managing via utilization in that
way is incorrect, ISTM it’s mostly being used as the measure is an indirect
proxy for end user QoE. I think latency/delay is becoming seen to be as
important certainly, if not a more direct proxy for end user QoE. This is all
still evolving and I have to say is a super interesting & fun thing to work on.
:-)
Jason
--- End Message ---
_______________________________________________
Cerowrt-devel mailing list
Cerowrt-devel@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/cerowrt-devel