Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

Thomas Rosenstein via Bloat Fri, 06 Nov 2020 03:38:15 -0800


On 6 Nov 2020, at 12:18, Jesper Dangaard Brouer wrote:

On Fri, 06 Nov 2020 10:18:10 +0100
"Thomas Rosenstein" <thomas.rosenst...@creamfinance.com> wrote:

I just tested 5.9.4 seems to also fix it partly, I have long
stretches where it looks good, and then some increases again. (3.10
Stock has them too, but not so high, rather 1-3 ms)


That you have long stretches where latency looks good is interesting
information.   My theory is that your system have a periodic userspace
process that does a kernel syscall that takes too long, blocking
network card from processing packets. (Note it can also be a kernel
thread).

The weird part is, I first only updated router-02 and pinged torouter-04 (out of traffic flow), there I noticed these long stretches ofok ping.

When I updated also router-03 and router-04, the old behaviour kind ofwas back, this confused me.

Could this be related to netlink? I have gobgpd running on theserouters, which injects routes via netlink.But the churn rate during the tests is very minimal, maybe 30 - 40routes every second.


Otherwise we got: salt-minion, collectd, node_exporter, sshd


Another theory is the NIC HW does strange things, but it is not very

likely. E.g. delaying the packets before generating the IRQinterrupt,

which hide it from my IRQ-to-softirq latency tool.

A question: What traffic control qdisc are you using on your system?


kernel 4+ uses pfifo, but there's no dropped packets

I have also tested with fq_codel, same behaviour and also no weirdnessin the packets queue itself


kernel 3.10 uses mq, and for the vlan interfaces noqueue


Here's the mail archive link for the question on lartc :

https://www.spinics.net/lists/lartc/msg23774.html

What you looked at the obvious case if any of your qdisc report alarge
backlog? (during the incidents)


as said above, nothing in qdiscs or reported

for example:

64 bytes from x.x.x.x: icmp_seq=10 ttl=64 time=0.169 ms
64 bytes from x.x.x.x: icmp_seq=11 ttl=64 time=5.53 ms
64 bytes from x.x.x.x: icmp_seq=12 ttl=64 time=9.44 ms
64 bytes from x.x.x.x: icmp_seq=13 ttl=64 time=0.167 ms
64 bytes from x.x.x.x: icmp_seq=14 ttl=64 time=3.88 ms

and then again:

64 bytes from x.x.x.x: icmp_seq=15 ttl=64 time=0.569 ms
64 bytes from x.x.x.x: icmp_seq=16 ttl=64 time=0.148 ms
64 bytes from x.x.x.x: icmp_seq=17 ttl=64 time=0.286 ms
64 bytes from x.x.x.x: icmp_seq=18 ttl=64 time=0.257 ms
64 bytes from x.x.x.x: icmp_seq=19 ttl=64 time=0.220 ms


These very low ping times tell me that you are measuring very close to

the target machine, which is good. Here on the bufferbloat list, weare

always suspicious of network equipment being use in these kind of
setups.  As experience tells us that this can be the cause of
bufferbloat latency.

yes, I'm just testing across two machines connected directly to the sameswitch,

basically that's the best case scenario apart from direct connection.

I do also use a VLAN on this interface, so the pings go through the vlanstack!


You mention some fs.com switches (your desc below signature), can you
tell us more?


It's a fs.com N5850-48S6Q

48 Port 10 Gbit + 6 port 40 Gbit

there are only 6 ports with 10 G in use, and 2 with 1 G, basically notraffic



[...]

I have a feeling that maybe not all config options were correctlymoved
to the newer kernel.
Or there's a big bug somewhere ... (which would seem rather weird forme
to be the first one to discover this)


I really appreciate that you report this.  This is a periodic issue,
that often result in people not reporting this.

Even if we find this to be caused by some process running on your
system, or a bad config, this it is really important that we find the
root-cause.

I'll rebuild the 5.9 kernel on one of the 3.10 kernel and see if it
makes a difference ...


--
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

On Wed, 04 Nov 2020 16:23:12 +0100
Thomas Rosenstein via Bloat <bloat@lists.bufferbloat.net> wrote:

General Info:

Routers are connected between each other with 10G Mellanox Connect-X
cards via 10G SPF+ DAC cables via a 10G Switch from fs.com
Latency generally is around 0.18 ms between all routers (4).
Throughput is 9.4 Gbit/s with 0 retransmissions when tested withiperf3.2 of the 4 routers are connected upstream with a 1G connection(separate
port, same network card)
All routers have the full internet routing tables, i.e. 80k entriesfor
IPv6 and 830k entries for IPv4
Conntrack is disabled (-j NOTRACK)
Kernel 5.4.60 (custom)
2x Xeon X5670 @ 2.93 Ghz
96 GB RAM

_______________________________________________
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat

Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

Reply via email to