Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-06 Thread Jesper Dangaard Brouer
On Fri, 06 Nov 2020 18:04:49 +0100
"Thomas Rosenstein"  wrote:

> On 6 Nov 2020, at 15:13, Jesper Dangaard Brouer wrote:
> 
> > On Fri, 6 Nov 2020 13:53:58 +0100
> > Jesper Dangaard Brouer  wrote:
> >  
> >> [...]  
> 
>  Could this be related to netlink? I have gobgpd running on these
>  routers, which injects routes via netlink.
>  But the churn rate during the tests is very minimal, maybe 30 - 40
>  routes every second.  
> >>
> >> Yes, this could be related.  The internal data-structure for FIB
> >> lookups is a fibtrie which is a compressed patricia tree, related to
> >> radix tree idea.  Thus, I can imagine that the kernel have to
> >> rebuild/rebalance the tree with all these updates.  
> >
> > Reading the kernel code. The IPv4 fib_trie code is very well tuned,
> > fully RCU-ified, meaning read-side is lock-free.  The resize() 
> > function code in net//ipv4/fib_trie.c have max_work limiter to avoid it 
> > uses 
> > too much time.  And the update looks lockfree.
> >
> > The IPv6 update looks more scary, as it seems to take a "bh" spinlock
> > that can block softirq from running code in net/ipv6/ip6_fib.c
> > (spin_lock_bh(>fib6_table->tb6_lock).  
> 
> I'm using ping on IPv4, but I'll try to see if IPv6 makes any 
> difference!

I think you misunderstand me.  I'm not asking you to use ping6. The
gobgpd daemon updates will both update IPv4 and IPv6 routes, right.
Updating IPv6 routes are more problematic than IPv4 routes.  The IPv6
route tables update can potentially stall softirq from running, which
was the latency tool was measuring... and it did show some outliers.


> > Have you tried to use 'perf record' to observe that is happening on 
> > the system while these latency incidents happen?  (let me know if you 
> > want some cmdline hints)  
> 
> Haven't tried this yet. If you have some hints what events to monitor 
> I'll take them!

Okay to record everything (-a) on the system and save call-graph (-g),
and run for 5 seconds (via profiling the sleep function).

 # perf record -g -a  sleep 5

To view the result the simply use the 'perf report', but likely you
want to use option --no-children as you are profiling the kernel (and
not a userspace program you want to have grouped 'children' by).  I
also include the CPU column via '--sort cpu,comm,dso,symbol' and you
can select/zoom-in-on a specific CPU via '-C zero-indexed-cpu-num'.

 # perf report --sort cpu,comm,dso,symbol --no-children

When we ask you to provide the output, you can use the --stdio option,
and provide txt-info via a pastebin link as it is very long.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Comparing bufferbloat tests (was: We built a new bufferbloat test and keen for feedback)

2020-11-06 Thread Sebastian Moeller
Hi Toke,

> Gesendet: Freitag, 06. November 2020 um 17:17 Uhr
> Von: "Toke Høiland-Jørgensen via Bloat" 
> An: "Stephen Hemminger" , "Toke Høiland-Jørgensen 
> via Bloat" 
> Betreff: Re: [Bloat] Comparing bufferbloat tests (was: We built a new 
> bufferbloat test and keen for feedback)
>
> Stephen Hemminger  writes:
> 
> > PS: Why to US providers have such asymmetric bandwidth? Getting something 
> > symmetric
> > requires going to a $$$ business rate.
> 
> For Cable, the DOCSIS standard is asymmetric by design, but not *that*
> asymmetric. 

   Unfortunately is is that bad: DOCSIS 3.0 Downstream 108 MHz to 1002 MHz 
Upstream 30 MHz to 85 MHz, so (1002-108)/(85-30) 16:1, but not all cable co, 
have matching upstream filters for 85MHz. Then again, the two ACK per two full 
segment rule puts a lower end in, with what an ISP can get away, if the 
customer is expected to at least see the downstream rate in speedtests, I can 
never remember whether that is essentially 20:1 or 40:1, but since then GRO?GSO 
and friends, as well as ACK filtering has reduced the ACK traffic somewhat...


> I *think* the rest is because providers have to assign
> channels independently for upstream and downstream, and if they just
> assign them all to downstream they can advertise a bigger number...

   They wished, once they deploy upstream amplifiers these have a fixed 
frequency split and need to be replaced if the spilt is changed, that gets 
expensive quickly... or so I have heard.

Best Regards
Sebastian

> 
> -Toke
> ___
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat
>
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-06 Thread Thomas Rosenstein via Bloat



On 6 Nov 2020, at 15:13, Jesper Dangaard Brouer wrote:


On Fri, 6 Nov 2020 13:53:58 +0100
Jesper Dangaard Brouer  wrote:


[...]


Could this be related to netlink? I have gobgpd running on these
routers, which injects routes via netlink.
But the churn rate during the tests is very minimal, maybe 30 - 40
routes every second.


Yes, this could be related.  The internal data-structure for FIB
lookups is a fibtrie which is a compressed patricia tree, related to
radix tree idea.  Thus, I can imagine that the kernel have to
rebuild/rebalance the tree with all these updates.


Reading the kernel code. The IPv4 fib_trie code is very well tuned,
fully RCU-ified, meaning read-side is lock-free.  The resize() 
function
code in net//ipv4/fib_trie.c have max_work limiter to avoid it uses 
too

much time.  And the update looks lockfree.

The IPv6 update looks more scary, as it seems to take a "bh" spinlock
that can block softirq from running code in net/ipv6/ip6_fib.c
(spin_lock_bh(>fib6_table->tb6_lock).


I'm using ping on IPv4, but I'll try to see if IPv6 makes any 
difference!




Have you tried to use 'perf record' to observe that is happening on 
the system while these latency incidents happen?  (let me know if you 
want some cmdline hints)


Haven't tried this yet. If you have some hints what events to monitor 
I'll take them!




--
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Comparing bufferbloat tests (was: We built a new bufferbloat test and keen for feedback)

2020-11-06 Thread Toke Høiland-Jørgensen via Bloat
Stephen Hemminger  writes:

> PS: Why to US providers have such asymmetric bandwidth? Getting something 
> symmetric
> requires going to a $$$ business rate.

For Cable, the DOCSIS standard is asymmetric by design, but not *that*
asymmetric. I *think* the rest is because providers have to assign
channels independently for upstream and downstream, and if they just
assign them all to downstream they can advertise a bigger number...

-Toke
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-06 Thread Jesper Dangaard Brouer
On Fri, 6 Nov 2020 13:53:58 +0100
Jesper Dangaard Brouer  wrote:

> [...]
> > >
> > > Could this be related to netlink? I have gobgpd running on these 
> > > routers, which injects routes via netlink.
> > > But the churn rate during the tests is very minimal, maybe 30 - 40 
> > > routes every second.  
> 
> Yes, this could be related.  The internal data-structure for FIB
> lookups is a fibtrie which is a compressed patricia tree, related to
> radix tree idea.  Thus, I can imagine that the kernel have to
> rebuild/rebalance the tree with all these updates.

Reading the kernel code. The IPv4 fib_trie code is very well tuned,
fully RCU-ified, meaning read-side is lock-free.  The resize() function
code in net//ipv4/fib_trie.c have max_work limiter to avoid it uses too
much time.  And the update looks lockfree.

The IPv6 update looks more scary, as it seems to take a "bh" spinlock
that can block softirq from running code in net/ipv6/ip6_fib.c
(spin_lock_bh(>fib6_table->tb6_lock).

Have you tried to use 'perf record' to observe that is happening on the system 
while these latency incidents happen?  (let me know if you want some cmdline 
hints)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-06 Thread Jesper Dangaard Brouer
On Fri, 06 Nov 2020 12:45:31 +0100
Toke Høiland-Jørgensen  wrote:

> "Thomas Rosenstein"  writes:
> 
> > On 6 Nov 2020, at 12:18, Jesper Dangaard Brouer wrote:
> >  
> >> On Fri, 06 Nov 2020 10:18:10 +0100
> >> "Thomas Rosenstein"  wrote:
> >>  
> > I just tested 5.9.4 seems to also fix it partly, I have long
> > stretches where it looks good, and then some increases again. (3.10
> > Stock has them too, but not so high, rather 1-3 ms)
> >  
> >>
> >> That you have long stretches where latency looks good is interesting
> >> information.   My theory is that your system have a periodic userspace
> >> process that does a kernel syscall that takes too long, blocking
> >> network card from processing packets. (Note it can also be a kernel
> >> thread).  
> >
[...]
> >
> > Could this be related to netlink? I have gobgpd running on these 
> > routers, which injects routes via netlink.
> > But the churn rate during the tests is very minimal, maybe 30 - 40 
> > routes every second.

Yes, this could be related.  The internal data-structure for FIB
lookups is a fibtrie which is a compressed patricia tree, related to
radix tree idea.  Thus, I can imagine that the kernel have to
rebuild/rebalance the tree with all these updates.

> >
> > Otherwise we got: salt-minion, collectd, node_exporter, sshd  
> 
> collectd may be polling the interface stats; try turning that off?

It should be fairly easy for you to test the theory if any of these
services (except sshd) is causing this, by turning them off
individually.


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] We built a new bufferbloat test and keen for feedback

2020-11-06 Thread Sam

On 11/4/20 3:30 PM, Sam Westwood wrote:

Hi everyone,

My name is Sam and I'm the co-founder and COO of Waveform.com. At 
Waveform we provide equipment to help improve cell phone service, and 
being in the industry we've always been interested in all aspects of 
network connectivity. Bufferbloat for us has always been interesting, 
and while there are a few tests out there we never found one that was 
fantastic. So we thought we'd try and build one!


My colleague Arshan has built the test, which we based upon the 
Cloudflare Speedtest template that was discussed earlier in the summer 
in a previous thread.


We measure bufferbloat under two conditions: when downlink is saturated 
and when uplink is saturated. The test involves three stages: Unloaded, 
Downlink Saturated, and Uplink Saturated. In the first stage we simply 
measure latency to a file hosted on a CDN. This is usually around 5ms 
and could vary a bit based on the user's location. We use the webTiming 
API to find the time-to-first-byte, and consider that as the latency. In 
the second stage we run a download, while simultaneously measuring 
latency. In the third stage we do the same but for upload. Both download 
and upload usually take around 5 seconds. We show the median, first 
quartile and the third quartile on distribution charts corresponding to 
each stage to provide a visual representation of the latency variations. 
For download and upload we have used Cloudflare's speedtest backend.


You can find the test here: https://www.waveform.com/apps/dev-arshan 



We built testing it on Chrome, but it works on Firefox and mobile too. 
On mobile results may be a little different, as the APIs aren't 
available and so instead we implemented a more manual method, which can 
be a little noisier.


This is a really early alpha, and so we are keen to get any and all 
feedback you have :-). Things that we would particularly like feedback on:


  * How does the bufferbloat measure compare to other tests you may have
run on the same connection (e.g. dslreports, fast.com )
  * How the throughput results (download/upload/latency) look compared
to other tools
  * Any feedback on the user interface of the test itself? We know that
before releasing more widely we will put more effort into explaining
bufferbloat than we have so far.
  * Anything else you would like to give feedback on?

We have added a feature to share results via a URL, so please feel free 
to share these if you have specific feedback.


Thanks!
Sam and Arshan

*
Sam Westwood
Co-Founder & COO | RSRF & Waveform
E s...@waveform.com 
D   (949) 207-3175
T   1-800-761-3041 Ext. 100
W www.rsrf.com  & www.waveform.com 



___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat




Looks pretty identical to what fast.com gave me. I'm on 50/50 fiber and 
firefox 82.

https://www.waveform.com/apps/dev-arshan?test-id=58dfa326-23d4-44a3-9059-b6011b104ccb

--Sam
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-06 Thread Thomas Rosenstein via Bloat



On 6 Nov 2020, at 12:45, Toke Høiland-Jørgensen wrote:


"Thomas Rosenstein"  writes:


On 6 Nov 2020, at 12:18, Jesper Dangaard Brouer wrote:


On Fri, 06 Nov 2020 10:18:10 +0100
"Thomas Rosenstein"  wrote:


I just tested 5.9.4 seems to also fix it partly, I have long
stretches where it looks good, and then some increases again. 
(3.10

Stock has them too, but not so high, rather 1-3 ms)



That you have long stretches where latency looks good is interesting
information.   My theory is that your system have a periodic 
userspace

process that does a kernel syscall that takes too long, blocking
network card from processing packets. (Note it can also be a kernel
thread).


The weird part is, I first only updated router-02 and pinged to
router-04 (out of traffic flow), there I noticed these long stretches 
of

ok ping.

When I updated also router-03 and router-04, the old behaviour kind 
of

was back, this confused me.

Could this be related to netlink? I have gobgpd running on these
routers, which injects routes via netlink.
But the churn rate during the tests is very minimal, maybe 30 - 40
routes every second.

Otherwise we got: salt-minion, collectd, node_exporter, sshd


collectd may be polling the interface stats; try turning that off?


I can, but shouldn't that also influence iperf3 performance then?





Another theory is the NIC HW does strange things, but it is not very
likely.  E.g. delaying the packets before generating the IRQ
interrupt,
which hide it from my IRQ-to-softirq latency tool.

A question: What traffic control qdisc are you using on your system?


kernel 4+ uses pfifo, but there's no dropped packets
I have also tested with fq_codel, same behaviour and also no 
weirdness

in the packets queue itself

kernel 3.10 uses mq, and for the vlan interfaces noqueue


Do you mean that you only have a single pfifo qdisc on kernel 4+? Why 
is

it not using mq?


oh, actually, I just noticed that's a remnant of the previous tests, I 
had


net.core.default_qdisc = fq_codel

in the sysctl.conf... so disregard my previous wrong info



so all kernel by default look like that, mq + pfifo_fast:

qdisc noqueue 0: dev lo root refcnt 2
qdisc mq 0: dev eth0 root
qdisc pfifo_fast 0: dev eth0 parent :1 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth0 parent :2 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth0 parent :3 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth0 parent :4 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth0 parent :5 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth0 parent :6 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth0 parent :7 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth0 parent :8 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1

qdisc mq 0: dev eth1 root
qdisc pfifo_fast 0: dev eth1 parent :1 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth1 parent :2 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth1 parent :3 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth1 parent :4 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth1 parent :5 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth1 parent :6 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth1 parent :7 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth1 parent :8 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1

qdisc mq 0: dev eth2 root
qdisc pfifo_fast 0: dev eth2 parent :1 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth2 parent :2 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth2 parent :3 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth2 parent :4 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth2 parent :5 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth2 parent :6 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth2 parent :7 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth2 parent :8 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1

qdisc mq 0: dev eth3 root
qdisc pfifo_fast 0: dev eth3 parent :1 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth3 parent :2 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth3 parent :3 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth3 parent :4 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth3 parent :5 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth3 parent :6 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth3 parent :7 

Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-06 Thread Toke Høiland-Jørgensen via Bloat
"Thomas Rosenstein"  writes:

> On 6 Nov 2020, at 12:18, Jesper Dangaard Brouer wrote:
>
>> On Fri, 06 Nov 2020 10:18:10 +0100
>> "Thomas Rosenstein"  wrote:
>>
> I just tested 5.9.4 seems to also fix it partly, I have long
> stretches where it looks good, and then some increases again. (3.10
> Stock has them too, but not so high, rather 1-3 ms)
>
>>
>> That you have long stretches where latency looks good is interesting
>> information.   My theory is that your system have a periodic userspace
>> process that does a kernel syscall that takes too long, blocking
>> network card from processing packets. (Note it can also be a kernel
>> thread).
>
> The weird part is, I first only updated router-02 and pinged to 
> router-04 (out of traffic flow), there I noticed these long stretches of 
> ok ping.
>
> When I updated also router-03 and router-04, the old behaviour kind of 
> was back, this confused me.
>
> Could this be related to netlink? I have gobgpd running on these 
> routers, which injects routes via netlink.
> But the churn rate during the tests is very minimal, maybe 30 - 40 
> routes every second.
>
> Otherwise we got: salt-minion, collectd, node_exporter, sshd

collectd may be polling the interface stats; try turning that off?

>>
>> Another theory is the NIC HW does strange things, but it is not very
>> likely.  E.g. delaying the packets before generating the IRQ 
>> interrupt,
>> which hide it from my IRQ-to-softirq latency tool.
>>
>> A question: What traffic control qdisc are you using on your system?
>
> kernel 4+ uses pfifo, but there's no dropped packets
> I have also tested with fq_codel, same behaviour and also no weirdness 
> in the packets queue itself
>
> kernel 3.10 uses mq, and for the vlan interfaces noqueue

Do you mean that you only have a single pfifo qdisc on kernel 4+? Why is
it not using mq?

Was there anything in the ethtool stats?

-Toke
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-06 Thread Thomas Rosenstein via Bloat



On 6 Nov 2020, at 12:18, Jesper Dangaard Brouer wrote:


On Fri, 06 Nov 2020 10:18:10 +0100
"Thomas Rosenstein"  wrote:


I just tested 5.9.4 seems to also fix it partly, I have long
stretches where it looks good, and then some increases again. (3.10
Stock has them too, but not so high, rather 1-3 ms)



That you have long stretches where latency looks good is interesting
information.   My theory is that your system have a periodic userspace
process that does a kernel syscall that takes too long, blocking
network card from processing packets. (Note it can also be a kernel
thread).


The weird part is, I first only updated router-02 and pinged to 
router-04 (out of traffic flow), there I noticed these long stretches of 
ok ping.


When I updated also router-03 and router-04, the old behaviour kind of 
was back, this confused me.


Could this be related to netlink? I have gobgpd running on these 
routers, which injects routes via netlink.
But the churn rate during the tests is very minimal, maybe 30 - 40 
routes every second.


Otherwise we got: salt-minion, collectd, node_exporter, sshd



Another theory is the NIC HW does strange things, but it is not very
likely.  E.g. delaying the packets before generating the IRQ 
interrupt,

which hide it from my IRQ-to-softirq latency tool.

A question: What traffic control qdisc are you using on your system?


kernel 4+ uses pfifo, but there's no dropped packets
I have also tested with fq_codel, same behaviour and also no weirdness 
in the packets queue itself


kernel 3.10 uses mq, and for the vlan interfaces noqueue


Here's the mail archive link for the question on lartc :

https://www.spinics.net/lists/lartc/msg23774.html



What you looked at the obvious case if any of your qdisc report a 
large

backlog? (during the incidents)


as said above, nothing in qdiscs or reported





for example:

64 bytes from x.x.x.x: icmp_seq=10 ttl=64 time=0.169 ms
64 bytes from x.x.x.x: icmp_seq=11 ttl=64 time=5.53 ms
64 bytes from x.x.x.x: icmp_seq=12 ttl=64 time=9.44 ms
64 bytes from x.x.x.x: icmp_seq=13 ttl=64 time=0.167 ms
64 bytes from x.x.x.x: icmp_seq=14 ttl=64 time=3.88 ms

and then again:

64 bytes from x.x.x.x: icmp_seq=15 ttl=64 time=0.569 ms
64 bytes from x.x.x.x: icmp_seq=16 ttl=64 time=0.148 ms
64 bytes from x.x.x.x: icmp_seq=17 ttl=64 time=0.286 ms
64 bytes from x.x.x.x: icmp_seq=18 ttl=64 time=0.257 ms
64 bytes from x.x.x.x: icmp_seq=19 ttl=64 time=0.220 ms


These very low ping times tell me that you are measuring very close to
the target machine, which is good.  Here on the bufferbloat list, we 
are

always suspicious of network equipment being use in these kind of
setups.  As experience tells us that this can be the cause of
bufferbloat latency.
yes, I'm just testing across two machines connected directly to the same 
switch,

basically that's the best case scenario apart from direct connection.

I do also use a VLAN on this interface, so the pings go through the vlan 
stack!




You mention some fs.com switches (your desc below signature), can you
tell us more?


It's a fs.com N5850-48S6Q

48 Port 10 Gbit + 6 port 40 Gbit

there are only 6 ports with 10 G in use, and 2 with 1 G, basically no 
traffic





[...]
I have a feeling that maybe not all config options were correctly 
moved

to the newer kernel.

Or there's a big bug somewhere ... (which would seem rather weird for 
me

to be the first one to discover this)


I really appreciate that you report this.  This is a periodic issue,
that often result in people not reporting this.

Even if we find this to be caused by some process running on your
system, or a bad config, this it is really important that we find the
root-cause.


I'll rebuild the 5.9 kernel on one of the 3.10 kernel and see if it
makes a difference ...


--
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

On Wed, 04 Nov 2020 16:23:12 +0100
Thomas Rosenstein via Bloat  wrote:


General Info:

Routers are connected between each other with 10G Mellanox Connect-X
cards via 10G SPF+ DAC cables via a 10G Switch from fs.com
Latency generally is around 0.18 ms between all routers (4).
Throughput is 9.4 Gbit/s with 0 retransmissions when tested with 
iperf3.
2 of the 4 routers are connected upstream with a 1G connection 
(separate

port, same network card)
All routers have the full internet routing tables, i.e. 80k entries 
for

IPv6 and 830k entries for IPv4
Conntrack is disabled (-j NOTRACK)
Kernel 5.4.60 (custom)
2x Xeon X5670 @ 2.93 Ghz
96 GB RAM

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-06 Thread Jesper Dangaard Brouer
On Fri, 06 Nov 2020 10:18:10 +0100
"Thomas Rosenstein"  wrote:

> >> I just tested 5.9.4 seems to also fix it partly, I have long
> >> stretches where it looks good, and then some increases again. (3.10
> >> Stock has them too, but not so high, rather 1-3 ms)
> >>

That you have long stretches where latency looks good is interesting
information.   My theory is that your system have a periodic userspace
process that does a kernel syscall that takes too long, blocking
network card from processing packets. (Note it can also be a kernel
thread).

Another theory is the NIC HW does strange things, but it is not very
likely.  E.g. delaying the packets before generating the IRQ interrupt,
which hide it from my IRQ-to-softirq latency tool.

A question: What traffic control qdisc are you using on your system?

What you looked at the obvious case if any of your qdisc report a large
backlog? (during the incidents)


> >> for example:
> >>
> >> 64 bytes from x.x.x.x: icmp_seq=10 ttl=64 time=0.169 ms
> >> 64 bytes from x.x.x.x: icmp_seq=11 ttl=64 time=5.53 ms
> >> 64 bytes from x.x.x.x: icmp_seq=12 ttl=64 time=9.44 ms
> >> 64 bytes from x.x.x.x: icmp_seq=13 ttl=64 time=0.167 ms
> >> 64 bytes from x.x.x.x: icmp_seq=14 ttl=64 time=3.88 ms
> >>
> >> and then again:
> >>
> >> 64 bytes from x.x.x.x: icmp_seq=15 ttl=64 time=0.569 ms
> >> 64 bytes from x.x.x.x: icmp_seq=16 ttl=64 time=0.148 ms
> >> 64 bytes from x.x.x.x: icmp_seq=17 ttl=64 time=0.286 ms
> >> 64 bytes from x.x.x.x: icmp_seq=18 ttl=64 time=0.257 ms
> >> 64 bytes from x.x.x.x: icmp_seq=19 ttl=64 time=0.220 ms

These very low ping times tell me that you are measuring very close to
the target machine, which is good.  Here on the bufferbloat list, we are
always suspicious of network equipment being use in these kind of
setups.  As experience tells us that this can be the cause of
bufferbloat latency.

You mention some fs.com switches (your desc below signature), can you
tell us more?


[...]
> I have a feeling that maybe not all config options were correctly moved 
> to the newer kernel.
>
> Or there's a big bug somewhere ... (which would seem rather weird for me 
> to be the first one to discover this)

I really appreciate that you report this.  This is a periodic issue,
that often result in people not reporting this.

Even if we find this to be caused by some process running on your
system, or a bad config, this it is really important that we find the
root-cause.

> I'll rebuild the 5.9 kernel on one of the 3.10 kernel and see if it 
> makes a difference ...

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

On Wed, 04 Nov 2020 16:23:12 +0100
Thomas Rosenstein via Bloat  wrote:

> General Info:
> 
> Routers are connected between each other with 10G Mellanox Connect-X 
> cards via 10G SPF+ DAC cables via a 10G Switch from fs.com
> Latency generally is around 0.18 ms between all routers (4).
> Throughput is 9.4 Gbit/s with 0 retransmissions when tested with iperf3.
> 2 of the 4 routers are connected upstream with a 1G connection (separate 
> port, same network card)
> All routers have the full internet routing tables, i.e. 80k entries for 
> IPv6 and 830k entries for IPv4
> Conntrack is disabled (-j NOTRACK)
> Kernel 5.4.60 (custom)
> 2x Xeon X5670 @ 2.93 Ghz
> 96 GB RAM


___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-06 Thread Jesper Dangaard Brouer
On Fri, 06 Nov 2020 09:48:46 +0100
"Thomas Rosenstein"  wrote:

> On 5 Nov 2020, at 14:33, Jesper Dangaard Brouer wrote:
> 
> > On Thu, 05 Nov 2020 13:22:10 +0100
> > Thomas Rosenstein via Bloat  wrote:
> >  
> >> On 5 Nov 2020, at 12:21, Toke Høiland-Jørgensen wrote:
> >>  
> >>> "Thomas Rosenstein"  writes:
> >>>  
> > If so, this sounds more like a driver issue, or maybe something to
> > do with scheduling. Does it only happen with ICMP? You could try 
> > this
> > tool for a userspace UDP measurement:  
> 
>  It happens with all packets, therefore the transfer to backblaze 
>  with
>  40 threads goes down to ~8MB/s instead of >60MB/s  
> >>>
> >>> Huh, right, definitely sounds like a kernel bug; or maybe the new
> >>> kernel is getting the hardware into a state where it bugs out when
> >>> there are lots of flows or something.
> >>>
> >>> You could try looking at the ethtool stats (ethtool -S) while
> >>> running the test and see if any error counters go up. Here's a
> >>> handy script to monitor changes in the counters:
> >>>
> >>> https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
> >>>  
>  I'll try what that reports!
>   
> > Also, what happens if you ping a host on the internet (*through*
> > the router instead of *to* it)?  
> 
>  Same issue, but twice pronounced, as it seems all interfaces are
>  affected.
>  So, ping on one interface and the second has the issue.
>  Also all traffic across the host has the issue, but on both sides,
>  so ping to the internet increased by 2x  
> >>>
> >>> Right, so even an unloaded interface suffers? But this is the same
> >>> NIC, right? So it could still be a hardware issue...
> >>>  
>  Yep default that CentOS ships, I just tested 4.12.5 there the
>  issue also does not happen. So I guess I can bisect it
>  then...(really don't want to )  
> >>>
> >>> Well that at least narrows it down :)  
> >>
> >> I just tested 5.9.4 seems to also fix it partly, I have long
> >> stretches where it looks good, and then some increases again. (3.10
> >> Stock has them too, but not so high, rather 1-3 ms)
> >>
> >> for example:
> >>
> >> 64 bytes from x.x.x.x: icmp_seq=10 ttl=64 time=0.169 ms
> >> 64 bytes from x.x.x.x: icmp_seq=11 ttl=64 time=5.53 ms
> >> 64 bytes from x.x.x.x: icmp_seq=12 ttl=64 time=9.44 ms
> >> 64 bytes from x.x.x.x: icmp_seq=13 ttl=64 time=0.167 ms
> >> 64 bytes from x.x.x.x: icmp_seq=14 ttl=64 time=3.88 ms
> >>
> >> and then again:
> >>
> >> 64 bytes from x.x.x.x: icmp_seq=15 ttl=64 time=0.569 ms
> >> 64 bytes from x.x.x.x: icmp_seq=16 ttl=64 time=0.148 ms
> >> 64 bytes from x.x.x.x: icmp_seq=17 ttl=64 time=0.286 ms
> >> 64 bytes from x.x.x.x: icmp_seq=18 ttl=64 time=0.257 ms
> >> 64 bytes from x.x.x.x: icmp_seq=19 ttl=64 time=0.220 ms
> >> 64 bytes from x.x.x.x: icmp_seq=20 ttl=64 time=0.125 ms
> >> 64 bytes from x.x.x.x: icmp_seq=21 ttl=64 time=0.188 ms
> >> 64 bytes from x.x.x.x: icmp_seq=22 ttl=64 time=0.202 ms
> >> 64 bytes from x.x.x.x: icmp_seq=23 ttl=64 time=0.195 ms
> >> 64 bytes from x.x.x.x: icmp_seq=24 ttl=64 time=0.177 ms
> >> 64 bytes from x.x.x.x: icmp_seq=25 ttl=64 time=0.242 ms
> >> 64 bytes from x.x.x.x: icmp_seq=26 ttl=64 time=0.339 ms
> >> 64 bytes from x.x.x.x: icmp_seq=27 ttl=64 time=0.183 ms
> >> 64 bytes from x.x.x.x: icmp_seq=28 ttl=64 time=0.221 ms
> >> 64 bytes from x.x.x.x: icmp_seq=29 ttl=64 time=0.317 ms
> >> 64 bytes from x.x.x.x: icmp_seq=30 ttl=64 time=0.210 ms
> >> 64 bytes from x.x.x.x: icmp_seq=31 ttl=64 time=0.242 ms
> >> 64 bytes from x.x.x.x: icmp_seq=32 ttl=64 time=0.127 ms
> >> 64 bytes from x.x.x.x: icmp_seq=33 ttl=64 time=0.217 ms
> >> 64 bytes from x.x.x.x: icmp_seq=34 ttl=64 time=0.184 ms
> >>
> >>
> >> For me it looks now that there was some fix between 5.4.60 and 5.9.4
> >> ... anyone can pinpoint it?  
> 
> So, new day, same issue!
> 
> I upgraded now all routers to 5.9.4, and the issue is back ...
> 
> here, when I stop it, it goes immediatly down to 0.xx ms
> 
> 64 bytes from x.x.x.x: icmp_seq=48 ttl=64 time=1.67 ms
> 64 bytes from x.x.x.x: icmp_seq=49 ttl=64 time=12.6 ms
> 64 bytes from x.x.x.x: icmp_seq=50 ttl=64 time=13.8 ms
> 64 bytes from x.x.x.x: icmp_seq=51 ttl=64 time=5.59 ms
> 64 bytes from x.x.x.x: icmp_seq=52 ttl=64 time=5.86 ms
> 64 bytes from x.x.x.x: icmp_seq=53 ttl=64 time=9.26 ms
> 64 bytes from x.x.x.x: icmp_seq=54 ttl=64 time=8.28 ms
> 64 bytes from x.x.x.x: icmp_seq=55 ttl=64 time=12.4 ms
> 64 bytes from x.x.x.x: icmp_seq=56 ttl=64 time=0.551 ms
> 64 bytes from x.x.x.x: icmp_seq=57 ttl=64 time=4.37 ms
> 64 bytes from x.x.x.x: icmp_seq=58 ttl=64 time=12.1 ms
> 64 bytes from x.x.x.x: icmp_seq=59 ttl=64 time=5.93 ms
> 64 bytes from x.x.x.x: icmp_seq=60 ttl=64 time=6.58 ms
> 64 bytes from x.x.x.x: icmp_seq=61 ttl=64 time=9.19 ms
> 64 bytes from x.x.x.x: icmp_seq=62 ttl=64 time=0.124 ms
> 64 bytes from x.x.x.x: icmp_seq=63 ttl=64 time=7.08 ms
> 64 bytes from x.x.x.x: 

Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-06 Thread Thomas Rosenstein via Bloat



On 5 Nov 2020, at 14:33, Jesper Dangaard Brouer wrote:


On Thu, 05 Nov 2020 13:22:10 +0100
Thomas Rosenstein via Bloat  wrote:


On 5 Nov 2020, at 12:21, Toke Høiland-Jørgensen wrote:


"Thomas Rosenstein"  writes:


If so, this sounds more like a driver issue, or maybe something to
do with scheduling. Does it only happen with ICMP? You could try 
this

tool for a userspace UDP measurement:


It happens with all packets, therefore the transfer to backblaze 
with

40 threads goes down to ~8MB/s instead of >60MB/s


Huh, right, definitely sounds like a kernel bug; or maybe the new
kernel is getting the hardware into a state where it bugs out when
there are lots of flows or something.

You could try looking at the ethtool stats (ethtool -S) while
running the test and see if any error counters go up. Here's a
handy script to monitor changes in the counters:

https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl


I'll try what that reports!


Also, what happens if you ping a host on the internet (*through*
the router instead of *to* it)?


Same issue, but twice pronounced, as it seems all interfaces are
affected.
So, ping on one interface and the second has the issue.
Also all traffic across the host has the issue, but on both sides,
so ping to the internet increased by 2x


Right, so even an unloaded interface suffers? But this is the same
NIC, right? So it could still be a hardware issue...


Yep default that CentOS ships, I just tested 4.12.5 there the
issue also does not happen. So I guess I can bisect it
then...(really don't want to )


Well that at least narrows it down :)


I just tested 5.9.4 seems to also fix it partly, I have long
stretches where it looks good, and then some increases again. (3.10
Stock has them too, but not so high, rather 1-3 ms)

for example:

64 bytes from x.x.x.x: icmp_seq=10 ttl=64 time=0.169 ms
64 bytes from x.x.x.x: icmp_seq=11 ttl=64 time=5.53 ms
64 bytes from x.x.x.x: icmp_seq=12 ttl=64 time=9.44 ms
64 bytes from x.x.x.x: icmp_seq=13 ttl=64 time=0.167 ms
64 bytes from x.x.x.x: icmp_seq=14 ttl=64 time=3.88 ms

and then again:

64 bytes from x.x.x.x: icmp_seq=15 ttl=64 time=0.569 ms
64 bytes from x.x.x.x: icmp_seq=16 ttl=64 time=0.148 ms
64 bytes from x.x.x.x: icmp_seq=17 ttl=64 time=0.286 ms
64 bytes from x.x.x.x: icmp_seq=18 ttl=64 time=0.257 ms
64 bytes from x.x.x.x: icmp_seq=19 ttl=64 time=0.220 ms
64 bytes from x.x.x.x: icmp_seq=20 ttl=64 time=0.125 ms
64 bytes from x.x.x.x: icmp_seq=21 ttl=64 time=0.188 ms
64 bytes from x.x.x.x: icmp_seq=22 ttl=64 time=0.202 ms
64 bytes from x.x.x.x: icmp_seq=23 ttl=64 time=0.195 ms
64 bytes from x.x.x.x: icmp_seq=24 ttl=64 time=0.177 ms
64 bytes from x.x.x.x: icmp_seq=25 ttl=64 time=0.242 ms
64 bytes from x.x.x.x: icmp_seq=26 ttl=64 time=0.339 ms
64 bytes from x.x.x.x: icmp_seq=27 ttl=64 time=0.183 ms
64 bytes from x.x.x.x: icmp_seq=28 ttl=64 time=0.221 ms
64 bytes from x.x.x.x: icmp_seq=29 ttl=64 time=0.317 ms
64 bytes from x.x.x.x: icmp_seq=30 ttl=64 time=0.210 ms
64 bytes from x.x.x.x: icmp_seq=31 ttl=64 time=0.242 ms
64 bytes from x.x.x.x: icmp_seq=32 ttl=64 time=0.127 ms
64 bytes from x.x.x.x: icmp_seq=33 ttl=64 time=0.217 ms
64 bytes from x.x.x.x: icmp_seq=34 ttl=64 time=0.184 ms


For me it looks now that there was some fix between 5.4.60 and 5.9.4
... anyone can pinpoint it?


I have now retried with 3.10, here's the results:

I'm pinging between the hosts with the traffic flow, upload is running 
55 MB/s (compared to ~20 MB/s on 5.9.4, and 8MB/s on 5.4.60):


64 bytes from x.x.x.x: icmp_seq=159 ttl=64 time=0.050 ms
64 bytes from x.x.x.x: icmp_seq=160 ttl=64 time=0.056 ms
64 bytes from x.x.x.x: icmp_seq=161 ttl=64 time=0.061 ms
64 bytes from x.x.x.x: icmp_seq=162 ttl=64 time=0.072 ms
64 bytes from x.x.x.x: icmp_seq=163 ttl=64 time=0.052 ms
64 bytes from x.x.x.x: icmp_seq=164 ttl=64 time=0.053 ms
64 bytes from x.x.x.x: icmp_seq=165 ttl=64 time=0.068 ms
64 bytes from x.x.x.x: icmp_seq=166 ttl=64 time=0.050 ms
64 bytes from x.x.x.x: icmp_seq=167 ttl=64 time=0.057 ms
64 bytes from x.x.x.x: icmp_seq=168 ttl=64 time=0.051 ms
64 bytes from x.x.x.x: icmp_seq=169 ttl=64 time=0.045 ms
64 bytes from x.x.x.x: icmp_seq=170 ttl=64 time=0.138 ms
64 bytes from x.x.x.x: icmp_seq=171 ttl=64 time=0.052 ms
64 bytes from x.x.x.x: icmp_seq=172 ttl=64 time=0.049 ms
64 bytes from x.x.x.x: icmp_seq=173 ttl=64 time=0.094 ms
64 bytes from x.x.x.x: icmp_seq=174 ttl=64 time=0.050 ms
64 bytes from x.x.x.x: icmp_seq=175 ttl=64 time=0.810 ms
64 bytes from x.x.x.x: icmp_seq=176 ttl=64 time=0.077 ms
64 bytes from x.x.x.x: icmp_seq=177 ttl=64 time=0.055 ms
64 bytes from x.x.x.x: icmp_seq=178 ttl=64 time=0.049 ms
64 bytes from x.x.x.x: icmp_seq=179 ttl=64 time=0.050 ms
64 bytes from x.x.x.x: icmp_seq=180 ttl=64 time=0.073 ms
64 bytes from x.x.x.x: icmp_seq=181 ttl=64 time=0.065 ms
64 bytes from x.x.x.x: icmp_seq=182 ttl=64 time=0.123 ms
64 bytes from x.x.x.x: icmp_seq=183 ttl=64 time=0.045 ms
64 bytes from x.x.x.x: icmp_seq=184 

Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-06 Thread Thomas Rosenstein via Bloat



On 5 Nov 2020, at 14:33, Jesper Dangaard Brouer wrote:


On Thu, 05 Nov 2020 13:22:10 +0100
Thomas Rosenstein via Bloat  wrote:


On 5 Nov 2020, at 12:21, Toke Høiland-Jørgensen wrote:


"Thomas Rosenstein"  writes:


If so, this sounds more like a driver issue, or maybe something to
do with scheduling. Does it only happen with ICMP? You could try 
this

tool for a userspace UDP measurement:


It happens with all packets, therefore the transfer to backblaze 
with

40 threads goes down to ~8MB/s instead of >60MB/s


Huh, right, definitely sounds like a kernel bug; or maybe the new
kernel is getting the hardware into a state where it bugs out when
there are lots of flows or something.

You could try looking at the ethtool stats (ethtool -S) while
running the test and see if any error counters go up. Here's a
handy script to monitor changes in the counters:

https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl


I'll try what that reports!


Also, what happens if you ping a host on the internet (*through*
the router instead of *to* it)?


Same issue, but twice pronounced, as it seems all interfaces are
affected.
So, ping on one interface and the second has the issue.
Also all traffic across the host has the issue, but on both sides,
so ping to the internet increased by 2x


Right, so even an unloaded interface suffers? But this is the same
NIC, right? So it could still be a hardware issue...


Yep default that CentOS ships, I just tested 4.12.5 there the
issue also does not happen. So I guess I can bisect it
then...(really don't want to )


Well that at least narrows it down :)


I just tested 5.9.4 seems to also fix it partly, I have long
stretches where it looks good, and then some increases again. (3.10
Stock has them too, but not so high, rather 1-3 ms)

for example:

64 bytes from x.x.x.x: icmp_seq=10 ttl=64 time=0.169 ms
64 bytes from x.x.x.x: icmp_seq=11 ttl=64 time=5.53 ms
64 bytes from x.x.x.x: icmp_seq=12 ttl=64 time=9.44 ms
64 bytes from x.x.x.x: icmp_seq=13 ttl=64 time=0.167 ms
64 bytes from x.x.x.x: icmp_seq=14 ttl=64 time=3.88 ms

and then again:

64 bytes from x.x.x.x: icmp_seq=15 ttl=64 time=0.569 ms
64 bytes from x.x.x.x: icmp_seq=16 ttl=64 time=0.148 ms
64 bytes from x.x.x.x: icmp_seq=17 ttl=64 time=0.286 ms
64 bytes from x.x.x.x: icmp_seq=18 ttl=64 time=0.257 ms
64 bytes from x.x.x.x: icmp_seq=19 ttl=64 time=0.220 ms
64 bytes from x.x.x.x: icmp_seq=20 ttl=64 time=0.125 ms
64 bytes from x.x.x.x: icmp_seq=21 ttl=64 time=0.188 ms
64 bytes from x.x.x.x: icmp_seq=22 ttl=64 time=0.202 ms
64 bytes from x.x.x.x: icmp_seq=23 ttl=64 time=0.195 ms
64 bytes from x.x.x.x: icmp_seq=24 ttl=64 time=0.177 ms
64 bytes from x.x.x.x: icmp_seq=25 ttl=64 time=0.242 ms
64 bytes from x.x.x.x: icmp_seq=26 ttl=64 time=0.339 ms
64 bytes from x.x.x.x: icmp_seq=27 ttl=64 time=0.183 ms
64 bytes from x.x.x.x: icmp_seq=28 ttl=64 time=0.221 ms
64 bytes from x.x.x.x: icmp_seq=29 ttl=64 time=0.317 ms
64 bytes from x.x.x.x: icmp_seq=30 ttl=64 time=0.210 ms
64 bytes from x.x.x.x: icmp_seq=31 ttl=64 time=0.242 ms
64 bytes from x.x.x.x: icmp_seq=32 ttl=64 time=0.127 ms
64 bytes from x.x.x.x: icmp_seq=33 ttl=64 time=0.217 ms
64 bytes from x.x.x.x: icmp_seq=34 ttl=64 time=0.184 ms


For me it looks now that there was some fix between 5.4.60 and 5.9.4
... anyone can pinpoint it?


So, new day, same issue!

I upgraded now all routers to 5.9.4, and the issue is back ...

here, when I stop it, it goes immediatly down to 0.xx ms

64 bytes from x.x.x.x: icmp_seq=48 ttl=64 time=1.67 ms
64 bytes from x.x.x.x: icmp_seq=49 ttl=64 time=12.6 ms
64 bytes from x.x.x.x: icmp_seq=50 ttl=64 time=13.8 ms
64 bytes from x.x.x.x: icmp_seq=51 ttl=64 time=5.59 ms
64 bytes from x.x.x.x: icmp_seq=52 ttl=64 time=5.86 ms
64 bytes from x.x.x.x: icmp_seq=53 ttl=64 time=9.26 ms
64 bytes from x.x.x.x: icmp_seq=54 ttl=64 time=8.28 ms
64 bytes from x.x.x.x: icmp_seq=55 ttl=64 time=12.4 ms
64 bytes from x.x.x.x: icmp_seq=56 ttl=64 time=0.551 ms
64 bytes from x.x.x.x: icmp_seq=57 ttl=64 time=4.37 ms
64 bytes from x.x.x.x: icmp_seq=58 ttl=64 time=12.1 ms
64 bytes from x.x.x.x: icmp_seq=59 ttl=64 time=5.93 ms
64 bytes from x.x.x.x: icmp_seq=60 ttl=64 time=6.58 ms
64 bytes from x.x.x.x: icmp_seq=61 ttl=64 time=9.19 ms
64 bytes from x.x.x.x: icmp_seq=62 ttl=64 time=0.124 ms
64 bytes from x.x.x.x: icmp_seq=63 ttl=64 time=7.08 ms
64 bytes from x.x.x.x: icmp_seq=64 ttl=64 time=9.69 ms
64 bytes from x.x.x.x: icmp_seq=65 ttl=64 time=7.52 ms
64 bytes from x.x.x.x: icmp_seq=66 ttl=64 time=14.9 ms
64 bytes from x.x.x.x: icmp_seq=67 ttl=64 time=12.6 ms
64 bytes from x.x.x.x: icmp_seq=68 ttl=64 time=2.34 ms
64 bytes from x.x.x.x: icmp_seq=69 ttl=64 time=8.97 ms
64 bytes from x.x.x.x: icmp_seq=70 ttl=64 time=0.203 ms
64 bytes from x.x.x.x: icmp_seq=71 ttl=64 time=9.10 ms
64 bytes from x.x.x.x: icmp_seq=72 ttl=64 time=3.16 ms
64 bytes from x.x.x.x: icmp_seq=73 ttl=64 time=1.88 ms
64 bytes from x.x.x.x: icmp_seq=74 ttl=64 time=11.5 ms
64 bytes from