On 6 Nov 2020, at 21:19, Jesper Dangaard Brouer wrote:

On Fri, 06 Nov 2020 18:04:49 +0100
"Thomas Rosenstein" <thomas.rosenst...@creamfinance.com> wrote:

On 6 Nov 2020, at 15:13, Jesper Dangaard Brouer wrote:


I'm using ping on IPv4, but I'll try to see if IPv6 makes any
difference!

I think you misunderstand me.  I'm not asking you to use ping6. The
gobgpd daemon updates will both update IPv4 and IPv6 routes, right.
Updating IPv6 routes are more problematic than IPv4 routes.  The IPv6
route tables update can potentially stall softirq from running, which
was the latency tool was measuring... and it did show some outliers.

yes I did, I assumed the latency would be introduced in the traffic path by the lock.
Nonetheless, I tested it and no difference :)



Have you tried to use 'perf record' to observe that is happening on
the system while these latency incidents happen? (let me know if you
want some cmdline hints)

Haven't tried this yet. If you have some hints what events to monitor
I'll take them!

Okay to record everything (-a) on the system and save call-graph (-g),
and run for 5 seconds (via profiling the sleep function).

 # perf record -g -a  sleep 5

To view the result the simply use the 'perf report', but likely you
want to use option --no-children as you are profiling the kernel (and
not a userspace program you want to have grouped 'children' by).  I
also include the CPU column via '--sort cpu,comm,dso,symbol' and you
can select/zoom-in-on a specific CPU via '-C zero-indexed-cpu-num'.

 # perf report --sort cpu,comm,dso,symbol --no-children

When we ask you to provide the output, you can use the --stdio option,
and provide txt-info via a pastebin link as it is very long.

Here is the output from kernel 3.10_1127 (I updated to the really newest in that branch): https://pastebin.com/5mxirXPw
Here is the output from kernel 5.9.4: https://pastebin.com/KDZ2Ei2F

I have noticed that the delays are directly related to the traffic flows, see below.

These tests are WITHOUT gobgpd running, so no updates to the route table, but the route tables are fully populated. Also, it's ONLY outgoing traffic, the return packets are coming in on another router.

I have then cleared the routing tables, and the issue persists, table has only 78 entries.

40 threads -> sometimes higher rtt times: https://pastebin.com/Y9nd0h4h
60 threads -> always high rtt times: https://pastebin.com/JFvhtLrH

So it definitly gets worse the more connections there are.

I have also tried to reproduce the issue with the kernel on a virtual hyper-v machine, there I don't have any adverse effects. But it's not 100% the same, since MASQ happens on it .. will restructure a bit to get a similar representation

I also suspected now that -j NOTRACK would be an issue, removed that too, no change. (it's anyways async routing)

Additionally I have quit all applications except for sshd, no change!




--
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
_______________________________________________
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat

Reply via email to