"Thomas Rosenstein" <thomas.rosenst...@creamfinance.com> writes:
> On 5 Nov 2020, at 13:38, Toke Høiland-Jørgensen wrote: > >> "Thomas Rosenstein" <thomas.rosenst...@creamfinance.com> writes: >> >>> On 5 Nov 2020, at 12:21, Toke Høiland-Jørgensen wrote: >>> >>>> "Thomas Rosenstein" <thomas.rosenst...@creamfinance.com> writes: >>>> >>>>>> If so, this sounds more like a driver issue, or maybe something to >>>>>> do >>>>>> with scheduling. Does it only happen with ICMP? You could try this >>>>>> tool >>>>>> for a userspace UDP measurement: >>>>> >>>>> It happens with all packets, therefore the transfer to backblaze >>>>> with >>>>> 40 >>>>> threads goes down to ~8MB/s instead of >60MB/s >>>> >>>> Huh, right, definitely sounds like a kernel bug; or maybe the new >>>> kernel >>>> is getting the hardware into a state where it bugs out when there >>>> are >>>> lots of flows or something. >>>> >>>> You could try looking at the ethtool stats (ethtool -S) while >>>> running >>>> the test and see if any error counters go up. Here's a handy script >>>> to >>>> monitor changes in the counters: >>>> >>>> https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl >>>> >>>>> I'll try what that reports! >>>>> >>>>>> Also, what happens if you ping a host on the internet (*through* >>>>>> the >>>>>> router instead of *to* it)? >>>>> >>>>> Same issue, but twice pronounced, as it seems all interfaces are >>>>> affected. >>>>> So, ping on one interface and the second has the issue. >>>>> Also all traffic across the host has the issue, but on both sides, >>>>> so >>>>> ping to the internet increased by 2x >>>> >>>> Right, so even an unloaded interface suffers? But this is the same >>>> NIC, >>>> right? So it could still be a hardware issue... >>>> >>>>> Yep default that CentOS ships, I just tested 4.12.5 there the issue >>>>> also >>>>> does not happen. So I guess I can bisect it then...(really don't >>>>> want >>>>> to >>>>> 😃) >>>> >>>> Well that at least narrows it down :) >>> >>> I just tested 5.9.4 seems to also fix it partly, I have long >>> stretches >>> where it looks good, and then some increases again. (3.10 Stock has >>> them >>> too, but not so high, rather 1-3 ms) >>> >>> for example: >>> >>> 64 bytes from x.x.x.x: icmp_seq=10 ttl=64 time=0.169 ms >>> 64 bytes from x.x.x.x: icmp_seq=11 ttl=64 time=5.53 ms >>> 64 bytes from x.x.x.x: icmp_seq=12 ttl=64 time=9.44 ms >>> 64 bytes from x.x.x.x: icmp_seq=13 ttl=64 time=0.167 ms >>> 64 bytes from x.x.x.x: icmp_seq=14 ttl=64 time=3.88 ms >>> >>> and then again: >>> >>> 64 bytes from x.x.x.x: icmp_seq=15 ttl=64 time=0.569 ms >>> 64 bytes from x.x.x.x: icmp_seq=16 ttl=64 time=0.148 ms >>> 64 bytes from x.x.x.x: icmp_seq=17 ttl=64 time=0.286 ms >>> 64 bytes from x.x.x.x: icmp_seq=18 ttl=64 time=0.257 ms >>> 64 bytes from x.x.x.x: icmp_seq=19 ttl=64 time=0.220 ms >>> 64 bytes from x.x.x.x: icmp_seq=20 ttl=64 time=0.125 ms >>> 64 bytes from x.x.x.x: icmp_seq=21 ttl=64 time=0.188 ms >>> 64 bytes from x.x.x.x: icmp_seq=22 ttl=64 time=0.202 ms >>> 64 bytes from x.x.x.x: icmp_seq=23 ttl=64 time=0.195 ms >>> 64 bytes from x.x.x.x: icmp_seq=24 ttl=64 time=0.177 ms >>> 64 bytes from x.x.x.x: icmp_seq=25 ttl=64 time=0.242 ms >>> 64 bytes from x.x.x.x: icmp_seq=26 ttl=64 time=0.339 ms >>> 64 bytes from x.x.x.x: icmp_seq=27 ttl=64 time=0.183 ms >>> 64 bytes from x.x.x.x: icmp_seq=28 ttl=64 time=0.221 ms >>> 64 bytes from x.x.x.x: icmp_seq=29 ttl=64 time=0.317 ms >>> 64 bytes from x.x.x.x: icmp_seq=30 ttl=64 time=0.210 ms >>> 64 bytes from x.x.x.x: icmp_seq=31 ttl=64 time=0.242 ms >>> 64 bytes from x.x.x.x: icmp_seq=32 ttl=64 time=0.127 ms >>> 64 bytes from x.x.x.x: icmp_seq=33 ttl=64 time=0.217 ms >>> 64 bytes from x.x.x.x: icmp_seq=34 ttl=64 time=0.184 ms >>> >>> >>> For me it looks now that there was some fix between 5.4.60 and 5.9.4 >>> ... >>> anyone can pinpoint it? >> >> $ git log --no-merges --oneline v5.4.60..v5.9.4|wc -l >> 72932 >> >> Only 73k commits; should be easy, right? :) >> >> (In other words no, I have no idea; I'd suggest either (a) asking on >> netdev, (b) bisecting or (c) using 5.9+ and just making peace with not >> knowing). > > Guess I'll go the easy route and let it be ... > > I'll update all routers to the 5.9.4 and see if it fixes the traffic > flow - will report back once more after that. Sounds like a plan :) >> >>>>>> How did you configure the new kernel? Did you start from scratch, >>>>>> or >>>>>> is >>>>>> it based on the old centos config? >>>>> >>>>> first oldconfig and from there then added additional options for >>>>> IB, >>>>> NVMe, etc (which I don't really need on the routers) >>>> >>>> OK, so you're probably building with roughly the same options in >>>> terms >>>> of scheduling granularity etc. That's good. Did you enable spectre >>>> mitigations etc on the new kernel? What's the output of >>>> `tail /sys/devices/system/cpu/vulnerabilities/*` ? >>> >>> mitigations are off >> >> Right, I just figured maybe you were hitting some threshold that >> involved a lot of indirect calls which slowed things down due to >> mitigations. Guess not, then... >> > > Thanks for the support :) You're welcome! -Toke _______________________________________________ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat