Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On 16 Nov 2020, at 13:34, Jesper Dangaard Brouer wrote: On Wed, 04 Nov 2020 16:23:12 +0100 Thomas Rosenstein via Bloat wrote: [...] I have multiple routers which connect to multiple upstream providers, I have noticed a high latency shift in icmp (and generally all connection) if I run b2 upload-file --threads 40 (and I can reproduce this) What options do I have to analyze why this happens? General Info: Routers are connected between each other with 10G Mellanox Connect-X cards via 10G SPF+ DAC cables via a 10G Switch from fs.com Latency generally is around 0.18 ms between all routers (4). Throughput is 9.4 Gbit/s with 0 retransmissions when tested with iperf3. 2 of the 4 routers are connected upstream with a 1G connection (separate port, same network card) All routers have the full internet routing tables, i.e. 80k entries for IPv6 and 830k entries for IPv4 Conntrack is disabled (-j NOTRACK) Kernel 5.4.60 (custom) 2x Xeon X5670 @ 2.93 Ghz I think I have spotted your problem... This CPU[1] Xeon X5670 is more than 10 years old! It basically corresponds to the machines I used for my presentation at LinuxCon 2009 see slides[2]. Only with large frames and with massive scaling across all CPUs was I able to get close to 10Gbit/s through these machines. And on top I had to buy low-latency RAM memory-blocks to make it happen. As you can see on my slides[2], memory bandwidth and PCIe speeds was at the limit for making it possible on the hardware level. I had to run DDR3 memory at 1333MHz and tune the QuickPath Interconnect (QPI) to 6.4GT/s (default 4.8GT/s). This generation Motherboards had both PCIe gen-1 and gen-2 slots. Only the PCIe gen-2 slots had barely enough bandwidth. Maybe you physically placed NIC in PCIe gen-1 slot? On top of this, you also have a NUMA system, 2x Xeon X5670, which can result is A LOT of "funny" issue, that is really hard to troubleshoot... Yes, I'm aware of the limits of what to expect, but as we agree 60 tcp streams with not even 200 Mbits shouldn't overload the PCIex bus or the cpus. Also, don't forget, no issues with Kernel 3.10. PCI slot is a Gen2, x8, so more than enough bandwidth there luckily ;) But yes, they are quite old... [1] https://ark.intel.com/content/www/us/en/ark/products/47920/intel-xeon-processor-x5670-12m-cache-2-93-ghz-6-40-gt-s-intel-qpi.html [2] https://people.netfilter.org/hawk/presentations/LinuxCon2009/LinuxCon2009_JesperDangaardBrouer_final.pdf -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On Wed, 04 Nov 2020 16:23:12 +0100 Thomas Rosenstein via Bloat wrote: [...] > I have multiple routers which connect to multiple upstream providers, I > have noticed a high latency shift in icmp (and generally all connection) > if I run b2 upload-file --threads 40 (and I can reproduce this) > > What options do I have to analyze why this happens? > > General Info: > > Routers are connected between each other with 10G Mellanox Connect-X > cards via 10G SPF+ DAC cables via a 10G Switch from fs.com > Latency generally is around 0.18 ms between all routers (4). > Throughput is 9.4 Gbit/s with 0 retransmissions when tested with iperf3. > 2 of the 4 routers are connected upstream with a 1G connection (separate > port, same network card) > All routers have the full internet routing tables, i.e. 80k entries for > IPv6 and 830k entries for IPv4 > Conntrack is disabled (-j NOTRACK) > Kernel 5.4.60 (custom) > 2x Xeon X5670 @ 2.93 Ghz I think I have spotted your problem... This CPU[1] Xeon X5670 is more than 10 years old! It basically corresponds to the machines I used for my presentation at LinuxCon 2009 see slides[2]. Only with large frames and with massive scaling across all CPUs was I able to get close to 10Gbit/s through these machines. And on top I had to buy low-latency RAM memory-blocks to make it happen. As you can see on my slides[2], memory bandwidth and PCIe speeds was at the limit for making it possible on the hardware level. I had to run DDR3 memory at 1333MHz and tune the QuickPath Interconnect (QPI) to 6.4GT/s (default 4.8GT/s). This generation Motherboards had both PCIe gen-1 and gen-2 slots. Only the PCIe gen-2 slots had barely enough bandwidth. Maybe you physically placed NIC in PCIe gen-1 slot? On top of this, you also have a NUMA system, 2x Xeon X5670, which can result is A LOT of "funny" issue, that is really hard to troubleshoot... [1] https://ark.intel.com/content/www/us/en/ark/products/47920/intel-xeon-processor-x5670-12m-cache-2-93-ghz-6-40-gt-s-intel-qpi.html [2] https://people.netfilter.org/hawk/presentations/LinuxCon2009/LinuxCon2009_JesperDangaardBrouer_final.pdf -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On 16 Nov 2020, at 12:56, Jesper Dangaard Brouer wrote: On Fri, 13 Nov 2020 07:31:26 +0100 "Thomas Rosenstein" wrote: On 12 Nov 2020, at 16:42, Jesper Dangaard Brouer wrote: On Thu, 12 Nov 2020 14:42:59 +0100 "Thomas Rosenstein" wrote: Notice "Adaptive" setting is on. My long-shot theory(2) is that this adaptive algorithm in the driver code can guess wrong (due to not taking TSO into account) and cause issues for Try to turn this adaptive algorithm off: ethtool -C eth4 adaptive-rx off adaptive-tx off [...] rx-usecs: 32 When you run off "adaptive-rx" you will get 31250 interrupts/sec (calc 1/(32/10^6) = 31250). rx-frames: 64 [...] tx-usecs-irq: 0 tx-frames-irq: 0 [...] I have now updated the settings to: ethtool -c eth4 Coalesce parameters for eth4: Adaptive RX: off TX: off stats-block-usecs: 0 sample-interval: 0 pkt-rate-low: 0 pkt-rate-high: 0 rx-usecs: 0 Please put a value in rx-usecs, like 20 or 10. The value 0 is often used to signal driver to do adaptive. Ok, put it now to 10. Setting it to 10 is a little aggressive, as you ask it to generate 100,000 interrupts per sec. (Watch with 'vmstat 1' to see it.) 1/(10/10^6) = 10 interrupts/sec Goes a bit quicker (transfer up to 26 MB/s), but discards and pci stalls are still there. Why are you measuring in (26) MBytes/sec ? (equal 208 Mbit/s) yep 208 MBits If you still have ethtool PHY-discards, then you still have a problem. Ping times are noticable improved: Okay so this means these changes did have a positive effect. So, this can be related to OS is not getting activated fast-enough by NIC interrupts. 64 bytes from x.x.x.x: icmp_seq=39 ttl=64 time=0.172 ms 64 bytes from x.x.x.x: icmp_seq=40 ttl=64 time=0.414 ms 64 bytes from x.x.x.x: icmp_seq=41 ttl=64 time=0.183 ms 64 bytes from x.x.x.x: icmp_seq=42 ttl=64 time=1.41 ms 64 bytes from x.x.x.x: icmp_seq=43 ttl=64 time=0.172 ms 64 bytes from x.x.x.x: icmp_seq=44 ttl=64 time=0.228 ms 64 bytes from x.x.x.x: icmp_seq=46 ttl=64 time=0.120 ms 64 bytes from x.x.x.x: icmp_seq=47 ttl=64 time=1.47 ms 64 bytes from x.x.x.x: icmp_seq=48 ttl=64 time=0.162 ms 64 bytes from x.x.x.x: icmp_seq=49 ttl=64 time=0.160 ms 64 bytes from x.x.x.x: icmp_seq=50 ttl=64 time=0.158 ms 64 bytes from x.x.x.x: icmp_seq=51 ttl=64 time=0.113 ms Can you try to test if disabling TSO, GRO and GSO makes a difference? ethtool -K eth4 gso off gro off tso off I had a call yesterday with Mellanox and we added the following boot options: intel_idle.max_cstate=0 processor.max_cstate=1 idle=poll This completely solved the problem, but now we run with a heater and energy consumer, nearly 2x Watts on the outlet. I had no discards, super pings during transfer(< 0.100 ms), no outliers, and good transfer rates > 50 MB/s So it seems to be related to C-State management in newer kernel version being too agressive. I would like to try to tune here a bit, maybe we can get some input which knobs to turn? I will read here: https://www.kernel.org/doc/html/latest/admin-guide/pm/cpuidle.html#idle-states-representation and related docs, I think there will be a few helpful hints. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On Fri, 13 Nov 2020 07:31:26 +0100 "Thomas Rosenstein" wrote: > On 12 Nov 2020, at 16:42, Jesper Dangaard Brouer wrote: > > > On Thu, 12 Nov 2020 14:42:59 +0100 > > "Thomas Rosenstein" wrote: > > > >>> Notice "Adaptive" setting is on. My long-shot theory(2) is that > >>> this > >>> adaptive algorithm in the driver code can guess wrong (due to not > >>> taking TSO into account) and cause issues for > >>> > >>> Try to turn this adaptive algorithm off: > >>> > >>> ethtool -C eth4 adaptive-rx off adaptive-tx off > >>> > > [...] > > rx-usecs: 32 > >>> > >>> When you run off "adaptive-rx" you will get 31250 interrupts/sec > >>> (calc 1/(32/10^6) = 31250). > >>> > rx-frames: 64 > > [...] > tx-usecs-irq: 0 > tx-frames-irq: 0 > > >>> [...] > >> > >> I have now updated the settings to: > >> > >> ethtool -c eth4 > >> Coalesce parameters for eth4: > >> Adaptive RX: off TX: off > >> stats-block-usecs: 0 > >> sample-interval: 0 > >> pkt-rate-low: 0 > >> pkt-rate-high: 0 > >> > >> rx-usecs: 0 > > > > Please put a value in rx-usecs, like 20 or 10. > > The value 0 is often used to signal driver to do adaptive. > > Ok, put it now to 10. Setting it to 10 is a little aggressive, as you ask it to generate 100,000 interrupts per sec. (Watch with 'vmstat 1' to see it.) 1/(10/10^6) = 10 interrupts/sec > Goes a bit quicker (transfer up to 26 MB/s), but discards and pci stalls > are still there. Why are you measuring in (26) MBytes/sec ? (equal 208 Mbit/s) If you still have ethtool PHY-discards, then you still have a problem. > Ping times are noticable improved: Okay so this means these changes did have a positive effect. So, this can be related to OS is not getting activated fast-enough by NIC interrupts. > 64 bytes from x.x.x.x: icmp_seq=39 ttl=64 time=0.172 ms > 64 bytes from x.x.x.x: icmp_seq=40 ttl=64 time=0.414 ms > 64 bytes from x.x.x.x: icmp_seq=41 ttl=64 time=0.183 ms > 64 bytes from x.x.x.x: icmp_seq=42 ttl=64 time=1.41 ms > 64 bytes from x.x.x.x: icmp_seq=43 ttl=64 time=0.172 ms > 64 bytes from x.x.x.x: icmp_seq=44 ttl=64 time=0.228 ms > 64 bytes from x.x.x.x: icmp_seq=46 ttl=64 time=0.120 ms > 64 bytes from x.x.x.x: icmp_seq=47 ttl=64 time=1.47 ms > 64 bytes from x.x.x.x: icmp_seq=48 ttl=64 time=0.162 ms > 64 bytes from x.x.x.x: icmp_seq=49 ttl=64 time=0.160 ms > 64 bytes from x.x.x.x: icmp_seq=50 ttl=64 time=0.158 ms > 64 bytes from x.x.x.x: icmp_seq=51 ttl=64 time=0.113 ms Can you try to test if disabling TSO, GRO and GSO makes a difference? ethtool -K eth4 gso off gro off tso off -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On 12 Nov 2020, at 16:42, Jesper Dangaard Brouer wrote: On Thu, 12 Nov 2020 14:42:59 +0100 "Thomas Rosenstein" wrote: Notice "Adaptive" setting is on. My long-shot theory(2) is that this adaptive algorithm in the driver code can guess wrong (due to not taking TSO into account) and cause issues for Try to turn this adaptive algorithm off: ethtool -C eth4 adaptive-rx off adaptive-tx off [...] rx-usecs: 32 When you run off "adaptive-rx" you will get 31250 interrupts/sec (calc 1/(32/10^6) = 31250). rx-frames: 64 [...] tx-usecs-irq: 0 tx-frames-irq: 0 [...] I have now updated the settings to: ethtool -c eth4 Coalesce parameters for eth4: Adaptive RX: off TX: off stats-block-usecs: 0 sample-interval: 0 pkt-rate-low: 0 pkt-rate-high: 0 rx-usecs: 0 Please put a value in rx-usecs, like 20 or 10. The value 0 is often used to signal driver to do adaptive. Ok, put it now to 10. Goes a bit quicker (transfer up to 26 MB/s), but discards and pci stalls are still there. Ping times are noticable improved: 64 bytes from x.x.x.x: icmp_seq=39 ttl=64 time=0.172 ms 64 bytes from x.x.x.x: icmp_seq=40 ttl=64 time=0.414 ms 64 bytes from x.x.x.x: icmp_seq=41 ttl=64 time=0.183 ms 64 bytes from x.x.x.x: icmp_seq=42 ttl=64 time=1.41 ms 64 bytes from x.x.x.x: icmp_seq=43 ttl=64 time=0.172 ms 64 bytes from x.x.x.x: icmp_seq=44 ttl=64 time=0.228 ms 64 bytes from x.x.x.x: icmp_seq=46 ttl=64 time=0.120 ms 64 bytes from x.x.x.x: icmp_seq=47 ttl=64 time=1.47 ms 64 bytes from x.x.x.x: icmp_seq=48 ttl=64 time=0.162 ms 64 bytes from x.x.x.x: icmp_seq=49 ttl=64 time=0.160 ms 64 bytes from x.x.x.x: icmp_seq=50 ttl=64 time=0.158 ms 64 bytes from x.x.x.x: icmp_seq=51 ttl=64 time=0.113 ms rx-frames: 32 rx-usecs-irq: 0 rx-frames-irq: 0 -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On Thu, 12 Nov 2020 14:42:59 +0100 "Thomas Rosenstein" wrote: > > Notice "Adaptive" setting is on. My long-shot theory(2) is that this > > adaptive algorithm in the driver code can guess wrong (due to not > > taking TSO into account) and cause issues for > > > > Try to turn this adaptive algorithm off: > > > > ethtool -C eth4 adaptive-rx off adaptive-tx off > > [...] > >> > >> rx-usecs: 32 > > > > When you run off "adaptive-rx" you will get 31250 interrupts/sec > > (calc 1/(32/10^6) = 31250). > > > >> rx-frames: 64 [...] > >> tx-usecs-irq: 0 > >> tx-frames-irq: 0 > >> > > [...] > > I have now updated the settings to: > > ethtool -c eth4 > Coalesce parameters for eth4: > Adaptive RX: off TX: off > stats-block-usecs: 0 > sample-interval: 0 > pkt-rate-low: 0 > pkt-rate-high: 0 > > rx-usecs: 0 Please put a value in rx-usecs, like 20 or 10. The value 0 is often used to signal driver to do adaptive. > rx-frames: 32 > rx-usecs-irq: 0 > rx-frames-irq: 0 -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On 12 Nov 2020, at 14:31, Jesper Dangaard Brouer wrote: > On Thu, 12 Nov 2020 12:26:20 +0100 > "Thomas Rosenstein" wrote: > >>> Long-shot theory(2): The NIC driver IRQ interrupt coalesce >>> adaptive-IRQ (adj via ethtool -C) is wrong. The reason it is wrong is >>> because of TSO frames, due packet-to-byte ratio isn't taken into >>> account. >> >> here are the coalese parameters: (are the same for 3.10 and 5.4.60) >> >> ethtool -c eth4 >> Coalesce parameters for eth4: >> Adaptive RX: on TX: on > > Notice "Adaptive" setting is on. My long-shot theory(2) is that this > adaptive algorithm in the driver code can guess wrong (due to not > taking TSO into account) and cause issues for > > Try to turn this adaptive algorithm off: > > ethtool -C eth4 adaptive-rx off adaptive-tx off > >> stats-block-usecs: 0 >> sample-interval: 0 >> pkt-rate-low: 0 >> pkt-rate-high: 0 >> >> rx-usecs: 32 > > When you run off "adaptive-rx" you will get 31250 interrupts/sec > (calc 1/(32/10^6) = 31250). > >> rx-frames: 64 > > You likely want to reduce this. > >> rx-usecs-irq: 0 >> rx-frames-irq: 0 >> >> tx-usecs: 8 >> tx-frames: 128 > > These 128 frames also seems too large. > >> tx-usecs-irq: 0 >> tx-frames-irq: 0 >> > [...] I have now updated the settings to: ethtool -c eth4 Coalesce parameters for eth4: Adaptive RX: off TX: off stats-block-usecs: 0 sample-interval: 0 pkt-rate-low: 0 pkt-rate-high: 0 rx-usecs: 0 rx-frames: 32 rx-usecs-irq: 0 rx-frames-irq: 0 tx-usecs: 16 tx-frames: 32 tx-usecs-irq: 0 tx-frames-irq: 0 rx-usecs-low: 0 rx-frame-low: 0 tx-usecs-low: 0 tx-frame-low: 0 rx-usecs-high: 0 rx-frame-high: 0 tx-usecs-high: 0 tx-frame-high: 0 but it makes absolutely no difference, also discard keep coming as before. >>> >>> You can test theory(2) in two ways, via disabling TSO and via >>> adjusting >>> ethtool --show-coalesce|--coalesce (-c|-C). The --coalesce change can >>> change part of theory(1), the NIC wake-up/IRQ interval. here's the current nic settings: ethtool -k eth4 Features for eth4: rx-checksumming: on tx-checksumming: on tx-checksum-ipv4: off [fixed] tx-checksum-ip-generic: on tx-checksum-ipv6: off [fixed] tx-checksum-fcoe-crc: off [fixed] tx-checksum-sctp: off [fixed] scatter-gather: on tx-scatter-gather: on tx-scatter-gather-fraglist: off [fixed] tcp-segmentation-offload: off tx-tcp-segmentation: off tx-tcp-ecn-segmentation: off [fixed] tx-tcp-mangleid-segmentation: off tx-tcp6-segmentation: off udp-fragmentation-offload: off generic-segmentation-offload: on generic-receive-offload: on large-receive-offload: off rx-vlan-offload: on tx-vlan-offload: on ntuple-filters: off receive-hashing: on highdma: on [fixed] rx-vlan-filter: on vlan-challenged: off [fixed] tx-lockless: off [fixed] netns-local: off [fixed] tx-gso-robust: off [fixed] tx-fcoe-segmentation: off [fixed] tx-gre-segmentation: on tx-gre-csum-segmentation: on tx-ipxip4-segmentation: on tx-ipxip6-segmentation: on tx-udp_tnl-segmentation: on tx-udp_tnl-csum-segmentation: on tx-gso-partial: on tx-sctp-segmentation: off [fixed] tx-esp-segmentation: off [fixed] tx-udp-segmentation: on fcoe-mtu: off [fixed] tx-nocache-copy: off loopback: off [fixed] rx-fcs: off rx-all: off tx-vlan-stag-hw-insert: on rx-vlan-stag-hw-parse: off [fixed] rx-vlan-stag-filter: on [fixed] l2-fwd-offload: off [fixed] hw-tc-offload: off esp-hw-offload: off [fixed] esp-tx-csum-hw-offload: off [fixed] rx-udp_tunnel-port-offload: on tls-hw-tx-offload: off [fixed] tls-hw-rx-offload: off [fixed] rx-gro-hw: off [fixed] tls-hw-record: off [fixed] > > What is the NIC ring size settings? > > ethtool -g eth4 > ethtool -g eth4 Ring parameters for eth4: Pre-set maximums: RX: 8192 RX Mini:0 RX Jumbo: 0 TX: 8192 Current hardware settings: RX: 1024 RX Mini:0 RX Jumbo: 0 TX: 1024 > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > LinkedIn: http://www.linkedin.com/in/brouer ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On Thu, 12 Nov 2020 12:26:20 +0100 "Thomas Rosenstein" wrote: > > Long-shot theory(2): The NIC driver IRQ interrupt coalesce > > adaptive-IRQ (adj via ethtool -C) is wrong. The reason it is wrong is > > because of TSO frames, due packet-to-byte ratio isn't taken into > > account. > > here are the coalese parameters: (are the same for 3.10 and 5.4.60) > > ethtool -c eth4 > Coalesce parameters for eth4: > Adaptive RX: on TX: on Notice "Adaptive" setting is on. My long-shot theory(2) is that this adaptive algorithm in the driver code can guess wrong (due to not taking TSO into account) and cause issues for Try to turn this adaptive algorithm off: ethtool -C eth4 adaptive-rx off adaptive-tx off > stats-block-usecs: 0 > sample-interval: 0 > pkt-rate-low: 0 > pkt-rate-high: 0 > > rx-usecs: 32 When you run off "adaptive-rx" you will get 31250 interrupts/sec (calc 1/(32/10^6) = 31250). > rx-frames: 64 You likely want to reduce this. > rx-usecs-irq: 0 > rx-frames-irq: 0 > > tx-usecs: 8 > tx-frames: 128 These 128 frames also seems too large. > tx-usecs-irq: 0 > tx-frames-irq: 0 > [...] > > > > You can test theory(2) in two ways, via disabling TSO and via > > adjusting > > ethtool --show-coalesce|--coalesce (-c|-C). The --coalesce change can > > change part of theory(1), the NIC wake-up/IRQ interval. What is the NIC ring size settings? ethtool -g eth4 -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On 12 Nov 2020, at 11:05, Jesper Dangaard Brouer wrote: On Mon, 09 Nov 2020 15:33:46 +0100 "Thomas Rosenstein" wrote: On 9 Nov 2020, at 12:40, Jesper Dangaard Brouer wrote: On Mon, 09 Nov 2020 11:09:33 +0100 "Thomas Rosenstein" wrote: On 9 Nov 2020, at 9:24, Jesper Dangaard Brouer wrote: On Sat, 07 Nov 2020 14:00:04 +0100 Thomas Rosenstein via Bloat wrote: Here's an extract from the ethtool https://pastebin.com/cabpWGFz just in case there's something hidden. Yes, there is something hiding in the data from ethtool_stats.pl[1]: (10G Mellanox Connect-X cards via 10G SPF+ DAC) stat:1 ( 1) <= outbound_pci_stalled_wr_events /sec stat:339731557 (339,731,557) <= rx_buffer_passed_thres_phy /sec I've not seen this counter 'rx_buffer_passed_thres_phy' before, looking in the kernel driver code it is related to "rx_buffer_almost_full". The numbers per second is excessive (but it be related to a driver bug as it ends up reading "high" -> rx_buffer_almost_full_high in the extended counters). I have now tested with a new kernel 5.9.4 build made from 3.10 with make oldconfig and I noticed an interesting effect. The first ca. 2 minutes the router behaves completely normal as with 3.10, after that the ping times go crazy. I have recorded this with ethtool, and also the ping times. Ethtool: (13 MB) https://drive.google.com/file/d/1Ojp64UUw0zKwrgF_CisZb3BCdidAJYZo/view?usp=sharing The transfer first was doing around 50 - 70 MB/s then once the ping times go worse it dropped to ~12 MB/s. ca. Line 74324 the transfer speed drops to 12 MB/s Lets add them here so we can see and talk about them: 74324 Ethtool(eth4 ) stat:144584002 (144,584,002) <= rx_buffer_passed_thres_phy /sec 74325 Ethtool(eth4 ) stat: 17243186 ( 17,243,186) <= rx_bytes /sec 74326 Ethtool(eth4 ) stat: 19032079 ( 19,032,079) <= rx_bytes_phy /sec Significant difference between RX-bytes Phy-level and OS-level. 17243186*8/10^6 = 138 Mbit/s 74327 Ethtool(eth4 ) stat: 321 (321) <= rx_cache_busy /sec 74328 Ethtool(eth4 ) stat: 321 (321) <= rx_cache_full /sec 74329 Ethtool(eth4 ) stat: 7323 ( 7,323) <= rx_cache_reuse /sec 74330 Ethtool(eth4 ) stat:15188 ( 15,188) <= rx_csum_complete /sec 74331 Ethtool(eth4 ) stat:1 ( 1) <= rx_csum_none /sec 74332 Ethtool(eth4 ) stat: 96 ( 96) <= rx_csum_unnecessary /sec 74333 Ethtool(eth4 ) stat: 1242 ( 1,242) <= rx_discards_phy /sec Packets are getting rx_discards 74334 Ethtool(eth4 ) stat:15285 ( 15,285) <= rx_packets /sec 74335 Ethtool(eth4 ) stat:16655 ( 16,655) <= rx_packets_phy /sec (16655-15285 = 1370) Again difference between RX-packets Phy-level and OS-level 74336 Ethtool(eth4 ) stat: 19132497 ( 19,132,497) <= rx_prio0_bytes /sec 74337 Ethtool(eth4 ) stat: 1264 ( 1,264) <= rx_prio0_discards /sec 74338 Ethtool(eth4 ) stat:15474 ( 15,474) <= rx_prio0_packets /sec I have also updated now OFED on the routers and checked with 3.10 and 5.4.60 (new firmware for the adapters) That did change absolutly nothing! I noticed still that only under kernel 5.4.60 the discard happen, on 3.10 they don't. Based on that I would blame the kernel instead of hardware. Do you agree on this? I wouldn't exclude any settings to be problematic for now, maybe some defaults changed. Seems you are right about the rx_buffer_passed_thres_phy if you check just those lines they appear more often once the speed dropped. Not sure if that's the cause or an effect of the underlying problem! Thanks for the link to Mellanox doc: https://community.mellanox.com/s/article/understanding-mlx5-ethtool-counters The counter: 'rx_buffer_passed_thres_phy' Description: 'The number of events where the port receive buffer was over 85% full. Supported from kernel 4.14' The counter: 'tx_pause_ctrl_phy' Description: 'The number of link layer pause packets transmitted on a physical port. If this counter is increasing, it implies that the NIC is congested and cannot absorb the traffic coming from the network.' So, the NIC is clearly telling us that something is wrong. You are not near link speed (12MB/s => 96Mbit/s). And still port receive buffer is over 85% full most of the time. There are indications that, the NIC is congested and cannot absorb the traffic coming from the network. I have opened a support case with Mellanox maybe they have an idea directly. Their first response based on the outbound_pci_stalled_wr was that the PCIexpress credits for the card run out. Which means (if I interpret it correctly) that the buffer is not emptied quickly enough anymore. Long-shot theory(1): Could this be caused by CPUs going into too deep sleep states? And NIC not sending wake-up signal early enough,
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On Mon, 09 Nov 2020 15:33:46 +0100 "Thomas Rosenstein" wrote: > On 9 Nov 2020, at 12:40, Jesper Dangaard Brouer wrote: > > > On Mon, 09 Nov 2020 11:09:33 +0100 > > "Thomas Rosenstein" wrote: > > > >> On 9 Nov 2020, at 9:24, Jesper Dangaard Brouer wrote: > >> > >>> On Sat, 07 Nov 2020 14:00:04 +0100 > >>> Thomas Rosenstein via Bloat wrote: > >>> > Here's an extract from the ethtool https://pastebin.com/cabpWGFz > just > in > case there's something hidden. > >>> > >>> Yes, there is something hiding in the data from ethtool_stats.pl[1]: > >>> (10G Mellanox Connect-X cards via 10G SPF+ DAC) > >>> > >>> stat:1 ( 1) <= outbound_pci_stalled_wr_events /sec > >>> stat:339731557 (339,731,557) <= rx_buffer_passed_thres_phy /sec > >>> > >>> I've not seen this counter 'rx_buffer_passed_thres_phy' before, > >>> looking > >>> in the kernel driver code it is related to "rx_buffer_almost_full". > >>> The numbers per second is excessive (but it be related to a driver > >>> bug > >>> as it ends up reading "high" -> rx_buffer_almost_full_high in the > >>> extended counters). > > I have now tested with a new kernel 5.9.4 build made from 3.10 with make > oldconfig and I noticed an interesting effect. > > The first ca. 2 minutes the router behaves completely normal as with > 3.10, after that the ping times go crazy. > > I have recorded this with ethtool, and also the ping times. > > Ethtool: (13 MB) > https://drive.google.com/file/d/1Ojp64UUw0zKwrgF_CisZb3BCdidAJYZo/view?usp=sharing > > The transfer first was doing around 50 - 70 MB/s then once the ping > times go worse it dropped to ~12 MB/s. > ca. Line 74324 the transfer speed drops to 12 MB/s Lets add them here so we can see and talk about them: 74324 Ethtool(eth4 ) stat:144584002 (144,584,002) <= rx_buffer_passed_thres_phy /sec 74325 Ethtool(eth4 ) stat: 17243186 ( 17,243,186) <= rx_bytes /sec 74326 Ethtool(eth4 ) stat: 19032079 ( 19,032,079) <= rx_bytes_phy /sec Significant difference between RX-bytes Phy-level and OS-level. 17243186*8/10^6 = 138 Mbit/s 74327 Ethtool(eth4 ) stat: 321 (321) <= rx_cache_busy /sec 74328 Ethtool(eth4 ) stat: 321 (321) <= rx_cache_full /sec 74329 Ethtool(eth4 ) stat: 7323 ( 7,323) <= rx_cache_reuse /sec 74330 Ethtool(eth4 ) stat:15188 ( 15,188) <= rx_csum_complete /sec 74331 Ethtool(eth4 ) stat:1 ( 1) <= rx_csum_none /sec 74332 Ethtool(eth4 ) stat: 96 ( 96) <= rx_csum_unnecessary /sec 74333 Ethtool(eth4 ) stat: 1242 ( 1,242) <= rx_discards_phy /sec Packets are getting rx_discards 74334 Ethtool(eth4 ) stat:15285 ( 15,285) <= rx_packets /sec 74335 Ethtool(eth4 ) stat:16655 ( 16,655) <= rx_packets_phy /sec (16655-15285 = 1370) Again difference between RX-packets Phy-level and OS-level 74336 Ethtool(eth4 ) stat: 19132497 ( 19,132,497) <= rx_prio0_bytes /sec 74337 Ethtool(eth4 ) stat: 1264 ( 1,264) <= rx_prio0_discards /sec 74338 Ethtool(eth4 ) stat:15474 ( 15,474) <= rx_prio0_packets /sec > Seems you are right about the rx_buffer_passed_thres_phy if you check > just those lines they appear more often once the speed dropped. > Not sure if that's the cause or an effect of the underlying problem! Thanks for the link to Mellanox doc: https://community.mellanox.com/s/article/understanding-mlx5-ethtool-counters The counter: 'rx_buffer_passed_thres_phy' Description: 'The number of events where the port receive buffer was over 85% full. Supported from kernel 4.14' The counter: 'tx_pause_ctrl_phy' Description: 'The number of link layer pause packets transmitted on a physical port. If this counter is increasing, it implies that the NIC is congested and cannot absorb the traffic coming from the network.' So, the NIC is clearly telling us that something is wrong. You are not near link speed (12MB/s => 96Mbit/s). And still port receive buffer is over 85% full most of the time. There are indications that, the NIC is congested and cannot absorb the traffic coming from the network. Long-shot theory(1): Could this be caused by CPUs going into too deep sleep states? And NIC not sending wake-up signal early enough, e.g. because it holds on to packets for generating a bulk at IRQ raise. Long-shot theory(2): The NIC driver IRQ interrupt coalesce adaptive-IRQ (adj via ethtool -C) is wrong. The reason it is wrong is because of TSO frames, due packet-to-byte ratio isn't taken into account. You can test theory(2) in two ways, via disabling TSO and via adjusting ethtool --show-coalesce|--coalesce (-c|-C). The --coalesce change can change part of theory(1), the NIC wake-up/IRQ interval. For CPU sleep modes see software package 'tuned'. > Pings: > https://drive.google.com/file/d/16phOxM5IFU
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On 9 Nov 2020, at 12:40, Jesper Dangaard Brouer wrote: On Mon, 09 Nov 2020 11:09:33 +0100 "Thomas Rosenstein" wrote: Could you also provide ethtool_stats for the TX interface? Notice that the tool[1] ethtool_stats.pl support monitoring several interfaces at the same time, e.g. run: ethtool_stats.pl --sec 3 --dev eth4 --dev ethTX And provide output as pastebin. I have now also repeated the same test with 3.10, here are the ethtool outputs: https://drive.google.com/file/d/1c98MVV0JYl6Su6xZTpqwS7m-6OlbmAFp/view?usp=sharing and the ping times: https://drive.google.com/file/d/1xhbGJHb5jUbPsee4frbx-c-uqh-7orXY/view?usp=sharing Sadly the parameters we were looking at are not supported below 4.14. but I immediatly saw 1 thing very different: ethtool --statistics eth4 | grep discards rx_discards_phy: 0 tx_discards_phy: 0 if we check the ethtool output from 5.9.4 were have: rx_discards_phy: 151793 And also the outbound_pci_stalled_wr_events get more frequent the lower the total bandwidth / the higher the ping is. Logically there must be something blocking the the buffers, either they are not getting freed, or not rotated correctly, or processing is too slow. I would exclude the processing, simply based on 0% CPU load, and also that it doesn't happen in 3.10. Suspicious is also, that the issue only appears after a certain time of activity (maybe total traffic?!) ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On 9 Nov 2020, at 12:40, Jesper Dangaard Brouer wrote: On Mon, 09 Nov 2020 11:09:33 +0100 "Thomas Rosenstein" wrote: On 9 Nov 2020, at 9:24, Jesper Dangaard Brouer wrote: On Sat, 07 Nov 2020 14:00:04 +0100 Thomas Rosenstein via Bloat wrote: Here's an extract from the ethtool https://pastebin.com/cabpWGFz just in case there's something hidden. Yes, there is something hiding in the data from ethtool_stats.pl[1]: (10G Mellanox Connect-X cards via 10G SPF+ DAC) stat:1 ( 1) <= outbound_pci_stalled_wr_events /sec stat:339731557 (339,731,557) <= rx_buffer_passed_thres_phy /sec I've not seen this counter 'rx_buffer_passed_thres_phy' before, looking in the kernel driver code it is related to "rx_buffer_almost_full". The numbers per second is excessive (but it be related to a driver bug as it ends up reading "high" -> rx_buffer_almost_full_high in the extended counters). I have now tested with a new kernel 5.9.4 build made from 3.10 with make oldconfig and I noticed an interesting effect. The first ca. 2 minutes the router behaves completely normal as with 3.10, after that the ping times go crazy. I have recorded this with ethtool, and also the ping times. Ethtool: (13 MB) https://drive.google.com/file/d/1Ojp64UUw0zKwrgF_CisZb3BCdidAJYZo/view?usp=sharing The transfer first was doing around 50 - 70 MB/s then once the ping times go worse it dropped to ~12 MB/s. ca. Line 74324 the transfer speed drops to 12 MB/s Seems you are right about the rx_buffer_passed_thres_phy if you check just those lines they appear more often once the speed dropped. Not sure if that's the cause or an effect of the underlying problem! Pings: https://drive.google.com/file/d/16phOxM5IFU6RAl4Ua4pRqMNuLYBc4RK7/view?usp=sharing Pause frames were activated again after the restart. (Here a link for rerefence for the ethtool variables: https://community.mellanox.com/s/article/understanding-mlx5-ethtool-counters) ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On 9 Nov 2020, at 12:40, Jesper Dangaard Brouer wrote: On Mon, 09 Nov 2020 11:09:33 +0100 "Thomas Rosenstein" wrote: On 9 Nov 2020, at 9:24, Jesper Dangaard Brouer wrote: On Sat, 07 Nov 2020 14:00:04 +0100 Thomas Rosenstein via Bloat wrote: Here's an extract from the ethtool https://pastebin.com/cabpWGFz just in case there's something hidden. Yes, there is something hiding in the data from ethtool_stats.pl[1]: (10G Mellanox Connect-X cards via 10G SPF+ DAC) stat:1 ( 1) <= outbound_pci_stalled_wr_events /sec stat:339731557 (339,731,557) <= rx_buffer_passed_thres_phy /sec I've not seen this counter 'rx_buffer_passed_thres_phy' before, looking in the kernel driver code it is related to "rx_buffer_almost_full". The numbers per second is excessive (but it be related to a driver bug as it ends up reading "high" -> rx_buffer_almost_full_high in the extended counters). Notice this indication is a strong red-flag that something is wrong. Okay, but as this is a router you also need to transmit this (asymmetric) traffic out another interface right. The asymmetric traffic comes back on another router, this is router-02, traffic from internet comes back on router-01, I also added the interfaces names. See the updated diagram: https://drive.google.com/file/d/15oAsxiNfsbjB9a855Q_dh6YvFZBDdY5I/view?usp=sharing Could you also provide ethtool_stats for the TX interface? Notice that the tool[1] ethtool_stats.pl support monitoring several interfaces at the same time, e.g. run: ethtool_stats.pl --sec 3 --dev eth4 --dev ethTX And provide output as pastebin. I have disabled pause control, like Toke said via: ethtool -A eth4 autoneg off rx off tx off ethtool -A eth5 autoneg off rx off tx off Afterwards an ethtool output, first "without" traffic for a few seconds, then with the problematic flow. Since the output is > 512KB I had to upload it on gdrive: https://drive.google.com/file/d/1EVKt1LseaBuD40QE-SqFvqYSeWUEcGA_/view?usp=sharing [1] https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl Strange size distribution: stat: 19922 ( 19,922) <= rx_1519_to_2047_bytes_phy /sec stat:14 ( 14) <= rx_65_to_127_bytes_phy /sec I assume it's because of the VLAN Tagging, and therefore 1522 bytes per packet with mtu of 1500? -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
>> Also, it's TX, and we are only doing RX, as I said already somewhere, >> it's async routing, so the TX data comes via another router back. > > Okay, but as this is a router you also need to transmit this > (asymmetric) traffic out another interface right. > > Could you also provide ethtool_stats for the TX interface? > > Notice that the tool[1] ethtool_stats.pl support monitoring several > interfaces at the same time, e.g. run: > > ethtool_stats.pl --sec 3 --dev eth4 --dev ethTX > > And provide output as pastebin. Also, from your drawing, everything goes through the same switch, right? And since pause frames are involved... Maybe it's the switch being overwhelmed? In general, I find that pause frames give more problems than they solve, so I always turn them off... -Toke ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On Mon, 09 Nov 2020 11:09:33 +0100 "Thomas Rosenstein" wrote: > On 9 Nov 2020, at 9:24, Jesper Dangaard Brouer wrote: > > > On Sat, 07 Nov 2020 14:00:04 +0100 > > Thomas Rosenstein via Bloat wrote: > > > >> Here's an extract from the ethtool https://pastebin.com/cabpWGFz just > >> in > >> case there's something hidden. > > > > Yes, there is something hiding in the data from ethtool_stats.pl[1]: > > (10G Mellanox Connect-X cards via 10G SPF+ DAC) > > > > stat:1 ( 1) <= outbound_pci_stalled_wr_events /sec > > stat:339731557 (339,731,557) <= rx_buffer_passed_thres_phy /sec > > > > I've not seen this counter 'rx_buffer_passed_thres_phy' before, looking > > in the kernel driver code it is related to "rx_buffer_almost_full". > > The numbers per second is excessive (but it be related to a driver bug > > as it ends up reading "high" -> rx_buffer_almost_full_high in the > > extended counters). Notice this indication is a strong red-flag that something is wrong. > > stat: 29583661 ( 29,583,661) <= rx_bytes /sec > > stat: 30343677 ( 30,343,677) <= rx_bytes_phy /sec > > > > You are receiving with 236 Mbit/s in 10Gbit/s link. There is a > > difference between what the OS sees (rx_bytes) and what the NIC > > hardware sees (rx_bytes_phy) (diff approx 6Mbit/s). > > > > stat:19552 ( 19,552) <= rx_packets /sec > > stat:19950 ( 19,950) <= rx_packets_phy /sec > > Could these packets be from VLAN interfaces that are not used in the OS? > > > > > Above RX packet counters also indicated HW is seeing more packets that > > OS is receiving. > > > > Next counters is likely your problem: > > > > stat: 718 (718) <= tx_global_pause /sec > > stat: 954035 (954,035) <= tx_global_pause_duration /sec > > stat: 714 (714) <= tx_pause_ctrl_phy /sec > > As far as I can see that's only the TX, and we are only doing RX on this > interface - so maybe that's irrelevant? > > > > > It looks like you have enabled Ethernet Flow-Control, and something is > > causing pause frames to be generated. It seem strange that this > > happen on a 10Gbit/s link with only 236 Mbit/s. > > > > The TX byte counters are also very strange: > > > > stat:26063 ( 26,063) <= tx_bytes /sec > > stat:71950 ( 71,950) <= tx_bytes_phy /sec > > Also, it's TX, and we are only doing RX, as I said already somewhere, > it's async routing, so the TX data comes via another router back. Okay, but as this is a router you also need to transmit this (asymmetric) traffic out another interface right. Could you also provide ethtool_stats for the TX interface? Notice that the tool[1] ethtool_stats.pl support monitoring several interfaces at the same time, e.g. run: ethtool_stats.pl --sec 3 --dev eth4 --dev ethTX And provide output as pastebin. > > [1] > > https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl > > > > Strange size distribution: > > stat: 19922 ( 19,922) <= rx_1519_to_2047_bytes_phy /sec > > stat:14 ( 14) <= rx_65_to_127_bytes_phy /sec > -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On 9 Nov 2020, at 9:24, Jesper Dangaard Brouer wrote: On Sat, 07 Nov 2020 14:00:04 +0100 Thomas Rosenstein via Bloat wrote: Here's an extract from the ethtool https://pastebin.com/cabpWGFz just in case there's something hidden. Yes, there is something hiding in the data from ethtool_stats.pl[1]: (10G Mellanox Connect-X cards via 10G SPF+ DAC) stat:1 ( 1) <= outbound_pci_stalled_wr_events /sec stat:339731557 (339,731,557) <= rx_buffer_passed_thres_phy /sec I've not seen this counter 'rx_buffer_passed_thres_phy' before, looking in the kernel driver code it is related to "rx_buffer_almost_full". The numbers per second is excessive (but it be related to a driver bug as it ends up reading "high" -> rx_buffer_almost_full_high in the extended counters). stat: 29583661 ( 29,583,661) <= rx_bytes /sec stat: 30343677 ( 30,343,677) <= rx_bytes_phy /sec You are receiving with 236 Mbit/s in 10Gbit/s link. There is a difference between what the OS sees (rx_bytes) and what the NIC hardware sees (rx_bytes_phy) (diff approx 6Mbit/s). stat:19552 ( 19,552) <= rx_packets /sec stat:19950 ( 19,950) <= rx_packets_phy /sec Could these packets be from VLAN interfaces that are not used in the OS? Above RX packet counters also indicated HW is seeing more packets that OS is receiving. Next counters is likely your problem: stat: 718 (718) <= tx_global_pause /sec stat: 954035 (954,035) <= tx_global_pause_duration /sec stat: 714 (714) <= tx_pause_ctrl_phy /sec As far as I can see that's only the TX, and we are only doing RX on this interface - so maybe that's irrelevant? It looks like you have enabled Ethernet Flow-Control, and something is causing pause frames to be generated. It seem strange that this happen on a 10Gbit/s link with only 236 Mbit/s. The TX byte counters are also very strange: stat:26063 ( 26,063) <= tx_bytes /sec stat:71950 ( 71,950) <= tx_bytes_phy /sec Also, it's TX, and we are only doing RX, as I said already somewhere, it's async routing, so the TX data comes via another router back. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer [1] https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl Strange size distribution: stat: 19922 ( 19,922) <= rx_1519_to_2047_bytes_phy /sec stat:14 ( 14) <= rx_65_to_127_bytes_phy /sec ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On Sat, 07 Nov 2020 14:00:04 +0100 Thomas Rosenstein via Bloat wrote: > Here's an extract from the ethtool https://pastebin.com/cabpWGFz just in > case there's something hidden. Yes, there is something hiding in the data from ethtool_stats.pl[1]: (10G Mellanox Connect-X cards via 10G SPF+ DAC) stat:1 ( 1) <= outbound_pci_stalled_wr_events /sec stat:339731557 (339,731,557) <= rx_buffer_passed_thres_phy /sec I've not seen this counter 'rx_buffer_passed_thres_phy' before, looking in the kernel driver code it is related to "rx_buffer_almost_full". The numbers per second is excessive (but it be related to a driver bug as it ends up reading "high" -> rx_buffer_almost_full_high in the extended counters). stat: 29583661 ( 29,583,661) <= rx_bytes /sec stat: 30343677 ( 30,343,677) <= rx_bytes_phy /sec You are receiving with 236 Mbit/s in 10Gbit/s link. There is a difference between what the OS sees (rx_bytes) and what the NIC hardware sees (rx_bytes_phy) (diff approx 6Mbit/s). stat:19552 ( 19,552) <= rx_packets /sec stat:19950 ( 19,950) <= rx_packets_phy /sec Above RX packet counters also indicated HW is seeing more packets that OS is receiving. Next counters is likely your problem: stat: 718 (718) <= tx_global_pause /sec stat: 954035 (954,035) <= tx_global_pause_duration /sec stat: 714 (714) <= tx_pause_ctrl_phy /sec It looks like you have enabled Ethernet Flow-Control, and something is causing pause frames to be generated. It seem strange that this happen on a 10Gbit/s link with only 236 Mbit/s. The TX byte counters are also very strange: stat:26063 ( 26,063) <= tx_bytes /sec stat:71950 ( 71,950) <= tx_bytes_phy /sec -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer [1] https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl Strange size distribution: stat: 19922 ( 19,922) <= rx_1519_to_2047_bytes_phy /sec stat:14 ( 14) <= rx_65_to_127_bytes_phy /sec ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
Hi Thomas, Probably not very helpful, but iputils' ping has the -D option that is described as: -D Print timestamp (unix time + microseconds as in gettimeofday) before each line. this potentially allows to be translated into kernel time to pinpoint events in time? Not that I would know how to translate times here and if perf and friends actually store timestamps Best Regards Sebastian > On Nov 7, 2020, at 18:01, Thomas Rosenstein via Bloat > wrote: > > > > On 7 Nov 2020, at 17:46, Jesper Dangaard Brouer wrote: > >> >> Did the latency issue happen during this this perf recording? >> (it is obviously important that you record during the issue) > > yes it was recorded during the transfer, while it exhibited the issue > > I let the transfer run some time, then did the perf extract, stopped the > transfer afterwards. > >> >> Finding a single latency event/issue in perf.data is like looking in >> for needle in a haystack. It does give hints and e.g. tell me details >> like what part of the kernel are used/activated. >> >> I can see that you have likely configured XPS (Transmit Packet >> Steering). Is this on purpose? > > Not on purpose, I assume it comes from the Mellanox Card. > >> >> -- >> Best regards, >> Jesper Dangaard Brouer >> MSc.CS, Principal Kernel Engineer at Red Hat >> LinkedIn: http://www.linkedin.com/in/brouer > ___ > Bloat mailing list > Bloat@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/bloat ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On 7 Nov 2020, at 17:46, Jesper Dangaard Brouer wrote: Did the latency issue happen during this this perf recording? (it is obviously important that you record during the issue) yes it was recorded during the transfer, while it exhibited the issue I let the transfer run some time, then did the perf extract, stopped the transfer afterwards. Finding a single latency event/issue in perf.data is like looking in for needle in a haystack. It does give hints and e.g. tell me details like what part of the kernel are used/activated. I can see that you have likely configured XPS (Transmit Packet Steering). Is this on purpose? Not on purpose, I assume it comes from the Mellanox Card. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On Sat, 07 Nov 2020 13:37:01 +0100 "Thomas Rosenstein" wrote: > On 6 Nov 2020, at 21:19, Jesper Dangaard Brouer wrote: > > > On Fri, 06 Nov 2020 18:04:49 +0100 > > "Thomas Rosenstein" wrote: > > > >> On 6 Nov 2020, at 15:13, Jesper Dangaard Brouer wrote: > >> > >> [...] > > > >>> Have you tried to use 'perf record' to observe that is happening on > >>> the system while these latency incidents happen? (let me know if > >>> you want some cmdline hints) > >> > >> Haven't tried this yet. If you have some hints what events to monitor > >> I'll take them! > > > > Okay to record everything (-a) on the system and save call-graph (-g), > > and run for 5 seconds (via profiling the sleep function). > > > > # perf record -g -a sleep 5 > > > > To view the result the simply use the 'perf report', but likely you > > want to use option --no-children as you are profiling the kernel (and > > not a userspace program you want to have grouped 'children' by). I > > also include the CPU column via '--sort cpu,comm,dso,symbol' and you > > can select/zoom-in-on a specific CPU via '-C zero-indexed-cpu-num'. > > > > # perf report --sort cpu,comm,dso,symbol --no-children > > > > When we ask you to provide the output, you can use the --stdio option, > > and provide txt-info via a pastebin link as it is very long. > > Here is the output from kernel 3.10_1127 (I updated to the really newest > in that branch): https://pastebin.com/5mxirXPw > Here is the output from kernel 5.9.4: https://pastebin.com/KDZ2Ei2F Did the latency issue happen during this this perf recording? (it is obviously important that you record during the issue) Finding a single latency event/issue in perf.data is like looking in for needle in a haystack. It does give hints and e.g. tell me details like what part of the kernel are used/activated. I can see that you have likely configured XPS (Transmit Packet Steering). Is this on purpose? -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On 7 Nov 2020, at 13:37, Thomas Rosenstein wrote: I have also tried to reproduce the issue with the kernel on a virtual hyper-v machine, there I don't have any adverse effects. But it's not 100% the same, since MASQ happens on it .. will restructure a bit to get a similar representation I reproduced something similar now, for the VM it only happens at 120 threads but I have every 5 or so seconds one packet that's delayed heavily: PING 10.10.0.69 (10.10.0.69) 56(84) bytes of data. 64 bytes from 10.10.0.69: icmp_seq=1 ttl=63 time=0.273 ms 64 bytes from 10.10.0.69: icmp_seq=2 ttl=63 time=0.260 ms 64 bytes from 10.10.0.69: icmp_seq=3 ttl=63 time=0.621 ms 64 bytes from 10.10.0.69: icmp_seq=4 ttl=63 time=0.218 ms 64 bytes from 10.10.0.69: icmp_seq=5 ttl=63 time=52.7 ms 64 bytes from 10.10.0.69: icmp_seq=6 ttl=63 time=0.261 ms 64 bytes from 10.10.0.69: icmp_seq=7 ttl=63 time=0.381 ms 64 bytes from 10.10.0.69: icmp_seq=8 ttl=63 time=0.230 ms 64 bytes from 10.10.0.69: icmp_seq=9 ttl=63 time=0.378 ms 64 bytes from 10.10.0.69: icmp_seq=10 ttl=63 time=0.268 ms 64 bytes from 10.10.0.69: icmp_seq=11 ttl=63 time=0.313 ms 64 bytes from 10.10.0.69: icmp_seq=12 ttl=63 time=0.395 ms 64 bytes from 10.10.0.69: icmp_seq=13 ttl=63 time=0.269 ms 64 bytes from 10.10.0.69: icmp_seq=14 ttl=63 time=44.1 ms 64 bytes from 10.10.0.69: icmp_seq=15 ttl=63 time=0.279 ms 64 bytes from 10.10.0.69: icmp_seq=16 ttl=63 time=0.406 ms 64 bytes from 10.10.0.69: icmp_seq=17 ttl=63 time=0.245 ms 64 bytes from 10.10.0.69: icmp_seq=18 ttl=63 time=0.202 ms 64 bytes from 10.10.0.69: icmp_seq=19 ttl=63 time=0.229 ms 64 bytes from 10.10.0.69: icmp_seq=20 ttl=63 time=0.357 ms 64 bytes from 10.10.0.69: icmp_seq=21 ttl=63 time=0.462 ms 64 bytes from 10.10.0.69: icmp_seq=22 ttl=63 time=0.300 ms 64 bytes from 10.10.0.69: icmp_seq=23 ttl=63 time=60.3 ms 64 bytes from 10.10.0.69: icmp_seq=24 ttl=63 time=0.200 ms 64 bytes from 10.10.0.69: icmp_seq=25 ttl=63 time=15.8 ms 64 bytes from 10.10.0.69: icmp_seq=26 ttl=63 time=0.296 ms 64 bytes from 10.10.0.69: icmp_seq=27 ttl=63 time=0.273 ms 64 bytes from 10.10.0.69: icmp_seq=28 ttl=63 time=0.591 ms 64 bytes from 10.10.0.69: icmp_seq=29 ttl=63 time=0.583 ms during that time the interface is loaded with 40 Mbits. this does not appear with 3.10 kernel: PING 10.10.0.69 (10.10.0.69) 56(84) bytes of data. 64 bytes from 10.10.0.69: icmp_seq=1 ttl=63 time=0.322 ms 64 bytes from 10.10.0.69: icmp_seq=2 ttl=63 time=0.242 ms 64 bytes from 10.10.0.69: icmp_seq=3 ttl=63 time=0.461 ms 64 bytes from 10.10.0.69: icmp_seq=4 ttl=63 time=0.244 ms 64 bytes from 10.10.0.69: icmp_seq=5 ttl=63 time=0.429 ms 64 bytes from 10.10.0.69: icmp_seq=6 ttl=63 time=0.268 ms 64 bytes from 10.10.0.69: icmp_seq=7 ttl=63 time=0.540 ms 64 bytes from 10.10.0.69: icmp_seq=8 ttl=63 time=0.465 ms 64 bytes from 10.10.0.69: icmp_seq=9 ttl=63 time=0.237 ms 64 bytes from 10.10.0.69: icmp_seq=10 ttl=63 time=0.216 ms 64 bytes from 10.10.0.69: icmp_seq=11 ttl=63 time=0.248 ms 64 bytes from 10.10.0.69: icmp_seq=12 ttl=63 time=0.247 ms 64 bytes from 10.10.0.69: icmp_seq=13 ttl=63 time=0.282 ms 64 bytes from 10.10.0.69: icmp_seq=14 ttl=63 time=0.256 ms 64 bytes from 10.10.0.69: icmp_seq=15 ttl=63 time=0.268 ms 64 bytes from 10.10.0.69: icmp_seq=16 ttl=63 time=0.272 ms 64 bytes from 10.10.0.69: icmp_seq=17 ttl=63 time=0.236 ms 64 bytes from 10.10.0.69: icmp_seq=18 ttl=63 time=0.287 ms 64 bytes from 10.10.0.69: icmp_seq=19 ttl=63 time=0.288 ms 64 bytes from 10.10.0.69: icmp_seq=20 ttl=63 time=0.297 ms 64 bytes from 10.10.0.69: icmp_seq=21 ttl=63 time=0.319 ms 64 bytes from 10.10.0.69: icmp_seq=22 ttl=63 time=0.294 ms 64 bytes from 10.10.0.69: icmp_seq=23 ttl=63 time=0.313 ms 64 bytes from 10.10.0.69: icmp_seq=24 ttl=63 time=0.208 ms 64 bytes from 10.10.0.69: icmp_seq=25 ttl=63 time=0.222 ms 64 bytes from 10.10.0.69: icmp_seq=26 ttl=63 time=0.279 ms 64 bytes from 10.10.0.69: icmp_seq=27 ttl=63 time=0.257 ms 64 bytes from 10.10.0.69: icmp_seq=28 ttl=63 time=0.266 ms 64 bytes from 10.10.0.69: icmp_seq=29 ttl=63 time=0.398 ms 64 bytes from 10.10.0.69: icmp_seq=30 ttl=63 time=0.281 ms 64 bytes from 10.10.0.69: icmp_seq=31 ttl=63 time=0.263 ms 64 bytes from 10.10.0.69: icmp_seq=32 ttl=63 time=0.276 ms might be a hint that it's actually reproduceable. ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On 7 Nov 2020, at 13:40, Jan Ceuleers wrote: On 07/11/2020 13:37, Thomas Rosenstein via Bloat wrote: Nonetheless, I tested it and no difference :) The relevant test would be to run your experiment while IPv6 is completely disabled in the kernel. Tested with ipv6 full off (via ipv6.disable=1), no difference! Here's an extract from the ethtool https://pastebin.com/cabpWGFz just in case there's something hidden. ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On 7 Nov 2020, at 13:40, Jan Ceuleers wrote: > On 07/11/2020 13:37, Thomas Rosenstein via Bloat wrote: >> Nonetheless, I tested it and no difference :) > > The relevant test would be to run your experiment while IPv6 is > completely disabled in the kernel. sure, I'll try! ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On 07/11/2020 13:37, Thomas Rosenstein via Bloat wrote: > Nonetheless, I tested it and no difference :) The relevant test would be to run your experiment while IPv6 is completely disabled in the kernel. ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On 6 Nov 2020, at 21:19, Jesper Dangaard Brouer wrote: On Fri, 06 Nov 2020 18:04:49 +0100 "Thomas Rosenstein" wrote: On 6 Nov 2020, at 15:13, Jesper Dangaard Brouer wrote: I'm using ping on IPv4, but I'll try to see if IPv6 makes any difference! I think you misunderstand me. I'm not asking you to use ping6. The gobgpd daemon updates will both update IPv4 and IPv6 routes, right. Updating IPv6 routes are more problematic than IPv4 routes. The IPv6 route tables update can potentially stall softirq from running, which was the latency tool was measuring... and it did show some outliers. yes I did, I assumed the latency would be introduced in the traffic path by the lock. Nonetheless, I tested it and no difference :) Have you tried to use 'perf record' to observe that is happening on the system while these latency incidents happen? (let me know if you want some cmdline hints) Haven't tried this yet. If you have some hints what events to monitor I'll take them! Okay to record everything (-a) on the system and save call-graph (-g), and run for 5 seconds (via profiling the sleep function). # perf record -g -a sleep 5 To view the result the simply use the 'perf report', but likely you want to use option --no-children as you are profiling the kernel (and not a userspace program you want to have grouped 'children' by). I also include the CPU column via '--sort cpu,comm,dso,symbol' and you can select/zoom-in-on a specific CPU via '-C zero-indexed-cpu-num'. # perf report --sort cpu,comm,dso,symbol --no-children When we ask you to provide the output, you can use the --stdio option, and provide txt-info via a pastebin link as it is very long. Here is the output from kernel 3.10_1127 (I updated to the really newest in that branch): https://pastebin.com/5mxirXPw Here is the output from kernel 5.9.4: https://pastebin.com/KDZ2Ei2F I have noticed that the delays are directly related to the traffic flows, see below. These tests are WITHOUT gobgpd running, so no updates to the route table, but the route tables are fully populated. Also, it's ONLY outgoing traffic, the return packets are coming in on another router. I have then cleared the routing tables, and the issue persists, table has only 78 entries. 40 threads -> sometimes higher rtt times: https://pastebin.com/Y9nd0h4h 60 threads -> always high rtt times: https://pastebin.com/JFvhtLrH So it definitly gets worse the more connections there are. I have also tried to reproduce the issue with the kernel on a virtual hyper-v machine, there I don't have any adverse effects. But it's not 100% the same, since MASQ happens on it .. will restructure a bit to get a similar representation I also suspected now that -j NOTRACK would be an issue, removed that too, no change. (it's anyways async routing) Additionally I have quit all applications except for sshd, no change! -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On Fri, 06 Nov 2020 18:04:49 +0100 "Thomas Rosenstein" wrote: > On 6 Nov 2020, at 15:13, Jesper Dangaard Brouer wrote: > > > On Fri, 6 Nov 2020 13:53:58 +0100 > > Jesper Dangaard Brouer wrote: > > > >> [...] > > Could this be related to netlink? I have gobgpd running on these > routers, which injects routes via netlink. > But the churn rate during the tests is very minimal, maybe 30 - 40 > routes every second. > >> > >> Yes, this could be related. The internal data-structure for FIB > >> lookups is a fibtrie which is a compressed patricia tree, related to > >> radix tree idea. Thus, I can imagine that the kernel have to > >> rebuild/rebalance the tree with all these updates. > > > > Reading the kernel code. The IPv4 fib_trie code is very well tuned, > > fully RCU-ified, meaning read-side is lock-free. The resize() > > function code in net//ipv4/fib_trie.c have max_work limiter to avoid it > > uses > > too much time. And the update looks lockfree. > > > > The IPv6 update looks more scary, as it seems to take a "bh" spinlock > > that can block softirq from running code in net/ipv6/ip6_fib.c > > (spin_lock_bh(&f6i->fib6_table->tb6_lock). > > I'm using ping on IPv4, but I'll try to see if IPv6 makes any > difference! I think you misunderstand me. I'm not asking you to use ping6. The gobgpd daemon updates will both update IPv4 and IPv6 routes, right. Updating IPv6 routes are more problematic than IPv4 routes. The IPv6 route tables update can potentially stall softirq from running, which was the latency tool was measuring... and it did show some outliers. > > Have you tried to use 'perf record' to observe that is happening on > > the system while these latency incidents happen? (let me know if you > > want some cmdline hints) > > Haven't tried this yet. If you have some hints what events to monitor > I'll take them! Okay to record everything (-a) on the system and save call-graph (-g), and run for 5 seconds (via profiling the sleep function). # perf record -g -a sleep 5 To view the result the simply use the 'perf report', but likely you want to use option --no-children as you are profiling the kernel (and not a userspace program you want to have grouped 'children' by). I also include the CPU column via '--sort cpu,comm,dso,symbol' and you can select/zoom-in-on a specific CPU via '-C zero-indexed-cpu-num'. # perf report --sort cpu,comm,dso,symbol --no-children When we ask you to provide the output, you can use the --stdio option, and provide txt-info via a pastebin link as it is very long. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On 6 Nov 2020, at 15:13, Jesper Dangaard Brouer wrote: On Fri, 6 Nov 2020 13:53:58 +0100 Jesper Dangaard Brouer wrote: [...] Could this be related to netlink? I have gobgpd running on these routers, which injects routes via netlink. But the churn rate during the tests is very minimal, maybe 30 - 40 routes every second. Yes, this could be related. The internal data-structure for FIB lookups is a fibtrie which is a compressed patricia tree, related to radix tree idea. Thus, I can imagine that the kernel have to rebuild/rebalance the tree with all these updates. Reading the kernel code. The IPv4 fib_trie code is very well tuned, fully RCU-ified, meaning read-side is lock-free. The resize() function code in net//ipv4/fib_trie.c have max_work limiter to avoid it uses too much time. And the update looks lockfree. The IPv6 update looks more scary, as it seems to take a "bh" spinlock that can block softirq from running code in net/ipv6/ip6_fib.c (spin_lock_bh(&f6i->fib6_table->tb6_lock). I'm using ping on IPv4, but I'll try to see if IPv6 makes any difference! Have you tried to use 'perf record' to observe that is happening on the system while these latency incidents happen? (let me know if you want some cmdline hints) Haven't tried this yet. If you have some hints what events to monitor I'll take them! -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On Fri, 6 Nov 2020 13:53:58 +0100 Jesper Dangaard Brouer wrote: > [...] > > > > > > Could this be related to netlink? I have gobgpd running on these > > > routers, which injects routes via netlink. > > > But the churn rate during the tests is very minimal, maybe 30 - 40 > > > routes every second. > > Yes, this could be related. The internal data-structure for FIB > lookups is a fibtrie which is a compressed patricia tree, related to > radix tree idea. Thus, I can imagine that the kernel have to > rebuild/rebalance the tree with all these updates. Reading the kernel code. The IPv4 fib_trie code is very well tuned, fully RCU-ified, meaning read-side is lock-free. The resize() function code in net//ipv4/fib_trie.c have max_work limiter to avoid it uses too much time. And the update looks lockfree. The IPv6 update looks more scary, as it seems to take a "bh" spinlock that can block softirq from running code in net/ipv6/ip6_fib.c (spin_lock_bh(&f6i->fib6_table->tb6_lock). Have you tried to use 'perf record' to observe that is happening on the system while these latency incidents happen? (let me know if you want some cmdline hints) -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On Fri, 06 Nov 2020 12:45:31 +0100 Toke Høiland-Jørgensen wrote: > "Thomas Rosenstein" writes: > > > On 6 Nov 2020, at 12:18, Jesper Dangaard Brouer wrote: > > > >> On Fri, 06 Nov 2020 10:18:10 +0100 > >> "Thomas Rosenstein" wrote: > >> > > I just tested 5.9.4 seems to also fix it partly, I have long > > stretches where it looks good, and then some increases again. (3.10 > > Stock has them too, but not so high, rather 1-3 ms) > > > >> > >> That you have long stretches where latency looks good is interesting > >> information. My theory is that your system have a periodic userspace > >> process that does a kernel syscall that takes too long, blocking > >> network card from processing packets. (Note it can also be a kernel > >> thread). > > [...] > > > > Could this be related to netlink? I have gobgpd running on these > > routers, which injects routes via netlink. > > But the churn rate during the tests is very minimal, maybe 30 - 40 > > routes every second. Yes, this could be related. The internal data-structure for FIB lookups is a fibtrie which is a compressed patricia tree, related to radix tree idea. Thus, I can imagine that the kernel have to rebuild/rebalance the tree with all these updates. > > > > Otherwise we got: salt-minion, collectd, node_exporter, sshd > > collectd may be polling the interface stats; try turning that off? It should be fairly easy for you to test the theory if any of these services (except sshd) is causing this, by turning them off individually. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On 6 Nov 2020, at 12:45, Toke Høiland-Jørgensen wrote: "Thomas Rosenstein" writes: On 6 Nov 2020, at 12:18, Jesper Dangaard Brouer wrote: On Fri, 06 Nov 2020 10:18:10 +0100 "Thomas Rosenstein" wrote: I just tested 5.9.4 seems to also fix it partly, I have long stretches where it looks good, and then some increases again. (3.10 Stock has them too, but not so high, rather 1-3 ms) That you have long stretches where latency looks good is interesting information. My theory is that your system have a periodic userspace process that does a kernel syscall that takes too long, blocking network card from processing packets. (Note it can also be a kernel thread). The weird part is, I first only updated router-02 and pinged to router-04 (out of traffic flow), there I noticed these long stretches of ok ping. When I updated also router-03 and router-04, the old behaviour kind of was back, this confused me. Could this be related to netlink? I have gobgpd running on these routers, which injects routes via netlink. But the churn rate during the tests is very minimal, maybe 30 - 40 routes every second. Otherwise we got: salt-minion, collectd, node_exporter, sshd collectd may be polling the interface stats; try turning that off? I can, but shouldn't that also influence iperf3 performance then? Another theory is the NIC HW does strange things, but it is not very likely. E.g. delaying the packets before generating the IRQ interrupt, which hide it from my IRQ-to-softirq latency tool. A question: What traffic control qdisc are you using on your system? kernel 4+ uses pfifo, but there's no dropped packets I have also tested with fq_codel, same behaviour and also no weirdness in the packets queue itself kernel 3.10 uses mq, and for the vlan interfaces noqueue Do you mean that you only have a single pfifo qdisc on kernel 4+? Why is it not using mq? oh, actually, I just noticed that's a remnant of the previous tests, I had net.core.default_qdisc = fq_codel in the sysctl.conf... so disregard my previous wrong info so all kernel by default look like that, mq + pfifo_fast: qdisc noqueue 0: dev lo root refcnt 2 qdisc mq 0: dev eth0 root qdisc pfifo_fast 0: dev eth0 parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: dev eth0 parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: dev eth0 parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: dev eth0 parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: dev eth0 parent :5 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: dev eth0 parent :6 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: dev eth0 parent :7 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: dev eth0 parent :8 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc mq 0: dev eth1 root qdisc pfifo_fast 0: dev eth1 parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: dev eth1 parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: dev eth1 parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: dev eth1 parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: dev eth1 parent :5 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: dev eth1 parent :6 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: dev eth1 parent :7 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: dev eth1 parent :8 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc mq 0: dev eth2 root qdisc pfifo_fast 0: dev eth2 parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: dev eth2 parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: dev eth2 parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: dev eth2 parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: dev eth2 parent :5 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: dev eth2 parent :6 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: dev eth2 parent :7 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: dev eth2 parent :8 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc mq 0: dev eth3 root qdisc pfifo_fast 0: dev eth3 parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: dev eth3 parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: dev eth3 parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: dev eth3 parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: dev eth3 parent :5 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: dev eth3 parent :6 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 qdisc pfifo_fast 0: dev eth3 parent :7 band
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
"Thomas Rosenstein" writes: > On 6 Nov 2020, at 12:18, Jesper Dangaard Brouer wrote: > >> On Fri, 06 Nov 2020 10:18:10 +0100 >> "Thomas Rosenstein" wrote: >> > I just tested 5.9.4 seems to also fix it partly, I have long > stretches where it looks good, and then some increases again. (3.10 > Stock has them too, but not so high, rather 1-3 ms) > >> >> That you have long stretches where latency looks good is interesting >> information. My theory is that your system have a periodic userspace >> process that does a kernel syscall that takes too long, blocking >> network card from processing packets. (Note it can also be a kernel >> thread). > > The weird part is, I first only updated router-02 and pinged to > router-04 (out of traffic flow), there I noticed these long stretches of > ok ping. > > When I updated also router-03 and router-04, the old behaviour kind of > was back, this confused me. > > Could this be related to netlink? I have gobgpd running on these > routers, which injects routes via netlink. > But the churn rate during the tests is very minimal, maybe 30 - 40 > routes every second. > > Otherwise we got: salt-minion, collectd, node_exporter, sshd collectd may be polling the interface stats; try turning that off? >> >> Another theory is the NIC HW does strange things, but it is not very >> likely. E.g. delaying the packets before generating the IRQ >> interrupt, >> which hide it from my IRQ-to-softirq latency tool. >> >> A question: What traffic control qdisc are you using on your system? > > kernel 4+ uses pfifo, but there's no dropped packets > I have also tested with fq_codel, same behaviour and also no weirdness > in the packets queue itself > > kernel 3.10 uses mq, and for the vlan interfaces noqueue Do you mean that you only have a single pfifo qdisc on kernel 4+? Why is it not using mq? Was there anything in the ethtool stats? -Toke ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On 6 Nov 2020, at 12:18, Jesper Dangaard Brouer wrote: On Fri, 06 Nov 2020 10:18:10 +0100 "Thomas Rosenstein" wrote: I just tested 5.9.4 seems to also fix it partly, I have long stretches where it looks good, and then some increases again. (3.10 Stock has them too, but not so high, rather 1-3 ms) That you have long stretches where latency looks good is interesting information. My theory is that your system have a periodic userspace process that does a kernel syscall that takes too long, blocking network card from processing packets. (Note it can also be a kernel thread). The weird part is, I first only updated router-02 and pinged to router-04 (out of traffic flow), there I noticed these long stretches of ok ping. When I updated also router-03 and router-04, the old behaviour kind of was back, this confused me. Could this be related to netlink? I have gobgpd running on these routers, which injects routes via netlink. But the churn rate during the tests is very minimal, maybe 30 - 40 routes every second. Otherwise we got: salt-minion, collectd, node_exporter, sshd Another theory is the NIC HW does strange things, but it is not very likely. E.g. delaying the packets before generating the IRQ interrupt, which hide it from my IRQ-to-softirq latency tool. A question: What traffic control qdisc are you using on your system? kernel 4+ uses pfifo, but there's no dropped packets I have also tested with fq_codel, same behaviour and also no weirdness in the packets queue itself kernel 3.10 uses mq, and for the vlan interfaces noqueue Here's the mail archive link for the question on lartc : https://www.spinics.net/lists/lartc/msg23774.html What you looked at the obvious case if any of your qdisc report a large backlog? (during the incidents) as said above, nothing in qdiscs or reported for example: 64 bytes from x.x.x.x: icmp_seq=10 ttl=64 time=0.169 ms 64 bytes from x.x.x.x: icmp_seq=11 ttl=64 time=5.53 ms 64 bytes from x.x.x.x: icmp_seq=12 ttl=64 time=9.44 ms 64 bytes from x.x.x.x: icmp_seq=13 ttl=64 time=0.167 ms 64 bytes from x.x.x.x: icmp_seq=14 ttl=64 time=3.88 ms and then again: 64 bytes from x.x.x.x: icmp_seq=15 ttl=64 time=0.569 ms 64 bytes from x.x.x.x: icmp_seq=16 ttl=64 time=0.148 ms 64 bytes from x.x.x.x: icmp_seq=17 ttl=64 time=0.286 ms 64 bytes from x.x.x.x: icmp_seq=18 ttl=64 time=0.257 ms 64 bytes from x.x.x.x: icmp_seq=19 ttl=64 time=0.220 ms These very low ping times tell me that you are measuring very close to the target machine, which is good. Here on the bufferbloat list, we are always suspicious of network equipment being use in these kind of setups. As experience tells us that this can be the cause of bufferbloat latency. yes, I'm just testing across two machines connected directly to the same switch, basically that's the best case scenario apart from direct connection. I do also use a VLAN on this interface, so the pings go through the vlan stack! You mention some fs.com switches (your desc below signature), can you tell us more? It's a fs.com N5850-48S6Q 48 Port 10 Gbit + 6 port 40 Gbit there are only 6 ports with 10 G in use, and 2 with 1 G, basically no traffic [...] I have a feeling that maybe not all config options were correctly moved to the newer kernel. Or there's a big bug somewhere ... (which would seem rather weird for me to be the first one to discover this) I really appreciate that you report this. This is a periodic issue, that often result in people not reporting this. Even if we find this to be caused by some process running on your system, or a bad config, this it is really important that we find the root-cause. I'll rebuild the 5.9 kernel on one of the 3.10 kernel and see if it makes a difference ... -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer On Wed, 04 Nov 2020 16:23:12 +0100 Thomas Rosenstein via Bloat wrote: General Info: Routers are connected between each other with 10G Mellanox Connect-X cards via 10G SPF+ DAC cables via a 10G Switch from fs.com Latency generally is around 0.18 ms between all routers (4). Throughput is 9.4 Gbit/s with 0 retransmissions when tested with iperf3. 2 of the 4 routers are connected upstream with a 1G connection (separate port, same network card) All routers have the full internet routing tables, i.e. 80k entries for IPv6 and 830k entries for IPv4 Conntrack is disabled (-j NOTRACK) Kernel 5.4.60 (custom) 2x Xeon X5670 @ 2.93 Ghz 96 GB RAM ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On Fri, 06 Nov 2020 10:18:10 +0100 "Thomas Rosenstein" wrote: > >> I just tested 5.9.4 seems to also fix it partly, I have long > >> stretches where it looks good, and then some increases again. (3.10 > >> Stock has them too, but not so high, rather 1-3 ms) > >> That you have long stretches where latency looks good is interesting information. My theory is that your system have a periodic userspace process that does a kernel syscall that takes too long, blocking network card from processing packets. (Note it can also be a kernel thread). Another theory is the NIC HW does strange things, but it is not very likely. E.g. delaying the packets before generating the IRQ interrupt, which hide it from my IRQ-to-softirq latency tool. A question: What traffic control qdisc are you using on your system? What you looked at the obvious case if any of your qdisc report a large backlog? (during the incidents) > >> for example: > >> > >> 64 bytes from x.x.x.x: icmp_seq=10 ttl=64 time=0.169 ms > >> 64 bytes from x.x.x.x: icmp_seq=11 ttl=64 time=5.53 ms > >> 64 bytes from x.x.x.x: icmp_seq=12 ttl=64 time=9.44 ms > >> 64 bytes from x.x.x.x: icmp_seq=13 ttl=64 time=0.167 ms > >> 64 bytes from x.x.x.x: icmp_seq=14 ttl=64 time=3.88 ms > >> > >> and then again: > >> > >> 64 bytes from x.x.x.x: icmp_seq=15 ttl=64 time=0.569 ms > >> 64 bytes from x.x.x.x: icmp_seq=16 ttl=64 time=0.148 ms > >> 64 bytes from x.x.x.x: icmp_seq=17 ttl=64 time=0.286 ms > >> 64 bytes from x.x.x.x: icmp_seq=18 ttl=64 time=0.257 ms > >> 64 bytes from x.x.x.x: icmp_seq=19 ttl=64 time=0.220 ms These very low ping times tell me that you are measuring very close to the target machine, which is good. Here on the bufferbloat list, we are always suspicious of network equipment being use in these kind of setups. As experience tells us that this can be the cause of bufferbloat latency. You mention some fs.com switches (your desc below signature), can you tell us more? [...] > I have a feeling that maybe not all config options were correctly moved > to the newer kernel. > > Or there's a big bug somewhere ... (which would seem rather weird for me > to be the first one to discover this) I really appreciate that you report this. This is a periodic issue, that often result in people not reporting this. Even if we find this to be caused by some process running on your system, or a bad config, this it is really important that we find the root-cause. > I'll rebuild the 5.9 kernel on one of the 3.10 kernel and see if it > makes a difference ... -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer On Wed, 04 Nov 2020 16:23:12 +0100 Thomas Rosenstein via Bloat wrote: > General Info: > > Routers are connected between each other with 10G Mellanox Connect-X > cards via 10G SPF+ DAC cables via a 10G Switch from fs.com > Latency generally is around 0.18 ms between all routers (4). > Throughput is 9.4 Gbit/s with 0 retransmissions when tested with iperf3. > 2 of the 4 routers are connected upstream with a 1G connection (separate > port, same network card) > All routers have the full internet routing tables, i.e. 80k entries for > IPv6 and 830k entries for IPv4 > Conntrack is disabled (-j NOTRACK) > Kernel 5.4.60 (custom) > 2x Xeon X5670 @ 2.93 Ghz > 96 GB RAM ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On Fri, 06 Nov 2020 09:48:46 +0100 "Thomas Rosenstein" wrote: > On 5 Nov 2020, at 14:33, Jesper Dangaard Brouer wrote: > > > On Thu, 05 Nov 2020 13:22:10 +0100 > > Thomas Rosenstein via Bloat wrote: > > > >> On 5 Nov 2020, at 12:21, Toke Høiland-Jørgensen wrote: > >> > >>> "Thomas Rosenstein" writes: > >>> > > If so, this sounds more like a driver issue, or maybe something to > > do with scheduling. Does it only happen with ICMP? You could try > > this > > tool for a userspace UDP measurement: > > It happens with all packets, therefore the transfer to backblaze > with > 40 threads goes down to ~8MB/s instead of >60MB/s > >>> > >>> Huh, right, definitely sounds like a kernel bug; or maybe the new > >>> kernel is getting the hardware into a state where it bugs out when > >>> there are lots of flows or something. > >>> > >>> You could try looking at the ethtool stats (ethtool -S) while > >>> running the test and see if any error counters go up. Here's a > >>> handy script to monitor changes in the counters: > >>> > >>> https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl > >>> > I'll try what that reports! > > > Also, what happens if you ping a host on the internet (*through* > > the router instead of *to* it)? > > Same issue, but twice pronounced, as it seems all interfaces are > affected. > So, ping on one interface and the second has the issue. > Also all traffic across the host has the issue, but on both sides, > so ping to the internet increased by 2x > >>> > >>> Right, so even an unloaded interface suffers? But this is the same > >>> NIC, right? So it could still be a hardware issue... > >>> > Yep default that CentOS ships, I just tested 4.12.5 there the > issue also does not happen. So I guess I can bisect it > then...(really don't want to 😃) > >>> > >>> Well that at least narrows it down :) > >> > >> I just tested 5.9.4 seems to also fix it partly, I have long > >> stretches where it looks good, and then some increases again. (3.10 > >> Stock has them too, but not so high, rather 1-3 ms) > >> > >> for example: > >> > >> 64 bytes from x.x.x.x: icmp_seq=10 ttl=64 time=0.169 ms > >> 64 bytes from x.x.x.x: icmp_seq=11 ttl=64 time=5.53 ms > >> 64 bytes from x.x.x.x: icmp_seq=12 ttl=64 time=9.44 ms > >> 64 bytes from x.x.x.x: icmp_seq=13 ttl=64 time=0.167 ms > >> 64 bytes from x.x.x.x: icmp_seq=14 ttl=64 time=3.88 ms > >> > >> and then again: > >> > >> 64 bytes from x.x.x.x: icmp_seq=15 ttl=64 time=0.569 ms > >> 64 bytes from x.x.x.x: icmp_seq=16 ttl=64 time=0.148 ms > >> 64 bytes from x.x.x.x: icmp_seq=17 ttl=64 time=0.286 ms > >> 64 bytes from x.x.x.x: icmp_seq=18 ttl=64 time=0.257 ms > >> 64 bytes from x.x.x.x: icmp_seq=19 ttl=64 time=0.220 ms > >> 64 bytes from x.x.x.x: icmp_seq=20 ttl=64 time=0.125 ms > >> 64 bytes from x.x.x.x: icmp_seq=21 ttl=64 time=0.188 ms > >> 64 bytes from x.x.x.x: icmp_seq=22 ttl=64 time=0.202 ms > >> 64 bytes from x.x.x.x: icmp_seq=23 ttl=64 time=0.195 ms > >> 64 bytes from x.x.x.x: icmp_seq=24 ttl=64 time=0.177 ms > >> 64 bytes from x.x.x.x: icmp_seq=25 ttl=64 time=0.242 ms > >> 64 bytes from x.x.x.x: icmp_seq=26 ttl=64 time=0.339 ms > >> 64 bytes from x.x.x.x: icmp_seq=27 ttl=64 time=0.183 ms > >> 64 bytes from x.x.x.x: icmp_seq=28 ttl=64 time=0.221 ms > >> 64 bytes from x.x.x.x: icmp_seq=29 ttl=64 time=0.317 ms > >> 64 bytes from x.x.x.x: icmp_seq=30 ttl=64 time=0.210 ms > >> 64 bytes from x.x.x.x: icmp_seq=31 ttl=64 time=0.242 ms > >> 64 bytes from x.x.x.x: icmp_seq=32 ttl=64 time=0.127 ms > >> 64 bytes from x.x.x.x: icmp_seq=33 ttl=64 time=0.217 ms > >> 64 bytes from x.x.x.x: icmp_seq=34 ttl=64 time=0.184 ms > >> > >> > >> For me it looks now that there was some fix between 5.4.60 and 5.9.4 > >> ... anyone can pinpoint it? > > So, new day, same issue! > > I upgraded now all routers to 5.9.4, and the issue is back ... > > here, when I stop it, it goes immediatly down to 0.xx ms > > 64 bytes from x.x.x.x: icmp_seq=48 ttl=64 time=1.67 ms > 64 bytes from x.x.x.x: icmp_seq=49 ttl=64 time=12.6 ms > 64 bytes from x.x.x.x: icmp_seq=50 ttl=64 time=13.8 ms > 64 bytes from x.x.x.x: icmp_seq=51 ttl=64 time=5.59 ms > 64 bytes from x.x.x.x: icmp_seq=52 ttl=64 time=5.86 ms > 64 bytes from x.x.x.x: icmp_seq=53 ttl=64 time=9.26 ms > 64 bytes from x.x.x.x: icmp_seq=54 ttl=64 time=8.28 ms > 64 bytes from x.x.x.x: icmp_seq=55 ttl=64 time=12.4 ms > 64 bytes from x.x.x.x: icmp_seq=56 ttl=64 time=0.551 ms > 64 bytes from x.x.x.x: icmp_seq=57 ttl=64 time=4.37 ms > 64 bytes from x.x.x.x: icmp_seq=58 ttl=64 time=12.1 ms > 64 bytes from x.x.x.x: icmp_seq=59 ttl=64 time=5.93 ms > 64 bytes from x.x.x.x: icmp_seq=60 ttl=64 time=6.58 ms > 64 bytes from x.x.x.x: icmp_seq=61 ttl=64 time=9.19 ms > 64 bytes from x.x.x.x: icmp_seq=62 ttl=64 time=0.124 ms > 64 bytes from x.x.x.x: icmp_seq=63 ttl=64 time=7.08 ms > 64 bytes from x.x.x.x: icmp_seq=64
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On 5 Nov 2020, at 14:33, Jesper Dangaard Brouer wrote: On Thu, 05 Nov 2020 13:22:10 +0100 Thomas Rosenstein via Bloat wrote: On 5 Nov 2020, at 12:21, Toke Høiland-Jørgensen wrote: "Thomas Rosenstein" writes: If so, this sounds more like a driver issue, or maybe something to do with scheduling. Does it only happen with ICMP? You could try this tool for a userspace UDP measurement: It happens with all packets, therefore the transfer to backblaze with 40 threads goes down to ~8MB/s instead of >60MB/s Huh, right, definitely sounds like a kernel bug; or maybe the new kernel is getting the hardware into a state where it bugs out when there are lots of flows or something. You could try looking at the ethtool stats (ethtool -S) while running the test and see if any error counters go up. Here's a handy script to monitor changes in the counters: https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl I'll try what that reports! Also, what happens if you ping a host on the internet (*through* the router instead of *to* it)? Same issue, but twice pronounced, as it seems all interfaces are affected. So, ping on one interface and the second has the issue. Also all traffic across the host has the issue, but on both sides, so ping to the internet increased by 2x Right, so even an unloaded interface suffers? But this is the same NIC, right? So it could still be a hardware issue... Yep default that CentOS ships, I just tested 4.12.5 there the issue also does not happen. So I guess I can bisect it then...(really don't want to 😃) Well that at least narrows it down :) I just tested 5.9.4 seems to also fix it partly, I have long stretches where it looks good, and then some increases again. (3.10 Stock has them too, but not so high, rather 1-3 ms) for example: 64 bytes from x.x.x.x: icmp_seq=10 ttl=64 time=0.169 ms 64 bytes from x.x.x.x: icmp_seq=11 ttl=64 time=5.53 ms 64 bytes from x.x.x.x: icmp_seq=12 ttl=64 time=9.44 ms 64 bytes from x.x.x.x: icmp_seq=13 ttl=64 time=0.167 ms 64 bytes from x.x.x.x: icmp_seq=14 ttl=64 time=3.88 ms and then again: 64 bytes from x.x.x.x: icmp_seq=15 ttl=64 time=0.569 ms 64 bytes from x.x.x.x: icmp_seq=16 ttl=64 time=0.148 ms 64 bytes from x.x.x.x: icmp_seq=17 ttl=64 time=0.286 ms 64 bytes from x.x.x.x: icmp_seq=18 ttl=64 time=0.257 ms 64 bytes from x.x.x.x: icmp_seq=19 ttl=64 time=0.220 ms 64 bytes from x.x.x.x: icmp_seq=20 ttl=64 time=0.125 ms 64 bytes from x.x.x.x: icmp_seq=21 ttl=64 time=0.188 ms 64 bytes from x.x.x.x: icmp_seq=22 ttl=64 time=0.202 ms 64 bytes from x.x.x.x: icmp_seq=23 ttl=64 time=0.195 ms 64 bytes from x.x.x.x: icmp_seq=24 ttl=64 time=0.177 ms 64 bytes from x.x.x.x: icmp_seq=25 ttl=64 time=0.242 ms 64 bytes from x.x.x.x: icmp_seq=26 ttl=64 time=0.339 ms 64 bytes from x.x.x.x: icmp_seq=27 ttl=64 time=0.183 ms 64 bytes from x.x.x.x: icmp_seq=28 ttl=64 time=0.221 ms 64 bytes from x.x.x.x: icmp_seq=29 ttl=64 time=0.317 ms 64 bytes from x.x.x.x: icmp_seq=30 ttl=64 time=0.210 ms 64 bytes from x.x.x.x: icmp_seq=31 ttl=64 time=0.242 ms 64 bytes from x.x.x.x: icmp_seq=32 ttl=64 time=0.127 ms 64 bytes from x.x.x.x: icmp_seq=33 ttl=64 time=0.217 ms 64 bytes from x.x.x.x: icmp_seq=34 ttl=64 time=0.184 ms For me it looks now that there was some fix between 5.4.60 and 5.9.4 ... anyone can pinpoint it? I have now retried with 3.10, here's the results: I'm pinging between the hosts with the traffic flow, upload is running 55 MB/s (compared to ~20 MB/s on 5.9.4, and 8MB/s on 5.4.60): 64 bytes from x.x.x.x: icmp_seq=159 ttl=64 time=0.050 ms 64 bytes from x.x.x.x: icmp_seq=160 ttl=64 time=0.056 ms 64 bytes from x.x.x.x: icmp_seq=161 ttl=64 time=0.061 ms 64 bytes from x.x.x.x: icmp_seq=162 ttl=64 time=0.072 ms 64 bytes from x.x.x.x: icmp_seq=163 ttl=64 time=0.052 ms 64 bytes from x.x.x.x: icmp_seq=164 ttl=64 time=0.053 ms 64 bytes from x.x.x.x: icmp_seq=165 ttl=64 time=0.068 ms 64 bytes from x.x.x.x: icmp_seq=166 ttl=64 time=0.050 ms 64 bytes from x.x.x.x: icmp_seq=167 ttl=64 time=0.057 ms 64 bytes from x.x.x.x: icmp_seq=168 ttl=64 time=0.051 ms 64 bytes from x.x.x.x: icmp_seq=169 ttl=64 time=0.045 ms 64 bytes from x.x.x.x: icmp_seq=170 ttl=64 time=0.138 ms 64 bytes from x.x.x.x: icmp_seq=171 ttl=64 time=0.052 ms 64 bytes from x.x.x.x: icmp_seq=172 ttl=64 time=0.049 ms 64 bytes from x.x.x.x: icmp_seq=173 ttl=64 time=0.094 ms 64 bytes from x.x.x.x: icmp_seq=174 ttl=64 time=0.050 ms 64 bytes from x.x.x.x: icmp_seq=175 ttl=64 time=0.810 ms 64 bytes from x.x.x.x: icmp_seq=176 ttl=64 time=0.077 ms 64 bytes from x.x.x.x: icmp_seq=177 ttl=64 time=0.055 ms 64 bytes from x.x.x.x: icmp_seq=178 ttl=64 time=0.049 ms 64 bytes from x.x.x.x: icmp_seq=179 ttl=64 time=0.050 ms 64 bytes from x.x.x.x: icmp_seq=180 ttl=64 time=0.073 ms 64 bytes from x.x.x.x: icmp_seq=181 ttl=64 time=0.065 ms 64 bytes from x.x.x.x: icmp_seq=182 ttl=64 time=0.123 ms 64 bytes from x.x.x.x: icmp_seq=183 ttl=64 time=0.045 ms 64 bytes from x.x.x.x: icmp_seq=184 ttl
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On 5 Nov 2020, at 14:33, Jesper Dangaard Brouer wrote: On Thu, 05 Nov 2020 13:22:10 +0100 Thomas Rosenstein via Bloat wrote: On 5 Nov 2020, at 12:21, Toke Høiland-Jørgensen wrote: "Thomas Rosenstein" writes: If so, this sounds more like a driver issue, or maybe something to do with scheduling. Does it only happen with ICMP? You could try this tool for a userspace UDP measurement: It happens with all packets, therefore the transfer to backblaze with 40 threads goes down to ~8MB/s instead of >60MB/s Huh, right, definitely sounds like a kernel bug; or maybe the new kernel is getting the hardware into a state where it bugs out when there are lots of flows or something. You could try looking at the ethtool stats (ethtool -S) while running the test and see if any error counters go up. Here's a handy script to monitor changes in the counters: https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl I'll try what that reports! Also, what happens if you ping a host on the internet (*through* the router instead of *to* it)? Same issue, but twice pronounced, as it seems all interfaces are affected. So, ping on one interface and the second has the issue. Also all traffic across the host has the issue, but on both sides, so ping to the internet increased by 2x Right, so even an unloaded interface suffers? But this is the same NIC, right? So it could still be a hardware issue... Yep default that CentOS ships, I just tested 4.12.5 there the issue also does not happen. So I guess I can bisect it then...(really don't want to 😃) Well that at least narrows it down :) I just tested 5.9.4 seems to also fix it partly, I have long stretches where it looks good, and then some increases again. (3.10 Stock has them too, but not so high, rather 1-3 ms) for example: 64 bytes from x.x.x.x: icmp_seq=10 ttl=64 time=0.169 ms 64 bytes from x.x.x.x: icmp_seq=11 ttl=64 time=5.53 ms 64 bytes from x.x.x.x: icmp_seq=12 ttl=64 time=9.44 ms 64 bytes from x.x.x.x: icmp_seq=13 ttl=64 time=0.167 ms 64 bytes from x.x.x.x: icmp_seq=14 ttl=64 time=3.88 ms and then again: 64 bytes from x.x.x.x: icmp_seq=15 ttl=64 time=0.569 ms 64 bytes from x.x.x.x: icmp_seq=16 ttl=64 time=0.148 ms 64 bytes from x.x.x.x: icmp_seq=17 ttl=64 time=0.286 ms 64 bytes from x.x.x.x: icmp_seq=18 ttl=64 time=0.257 ms 64 bytes from x.x.x.x: icmp_seq=19 ttl=64 time=0.220 ms 64 bytes from x.x.x.x: icmp_seq=20 ttl=64 time=0.125 ms 64 bytes from x.x.x.x: icmp_seq=21 ttl=64 time=0.188 ms 64 bytes from x.x.x.x: icmp_seq=22 ttl=64 time=0.202 ms 64 bytes from x.x.x.x: icmp_seq=23 ttl=64 time=0.195 ms 64 bytes from x.x.x.x: icmp_seq=24 ttl=64 time=0.177 ms 64 bytes from x.x.x.x: icmp_seq=25 ttl=64 time=0.242 ms 64 bytes from x.x.x.x: icmp_seq=26 ttl=64 time=0.339 ms 64 bytes from x.x.x.x: icmp_seq=27 ttl=64 time=0.183 ms 64 bytes from x.x.x.x: icmp_seq=28 ttl=64 time=0.221 ms 64 bytes from x.x.x.x: icmp_seq=29 ttl=64 time=0.317 ms 64 bytes from x.x.x.x: icmp_seq=30 ttl=64 time=0.210 ms 64 bytes from x.x.x.x: icmp_seq=31 ttl=64 time=0.242 ms 64 bytes from x.x.x.x: icmp_seq=32 ttl=64 time=0.127 ms 64 bytes from x.x.x.x: icmp_seq=33 ttl=64 time=0.217 ms 64 bytes from x.x.x.x: icmp_seq=34 ttl=64 time=0.184 ms For me it looks now that there was some fix between 5.4.60 and 5.9.4 ... anyone can pinpoint it? So, new day, same issue! I upgraded now all routers to 5.9.4, and the issue is back ... here, when I stop it, it goes immediatly down to 0.xx ms 64 bytes from x.x.x.x: icmp_seq=48 ttl=64 time=1.67 ms 64 bytes from x.x.x.x: icmp_seq=49 ttl=64 time=12.6 ms 64 bytes from x.x.x.x: icmp_seq=50 ttl=64 time=13.8 ms 64 bytes from x.x.x.x: icmp_seq=51 ttl=64 time=5.59 ms 64 bytes from x.x.x.x: icmp_seq=52 ttl=64 time=5.86 ms 64 bytes from x.x.x.x: icmp_seq=53 ttl=64 time=9.26 ms 64 bytes from x.x.x.x: icmp_seq=54 ttl=64 time=8.28 ms 64 bytes from x.x.x.x: icmp_seq=55 ttl=64 time=12.4 ms 64 bytes from x.x.x.x: icmp_seq=56 ttl=64 time=0.551 ms 64 bytes from x.x.x.x: icmp_seq=57 ttl=64 time=4.37 ms 64 bytes from x.x.x.x: icmp_seq=58 ttl=64 time=12.1 ms 64 bytes from x.x.x.x: icmp_seq=59 ttl=64 time=5.93 ms 64 bytes from x.x.x.x: icmp_seq=60 ttl=64 time=6.58 ms 64 bytes from x.x.x.x: icmp_seq=61 ttl=64 time=9.19 ms 64 bytes from x.x.x.x: icmp_seq=62 ttl=64 time=0.124 ms 64 bytes from x.x.x.x: icmp_seq=63 ttl=64 time=7.08 ms 64 bytes from x.x.x.x: icmp_seq=64 ttl=64 time=9.69 ms 64 bytes from x.x.x.x: icmp_seq=65 ttl=64 time=7.52 ms 64 bytes from x.x.x.x: icmp_seq=66 ttl=64 time=14.9 ms 64 bytes from x.x.x.x: icmp_seq=67 ttl=64 time=12.6 ms 64 bytes from x.x.x.x: icmp_seq=68 ttl=64 time=2.34 ms 64 bytes from x.x.x.x: icmp_seq=69 ttl=64 time=8.97 ms 64 bytes from x.x.x.x: icmp_seq=70 ttl=64 time=0.203 ms 64 bytes from x.x.x.x: icmp_seq=71 ttl=64 time=9.10 ms 64 bytes from x.x.x.x: icmp_seq=72 ttl=64 time=3.16 ms 64 bytes from x.x.x.x: icmp_seq=73 ttl=64 time=1.88 ms 64 bytes from x.x.x.x: icmp_seq=74 ttl=64 time=11.5 ms 64 bytes from x
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On Thu, 05 Nov 2020 13:22:10 +0100 Thomas Rosenstein via Bloat wrote: > On 5 Nov 2020, at 12:21, Toke Høiland-Jørgensen wrote: > > > "Thomas Rosenstein" writes: > > > >>> If so, this sounds more like a driver issue, or maybe something to > >>> do with scheduling. Does it only happen with ICMP? You could try this > >>> tool for a userspace UDP measurement: > >> > >> It happens with all packets, therefore the transfer to backblaze with > >> 40 threads goes down to ~8MB/s instead of >60MB/s > > > > Huh, right, definitely sounds like a kernel bug; or maybe the new > > kernel is getting the hardware into a state where it bugs out when > > there are lots of flows or something. > > > > You could try looking at the ethtool stats (ethtool -S) while > > running the test and see if any error counters go up. Here's a > > handy script to monitor changes in the counters: > > > > https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl > > > >> I'll try what that reports! > >> > >>> Also, what happens if you ping a host on the internet (*through* > >>> the router instead of *to* it)? > >> > >> Same issue, but twice pronounced, as it seems all interfaces are > >> affected. > >> So, ping on one interface and the second has the issue. > >> Also all traffic across the host has the issue, but on both sides, > >> so ping to the internet increased by 2x > > > > Right, so even an unloaded interface suffers? But this is the same > > NIC, right? So it could still be a hardware issue... > > > >> Yep default that CentOS ships, I just tested 4.12.5 there the > >> issue also does not happen. So I guess I can bisect it > >> then...(really don't want to 😃) > > > > Well that at least narrows it down :) > > I just tested 5.9.4 seems to also fix it partly, I have long > stretches where it looks good, and then some increases again. (3.10 > Stock has them too, but not so high, rather 1-3 ms) > > for example: > > 64 bytes from x.x.x.x: icmp_seq=10 ttl=64 time=0.169 ms > 64 bytes from x.x.x.x: icmp_seq=11 ttl=64 time=5.53 ms > 64 bytes from x.x.x.x: icmp_seq=12 ttl=64 time=9.44 ms > 64 bytes from x.x.x.x: icmp_seq=13 ttl=64 time=0.167 ms > 64 bytes from x.x.x.x: icmp_seq=14 ttl=64 time=3.88 ms > > and then again: > > 64 bytes from x.x.x.x: icmp_seq=15 ttl=64 time=0.569 ms > 64 bytes from x.x.x.x: icmp_seq=16 ttl=64 time=0.148 ms > 64 bytes from x.x.x.x: icmp_seq=17 ttl=64 time=0.286 ms > 64 bytes from x.x.x.x: icmp_seq=18 ttl=64 time=0.257 ms > 64 bytes from x.x.x.x: icmp_seq=19 ttl=64 time=0.220 ms > 64 bytes from x.x.x.x: icmp_seq=20 ttl=64 time=0.125 ms > 64 bytes from x.x.x.x: icmp_seq=21 ttl=64 time=0.188 ms > 64 bytes from x.x.x.x: icmp_seq=22 ttl=64 time=0.202 ms > 64 bytes from x.x.x.x: icmp_seq=23 ttl=64 time=0.195 ms > 64 bytes from x.x.x.x: icmp_seq=24 ttl=64 time=0.177 ms > 64 bytes from x.x.x.x: icmp_seq=25 ttl=64 time=0.242 ms > 64 bytes from x.x.x.x: icmp_seq=26 ttl=64 time=0.339 ms > 64 bytes from x.x.x.x: icmp_seq=27 ttl=64 time=0.183 ms > 64 bytes from x.x.x.x: icmp_seq=28 ttl=64 time=0.221 ms > 64 bytes from x.x.x.x: icmp_seq=29 ttl=64 time=0.317 ms > 64 bytes from x.x.x.x: icmp_seq=30 ttl=64 time=0.210 ms > 64 bytes from x.x.x.x: icmp_seq=31 ttl=64 time=0.242 ms > 64 bytes from x.x.x.x: icmp_seq=32 ttl=64 time=0.127 ms > 64 bytes from x.x.x.x: icmp_seq=33 ttl=64 time=0.217 ms > 64 bytes from x.x.x.x: icmp_seq=34 ttl=64 time=0.184 ms > > > For me it looks now that there was some fix between 5.4.60 and 5.9.4 > ... anyone can pinpoint it? I have some bpftrace tools to measure these kind of latency spikes here: [1] https://github.com/xdp-project/xdp-project/blob/master/areas/latency/ The tool you want is: softirq_net_latency.bt [2] https://github.com/xdp-project/xdp-project/blob/master/areas/latency/softirq_net_latency.bt Example output see[3]: [3] https://bugzilla.redhat.com/show_bug.cgi?id=1795049#c8 Based on the kernel versions, I don't expect this to be same latency issue as described in the bugzilla[3] case (as IIRC it was fixed in 4.19). It can still be similar issue, where some userspace process is reading information from the kernel (/sys/fs/cgroup/memory/memory.stat in BZ case) that blocks softirq from running, and result in these latency spikes. Install guide to bpftrace[4]: [4] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/bpftrace/INSTALL.org -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
"Thomas Rosenstein" writes: > On 5 Nov 2020, at 13:38, Toke Høiland-Jørgensen wrote: > >> "Thomas Rosenstein" writes: >> >>> On 5 Nov 2020, at 12:21, Toke Høiland-Jørgensen wrote: >>> "Thomas Rosenstein" writes: >> If so, this sounds more like a driver issue, or maybe something to >> do >> with scheduling. Does it only happen with ICMP? You could try this >> tool >> for a userspace UDP measurement: > > It happens with all packets, therefore the transfer to backblaze > with > 40 > threads goes down to ~8MB/s instead of >60MB/s Huh, right, definitely sounds like a kernel bug; or maybe the new kernel is getting the hardware into a state where it bugs out when there are lots of flows or something. You could try looking at the ethtool stats (ethtool -S) while running the test and see if any error counters go up. Here's a handy script to monitor changes in the counters: https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl > I'll try what that reports! > >> Also, what happens if you ping a host on the internet (*through* >> the >> router instead of *to* it)? > > Same issue, but twice pronounced, as it seems all interfaces are > affected. > So, ping on one interface and the second has the issue. > Also all traffic across the host has the issue, but on both sides, > so > ping to the internet increased by 2x Right, so even an unloaded interface suffers? But this is the same NIC, right? So it could still be a hardware issue... > Yep default that CentOS ships, I just tested 4.12.5 there the issue > also > does not happen. So I guess I can bisect it then...(really don't > want > to > 😃) Well that at least narrows it down :) >>> >>> I just tested 5.9.4 seems to also fix it partly, I have long >>> stretches >>> where it looks good, and then some increases again. (3.10 Stock has >>> them >>> too, but not so high, rather 1-3 ms) >>> >>> for example: >>> >>> 64 bytes from x.x.x.x: icmp_seq=10 ttl=64 time=0.169 ms >>> 64 bytes from x.x.x.x: icmp_seq=11 ttl=64 time=5.53 ms >>> 64 bytes from x.x.x.x: icmp_seq=12 ttl=64 time=9.44 ms >>> 64 bytes from x.x.x.x: icmp_seq=13 ttl=64 time=0.167 ms >>> 64 bytes from x.x.x.x: icmp_seq=14 ttl=64 time=3.88 ms >>> >>> and then again: >>> >>> 64 bytes from x.x.x.x: icmp_seq=15 ttl=64 time=0.569 ms >>> 64 bytes from x.x.x.x: icmp_seq=16 ttl=64 time=0.148 ms >>> 64 bytes from x.x.x.x: icmp_seq=17 ttl=64 time=0.286 ms >>> 64 bytes from x.x.x.x: icmp_seq=18 ttl=64 time=0.257 ms >>> 64 bytes from x.x.x.x: icmp_seq=19 ttl=64 time=0.220 ms >>> 64 bytes from x.x.x.x: icmp_seq=20 ttl=64 time=0.125 ms >>> 64 bytes from x.x.x.x: icmp_seq=21 ttl=64 time=0.188 ms >>> 64 bytes from x.x.x.x: icmp_seq=22 ttl=64 time=0.202 ms >>> 64 bytes from x.x.x.x: icmp_seq=23 ttl=64 time=0.195 ms >>> 64 bytes from x.x.x.x: icmp_seq=24 ttl=64 time=0.177 ms >>> 64 bytes from x.x.x.x: icmp_seq=25 ttl=64 time=0.242 ms >>> 64 bytes from x.x.x.x: icmp_seq=26 ttl=64 time=0.339 ms >>> 64 bytes from x.x.x.x: icmp_seq=27 ttl=64 time=0.183 ms >>> 64 bytes from x.x.x.x: icmp_seq=28 ttl=64 time=0.221 ms >>> 64 bytes from x.x.x.x: icmp_seq=29 ttl=64 time=0.317 ms >>> 64 bytes from x.x.x.x: icmp_seq=30 ttl=64 time=0.210 ms >>> 64 bytes from x.x.x.x: icmp_seq=31 ttl=64 time=0.242 ms >>> 64 bytes from x.x.x.x: icmp_seq=32 ttl=64 time=0.127 ms >>> 64 bytes from x.x.x.x: icmp_seq=33 ttl=64 time=0.217 ms >>> 64 bytes from x.x.x.x: icmp_seq=34 ttl=64 time=0.184 ms >>> >>> >>> For me it looks now that there was some fix between 5.4.60 and 5.9.4 >>> ... >>> anyone can pinpoint it? >> >> $ git log --no-merges --oneline v5.4.60..v5.9.4|wc -l >> 72932 >> >> Only 73k commits; should be easy, right? :) >> >> (In other words no, I have no idea; I'd suggest either (a) asking on >> netdev, (b) bisecting or (c) using 5.9+ and just making peace with not >> knowing). > > Guess I'll go the easy route and let it be ... > > I'll update all routers to the 5.9.4 and see if it fixes the traffic > flow - will report back once more after that. Sounds like a plan :) >> >> How did you configure the new kernel? Did you start from scratch, >> or >> is >> it based on the old centos config? > > first oldconfig and from there then added additional options for > IB, > NVMe, etc (which I don't really need on the routers) OK, so you're probably building with roughly the same options in terms of scheduling granularity etc. That's good. Did you enable spectre mitigations etc on the new kernel? What's the output of `tail /sys/devices/system/cpu/vulnerabilities/*` ? >>> >>> mitigations are off >> >> Right, I just figured maybe you were hitting some threshold that >> involved a lot of indirect calls which slowed things down due to >> miti
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On 5 Nov 2020, at 13:38, Toke Høiland-Jørgensen wrote: "Thomas Rosenstein" writes: On 5 Nov 2020, at 12:21, Toke Høiland-Jørgensen wrote: "Thomas Rosenstein" writes: If so, this sounds more like a driver issue, or maybe something to do with scheduling. Does it only happen with ICMP? You could try this tool for a userspace UDP measurement: It happens with all packets, therefore the transfer to backblaze with 40 threads goes down to ~8MB/s instead of >60MB/s Huh, right, definitely sounds like a kernel bug; or maybe the new kernel is getting the hardware into a state where it bugs out when there are lots of flows or something. You could try looking at the ethtool stats (ethtool -S) while running the test and see if any error counters go up. Here's a handy script to monitor changes in the counters: https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl I'll try what that reports! Also, what happens if you ping a host on the internet (*through* the router instead of *to* it)? Same issue, but twice pronounced, as it seems all interfaces are affected. So, ping on one interface and the second has the issue. Also all traffic across the host has the issue, but on both sides, so ping to the internet increased by 2x Right, so even an unloaded interface suffers? But this is the same NIC, right? So it could still be a hardware issue... Yep default that CentOS ships, I just tested 4.12.5 there the issue also does not happen. So I guess I can bisect it then...(really don't want to 😃) Well that at least narrows it down :) I just tested 5.9.4 seems to also fix it partly, I have long stretches where it looks good, and then some increases again. (3.10 Stock has them too, but not so high, rather 1-3 ms) for example: 64 bytes from x.x.x.x: icmp_seq=10 ttl=64 time=0.169 ms 64 bytes from x.x.x.x: icmp_seq=11 ttl=64 time=5.53 ms 64 bytes from x.x.x.x: icmp_seq=12 ttl=64 time=9.44 ms 64 bytes from x.x.x.x: icmp_seq=13 ttl=64 time=0.167 ms 64 bytes from x.x.x.x: icmp_seq=14 ttl=64 time=3.88 ms and then again: 64 bytes from x.x.x.x: icmp_seq=15 ttl=64 time=0.569 ms 64 bytes from x.x.x.x: icmp_seq=16 ttl=64 time=0.148 ms 64 bytes from x.x.x.x: icmp_seq=17 ttl=64 time=0.286 ms 64 bytes from x.x.x.x: icmp_seq=18 ttl=64 time=0.257 ms 64 bytes from x.x.x.x: icmp_seq=19 ttl=64 time=0.220 ms 64 bytes from x.x.x.x: icmp_seq=20 ttl=64 time=0.125 ms 64 bytes from x.x.x.x: icmp_seq=21 ttl=64 time=0.188 ms 64 bytes from x.x.x.x: icmp_seq=22 ttl=64 time=0.202 ms 64 bytes from x.x.x.x: icmp_seq=23 ttl=64 time=0.195 ms 64 bytes from x.x.x.x: icmp_seq=24 ttl=64 time=0.177 ms 64 bytes from x.x.x.x: icmp_seq=25 ttl=64 time=0.242 ms 64 bytes from x.x.x.x: icmp_seq=26 ttl=64 time=0.339 ms 64 bytes from x.x.x.x: icmp_seq=27 ttl=64 time=0.183 ms 64 bytes from x.x.x.x: icmp_seq=28 ttl=64 time=0.221 ms 64 bytes from x.x.x.x: icmp_seq=29 ttl=64 time=0.317 ms 64 bytes from x.x.x.x: icmp_seq=30 ttl=64 time=0.210 ms 64 bytes from x.x.x.x: icmp_seq=31 ttl=64 time=0.242 ms 64 bytes from x.x.x.x: icmp_seq=32 ttl=64 time=0.127 ms 64 bytes from x.x.x.x: icmp_seq=33 ttl=64 time=0.217 ms 64 bytes from x.x.x.x: icmp_seq=34 ttl=64 time=0.184 ms For me it looks now that there was some fix between 5.4.60 and 5.9.4 ... anyone can pinpoint it? $ git log --no-merges --oneline v5.4.60..v5.9.4|wc -l 72932 Only 73k commits; should be easy, right? :) (In other words no, I have no idea; I'd suggest either (a) asking on netdev, (b) bisecting or (c) using 5.9+ and just making peace with not knowing). Guess I'll go the easy route and let it be ... I'll update all routers to the 5.9.4 and see if it fixes the traffic flow - will report back once more after that. How did you configure the new kernel? Did you start from scratch, or is it based on the old centos config? first oldconfig and from there then added additional options for IB, NVMe, etc (which I don't really need on the routers) OK, so you're probably building with roughly the same options in terms of scheduling granularity etc. That's good. Did you enable spectre mitigations etc on the new kernel? What's the output of `tail /sys/devices/system/cpu/vulnerabilities/*` ? mitigations are off Right, I just figured maybe you were hitting some threshold that involved a lot of indirect calls which slowed things down due to mitigations. Guess not, then... Thanks for the support :) -Toke ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
"Thomas Rosenstein" writes: > On 5 Nov 2020, at 12:21, Toke Høiland-Jørgensen wrote: > >> "Thomas Rosenstein" writes: >> If so, this sounds more like a driver issue, or maybe something to do with scheduling. Does it only happen with ICMP? You could try this tool for a userspace UDP measurement: >>> >>> It happens with all packets, therefore the transfer to backblaze with >>> 40 >>> threads goes down to ~8MB/s instead of >60MB/s >> >> Huh, right, definitely sounds like a kernel bug; or maybe the new >> kernel >> is getting the hardware into a state where it bugs out when there are >> lots of flows or something. >> >> You could try looking at the ethtool stats (ethtool -S) while running >> the test and see if any error counters go up. Here's a handy script to >> monitor changes in the counters: >> >> https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl >> >>> I'll try what that reports! >>> Also, what happens if you ping a host on the internet (*through* the router instead of *to* it)? >>> >>> Same issue, but twice pronounced, as it seems all interfaces are >>> affected. >>> So, ping on one interface and the second has the issue. >>> Also all traffic across the host has the issue, but on both sides, so >>> ping to the internet increased by 2x >> >> Right, so even an unloaded interface suffers? But this is the same >> NIC, >> right? So it could still be a hardware issue... >> >>> Yep default that CentOS ships, I just tested 4.12.5 there the issue >>> also >>> does not happen. So I guess I can bisect it then...(really don't want >>> to >>> 😃) >> >> Well that at least narrows it down :) > > I just tested 5.9.4 seems to also fix it partly, I have long stretches > where it looks good, and then some increases again. (3.10 Stock has them > too, but not so high, rather 1-3 ms) > > for example: > > 64 bytes from x.x.x.x: icmp_seq=10 ttl=64 time=0.169 ms > 64 bytes from x.x.x.x: icmp_seq=11 ttl=64 time=5.53 ms > 64 bytes from x.x.x.x: icmp_seq=12 ttl=64 time=9.44 ms > 64 bytes from x.x.x.x: icmp_seq=13 ttl=64 time=0.167 ms > 64 bytes from x.x.x.x: icmp_seq=14 ttl=64 time=3.88 ms > > and then again: > > 64 bytes from x.x.x.x: icmp_seq=15 ttl=64 time=0.569 ms > 64 bytes from x.x.x.x: icmp_seq=16 ttl=64 time=0.148 ms > 64 bytes from x.x.x.x: icmp_seq=17 ttl=64 time=0.286 ms > 64 bytes from x.x.x.x: icmp_seq=18 ttl=64 time=0.257 ms > 64 bytes from x.x.x.x: icmp_seq=19 ttl=64 time=0.220 ms > 64 bytes from x.x.x.x: icmp_seq=20 ttl=64 time=0.125 ms > 64 bytes from x.x.x.x: icmp_seq=21 ttl=64 time=0.188 ms > 64 bytes from x.x.x.x: icmp_seq=22 ttl=64 time=0.202 ms > 64 bytes from x.x.x.x: icmp_seq=23 ttl=64 time=0.195 ms > 64 bytes from x.x.x.x: icmp_seq=24 ttl=64 time=0.177 ms > 64 bytes from x.x.x.x: icmp_seq=25 ttl=64 time=0.242 ms > 64 bytes from x.x.x.x: icmp_seq=26 ttl=64 time=0.339 ms > 64 bytes from x.x.x.x: icmp_seq=27 ttl=64 time=0.183 ms > 64 bytes from x.x.x.x: icmp_seq=28 ttl=64 time=0.221 ms > 64 bytes from x.x.x.x: icmp_seq=29 ttl=64 time=0.317 ms > 64 bytes from x.x.x.x: icmp_seq=30 ttl=64 time=0.210 ms > 64 bytes from x.x.x.x: icmp_seq=31 ttl=64 time=0.242 ms > 64 bytes from x.x.x.x: icmp_seq=32 ttl=64 time=0.127 ms > 64 bytes from x.x.x.x: icmp_seq=33 ttl=64 time=0.217 ms > 64 bytes from x.x.x.x: icmp_seq=34 ttl=64 time=0.184 ms > > > For me it looks now that there was some fix between 5.4.60 and 5.9.4 ... > anyone can pinpoint it? $ git log --no-merges --oneline v5.4.60..v5.9.4|wc -l 72932 Only 73k commits; should be easy, right? :) (In other words no, I have no idea; I'd suggest either (a) asking on netdev, (b) bisecting or (c) using 5.9+ and just making peace with not knowing). How did you configure the new kernel? Did you start from scratch, or is it based on the old centos config? >>> >>> first oldconfig and from there then added additional options for IB, >>> NVMe, etc (which I don't really need on the routers) >> >> OK, so you're probably building with roughly the same options in terms >> of scheduling granularity etc. That's good. Did you enable spectre >> mitigations etc on the new kernel? What's the output of >> `tail /sys/devices/system/cpu/vulnerabilities/*` ? > > mitigations are off Right, I just figured maybe you were hitting some threshold that involved a lot of indirect calls which slowed things down due to mitigations. Guess not, then... -Toke ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On 5 Nov 2020, at 12:21, Toke Høiland-Jørgensen wrote: "Thomas Rosenstein" writes: If so, this sounds more like a driver issue, or maybe something to do with scheduling. Does it only happen with ICMP? You could try this tool for a userspace UDP measurement: It happens with all packets, therefore the transfer to backblaze with 40 threads goes down to ~8MB/s instead of >60MB/s Huh, right, definitely sounds like a kernel bug; or maybe the new kernel is getting the hardware into a state where it bugs out when there are lots of flows or something. You could try looking at the ethtool stats (ethtool -S) while running the test and see if any error counters go up. Here's a handy script to monitor changes in the counters: https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl I'll try what that reports! Also, what happens if you ping a host on the internet (*through* the router instead of *to* it)? Same issue, but twice pronounced, as it seems all interfaces are affected. So, ping on one interface and the second has the issue. Also all traffic across the host has the issue, but on both sides, so ping to the internet increased by 2x Right, so even an unloaded interface suffers? But this is the same NIC, right? So it could still be a hardware issue... Yep default that CentOS ships, I just tested 4.12.5 there the issue also does not happen. So I guess I can bisect it then...(really don't want to 😃) Well that at least narrows it down :) I just tested 5.9.4 seems to also fix it partly, I have long stretches where it looks good, and then some increases again. (3.10 Stock has them too, but not so high, rather 1-3 ms) for example: 64 bytes from x.x.x.x: icmp_seq=10 ttl=64 time=0.169 ms 64 bytes from x.x.x.x: icmp_seq=11 ttl=64 time=5.53 ms 64 bytes from x.x.x.x: icmp_seq=12 ttl=64 time=9.44 ms 64 bytes from x.x.x.x: icmp_seq=13 ttl=64 time=0.167 ms 64 bytes from x.x.x.x: icmp_seq=14 ttl=64 time=3.88 ms and then again: 64 bytes from x.x.x.x: icmp_seq=15 ttl=64 time=0.569 ms 64 bytes from x.x.x.x: icmp_seq=16 ttl=64 time=0.148 ms 64 bytes from x.x.x.x: icmp_seq=17 ttl=64 time=0.286 ms 64 bytes from x.x.x.x: icmp_seq=18 ttl=64 time=0.257 ms 64 bytes from x.x.x.x: icmp_seq=19 ttl=64 time=0.220 ms 64 bytes from x.x.x.x: icmp_seq=20 ttl=64 time=0.125 ms 64 bytes from x.x.x.x: icmp_seq=21 ttl=64 time=0.188 ms 64 bytes from x.x.x.x: icmp_seq=22 ttl=64 time=0.202 ms 64 bytes from x.x.x.x: icmp_seq=23 ttl=64 time=0.195 ms 64 bytes from x.x.x.x: icmp_seq=24 ttl=64 time=0.177 ms 64 bytes from x.x.x.x: icmp_seq=25 ttl=64 time=0.242 ms 64 bytes from x.x.x.x: icmp_seq=26 ttl=64 time=0.339 ms 64 bytes from x.x.x.x: icmp_seq=27 ttl=64 time=0.183 ms 64 bytes from x.x.x.x: icmp_seq=28 ttl=64 time=0.221 ms 64 bytes from x.x.x.x: icmp_seq=29 ttl=64 time=0.317 ms 64 bytes from x.x.x.x: icmp_seq=30 ttl=64 time=0.210 ms 64 bytes from x.x.x.x: icmp_seq=31 ttl=64 time=0.242 ms 64 bytes from x.x.x.x: icmp_seq=32 ttl=64 time=0.127 ms 64 bytes from x.x.x.x: icmp_seq=33 ttl=64 time=0.217 ms 64 bytes from x.x.x.x: icmp_seq=34 ttl=64 time=0.184 ms For me it looks now that there was some fix between 5.4.60 and 5.9.4 ... anyone can pinpoint it? How did you configure the new kernel? Did you start from scratch, or is it based on the old centos config? first oldconfig and from there then added additional options for IB, NVMe, etc (which I don't really need on the routers) OK, so you're probably building with roughly the same options in terms of scheduling granularity etc. That's good. Did you enable spectre mitigations etc on the new kernel? What's the output of `tail /sys/devices/system/cpu/vulnerabilities/*` ? mitigations are off ==> /sys/devices/system/cpu/vulnerabilities/itlb_multihit <== KVM: Vulnerable ==> /sys/devices/system/cpu/vulnerabilities/l1tf <== Mitigation: PTE Inversion; VMX: vulnerable ==> /sys/devices/system/cpu/vulnerabilities/mds <== Vulnerable; SMT vulnerable ==> /sys/devices/system/cpu/vulnerabilities/meltdown <== Vulnerable ==> /sys/devices/system/cpu/vulnerabilities/spec_store_bypass <== Vulnerable ==> /sys/devices/system/cpu/vulnerabilities/spectre_v1 <== Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers ==> /sys/devices/system/cpu/vulnerabilities/spectre_v2 <== Vulnerable, STIBP: disabled ==> /sys/devices/system/cpu/vulnerabilities/srbds <== Not affected ==> /sys/devices/system/cpu/vulnerabilities/tsx_async_abort <== Not affected Grub Boot options are: crashkernel=896M rd.lvm.lv=cl/root net.ifnames=0 biosdevname=0 scsi_mod.use_blk_mq=1 dm_mod.use_blk_mq=y mitigations=off console=tty0 console=ttyS1,115200 -Toke ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
"Thomas Rosenstein" writes: >> If so, this sounds more like a driver issue, or maybe something to do >> with scheduling. Does it only happen with ICMP? You could try this >> tool >> for a userspace UDP measurement: > > It happens with all packets, therefore the transfer to backblaze with 40 > threads goes down to ~8MB/s instead of >60MB/s Huh, right, definitely sounds like a kernel bug; or maybe the new kernel is getting the hardware into a state where it bugs out when there are lots of flows or something. You could try looking at the ethtool stats (ethtool -S) while running the test and see if any error counters go up. Here's a handy script to monitor changes in the counters: https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl > I'll try what that reports! > >> Also, what happens if you ping a host on the internet (*through* the >> router instead of *to* it)? > > Same issue, but twice pronounced, as it seems all interfaces are > affected. > So, ping on one interface and the second has the issue. > Also all traffic across the host has the issue, but on both sides, so > ping to the internet increased by 2x Right, so even an unloaded interface suffers? But this is the same NIC, right? So it could still be a hardware issue... > Yep default that CentOS ships, I just tested 4.12.5 there the issue also > does not happen. So I guess I can bisect it then...(really don't want to > 😃) Well that at least narrows it down :) >> >> How did you configure the new kernel? Did you start from scratch, or >> is >> it based on the old centos config? > > first oldconfig and from there then added additional options for IB, > NVMe, etc (which I don't really need on the routers) OK, so you're probably building with roughly the same options in terms of scheduling granularity etc. That's good. Did you enable spectre mitigations etc on the new kernel? What's the output of `tail /sys/devices/system/cpu/vulnerabilities/*` ? -Toke ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On 5 Nov 2020, at 1:10, Toke Høiland-Jørgensen wrote: "Thomas Rosenstein" writes: On 4 Nov 2020, at 17:10, Toke Høiland-Jørgensen wrote: Thomas Rosenstein via Bloat writes: Hi all, I'm coming from the lartc mailing list, here's the original text: = I have multiple routers which connect to multiple upstream providers, I have noticed a high latency shift in icmp (and generally all connection) if I run b2 upload-file --threads 40 (and I can reproduce this) What options do I have to analyze why this happens? General Info: Routers are connected between each other with 10G Mellanox Connect-X cards via 10G SPF+ DAC cables via a 10G Switch from fs.com Latency generally is around 0.18 ms between all routers (4). Throughput is 9.4 Gbit/s with 0 retransmissions when tested with iperf3. 2 of the 4 routers are connected upstream with a 1G connection (separate port, same network card) All routers have the full internet routing tables, i.e. 80k entries for IPv6 and 830k entries for IPv4 Conntrack is disabled (-j NOTRACK) Kernel 5.4.60 (custom) 2x Xeon X5670 @ 2.93 Ghz 96 GB RAM No Swap CentOs 7 During high latency: Latency on routers which have the traffic flow increases to 12 - 20 ms, for all interfaces, moving of the stream (via bgp disable session) moves also the high latency iperf3 performance plumets to 300 - 400 MBits CPU load (user / system) are around 0.1% Ram Usage is around 3 - 4 GB if_packets count is stable (around 8000 pkt/s more) I'm not sure I get you topology. Packets are going from where to where, and what link is the bottleneck for the transfer you're doing? Are you measuring the latency along the same path? Have you tried running 'mtr' to figure out which hop the latency is at? I tried to draw the topology, I hope this is okay and explains betters what's happening: https://drive.google.com/file/d/15oAsxiNfsbjB9a855Q_dh6YvFZBDdY5I/view?usp=sharing Ohh, right, you're pinging between two of the routers across a 10 Gbps link with plenty of capacity to spare, and *that* goes up by two orders of magnitude when you start the transfer, even though the transfer itself is <1Gbps? Am I understanding you correctly now? Exactly :) If so, this sounds more like a driver issue, or maybe something to do with scheduling. Does it only happen with ICMP? You could try this tool for a userspace UDP measurement: It happens with all packets, therefore the transfer to backblaze with 40 threads goes down to ~8MB/s instead of >60MB/s https://github.com/heistp/irtt/ I'll try what that reports! Also, what happens if you ping a host on the internet (*through* the router instead of *to* it)? Same issue, but twice pronounced, as it seems all interfaces are affected. So, ping on one interface and the second has the issue. Also all traffic across the host has the issue, but on both sides, so ping to the internet increased by 2x And which version of the Connect-X cards are you using (or rather, which driver? mlx4?) It's Connect-X 4 Lx cards, specifcally: MCX4121A-ACAT Driver is mlx5_core So it must be something in the kernel tacking on a delay, I could try to do a bisect and build like 10 kernels :) That may ultimately end up being necessary. However, when you say 'stock kernel' you mean what CentOS ships, right? If so, that's not really a 3.10 kernel - the RHEL kernels (that centos is based on) are... somewhat creative... about their versioning. So if you're switched to a vanilla upstream kernel you may find bisecting difficult :/ Yep default that CentOS ships, I just tested 4.12.5 there the issue also does not happen. So I guess I can bisect it then...(really don't want to 😃) How did you configure the new kernel? Did you start from scratch, or is it based on the old centos config? first oldconfig and from there then added additional options for IB, NVMe, etc (which I don't really need on the routers) -Toke ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
"Thomas Rosenstein" writes: > On 4 Nov 2020, at 17:10, Toke Høiland-Jørgensen wrote: > >> Thomas Rosenstein via Bloat writes: >> >>> Hi all, >>> >>> I'm coming from the lartc mailing list, here's the original text: >>> >>> = >>> >>> I have multiple routers which connect to multiple upstream providers, >>> I >>> have noticed a high latency shift in icmp (and generally all >>> connection) >>> if I run b2 upload-file --threads 40 (and I can reproduce this) >>> >>> What options do I have to analyze why this happens? >>> >>> General Info: >>> >>> Routers are connected between each other with 10G Mellanox Connect-X >>> cards via 10G SPF+ DAC cables via a 10G Switch from fs.com >>> Latency generally is around 0.18 ms between all routers (4). >>> Throughput is 9.4 Gbit/s with 0 retransmissions when tested with >>> iperf3. >>> 2 of the 4 routers are connected upstream with a 1G connection >>> (separate >>> port, same network card) >>> All routers have the full internet routing tables, i.e. 80k entries >>> for >>> IPv6 and 830k entries for IPv4 >>> Conntrack is disabled (-j NOTRACK) >>> Kernel 5.4.60 (custom) >>> 2x Xeon X5670 @ 2.93 Ghz >>> 96 GB RAM >>> No Swap >>> CentOs 7 >>> >>> During high latency: >>> >>> Latency on routers which have the traffic flow increases to 12 - 20 >>> ms, >>> for all interfaces, moving of the stream (via bgp disable session) >>> moves >>> also the high latency >>> iperf3 performance plumets to 300 - 400 MBits >>> CPU load (user / system) are around 0.1% >>> Ram Usage is around 3 - 4 GB >>> if_packets count is stable (around 8000 pkt/s more) >> >> I'm not sure I get you topology. Packets are going from where to >> where, >> and what link is the bottleneck for the transfer you're doing? Are you >> measuring the latency along the same path? >> >> Have you tried running 'mtr' to figure out which hop the latency is >> at? > > I tried to draw the topology, I hope this is okay and explains betters > what's happening: > > https://drive.google.com/file/d/15oAsxiNfsbjB9a855Q_dh6YvFZBDdY5I/view?usp=sharing Ohh, right, you're pinging between two of the routers across a 10 Gbps link with plenty of capacity to spare, and *that* goes up by two orders of magnitude when you start the transfer, even though the transfer itself is <1Gbps? Am I understanding you correctly now? If so, this sounds more like a driver issue, or maybe something to do with scheduling. Does it only happen with ICMP? You could try this tool for a userspace UDP measurement: https://github.com/heistp/irtt/ Also, what happens if you ping a host on the internet (*through* the router instead of *to* it)? And which version of the Connect-X cards are you using (or rather, which driver? mlx4?) > So it must be something in the kernel tacking on a delay, I could try to > do a bisect and build like 10 kernels :) That may ultimately end up being necessary. However, when you say 'stock kernel' you mean what CentOS ships, right? If so, that's not really a 3.10 kernel - the RHEL kernels (that centos is based on) are... somewhat creative... about their versioning. So if you're switched to a vanilla upstream kernel you may find bisecting difficult :/ How did you configure the new kernel? Did you start from scratch, or is it based on the old centos config? -Toke ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On 4 Nov 2020, at 17:10, Toke Høiland-Jørgensen wrote: Thomas Rosenstein via Bloat writes: Hi all, I'm coming from the lartc mailing list, here's the original text: = I have multiple routers which connect to multiple upstream providers, I have noticed a high latency shift in icmp (and generally all connection) if I run b2 upload-file --threads 40 (and I can reproduce this) What options do I have to analyze why this happens? General Info: Routers are connected between each other with 10G Mellanox Connect-X cards via 10G SPF+ DAC cables via a 10G Switch from fs.com Latency generally is around 0.18 ms between all routers (4). Throughput is 9.4 Gbit/s with 0 retransmissions when tested with iperf3. 2 of the 4 routers are connected upstream with a 1G connection (separate port, same network card) All routers have the full internet routing tables, i.e. 80k entries for IPv6 and 830k entries for IPv4 Conntrack is disabled (-j NOTRACK) Kernel 5.4.60 (custom) 2x Xeon X5670 @ 2.93 Ghz 96 GB RAM No Swap CentOs 7 During high latency: Latency on routers which have the traffic flow increases to 12 - 20 ms, for all interfaces, moving of the stream (via bgp disable session) moves also the high latency iperf3 performance plumets to 300 - 400 MBits CPU load (user / system) are around 0.1% Ram Usage is around 3 - 4 GB if_packets count is stable (around 8000 pkt/s more) I'm not sure I get you topology. Packets are going from where to where, and what link is the bottleneck for the transfer you're doing? Are you measuring the latency along the same path? Have you tried running 'mtr' to figure out which hop the latency is at? I tried to draw the topology, I hope this is okay and explains betters what's happening: https://drive.google.com/file/d/15oAsxiNfsbjB9a855Q_dh6YvFZBDdY5I/view?usp=sharing There is definitly no bottleneck in any of the links, the maximum on any link is 16k packets/sec and around 300 Mbit/s. In the iperf3 tests I can easily get up to 9.4 Gbit/s So it must be something in the kernel tacking on a delay, I could try to do a bisect and build like 10 kernels :) Here is the tc -s qdisc output: This indicates ("dropped 0" and "ecn_mark 0") that there's no backpressure on the qdisc, so something else is going on. Also, you said the issue goes away if you downgrade the kernel? That does sound odd... Yes, indeed. I have only recently upgraded the kernel to the 5.4.60 and haven't had the issue before. -Toke ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
Thomas Rosenstein via Bloat writes: > Hi all, > > I'm coming from the lartc mailing list, here's the original text: > > = > > I have multiple routers which connect to multiple upstream providers, I > have noticed a high latency shift in icmp (and generally all connection) > if I run b2 upload-file --threads 40 (and I can reproduce this) > > What options do I have to analyze why this happens? > > General Info: > > Routers are connected between each other with 10G Mellanox Connect-X > cards via 10G SPF+ DAC cables via a 10G Switch from fs.com > Latency generally is around 0.18 ms between all routers (4). > Throughput is 9.4 Gbit/s with 0 retransmissions when tested with iperf3. > 2 of the 4 routers are connected upstream with a 1G connection (separate > port, same network card) > All routers have the full internet routing tables, i.e. 80k entries for > IPv6 and 830k entries for IPv4 > Conntrack is disabled (-j NOTRACK) > Kernel 5.4.60 (custom) > 2x Xeon X5670 @ 2.93 Ghz > 96 GB RAM > No Swap > CentOs 7 > > During high latency: > > Latency on routers which have the traffic flow increases to 12 - 20 ms, > for all interfaces, moving of the stream (via bgp disable session) moves > also the high latency > iperf3 performance plumets to 300 - 400 MBits > CPU load (user / system) are around 0.1% > Ram Usage is around 3 - 4 GB > if_packets count is stable (around 8000 pkt/s more) I'm not sure I get you topology. Packets are going from where to where, and what link is the bottleneck for the transfer you're doing? Are you measuring the latency along the same path? Have you tried running 'mtr' to figure out which hop the latency is at? > Here is the tc -s qdisc output: This indicates ("dropped 0" and "ecn_mark 0") that there's no backpressure on the qdisc, so something else is going on. Also, you said the issue goes away if you downgrade the kernel? That does sound odd... -Toke ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
[Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
Hi all, I'm coming from the lartc mailing list, here's the original text: = I have multiple routers which connect to multiple upstream providers, I have noticed a high latency shift in icmp (and generally all connection) if I run b2 upload-file --threads 40 (and I can reproduce this) What options do I have to analyze why this happens? General Info: Routers are connected between each other with 10G Mellanox Connect-X cards via 10G SPF+ DAC cables via a 10G Switch from fs.com Latency generally is around 0.18 ms between all routers (4). Throughput is 9.4 Gbit/s with 0 retransmissions when tested with iperf3. 2 of the 4 routers are connected upstream with a 1G connection (separate port, same network card) All routers have the full internet routing tables, i.e. 80k entries for IPv6 and 830k entries for IPv4 Conntrack is disabled (-j NOTRACK) Kernel 5.4.60 (custom) 2x Xeon X5670 @ 2.93 Ghz 96 GB RAM No Swap CentOs 7 During high latency: Latency on routers which have the traffic flow increases to 12 - 20 ms, for all interfaces, moving of the stream (via bgp disable session) moves also the high latency iperf3 performance plumets to 300 - 400 MBits CPU load (user / system) are around 0.1% Ram Usage is around 3 - 4 GB if_packets count is stable (around 8000 pkt/s more) for b2 upload-file with 10 threads I can achieve 60 MB/s consistently, with 40 threads the performance drops to 8 MB/s I do not believe that 40 tcp streams should be any problem for a machine of that size. Thanks for any ideas, help, pointers, things I can verify / check / provide additional! === So far I have tested: 1) Use Stock Kernel 3.10.0-541 -> issue does not happen 2) setup fq_codel on the interfaces: Here is the tc -s qdisc output: qdisc fq_codel 8005: dev eth4 root refcnt 193 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms ecn Sent 8374229144 bytes 10936167 pkt (dropped 0, overlimits 0 requeues 6127) backlog 0b 0p requeues 6127 maxpacket 25398 drop_overlimit 0 new_flow_count 15441 ecn_mark 0 new_flows_len 0 old_flows_len 0 qdisc fq_codel 8008: dev eth5 root refcnt 193 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms ecn Sent 1072480080 bytes 1012973 pkt (dropped 0, overlimits 0 requeues 735) backlog 0b 0p requeues 735 maxpacket 19682 drop_overlimit 0 new_flow_count 15963 ecn_mark 0 new_flows_len 0 old_flows_len 0 qdisc fq_codel 8004: dev eth4.2300 root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms ecn Sent 8441021899 bytes 11021070 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 maxpacket 68130 drop_overlimit 0 new_flow_count 257055 ecn_mark 0 new_flows_len 0 old_flows_len 0 qdisc fq_codel 8006: dev eth5.2501 root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms ecn Sent 571984459 bytes 2148377 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 maxpacket 7570 drop_overlimit 0 new_flow_count 11300 ecn_mark 0 new_flows_len 0 old_flows_len 0 qdisc fq_codel 8007: dev eth5.2502 root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms ecn Sent 140132 bytes 1966724 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 maxpacket 19682 drop_overlimit 0 new_flow_count 76653 ecn_mark 0 new_flows_len 0 old_flows_len 0 I have no statistics / metrics that would point to a slow down on the server, cpu / load / network / packets / memory all show normal very low load. Is there other, (hidden) metrics I can collect to analyze this issue further? Thanks Thomas ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat