Re: [Bloat] BBR implementations, knobs to turn?
A couple questions: - I guess this is Linux TCP BBRv1 ("bbr" module)? What's the OS distribution and exact kernel version ("uname -r")? - What do you mean when you say "The old server allows for more re-transmits"? - If BBRv1 is suffering throughput problems due to high retransmit rates, then usually the retransmit rate is around 15% or higher. If the retransmit rate is that high on a radio link that is being tested, then that radio link may be having issues that should be investigated separately? - Would you be able to take a tcpdump trace of the well-behaved and problematic traffic and share the pcap or a plot? https://github.com/google/bbr/blob/master/Documentation/bbr-faq.md#how-can-i-visualize-the-behavior-of-linux-tcp-bbr-connections - Would you be able to share the output of "ss -tin" from a recently built "ss" binary, near the end of a long-lived test flow, for the well-behaved and problematic cases? https://github.com/google/bbr/blob/master/Documentation/bbr-faq.md#how-can-i-monitor-linux-tcp-bbr-connections best, neal On Mon, Nov 16, 2020 at 10:25 AM wrote: > I'm in the process of replacing a throughput test server. The old server > is running a 1Gbit Ethernet card on a 1Gbit link and ubuntu. The new a > 10Gbit card on a 40Gbit link and centos. Both have low load and Xenon > processors. > > > The purpose is for field installers to verify the bandwidth sold to the > customers using known clients against known servers. (4G and 5G fixed > installations mainly). > > > What I'm finding is that the new server is consistently delivering > slightly lower throughput than the old server. The old server allows for > more re-transmits and has a slightly higher congestion window than the new > server. > > > Is there any way to tune bbr to allow for more re-transmits (which seems > to be the limiting factor)? Or other suggestions? > > > > (Frankly I think the old server is to aggressive for general purpose use. > It seems to starve out other tcp sessions more than the new server. So for > delivering regular content to users the new implementation seems more > balanced, but that is not the target here. We want to stress test the > radio link.) > > > Regards Erik > ___ > Bloat mailing list > Bloat@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/bloat > ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
[Bloat] BBR implementations, knobs to turn?
I'm in the process of replacing a throughput test server. The old server is running a 1Gbit Ethernet card on a 1Gbit link and ubuntu. The new a 10Gbit card on a 40Gbit link and centos. Both have low load and Xenon processors. The purpose is for field installers to verify the bandwidth sold to the customers using known clients against known servers. (4G and 5G fixed installations mainly). What I'm finding is that the new server is consistently delivering slightly lower throughput than the old server. The old server allows for more re-transmits and has a slightly higher congestion window than the new server. Is there any way to tune bbr to allow for more re-transmits (which seems to be the limiting factor)? Or other suggestions? (Frankly I think the old server is to aggressive for general purpose use. It seems to starve out other tcp sessions more than the new server. So for delivering regular content to users the new implementation seems more balanced, but that is not the target here. We want to stress test the radio link.) Regards Erik ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On 16 Nov 2020, at 13:34, Jesper Dangaard Brouer wrote: On Wed, 04 Nov 2020 16:23:12 +0100 Thomas Rosenstein via Bloat wrote: [...] I have multiple routers which connect to multiple upstream providers, I have noticed a high latency shift in icmp (and generally all connection) if I run b2 upload-file --threads 40 (and I can reproduce this) What options do I have to analyze why this happens? General Info: Routers are connected between each other with 10G Mellanox Connect-X cards via 10G SPF+ DAC cables via a 10G Switch from fs.com Latency generally is around 0.18 ms between all routers (4). Throughput is 9.4 Gbit/s with 0 retransmissions when tested with iperf3. 2 of the 4 routers are connected upstream with a 1G connection (separate port, same network card) All routers have the full internet routing tables, i.e. 80k entries for IPv6 and 830k entries for IPv4 Conntrack is disabled (-j NOTRACK) Kernel 5.4.60 (custom) 2x Xeon X5670 @ 2.93 Ghz I think I have spotted your problem... This CPU[1] Xeon X5670 is more than 10 years old! It basically corresponds to the machines I used for my presentation at LinuxCon 2009 see slides[2]. Only with large frames and with massive scaling across all CPUs was I able to get close to 10Gbit/s through these machines. And on top I had to buy low-latency RAM memory-blocks to make it happen. As you can see on my slides[2], memory bandwidth and PCIe speeds was at the limit for making it possible on the hardware level. I had to run DDR3 memory at 1333MHz and tune the QuickPath Interconnect (QPI) to 6.4GT/s (default 4.8GT/s). This generation Motherboards had both PCIe gen-1 and gen-2 slots. Only the PCIe gen-2 slots had barely enough bandwidth. Maybe you physically placed NIC in PCIe gen-1 slot? On top of this, you also have a NUMA system, 2x Xeon X5670, which can result is A LOT of "funny" issue, that is really hard to troubleshoot... Yes, I'm aware of the limits of what to expect, but as we agree 60 tcp streams with not even 200 Mbits shouldn't overload the PCIex bus or the cpus. Also, don't forget, no issues with Kernel 3.10. PCI slot is a Gen2, x8, so more than enough bandwidth there luckily ;) But yes, they are quite old... [1] https://ark.intel.com/content/www/us/en/ark/products/47920/intel-xeon-processor-x5670-12m-cache-2-93-ghz-6-40-gt-s-intel-qpi.html [2] https://people.netfilter.org/hawk/presentations/LinuxCon2009/LinuxCon2009_JesperDangaardBrouer_final.pdf -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On Wed, 04 Nov 2020 16:23:12 +0100 Thomas Rosenstein via Bloat wrote: [...] > I have multiple routers which connect to multiple upstream providers, I > have noticed a high latency shift in icmp (and generally all connection) > if I run b2 upload-file --threads 40 (and I can reproduce this) > > What options do I have to analyze why this happens? > > General Info: > > Routers are connected between each other with 10G Mellanox Connect-X > cards via 10G SPF+ DAC cables via a 10G Switch from fs.com > Latency generally is around 0.18 ms between all routers (4). > Throughput is 9.4 Gbit/s with 0 retransmissions when tested with iperf3. > 2 of the 4 routers are connected upstream with a 1G connection (separate > port, same network card) > All routers have the full internet routing tables, i.e. 80k entries for > IPv6 and 830k entries for IPv4 > Conntrack is disabled (-j NOTRACK) > Kernel 5.4.60 (custom) > 2x Xeon X5670 @ 2.93 Ghz I think I have spotted your problem... This CPU[1] Xeon X5670 is more than 10 years old! It basically corresponds to the machines I used for my presentation at LinuxCon 2009 see slides[2]. Only with large frames and with massive scaling across all CPUs was I able to get close to 10Gbit/s through these machines. And on top I had to buy low-latency RAM memory-blocks to make it happen. As you can see on my slides[2], memory bandwidth and PCIe speeds was at the limit for making it possible on the hardware level. I had to run DDR3 memory at 1333MHz and tune the QuickPath Interconnect (QPI) to 6.4GT/s (default 4.8GT/s). This generation Motherboards had both PCIe gen-1 and gen-2 slots. Only the PCIe gen-2 slots had barely enough bandwidth. Maybe you physically placed NIC in PCIe gen-1 slot? On top of this, you also have a NUMA system, 2x Xeon X5670, which can result is A LOT of "funny" issue, that is really hard to troubleshoot... [1] https://ark.intel.com/content/www/us/en/ark/products/47920/intel-xeon-processor-x5670-12m-cache-2-93-ghz-6-40-gt-s-intel-qpi.html [2] https://people.netfilter.org/hawk/presentations/LinuxCon2009/LinuxCon2009_JesperDangaardBrouer_final.pdf -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On 16 Nov 2020, at 12:56, Jesper Dangaard Brouer wrote: On Fri, 13 Nov 2020 07:31:26 +0100 "Thomas Rosenstein" wrote: On 12 Nov 2020, at 16:42, Jesper Dangaard Brouer wrote: On Thu, 12 Nov 2020 14:42:59 +0100 "Thomas Rosenstein" wrote: Notice "Adaptive" setting is on. My long-shot theory(2) is that this adaptive algorithm in the driver code can guess wrong (due to not taking TSO into account) and cause issues for Try to turn this adaptive algorithm off: ethtool -C eth4 adaptive-rx off adaptive-tx off [...] rx-usecs: 32 When you run off "adaptive-rx" you will get 31250 interrupts/sec (calc 1/(32/10^6) = 31250). rx-frames: 64 [...] tx-usecs-irq: 0 tx-frames-irq: 0 [...] I have now updated the settings to: ethtool -c eth4 Coalesce parameters for eth4: Adaptive RX: off TX: off stats-block-usecs: 0 sample-interval: 0 pkt-rate-low: 0 pkt-rate-high: 0 rx-usecs: 0 Please put a value in rx-usecs, like 20 or 10. The value 0 is often used to signal driver to do adaptive. Ok, put it now to 10. Setting it to 10 is a little aggressive, as you ask it to generate 100,000 interrupts per sec. (Watch with 'vmstat 1' to see it.) 1/(10/10^6) = 10 interrupts/sec Goes a bit quicker (transfer up to 26 MB/s), but discards and pci stalls are still there. Why are you measuring in (26) MBytes/sec ? (equal 208 Mbit/s) yep 208 MBits If you still have ethtool PHY-discards, then you still have a problem. Ping times are noticable improved: Okay so this means these changes did have a positive effect. So, this can be related to OS is not getting activated fast-enough by NIC interrupts. 64 bytes from x.x.x.x: icmp_seq=39 ttl=64 time=0.172 ms 64 bytes from x.x.x.x: icmp_seq=40 ttl=64 time=0.414 ms 64 bytes from x.x.x.x: icmp_seq=41 ttl=64 time=0.183 ms 64 bytes from x.x.x.x: icmp_seq=42 ttl=64 time=1.41 ms 64 bytes from x.x.x.x: icmp_seq=43 ttl=64 time=0.172 ms 64 bytes from x.x.x.x: icmp_seq=44 ttl=64 time=0.228 ms 64 bytes from x.x.x.x: icmp_seq=46 ttl=64 time=0.120 ms 64 bytes from x.x.x.x: icmp_seq=47 ttl=64 time=1.47 ms 64 bytes from x.x.x.x: icmp_seq=48 ttl=64 time=0.162 ms 64 bytes from x.x.x.x: icmp_seq=49 ttl=64 time=0.160 ms 64 bytes from x.x.x.x: icmp_seq=50 ttl=64 time=0.158 ms 64 bytes from x.x.x.x: icmp_seq=51 ttl=64 time=0.113 ms Can you try to test if disabling TSO, GRO and GSO makes a difference? ethtool -K eth4 gso off gro off tso off I had a call yesterday with Mellanox and we added the following boot options: intel_idle.max_cstate=0 processor.max_cstate=1 idle=poll This completely solved the problem, but now we run with a heater and energy consumer, nearly 2x Watts on the outlet. I had no discards, super pings during transfer(< 0.100 ms), no outliers, and good transfer rates > 50 MB/s So it seems to be related to C-State management in newer kernel version being too agressive. I would like to try to tune here a bit, maybe we can get some input which knobs to turn? I will read here: https://www.kernel.org/doc/html/latest/admin-guide/pm/cpuidle.html#idle-states-representation and related docs, I think there will be a few helpful hints. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60
On Fri, 13 Nov 2020 07:31:26 +0100 "Thomas Rosenstein" wrote: > On 12 Nov 2020, at 16:42, Jesper Dangaard Brouer wrote: > > > On Thu, 12 Nov 2020 14:42:59 +0100 > > "Thomas Rosenstein" wrote: > > > >>> Notice "Adaptive" setting is on. My long-shot theory(2) is that > >>> this > >>> adaptive algorithm in the driver code can guess wrong (due to not > >>> taking TSO into account) and cause issues for > >>> > >>> Try to turn this adaptive algorithm off: > >>> > >>> ethtool -C eth4 adaptive-rx off adaptive-tx off > >>> > > [...] > > rx-usecs: 32 > >>> > >>> When you run off "adaptive-rx" you will get 31250 interrupts/sec > >>> (calc 1/(32/10^6) = 31250). > >>> > rx-frames: 64 > > [...] > tx-usecs-irq: 0 > tx-frames-irq: 0 > > >>> [...] > >> > >> I have now updated the settings to: > >> > >> ethtool -c eth4 > >> Coalesce parameters for eth4: > >> Adaptive RX: off TX: off > >> stats-block-usecs: 0 > >> sample-interval: 0 > >> pkt-rate-low: 0 > >> pkt-rate-high: 0 > >> > >> rx-usecs: 0 > > > > Please put a value in rx-usecs, like 20 or 10. > > The value 0 is often used to signal driver to do adaptive. > > Ok, put it now to 10. Setting it to 10 is a little aggressive, as you ask it to generate 100,000 interrupts per sec. (Watch with 'vmstat 1' to see it.) 1/(10/10^6) = 10 interrupts/sec > Goes a bit quicker (transfer up to 26 MB/s), but discards and pci stalls > are still there. Why are you measuring in (26) MBytes/sec ? (equal 208 Mbit/s) If you still have ethtool PHY-discards, then you still have a problem. > Ping times are noticable improved: Okay so this means these changes did have a positive effect. So, this can be related to OS is not getting activated fast-enough by NIC interrupts. > 64 bytes from x.x.x.x: icmp_seq=39 ttl=64 time=0.172 ms > 64 bytes from x.x.x.x: icmp_seq=40 ttl=64 time=0.414 ms > 64 bytes from x.x.x.x: icmp_seq=41 ttl=64 time=0.183 ms > 64 bytes from x.x.x.x: icmp_seq=42 ttl=64 time=1.41 ms > 64 bytes from x.x.x.x: icmp_seq=43 ttl=64 time=0.172 ms > 64 bytes from x.x.x.x: icmp_seq=44 ttl=64 time=0.228 ms > 64 bytes from x.x.x.x: icmp_seq=46 ttl=64 time=0.120 ms > 64 bytes from x.x.x.x: icmp_seq=47 ttl=64 time=1.47 ms > 64 bytes from x.x.x.x: icmp_seq=48 ttl=64 time=0.162 ms > 64 bytes from x.x.x.x: icmp_seq=49 ttl=64 time=0.160 ms > 64 bytes from x.x.x.x: icmp_seq=50 ttl=64 time=0.158 ms > 64 bytes from x.x.x.x: icmp_seq=51 ttl=64 time=0.113 ms Can you try to test if disabling TSO, GRO and GSO makes a difference? ethtool -K eth4 gso off gro off tso off -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat