Re: [Bloat] BBR implementations, knobs to turn?

2020-11-16 Thread Neal Cardwell via Bloat
A couple questions:

- I guess this is Linux TCP BBRv1 ("bbr" module)? What's the OS
distribution and exact kernel version ("uname -r")?

- What do you mean when you say "The old server allows for more
re-transmits"?

- If BBRv1 is suffering throughput problems due to high retransmit rates,
then usually the retransmit rate is around 15% or higher. If the retransmit
rate is that high on a radio link that is being tested, then that radio
link may be having issues that should be investigated separately?

- Would you be able to take a tcpdump trace of the well-behaved and
problematic traffic and share the pcap or a plot?

https://github.com/google/bbr/blob/master/Documentation/bbr-faq.md#how-can-i-visualize-the-behavior-of-linux-tcp-bbr-connections

- Would you be able to share the output of "ss -tin" from a recently built
"ss" binary, near the end of a long-lived test flow, for the well-behaved
and problematic cases?

https://github.com/google/bbr/blob/master/Documentation/bbr-faq.md#how-can-i-monitor-linux-tcp-bbr-connections

best,
neal



On Mon, Nov 16, 2020 at 10:25 AM  wrote:

> I'm in the process of replacing a throughput test server.  The old server
> is running a 1Gbit Ethernet card on a 1Gbit link and ubuntu.  The new a
> 10Gbit card on a 40Gbit link and centos.  Both have low load and Xenon
> processors.
>
>
> The purpose is for field installers to verify the bandwidth sold to the
> customers using known clients against known servers.  (4G and 5G fixed
> installations mainly).
>
>
> What I'm finding is that the new server is consistently delivering
> slightly lower throughput than the old server.  The old server allows for
> more re-transmits and has a slightly higher congestion window than the new
> server.
>
>
> Is there any way to tune bbr to allow for more re-transmits (which seems
> to be the limiting factor)?  Or other suggestions?
>
>
>
> (Frankly I think the old server is to aggressive for general purpose use.
> It seems to starve out other tcp sessions more than the new server.  So for
> delivering regular content to users the new implementation seems more
> balanced, but that is not the target here.  We want to stress test the
> radio link.)
>
>
> Regards Erik
> ___
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat
>
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


[Bloat] BBR implementations, knobs to turn?

2020-11-16 Thread erik.taraldsen
I'm in the process of replacing a throughput test server.  The old server is 
running a 1Gbit Ethernet card on a 1Gbit link and ubuntu.  The new a 10Gbit 
card on a 40Gbit link and centos.  Both have low load and Xenon processors.


The purpose is for field installers to verify the bandwidth sold to the 
customers using known clients against known servers.  (4G and 5G fixed 
installations mainly).


What I'm finding is that the new server is consistently delivering slightly 
lower throughput than the old server.  The old server allows for more 
re-transmits and has a slightly higher congestion window than the new server.


Is there any way to tune bbr to allow for more re-transmits (which seems to be 
the limiting factor)?  Or other suggestions?



(Frankly I think the old server is to aggressive for general purpose use.  It 
seems to starve out other tcp sessions more than the new server.  So for 
delivering regular content to users the new implementation seems more balanced, 
but that is not the target here.  We want to stress test the radio link.)


Regards Erik
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-16 Thread Thomas Rosenstein via Bloat



On 16 Nov 2020, at 13:34, Jesper Dangaard Brouer wrote:


On Wed, 04 Nov 2020 16:23:12 +0100
Thomas Rosenstein via Bloat  wrote:

[...]
I have multiple routers which connect to multiple upstream providers, 
I
have noticed a high latency shift in icmp (and generally all 
connection)

if I run b2 upload-file --threads 40 (and I can reproduce this)

What options do I have to analyze why this happens?

General Info:

Routers are connected between each other with 10G Mellanox Connect-X
cards via 10G SPF+ DAC cables via a 10G Switch from fs.com
Latency generally is around 0.18 ms between all routers (4).
Throughput is 9.4 Gbit/s with 0 retransmissions when tested with 
iperf3.
2 of the 4 routers are connected upstream with a 1G connection 
(separate

port, same network card)
All routers have the full internet routing tables, i.e. 80k entries 
for

IPv6 and 830k entries for IPv4
Conntrack is disabled (-j NOTRACK)
Kernel 5.4.60 (custom)
2x Xeon X5670 @ 2.93 Ghz


I think I have spotted your problem... This CPU[1] Xeon X5670 is more
than 10 years old!  It basically corresponds to the machines I used 
for
my presentation at LinuxCon 2009 see slides[2].  Only with large 
frames

and with massive scaling across all CPUs was I able to get close to
10Gbit/s through these machines.  And on top I had to buy low-latency
RAM memory-blocks to make it happen.

As you can see on my slides[2], memory bandwidth and PCIe speeds was 
at

the limit for making it possible on the hardware level.  I had to run
DDR3 memory at 1333MHz and tune the QuickPath Interconnect (QPI) to
6.4GT/s (default 4.8GT/s).

This generation Motherboards had both PCIe gen-1 and gen-2 slots.  
Only
the PCIe gen-2 slots had barely enough bandwidth.  Maybe you 
physically

placed NIC in PCIe gen-1 slot?

On top of this, you also have a NUMA system, 2x Xeon X5670, which can
result is A LOT of "funny" issue, that is really hard to 
troubleshoot...




Yes, I'm aware of the limits of what to expect, but as we agree 60 tcp 
streams with not even 200 Mbits shouldn't overload the PCIex bus or the 
cpus.


Also, don't forget, no issues with Kernel 3.10.

PCI slot is a Gen2, x8, so more than enough bandwidth there luckily ;)

But yes, they are quite old...



[1] 
https://ark.intel.com/content/www/us/en/ark/products/47920/intel-xeon-processor-x5670-12m-cache-2-93-ghz-6-40-gt-s-intel-qpi.html


[2] 
https://people.netfilter.org/hawk/presentations/LinuxCon2009/LinuxCon2009_JesperDangaardBrouer_final.pdf


--
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-16 Thread Jesper Dangaard Brouer
On Wed, 04 Nov 2020 16:23:12 +0100
Thomas Rosenstein via Bloat  wrote:

[...] 
> I have multiple routers which connect to multiple upstream providers, I 
> have noticed a high latency shift in icmp (and generally all connection) 
> if I run b2 upload-file --threads 40 (and I can reproduce this)
> 
> What options do I have to analyze why this happens?
> 
> General Info:
> 
> Routers are connected between each other with 10G Mellanox Connect-X 
> cards via 10G SPF+ DAC cables via a 10G Switch from fs.com
> Latency generally is around 0.18 ms between all routers (4).
> Throughput is 9.4 Gbit/s with 0 retransmissions when tested with iperf3.
> 2 of the 4 routers are connected upstream with a 1G connection (separate 
> port, same network card)
> All routers have the full internet routing tables, i.e. 80k entries for 
> IPv6 and 830k entries for IPv4
> Conntrack is disabled (-j NOTRACK)
> Kernel 5.4.60 (custom)
> 2x Xeon X5670 @ 2.93 Ghz

I think I have spotted your problem... This CPU[1] Xeon X5670 is more
than 10 years old!  It basically corresponds to the machines I used for
my presentation at LinuxCon 2009 see slides[2].  Only with large frames
and with massive scaling across all CPUs was I able to get close to
10Gbit/s through these machines.  And on top I had to buy low-latency
RAM memory-blocks to make it happen.

As you can see on my slides[2], memory bandwidth and PCIe speeds was at
the limit for making it possible on the hardware level.  I had to run
DDR3 memory at 1333MHz and tune the QuickPath Interconnect (QPI) to
6.4GT/s (default 4.8GT/s).

This generation Motherboards had both PCIe gen-1 and gen-2 slots.  Only
the PCIe gen-2 slots had barely enough bandwidth.  Maybe you physically
placed NIC in PCIe gen-1 slot?

On top of this, you also have a NUMA system, 2x Xeon X5670, which can
result is A LOT of "funny" issue, that is really hard to troubleshoot...


[1] 
https://ark.intel.com/content/www/us/en/ark/products/47920/intel-xeon-processor-x5670-12m-cache-2-93-ghz-6-40-gt-s-intel-qpi.html

[2] 
https://people.netfilter.org/hawk/presentations/LinuxCon2009/LinuxCon2009_JesperDangaardBrouer_final.pdf

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-16 Thread Thomas Rosenstein via Bloat



On 16 Nov 2020, at 12:56, Jesper Dangaard Brouer wrote:


On Fri, 13 Nov 2020 07:31:26 +0100
"Thomas Rosenstein"  wrote:


On 12 Nov 2020, at 16:42, Jesper Dangaard Brouer wrote:


On Thu, 12 Nov 2020 14:42:59 +0100
"Thomas Rosenstein"  wrote:


Notice "Adaptive" setting is on.  My long-shot theory(2) is that
this
adaptive algorithm in the driver code can guess wrong (due to not
taking TSO into account) and cause issues for

Try to turn this adaptive algorithm off:

  ethtool -C eth4 adaptive-rx off adaptive-tx off


[...]


rx-usecs: 32


When you run off "adaptive-rx" you will get 31250 interrupts/sec
(calc 1/(32/10^6) = 31250).


rx-frames: 64

[...]

tx-usecs-irq: 0
tx-frames-irq: 0


[...]


I have now updated the settings to:

ethtool -c eth4
Coalesce parameters for eth4:
Adaptive RX: off  TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0

rx-usecs: 0


Please put a value in rx-usecs, like 20 or 10.
The value 0 is often used to signal driver to do adaptive.


Ok, put it now to 10.


Setting it to 10 is a little aggressive, as you ask it to generate
100,000 interrupts per sec.  (Watch with 'vmstat 1' to see it.)

 1/(10/10^6) = 10 interrupts/sec

Goes a bit quicker (transfer up to 26 MB/s), but discards and pci 
stalls

are still there.


Why are you measuring in (26) MBytes/sec ? (equal 208 Mbit/s)


yep 208 MBits



If you still have ethtool PHY-discards, then you still have a problem.


Ping times are noticable improved:


Okay so this means these changes did have a positive effect.  So, this
can be related to OS is not getting activated fast-enough by NIC
interrupts.



64 bytes from x.x.x.x: icmp_seq=39 ttl=64 time=0.172 ms
64 bytes from x.x.x.x: icmp_seq=40 ttl=64 time=0.414 ms
64 bytes from x.x.x.x: icmp_seq=41 ttl=64 time=0.183 ms
64 bytes from x.x.x.x: icmp_seq=42 ttl=64 time=1.41 ms
64 bytes from x.x.x.x: icmp_seq=43 ttl=64 time=0.172 ms
64 bytes from x.x.x.x: icmp_seq=44 ttl=64 time=0.228 ms
64 bytes from x.x.x.x: icmp_seq=46 ttl=64 time=0.120 ms
64 bytes from x.x.x.x: icmp_seq=47 ttl=64 time=1.47 ms
64 bytes from x.x.x.x: icmp_seq=48 ttl=64 time=0.162 ms
64 bytes from x.x.x.x: icmp_seq=49 ttl=64 time=0.160 ms
64 bytes from x.x.x.x: icmp_seq=50 ttl=64 time=0.158 ms
64 bytes from x.x.x.x: icmp_seq=51 ttl=64 time=0.113 ms


Can you try to test if disabling TSO, GRO and GSO makes a difference?

 ethtool -K eth4 gso off gro off tso off



I had a call yesterday with Mellanox and we added the following boot 
options: intel_idle.max_cstate=0 processor.max_cstate=1 idle=poll


This completely solved the problem, but now we run with a heater and 
energy consumer, nearly 2x Watts on the outlet.


I had no discards, super pings during transfer(< 0.100 ms), no outliers, 
and good transfer rates > 50 MB/s



So it seems to be related to C-State management in newer kernel version 
being too agressive.
I would like to try to tune here a bit, maybe we can get some input 
which knobs to turn?


I will read here: 
https://www.kernel.org/doc/html/latest/admin-guide/pm/cpuidle.html#idle-states-representation

and related docs, I think there will be a few helpful hints.



--
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-16 Thread Jesper Dangaard Brouer
On Fri, 13 Nov 2020 07:31:26 +0100
"Thomas Rosenstein"  wrote:

> On 12 Nov 2020, at 16:42, Jesper Dangaard Brouer wrote:
> 
> > On Thu, 12 Nov 2020 14:42:59 +0100
> > "Thomas Rosenstein"  wrote:
> >  
> >>> Notice "Adaptive" setting is on.  My long-shot theory(2) is that 
> >>> this
> >>> adaptive algorithm in the driver code can guess wrong (due to not
> >>> taking TSO into account) and cause issues for
> >>>
> >>> Try to turn this adaptive algorithm off:
> >>>
> >>>   ethtool -C eth4 adaptive-rx off adaptive-tx off
> >>>  
> > [...]  
> 
>  rx-usecs: 32  
> >>>
> >>> When you run off "adaptive-rx" you will get 31250 interrupts/sec
> >>> (calc 1/(32/10^6) = 31250).
> >>>  
>  rx-frames: 64  
> > [...]  
>  tx-usecs-irq: 0
>  tx-frames-irq: 0
>   
> >>> [...]  
> >>
> >> I have now updated the settings to:
> >>
> >> ethtool -c eth4
> >> Coalesce parameters for eth4:
> >> Adaptive RX: off  TX: off
> >> stats-block-usecs: 0
> >> sample-interval: 0
> >> pkt-rate-low: 0
> >> pkt-rate-high: 0
> >>
> >> rx-usecs: 0  
> >
> > Please put a value in rx-usecs, like 20 or 10.
> > The value 0 is often used to signal driver to do adaptive.  
> 
> Ok, put it now to 10.

Setting it to 10 is a little aggressive, as you ask it to generate
100,000 interrupts per sec.  (Watch with 'vmstat 1' to see it.)

 1/(10/10^6) = 10 interrupts/sec

> Goes a bit quicker (transfer up to 26 MB/s), but discards and pci stalls 
> are still there.

Why are you measuring in (26) MBytes/sec ? (equal 208 Mbit/s)

If you still have ethtool PHY-discards, then you still have a problem.

> Ping times are noticable improved:

Okay so this means these changes did have a positive effect.  So, this
can be related to OS is not getting activated fast-enough by NIC
interrupts.

 
> 64 bytes from x.x.x.x: icmp_seq=39 ttl=64 time=0.172 ms
> 64 bytes from x.x.x.x: icmp_seq=40 ttl=64 time=0.414 ms
> 64 bytes from x.x.x.x: icmp_seq=41 ttl=64 time=0.183 ms
> 64 bytes from x.x.x.x: icmp_seq=42 ttl=64 time=1.41 ms
> 64 bytes from x.x.x.x: icmp_seq=43 ttl=64 time=0.172 ms
> 64 bytes from x.x.x.x: icmp_seq=44 ttl=64 time=0.228 ms
> 64 bytes from x.x.x.x: icmp_seq=46 ttl=64 time=0.120 ms
> 64 bytes from x.x.x.x: icmp_seq=47 ttl=64 time=1.47 ms
> 64 bytes from x.x.x.x: icmp_seq=48 ttl=64 time=0.162 ms
> 64 bytes from x.x.x.x: icmp_seq=49 ttl=64 time=0.160 ms
> 64 bytes from x.x.x.x: icmp_seq=50 ttl=64 time=0.158 ms
> 64 bytes from x.x.x.x: icmp_seq=51 ttl=64 time=0.113 ms

Can you try to test if disabling TSO, GRO and GSO makes a difference?

 ethtool -K eth4 gso off gro off tso off


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat