Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-16 Thread Thomas Rosenstein via Bloat



On 16 Nov 2020, at 13:34, Jesper Dangaard Brouer wrote:


On Wed, 04 Nov 2020 16:23:12 +0100
Thomas Rosenstein via Bloat  wrote:

[...]
I have multiple routers which connect to multiple upstream providers, 
I
have noticed a high latency shift in icmp (and generally all 
connection)

if I run b2 upload-file --threads 40 (and I can reproduce this)

What options do I have to analyze why this happens?

General Info:

Routers are connected between each other with 10G Mellanox Connect-X
cards via 10G SPF+ DAC cables via a 10G Switch from fs.com
Latency generally is around 0.18 ms between all routers (4).
Throughput is 9.4 Gbit/s with 0 retransmissions when tested with 
iperf3.
2 of the 4 routers are connected upstream with a 1G connection 
(separate

port, same network card)
All routers have the full internet routing tables, i.e. 80k entries 
for

IPv6 and 830k entries for IPv4
Conntrack is disabled (-j NOTRACK)
Kernel 5.4.60 (custom)
2x Xeon X5670 @ 2.93 Ghz


I think I have spotted your problem... This CPU[1] Xeon X5670 is more
than 10 years old!  It basically corresponds to the machines I used 
for
my presentation at LinuxCon 2009 see slides[2].  Only with large 
frames

and with massive scaling across all CPUs was I able to get close to
10Gbit/s through these machines.  And on top I had to buy low-latency
RAM memory-blocks to make it happen.

As you can see on my slides[2], memory bandwidth and PCIe speeds was 
at

the limit for making it possible on the hardware level.  I had to run
DDR3 memory at 1333MHz and tune the QuickPath Interconnect (QPI) to
6.4GT/s (default 4.8GT/s).

This generation Motherboards had both PCIe gen-1 and gen-2 slots.  
Only
the PCIe gen-2 slots had barely enough bandwidth.  Maybe you 
physically

placed NIC in PCIe gen-1 slot?

On top of this, you also have a NUMA system, 2x Xeon X5670, which can
result is A LOT of "funny" issue, that is really hard to 
troubleshoot...




Yes, I'm aware of the limits of what to expect, but as we agree 60 tcp 
streams with not even 200 Mbits shouldn't overload the PCIex bus or the 
cpus.


Also, don't forget, no issues with Kernel 3.10.

PCI slot is a Gen2, x8, so more than enough bandwidth there luckily ;)

But yes, they are quite old...



[1] 
https://ark.intel.com/content/www/us/en/ark/products/47920/intel-xeon-processor-x5670-12m-cache-2-93-ghz-6-40-gt-s-intel-qpi.html


[2] 
https://people.netfilter.org/hawk/presentations/LinuxCon2009/LinuxCon2009_JesperDangaardBrouer_final.pdf


--
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-16 Thread Jesper Dangaard Brouer
On Wed, 04 Nov 2020 16:23:12 +0100
Thomas Rosenstein via Bloat  wrote:

[...] 
> I have multiple routers which connect to multiple upstream providers, I 
> have noticed a high latency shift in icmp (and generally all connection) 
> if I run b2 upload-file --threads 40 (and I can reproduce this)
> 
> What options do I have to analyze why this happens?
> 
> General Info:
> 
> Routers are connected between each other with 10G Mellanox Connect-X 
> cards via 10G SPF+ DAC cables via a 10G Switch from fs.com
> Latency generally is around 0.18 ms between all routers (4).
> Throughput is 9.4 Gbit/s with 0 retransmissions when tested with iperf3.
> 2 of the 4 routers are connected upstream with a 1G connection (separate 
> port, same network card)
> All routers have the full internet routing tables, i.e. 80k entries for 
> IPv6 and 830k entries for IPv4
> Conntrack is disabled (-j NOTRACK)
> Kernel 5.4.60 (custom)
> 2x Xeon X5670 @ 2.93 Ghz

I think I have spotted your problem... This CPU[1] Xeon X5670 is more
than 10 years old!  It basically corresponds to the machines I used for
my presentation at LinuxCon 2009 see slides[2].  Only with large frames
and with massive scaling across all CPUs was I able to get close to
10Gbit/s through these machines.  And on top I had to buy low-latency
RAM memory-blocks to make it happen.

As you can see on my slides[2], memory bandwidth and PCIe speeds was at
the limit for making it possible on the hardware level.  I had to run
DDR3 memory at 1333MHz and tune the QuickPath Interconnect (QPI) to
6.4GT/s (default 4.8GT/s).

This generation Motherboards had both PCIe gen-1 and gen-2 slots.  Only
the PCIe gen-2 slots had barely enough bandwidth.  Maybe you physically
placed NIC in PCIe gen-1 slot?

On top of this, you also have a NUMA system, 2x Xeon X5670, which can
result is A LOT of "funny" issue, that is really hard to troubleshoot...


[1] 
https://ark.intel.com/content/www/us/en/ark/products/47920/intel-xeon-processor-x5670-12m-cache-2-93-ghz-6-40-gt-s-intel-qpi.html

[2] 
https://people.netfilter.org/hawk/presentations/LinuxCon2009/LinuxCon2009_JesperDangaardBrouer_final.pdf

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-16 Thread Thomas Rosenstein via Bloat



On 16 Nov 2020, at 12:56, Jesper Dangaard Brouer wrote:


On Fri, 13 Nov 2020 07:31:26 +0100
"Thomas Rosenstein"  wrote:


On 12 Nov 2020, at 16:42, Jesper Dangaard Brouer wrote:


On Thu, 12 Nov 2020 14:42:59 +0100
"Thomas Rosenstein"  wrote:


Notice "Adaptive" setting is on.  My long-shot theory(2) is that
this
adaptive algorithm in the driver code can guess wrong (due to not
taking TSO into account) and cause issues for

Try to turn this adaptive algorithm off:

  ethtool -C eth4 adaptive-rx off adaptive-tx off


[...]


rx-usecs: 32


When you run off "adaptive-rx" you will get 31250 interrupts/sec
(calc 1/(32/10^6) = 31250).


rx-frames: 64

[...]

tx-usecs-irq: 0
tx-frames-irq: 0


[...]


I have now updated the settings to:

ethtool -c eth4
Coalesce parameters for eth4:
Adaptive RX: off  TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0

rx-usecs: 0


Please put a value in rx-usecs, like 20 or 10.
The value 0 is often used to signal driver to do adaptive.


Ok, put it now to 10.


Setting it to 10 is a little aggressive, as you ask it to generate
100,000 interrupts per sec.  (Watch with 'vmstat 1' to see it.)

 1/(10/10^6) = 10 interrupts/sec

Goes a bit quicker (transfer up to 26 MB/s), but discards and pci 
stalls

are still there.


Why are you measuring in (26) MBytes/sec ? (equal 208 Mbit/s)


yep 208 MBits



If you still have ethtool PHY-discards, then you still have a problem.


Ping times are noticable improved:


Okay so this means these changes did have a positive effect.  So, this
can be related to OS is not getting activated fast-enough by NIC
interrupts.



64 bytes from x.x.x.x: icmp_seq=39 ttl=64 time=0.172 ms
64 bytes from x.x.x.x: icmp_seq=40 ttl=64 time=0.414 ms
64 bytes from x.x.x.x: icmp_seq=41 ttl=64 time=0.183 ms
64 bytes from x.x.x.x: icmp_seq=42 ttl=64 time=1.41 ms
64 bytes from x.x.x.x: icmp_seq=43 ttl=64 time=0.172 ms
64 bytes from x.x.x.x: icmp_seq=44 ttl=64 time=0.228 ms
64 bytes from x.x.x.x: icmp_seq=46 ttl=64 time=0.120 ms
64 bytes from x.x.x.x: icmp_seq=47 ttl=64 time=1.47 ms
64 bytes from x.x.x.x: icmp_seq=48 ttl=64 time=0.162 ms
64 bytes from x.x.x.x: icmp_seq=49 ttl=64 time=0.160 ms
64 bytes from x.x.x.x: icmp_seq=50 ttl=64 time=0.158 ms
64 bytes from x.x.x.x: icmp_seq=51 ttl=64 time=0.113 ms


Can you try to test if disabling TSO, GRO and GSO makes a difference?

 ethtool -K eth4 gso off gro off tso off



I had a call yesterday with Mellanox and we added the following boot 
options: intel_idle.max_cstate=0 processor.max_cstate=1 idle=poll


This completely solved the problem, but now we run with a heater and 
energy consumer, nearly 2x Watts on the outlet.


I had no discards, super pings during transfer(< 0.100 ms), no outliers, 
and good transfer rates > 50 MB/s



So it seems to be related to C-State management in newer kernel version 
being too agressive.
I would like to try to tune here a bit, maybe we can get some input 
which knobs to turn?


I will read here: 
https://www.kernel.org/doc/html/latest/admin-guide/pm/cpuidle.html#idle-states-representation

and related docs, I think there will be a few helpful hints.



--
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-16 Thread Jesper Dangaard Brouer
On Fri, 13 Nov 2020 07:31:26 +0100
"Thomas Rosenstein"  wrote:

> On 12 Nov 2020, at 16:42, Jesper Dangaard Brouer wrote:
> 
> > On Thu, 12 Nov 2020 14:42:59 +0100
> > "Thomas Rosenstein"  wrote:
> >  
> >>> Notice "Adaptive" setting is on.  My long-shot theory(2) is that 
> >>> this
> >>> adaptive algorithm in the driver code can guess wrong (due to not
> >>> taking TSO into account) and cause issues for
> >>>
> >>> Try to turn this adaptive algorithm off:
> >>>
> >>>   ethtool -C eth4 adaptive-rx off adaptive-tx off
> >>>  
> > [...]  
> 
>  rx-usecs: 32  
> >>>
> >>> When you run off "adaptive-rx" you will get 31250 interrupts/sec
> >>> (calc 1/(32/10^6) = 31250).
> >>>  
>  rx-frames: 64  
> > [...]  
>  tx-usecs-irq: 0
>  tx-frames-irq: 0
>   
> >>> [...]  
> >>
> >> I have now updated the settings to:
> >>
> >> ethtool -c eth4
> >> Coalesce parameters for eth4:
> >> Adaptive RX: off  TX: off
> >> stats-block-usecs: 0
> >> sample-interval: 0
> >> pkt-rate-low: 0
> >> pkt-rate-high: 0
> >>
> >> rx-usecs: 0  
> >
> > Please put a value in rx-usecs, like 20 or 10.
> > The value 0 is often used to signal driver to do adaptive.  
> 
> Ok, put it now to 10.

Setting it to 10 is a little aggressive, as you ask it to generate
100,000 interrupts per sec.  (Watch with 'vmstat 1' to see it.)

 1/(10/10^6) = 10 interrupts/sec

> Goes a bit quicker (transfer up to 26 MB/s), but discards and pci stalls 
> are still there.

Why are you measuring in (26) MBytes/sec ? (equal 208 Mbit/s)

If you still have ethtool PHY-discards, then you still have a problem.

> Ping times are noticable improved:

Okay so this means these changes did have a positive effect.  So, this
can be related to OS is not getting activated fast-enough by NIC
interrupts.

 
> 64 bytes from x.x.x.x: icmp_seq=39 ttl=64 time=0.172 ms
> 64 bytes from x.x.x.x: icmp_seq=40 ttl=64 time=0.414 ms
> 64 bytes from x.x.x.x: icmp_seq=41 ttl=64 time=0.183 ms
> 64 bytes from x.x.x.x: icmp_seq=42 ttl=64 time=1.41 ms
> 64 bytes from x.x.x.x: icmp_seq=43 ttl=64 time=0.172 ms
> 64 bytes from x.x.x.x: icmp_seq=44 ttl=64 time=0.228 ms
> 64 bytes from x.x.x.x: icmp_seq=46 ttl=64 time=0.120 ms
> 64 bytes from x.x.x.x: icmp_seq=47 ttl=64 time=1.47 ms
> 64 bytes from x.x.x.x: icmp_seq=48 ttl=64 time=0.162 ms
> 64 bytes from x.x.x.x: icmp_seq=49 ttl=64 time=0.160 ms
> 64 bytes from x.x.x.x: icmp_seq=50 ttl=64 time=0.158 ms
> 64 bytes from x.x.x.x: icmp_seq=51 ttl=64 time=0.113 ms

Can you try to test if disabling TSO, GRO and GSO makes a difference?

 ethtool -K eth4 gso off gro off tso off


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-12 Thread Thomas Rosenstein via Bloat



On 12 Nov 2020, at 16:42, Jesper Dangaard Brouer wrote:


On Thu, 12 Nov 2020 14:42:59 +0100
"Thomas Rosenstein"  wrote:

Notice "Adaptive" setting is on.  My long-shot theory(2) is that 
this

adaptive algorithm in the driver code can guess wrong (due to not
taking TSO into account) and cause issues for

Try to turn this adaptive algorithm off:

  ethtool -C eth4 adaptive-rx off adaptive-tx off


[...]


rx-usecs: 32


When you run off "adaptive-rx" you will get 31250 interrupts/sec
(calc 1/(32/10^6) = 31250).


rx-frames: 64

[...]

tx-usecs-irq: 0
tx-frames-irq: 0


[...]


I have now updated the settings to:

ethtool -c eth4
Coalesce parameters for eth4:
Adaptive RX: off  TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0

rx-usecs: 0


Please put a value in rx-usecs, like 20 or 10.
The value 0 is often used to signal driver to do adaptive.


Ok, put it now to 10.

Goes a bit quicker (transfer up to 26 MB/s), but discards and pci stalls 
are still there.


Ping times are noticable improved:

64 bytes from x.x.x.x: icmp_seq=39 ttl=64 time=0.172 ms
64 bytes from x.x.x.x: icmp_seq=40 ttl=64 time=0.414 ms
64 bytes from x.x.x.x: icmp_seq=41 ttl=64 time=0.183 ms
64 bytes from x.x.x.x: icmp_seq=42 ttl=64 time=1.41 ms
64 bytes from x.x.x.x: icmp_seq=43 ttl=64 time=0.172 ms
64 bytes from x.x.x.x: icmp_seq=44 ttl=64 time=0.228 ms
64 bytes from x.x.x.x: icmp_seq=46 ttl=64 time=0.120 ms
64 bytes from x.x.x.x: icmp_seq=47 ttl=64 time=1.47 ms
64 bytes from x.x.x.x: icmp_seq=48 ttl=64 time=0.162 ms
64 bytes from x.x.x.x: icmp_seq=49 ttl=64 time=0.160 ms
64 bytes from x.x.x.x: icmp_seq=50 ttl=64 time=0.158 ms
64 bytes from x.x.x.x: icmp_seq=51 ttl=64 time=0.113 ms




rx-frames: 32
rx-usecs-irq: 0
rx-frames-irq: 0




--
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-12 Thread Jesper Dangaard Brouer
On Thu, 12 Nov 2020 14:42:59 +0100
"Thomas Rosenstein"  wrote:

> > Notice "Adaptive" setting is on.  My long-shot theory(2) is that this
> > adaptive algorithm in the driver code can guess wrong (due to not
> > taking TSO into account) and cause issues for
> >
> > Try to turn this adaptive algorithm off:
> >
> >   ethtool -C eth4 adaptive-rx off adaptive-tx off
> >  
[...]
> >>
> >> rx-usecs: 32  
> >
> > When you run off "adaptive-rx" you will get 31250 interrupts/sec
> > (calc 1/(32/10^6) = 31250).
> >  
> >> rx-frames: 64  
[...]
> >> tx-usecs-irq: 0
> >> tx-frames-irq: 0
> >>  
> > [...]  
> 
> I have now updated the settings to:
> 
> ethtool -c eth4
> Coalesce parameters for eth4:
> Adaptive RX: off  TX: off
> stats-block-usecs: 0
> sample-interval: 0
> pkt-rate-low: 0
> pkt-rate-high: 0
> 
> rx-usecs: 0

Please put a value in rx-usecs, like 20 or 10.
The value 0 is often used to signal driver to do adaptive.

> rx-frames: 32
> rx-usecs-irq: 0
> rx-frames-irq: 0



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-12 Thread Thomas Rosenstein via Bloat


On 12 Nov 2020, at 14:31, Jesper Dangaard Brouer wrote:

> On Thu, 12 Nov 2020 12:26:20 +0100
> "Thomas Rosenstein"  wrote:
>
>>> Long-shot theory(2): The NIC driver IRQ interrupt coalesce
>>> adaptive-IRQ (adj via ethtool -C) is wrong. The reason it is wrong is
>>> because of TSO frames, due packet-to-byte ratio isn't taken into
>>> account.
>>
>> here are the coalese parameters: (are the same for 3.10 and 5.4.60)
>>
>> ethtool -c eth4
>> Coalesce parameters for eth4:
>> Adaptive RX: on  TX: on
>
> Notice "Adaptive" setting is on.  My long-shot theory(2) is that this
> adaptive algorithm in the driver code can guess wrong (due to not
> taking TSO into account) and cause issues for
>
> Try to turn this adaptive algorithm off:
>
>   ethtool -C eth4 adaptive-rx off adaptive-tx off
>
>> stats-block-usecs: 0
>> sample-interval: 0
>> pkt-rate-low: 0
>> pkt-rate-high: 0
>>
>> rx-usecs: 32
>
> When you run off "adaptive-rx" you will get 31250 interrupts/sec
> (calc 1/(32/10^6) = 31250).
>
>> rx-frames: 64
>
> You likely want to reduce this.
>
>> rx-usecs-irq: 0
>> rx-frames-irq: 0
>>
>> tx-usecs: 8
>> tx-frames: 128
>
> These 128 frames also seems too large.
>
>> tx-usecs-irq: 0
>> tx-frames-irq: 0
>>
> [...]

I have now updated the settings to:

ethtool -c eth4
Coalesce parameters for eth4:
Adaptive RX: off  TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0

rx-usecs: 0
rx-frames: 32
rx-usecs-irq: 0
rx-frames-irq: 0

tx-usecs: 16
tx-frames: 32
tx-usecs-irq: 0
tx-frames-irq: 0

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0

but it makes absolutely no difference, also discard keep coming as before.

>>>
>>> You can test theory(2) in two ways, via disabling TSO and via
>>> adjusting
>>> ethtool --show-coalesce|--coalesce (-c|-C).  The --coalesce change can
>>> change part of theory(1), the NIC wake-up/IRQ interval.
here's the current nic settings:

ethtool -k eth4
Features for eth4:
rx-checksumming: on
tx-checksumming: on
tx-checksum-ipv4: off [fixed]
tx-checksum-ip-generic: on
tx-checksum-ipv6: off [fixed]
tx-checksum-fcoe-crc: off [fixed]
tx-checksum-sctp: off [fixed]
scatter-gather: on
tx-scatter-gather: on
tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: off
tx-tcp-segmentation: off
tx-tcp-ecn-segmentation: off [fixed]
tx-tcp-mangleid-segmentation: off
tx-tcp6-segmentation: off
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: on
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: on
tx-ipxip6-segmentation: on
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: on
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
tx-udp-segmentation: on
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off
rx-all: off
tx-vlan-stag-hw-insert: on
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: on [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: on
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: off [fixed]
tls-hw-record: off [fixed]

>
> What is the NIC ring size settings?
>
>  ethtool -g eth4
>

ethtool -g eth4
Ring parameters for eth4:
Pre-set maximums:
RX: 8192
RX Mini:0
RX Jumbo:   0
TX: 8192
Current hardware settings:
RX: 1024
RX Mini:0
RX Jumbo:   0
TX: 1024

> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-12 Thread Jesper Dangaard Brouer
On Thu, 12 Nov 2020 12:26:20 +0100
"Thomas Rosenstein"  wrote:

> > Long-shot theory(2): The NIC driver IRQ interrupt coalesce
> > adaptive-IRQ (adj via ethtool -C) is wrong. The reason it is wrong is
> > because of TSO frames, due packet-to-byte ratio isn't taken into
> > account.  
> 
> here are the coalese parameters: (are the same for 3.10 and 5.4.60)
> 
> ethtool -c eth4
> Coalesce parameters for eth4:
> Adaptive RX: on  TX: on

Notice "Adaptive" setting is on.  My long-shot theory(2) is that this
adaptive algorithm in the driver code can guess wrong (due to not
taking TSO into account) and cause issues for 

Try to turn this adaptive algorithm off:

  ethtool -C eth4 adaptive-rx off adaptive-tx off

> stats-block-usecs: 0
> sample-interval: 0
> pkt-rate-low: 0
> pkt-rate-high: 0
> 
> rx-usecs: 32

When you run off "adaptive-rx" you will get 31250 interrupts/sec
(calc 1/(32/10^6) = 31250).

> rx-frames: 64

You likely want to reduce this.

> rx-usecs-irq: 0
> rx-frames-irq: 0
> 
> tx-usecs: 8
> tx-frames: 128

These 128 frames also seems too large.

> tx-usecs-irq: 0
> tx-frames-irq: 0
> 
[...]
> >
> > You can test theory(2) in two ways, via disabling TSO and via 
> > adjusting
> > ethtool --show-coalesce|--coalesce (-c|-C).  The --coalesce change can
> > change part of theory(1), the NIC wake-up/IRQ interval.  

What is the NIC ring size settings?

 ethtool -g eth4

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-12 Thread Thomas Rosenstein via Bloat



On 12 Nov 2020, at 11:05, Jesper Dangaard Brouer wrote:


On Mon, 09 Nov 2020 15:33:46 +0100
"Thomas Rosenstein"  wrote:


On 9 Nov 2020, at 12:40, Jesper Dangaard Brouer wrote:


On Mon, 09 Nov 2020 11:09:33 +0100
"Thomas Rosenstein"  wrote:


On 9 Nov 2020, at 9:24, Jesper Dangaard Brouer wrote:


On Sat, 07 Nov 2020 14:00:04 +0100
Thomas Rosenstein via Bloat  wrote:


Here's an extract from the ethtool https://pastebin.com/cabpWGFz
just
in
case there's something hidden.


Yes, there is something hiding in the data from 
ethtool_stats.pl[1]:

(10G Mellanox Connect-X cards via 10G SPF+ DAC)

 stat:1 (  1) <= 
outbound_pci_stalled_wr_events /sec
 stat:339731557 (339,731,557) <= rx_buffer_passed_thres_phy 
/sec


I've not seen this counter 'rx_buffer_passed_thres_phy' before,
looking
in the kernel driver code it is related to 
"rx_buffer_almost_full".

The numbers per second is excessive (but it be related to a driver
bug
as it ends up reading "high" -> rx_buffer_almost_full_high in the
extended counters).


I have now tested with a new kernel 5.9.4 build made from 3.10 with 
make

oldconfig and I noticed an interesting effect.

The first ca. 2 minutes the router behaves completely normal as with
3.10, after that the ping times go crazy.

I have recorded this with ethtool, and also the ping times.

Ethtool: (13 MB)
https://drive.google.com/file/d/1Ojp64UUw0zKwrgF_CisZb3BCdidAJYZo/view?usp=sharing

The transfer first was doing around 50 - 70 MB/s then once the ping
times go worse it dropped to ~12 MB/s.
ca. Line 74324 the transfer speed drops to 12 MB/s


Lets add them here so we can see and talk about them:

 74324 Ethtool(eth4 ) stat:144584002 (144,584,002)  <= 
rx_buffer_passed_thres_phy /sec
 74325 Ethtool(eth4 ) stat: 17243186 ( 17,243,186) <= rx_bytes 
/sec
 74326 Ethtool(eth4 ) stat: 19032079 ( 19,032,079) <= 
rx_bytes_phy /sec


Significant difference between RX-bytes Phy-level and OS-level.

17243186*8/10^6 = 138 Mbit/s

 74327 Ethtool(eth4 ) stat:  321 (321) <= 
rx_cache_busy /sec
 74328 Ethtool(eth4 ) stat:  321 (321) <= 
rx_cache_full /sec
 74329 Ethtool(eth4 ) stat: 7323 (  7,323) <= 
rx_cache_reuse /sec
 74330 Ethtool(eth4 ) stat:15188 ( 15,188) <= 
rx_csum_complete /sec
 74331 Ethtool(eth4 ) stat:1 (  1) <= 
rx_csum_none /sec
 74332 Ethtool(eth4 ) stat:   96 ( 96) <= 
rx_csum_unnecessary /sec
 74333 Ethtool(eth4 ) stat: 1242 (  1,242) <= 
rx_discards_phy /sec


Packets are getting rx_discards

 74334 Ethtool(eth4 ) stat:15285 ( 15,285) <= 
rx_packets /sec
 74335 Ethtool(eth4 ) stat:16655 ( 16,655) <= 
rx_packets_phy /sec


(16655-15285 = 1370)
Again difference between RX-packets Phy-level and OS-level

 74336 Ethtool(eth4 ) stat: 19132497 ( 19,132,497) <= 
rx_prio0_bytes /sec
 74337 Ethtool(eth4 ) stat: 1264 (  1,264) <= 
rx_prio0_discards /sec
 74338 Ethtool(eth4 ) stat:15474 ( 15,474) <= 
rx_prio0_packets /sec


I have also updated now OFED on the routers and checked with 3.10 and 
5.4.60  (new firmware for the adapters)

That did change absolutly nothing!

I noticed still that only under kernel 5.4.60 the discard happen, on 
3.10 they don't.


Based on that I would blame the kernel instead of hardware.
Do you agree on this?

I wouldn't exclude any settings to be problematic for now, maybe some 
defaults changed.








Seems you are right about the rx_buffer_passed_thres_phy if you check
just those lines they appear more often once the speed dropped.
Not sure if that's the cause or an effect of the underlying problem!


Thanks for the link to Mellanox doc:
 https://community.mellanox.com/s/article/understanding-mlx5-ethtool-counters

The counter: 'rx_buffer_passed_thres_phy'
 Description:
  'The number of events where the port receive buffer was over 85% 
full.

   Supported from kernel 4.14'

The counter: 'tx_pause_ctrl_phy'
 Description:
 'The number of link layer pause packets transmitted on a physical
  port. If this counter is increasing, it implies that the NIC is
  congested and cannot absorb the traffic coming from the network.'


So, the NIC is clearly telling us that something is wrong. You are not
near link speed (12MB/s => 96Mbit/s).   And still port receive buffer
is over 85% full most of the time.  There are indications that, the 
NIC

is congested and cannot absorb the traffic coming from the network.


I have opened a support case with Mellanox maybe they have an idea 
directly.


Their first response based on the outbound_pci_stalled_wr was that the 
PCIexpress credits for the card run out.
Which means (if I interpret it correctly) that the buffer is not emptied 
quickly enough anymore.




Long-shot theory(1):  Could this be caused by CPUs going into too deep
sleep states? And NIC not sending wake-up signal early enough, 

Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-12 Thread Jesper Dangaard Brouer
On Mon, 09 Nov 2020 15:33:46 +0100
"Thomas Rosenstein"  wrote:

> On 9 Nov 2020, at 12:40, Jesper Dangaard Brouer wrote:
> 
> > On Mon, 09 Nov 2020 11:09:33 +0100
> > "Thomas Rosenstein"  wrote:
> >  
> >> On 9 Nov 2020, at 9:24, Jesper Dangaard Brouer wrote:
> >>  
> >>> On Sat, 07 Nov 2020 14:00:04 +0100
> >>> Thomas Rosenstein via Bloat  wrote:
> >>>  
>  Here's an extract from the ethtool https://pastebin.com/cabpWGFz 
>  just
>  in
>  case there's something hidden.  
> >>>
> >>> Yes, there is something hiding in the data from ethtool_stats.pl[1]:
> >>> (10G Mellanox Connect-X cards via 10G SPF+ DAC)
> >>>
> >>>  stat:1 (  1) <= outbound_pci_stalled_wr_events /sec
> >>>  stat:339731557 (339,731,557) <= rx_buffer_passed_thres_phy /sec
> >>>
> >>> I've not seen this counter 'rx_buffer_passed_thres_phy' before, 
> >>> looking
> >>> in the kernel driver code it is related to "rx_buffer_almost_full".
> >>> The numbers per second is excessive (but it be related to a driver 
> >>> bug
> >>> as it ends up reading "high" -> rx_buffer_almost_full_high in the
> >>> extended counters).  
> 
> I have now tested with a new kernel 5.9.4 build made from 3.10 with make 
> oldconfig and I noticed an interesting effect.
> 
> The first ca. 2 minutes the router behaves completely normal as with 
> 3.10, after that the ping times go crazy.
> 
> I have recorded this with ethtool, and also the ping times.
> 
> Ethtool: (13 MB)
> https://drive.google.com/file/d/1Ojp64UUw0zKwrgF_CisZb3BCdidAJYZo/view?usp=sharing
> 
> The transfer first was doing around 50 - 70 MB/s then once the ping 
> times go worse it dropped to ~12 MB/s.
> ca. Line 74324 the transfer speed drops to 12 MB/s

Lets add them here so we can see and talk about them:

 74324 Ethtool(eth4 ) stat:144584002 (144,584,002)  <= 
rx_buffer_passed_thres_phy /sec 
 74325 Ethtool(eth4 ) stat: 17243186 ( 17,243,186) <= rx_bytes /sec 
 74326 Ethtool(eth4 ) stat: 19032079 ( 19,032,079) <= rx_bytes_phy /sec 

Significant difference between RX-bytes Phy-level and OS-level.

17243186*8/10^6 = 138 Mbit/s

 74327 Ethtool(eth4 ) stat:  321 (321) <= rx_cache_busy 
/sec 
 74328 Ethtool(eth4 ) stat:  321 (321) <= rx_cache_full 
/sec 
 74329 Ethtool(eth4 ) stat: 7323 (  7,323) <= rx_cache_reuse 
/sec 
 74330 Ethtool(eth4 ) stat:15188 ( 15,188) <= rx_csum_complete 
/sec 
 74331 Ethtool(eth4 ) stat:1 (  1) <= rx_csum_none /sec 
 74332 Ethtool(eth4 ) stat:   96 ( 96) <= 
rx_csum_unnecessary /sec 
 74333 Ethtool(eth4 ) stat: 1242 (  1,242) <= rx_discards_phy 
/sec 

Packets are getting rx_discards

 74334 Ethtool(eth4 ) stat:15285 ( 15,285) <= rx_packets /sec 
 74335 Ethtool(eth4 ) stat:16655 ( 16,655) <= rx_packets_phy 
/sec 

(16655-15285 = 1370)
Again difference between RX-packets Phy-level and OS-level

 74336 Ethtool(eth4 ) stat: 19132497 ( 19,132,497) <= rx_prio0_bytes 
/sec 
 74337 Ethtool(eth4 ) stat: 1264 (  1,264) <= rx_prio0_discards 
/sec 
 74338 Ethtool(eth4 ) stat:15474 ( 15,474) <= rx_prio0_packets 
/sec 



> Seems you are right about the rx_buffer_passed_thres_phy if you check 
> just those lines they appear more often once the speed dropped.
> Not sure if that's the cause or an effect of the underlying problem!

Thanks for the link to Mellanox doc:
 https://community.mellanox.com/s/article/understanding-mlx5-ethtool-counters

The counter: 'rx_buffer_passed_thres_phy'
 Description: 
  'The number of events where the port receive buffer was over 85% full.
   Supported from kernel 4.14'

The counter: 'tx_pause_ctrl_phy'
 Description:
 'The number of link layer pause packets transmitted on a physical
  port. If this counter is increasing, it implies that the NIC is
  congested and cannot absorb the traffic coming from the network.'


So, the NIC is clearly telling us that something is wrong. You are not
near link speed (12MB/s => 96Mbit/s).   And still port receive buffer
is over 85% full most of the time.  There are indications that, the NIC
is congested and cannot absorb the traffic coming from the network.

Long-shot theory(1):  Could this be caused by CPUs going into too deep
sleep states? And NIC not sending wake-up signal early enough, e.g.
because it holds on to packets for generating a bulk at IRQ raise.

Long-shot theory(2): The NIC driver IRQ interrupt coalesce
adaptive-IRQ (adj via ethtool -C) is wrong. The reason it is wrong is
because of TSO frames, due packet-to-byte ratio isn't taken into
account.

You can test theory(2) in two ways, via disabling TSO and via adjusting
ethtool --show-coalesce|--coalesce (-c|-C).  The --coalesce change can
change part of theory(1), the NIC wake-up/IRQ interval.  For CPU sleep
modes see software package 'tuned'.


> Pings:
> https://drive.google.com/file/d/16phOxM5IFU

Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-09 Thread Thomas Rosenstein via Bloat



On 9 Nov 2020, at 12:40, Jesper Dangaard Brouer wrote:


On Mon, 09 Nov 2020 11:09:33 +0100
"Thomas Rosenstein"  wrote:

Could you also provide ethtool_stats for the TX interface?

Notice that the tool[1] ethtool_stats.pl support monitoring several
interfaces at the same time, e.g. run:

 ethtool_stats.pl --sec 3 --dev eth4 --dev ethTX

And provide output as pastebin.


I have now also repeated the same test with 3.10, here are the ethtool 
outputs:


https://drive.google.com/file/d/1c98MVV0JYl6Su6xZTpqwS7m-6OlbmAFp/view?usp=sharing

and the ping times:

https://drive.google.com/file/d/1xhbGJHb5jUbPsee4frbx-c-uqh-7orXY/view?usp=sharing

Sadly the parameters we were looking at are not supported below 4.14.

but I immediatly saw 1 thing very different:

ethtool --statistics eth4 | grep discards
 rx_discards_phy: 0
 tx_discards_phy: 0

if we check the ethtool output from 5.9.4 were have:

 rx_discards_phy: 151793

And also the outbound_pci_stalled_wr_events get more frequent the lower 
the total bandwidth / the higher the ping is.
Logically there must be something blocking the the buffers, either they 
are not getting freed, or not rotated correctly, or processing is too 
slow.
I would exclude the processing, simply based on 0% CPU load, and also 
that it doesn't happen in 3.10.
Suspicious is also, that the issue only appears after a certain time of 
activity (maybe total traffic?!)



___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-09 Thread Thomas Rosenstein via Bloat



On 9 Nov 2020, at 12:40, Jesper Dangaard Brouer wrote:


On Mon, 09 Nov 2020 11:09:33 +0100
"Thomas Rosenstein"  wrote:


On 9 Nov 2020, at 9:24, Jesper Dangaard Brouer wrote:


On Sat, 07 Nov 2020 14:00:04 +0100
Thomas Rosenstein via Bloat  wrote:

Here's an extract from the ethtool https://pastebin.com/cabpWGFz 
just

in
case there's something hidden.


Yes, there is something hiding in the data from ethtool_stats.pl[1]:
(10G Mellanox Connect-X cards via 10G SPF+ DAC)

 stat:1 (  1) <= outbound_pci_stalled_wr_events 
/sec

 stat:339731557 (339,731,557) <= rx_buffer_passed_thres_phy /sec

I've not seen this counter 'rx_buffer_passed_thres_phy' before, 
looking

in the kernel driver code it is related to "rx_buffer_almost_full".
The numbers per second is excessive (but it be related to a driver 
bug

as it ends up reading "high" -> rx_buffer_almost_full_high in the
extended counters).


I have now tested with a new kernel 5.9.4 build made from 3.10 with make 
oldconfig and I noticed an interesting effect.


The first ca. 2 minutes the router behaves completely normal as with 
3.10, after that the ping times go crazy.


I have recorded this with ethtool, and also the ping times.

Ethtool: (13 MB)
https://drive.google.com/file/d/1Ojp64UUw0zKwrgF_CisZb3BCdidAJYZo/view?usp=sharing

The transfer first was doing around 50 - 70 MB/s then once the ping 
times go worse it dropped to ~12 MB/s.

ca. Line 74324 the transfer speed drops to 12 MB/s

Seems you are right about the rx_buffer_passed_thres_phy if you check 
just those lines they appear more often once the speed dropped.

Not sure if that's the cause or an effect of the underlying problem!

Pings:
https://drive.google.com/file/d/16phOxM5IFU6RAl4Ua4pRqMNuLYBc4RK7/view?usp=sharing

Pause frames were activated again after the restart.

(Here a link for rerefence for the ethtool variables: 
https://community.mellanox.com/s/article/understanding-mlx5-ethtool-counters)

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-09 Thread Thomas Rosenstein via Bloat



On 9 Nov 2020, at 12:40, Jesper Dangaard Brouer wrote:


On Mon, 09 Nov 2020 11:09:33 +0100
"Thomas Rosenstein"  wrote:


On 9 Nov 2020, at 9:24, Jesper Dangaard Brouer wrote:


On Sat, 07 Nov 2020 14:00:04 +0100
Thomas Rosenstein via Bloat  wrote:

Here's an extract from the ethtool https://pastebin.com/cabpWGFz 
just

in
case there's something hidden.


Yes, there is something hiding in the data from ethtool_stats.pl[1]:
(10G Mellanox Connect-X cards via 10G SPF+ DAC)

 stat:1 (  1) <= outbound_pci_stalled_wr_events 
/sec

 stat:339731557 (339,731,557) <= rx_buffer_passed_thres_phy /sec

I've not seen this counter 'rx_buffer_passed_thres_phy' before, 
looking

in the kernel driver code it is related to "rx_buffer_almost_full".
The numbers per second is excessive (but it be related to a driver 
bug

as it ends up reading "high" -> rx_buffer_almost_full_high in the
extended counters).


Notice this indication is a strong red-flag that something is wrong.


Okay, but as this is a router you also need to transmit this
(asymmetric) traffic out another interface right.


The asymmetric traffic comes back on another router, this is router-02, 
traffic from internet comes back on router-01,

I also added the interfaces names.

See the updated diagram: 
https://drive.google.com/file/d/15oAsxiNfsbjB9a855Q_dh6YvFZBDdY5I/view?usp=sharing




Could you also provide ethtool_stats for the TX interface?

Notice that the tool[1] ethtool_stats.pl support monitoring several
interfaces at the same time, e.g. run:

 ethtool_stats.pl --sec 3 --dev eth4 --dev ethTX

And provide output as pastebin.


I have disabled pause control, like Toke said via:

ethtool -A eth4 autoneg off rx off tx off
ethtool -A eth5 autoneg off rx off tx off

Afterwards an ethtool output, first "without" traffic for a few seconds, 
then with the problematic flow.


Since the output is > 512KB I had to upload it on gdrive:

https://drive.google.com/file/d/1EVKt1LseaBuD40QE-SqFvqYSeWUEcGA_/view?usp=sharing





[1]
https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl

Strange size distribution:
 stat: 19922 ( 19,922) <= rx_1519_to_2047_bytes_phy /sec
 stat:14 ( 14) <= rx_65_to_127_bytes_phy /sec


I assume it's because of the VLAN Tagging, and therefore 1522 bytes per 
packet with mtu of 1500?







--
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-09 Thread Toke Høiland-Jørgensen via Bloat
>> Also, it's TX, and we are only doing RX, as I said already somewhere, 
>> it's async routing, so the TX data comes via another router back.
>
> Okay, but as this is a router you also need to transmit this
> (asymmetric) traffic out another interface right.
>
> Could you also provide ethtool_stats for the TX interface?
>
> Notice that the tool[1] ethtool_stats.pl support monitoring several
> interfaces at the same time, e.g. run:
>
>  ethtool_stats.pl --sec 3 --dev eth4 --dev ethTX
>
> And provide output as pastebin.

Also, from your drawing, everything goes through the same switch, right?
And since pause frames are involved... Maybe it's the switch being
overwhelmed?

In general, I find that pause frames give more problems than they solve,
so I always turn them off...

-Toke
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-09 Thread Jesper Dangaard Brouer
On Mon, 09 Nov 2020 11:09:33 +0100
"Thomas Rosenstein"  wrote:

> On 9 Nov 2020, at 9:24, Jesper Dangaard Brouer wrote:
> 
> > On Sat, 07 Nov 2020 14:00:04 +0100
> > Thomas Rosenstein via Bloat  wrote:
> >  
> >> Here's an extract from the ethtool https://pastebin.com/cabpWGFz just 
> >> in
> >> case there's something hidden.  
> >
> > Yes, there is something hiding in the data from ethtool_stats.pl[1]:
> > (10G Mellanox Connect-X cards via 10G SPF+ DAC)
> >
> >  stat:1 (  1) <= outbound_pci_stalled_wr_events /sec
> >  stat:339731557 (339,731,557) <= rx_buffer_passed_thres_phy /sec
> >
> > I've not seen this counter 'rx_buffer_passed_thres_phy' before, looking
> > in the kernel driver code it is related to "rx_buffer_almost_full".
> > The numbers per second is excessive (but it be related to a driver bug
> > as it ends up reading "high" -> rx_buffer_almost_full_high in the
> > extended counters).

Notice this indication is a strong red-flag that something is wrong.

> >  stat: 29583661 ( 29,583,661) <= rx_bytes /sec
> >  stat: 30343677 ( 30,343,677) <= rx_bytes_phy /sec
> >
> > You are receiving with 236 Mbit/s in 10Gbit/s link.  There is a
> > difference between what the OS sees (rx_bytes) and what the NIC
> > hardware sees (rx_bytes_phy) (diff approx 6Mbit/s).
> >
> >  stat:19552 ( 19,552) <= rx_packets /sec
> >  stat:19950 ( 19,950) <= rx_packets_phy /sec  
> 
> Could these packets be from VLAN interfaces that are not used in the OS?
> 
> >
> > Above RX packet counters also indicated HW is seeing more packets that
> > OS is receiving.
> >
> > Next counters is likely your problem:
> >
> >  stat:  718 (718) <= tx_global_pause /sec
> >  stat:   954035 (954,035) <= tx_global_pause_duration /sec
> >  stat:  714 (714) <= tx_pause_ctrl_phy /sec  
> 
> As far as I can see that's only the TX, and we are only doing RX on this 
> interface - so maybe that's irrelevant?
> 
> >
> > It looks like you have enabled Ethernet Flow-Control, and something is
> > causing pause frames to be generated.  It seem strange that this 
> > happen on a 10Gbit/s link with only 236 Mbit/s.
> >
> > The TX byte counters are also very strange:
> >
> >  stat:26063 ( 26,063) <= tx_bytes /sec
> >  stat:71950 ( 71,950) <= tx_bytes_phy /sec  
> 
> Also, it's TX, and we are only doing RX, as I said already somewhere, 
> it's async routing, so the TX data comes via another router back.

Okay, but as this is a router you also need to transmit this
(asymmetric) traffic out another interface right.

Could you also provide ethtool_stats for the TX interface?

Notice that the tool[1] ethtool_stats.pl support monitoring several
interfaces at the same time, e.g. run:

 ethtool_stats.pl --sec 3 --dev eth4 --dev ethTX

And provide output as pastebin.


> > [1] 
> > https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
> >
> > Strange size distribution:
> >  stat: 19922 ( 19,922) <= rx_1519_to_2047_bytes_phy /sec
> >  stat:14 ( 14) <= rx_65_to_127_bytes_phy /sec  
> 

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-09 Thread Thomas Rosenstein via Bloat



On 9 Nov 2020, at 9:24, Jesper Dangaard Brouer wrote:


On Sat, 07 Nov 2020 14:00:04 +0100
Thomas Rosenstein via Bloat  wrote:

Here's an extract from the ethtool https://pastebin.com/cabpWGFz just 
in

case there's something hidden.


Yes, there is something hiding in the data from ethtool_stats.pl[1]:
(10G Mellanox Connect-X cards via 10G SPF+ DAC)

 stat:1 (  1) <= outbound_pci_stalled_wr_events 
/sec

 stat:339731557 (339,731,557) <= rx_buffer_passed_thres_phy /sec

I've not seen this counter 'rx_buffer_passed_thres_phy' before, 
looking

in the kernel driver code it is related to "rx_buffer_almost_full".
The numbers per second is excessive (but it be related to a driver bug
as it ends up reading "high" -> rx_buffer_almost_full_high in the
extended counters).

 stat: 29583661 ( 29,583,661) <= rx_bytes /sec
 stat: 30343677 ( 30,343,677) <= rx_bytes_phy /sec

You are receiving with 236 Mbit/s in 10Gbit/s link.  There is a
difference between what the OS sees (rx_bytes) and what the NIC
hardware sees (rx_bytes_phy) (diff approx 6Mbit/s).

 stat:19552 ( 19,552) <= rx_packets /sec
 stat:19950 ( 19,950) <= rx_packets_phy /sec


Could these packets be from VLAN interfaces that are not used in the OS?



Above RX packet counters also indicated HW is seeing more packets that
OS is receiving.

Next counters is likely your problem:

 stat:  718 (718) <= tx_global_pause /sec
 stat:   954035 (954,035) <= tx_global_pause_duration /sec
 stat:  714 (714) <= tx_pause_ctrl_phy /sec


As far as I can see that's only the TX, and we are only doing RX on this 
interface - so maybe that's irrelevant?




It looks like you have enabled Ethernet Flow-Control, and something is
causing pause frames to be generated.  It seem strange that this 
happen

on a 10Gbit/s link with only 236 Mbit/s.

The TX byte counters are also very strange:

 stat:26063 ( 26,063) <= tx_bytes /sec
 stat:71950 ( 71,950) <= tx_bytes_phy /sec


Also, it's TX, and we are only doing RX, as I said already somewhere, 
it's async routing, so the TX data comes via another router back.




--
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[1] 
https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl


Strange size distribution:
 stat: 19922 ( 19,922) <= rx_1519_to_2047_bytes_phy /sec
 stat:14 ( 14) <= rx_65_to_127_bytes_phy /sec

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-09 Thread Jesper Dangaard Brouer
On Sat, 07 Nov 2020 14:00:04 +0100
Thomas Rosenstein via Bloat  wrote:
 
> Here's an extract from the ethtool https://pastebin.com/cabpWGFz just in 
> case there's something hidden.

Yes, there is something hiding in the data from ethtool_stats.pl[1]:
(10G Mellanox Connect-X cards via 10G SPF+ DAC)

 stat:1 (  1) <= outbound_pci_stalled_wr_events /sec
 stat:339731557 (339,731,557) <= rx_buffer_passed_thres_phy /sec

I've not seen this counter 'rx_buffer_passed_thres_phy' before, looking
in the kernel driver code it is related to "rx_buffer_almost_full".
The numbers per second is excessive (but it be related to a driver bug
as it ends up reading "high" -> rx_buffer_almost_full_high in the
extended counters).

 stat: 29583661 ( 29,583,661) <= rx_bytes /sec
 stat: 30343677 ( 30,343,677) <= rx_bytes_phy /sec

You are receiving with 236 Mbit/s in 10Gbit/s link.  There is a
difference between what the OS sees (rx_bytes) and what the NIC
hardware sees (rx_bytes_phy) (diff approx 6Mbit/s).

 stat:19552 ( 19,552) <= rx_packets /sec
 stat:19950 ( 19,950) <= rx_packets_phy /sec

Above RX packet counters also indicated HW is seeing more packets that
OS is receiving.

Next counters is likely your problem:

 stat:  718 (718) <= tx_global_pause /sec
 stat:   954035 (954,035) <= tx_global_pause_duration /sec
 stat:  714 (714) <= tx_pause_ctrl_phy /sec

It looks like you have enabled Ethernet Flow-Control, and something is
causing pause frames to be generated.  It seem strange that this happen
on a 10Gbit/s link with only 236 Mbit/s.

The TX byte counters are also very strange:

 stat:26063 ( 26,063) <= tx_bytes /sec
 stat:71950 ( 71,950) <= tx_bytes_phy /sec

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[1] 
https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl

Strange size distribution:
 stat: 19922 ( 19,922) <= rx_1519_to_2047_bytes_phy /sec
 stat:14 ( 14) <= rx_65_to_127_bytes_phy /sec

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-07 Thread Sebastian Moeller
Hi Thomas,

Probably not very helpful, but iputils' ping has the -D option that is 
described as:
  -D
   Print timestamp (unix time + microseconds as in gettimeofday) before 
each line.

this potentially allows to be translated into kernel time to pinpoint events in 
time? Not that I would know how to translate times here and if perf and friends 
actually store timestamps

Best Regards
Sebastian


> On Nov 7, 2020, at 18:01, Thomas Rosenstein via Bloat 
>  wrote:
> 
> 
> 
> On 7 Nov 2020, at 17:46, Jesper Dangaard Brouer wrote:
> 
>> 
>> Did the latency issue happen during this this perf recording?
>> (it is obviously important that you record during the issue)
> 
> yes it was recorded during the transfer, while it exhibited the issue
> 
> I let the transfer run some time, then did the perf extract, stopped the 
> transfer afterwards.
> 
>> 
>> Finding a single latency event/issue in perf.data is like looking in
>> for needle in a haystack.  It does give hints and e.g. tell me details
>> like what part of the kernel are used/activated.
>> 
>> I can see that you have likely configured XPS (Transmit Packet
>> Steering). Is this on purpose?
> 
> Not on purpose, I assume it comes from the Mellanox Card.
> 
>> 
>> -- 
>> Best regards,
>>  Jesper Dangaard Brouer
>>  MSc.CS, Principal Kernel Engineer at Red Hat
>>  LinkedIn: http://www.linkedin.com/in/brouer
> ___
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-07 Thread Thomas Rosenstein via Bloat



On 7 Nov 2020, at 17:46, Jesper Dangaard Brouer wrote:



Did the latency issue happen during this this perf recording?
(it is obviously important that you record during the issue)


yes it was recorded during the transfer, while it exhibited the issue

I let the transfer run some time, then did the perf extract, stopped the 
transfer afterwards.




Finding a single latency event/issue in perf.data is like looking in
for needle in a haystack.  It does give hints and e.g. tell me details
like what part of the kernel are used/activated.

I can see that you have likely configured XPS (Transmit Packet
Steering). Is this on purpose?


Not on purpose, I assume it comes from the Mellanox Card.



--
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-07 Thread Jesper Dangaard Brouer
On Sat, 07 Nov 2020 13:37:01 +0100
"Thomas Rosenstein"  wrote:

> On 6 Nov 2020, at 21:19, Jesper Dangaard Brouer wrote:
> 
> > On Fri, 06 Nov 2020 18:04:49 +0100
> > "Thomas Rosenstein"  wrote:
> >  
> >> On 6 Nov 2020, at 15:13, Jesper Dangaard Brouer wrote:
> >>
> >>
[...]
> >  
> >>> Have you tried to use 'perf record' to observe that is happening on
> >>> the system while these latency incidents happen?  (let me know if 
> >>> you want some cmdline hints)  
> >>
> >> Haven't tried this yet. If you have some hints what events to monitor
> >> I'll take them!  
> >
> > Okay to record everything (-a) on the system and save call-graph (-g),
> > and run for 5 seconds (via profiling the sleep function).
> >
> >  # perf record -g -a  sleep 5
> >
> > To view the result the simply use the 'perf report', but likely you
> > want to use option --no-children as you are profiling the kernel (and
> > not a userspace program you want to have grouped 'children' by).  I
> > also include the CPU column via '--sort cpu,comm,dso,symbol' and you
> > can select/zoom-in-on a specific CPU via '-C zero-indexed-cpu-num'.
> >
> >  # perf report --sort cpu,comm,dso,symbol --no-children
> >
> > When we ask you to provide the output, you can use the --stdio option,
> > and provide txt-info via a pastebin link as it is very long.  
> 
> Here is the output from kernel 3.10_1127 (I updated to the really newest 
> in that branch):  https://pastebin.com/5mxirXPw
> Here is the output from kernel 5.9.4: https://pastebin.com/KDZ2Ei2F

Did the latency issue happen during this this perf recording?
(it is obviously important that you record during the issue)

Finding a single latency event/issue in perf.data is like looking in
for needle in a haystack.  It does give hints and e.g. tell me details
like what part of the kernel are used/activated.

I can see that you have likely configured XPS (Transmit Packet
Steering). Is this on purpose?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-07 Thread Thomas Rosenstein via Bloat


On 7 Nov 2020, at 13:37, Thomas Rosenstein wrote:


I have also tried to reproduce the issue with the kernel on a virtual 
hyper-v machine, there I don't have any adverse effects.
But it's not 100% the same, since MASQ happens on it .. will 
restructure a bit to get a similar representation




I reproduced something similar now, for the VM it only happens at 120 
threads but I have every 5 or so seconds one packet that's delayed 
heavily:


PING 10.10.0.69 (10.10.0.69) 56(84) bytes of data.
64 bytes from 10.10.0.69: icmp_seq=1 ttl=63 time=0.273 ms
64 bytes from 10.10.0.69: icmp_seq=2 ttl=63 time=0.260 ms
64 bytes from 10.10.0.69: icmp_seq=3 ttl=63 time=0.621 ms
64 bytes from 10.10.0.69: icmp_seq=4 ttl=63 time=0.218 ms
64 bytes from 10.10.0.69: icmp_seq=5 ttl=63 time=52.7 ms
64 bytes from 10.10.0.69: icmp_seq=6 ttl=63 time=0.261 ms
64 bytes from 10.10.0.69: icmp_seq=7 ttl=63 time=0.381 ms
64 bytes from 10.10.0.69: icmp_seq=8 ttl=63 time=0.230 ms
64 bytes from 10.10.0.69: icmp_seq=9 ttl=63 time=0.378 ms
64 bytes from 10.10.0.69: icmp_seq=10 ttl=63 time=0.268 ms
64 bytes from 10.10.0.69: icmp_seq=11 ttl=63 time=0.313 ms
64 bytes from 10.10.0.69: icmp_seq=12 ttl=63 time=0.395 ms
64 bytes from 10.10.0.69: icmp_seq=13 ttl=63 time=0.269 ms
64 bytes from 10.10.0.69: icmp_seq=14 ttl=63 time=44.1 ms
64 bytes from 10.10.0.69: icmp_seq=15 ttl=63 time=0.279 ms
64 bytes from 10.10.0.69: icmp_seq=16 ttl=63 time=0.406 ms
64 bytes from 10.10.0.69: icmp_seq=17 ttl=63 time=0.245 ms
64 bytes from 10.10.0.69: icmp_seq=18 ttl=63 time=0.202 ms
64 bytes from 10.10.0.69: icmp_seq=19 ttl=63 time=0.229 ms
64 bytes from 10.10.0.69: icmp_seq=20 ttl=63 time=0.357 ms
64 bytes from 10.10.0.69: icmp_seq=21 ttl=63 time=0.462 ms
64 bytes from 10.10.0.69: icmp_seq=22 ttl=63 time=0.300 ms
64 bytes from 10.10.0.69: icmp_seq=23 ttl=63 time=60.3 ms
64 bytes from 10.10.0.69: icmp_seq=24 ttl=63 time=0.200 ms
64 bytes from 10.10.0.69: icmp_seq=25 ttl=63 time=15.8 ms
64 bytes from 10.10.0.69: icmp_seq=26 ttl=63 time=0.296 ms
64 bytes from 10.10.0.69: icmp_seq=27 ttl=63 time=0.273 ms
64 bytes from 10.10.0.69: icmp_seq=28 ttl=63 time=0.591 ms
64 bytes from 10.10.0.69: icmp_seq=29 ttl=63 time=0.583 ms

during that time the interface is loaded with 40 Mbits.

this does not appear with 3.10 kernel:

PING 10.10.0.69 (10.10.0.69) 56(84) bytes of data.
64 bytes from 10.10.0.69: icmp_seq=1 ttl=63 time=0.322 ms
64 bytes from 10.10.0.69: icmp_seq=2 ttl=63 time=0.242 ms
64 bytes from 10.10.0.69: icmp_seq=3 ttl=63 time=0.461 ms
64 bytes from 10.10.0.69: icmp_seq=4 ttl=63 time=0.244 ms
64 bytes from 10.10.0.69: icmp_seq=5 ttl=63 time=0.429 ms
64 bytes from 10.10.0.69: icmp_seq=6 ttl=63 time=0.268 ms
64 bytes from 10.10.0.69: icmp_seq=7 ttl=63 time=0.540 ms
64 bytes from 10.10.0.69: icmp_seq=8 ttl=63 time=0.465 ms
64 bytes from 10.10.0.69: icmp_seq=9 ttl=63 time=0.237 ms
64 bytes from 10.10.0.69: icmp_seq=10 ttl=63 time=0.216 ms
64 bytes from 10.10.0.69: icmp_seq=11 ttl=63 time=0.248 ms
64 bytes from 10.10.0.69: icmp_seq=12 ttl=63 time=0.247 ms
64 bytes from 10.10.0.69: icmp_seq=13 ttl=63 time=0.282 ms
64 bytes from 10.10.0.69: icmp_seq=14 ttl=63 time=0.256 ms
64 bytes from 10.10.0.69: icmp_seq=15 ttl=63 time=0.268 ms
64 bytes from 10.10.0.69: icmp_seq=16 ttl=63 time=0.272 ms
64 bytes from 10.10.0.69: icmp_seq=17 ttl=63 time=0.236 ms
64 bytes from 10.10.0.69: icmp_seq=18 ttl=63 time=0.287 ms
64 bytes from 10.10.0.69: icmp_seq=19 ttl=63 time=0.288 ms
64 bytes from 10.10.0.69: icmp_seq=20 ttl=63 time=0.297 ms
64 bytes from 10.10.0.69: icmp_seq=21 ttl=63 time=0.319 ms
64 bytes from 10.10.0.69: icmp_seq=22 ttl=63 time=0.294 ms
64 bytes from 10.10.0.69: icmp_seq=23 ttl=63 time=0.313 ms
64 bytes from 10.10.0.69: icmp_seq=24 ttl=63 time=0.208 ms
64 bytes from 10.10.0.69: icmp_seq=25 ttl=63 time=0.222 ms
64 bytes from 10.10.0.69: icmp_seq=26 ttl=63 time=0.279 ms
64 bytes from 10.10.0.69: icmp_seq=27 ttl=63 time=0.257 ms
64 bytes from 10.10.0.69: icmp_seq=28 ttl=63 time=0.266 ms
64 bytes from 10.10.0.69: icmp_seq=29 ttl=63 time=0.398 ms
64 bytes from 10.10.0.69: icmp_seq=30 ttl=63 time=0.281 ms
64 bytes from 10.10.0.69: icmp_seq=31 ttl=63 time=0.263 ms
64 bytes from 10.10.0.69: icmp_seq=32 ttl=63 time=0.276 ms

might be a hint that it's actually reproduceable.
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-07 Thread Thomas Rosenstein via Bloat



On 7 Nov 2020, at 13:40, Jan Ceuleers wrote:


On 07/11/2020 13:37, Thomas Rosenstein via Bloat wrote:

Nonetheless, I tested it and no difference :)


The relevant test would be to run your experiment while IPv6 is
completely disabled in the kernel.


Tested with ipv6 full off (via ipv6.disable=1), no difference!

Here's an extract from the ethtool https://pastebin.com/cabpWGFz just in 
case there's something hidden.

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-07 Thread Thomas Rosenstein via Bloat
On 7 Nov 2020, at 13:40, Jan Ceuleers wrote:

> On 07/11/2020 13:37, Thomas Rosenstein via Bloat wrote:
>> Nonetheless, I tested it and no difference :)
>
> The relevant test would be to run your experiment while IPv6 is
> completely disabled in the kernel.

sure, I'll try!
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-07 Thread Jan Ceuleers
On 07/11/2020 13:37, Thomas Rosenstein via Bloat wrote:
> Nonetheless, I tested it and no difference :)

The relevant test would be to run your experiment while IPv6 is
completely disabled in the kernel.
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-07 Thread Thomas Rosenstein via Bloat



On 6 Nov 2020, at 21:19, Jesper Dangaard Brouer wrote:


On Fri, 06 Nov 2020 18:04:49 +0100
"Thomas Rosenstein"  wrote:


On 6 Nov 2020, at 15:13, Jesper Dangaard Brouer wrote:


I'm using ping on IPv4, but I'll try to see if IPv6 makes any
difference!


I think you misunderstand me.  I'm not asking you to use ping6. The
gobgpd daemon updates will both update IPv4 and IPv6 routes, right.
Updating IPv6 routes are more problematic than IPv4 routes.  The IPv6
route tables update can potentially stall softirq from running, which
was the latency tool was measuring... and it did show some outliers.


yes I did, I assumed the latency would be introduced in the traffic path 
by the lock.

Nonetheless, I tested it and no difference :)





Have you tried to use 'perf record' to observe that is happening on
the system while these latency incidents happen?  (let me know if 
you

want some cmdline hints)


Haven't tried this yet. If you have some hints what events to monitor
I'll take them!


Okay to record everything (-a) on the system and save call-graph (-g),
and run for 5 seconds (via profiling the sleep function).

 # perf record -g -a  sleep 5

To view the result the simply use the 'perf report', but likely you
want to use option --no-children as you are profiling the kernel (and
not a userspace program you want to have grouped 'children' by).  I
also include the CPU column via '--sort cpu,comm,dso,symbol' and you
can select/zoom-in-on a specific CPU via '-C zero-indexed-cpu-num'.

 # perf report --sort cpu,comm,dso,symbol --no-children

When we ask you to provide the output, you can use the --stdio option,
and provide txt-info via a pastebin link as it is very long.


Here is the output from kernel 3.10_1127 (I updated to the really newest 
in that branch):  https://pastebin.com/5mxirXPw

Here is the output from kernel 5.9.4: https://pastebin.com/KDZ2Ei2F

I have noticed that the delays are directly related to the traffic 
flows, see below.


These tests are WITHOUT gobgpd running, so no updates to the route 
table, but the route tables are fully populated.
Also, it's ONLY outgoing traffic, the return packets are coming in on 
another router.


I have then cleared the routing tables, and the issue persists, table 
has only 78 entries.


40 threads -> sometimes higher rtt times: https://pastebin.com/Y9nd0h4h
60 threads -> always high rtt times: https://pastebin.com/JFvhtLrH

So it definitly gets worse the more connections there are.

I have also tried to reproduce the issue with the kernel on a virtual 
hyper-v machine, there I don't have any adverse effects.
But it's not 100% the same, since MASQ happens on it .. will restructure 
a bit to get a similar representation


I also suspected now that -j NOTRACK would be an issue, removed that 
too, no change. (it's anyways async routing)


Additionally I have quit all applications except for sshd, no change!





--
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-06 Thread Jesper Dangaard Brouer
On Fri, 06 Nov 2020 18:04:49 +0100
"Thomas Rosenstein"  wrote:

> On 6 Nov 2020, at 15:13, Jesper Dangaard Brouer wrote:
> 
> > On Fri, 6 Nov 2020 13:53:58 +0100
> > Jesper Dangaard Brouer  wrote:
> >  
> >> [...]  
> 
>  Could this be related to netlink? I have gobgpd running on these
>  routers, which injects routes via netlink.
>  But the churn rate during the tests is very minimal, maybe 30 - 40
>  routes every second.  
> >>
> >> Yes, this could be related.  The internal data-structure for FIB
> >> lookups is a fibtrie which is a compressed patricia tree, related to
> >> radix tree idea.  Thus, I can imagine that the kernel have to
> >> rebuild/rebalance the tree with all these updates.  
> >
> > Reading the kernel code. The IPv4 fib_trie code is very well tuned,
> > fully RCU-ified, meaning read-side is lock-free.  The resize() 
> > function code in net//ipv4/fib_trie.c have max_work limiter to avoid it 
> > uses 
> > too much time.  And the update looks lockfree.
> >
> > The IPv6 update looks more scary, as it seems to take a "bh" spinlock
> > that can block softirq from running code in net/ipv6/ip6_fib.c
> > (spin_lock_bh(&f6i->fib6_table->tb6_lock).  
> 
> I'm using ping on IPv4, but I'll try to see if IPv6 makes any 
> difference!

I think you misunderstand me.  I'm not asking you to use ping6. The
gobgpd daemon updates will both update IPv4 and IPv6 routes, right.
Updating IPv6 routes are more problematic than IPv4 routes.  The IPv6
route tables update can potentially stall softirq from running, which
was the latency tool was measuring... and it did show some outliers.


> > Have you tried to use 'perf record' to observe that is happening on 
> > the system while these latency incidents happen?  (let me know if you 
> > want some cmdline hints)  
> 
> Haven't tried this yet. If you have some hints what events to monitor 
> I'll take them!

Okay to record everything (-a) on the system and save call-graph (-g),
and run for 5 seconds (via profiling the sleep function).

 # perf record -g -a  sleep 5

To view the result the simply use the 'perf report', but likely you
want to use option --no-children as you are profiling the kernel (and
not a userspace program you want to have grouped 'children' by).  I
also include the CPU column via '--sort cpu,comm,dso,symbol' and you
can select/zoom-in-on a specific CPU via '-C zero-indexed-cpu-num'.

 # perf report --sort cpu,comm,dso,symbol --no-children

When we ask you to provide the output, you can use the --stdio option,
and provide txt-info via a pastebin link as it is very long.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-06 Thread Thomas Rosenstein via Bloat



On 6 Nov 2020, at 15:13, Jesper Dangaard Brouer wrote:


On Fri, 6 Nov 2020 13:53:58 +0100
Jesper Dangaard Brouer  wrote:


[...]


Could this be related to netlink? I have gobgpd running on these
routers, which injects routes via netlink.
But the churn rate during the tests is very minimal, maybe 30 - 40
routes every second.


Yes, this could be related.  The internal data-structure for FIB
lookups is a fibtrie which is a compressed patricia tree, related to
radix tree idea.  Thus, I can imagine that the kernel have to
rebuild/rebalance the tree with all these updates.


Reading the kernel code. The IPv4 fib_trie code is very well tuned,
fully RCU-ified, meaning read-side is lock-free.  The resize() 
function
code in net//ipv4/fib_trie.c have max_work limiter to avoid it uses 
too

much time.  And the update looks lockfree.

The IPv6 update looks more scary, as it seems to take a "bh" spinlock
that can block softirq from running code in net/ipv6/ip6_fib.c
(spin_lock_bh(&f6i->fib6_table->tb6_lock).


I'm using ping on IPv4, but I'll try to see if IPv6 makes any 
difference!




Have you tried to use 'perf record' to observe that is happening on 
the system while these latency incidents happen?  (let me know if you 
want some cmdline hints)


Haven't tried this yet. If you have some hints what events to monitor 
I'll take them!




--
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-06 Thread Jesper Dangaard Brouer
On Fri, 6 Nov 2020 13:53:58 +0100
Jesper Dangaard Brouer  wrote:

> [...]
> > >
> > > Could this be related to netlink? I have gobgpd running on these 
> > > routers, which injects routes via netlink.
> > > But the churn rate during the tests is very minimal, maybe 30 - 40 
> > > routes every second.  
> 
> Yes, this could be related.  The internal data-structure for FIB
> lookups is a fibtrie which is a compressed patricia tree, related to
> radix tree idea.  Thus, I can imagine that the kernel have to
> rebuild/rebalance the tree with all these updates.

Reading the kernel code. The IPv4 fib_trie code is very well tuned,
fully RCU-ified, meaning read-side is lock-free.  The resize() function
code in net//ipv4/fib_trie.c have max_work limiter to avoid it uses too
much time.  And the update looks lockfree.

The IPv6 update looks more scary, as it seems to take a "bh" spinlock
that can block softirq from running code in net/ipv6/ip6_fib.c
(spin_lock_bh(&f6i->fib6_table->tb6_lock).

Have you tried to use 'perf record' to observe that is happening on the system 
while these latency incidents happen?  (let me know if you want some cmdline 
hints)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-06 Thread Jesper Dangaard Brouer
On Fri, 06 Nov 2020 12:45:31 +0100
Toke Høiland-Jørgensen  wrote:

> "Thomas Rosenstein"  writes:
> 
> > On 6 Nov 2020, at 12:18, Jesper Dangaard Brouer wrote:
> >  
> >> On Fri, 06 Nov 2020 10:18:10 +0100
> >> "Thomas Rosenstein"  wrote:
> >>  
> > I just tested 5.9.4 seems to also fix it partly, I have long
> > stretches where it looks good, and then some increases again. (3.10
> > Stock has them too, but not so high, rather 1-3 ms)
> >  
> >>
> >> That you have long stretches where latency looks good is interesting
> >> information.   My theory is that your system have a periodic userspace
> >> process that does a kernel syscall that takes too long, blocking
> >> network card from processing packets. (Note it can also be a kernel
> >> thread).  
> >
[...]
> >
> > Could this be related to netlink? I have gobgpd running on these 
> > routers, which injects routes via netlink.
> > But the churn rate during the tests is very minimal, maybe 30 - 40 
> > routes every second.

Yes, this could be related.  The internal data-structure for FIB
lookups is a fibtrie which is a compressed patricia tree, related to
radix tree idea.  Thus, I can imagine that the kernel have to
rebuild/rebalance the tree with all these updates.

> >
> > Otherwise we got: salt-minion, collectd, node_exporter, sshd  
> 
> collectd may be polling the interface stats; try turning that off?

It should be fairly easy for you to test the theory if any of these
services (except sshd) is causing this, by turning them off
individually.


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-06 Thread Thomas Rosenstein via Bloat



On 6 Nov 2020, at 12:45, Toke Høiland-Jørgensen wrote:


"Thomas Rosenstein"  writes:


On 6 Nov 2020, at 12:18, Jesper Dangaard Brouer wrote:


On Fri, 06 Nov 2020 10:18:10 +0100
"Thomas Rosenstein"  wrote:


I just tested 5.9.4 seems to also fix it partly, I have long
stretches where it looks good, and then some increases again. 
(3.10

Stock has them too, but not so high, rather 1-3 ms)



That you have long stretches where latency looks good is interesting
information.   My theory is that your system have a periodic 
userspace

process that does a kernel syscall that takes too long, blocking
network card from processing packets. (Note it can also be a kernel
thread).


The weird part is, I first only updated router-02 and pinged to
router-04 (out of traffic flow), there I noticed these long stretches 
of

ok ping.

When I updated also router-03 and router-04, the old behaviour kind 
of

was back, this confused me.

Could this be related to netlink? I have gobgpd running on these
routers, which injects routes via netlink.
But the churn rate during the tests is very minimal, maybe 30 - 40
routes every second.

Otherwise we got: salt-minion, collectd, node_exporter, sshd


collectd may be polling the interface stats; try turning that off?


I can, but shouldn't that also influence iperf3 performance then?





Another theory is the NIC HW does strange things, but it is not very
likely.  E.g. delaying the packets before generating the IRQ
interrupt,
which hide it from my IRQ-to-softirq latency tool.

A question: What traffic control qdisc are you using on your system?


kernel 4+ uses pfifo, but there's no dropped packets
I have also tested with fq_codel, same behaviour and also no 
weirdness

in the packets queue itself

kernel 3.10 uses mq, and for the vlan interfaces noqueue


Do you mean that you only have a single pfifo qdisc on kernel 4+? Why 
is

it not using mq?


oh, actually, I just noticed that's a remnant of the previous tests, I 
had


net.core.default_qdisc = fq_codel

in the sysctl.conf... so disregard my previous wrong info



so all kernel by default look like that, mq + pfifo_fast:

qdisc noqueue 0: dev lo root refcnt 2
qdisc mq 0: dev eth0 root
qdisc pfifo_fast 0: dev eth0 parent :1 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth0 parent :2 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth0 parent :3 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth0 parent :4 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth0 parent :5 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth0 parent :6 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth0 parent :7 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth0 parent :8 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1

qdisc mq 0: dev eth1 root
qdisc pfifo_fast 0: dev eth1 parent :1 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth1 parent :2 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth1 parent :3 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth1 parent :4 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth1 parent :5 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth1 parent :6 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth1 parent :7 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth1 parent :8 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1

qdisc mq 0: dev eth2 root
qdisc pfifo_fast 0: dev eth2 parent :1 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth2 parent :2 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth2 parent :3 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth2 parent :4 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth2 parent :5 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth2 parent :6 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth2 parent :7 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth2 parent :8 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1

qdisc mq 0: dev eth3 root
qdisc pfifo_fast 0: dev eth3 parent :1 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth3 parent :2 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth3 parent :3 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth3 parent :4 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth3 parent :5 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth3 parent :6 bands 3 priomap  1 2 2 2 1 2 0 0 
1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: dev eth3 parent :7 band

Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-06 Thread Toke Høiland-Jørgensen via Bloat
"Thomas Rosenstein"  writes:

> On 6 Nov 2020, at 12:18, Jesper Dangaard Brouer wrote:
>
>> On Fri, 06 Nov 2020 10:18:10 +0100
>> "Thomas Rosenstein"  wrote:
>>
> I just tested 5.9.4 seems to also fix it partly, I have long
> stretches where it looks good, and then some increases again. (3.10
> Stock has them too, but not so high, rather 1-3 ms)
>
>>
>> That you have long stretches where latency looks good is interesting
>> information.   My theory is that your system have a periodic userspace
>> process that does a kernel syscall that takes too long, blocking
>> network card from processing packets. (Note it can also be a kernel
>> thread).
>
> The weird part is, I first only updated router-02 and pinged to 
> router-04 (out of traffic flow), there I noticed these long stretches of 
> ok ping.
>
> When I updated also router-03 and router-04, the old behaviour kind of 
> was back, this confused me.
>
> Could this be related to netlink? I have gobgpd running on these 
> routers, which injects routes via netlink.
> But the churn rate during the tests is very minimal, maybe 30 - 40 
> routes every second.
>
> Otherwise we got: salt-minion, collectd, node_exporter, sshd

collectd may be polling the interface stats; try turning that off?

>>
>> Another theory is the NIC HW does strange things, but it is not very
>> likely.  E.g. delaying the packets before generating the IRQ 
>> interrupt,
>> which hide it from my IRQ-to-softirq latency tool.
>>
>> A question: What traffic control qdisc are you using on your system?
>
> kernel 4+ uses pfifo, but there's no dropped packets
> I have also tested with fq_codel, same behaviour and also no weirdness 
> in the packets queue itself
>
> kernel 3.10 uses mq, and for the vlan interfaces noqueue

Do you mean that you only have a single pfifo qdisc on kernel 4+? Why is
it not using mq?

Was there anything in the ethtool stats?

-Toke
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-06 Thread Thomas Rosenstein via Bloat



On 6 Nov 2020, at 12:18, Jesper Dangaard Brouer wrote:


On Fri, 06 Nov 2020 10:18:10 +0100
"Thomas Rosenstein"  wrote:


I just tested 5.9.4 seems to also fix it partly, I have long
stretches where it looks good, and then some increases again. (3.10
Stock has them too, but not so high, rather 1-3 ms)



That you have long stretches where latency looks good is interesting
information.   My theory is that your system have a periodic userspace
process that does a kernel syscall that takes too long, blocking
network card from processing packets. (Note it can also be a kernel
thread).


The weird part is, I first only updated router-02 and pinged to 
router-04 (out of traffic flow), there I noticed these long stretches of 
ok ping.


When I updated also router-03 and router-04, the old behaviour kind of 
was back, this confused me.


Could this be related to netlink? I have gobgpd running on these 
routers, which injects routes via netlink.
But the churn rate during the tests is very minimal, maybe 30 - 40 
routes every second.


Otherwise we got: salt-minion, collectd, node_exporter, sshd



Another theory is the NIC HW does strange things, but it is not very
likely.  E.g. delaying the packets before generating the IRQ 
interrupt,

which hide it from my IRQ-to-softirq latency tool.

A question: What traffic control qdisc are you using on your system?


kernel 4+ uses pfifo, but there's no dropped packets
I have also tested with fq_codel, same behaviour and also no weirdness 
in the packets queue itself


kernel 3.10 uses mq, and for the vlan interfaces noqueue


Here's the mail archive link for the question on lartc :

https://www.spinics.net/lists/lartc/msg23774.html



What you looked at the obvious case if any of your qdisc report a 
large

backlog? (during the incidents)


as said above, nothing in qdiscs or reported





for example:

64 bytes from x.x.x.x: icmp_seq=10 ttl=64 time=0.169 ms
64 bytes from x.x.x.x: icmp_seq=11 ttl=64 time=5.53 ms
64 bytes from x.x.x.x: icmp_seq=12 ttl=64 time=9.44 ms
64 bytes from x.x.x.x: icmp_seq=13 ttl=64 time=0.167 ms
64 bytes from x.x.x.x: icmp_seq=14 ttl=64 time=3.88 ms

and then again:

64 bytes from x.x.x.x: icmp_seq=15 ttl=64 time=0.569 ms
64 bytes from x.x.x.x: icmp_seq=16 ttl=64 time=0.148 ms
64 bytes from x.x.x.x: icmp_seq=17 ttl=64 time=0.286 ms
64 bytes from x.x.x.x: icmp_seq=18 ttl=64 time=0.257 ms
64 bytes from x.x.x.x: icmp_seq=19 ttl=64 time=0.220 ms


These very low ping times tell me that you are measuring very close to
the target machine, which is good.  Here on the bufferbloat list, we 
are

always suspicious of network equipment being use in these kind of
setups.  As experience tells us that this can be the cause of
bufferbloat latency.
yes, I'm just testing across two machines connected directly to the same 
switch,

basically that's the best case scenario apart from direct connection.

I do also use a VLAN on this interface, so the pings go through the vlan 
stack!




You mention some fs.com switches (your desc below signature), can you
tell us more?


It's a fs.com N5850-48S6Q

48 Port 10 Gbit + 6 port 40 Gbit

there are only 6 ports with 10 G in use, and 2 with 1 G, basically no 
traffic





[...]
I have a feeling that maybe not all config options were correctly 
moved

to the newer kernel.

Or there's a big bug somewhere ... (which would seem rather weird for 
me

to be the first one to discover this)


I really appreciate that you report this.  This is a periodic issue,
that often result in people not reporting this.

Even if we find this to be caused by some process running on your
system, or a bad config, this it is really important that we find the
root-cause.


I'll rebuild the 5.9 kernel on one of the 3.10 kernel and see if it
makes a difference ...


--
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

On Wed, 04 Nov 2020 16:23:12 +0100
Thomas Rosenstein via Bloat  wrote:


General Info:

Routers are connected between each other with 10G Mellanox Connect-X
cards via 10G SPF+ DAC cables via a 10G Switch from fs.com
Latency generally is around 0.18 ms between all routers (4).
Throughput is 9.4 Gbit/s with 0 retransmissions when tested with 
iperf3.
2 of the 4 routers are connected upstream with a 1G connection 
(separate

port, same network card)
All routers have the full internet routing tables, i.e. 80k entries 
for

IPv6 and 830k entries for IPv4
Conntrack is disabled (-j NOTRACK)
Kernel 5.4.60 (custom)
2x Xeon X5670 @ 2.93 Ghz
96 GB RAM

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-06 Thread Jesper Dangaard Brouer
On Fri, 06 Nov 2020 10:18:10 +0100
"Thomas Rosenstein"  wrote:

> >> I just tested 5.9.4 seems to also fix it partly, I have long
> >> stretches where it looks good, and then some increases again. (3.10
> >> Stock has them too, but not so high, rather 1-3 ms)
> >>

That you have long stretches where latency looks good is interesting
information.   My theory is that your system have a periodic userspace
process that does a kernel syscall that takes too long, blocking
network card from processing packets. (Note it can also be a kernel
thread).

Another theory is the NIC HW does strange things, but it is not very
likely.  E.g. delaying the packets before generating the IRQ interrupt,
which hide it from my IRQ-to-softirq latency tool.

A question: What traffic control qdisc are you using on your system?

What you looked at the obvious case if any of your qdisc report a large
backlog? (during the incidents)


> >> for example:
> >>
> >> 64 bytes from x.x.x.x: icmp_seq=10 ttl=64 time=0.169 ms
> >> 64 bytes from x.x.x.x: icmp_seq=11 ttl=64 time=5.53 ms
> >> 64 bytes from x.x.x.x: icmp_seq=12 ttl=64 time=9.44 ms
> >> 64 bytes from x.x.x.x: icmp_seq=13 ttl=64 time=0.167 ms
> >> 64 bytes from x.x.x.x: icmp_seq=14 ttl=64 time=3.88 ms
> >>
> >> and then again:
> >>
> >> 64 bytes from x.x.x.x: icmp_seq=15 ttl=64 time=0.569 ms
> >> 64 bytes from x.x.x.x: icmp_seq=16 ttl=64 time=0.148 ms
> >> 64 bytes from x.x.x.x: icmp_seq=17 ttl=64 time=0.286 ms
> >> 64 bytes from x.x.x.x: icmp_seq=18 ttl=64 time=0.257 ms
> >> 64 bytes from x.x.x.x: icmp_seq=19 ttl=64 time=0.220 ms

These very low ping times tell me that you are measuring very close to
the target machine, which is good.  Here on the bufferbloat list, we are
always suspicious of network equipment being use in these kind of
setups.  As experience tells us that this can be the cause of
bufferbloat latency.

You mention some fs.com switches (your desc below signature), can you
tell us more?


[...]
> I have a feeling that maybe not all config options were correctly moved 
> to the newer kernel.
>
> Or there's a big bug somewhere ... (which would seem rather weird for me 
> to be the first one to discover this)

I really appreciate that you report this.  This is a periodic issue,
that often result in people not reporting this.

Even if we find this to be caused by some process running on your
system, or a bad config, this it is really important that we find the
root-cause.

> I'll rebuild the 5.9 kernel on one of the 3.10 kernel and see if it 
> makes a difference ...

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

On Wed, 04 Nov 2020 16:23:12 +0100
Thomas Rosenstein via Bloat  wrote:

> General Info:
> 
> Routers are connected between each other with 10G Mellanox Connect-X 
> cards via 10G SPF+ DAC cables via a 10G Switch from fs.com
> Latency generally is around 0.18 ms between all routers (4).
> Throughput is 9.4 Gbit/s with 0 retransmissions when tested with iperf3.
> 2 of the 4 routers are connected upstream with a 1G connection (separate 
> port, same network card)
> All routers have the full internet routing tables, i.e. 80k entries for 
> IPv6 and 830k entries for IPv4
> Conntrack is disabled (-j NOTRACK)
> Kernel 5.4.60 (custom)
> 2x Xeon X5670 @ 2.93 Ghz
> 96 GB RAM


___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-06 Thread Jesper Dangaard Brouer
On Fri, 06 Nov 2020 09:48:46 +0100
"Thomas Rosenstein"  wrote:

> On 5 Nov 2020, at 14:33, Jesper Dangaard Brouer wrote:
> 
> > On Thu, 05 Nov 2020 13:22:10 +0100
> > Thomas Rosenstein via Bloat  wrote:
> >  
> >> On 5 Nov 2020, at 12:21, Toke Høiland-Jørgensen wrote:
> >>  
> >>> "Thomas Rosenstein"  writes:
> >>>  
> > If so, this sounds more like a driver issue, or maybe something to
> > do with scheduling. Does it only happen with ICMP? You could try 
> > this
> > tool for a userspace UDP measurement:  
> 
>  It happens with all packets, therefore the transfer to backblaze 
>  with
>  40 threads goes down to ~8MB/s instead of >60MB/s  
> >>>
> >>> Huh, right, definitely sounds like a kernel bug; or maybe the new
> >>> kernel is getting the hardware into a state where it bugs out when
> >>> there are lots of flows or something.
> >>>
> >>> You could try looking at the ethtool stats (ethtool -S) while
> >>> running the test and see if any error counters go up. Here's a
> >>> handy script to monitor changes in the counters:
> >>>
> >>> https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
> >>>  
>  I'll try what that reports!
>   
> > Also, what happens if you ping a host on the internet (*through*
> > the router instead of *to* it)?  
> 
>  Same issue, but twice pronounced, as it seems all interfaces are
>  affected.
>  So, ping on one interface and the second has the issue.
>  Also all traffic across the host has the issue, but on both sides,
>  so ping to the internet increased by 2x  
> >>>
> >>> Right, so even an unloaded interface suffers? But this is the same
> >>> NIC, right? So it could still be a hardware issue...
> >>>  
>  Yep default that CentOS ships, I just tested 4.12.5 there the
>  issue also does not happen. So I guess I can bisect it
>  then...(really don't want to 😃)  
> >>>
> >>> Well that at least narrows it down :)  
> >>
> >> I just tested 5.9.4 seems to also fix it partly, I have long
> >> stretches where it looks good, and then some increases again. (3.10
> >> Stock has them too, but not so high, rather 1-3 ms)
> >>
> >> for example:
> >>
> >> 64 bytes from x.x.x.x: icmp_seq=10 ttl=64 time=0.169 ms
> >> 64 bytes from x.x.x.x: icmp_seq=11 ttl=64 time=5.53 ms
> >> 64 bytes from x.x.x.x: icmp_seq=12 ttl=64 time=9.44 ms
> >> 64 bytes from x.x.x.x: icmp_seq=13 ttl=64 time=0.167 ms
> >> 64 bytes from x.x.x.x: icmp_seq=14 ttl=64 time=3.88 ms
> >>
> >> and then again:
> >>
> >> 64 bytes from x.x.x.x: icmp_seq=15 ttl=64 time=0.569 ms
> >> 64 bytes from x.x.x.x: icmp_seq=16 ttl=64 time=0.148 ms
> >> 64 bytes from x.x.x.x: icmp_seq=17 ttl=64 time=0.286 ms
> >> 64 bytes from x.x.x.x: icmp_seq=18 ttl=64 time=0.257 ms
> >> 64 bytes from x.x.x.x: icmp_seq=19 ttl=64 time=0.220 ms
> >> 64 bytes from x.x.x.x: icmp_seq=20 ttl=64 time=0.125 ms
> >> 64 bytes from x.x.x.x: icmp_seq=21 ttl=64 time=0.188 ms
> >> 64 bytes from x.x.x.x: icmp_seq=22 ttl=64 time=0.202 ms
> >> 64 bytes from x.x.x.x: icmp_seq=23 ttl=64 time=0.195 ms
> >> 64 bytes from x.x.x.x: icmp_seq=24 ttl=64 time=0.177 ms
> >> 64 bytes from x.x.x.x: icmp_seq=25 ttl=64 time=0.242 ms
> >> 64 bytes from x.x.x.x: icmp_seq=26 ttl=64 time=0.339 ms
> >> 64 bytes from x.x.x.x: icmp_seq=27 ttl=64 time=0.183 ms
> >> 64 bytes from x.x.x.x: icmp_seq=28 ttl=64 time=0.221 ms
> >> 64 bytes from x.x.x.x: icmp_seq=29 ttl=64 time=0.317 ms
> >> 64 bytes from x.x.x.x: icmp_seq=30 ttl=64 time=0.210 ms
> >> 64 bytes from x.x.x.x: icmp_seq=31 ttl=64 time=0.242 ms
> >> 64 bytes from x.x.x.x: icmp_seq=32 ttl=64 time=0.127 ms
> >> 64 bytes from x.x.x.x: icmp_seq=33 ttl=64 time=0.217 ms
> >> 64 bytes from x.x.x.x: icmp_seq=34 ttl=64 time=0.184 ms
> >>
> >>
> >> For me it looks now that there was some fix between 5.4.60 and 5.9.4
> >> ... anyone can pinpoint it?  
> 
> So, new day, same issue!
> 
> I upgraded now all routers to 5.9.4, and the issue is back ...
> 
> here, when I stop it, it goes immediatly down to 0.xx ms
> 
> 64 bytes from x.x.x.x: icmp_seq=48 ttl=64 time=1.67 ms
> 64 bytes from x.x.x.x: icmp_seq=49 ttl=64 time=12.6 ms
> 64 bytes from x.x.x.x: icmp_seq=50 ttl=64 time=13.8 ms
> 64 bytes from x.x.x.x: icmp_seq=51 ttl=64 time=5.59 ms
> 64 bytes from x.x.x.x: icmp_seq=52 ttl=64 time=5.86 ms
> 64 bytes from x.x.x.x: icmp_seq=53 ttl=64 time=9.26 ms
> 64 bytes from x.x.x.x: icmp_seq=54 ttl=64 time=8.28 ms
> 64 bytes from x.x.x.x: icmp_seq=55 ttl=64 time=12.4 ms
> 64 bytes from x.x.x.x: icmp_seq=56 ttl=64 time=0.551 ms
> 64 bytes from x.x.x.x: icmp_seq=57 ttl=64 time=4.37 ms
> 64 bytes from x.x.x.x: icmp_seq=58 ttl=64 time=12.1 ms
> 64 bytes from x.x.x.x: icmp_seq=59 ttl=64 time=5.93 ms
> 64 bytes from x.x.x.x: icmp_seq=60 ttl=64 time=6.58 ms
> 64 bytes from x.x.x.x: icmp_seq=61 ttl=64 time=9.19 ms
> 64 bytes from x.x.x.x: icmp_seq=62 ttl=64 time=0.124 ms
> 64 bytes from x.x.x.x: icmp_seq=63 ttl=64 time=7.08 ms
> 64 bytes from x.x.x.x: icmp_seq=64

Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-06 Thread Thomas Rosenstein via Bloat



On 5 Nov 2020, at 14:33, Jesper Dangaard Brouer wrote:


On Thu, 05 Nov 2020 13:22:10 +0100
Thomas Rosenstein via Bloat  wrote:


On 5 Nov 2020, at 12:21, Toke Høiland-Jørgensen wrote:


"Thomas Rosenstein"  writes:


If so, this sounds more like a driver issue, or maybe something to
do with scheduling. Does it only happen with ICMP? You could try 
this

tool for a userspace UDP measurement:


It happens with all packets, therefore the transfer to backblaze 
with

40 threads goes down to ~8MB/s instead of >60MB/s


Huh, right, definitely sounds like a kernel bug; or maybe the new
kernel is getting the hardware into a state where it bugs out when
there are lots of flows or something.

You could try looking at the ethtool stats (ethtool -S) while
running the test and see if any error counters go up. Here's a
handy script to monitor changes in the counters:

https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl


I'll try what that reports!


Also, what happens if you ping a host on the internet (*through*
the router instead of *to* it)?


Same issue, but twice pronounced, as it seems all interfaces are
affected.
So, ping on one interface and the second has the issue.
Also all traffic across the host has the issue, but on both sides,
so ping to the internet increased by 2x


Right, so even an unloaded interface suffers? But this is the same
NIC, right? So it could still be a hardware issue...


Yep default that CentOS ships, I just tested 4.12.5 there the
issue also does not happen. So I guess I can bisect it
then...(really don't want to 😃)


Well that at least narrows it down :)


I just tested 5.9.4 seems to also fix it partly, I have long
stretches where it looks good, and then some increases again. (3.10
Stock has them too, but not so high, rather 1-3 ms)

for example:

64 bytes from x.x.x.x: icmp_seq=10 ttl=64 time=0.169 ms
64 bytes from x.x.x.x: icmp_seq=11 ttl=64 time=5.53 ms
64 bytes from x.x.x.x: icmp_seq=12 ttl=64 time=9.44 ms
64 bytes from x.x.x.x: icmp_seq=13 ttl=64 time=0.167 ms
64 bytes from x.x.x.x: icmp_seq=14 ttl=64 time=3.88 ms

and then again:

64 bytes from x.x.x.x: icmp_seq=15 ttl=64 time=0.569 ms
64 bytes from x.x.x.x: icmp_seq=16 ttl=64 time=0.148 ms
64 bytes from x.x.x.x: icmp_seq=17 ttl=64 time=0.286 ms
64 bytes from x.x.x.x: icmp_seq=18 ttl=64 time=0.257 ms
64 bytes from x.x.x.x: icmp_seq=19 ttl=64 time=0.220 ms
64 bytes from x.x.x.x: icmp_seq=20 ttl=64 time=0.125 ms
64 bytes from x.x.x.x: icmp_seq=21 ttl=64 time=0.188 ms
64 bytes from x.x.x.x: icmp_seq=22 ttl=64 time=0.202 ms
64 bytes from x.x.x.x: icmp_seq=23 ttl=64 time=0.195 ms
64 bytes from x.x.x.x: icmp_seq=24 ttl=64 time=0.177 ms
64 bytes from x.x.x.x: icmp_seq=25 ttl=64 time=0.242 ms
64 bytes from x.x.x.x: icmp_seq=26 ttl=64 time=0.339 ms
64 bytes from x.x.x.x: icmp_seq=27 ttl=64 time=0.183 ms
64 bytes from x.x.x.x: icmp_seq=28 ttl=64 time=0.221 ms
64 bytes from x.x.x.x: icmp_seq=29 ttl=64 time=0.317 ms
64 bytes from x.x.x.x: icmp_seq=30 ttl=64 time=0.210 ms
64 bytes from x.x.x.x: icmp_seq=31 ttl=64 time=0.242 ms
64 bytes from x.x.x.x: icmp_seq=32 ttl=64 time=0.127 ms
64 bytes from x.x.x.x: icmp_seq=33 ttl=64 time=0.217 ms
64 bytes from x.x.x.x: icmp_seq=34 ttl=64 time=0.184 ms


For me it looks now that there was some fix between 5.4.60 and 5.9.4
... anyone can pinpoint it?


I have now retried with 3.10, here's the results:

I'm pinging between the hosts with the traffic flow, upload is running 
55 MB/s (compared to ~20 MB/s on 5.9.4, and 8MB/s on 5.4.60):


64 bytes from x.x.x.x: icmp_seq=159 ttl=64 time=0.050 ms
64 bytes from x.x.x.x: icmp_seq=160 ttl=64 time=0.056 ms
64 bytes from x.x.x.x: icmp_seq=161 ttl=64 time=0.061 ms
64 bytes from x.x.x.x: icmp_seq=162 ttl=64 time=0.072 ms
64 bytes from x.x.x.x: icmp_seq=163 ttl=64 time=0.052 ms
64 bytes from x.x.x.x: icmp_seq=164 ttl=64 time=0.053 ms
64 bytes from x.x.x.x: icmp_seq=165 ttl=64 time=0.068 ms
64 bytes from x.x.x.x: icmp_seq=166 ttl=64 time=0.050 ms
64 bytes from x.x.x.x: icmp_seq=167 ttl=64 time=0.057 ms
64 bytes from x.x.x.x: icmp_seq=168 ttl=64 time=0.051 ms
64 bytes from x.x.x.x: icmp_seq=169 ttl=64 time=0.045 ms
64 bytes from x.x.x.x: icmp_seq=170 ttl=64 time=0.138 ms
64 bytes from x.x.x.x: icmp_seq=171 ttl=64 time=0.052 ms
64 bytes from x.x.x.x: icmp_seq=172 ttl=64 time=0.049 ms
64 bytes from x.x.x.x: icmp_seq=173 ttl=64 time=0.094 ms
64 bytes from x.x.x.x: icmp_seq=174 ttl=64 time=0.050 ms
64 bytes from x.x.x.x: icmp_seq=175 ttl=64 time=0.810 ms
64 bytes from x.x.x.x: icmp_seq=176 ttl=64 time=0.077 ms
64 bytes from x.x.x.x: icmp_seq=177 ttl=64 time=0.055 ms
64 bytes from x.x.x.x: icmp_seq=178 ttl=64 time=0.049 ms
64 bytes from x.x.x.x: icmp_seq=179 ttl=64 time=0.050 ms
64 bytes from x.x.x.x: icmp_seq=180 ttl=64 time=0.073 ms
64 bytes from x.x.x.x: icmp_seq=181 ttl=64 time=0.065 ms
64 bytes from x.x.x.x: icmp_seq=182 ttl=64 time=0.123 ms
64 bytes from x.x.x.x: icmp_seq=183 ttl=64 time=0.045 ms
64 bytes from x.x.x.x: icmp_seq=184 ttl

Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-06 Thread Thomas Rosenstein via Bloat



On 5 Nov 2020, at 14:33, Jesper Dangaard Brouer wrote:


On Thu, 05 Nov 2020 13:22:10 +0100
Thomas Rosenstein via Bloat  wrote:


On 5 Nov 2020, at 12:21, Toke Høiland-Jørgensen wrote:


"Thomas Rosenstein"  writes:


If so, this sounds more like a driver issue, or maybe something to
do with scheduling. Does it only happen with ICMP? You could try 
this

tool for a userspace UDP measurement:


It happens with all packets, therefore the transfer to backblaze 
with

40 threads goes down to ~8MB/s instead of >60MB/s


Huh, right, definitely sounds like a kernel bug; or maybe the new
kernel is getting the hardware into a state where it bugs out when
there are lots of flows or something.

You could try looking at the ethtool stats (ethtool -S) while
running the test and see if any error counters go up. Here's a
handy script to monitor changes in the counters:

https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl


I'll try what that reports!


Also, what happens if you ping a host on the internet (*through*
the router instead of *to* it)?


Same issue, but twice pronounced, as it seems all interfaces are
affected.
So, ping on one interface and the second has the issue.
Also all traffic across the host has the issue, but on both sides,
so ping to the internet increased by 2x


Right, so even an unloaded interface suffers? But this is the same
NIC, right? So it could still be a hardware issue...


Yep default that CentOS ships, I just tested 4.12.5 there the
issue also does not happen. So I guess I can bisect it
then...(really don't want to 😃)


Well that at least narrows it down :)


I just tested 5.9.4 seems to also fix it partly, I have long
stretches where it looks good, and then some increases again. (3.10
Stock has them too, but not so high, rather 1-3 ms)

for example:

64 bytes from x.x.x.x: icmp_seq=10 ttl=64 time=0.169 ms
64 bytes from x.x.x.x: icmp_seq=11 ttl=64 time=5.53 ms
64 bytes from x.x.x.x: icmp_seq=12 ttl=64 time=9.44 ms
64 bytes from x.x.x.x: icmp_seq=13 ttl=64 time=0.167 ms
64 bytes from x.x.x.x: icmp_seq=14 ttl=64 time=3.88 ms

and then again:

64 bytes from x.x.x.x: icmp_seq=15 ttl=64 time=0.569 ms
64 bytes from x.x.x.x: icmp_seq=16 ttl=64 time=0.148 ms
64 bytes from x.x.x.x: icmp_seq=17 ttl=64 time=0.286 ms
64 bytes from x.x.x.x: icmp_seq=18 ttl=64 time=0.257 ms
64 bytes from x.x.x.x: icmp_seq=19 ttl=64 time=0.220 ms
64 bytes from x.x.x.x: icmp_seq=20 ttl=64 time=0.125 ms
64 bytes from x.x.x.x: icmp_seq=21 ttl=64 time=0.188 ms
64 bytes from x.x.x.x: icmp_seq=22 ttl=64 time=0.202 ms
64 bytes from x.x.x.x: icmp_seq=23 ttl=64 time=0.195 ms
64 bytes from x.x.x.x: icmp_seq=24 ttl=64 time=0.177 ms
64 bytes from x.x.x.x: icmp_seq=25 ttl=64 time=0.242 ms
64 bytes from x.x.x.x: icmp_seq=26 ttl=64 time=0.339 ms
64 bytes from x.x.x.x: icmp_seq=27 ttl=64 time=0.183 ms
64 bytes from x.x.x.x: icmp_seq=28 ttl=64 time=0.221 ms
64 bytes from x.x.x.x: icmp_seq=29 ttl=64 time=0.317 ms
64 bytes from x.x.x.x: icmp_seq=30 ttl=64 time=0.210 ms
64 bytes from x.x.x.x: icmp_seq=31 ttl=64 time=0.242 ms
64 bytes from x.x.x.x: icmp_seq=32 ttl=64 time=0.127 ms
64 bytes from x.x.x.x: icmp_seq=33 ttl=64 time=0.217 ms
64 bytes from x.x.x.x: icmp_seq=34 ttl=64 time=0.184 ms


For me it looks now that there was some fix between 5.4.60 and 5.9.4
... anyone can pinpoint it?


So, new day, same issue!

I upgraded now all routers to 5.9.4, and the issue is back ...

here, when I stop it, it goes immediatly down to 0.xx ms

64 bytes from x.x.x.x: icmp_seq=48 ttl=64 time=1.67 ms
64 bytes from x.x.x.x: icmp_seq=49 ttl=64 time=12.6 ms
64 bytes from x.x.x.x: icmp_seq=50 ttl=64 time=13.8 ms
64 bytes from x.x.x.x: icmp_seq=51 ttl=64 time=5.59 ms
64 bytes from x.x.x.x: icmp_seq=52 ttl=64 time=5.86 ms
64 bytes from x.x.x.x: icmp_seq=53 ttl=64 time=9.26 ms
64 bytes from x.x.x.x: icmp_seq=54 ttl=64 time=8.28 ms
64 bytes from x.x.x.x: icmp_seq=55 ttl=64 time=12.4 ms
64 bytes from x.x.x.x: icmp_seq=56 ttl=64 time=0.551 ms
64 bytes from x.x.x.x: icmp_seq=57 ttl=64 time=4.37 ms
64 bytes from x.x.x.x: icmp_seq=58 ttl=64 time=12.1 ms
64 bytes from x.x.x.x: icmp_seq=59 ttl=64 time=5.93 ms
64 bytes from x.x.x.x: icmp_seq=60 ttl=64 time=6.58 ms
64 bytes from x.x.x.x: icmp_seq=61 ttl=64 time=9.19 ms
64 bytes from x.x.x.x: icmp_seq=62 ttl=64 time=0.124 ms
64 bytes from x.x.x.x: icmp_seq=63 ttl=64 time=7.08 ms
64 bytes from x.x.x.x: icmp_seq=64 ttl=64 time=9.69 ms
64 bytes from x.x.x.x: icmp_seq=65 ttl=64 time=7.52 ms
64 bytes from x.x.x.x: icmp_seq=66 ttl=64 time=14.9 ms
64 bytes from x.x.x.x: icmp_seq=67 ttl=64 time=12.6 ms
64 bytes from x.x.x.x: icmp_seq=68 ttl=64 time=2.34 ms
64 bytes from x.x.x.x: icmp_seq=69 ttl=64 time=8.97 ms
64 bytes from x.x.x.x: icmp_seq=70 ttl=64 time=0.203 ms
64 bytes from x.x.x.x: icmp_seq=71 ttl=64 time=9.10 ms
64 bytes from x.x.x.x: icmp_seq=72 ttl=64 time=3.16 ms
64 bytes from x.x.x.x: icmp_seq=73 ttl=64 time=1.88 ms
64 bytes from x.x.x.x: icmp_seq=74 ttl=64 time=11.5 ms
64 bytes from x

Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-05 Thread Jesper Dangaard Brouer
On Thu, 05 Nov 2020 13:22:10 +0100
Thomas Rosenstein via Bloat  wrote:

> On 5 Nov 2020, at 12:21, Toke Høiland-Jørgensen wrote:
> 
> > "Thomas Rosenstein"  writes:
> >  
> >>> If so, this sounds more like a driver issue, or maybe something to 
> >>> do with scheduling. Does it only happen with ICMP? You could try this
> >>> tool for a userspace UDP measurement:  
> >>
> >> It happens with all packets, therefore the transfer to backblaze with 
> >> 40 threads goes down to ~8MB/s instead of >60MB/s  
> >
> > Huh, right, definitely sounds like a kernel bug; or maybe the new 
> > kernel is getting the hardware into a state where it bugs out when
> > there are lots of flows or something.
> >
> > You could try looking at the ethtool stats (ethtool -S) while
> > running the test and see if any error counters go up. Here's a
> > handy script to monitor changes in the counters:
> >
> > https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
> >  
> >> I'll try what that reports!
> >>  
> >>> Also, what happens if you ping a host on the internet (*through*
> >>> the router instead of *to* it)?  
> >>
> >> Same issue, but twice pronounced, as it seems all interfaces are
> >> affected.
> >> So, ping on one interface and the second has the issue.
> >> Also all traffic across the host has the issue, but on both sides,
> >> so ping to the internet increased by 2x  
> >
> > Right, so even an unloaded interface suffers? But this is the same 
> > NIC, right? So it could still be a hardware issue...
> >  
> >> Yep default that CentOS ships, I just tested 4.12.5 there the
> >> issue also does not happen. So I guess I can bisect it
> >> then...(really don't want to 😃)  
> >
> > Well that at least narrows it down :)  
> 
> I just tested 5.9.4 seems to also fix it partly, I have long
> stretches where it looks good, and then some increases again. (3.10
> Stock has them too, but not so high, rather 1-3 ms)
> 
> for example:
> 
> 64 bytes from x.x.x.x: icmp_seq=10 ttl=64 time=0.169 ms
> 64 bytes from x.x.x.x: icmp_seq=11 ttl=64 time=5.53 ms
> 64 bytes from x.x.x.x: icmp_seq=12 ttl=64 time=9.44 ms
> 64 bytes from x.x.x.x: icmp_seq=13 ttl=64 time=0.167 ms
> 64 bytes from x.x.x.x: icmp_seq=14 ttl=64 time=3.88 ms
> 
> and then again:
> 
> 64 bytes from x.x.x.x: icmp_seq=15 ttl=64 time=0.569 ms
> 64 bytes from x.x.x.x: icmp_seq=16 ttl=64 time=0.148 ms
> 64 bytes from x.x.x.x: icmp_seq=17 ttl=64 time=0.286 ms
> 64 bytes from x.x.x.x: icmp_seq=18 ttl=64 time=0.257 ms
> 64 bytes from x.x.x.x: icmp_seq=19 ttl=64 time=0.220 ms
> 64 bytes from x.x.x.x: icmp_seq=20 ttl=64 time=0.125 ms
> 64 bytes from x.x.x.x: icmp_seq=21 ttl=64 time=0.188 ms
> 64 bytes from x.x.x.x: icmp_seq=22 ttl=64 time=0.202 ms
> 64 bytes from x.x.x.x: icmp_seq=23 ttl=64 time=0.195 ms
> 64 bytes from x.x.x.x: icmp_seq=24 ttl=64 time=0.177 ms
> 64 bytes from x.x.x.x: icmp_seq=25 ttl=64 time=0.242 ms
> 64 bytes from x.x.x.x: icmp_seq=26 ttl=64 time=0.339 ms
> 64 bytes from x.x.x.x: icmp_seq=27 ttl=64 time=0.183 ms
> 64 bytes from x.x.x.x: icmp_seq=28 ttl=64 time=0.221 ms
> 64 bytes from x.x.x.x: icmp_seq=29 ttl=64 time=0.317 ms
> 64 bytes from x.x.x.x: icmp_seq=30 ttl=64 time=0.210 ms
> 64 bytes from x.x.x.x: icmp_seq=31 ttl=64 time=0.242 ms
> 64 bytes from x.x.x.x: icmp_seq=32 ttl=64 time=0.127 ms
> 64 bytes from x.x.x.x: icmp_seq=33 ttl=64 time=0.217 ms
> 64 bytes from x.x.x.x: icmp_seq=34 ttl=64 time=0.184 ms
> 
> 
> For me it looks now that there was some fix between 5.4.60 and 5.9.4
> ... anyone can pinpoint it?

I have some bpftrace tools to measure these kind of latency spikes here:
 [1] https://github.com/xdp-project/xdp-project/blob/master/areas/latency/

The tool you want is: softirq_net_latency.bt
 [2] 
https://github.com/xdp-project/xdp-project/blob/master/areas/latency/softirq_net_latency.bt

Example output see[3]:
 [3] https://bugzilla.redhat.com/show_bug.cgi?id=1795049#c8

Based on the kernel versions, I don't expect this to be same latency
issue as described in the bugzilla[3] case (as IIRC it was fixed in
4.19).  It can still be similar issue, where some userspace process is
reading information from the kernel (/sys/fs/cgroup/memory/memory.stat
in BZ case) that blocks softirq from running, and result in these
latency spikes.



Install guide to bpftrace[4]:
 [4] 
https://github.com/xdp-project/xdp-project/blob/master/areas/mem/bpftrace/INSTALL.org
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-05 Thread Toke Høiland-Jørgensen via Bloat
"Thomas Rosenstein"  writes:

> On 5 Nov 2020, at 13:38, Toke Høiland-Jørgensen wrote:
>
>> "Thomas Rosenstein"  writes:
>>
>>> On 5 Nov 2020, at 12:21, Toke Høiland-Jørgensen wrote:
>>>
 "Thomas Rosenstein"  writes:

>> If so, this sounds more like a driver issue, or maybe something to
>> do
>> with scheduling. Does it only happen with ICMP? You could try this
>> tool
>> for a userspace UDP measurement:
>
> It happens with all packets, therefore the transfer to backblaze 
> with
> 40
> threads goes down to ~8MB/s instead of >60MB/s

 Huh, right, definitely sounds like a kernel bug; or maybe the new
 kernel
 is getting the hardware into a state where it bugs out when there 
 are
 lots of flows or something.

 You could try looking at the ethtool stats (ethtool -S) while 
 running
 the test and see if any error counters go up. Here's a handy script 
 to
 monitor changes in the counters:

 https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl

> I'll try what that reports!
>
>> Also, what happens if you ping a host on the internet (*through* 
>> the
>> router instead of *to* it)?
>
> Same issue, but twice pronounced, as it seems all interfaces are
> affected.
> So, ping on one interface and the second has the issue.
> Also all traffic across the host has the issue, but on both sides, 
> so
> ping to the internet increased by 2x

 Right, so even an unloaded interface suffers? But this is the same
 NIC,
 right? So it could still be a hardware issue...

> Yep default that CentOS ships, I just tested 4.12.5 there the issue
> also
> does not happen. So I guess I can bisect it then...(really don't 
> want
> to
> 😃)

 Well that at least narrows it down :)
>>>
>>> I just tested 5.9.4 seems to also fix it partly, I have long 
>>> stretches
>>> where it looks good, and then some increases again. (3.10 Stock has 
>>> them
>>> too, but not so high, rather 1-3 ms)
>>>
>>> for example:
>>>
>>> 64 bytes from x.x.x.x: icmp_seq=10 ttl=64 time=0.169 ms
>>> 64 bytes from x.x.x.x: icmp_seq=11 ttl=64 time=5.53 ms
>>> 64 bytes from x.x.x.x: icmp_seq=12 ttl=64 time=9.44 ms
>>> 64 bytes from x.x.x.x: icmp_seq=13 ttl=64 time=0.167 ms
>>> 64 bytes from x.x.x.x: icmp_seq=14 ttl=64 time=3.88 ms
>>>
>>> and then again:
>>>
>>> 64 bytes from x.x.x.x: icmp_seq=15 ttl=64 time=0.569 ms
>>> 64 bytes from x.x.x.x: icmp_seq=16 ttl=64 time=0.148 ms
>>> 64 bytes from x.x.x.x: icmp_seq=17 ttl=64 time=0.286 ms
>>> 64 bytes from x.x.x.x: icmp_seq=18 ttl=64 time=0.257 ms
>>> 64 bytes from x.x.x.x: icmp_seq=19 ttl=64 time=0.220 ms
>>> 64 bytes from x.x.x.x: icmp_seq=20 ttl=64 time=0.125 ms
>>> 64 bytes from x.x.x.x: icmp_seq=21 ttl=64 time=0.188 ms
>>> 64 bytes from x.x.x.x: icmp_seq=22 ttl=64 time=0.202 ms
>>> 64 bytes from x.x.x.x: icmp_seq=23 ttl=64 time=0.195 ms
>>> 64 bytes from x.x.x.x: icmp_seq=24 ttl=64 time=0.177 ms
>>> 64 bytes from x.x.x.x: icmp_seq=25 ttl=64 time=0.242 ms
>>> 64 bytes from x.x.x.x: icmp_seq=26 ttl=64 time=0.339 ms
>>> 64 bytes from x.x.x.x: icmp_seq=27 ttl=64 time=0.183 ms
>>> 64 bytes from x.x.x.x: icmp_seq=28 ttl=64 time=0.221 ms
>>> 64 bytes from x.x.x.x: icmp_seq=29 ttl=64 time=0.317 ms
>>> 64 bytes from x.x.x.x: icmp_seq=30 ttl=64 time=0.210 ms
>>> 64 bytes from x.x.x.x: icmp_seq=31 ttl=64 time=0.242 ms
>>> 64 bytes from x.x.x.x: icmp_seq=32 ttl=64 time=0.127 ms
>>> 64 bytes from x.x.x.x: icmp_seq=33 ttl=64 time=0.217 ms
>>> 64 bytes from x.x.x.x: icmp_seq=34 ttl=64 time=0.184 ms
>>>
>>>
>>> For me it looks now that there was some fix between 5.4.60 and 5.9.4 
>>> ...
>>> anyone can pinpoint it?
>>
>> $ git log --no-merges --oneline v5.4.60..v5.9.4|wc -l
>> 72932
>>
>> Only 73k commits; should be easy, right? :)
>>
>> (In other words no, I have no idea; I'd suggest either (a) asking on
>> netdev, (b) bisecting or (c) using 5.9+ and just making peace with not
>> knowing).
>
> Guess I'll go the easy route and let it be ...
>
> I'll update all routers to the 5.9.4 and see if it fixes the traffic 
> flow - will report back once more after that.

Sounds like a plan :)

>>
>> How did you configure the new kernel? Did you start from scratch, 
>> or
>> is
>> it based on the old centos config?
>
> first oldconfig and from there then added additional options for 
> IB,
> NVMe, etc (which I don't really need on the routers)

 OK, so you're probably building with roughly the same options in 
 terms
 of scheduling granularity etc. That's good. Did you enable spectre
 mitigations etc on the new kernel? What's the output of
 `tail /sys/devices/system/cpu/vulnerabilities/*` ?
>>>
>>> mitigations are off
>>
>> Right, I just figured maybe you were hitting some threshold that
>> involved a lot of indirect calls which slowed things down due to
>> miti

Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-05 Thread Thomas Rosenstein via Bloat



On 5 Nov 2020, at 13:38, Toke Høiland-Jørgensen wrote:


"Thomas Rosenstein"  writes:


On 5 Nov 2020, at 12:21, Toke Høiland-Jørgensen wrote:


"Thomas Rosenstein"  writes:


If so, this sounds more like a driver issue, or maybe something to
do
with scheduling. Does it only happen with ICMP? You could try this
tool
for a userspace UDP measurement:


It happens with all packets, therefore the transfer to backblaze 
with

40
threads goes down to ~8MB/s instead of >60MB/s


Huh, right, definitely sounds like a kernel bug; or maybe the new
kernel
is getting the hardware into a state where it bugs out when there 
are

lots of flows or something.

You could try looking at the ethtool stats (ethtool -S) while 
running
the test and see if any error counters go up. Here's a handy script 
to

monitor changes in the counters:

https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl


I'll try what that reports!

Also, what happens if you ping a host on the internet (*through* 
the

router instead of *to* it)?


Same issue, but twice pronounced, as it seems all interfaces are
affected.
So, ping on one interface and the second has the issue.
Also all traffic across the host has the issue, but on both sides, 
so

ping to the internet increased by 2x


Right, so even an unloaded interface suffers? But this is the same
NIC,
right? So it could still be a hardware issue...


Yep default that CentOS ships, I just tested 4.12.5 there the issue
also
does not happen. So I guess I can bisect it then...(really don't 
want

to
😃)


Well that at least narrows it down :)


I just tested 5.9.4 seems to also fix it partly, I have long 
stretches
where it looks good, and then some increases again. (3.10 Stock has 
them

too, but not so high, rather 1-3 ms)

for example:

64 bytes from x.x.x.x: icmp_seq=10 ttl=64 time=0.169 ms
64 bytes from x.x.x.x: icmp_seq=11 ttl=64 time=5.53 ms
64 bytes from x.x.x.x: icmp_seq=12 ttl=64 time=9.44 ms
64 bytes from x.x.x.x: icmp_seq=13 ttl=64 time=0.167 ms
64 bytes from x.x.x.x: icmp_seq=14 ttl=64 time=3.88 ms

and then again:

64 bytes from x.x.x.x: icmp_seq=15 ttl=64 time=0.569 ms
64 bytes from x.x.x.x: icmp_seq=16 ttl=64 time=0.148 ms
64 bytes from x.x.x.x: icmp_seq=17 ttl=64 time=0.286 ms
64 bytes from x.x.x.x: icmp_seq=18 ttl=64 time=0.257 ms
64 bytes from x.x.x.x: icmp_seq=19 ttl=64 time=0.220 ms
64 bytes from x.x.x.x: icmp_seq=20 ttl=64 time=0.125 ms
64 bytes from x.x.x.x: icmp_seq=21 ttl=64 time=0.188 ms
64 bytes from x.x.x.x: icmp_seq=22 ttl=64 time=0.202 ms
64 bytes from x.x.x.x: icmp_seq=23 ttl=64 time=0.195 ms
64 bytes from x.x.x.x: icmp_seq=24 ttl=64 time=0.177 ms
64 bytes from x.x.x.x: icmp_seq=25 ttl=64 time=0.242 ms
64 bytes from x.x.x.x: icmp_seq=26 ttl=64 time=0.339 ms
64 bytes from x.x.x.x: icmp_seq=27 ttl=64 time=0.183 ms
64 bytes from x.x.x.x: icmp_seq=28 ttl=64 time=0.221 ms
64 bytes from x.x.x.x: icmp_seq=29 ttl=64 time=0.317 ms
64 bytes from x.x.x.x: icmp_seq=30 ttl=64 time=0.210 ms
64 bytes from x.x.x.x: icmp_seq=31 ttl=64 time=0.242 ms
64 bytes from x.x.x.x: icmp_seq=32 ttl=64 time=0.127 ms
64 bytes from x.x.x.x: icmp_seq=33 ttl=64 time=0.217 ms
64 bytes from x.x.x.x: icmp_seq=34 ttl=64 time=0.184 ms


For me it looks now that there was some fix between 5.4.60 and 5.9.4 
...

anyone can pinpoint it?


$ git log --no-merges --oneline v5.4.60..v5.9.4|wc -l
72932

Only 73k commits; should be easy, right? :)

(In other words no, I have no idea; I'd suggest either (a) asking on
netdev, (b) bisecting or (c) using 5.9+ and just making peace with not
knowing).


Guess I'll go the easy route and let it be ...

I'll update all routers to the 5.9.4 and see if it fixes the traffic 
flow - will report back once more after that.




How did you configure the new kernel? Did you start from scratch, 
or

is
it based on the old centos config?


first oldconfig and from there then added additional options for 
IB,

NVMe, etc (which I don't really need on the routers)


OK, so you're probably building with roughly the same options in 
terms

of scheduling granularity etc. That's good. Did you enable spectre
mitigations etc on the new kernel? What's the output of
`tail /sys/devices/system/cpu/vulnerabilities/*` ?


mitigations are off


Right, I just figured maybe you were hitting some threshold that
involved a lot of indirect calls which slowed things down due to
mitigations. Guess not, then...



Thanks for the support :)


-Toke

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-05 Thread Toke Høiland-Jørgensen via Bloat
"Thomas Rosenstein"  writes:

> On 5 Nov 2020, at 12:21, Toke Høiland-Jørgensen wrote:
>
>> "Thomas Rosenstein"  writes:
>>
 If so, this sounds more like a driver issue, or maybe something to 
 do
 with scheduling. Does it only happen with ICMP? You could try this
 tool
 for a userspace UDP measurement:
>>>
>>> It happens with all packets, therefore the transfer to backblaze with 
>>> 40
>>> threads goes down to ~8MB/s instead of >60MB/s
>>
>> Huh, right, definitely sounds like a kernel bug; or maybe the new 
>> kernel
>> is getting the hardware into a state where it bugs out when there are
>> lots of flows or something.
>>
>> You could try looking at the ethtool stats (ethtool -S) while running
>> the test and see if any error counters go up. Here's a handy script to
>> monitor changes in the counters:
>>
>> https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
>>
>>> I'll try what that reports!
>>>
 Also, what happens if you ping a host on the internet (*through* the
 router instead of *to* it)?
>>>
>>> Same issue, but twice pronounced, as it seems all interfaces are
>>> affected.
>>> So, ping on one interface and the second has the issue.
>>> Also all traffic across the host has the issue, but on both sides, so
>>> ping to the internet increased by 2x
>>
>> Right, so even an unloaded interface suffers? But this is the same 
>> NIC,
>> right? So it could still be a hardware issue...
>>
>>> Yep default that CentOS ships, I just tested 4.12.5 there the issue 
>>> also
>>> does not happen. So I guess I can bisect it then...(really don't want 
>>> to
>>> 😃)
>>
>> Well that at least narrows it down :)
>
> I just tested 5.9.4 seems to also fix it partly, I have long stretches 
> where it looks good, and then some increases again. (3.10 Stock has them 
> too, but not so high, rather 1-3 ms)
>
> for example:
>
> 64 bytes from x.x.x.x: icmp_seq=10 ttl=64 time=0.169 ms
> 64 bytes from x.x.x.x: icmp_seq=11 ttl=64 time=5.53 ms
> 64 bytes from x.x.x.x: icmp_seq=12 ttl=64 time=9.44 ms
> 64 bytes from x.x.x.x: icmp_seq=13 ttl=64 time=0.167 ms
> 64 bytes from x.x.x.x: icmp_seq=14 ttl=64 time=3.88 ms
>
> and then again:
>
> 64 bytes from x.x.x.x: icmp_seq=15 ttl=64 time=0.569 ms
> 64 bytes from x.x.x.x: icmp_seq=16 ttl=64 time=0.148 ms
> 64 bytes from x.x.x.x: icmp_seq=17 ttl=64 time=0.286 ms
> 64 bytes from x.x.x.x: icmp_seq=18 ttl=64 time=0.257 ms
> 64 bytes from x.x.x.x: icmp_seq=19 ttl=64 time=0.220 ms
> 64 bytes from x.x.x.x: icmp_seq=20 ttl=64 time=0.125 ms
> 64 bytes from x.x.x.x: icmp_seq=21 ttl=64 time=0.188 ms
> 64 bytes from x.x.x.x: icmp_seq=22 ttl=64 time=0.202 ms
> 64 bytes from x.x.x.x: icmp_seq=23 ttl=64 time=0.195 ms
> 64 bytes from x.x.x.x: icmp_seq=24 ttl=64 time=0.177 ms
> 64 bytes from x.x.x.x: icmp_seq=25 ttl=64 time=0.242 ms
> 64 bytes from x.x.x.x: icmp_seq=26 ttl=64 time=0.339 ms
> 64 bytes from x.x.x.x: icmp_seq=27 ttl=64 time=0.183 ms
> 64 bytes from x.x.x.x: icmp_seq=28 ttl=64 time=0.221 ms
> 64 bytes from x.x.x.x: icmp_seq=29 ttl=64 time=0.317 ms
> 64 bytes from x.x.x.x: icmp_seq=30 ttl=64 time=0.210 ms
> 64 bytes from x.x.x.x: icmp_seq=31 ttl=64 time=0.242 ms
> 64 bytes from x.x.x.x: icmp_seq=32 ttl=64 time=0.127 ms
> 64 bytes from x.x.x.x: icmp_seq=33 ttl=64 time=0.217 ms
> 64 bytes from x.x.x.x: icmp_seq=34 ttl=64 time=0.184 ms
>
>
> For me it looks now that there was some fix between 5.4.60 and 5.9.4 ... 
> anyone can pinpoint it?

$ git log --no-merges --oneline v5.4.60..v5.9.4|wc -l
72932

Only 73k commits; should be easy, right? :)

(In other words no, I have no idea; I'd suggest either (a) asking on
netdev, (b) bisecting or (c) using 5.9+ and just making peace with not
knowing).

 How did you configure the new kernel? Did you start from scratch, or
 is
 it based on the old centos config?
>>>
>>> first oldconfig and from there then added additional options for IB,
>>> NVMe, etc (which I don't really need on the routers)
>>
>> OK, so you're probably building with roughly the same options in terms
>> of scheduling granularity etc. That's good. Did you enable spectre
>> mitigations etc on the new kernel? What's the output of
>> `tail /sys/devices/system/cpu/vulnerabilities/*` ?
>
> mitigations are off

Right, I just figured maybe you were hitting some threshold that
involved a lot of indirect calls which slowed things down due to
mitigations. Guess not, then...

-Toke
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-05 Thread Thomas Rosenstein via Bloat



On 5 Nov 2020, at 12:21, Toke Høiland-Jørgensen wrote:


"Thomas Rosenstein"  writes:

If so, this sounds more like a driver issue, or maybe something to 
do

with scheduling. Does it only happen with ICMP? You could try this
tool
for a userspace UDP measurement:


It happens with all packets, therefore the transfer to backblaze with 
40

threads goes down to ~8MB/s instead of >60MB/s


Huh, right, definitely sounds like a kernel bug; or maybe the new 
kernel

is getting the hardware into a state where it bugs out when there are
lots of flows or something.

You could try looking at the ethtool stats (ethtool -S) while running
the test and see if any error counters go up. Here's a handy script to
monitor changes in the counters:

https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl


I'll try what that reports!


Also, what happens if you ping a host on the internet (*through* the
router instead of *to* it)?


Same issue, but twice pronounced, as it seems all interfaces are
affected.
So, ping on one interface and the second has the issue.
Also all traffic across the host has the issue, but on both sides, so
ping to the internet increased by 2x


Right, so even an unloaded interface suffers? But this is the same 
NIC,

right? So it could still be a hardware issue...

Yep default that CentOS ships, I just tested 4.12.5 there the issue 
also
does not happen. So I guess I can bisect it then...(really don't want 
to

😃)


Well that at least narrows it down :)


I just tested 5.9.4 seems to also fix it partly, I have long stretches 
where it looks good, and then some increases again. (3.10 Stock has them 
too, but not so high, rather 1-3 ms)


for example:

64 bytes from x.x.x.x: icmp_seq=10 ttl=64 time=0.169 ms
64 bytes from x.x.x.x: icmp_seq=11 ttl=64 time=5.53 ms
64 bytes from x.x.x.x: icmp_seq=12 ttl=64 time=9.44 ms
64 bytes from x.x.x.x: icmp_seq=13 ttl=64 time=0.167 ms
64 bytes from x.x.x.x: icmp_seq=14 ttl=64 time=3.88 ms

and then again:

64 bytes from x.x.x.x: icmp_seq=15 ttl=64 time=0.569 ms
64 bytes from x.x.x.x: icmp_seq=16 ttl=64 time=0.148 ms
64 bytes from x.x.x.x: icmp_seq=17 ttl=64 time=0.286 ms
64 bytes from x.x.x.x: icmp_seq=18 ttl=64 time=0.257 ms
64 bytes from x.x.x.x: icmp_seq=19 ttl=64 time=0.220 ms
64 bytes from x.x.x.x: icmp_seq=20 ttl=64 time=0.125 ms
64 bytes from x.x.x.x: icmp_seq=21 ttl=64 time=0.188 ms
64 bytes from x.x.x.x: icmp_seq=22 ttl=64 time=0.202 ms
64 bytes from x.x.x.x: icmp_seq=23 ttl=64 time=0.195 ms
64 bytes from x.x.x.x: icmp_seq=24 ttl=64 time=0.177 ms
64 bytes from x.x.x.x: icmp_seq=25 ttl=64 time=0.242 ms
64 bytes from x.x.x.x: icmp_seq=26 ttl=64 time=0.339 ms
64 bytes from x.x.x.x: icmp_seq=27 ttl=64 time=0.183 ms
64 bytes from x.x.x.x: icmp_seq=28 ttl=64 time=0.221 ms
64 bytes from x.x.x.x: icmp_seq=29 ttl=64 time=0.317 ms
64 bytes from x.x.x.x: icmp_seq=30 ttl=64 time=0.210 ms
64 bytes from x.x.x.x: icmp_seq=31 ttl=64 time=0.242 ms
64 bytes from x.x.x.x: icmp_seq=32 ttl=64 time=0.127 ms
64 bytes from x.x.x.x: icmp_seq=33 ttl=64 time=0.217 ms
64 bytes from x.x.x.x: icmp_seq=34 ttl=64 time=0.184 ms


For me it looks now that there was some fix between 5.4.60 and 5.9.4 ... 
anyone can pinpoint it?






How did you configure the new kernel? Did you start from scratch, or
is
it based on the old centos config?


first oldconfig and from there then added additional options for IB,
NVMe, etc (which I don't really need on the routers)


OK, so you're probably building with roughly the same options in terms
of scheduling granularity etc. That's good. Did you enable spectre
mitigations etc on the new kernel? What's the output of
`tail /sys/devices/system/cpu/vulnerabilities/*` ?


mitigations are off

==> /sys/devices/system/cpu/vulnerabilities/itlb_multihit <==
KVM: Vulnerable

==> /sys/devices/system/cpu/vulnerabilities/l1tf <==
Mitigation: PTE Inversion; VMX: vulnerable

==> /sys/devices/system/cpu/vulnerabilities/mds <==
Vulnerable; SMT vulnerable

==> /sys/devices/system/cpu/vulnerabilities/meltdown <==
Vulnerable

==> /sys/devices/system/cpu/vulnerabilities/spec_store_bypass <==
Vulnerable

==> /sys/devices/system/cpu/vulnerabilities/spectre_v1 <==
Vulnerable: __user pointer sanitization and usercopy barriers only; no 
swapgs barriers


==> /sys/devices/system/cpu/vulnerabilities/spectre_v2 <==
Vulnerable, STIBP: disabled

==> /sys/devices/system/cpu/vulnerabilities/srbds <==
Not affected

==> /sys/devices/system/cpu/vulnerabilities/tsx_async_abort <==
Not affected


Grub Boot options are: crashkernel=896M rd.lvm.lv=cl/root net.ifnames=0 
biosdevname=0 scsi_mod.use_blk_mq=1 dm_mod.use_blk_mq=y mitigations=off 
console=tty0 console=ttyS1,115200




-Toke

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-05 Thread Toke Høiland-Jørgensen via Bloat
"Thomas Rosenstein"  writes:

>> If so, this sounds more like a driver issue, or maybe something to do
>> with scheduling. Does it only happen with ICMP? You could try this 
>> tool
>> for a userspace UDP measurement:
>
> It happens with all packets, therefore the transfer to backblaze with 40 
> threads goes down to ~8MB/s instead of >60MB/s

Huh, right, definitely sounds like a kernel bug; or maybe the new kernel
is getting the hardware into a state where it bugs out when there are
lots of flows or something.

You could try looking at the ethtool stats (ethtool -S) while running
the test and see if any error counters go up. Here's a handy script to
monitor changes in the counters:

https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl

> I'll try what that reports!
>
>> Also, what happens if you ping a host on the internet (*through* the
>> router instead of *to* it)?
>
> Same issue, but twice pronounced, as it seems all interfaces are 
> affected.
> So, ping on one interface and the second has the issue.
> Also all traffic across the host has the issue, but on both sides, so 
> ping to the internet increased by 2x

Right, so even an unloaded interface suffers? But this is the same NIC,
right? So it could still be a hardware issue...

> Yep default that CentOS ships, I just tested 4.12.5 there the issue also 
> does not happen. So I guess I can bisect it then...(really don't want to 
> 😃)

Well that at least narrows it down :)

>>
>> How did you configure the new kernel? Did you start from scratch, or 
>> is
>> it based on the old centos config?
>
> first oldconfig and from there then added additional options for IB, 
> NVMe, etc (which I don't really need on the routers)

OK, so you're probably building with roughly the same options in terms
of scheduling granularity etc. That's good. Did you enable spectre
mitigations etc on the new kernel? What's the output of
`tail /sys/devices/system/cpu/vulnerabilities/*` ?

-Toke
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-05 Thread Thomas Rosenstein via Bloat



On 5 Nov 2020, at 1:10, Toke Høiland-Jørgensen wrote:


"Thomas Rosenstein"  writes:


On 4 Nov 2020, at 17:10, Toke Høiland-Jørgensen wrote:


Thomas Rosenstein via Bloat  writes:


Hi all,

I'm coming from the lartc mailing list, here's the original text:

=

I have multiple routers which connect to multiple upstream 
providers,

I
have noticed a high latency shift in icmp (and generally all
connection)
if I run b2 upload-file --threads 40 (and I can reproduce this)

What options do I have to analyze why this happens?

General Info:

Routers are connected between each other with 10G Mellanox 
Connect-X

cards via 10G SPF+ DAC cables via a 10G Switch from fs.com
Latency generally is around 0.18 ms between all routers (4).
Throughput is 9.4 Gbit/s with 0 retransmissions when tested with
iperf3.
2 of the 4 routers are connected upstream with a 1G connection
(separate
port, same network card)
All routers have the full internet routing tables, i.e. 80k entries
for
IPv6 and 830k entries for IPv4
Conntrack is disabled (-j NOTRACK)
Kernel 5.4.60 (custom)
2x Xeon X5670 @ 2.93 Ghz
96 GB RAM
No Swap
CentOs 7

During high latency:

Latency on routers which have the traffic flow increases to 12 - 20
ms,
for all interfaces, moving of the stream (via bgp disable session)
moves
also the high latency
iperf3 performance plumets to 300 - 400 MBits
CPU load (user / system) are around 0.1%
Ram Usage is around 3 - 4 GB
if_packets count is stable (around 8000 pkt/s more)


I'm not sure I get you topology. Packets are going from where to
where,
and what link is the bottleneck for the transfer you're doing? Are 
you

measuring the latency along the same path?

Have you tried running 'mtr' to figure out which hop the latency is
at?


I tried to draw the topology, I hope this is okay and explains 
betters

what's happening:

https://drive.google.com/file/d/15oAsxiNfsbjB9a855Q_dh6YvFZBDdY5I/view?usp=sharing


Ohh, right, you're pinging between two of the routers across a 10 Gbps
link with plenty of capacity to spare, and *that* goes up by two 
orders

of magnitude when you start the transfer, even though the transfer
itself is <1Gbps? Am I understanding you correctly now?


Exactly :)



If so, this sounds more like a driver issue, or maybe something to do
with scheduling. Does it only happen with ICMP? You could try this 
tool

for a userspace UDP measurement:


It happens with all packets, therefore the transfer to backblaze with 40 
threads goes down to ~8MB/s instead of >60MB/s




https://github.com/heistp/irtt/



I'll try what that reports!


Also, what happens if you ping a host on the internet (*through* the
router instead of *to* it)?


Same issue, but twice pronounced, as it seems all interfaces are 
affected.

So, ping on one interface and the second has the issue.
Also all traffic across the host has the issue, but on both sides, so 
ping to the internet increased by 2x




And which version of the Connect-X cards are you using (or rather, 
which

driver? mlx4?)



It's Connect-X 4 Lx cards, specifcally: MCX4121A-ACAT
Driver is mlx5_core

So it must be something in the kernel tacking on a delay, I could try 
to

do a bisect and build like 10 kernels :)


That may ultimately end up being necessary. However, when you say 
'stock

kernel' you mean what CentOS ships, right? If so, that's not really a
3.10 kernel - the RHEL kernels (that centos is based on) are... 
somewhat

creative... about their versioning. So if you're switched to a vanilla
upstream kernel you may find bisecting difficult :/


Yep default that CentOS ships, I just tested 4.12.5 there the issue also 
does not happen. So I guess I can bisect it then...(really don't want to 
😃)




How did you configure the new kernel? Did you start from scratch, or 
is

it based on the old centos config?


first oldconfig and from there then added additional options for IB, 
NVMe, etc (which I don't really need on the routers)




-Toke

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-04 Thread Toke Høiland-Jørgensen via Bloat
"Thomas Rosenstein"  writes:

> On 4 Nov 2020, at 17:10, Toke Høiland-Jørgensen wrote:
>
>> Thomas Rosenstein via Bloat  writes:
>>
>>> Hi all,
>>>
>>> I'm coming from the lartc mailing list, here's the original text:
>>>
>>> =
>>>
>>> I have multiple routers which connect to multiple upstream providers, 
>>> I
>>> have noticed a high latency shift in icmp (and generally all 
>>> connection)
>>> if I run b2 upload-file --threads 40 (and I can reproduce this)
>>>
>>> What options do I have to analyze why this happens?
>>>
>>> General Info:
>>>
>>> Routers are connected between each other with 10G Mellanox Connect-X
>>> cards via 10G SPF+ DAC cables via a 10G Switch from fs.com
>>> Latency generally is around 0.18 ms between all routers (4).
>>> Throughput is 9.4 Gbit/s with 0 retransmissions when tested with 
>>> iperf3.
>>> 2 of the 4 routers are connected upstream with a 1G connection 
>>> (separate
>>> port, same network card)
>>> All routers have the full internet routing tables, i.e. 80k entries 
>>> for
>>> IPv6 and 830k entries for IPv4
>>> Conntrack is disabled (-j NOTRACK)
>>> Kernel 5.4.60 (custom)
>>> 2x Xeon X5670 @ 2.93 Ghz
>>> 96 GB RAM
>>> No Swap
>>> CentOs 7
>>>
>>> During high latency:
>>>
>>> Latency on routers which have the traffic flow increases to 12 - 20 
>>> ms,
>>> for all interfaces, moving of the stream (via bgp disable session) 
>>> moves
>>> also the high latency
>>> iperf3 performance plumets to 300 - 400 MBits
>>> CPU load (user / system) are around 0.1%
>>> Ram Usage is around 3 - 4 GB
>>> if_packets count is stable (around 8000 pkt/s more)
>>
>> I'm not sure I get you topology. Packets are going from where to 
>> where,
>> and what link is the bottleneck for the transfer you're doing? Are you
>> measuring the latency along the same path?
>>
>> Have you tried running 'mtr' to figure out which hop the latency is 
>> at?
>
> I tried to draw the topology, I hope this is okay and explains betters 
> what's happening:
>
> https://drive.google.com/file/d/15oAsxiNfsbjB9a855Q_dh6YvFZBDdY5I/view?usp=sharing

Ohh, right, you're pinging between two of the routers across a 10 Gbps
link with plenty of capacity to spare, and *that* goes up by two orders
of magnitude when you start the transfer, even though the transfer
itself is <1Gbps? Am I understanding you correctly now?

If so, this sounds more like a driver issue, or maybe something to do
with scheduling. Does it only happen with ICMP? You could try this tool
for a userspace UDP measurement:

https://github.com/heistp/irtt/

Also, what happens if you ping a host on the internet (*through* the
router instead of *to* it)?

And which version of the Connect-X cards are you using (or rather, which
driver? mlx4?)

> So it must be something in the kernel tacking on a delay, I could try to 
> do a bisect and build like 10 kernels :)

That may ultimately end up being necessary. However, when you say 'stock
kernel' you mean what CentOS ships, right? If so, that's not really a
3.10 kernel - the RHEL kernels (that centos is based on) are... somewhat
creative... about their versioning. So if you're switched to a vanilla
upstream kernel you may find bisecting difficult :/

How did you configure the new kernel? Did you start from scratch, or is
it based on the old centos config?

-Toke
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-04 Thread Thomas Rosenstein via Bloat



On 4 Nov 2020, at 17:10, Toke Høiland-Jørgensen wrote:


Thomas Rosenstein via Bloat  writes:


Hi all,

I'm coming from the lartc mailing list, here's the original text:

=

I have multiple routers which connect to multiple upstream providers, 
I
have noticed a high latency shift in icmp (and generally all 
connection)

if I run b2 upload-file --threads 40 (and I can reproduce this)

What options do I have to analyze why this happens?

General Info:

Routers are connected between each other with 10G Mellanox Connect-X
cards via 10G SPF+ DAC cables via a 10G Switch from fs.com
Latency generally is around 0.18 ms between all routers (4).
Throughput is 9.4 Gbit/s with 0 retransmissions when tested with 
iperf3.
2 of the 4 routers are connected upstream with a 1G connection 
(separate

port, same network card)
All routers have the full internet routing tables, i.e. 80k entries 
for

IPv6 and 830k entries for IPv4
Conntrack is disabled (-j NOTRACK)
Kernel 5.4.60 (custom)
2x Xeon X5670 @ 2.93 Ghz
96 GB RAM
No Swap
CentOs 7

During high latency:

Latency on routers which have the traffic flow increases to 12 - 20 
ms,
for all interfaces, moving of the stream (via bgp disable session) 
moves

also the high latency
iperf3 performance plumets to 300 - 400 MBits
CPU load (user / system) are around 0.1%
Ram Usage is around 3 - 4 GB
if_packets count is stable (around 8000 pkt/s more)


I'm not sure I get you topology. Packets are going from where to 
where,

and what link is the bottleneck for the transfer you're doing? Are you
measuring the latency along the same path?

Have you tried running 'mtr' to figure out which hop the latency is 
at?


I tried to draw the topology, I hope this is okay and explains betters 
what's happening:


https://drive.google.com/file/d/15oAsxiNfsbjB9a855Q_dh6YvFZBDdY5I/view?usp=sharing

There is definitly no bottleneck in any of the links, the maximum on any 
link is 16k packets/sec and around 300 Mbit/s.

In the iperf3 tests I can easily get up to 9.4 Gbit/s

So it must be something in the kernel tacking on a delay, I could try to 
do a bisect and build like 10 kernels :)





Here is the tc -s qdisc output:


This indicates ("dropped 0" and "ecn_mark 0") that there's no
backpressure on the qdisc, so something else is going on.

Also, you said the issue goes away if you downgrade the kernel? That
does sound odd...


Yes, indeed. I have only recently upgraded the kernel to the 5.4.60 and 
haven't had the issue before.




-Toke

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-04 Thread Toke Høiland-Jørgensen via Bloat
Thomas Rosenstein via Bloat  writes:

> Hi all,
>
> I'm coming from the lartc mailing list, here's the original text:
>
> =
>
> I have multiple routers which connect to multiple upstream providers, I 
> have noticed a high latency shift in icmp (and generally all connection) 
> if I run b2 upload-file --threads 40 (and I can reproduce this)
>
> What options do I have to analyze why this happens?
>
> General Info:
>
> Routers are connected between each other with 10G Mellanox Connect-X 
> cards via 10G SPF+ DAC cables via a 10G Switch from fs.com
> Latency generally is around 0.18 ms between all routers (4).
> Throughput is 9.4 Gbit/s with 0 retransmissions when tested with iperf3.
> 2 of the 4 routers are connected upstream with a 1G connection (separate 
> port, same network card)
> All routers have the full internet routing tables, i.e. 80k entries for 
> IPv6 and 830k entries for IPv4
> Conntrack is disabled (-j NOTRACK)
> Kernel 5.4.60 (custom)
> 2x Xeon X5670 @ 2.93 Ghz
> 96 GB RAM
> No Swap
> CentOs 7
>
> During high latency:
>
> Latency on routers which have the traffic flow increases to 12 - 20 ms, 
> for all interfaces, moving of the stream (via bgp disable session) moves 
> also the high latency
> iperf3 performance plumets to 300 - 400 MBits
> CPU load (user / system) are around 0.1%
> Ram Usage is around 3 - 4 GB
> if_packets count is stable (around 8000 pkt/s more)

I'm not sure I get you topology. Packets are going from where to where,
and what link is the bottleneck for the transfer you're doing? Are you
measuring the latency along the same path?

Have you tried running 'mtr' to figure out which hop the latency is at?

> Here is the tc -s qdisc output:

This indicates ("dropped 0" and "ecn_mark 0") that there's no
backpressure on the qdisc, so something else is going on.

Also, you said the issue goes away if you downgrade the kernel? That
does sound odd...

-Toke
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


[Bloat] Router congestion, slow ping/ack times with kernel 5.4.60

2020-11-04 Thread Thomas Rosenstein via Bloat

Hi all,

I'm coming from the lartc mailing list, here's the original text:

=

I have multiple routers which connect to multiple upstream providers, I 
have noticed a high latency shift in icmp (and generally all connection) 
if I run b2 upload-file --threads 40 (and I can reproduce this)


What options do I have to analyze why this happens?

General Info:

Routers are connected between each other with 10G Mellanox Connect-X 
cards via 10G SPF+ DAC cables via a 10G Switch from fs.com

Latency generally is around 0.18 ms between all routers (4).
Throughput is 9.4 Gbit/s with 0 retransmissions when tested with iperf3.
2 of the 4 routers are connected upstream with a 1G connection (separate 
port, same network card)
All routers have the full internet routing tables, i.e. 80k entries for 
IPv6 and 830k entries for IPv4

Conntrack is disabled (-j NOTRACK)
Kernel 5.4.60 (custom)
2x Xeon X5670 @ 2.93 Ghz
96 GB RAM
No Swap
CentOs 7

During high latency:

Latency on routers which have the traffic flow increases to 12 - 20 ms, 
for all interfaces, moving of the stream (via bgp disable session) moves 
also the high latency

iperf3 performance plumets to 300 - 400 MBits
CPU load (user / system) are around 0.1%
Ram Usage is around 3 - 4 GB
if_packets count is stable (around 8000 pkt/s more)


for b2 upload-file with 10 threads I can achieve 60 MB/s consistently, 
with 40 threads the performance drops to 8 MB/s


I do not believe that 40 tcp streams should be any problem for a machine 
of that size.


Thanks for any ideas, help, pointers, things I can verify / check / 
provide additional!


===


So far I have tested:

1) Use Stock Kernel 3.10.0-541 -> issue does not happen
2) setup fq_codel on the interfaces:

Here is the tc -s qdisc output:

qdisc fq_codel 8005: dev eth4 root refcnt 193 limit 10240p flows 1024 
quantum 1514 target 5.0ms interval 100.0ms ecn
 Sent 8374229144 bytes 10936167 pkt (dropped 0, overlimits 0 requeues 
6127)

 backlog 0b 0p requeues 6127
  maxpacket 25398 drop_overlimit 0 new_flow_count 15441 ecn_mark 0
  new_flows_len 0 old_flows_len 0
qdisc fq_codel 8008: dev eth5 root refcnt 193 limit 10240p flows 1024 
quantum 1514 target 5.0ms interval 100.0ms ecn
 Sent 1072480080 bytes 1012973 pkt (dropped 0, overlimits 0 requeues 
735)

 backlog 0b 0p requeues 735
  maxpacket 19682 drop_overlimit 0 new_flow_count 15963 ecn_mark 0
  new_flows_len 0 old_flows_len 0
qdisc fq_codel 8004: dev eth4.2300 root refcnt 2 limit 10240p flows 1024 
quantum 1514 target 5.0ms interval 100.0ms ecn
 Sent 8441021899 bytes 11021070 pkt (dropped 0, overlimits 0 requeues 
0)

 backlog 0b 0p requeues 0
  maxpacket 68130 drop_overlimit 0 new_flow_count 257055 ecn_mark 0
  new_flows_len 0 old_flows_len 0
qdisc fq_codel 8006: dev eth5.2501 root refcnt 2 limit 10240p flows 1024 
quantum 1514 target 5.0ms interval 100.0ms ecn

 Sent 571984459 bytes 2148377 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  maxpacket 7570 drop_overlimit 0 new_flow_count 11300 ecn_mark 0
  new_flows_len 0 old_flows_len 0
qdisc fq_codel 8007: dev eth5.2502 root refcnt 2 limit 10240p flows 1024 
quantum 1514 target 5.0ms interval 100.0ms ecn

 Sent 140132 bytes 1966724 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  maxpacket 19682 drop_overlimit 0 new_flow_count 76653 ecn_mark 0
  new_flows_len 0 old_flows_len 0


I have no statistics / metrics that would point to a slow down on the 
server, cpu / load / network / packets / memory all show normal very low 
load.
Is there other, (hidden) metrics I can collect to analyze this issue 
further?


Thanks
Thomas




___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat