Hi all,

I'm coming from the lartc mailing list, here's the original text:

=====

I have multiple routers which connect to multiple upstream providers, I have noticed a high latency shift in icmp (and generally all connection) if I run b2 upload-file --threads 40 (and I can reproduce this)

What options do I have to analyze why this happens?

General Info:

Routers are connected between each other with 10G Mellanox Connect-X cards via 10G SPF+ DAC cables via a 10G Switch from fs.com
Latency generally is around 0.18 ms between all routers (4).
Throughput is 9.4 Gbit/s with 0 retransmissions when tested with iperf3.
2 of the 4 routers are connected upstream with a 1G connection (separate port, same network card) All routers have the full internet routing tables, i.e. 80k entries for IPv6 and 830k entries for IPv4
Conntrack is disabled (-j NOTRACK)
Kernel 5.4.60 (custom)
2x Xeon X5670 @ 2.93 Ghz
96 GB RAM
No Swap
CentOs 7

During high latency:

Latency on routers which have the traffic flow increases to 12 - 20 ms, for all interfaces, moving of the stream (via bgp disable session) moves also the high latency
iperf3 performance plumets to 300 - 400 MBits
CPU load (user / system) are around 0.1%
Ram Usage is around 3 - 4 GB
if_packets count is stable (around 8000 pkt/s more)


for b2 upload-file with 10 threads I can achieve 60 MB/s consistently, with 40 threads the performance drops to 8 MB/s

I do not believe that 40 tcp streams should be any problem for a machine of that size.

Thanks for any ideas, help, pointers, things I can verify / check / provide additional!

=======


So far I have tested:

1) Use Stock Kernel 3.10.0-541 -> issue does not happen
2) setup fq_codel on the interfaces:

Here is the tc -s qdisc output:

qdisc fq_codel 8005: dev eth4 root refcnt 193 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms ecn Sent 8374229144 bytes 10936167 pkt (dropped 0, overlimits 0 requeues 6127)
 backlog 0b 0p requeues 6127
  maxpacket 25398 drop_overlimit 0 new_flow_count 15441 ecn_mark 0
  new_flows_len 0 old_flows_len 0
qdisc fq_codel 8008: dev eth5 root refcnt 193 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms ecn Sent 1072480080 bytes 1012973 pkt (dropped 0, overlimits 0 requeues 735)
 backlog 0b 0p requeues 735
  maxpacket 19682 drop_overlimit 0 new_flow_count 15963 ecn_mark 0
  new_flows_len 0 old_flows_len 0
qdisc fq_codel 8004: dev eth4.2300 root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms ecn Sent 8441021899 bytes 11021070 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  maxpacket 68130 drop_overlimit 0 new_flow_count 257055 ecn_mark 0
  new_flows_len 0 old_flows_len 0
qdisc fq_codel 8006: dev eth5.2501 root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms ecn
 Sent 571984459 bytes 2148377 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  maxpacket 7570 drop_overlimit 0 new_flow_count 11300 ecn_mark 0
  new_flows_len 0 old_flows_len 0
qdisc fq_codel 8007: dev eth5.2502 root refcnt 2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms ecn
 Sent 1401322222 bytes 1966724 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
  maxpacket 19682 drop_overlimit 0 new_flow_count 76653 ecn_mark 0
  new_flows_len 0 old_flows_len 0


I have no statistics / metrics that would point to a slow down on the server, cpu / load / network / packets / memory all show normal very low load. Is there other, (hidden) metrics I can collect to analyze this issue further?

Thanks
Thomas




_______________________________________________
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat

Reply via email to