[Kernel-packages] [Bug 2045560] Re: TCP memory leak, slow network (arm64)

2024-04-24 Thread Jonathan Heathcote
The bug report can be found here:

https://lore.kernel.org/netdev/vi1pr01mb42407d7947b2ea448f1e04efd1...@vi1pr01mb4240.eurprd01.prod.exchangelabs.com/

The subsequently produced patch (not by me!) to fix this can be found
here:

https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=3584718cf2ec

I've also verified that the above patch does fix the problem at least in
my case!

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-signed-aws-6.2 in Ubuntu.
https://bugs.launchpad.net/bugs/2045560

Title:
  TCP memory  leak, slow network (arm64)

Status in linux-signed-aws-6.2 package in Ubuntu:
  Confirmed

Bug description:
  Hello! 

  We have Ubuntu OS-based servers running in both AWS and Azure clouds.
  These servers are handling thousands of connections, and we've been
  experiencing issues with TCP memory usage since upgrading to Ubuntu
  22.04.3 from 22.04.2.

  $ cat /proc/net/sockstat
  sockets: used 6642
  TCP: inuse 5962 orphan 0 tw 292 alloc 6008 mem 128989
  UDP: inuse 5 mem 0
  UDPLITE: inuse 0
  RAW: inuse 0
  FRAG: inuse 0 memory 0

  As shown in the output below, even after stopping all possible
  services and closing all open connections, the TCP memory usage
  remains high and only decreases very slowly.

  $ cat /proc/net/sockstat
  sockets: used 138
  TCP: inuse 2 orphan 0 tw 0 alloc 3 mem 128320
  UDP: inuse 3 mem 0
  UDPLITE: inuse 0
  RAW: inuse 0
  FRAG: inuse 0 memory 0

  I have attached a screenshot of linear TCP memory usage growth, which
  we believe may indicate a TCP memory leak

  When net.ipv4.tcp_mem limit is reached, it causes network slowdown

  We've never had these issues before, and the only solution we've found
  so far is to reboot the node. Do you have any suggestions on how to
  troubleshoot further?

  Thank you for any help or guidance you can provide!

  ProblemType: Bug
  DistroRelease: Ubuntu 22.04
  Package: linux-image-6.2.0-1015-aws 6.2.0-1015.15~22.04.1
  ProcVersionSignature: Ubuntu 6.2.0-1015.15~22.04.1-aws 6.2.16
  Uname: Linux 6.2.0-1015-aws aarch64
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: arm64
  CasperMD5CheckResult: unknown
  CloudArchitecture: aarch64
  CloudID: aws
  CloudName: aws
  CloudPlatform: ec2
  CloudRegion: us-west-2
  CloudSubPlatform: metadata (http://169.254.169.254)
  Date: Mon Dec  4 13:13:08 2023
  Ec2AMI: ami-095a68e28e781dfe1
  Ec2AMIManifest: (unknown)
  Ec2Architecture: arm64
  Ec2AvailabilityZone: us-west-2b
  Ec2Imageid: ami-095a68e28e781dfe1
  Ec2InstanceType: m7g.large
  Ec2Instancetype: m7g.large
  Ec2Kernel: unavailable
  Ec2Ramdisk: unavailable
  Ec2Region: us-west-2
  ProcEnviron:
   TERM=xterm-256color
   PATH=(custom, no user)
   LANG=C.UTF-8
   SHELL=/bin/bash
  RebootRequiredPkgs: Error: path contained symlinks.
  SourcePackage: linux-signed-aws-6.2
  UpgradeStatus: No upgrade log present (probably fresh install)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-signed-aws-6.2/+bug/2045560/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 2045560] Re: TCP memory leak, slow network (arm64)

2024-04-22 Thread Lev Petrushchak
Hi Jonathan,

Thanks for all the information and the deep dive into this issue! Please
post the link to the kernel bug report here when it's ready.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-signed-aws-6.2 in Ubuntu.
https://bugs.launchpad.net/bugs/2045560

Title:
  TCP memory  leak, slow network (arm64)

Status in linux-signed-aws-6.2 package in Ubuntu:
  Confirmed

Bug description:
  Hello! 

  We have Ubuntu OS-based servers running in both AWS and Azure clouds.
  These servers are handling thousands of connections, and we've been
  experiencing issues with TCP memory usage since upgrading to Ubuntu
  22.04.3 from 22.04.2.

  $ cat /proc/net/sockstat
  sockets: used 6642
  TCP: inuse 5962 orphan 0 tw 292 alloc 6008 mem 128989
  UDP: inuse 5 mem 0
  UDPLITE: inuse 0
  RAW: inuse 0
  FRAG: inuse 0 memory 0

  As shown in the output below, even after stopping all possible
  services and closing all open connections, the TCP memory usage
  remains high and only decreases very slowly.

  $ cat /proc/net/sockstat
  sockets: used 138
  TCP: inuse 2 orphan 0 tw 0 alloc 3 mem 128320
  UDP: inuse 3 mem 0
  UDPLITE: inuse 0
  RAW: inuse 0
  FRAG: inuse 0 memory 0

  I have attached a screenshot of linear TCP memory usage growth, which
  we believe may indicate a TCP memory leak

  When net.ipv4.tcp_mem limit is reached, it causes network slowdown

  We've never had these issues before, and the only solution we've found
  so far is to reboot the node. Do you have any suggestions on how to
  troubleshoot further?

  Thank you for any help or guidance you can provide!

  ProblemType: Bug
  DistroRelease: Ubuntu 22.04
  Package: linux-image-6.2.0-1015-aws 6.2.0-1015.15~22.04.1
  ProcVersionSignature: Ubuntu 6.2.0-1015.15~22.04.1-aws 6.2.16
  Uname: Linux 6.2.0-1015-aws aarch64
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: arm64
  CasperMD5CheckResult: unknown
  CloudArchitecture: aarch64
  CloudID: aws
  CloudName: aws
  CloudPlatform: ec2
  CloudRegion: us-west-2
  CloudSubPlatform: metadata (http://169.254.169.254)
  Date: Mon Dec  4 13:13:08 2023
  Ec2AMI: ami-095a68e28e781dfe1
  Ec2AMIManifest: (unknown)
  Ec2Architecture: arm64
  Ec2AvailabilityZone: us-west-2b
  Ec2Imageid: ami-095a68e28e781dfe1
  Ec2InstanceType: m7g.large
  Ec2Instancetype: m7g.large
  Ec2Kernel: unavailable
  Ec2Ramdisk: unavailable
  Ec2Region: us-west-2
  ProcEnviron:
   TERM=xterm-256color
   PATH=(custom, no user)
   LANG=C.UTF-8
   SHELL=/bin/bash
  RebootRequiredPkgs: Error: path contained symlinks.
  SourcePackage: linux-signed-aws-6.2
  UpgradeStatus: No upgrade log present (probably fresh install)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-signed-aws-6.2/+bug/2045560/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 2045560] Re: TCP memory leak, slow network (arm64)

2024-04-19 Thread Jonathan Heathcote
I've been digging into this and this appears to be a regression
introduced by the following patch
https://github.com/torvalds/linux/commit/3cd3399dd7a84 which was first
released in Linux 6.0.0.

The bug is not a memory leak but rather a bug in how memory usage is
counted. Excess memory is not actually being consumed, though the bug is
still fatal since the counter controls Linux's memory pressure logic.

The (apparently) responsible patch is a performance optimisation which
attempts to reduce the frequency of writes to the system-wide counter
which (I suspect) is subtly misusing some atomic operation on ARM. If
you undo this patch in a recent kernel, the bug disappears.

I am currently working on a detailed bug report for the relevant Kernel
maintainers.

NB: It appears that the "5.15" kernel shipped by Rocky (and RHEL)
includes a back-port of this bug, hence my seeing the bug in that kernel
version on Rocky Linux. A non-RedHat-patched vanilla build of 5.15 does
not exhibit the bug in my system either.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-signed-aws-6.2 in Ubuntu.
https://bugs.launchpad.net/bugs/2045560

Title:
  TCP memory  leak, slow network (arm64)

Status in linux-signed-aws-6.2 package in Ubuntu:
  Confirmed

Bug description:
  Hello! 

  We have Ubuntu OS-based servers running in both AWS and Azure clouds.
  These servers are handling thousands of connections, and we've been
  experiencing issues with TCP memory usage since upgrading to Ubuntu
  22.04.3 from 22.04.2.

  $ cat /proc/net/sockstat
  sockets: used 6642
  TCP: inuse 5962 orphan 0 tw 292 alloc 6008 mem 128989
  UDP: inuse 5 mem 0
  UDPLITE: inuse 0
  RAW: inuse 0
  FRAG: inuse 0 memory 0

  As shown in the output below, even after stopping all possible
  services and closing all open connections, the TCP memory usage
  remains high and only decreases very slowly.

  $ cat /proc/net/sockstat
  sockets: used 138
  TCP: inuse 2 orphan 0 tw 0 alloc 3 mem 128320
  UDP: inuse 3 mem 0
  UDPLITE: inuse 0
  RAW: inuse 0
  FRAG: inuse 0 memory 0

  I have attached a screenshot of linear TCP memory usage growth, which
  we believe may indicate a TCP memory leak

  When net.ipv4.tcp_mem limit is reached, it causes network slowdown

  We've never had these issues before, and the only solution we've found
  so far is to reboot the node. Do you have any suggestions on how to
  troubleshoot further?

  Thank you for any help or guidance you can provide!

  ProblemType: Bug
  DistroRelease: Ubuntu 22.04
  Package: linux-image-6.2.0-1015-aws 6.2.0-1015.15~22.04.1
  ProcVersionSignature: Ubuntu 6.2.0-1015.15~22.04.1-aws 6.2.16
  Uname: Linux 6.2.0-1015-aws aarch64
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: arm64
  CasperMD5CheckResult: unknown
  CloudArchitecture: aarch64
  CloudID: aws
  CloudName: aws
  CloudPlatform: ec2
  CloudRegion: us-west-2
  CloudSubPlatform: metadata (http://169.254.169.254)
  Date: Mon Dec  4 13:13:08 2023
  Ec2AMI: ami-095a68e28e781dfe1
  Ec2AMIManifest: (unknown)
  Ec2Architecture: arm64
  Ec2AvailabilityZone: us-west-2b
  Ec2Imageid: ami-095a68e28e781dfe1
  Ec2InstanceType: m7g.large
  Ec2Instancetype: m7g.large
  Ec2Kernel: unavailable
  Ec2Ramdisk: unavailable
  Ec2Region: us-west-2
  ProcEnviron:
   TERM=xterm-256color
   PATH=(custom, no user)
   LANG=C.UTF-8
   SHELL=/bin/bash
  RebootRequiredPkgs: Error: path contained symlinks.
  SourcePackage: linux-signed-aws-6.2
  UpgradeStatus: No upgrade log present (probably fresh install)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-signed-aws-6.2/+bug/2045560/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 2045560] Re: TCP memory leak, slow network (arm64)

2024-04-11 Thread Jonathan Heathcote
As another "me too" situation, I'm seeing the same phenomenon, though on
Rocky 9 rather than Ubuntu and on older kernels (5.14). Reporting
details here on the off chance this provides some insight.

Hardware: Ampere Altra Max 128 cores (aarch64), ConnectX6-DX NICs (2 x dual 
100G port)
Kernel versions tested: 5.15 (Rocky 9 native kernel) and 6.8 (elrepo kernel), 
both configured with 64 KB pages
OS: Rocky Linux 9
Software: nginx serving ~90k HTTPS clients at ~350 GBit/s (a synthetic load 
test)
Bare-metal (no virtualisation).

In my test environment, ~90k HTTPS connections are opened (and reused
via keepalive) and used to stream ~350 GBit/s of traffic to a cluster of
load generators. In this scenario, TCP memory gradually creeps up until
reaching the memory pressure threshold in /proc/sys/net/ipv4/tcp_mem
(243890 pages, or 15.6 GB in this system). At this point memory usage
growth actually increases slightly (and increased CPU load and response
times are also observed). The system eventually reaches the ultimate
limit (365832 pages, or 23.4 GB) at which point most connections fail
and all requests receive very slow responses.

Closing all connections or restarting nginx does not free up the memory,
only a reboot resolves the situation -- as reported above already.

Leaked memory appears to persist even if all connections are closed
prior to hitting any of the above limits.

Unfortunately I don't yet have any ideas on how to fix this but would be
glad to hear (and will share) any insights about what might be going on
here!

** Attachment added: "TCP memory consumption"
   
https://bugs.launchpad.net/ubuntu/+source/linux-signed-aws-6.2/+bug/2045560/+attachment/5763728/+files/boom.png

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-signed-aws-6.2 in Ubuntu.
https://bugs.launchpad.net/bugs/2045560

Title:
  TCP memory  leak, slow network (arm64)

Status in linux-signed-aws-6.2 package in Ubuntu:
  Confirmed

Bug description:
  Hello! 

  We have Ubuntu OS-based servers running in both AWS and Azure clouds.
  These servers are handling thousands of connections, and we've been
  experiencing issues with TCP memory usage since upgrading to Ubuntu
  22.04.3 from 22.04.2.

  $ cat /proc/net/sockstat
  sockets: used 6642
  TCP: inuse 5962 orphan 0 tw 292 alloc 6008 mem 128989
  UDP: inuse 5 mem 0
  UDPLITE: inuse 0
  RAW: inuse 0
  FRAG: inuse 0 memory 0

  As shown in the output below, even after stopping all possible
  services and closing all open connections, the TCP memory usage
  remains high and only decreases very slowly.

  $ cat /proc/net/sockstat
  sockets: used 138
  TCP: inuse 2 orphan 0 tw 0 alloc 3 mem 128320
  UDP: inuse 3 mem 0
  UDPLITE: inuse 0
  RAW: inuse 0
  FRAG: inuse 0 memory 0

  I have attached a screenshot of linear TCP memory usage growth, which
  we believe may indicate a TCP memory leak

  When net.ipv4.tcp_mem limit is reached, it causes network slowdown

  We've never had these issues before, and the only solution we've found
  so far is to reboot the node. Do you have any suggestions on how to
  troubleshoot further?

  Thank you for any help or guidance you can provide!

  ProblemType: Bug
  DistroRelease: Ubuntu 22.04
  Package: linux-image-6.2.0-1015-aws 6.2.0-1015.15~22.04.1
  ProcVersionSignature: Ubuntu 6.2.0-1015.15~22.04.1-aws 6.2.16
  Uname: Linux 6.2.0-1015-aws aarch64
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: arm64
  CasperMD5CheckResult: unknown
  CloudArchitecture: aarch64
  CloudID: aws
  CloudName: aws
  CloudPlatform: ec2
  CloudRegion: us-west-2
  CloudSubPlatform: metadata (http://169.254.169.254)
  Date: Mon Dec  4 13:13:08 2023
  Ec2AMI: ami-095a68e28e781dfe1
  Ec2AMIManifest: (unknown)
  Ec2Architecture: arm64
  Ec2AvailabilityZone: us-west-2b
  Ec2Imageid: ami-095a68e28e781dfe1
  Ec2InstanceType: m7g.large
  Ec2Instancetype: m7g.large
  Ec2Kernel: unavailable
  Ec2Ramdisk: unavailable
  Ec2Region: us-west-2
  ProcEnviron:
   TERM=xterm-256color
   PATH=(custom, no user)
   LANG=C.UTF-8
   SHELL=/bin/bash
  RebootRequiredPkgs: Error: path contained symlinks.
  SourcePackage: linux-signed-aws-6.2
  UpgradeStatus: No upgrade log present (probably fresh install)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-signed-aws-6.2/+bug/2045560/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 2045560] Re: TCP memory leak, slow network (arm64)

2023-12-13 Thread Lev Petrushchak
** Description changed:

  Hello! 
  
  We have Ubuntu OS-based servers running in both AWS and Azure clouds.
  These servers are handling thousands of connections, and we've been
  experiencing issues with TCP memory usage since upgrading to Ubuntu
- 22.04.3 from 22.04.2. This results in network slowdown
+ 22.04.3 from 22.04.2.
  
  $ cat /proc/net/sockstat
  sockets: used 6642
  TCP: inuse 5962 orphan 0 tw 292 alloc 6008 mem 128989
  UDP: inuse 5 mem 0
  UDPLITE: inuse 0
  RAW: inuse 0
  FRAG: inuse 0 memory 0
  
  As shown in the output below, even after stopping all possible services
  and closing all open connections, the TCP memory usage remains high and
  only decreases very slowly.
  
  $ cat /proc/net/sockstat
  sockets: used 138
  TCP: inuse 2 orphan 0 tw 0 alloc 3 mem 128320
  UDP: inuse 3 mem 0
  UDPLITE: inuse 0
  RAW: inuse 0
  FRAG: inuse 0 memory 0
  
  I have attached a screenshot of linear TCP memory usage growth, which we
  believe may indicate a TCP memory leak
+ 
+ When net.ipv4.tcp_mem limit is reached, it causes network slowdown
  
  We've never had these issues before, and the only solution we've found
  so far is to reboot the node. Do you have any suggestions on how to
  troubleshoot further?
  
  Thank you for any help or guidance you can provide!
  
  ProblemType: Bug
  DistroRelease: Ubuntu 22.04
  Package: linux-image-6.2.0-1015-aws 6.2.0-1015.15~22.04.1
  ProcVersionSignature: Ubuntu 6.2.0-1015.15~22.04.1-aws 6.2.16
  Uname: Linux 6.2.0-1015-aws aarch64
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: arm64
  CasperMD5CheckResult: unknown
  CloudArchitecture: aarch64
  CloudID: aws
  CloudName: aws
  CloudPlatform: ec2
  CloudRegion: us-west-2
  CloudSubPlatform: metadata (http://169.254.169.254)
  Date: Mon Dec  4 13:13:08 2023
  Ec2AMI: ami-095a68e28e781dfe1
  Ec2AMIManifest: (unknown)
  Ec2Architecture: arm64
  Ec2AvailabilityZone: us-west-2b
  Ec2Imageid: ami-095a68e28e781dfe1
  Ec2InstanceType: m7g.large
  Ec2Instancetype: m7g.large
  Ec2Kernel: unavailable
  Ec2Ramdisk: unavailable
  Ec2Region: us-west-2
  ProcEnviron:
   TERM=xterm-256color
   PATH=(custom, no user)
   LANG=C.UTF-8
   SHELL=/bin/bash
  RebootRequiredPkgs: Error: path contained symlinks.
  SourcePackage: linux-signed-aws-6.2
  UpgradeStatus: No upgrade log present (probably fresh install)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-signed-aws-6.2 in Ubuntu.
https://bugs.launchpad.net/bugs/2045560

Title:
  TCP memory  leak, slow network (arm64)

Status in linux-signed-aws-6.2 package in Ubuntu:
  Confirmed

Bug description:
  Hello! 

  We have Ubuntu OS-based servers running in both AWS and Azure clouds.
  These servers are handling thousands of connections, and we've been
  experiencing issues with TCP memory usage since upgrading to Ubuntu
  22.04.3 from 22.04.2.

  $ cat /proc/net/sockstat
  sockets: used 6642
  TCP: inuse 5962 orphan 0 tw 292 alloc 6008 mem 128989
  UDP: inuse 5 mem 0
  UDPLITE: inuse 0
  RAW: inuse 0
  FRAG: inuse 0 memory 0

  As shown in the output below, even after stopping all possible
  services and closing all open connections, the TCP memory usage
  remains high and only decreases very slowly.

  $ cat /proc/net/sockstat
  sockets: used 138
  TCP: inuse 2 orphan 0 tw 0 alloc 3 mem 128320
  UDP: inuse 3 mem 0
  UDPLITE: inuse 0
  RAW: inuse 0
  FRAG: inuse 0 memory 0

  I have attached a screenshot of linear TCP memory usage growth, which
  we believe may indicate a TCP memory leak

  When net.ipv4.tcp_mem limit is reached, it causes network slowdown

  We've never had these issues before, and the only solution we've found
  so far is to reboot the node. Do you have any suggestions on how to
  troubleshoot further?

  Thank you for any help or guidance you can provide!

  ProblemType: Bug
  DistroRelease: Ubuntu 22.04
  Package: linux-image-6.2.0-1015-aws 6.2.0-1015.15~22.04.1
  ProcVersionSignature: Ubuntu 6.2.0-1015.15~22.04.1-aws 6.2.16
  Uname: Linux 6.2.0-1015-aws aarch64
  ApportVersion: 2.20.11-0ubuntu82.5
  Architecture: arm64
  CasperMD5CheckResult: unknown
  CloudArchitecture: aarch64
  CloudID: aws
  CloudName: aws
  CloudPlatform: ec2
  CloudRegion: us-west-2
  CloudSubPlatform: metadata (http://169.254.169.254)
  Date: Mon Dec  4 13:13:08 2023
  Ec2AMI: ami-095a68e28e781dfe1
  Ec2AMIManifest: (unknown)
  Ec2Architecture: arm64
  Ec2AvailabilityZone: us-west-2b
  Ec2Imageid: ami-095a68e28e781dfe1
  Ec2InstanceType: m7g.large
  Ec2Instancetype: m7g.large
  Ec2Kernel: unavailable
  Ec2Ramdisk: unavailable
  Ec2Region: us-west-2
  ProcEnviron:
   TERM=xterm-256color
   PATH=(custom, no user)
   LANG=C.UTF-8
   SHELL=/bin/bash
  RebootRequiredPkgs: Error: path contained symlinks.
  SourcePackage: linux-signed-aws-6.2
  UpgradeStatus: No upgrade log present (probably fresh install)

To manage notifications about this bug go to: