As another "me too" situation, I'm seeing the same phenomenon, though on Rocky 9 rather than Ubuntu and on older kernels (5.14). Reporting details here on the off chance this provides some insight.
Hardware: Ampere Altra Max 128 cores (aarch64), ConnectX6-DX NICs (2 x dual 100G port) Kernel versions tested: 5.15 (Rocky 9 native kernel) and 6.8 (elrepo kernel), both configured with 64 KB pages OS: Rocky Linux 9 Software: nginx serving ~90k HTTPS clients at ~350 GBit/s (a synthetic load test) Bare-metal (no virtualisation). In my test environment, ~90k HTTPS connections are opened (and reused via keepalive) and used to stream ~350 GBit/s of traffic to a cluster of load generators. In this scenario, TCP memory gradually creeps up until reaching the memory pressure threshold in /proc/sys/net/ipv4/tcp_mem (243890 pages, or 15.6 GB in this system). At this point memory usage growth actually increases slightly (and increased CPU load and response times are also observed). The system eventually reaches the ultimate limit (365832 pages, or 23.4 GB) at which point most connections fail and all requests receive very slow responses. Closing all connections or restarting nginx does not free up the memory, only a reboot resolves the situation -- as reported above already. Leaked memory appears to persist even if all connections are closed prior to hitting any of the above limits. Unfortunately I don't yet have any ideas on how to fix this but would be glad to hear (and will share) any insights about what might be going on here! ** Attachment added: "TCP memory consumption" https://bugs.launchpad.net/ubuntu/+source/linux-signed-aws-6.2/+bug/2045560/+attachment/5763728/+files/boom.png -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-signed-aws-6.2 in Ubuntu. https://bugs.launchpad.net/bugs/2045560 Title: TCP memory leak, slow network (arm64) Status in linux-signed-aws-6.2 package in Ubuntu: Confirmed Bug description: Hello! 👋 We have Ubuntu OS-based servers running in both AWS and Azure clouds. These servers are handling thousands of connections, and we've been experiencing issues with TCP memory usage since upgrading to Ubuntu 22.04.3 from 22.04.2. $ cat /proc/net/sockstat sockets: used 6642 TCP: inuse 5962 orphan 0 tw 292 alloc 6008 mem 128989 UDP: inuse 5 mem 0 UDPLITE: inuse 0 RAW: inuse 0 FRAG: inuse 0 memory 0 As shown in the output below, even after stopping all possible services and closing all open connections, the TCP memory usage remains high and only decreases very slowly. $ cat /proc/net/sockstat sockets: used 138 TCP: inuse 2 orphan 0 tw 0 alloc 3 mem 128320 UDP: inuse 3 mem 0 UDPLITE: inuse 0 RAW: inuse 0 FRAG: inuse 0 memory 0 I have attached a screenshot of linear TCP memory usage growth, which we believe may indicate a TCP memory leak When net.ipv4.tcp_mem limit is reached, it causes network slowdown We've never had these issues before, and the only solution we've found so far is to reboot the node. Do you have any suggestions on how to troubleshoot further? Thank you for any help or guidance you can provide! ProblemType: Bug DistroRelease: Ubuntu 22.04 Package: linux-image-6.2.0-1015-aws 6.2.0-1015.15~22.04.1 ProcVersionSignature: Ubuntu 6.2.0-1015.15~22.04.1-aws 6.2.16 Uname: Linux 6.2.0-1015-aws aarch64 ApportVersion: 2.20.11-0ubuntu82.5 Architecture: arm64 CasperMD5CheckResult: unknown CloudArchitecture: aarch64 CloudID: aws CloudName: aws CloudPlatform: ec2 CloudRegion: us-west-2 CloudSubPlatform: metadata (http://169.254.169.254) Date: Mon Dec 4 13:13:08 2023 Ec2AMI: ami-095a68e28e781dfe1 Ec2AMIManifest: (unknown) Ec2Architecture: arm64 Ec2AvailabilityZone: us-west-2b Ec2Imageid: ami-095a68e28e781dfe1 Ec2InstanceType: m7g.large Ec2Instancetype: m7g.large Ec2Kernel: unavailable Ec2Ramdisk: unavailable Ec2Region: us-west-2 ProcEnviron: TERM=xterm-256color PATH=(custom, no user) LANG=C.UTF-8 SHELL=/bin/bash RebootRequiredPkgs: Error: path contained symlinks. SourcePackage: linux-signed-aws-6.2 UpgradeStatus: No upgrade log present (probably fresh install) To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-signed-aws-6.2/+bug/2045560/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp