Guilherme, thank you for your kind words :) I have been trying to reproduce this bug on several other systems that I have access to in our cloud account, but I have been unable to reproduce it on a VM (either with SAN or local SSD storage). The main set of servers where this has been seen by us are bare-metal servers with a RAID card backed by SSDs - it's possible that a combination of the resources available on the machine (CPU, RAM, disk IO) cause this bug to be more reproducible with my basic testcase.
I have taken a server out of production and rebooted it into the 4.15 kernel (4.15.0-141-generic) where the issue is able to be seen. I confirmed my testcase still reproduces the issue here, and it does - nr_writeback is currently stuck at 2641 after one iteration. I have supplied the apport collected information from that server, which is now attached to this issue. This is my first bug report on Launchpad, so I am as yet unfamiliar with the process of testing the potential patches I need. Are you suggesting that I follow the process to rebuild the kernel (https://wiki.ubuntu.com/Kernel/BuildYourOwnKernel) including the patches you have mentioned? Assuming that is the correct course of action I'll attempt to follow the instructions and do that, and report back. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1926081 Title: nr_writeback memory leak in kernel 4.15.0-137+ Status in linux package in Ubuntu: In Progress Bug description: Ubuntu 18.04.5 4.15.0 LTS kernels at version 4.15.0-137 and above contain a memory leak due to the inclusion of patch from the upstream kernel, but not the fix for that patch which was released later. This issue manifests itself as an increasing amount of memory used by the writeback queue, which never returns to zero. This can been seen either as the value of `nr_writeback` in /proc/vmstat, or the value of `Writeback` in /proc/meminfo. Ordinarily these values should be at or around zero, but on our servers we observe the `nr_writeback` value increasing to over 8 million, (32GB of memory), at which point it isn't long before the system IO slows to a crawl (tens of Kb/s). Our servers have 256GB of memory, and are performing many CI related activities - this issue appears to be related to concurrent writing to disk, and can be demonstrated with a simple testcase (see later). On our heavily used systems this memory leak can result in an unstable server after 2-3 days, requiring a reboot to fix it. After much investigation the issue appears to be because the patch "mm: memcontrol: fix excessive complexity in memory.stat reporting" was brought in to the 4.15.0-137 Ubuntu kernel (see https://launchpad.net/ubuntu/+source/linux/4.15.0-137.141) as part of " Bionic update: upstream stable patchset 2021-01-25 (LP: #1913214)", however in the mainline kernel there was a follow up patch because this initial patch introduced concurrency issues. The patch "mm: memcontrol: fix NR_WRITEBACK leak in memcg and system stats" is required, and should be brought into the Ubuntu packaged kernel to fix the issues reported. The required patch is here: https://github.com/torvalds/linux/commit/c3cc39118c3610eb6ab4711bc624af7fc48a35fe and was committed a few weeks after the original (broken) patch: https://github.com/torvalds/linux/commit/a983b5ebee57209c99f68c8327072f25e0e6e3da I have checked the release notes for Ubuntu versions -137 to -143, and none include this second patch that should fix the issue. (I checked https://people.canonical.com/~kernel/info/kernel-version-map.html for all the kernel versions, and then visited each changelog page in turn, e.g. https://launchpad.net/ubuntu/+source/linux/4.15.0-143.147 looking for "mm: memcontrol: fix NR_WRITEBACK leak in memcg and system stats"). We do not observe this on the 5.4.0 kernel (supported HWE kernel on 18.05.5), which includes this second patch. That kernel may also include other patches, so we do not know if any other fixes are also required, but the one we have linked above seems to definitely be needed, and seems to match our symptoms. Testcase: The following is enough to permanently increase the value of `nr_writeback` on our systems (by about 2000 during most executions): ``` date grep nr_writeback /proc/vmstat mkdir -p /docker/testfiles/{1..5} seq -w 1 100000 | xargs -n1 -I% sh -c 'dd if=/dev/urandom of=/docker/testfiles/1/file.% bs=4k count=10 status=none' & seq -w 1 100000 | xargs -n1 -I% sh -c 'dd if=/dev/urandom of=/docker/testfiles/2/file.% bs=4k count=10 status=none' & seq -w 1 100000 | xargs -n1 -I% sh -c 'dd if=/dev/urandom of=/docker/testfiles/3/file.% bs=4k count=10 status=none' & seq -w 1 100000 | xargs -n1 -I% sh -c 'dd if=/dev/urandom of=/docker/testfiles/4/file.% bs=4k count=10 status=none' & seq -w 1 100000 | xargs -n1 -I% sh -c 'dd if=/dev/urandom of=/docker/testfiles/5/file.% bs=4k count=10 status=none' & wait $(jobs -p) grep nr_writeback /proc/vmstat date ``` Subsequent iterations of the test raise it further, and on a system doing a lot of writing from a lot of different processes, it can rise quickly. System details: lsb_release -rd Description: Ubuntu 18.04.5 LTS Release: 18.04 Affected kernel: 4.15.0-137 onwards (current latest version tried was 4.15.0-142) e.g. apt-cache policy linux-image-4.15.0-141-generic linux-image-4.15.0-141-generic: Installed: 4.15.0-141.145 Candidate: 4.15.0-141.145 Version table: *** 4.15.0-141.145 500 500 http://mirrors.service.networklayer.com/ubuntu bionic-updates/main amd64 Packages 500 http://mirrors.service.networklayer.com/ubuntu bionic-security/main amd64 Packages 100 /var/lib/dpkg/status According to https://wiki.ubuntu.com/KernelTeam/KernelTeamBugPolicies I should include additional information from the server, but at this stage we have upgraded all our affected systems to 5.4.0, and therefore the kernel versions do not match those with this issue. We likely have other servers used in other services that are not as heavily loaded that have not been as affected by this issue - and therefore and I may be able to get the equivalent diagnostics from there after confirming that they demonstrate the same issue with my testcase Workaround: After several weeks narrowing this down, our only option was to upgrade our servers to the 5.4 kernel, which is included as the HWE kernel in 18.04.5: apt update && apt install --install-recommends -y linux-generic- hwe-18.04 We have now upgraded most of our heavily used systems where this is a major issue to the 5.4.0 kernel, which seemed to be our only option. We have a lot of other colleagues where this is not a possibility for them, and it seems to be affecting them to varying degrees depending on the nature of their workloads. --- ProblemType: Bug AlsaDevices: total 0 crw-rw---- 1 root audio 116, 1 Apr 27 04:12 seq crw-rw---- 1 root audio 116, 33 Apr 27 04:12 timer AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay' ApportVersion: 2.20.9-0ubuntu7.23 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord' AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: DistroRelease: Ubuntu 18.04 HibernationDevice: RESUME=UUID=e38970cc-bdc9-406f-9f41-e8b02cfa48d7 IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig' MachineType: Supermicro PIO-848B-TRF4T-ST031 Package: linux (not installed) PciMultimedia: ProcFB: 0 astdrmfb ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-4.15.0-141-generic root=UUID=102d359f-6a99-403b-ac57-ff2a5fc1246a ro ProcVersionSignature: Ubuntu 4.15.0-141.145-generic 4.15.18 RelatedPackageVersions: linux-restricted-modules-4.15.0-141-generic N/A linux-backports-modules-4.15.0-141-generic N/A linux-firmware 1.173.20 RfKill: Error: [Errno 2] No such file or directory: 'rfkill': 'rfkill' Tags: bionic Uname: Linux 4.15.0-141-generic x86_64 UnreportableReason: This report is about a package that is not installed. UpgradeStatus: No upgrade log present (probably fresh install) UserGroups: WifiSyslog: _MarkForUpload: False dmi.bios.date: 10/18/2016 dmi.bios.vendor: American Megatrends Inc. dmi.bios.version: 2.1 dmi.board.asset.tag: IBM SoftLayer dmi.board.name: X10QBi dmi.board.vendor: Supermicro dmi.board.version: 1.01A dmi.chassis.asset.tag: IBM SoftLayer dmi.chassis.type: 1 dmi.chassis.vendor: Supermicro dmi.chassis.version: To Be Filled By O.E.M. dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr2.1:bd10/18/2016:svnSupermicro:pnPIO-848B-TRF4T-ST031:pvr123456789:rvnSupermicro:rnX10QBi:rvr1.01A:cvnSupermicro:ct1:cvrToBeFilledByO.E.M.: dmi.product.family: SMC X10 dmi.product.name: PIO-848B-TRF4T-ST031 dmi.product.version: 123456789 dmi.sys.vendor: Supermicro To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1926081/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp