I did not initially run the `apport-collect` command as the servers on
which I observed this bug have been upgraded to use the 5.4.0 kernel to
mitigate the issue (as mentioned in the initial report), therefore the
kernel related information may be misleading.

I will endeavour to find a server that is exhibiting this issue that
remains on an impacted kernel, however I cannot do so today, and given
that I have identified a required upstream patch I believe some progress
can be made with this bug without this information.

If an `apport-collect` command on a server that has been upgraded to the newer 
kernel is still useful, please let me know and I can see if I can generate the 
information there.
I will attempt to find a server that has not been upgraded, replicate the 
issue, and then will upload the required logs from that server.

** Changed in: linux (Ubuntu)
       Status: Incomplete => Confirmed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1926081

Title:
  nr_writeback memory leak in kernel 4.15.0-137+

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  
  Ubuntu 18.04.5 4.15.0 LTS kernels at version 4.15.0-137 and above contain a 
memory leak due to the inclusion of patch from the upstream kernel, but not the 
fix for that patch which was released later.

  This issue manifests itself as an increasing amount of memory used by
  the writeback queue, which never returns to zero. This can been seen
  either as the value of `nr_writeback` in /proc/vmstat, or the value of
  `Writeback` in /proc/meminfo.

  Ordinarily these values should be at or around zero, but on our
  servers we observe the `nr_writeback` value increasing to over 8
  million, (32GB of memory), at which point it isn't long before the
  system IO slows to a crawl (tens of Kb/s). Our servers have 256GB of
  memory, and are performing many CI related activities - this issue
  appears to be related to concurrent writing to disk, and can be
  demonstrated with a simple testcase (see later).

  On our heavily used systems this memory leak can result in an unstable
  server after 2-3 days, requiring a reboot to fix it.

  After much investigation the issue appears to be because the patch
  "mm: memcontrol: fix excessive complexity in memory.stat reporting"
  was brought in to the 4.15.0-137 Ubuntu kernel (see
  https://launchpad.net/ubuntu/+source/linux/4.15.0-137.141) as part of
  " Bionic update: upstream stable patchset 2021-01-25 (LP: #1913214)",
  however in the mainline kernel there was a follow up patch because
  this initial patch introduced concurrency issues. The patch "mm:
  memcontrol: fix NR_WRITEBACK leak in memcg and system stats" is
  required, and should be brought into the Ubuntu packaged kernel to fix
  the issues reported.

  The required patch is here:
  
https://github.com/torvalds/linux/commit/c3cc39118c3610eb6ab4711bc624af7fc48a35fe
  and was committed a few weeks after the original (broken) patch:
  
https://github.com/torvalds/linux/commit/a983b5ebee57209c99f68c8327072f25e0e6e3da

  I have checked the release notes for Ubuntu versions -137 to -143, and
  none include this second patch that should fix the issue. (I checked
  https://people.canonical.com/~kernel/info/kernel-version-map.html for
  all the kernel versions, and then visited each changelog page in turn,
  e.g. https://launchpad.net/ubuntu/+source/linux/4.15.0-143.147 looking
  for "mm: memcontrol: fix NR_WRITEBACK leak in memcg and system
  stats").

  We do not observe this on the 5.4.0 kernel (supported HWE kernel on
  18.05.5), which includes this second patch. That kernel may also
  include other patches, so we do not know if any other fixes are also
  required, but the one we have linked above seems to definitely be
  needed, and seems to match our symptoms.

  Testcase:

  The following is enough to permanently increase the value of
  `nr_writeback` on our systems (by about 2000 during most executions):

  ```
  date
  grep nr_writeback /proc/vmstat
  mkdir -p /docker/testfiles/{1..5}

  seq -w 1 100000 | xargs -n1 -I% sh -c 'dd if=/dev/urandom 
of=/docker/testfiles/1/file.% bs=4k count=10 status=none' &
  seq -w 1 100000 | xargs -n1 -I% sh -c 'dd if=/dev/urandom 
of=/docker/testfiles/2/file.% bs=4k count=10 status=none' &
  seq -w 1 100000 | xargs -n1 -I% sh -c 'dd if=/dev/urandom 
of=/docker/testfiles/3/file.% bs=4k count=10 status=none' &
  seq -w 1 100000 | xargs -n1 -I% sh -c 'dd if=/dev/urandom 
of=/docker/testfiles/4/file.% bs=4k count=10 status=none' &
  seq -w 1 100000 | xargs -n1 -I% sh -c 'dd if=/dev/urandom 
of=/docker/testfiles/5/file.% bs=4k count=10 status=none' &

  wait $(jobs -p)
  grep nr_writeback /proc/vmstat
  date
  ```

  Subsequent iterations of the test raise it further, and on a system
  doing a lot of writing from a lot of different processes, it can rise
  quickly.

  System details:

  lsb_release -rd
  Description:  Ubuntu 18.04.5 LTS
  Release:      18.04

  Affected kernel: 4.15.0-137 onwards (current latest version tried was
  4.15.0-142)

  e.g.

  apt-cache policy linux-image-4.15.0-141-generic
  linux-image-4.15.0-141-generic:
    Installed: 4.15.0-141.145
    Candidate: 4.15.0-141.145
    Version table:
   *** 4.15.0-141.145 500
          500 http://mirrors.service.networklayer.com/ubuntu 
bionic-updates/main amd64 Packages
          500 http://mirrors.service.networklayer.com/ubuntu 
bionic-security/main amd64 Packages
          100 /var/lib/dpkg/status

  According to https://wiki.ubuntu.com/KernelTeam/KernelTeamBugPolicies
  I should include additional information from the server, but at this
  stage we have upgraded all our affected systems to 5.4.0, and
  therefore the kernel versions do not match those with this issue.

  We likely have other servers used in other services that are not as
  heavily loaded that have not been as affected by this issue - and
  therefore and I may be able to get the equivalent diagnostics from
  there after confirming that they demonstrate the same issue with my
  testcase

  Workaround:

  After several weeks narrowing this down, our only option was to
  upgrade our servers to the 5.4 kernel, which is included as the HWE
  kernel in 18.04.5:

  apt update && apt install --install-recommends -y linux-generic-
  hwe-18.04

  We have now upgraded most of our heavily used systems where this is a
  major issue to the 5.4.0 kernel, which seemed to be our only option.
  We have a lot of other colleagues where this is not a possibility for
  them, and it seems to be affecting them to varying degrees depending
  on the nature of their workloads.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1926081/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to