Control: tags -1 + moreinfo On Wed, Oct 31, 2018 at 05:21:39PM +0800, 段熊春 wrote: > Package: linux-image-4.9.0-0.bpo.7-amd64 > Version: 4.9.110-3+deb9u2~deb8u1 > > Package: systemd > Version: 230-7~bpo8+2 > > hi guys: > We suspect that we may have found a memory leak bug in cgroup memory > subsystem, with 1GBytes/Hour leak speed for a special case. > This bug could be reproduced 100% on the mainstream kernel version 4.19. > (Tried on Debian's latest kernel 4.14 and 4.9, the same result.) > > This is what we have observed (Debian 9 Stretch, with mainstream kernel > version 4.19, kconfig attached) and how to reprocude: > System with Cgroup enabled. A demo service which simulates an "ill" behavior: > program broken, and exit immediately after just startup: > > service code > #include "stdio.h" > #include "stdlib.h" > int main() > { > void * p = malloc(10240); > return 1; > } > Compile the above code and put the binary as /usr/bin/test > systemd service > [Service] > ExecStart=/usr/bin/test > Restart=always > RestartSec=2s > MemoryLimit=1G > StartLimitInterval=0 > [Install] > WantedBy=default.target > Enable and start the above service with the tool systemctl. > > Some additional information: > With strace attach to systemd before start the service: systemd will mkdir > under /sys/fs/cgroup/memory for that service(/usr/bin/test). After the > service stops, rmdir will remove the correspond entry under > /sys/fs/cgrou/memory > With kprobe hook to cgroup_mkdir and cgroup_rmdir: the number of call > cgroup_mkdir and cgroup_rmdir are equally. > With kprobe hook to (1)mem_cgroup_css_alloc (2)mem_cgroup_css_free > (3)mem_cgroup_css_released (4)mem_cgroup_css_offline: > the invoke number of mem_cgroup_css_alloc and mem_cgroup_css_offline are > equally (Assume the number is A) > the invoke number of alloc mem_cgroup_css_free and mem_cgroup_css_released > are equally (Assume the number is B) > A > B > With jprobe: we have collected some addresses of memcg. With the crash tool, > inspect the living kernel: the member named refcnt's flag in the memcg->css > is change to __PERCPU_REF_ATOMIC_DEAD. memcg->css->refcnt->count keeps > the same value as memcg->memory->count. After 24 hours, we observed the data > structure is still in use, and the value of the two count both are 1. > we wrote a kmod to put a memcg which counter is 1, nothing happen except this > struct has been free > We suspect the issue maybe caused by incorrect call to try_charge and > cancel_charge. Anyway, just guess. > Following is some inspection code we used as described above: [...]
Can you still reproduce this issue with a recent kernel from unstable or buster-backports? What about mainline? if so can you please report this upstream instead, and keep us downstream in the loop? Regards, Salvatore