A cgroup can remain in the dying state for a long time, being pinned in the memory by any kernel object. It can be pinned by a page, shared with other cgroup (e.g. mlocked by a process in the other cgroup). It can be pinned by a vfs cache object, etc.
Mostly because of percpu data, the size of a memcg structure in the kernel memory is quite large. Depending on the machine size and the kernel config, it can easily reach hundreds of kilobytes per cgroup. Depending on the memory pressure and the reclaim approach (which is a separate topic), it looks like several hundreds (if not single thousands) of dying cgroups is a typical number. On a moderately sized machine the overall memory footprint is measured in hundreds of megabytes. So if we can't completely get rid of dying cgroups, let's make them smaller. This patchset aims to reduce the size of a dying memory cgroup by the premature release of percpu data during the cgroup removal, and use of atomic counterparts instead. Currently it covers per-memcg vmstat_percpu, per-memcg per-node lruvec_stat_cpu. The same approach can be further applied to other percpu data. Results on my test machine (32 CPUs, singe node): With the patchset: Originally: nr_dying_descendants 0 Slab: 66640 kB Slab: 67644 kB Percpu: 6912 kB Percpu: 6912 kB nr_dying_descendants 1000 Slab: 85912 kB Slab: 84704 kB Percpu: 26880 kB Percpu: 64128 kB So one dying cgroup went from 75 kB to 39 kB, which is almost twice smaller. The difference will be even bigger on a bigger machine (especially, with NUMA). To test the patchset, I used the following script: CG=/sys/fs/cgroup/percpu_test/ mkdir ${CG} echo "+memory" > ${CG}/cgroup.subtree_control cat ${CG}/cgroup.stat | grep nr_dying_descendants cat /proc/meminfo | grep -e Percpu -e Slab for i in `seq 1 1000`; do mkdir ${CG}/${i} echo $$ > ${CG}/${i}/cgroup.procs dd if=/dev/urandom of=/tmp/test-${i} count=1 2> /dev/null echo $$ > /sys/fs/cgroup/cgroup.procs rmdir ${CG}/${i} done cat /sys/fs/cgroup/cgroup.stat | grep nr_dying_descendants cat /proc/meminfo | grep -e Percpu -e Slab rmdir ${CG} v3: - replaced get_cpu_mask() with cpumask_of() (by Johannes) v2: - several renamings suggested by Johannes Weiner - added a patch, which merges cpu offlining and percpu flush code Roman Gushchin (6): mm: prepare to premature release of memcg->vmstats_percpu mm: prepare to premature release of per-node lruvec_stat_cpu mm: release memcg percpu data prematurely mm: release per-node memcg percpu data prematurely mm: flush memcg percpu stats and events before releasing mm: refactor memcg_hotplug_cpu_dead() to use memcg_flush_offline_percpu() include/linux/memcontrol.h | 66 ++++++++++---- mm/memcontrol.c | 179 ++++++++++++++++++++++++++++--------- 2 files changed, 186 insertions(+), 59 deletions(-) -- 2.20.1