Re: [PATCH v4 0/7] mm: reparent slab memory on cgroup removal

2019-06-05 Thread Roman Gushchin
On Wed, Jun 05, 2019 at 12:39:24AM -0700, Greg Thelen wrote:
> Roman Gushchin  wrote:
> 
> > # Why do we need this?
> >
> > We've noticed that the number of dying cgroups is steadily growing on most
> > of our hosts in production. The following investigation revealed an issue
> > in userspace memory reclaim code [1], accounting of kernel stacks [2],
> > and also the mainreason: slab objects.
> >
> > The underlying problem is quite simple: any page charged
> > to a cgroup holds a reference to it, so the cgroup can't be reclaimed unless
> > all charged pages are gone. If a slab object is actively used by other 
> > cgroups,
> > it won't be reclaimed, and will prevent the origin cgroup from being 
> > reclaimed.
> >
> > Slab objects, and first of all vfs cache, is shared between cgroups, which 
> > are
> > using the same underlying fs, and what's even more important, it's shared
> > between multiple generations of the same workload. So if something is 
> > running
> > periodically every time in a new cgroup (like how systemd works), we do
> > accumulate multiple dying cgroups.
> >
> > Strictly speaking pagecache isn't different here, but there is a key 
> > difference:
> > we disable protection and apply some extra pressure on LRUs of dying 
> > cgroups,
> > and these LRUs contain all charged pages.
> > My experiments show that with the disabled kernel memory accounting the 
> > number
> > of dying cgroups stabilizes at a relatively small number (~100, depends on
> > memory pressure and cgroup creation rate), and with kernel memory accounting
> > it grows pretty steadily up to several thousands.
> >
> > Memory cgroups are quite complex and big objects (mostly due to percpu 
> > stats),
> > so it leads to noticeable memory losses. Memory occupied by dying cgroups
> > is measured in hundreds of megabytes. I've even seen a host with more than 
> > 100Gb
> > of memory wasted for dying cgroups. It leads to a degradation of performance
> > with the uptime, and generally limits the usage of cgroups.
> >
> > My previous attempt [3] to fix the problem by applying extra pressure on 
> > slab
> > shrinker lists caused a regressions with xfs and ext4, and has been 
> > reverted [4].
> > The following attempts to find the right balance [5, 6] were not successful.
> >
> > So instead of trying to find a maybe non-existing balance, let's do reparent
> > the accounted slabs to the parent cgroup on cgroup removal.
> >
> >
> > # Implementation approach
> >
> > There is however a significant problem with reparenting of slab memory:
> > there is no list of charged pages. Some of them are in shrinker lists,
> > but not all. Introducing of a new list is really not an option.
> >
> > But fortunately there is a way forward: every slab page has a stable pointer
> > to the corresponding kmem_cache. So the idea is to reparent kmem_caches
> > instead of slab pages.
> >
> > It's actually simpler and cheaper, but requires some underlying changes:
> > 1) Make kmem_caches to hold a single reference to the memory cgroup,
> >instead of a separate reference per every slab page.
> > 2) Stop setting page->mem_cgroup pointer for memcg slab pages and use
> >page->kmem_cache->memcg indirection instead. It's used only on
> >slab page release, so it shouldn't be a big issue.
> > 3) Introduce a refcounter for non-root slab caches. It's required to
> >be able to destroy kmem_caches when they become empty and release
> >the associated memory cgroup.
> >
> > There is a bonus: currently we do release empty kmem_caches on cgroup
> > removal, however all other are waiting for the releasing of the memory 
> > cgroup.
> > These refactorings allow kmem_caches to be released as soon as they
> > become inactive and free.
> >
> > Some additional implementation details are provided in corresponding
> > commit messages.
> >
> > # Results
> >
> > Below is the average number of dying cgroups on two groups of our production
> > hosts. They do run some sort of web frontend workload, the memory pressure
> > is moderate. As we can see, with the kernel memory reparenting the number
> > stabilizes in 60s range; however with the original version it grows almost
> > linearly and doesn't show any signs of plateauing. The difference in slab
> > and percpu usage between patched and unpatched versions also grows linearly.
> > In 7 days it exceeded 200Mb.
> >
> > day   01234567
> > original 56  362  628  752 1070 1250 1490 1560
> > patched  23   46   51   55   60   57   67   69
> > mem diff(Mb) 22   74  123  152  164  182  214  241
> 
> No objection to the idea, but a question...

Hi Greg!

> In patched kernel, does slabinfo (or similar) show the list reparented
> slab caches?  A pile of zombie kmem_caches is certainly better than a
> pile of zombie mem_cgroup.  But it still seems like it'll might cause
> degradation - does cache_reap() walk an ever growing set of zombie
> caches?

It's not a pile of zombie 

Re: [PATCH v4 0/7] mm: reparent slab memory on cgroup removal

2019-06-05 Thread Greg Thelen
Roman Gushchin  wrote:

> # Why do we need this?
>
> We've noticed that the number of dying cgroups is steadily growing on most
> of our hosts in production. The following investigation revealed an issue
> in userspace memory reclaim code [1], accounting of kernel stacks [2],
> and also the mainreason: slab objects.
>
> The underlying problem is quite simple: any page charged
> to a cgroup holds a reference to it, so the cgroup can't be reclaimed unless
> all charged pages are gone. If a slab object is actively used by other 
> cgroups,
> it won't be reclaimed, and will prevent the origin cgroup from being 
> reclaimed.
>
> Slab objects, and first of all vfs cache, is shared between cgroups, which are
> using the same underlying fs, and what's even more important, it's shared
> between multiple generations of the same workload. So if something is running
> periodically every time in a new cgroup (like how systemd works), we do
> accumulate multiple dying cgroups.
>
> Strictly speaking pagecache isn't different here, but there is a key 
> difference:
> we disable protection and apply some extra pressure on LRUs of dying cgroups,
> and these LRUs contain all charged pages.
> My experiments show that with the disabled kernel memory accounting the number
> of dying cgroups stabilizes at a relatively small number (~100, depends on
> memory pressure and cgroup creation rate), and with kernel memory accounting
> it grows pretty steadily up to several thousands.
>
> Memory cgroups are quite complex and big objects (mostly due to percpu stats),
> so it leads to noticeable memory losses. Memory occupied by dying cgroups
> is measured in hundreds of megabytes. I've even seen a host with more than 
> 100Gb
> of memory wasted for dying cgroups. It leads to a degradation of performance
> with the uptime, and generally limits the usage of cgroups.
>
> My previous attempt [3] to fix the problem by applying extra pressure on slab
> shrinker lists caused a regressions with xfs and ext4, and has been reverted 
> [4].
> The following attempts to find the right balance [5, 6] were not successful.
>
> So instead of trying to find a maybe non-existing balance, let's do reparent
> the accounted slabs to the parent cgroup on cgroup removal.
>
>
> # Implementation approach
>
> There is however a significant problem with reparenting of slab memory:
> there is no list of charged pages. Some of them are in shrinker lists,
> but not all. Introducing of a new list is really not an option.
>
> But fortunately there is a way forward: every slab page has a stable pointer
> to the corresponding kmem_cache. So the idea is to reparent kmem_caches
> instead of slab pages.
>
> It's actually simpler and cheaper, but requires some underlying changes:
> 1) Make kmem_caches to hold a single reference to the memory cgroup,
>instead of a separate reference per every slab page.
> 2) Stop setting page->mem_cgroup pointer for memcg slab pages and use
>page->kmem_cache->memcg indirection instead. It's used only on
>slab page release, so it shouldn't be a big issue.
> 3) Introduce a refcounter for non-root slab caches. It's required to
>be able to destroy kmem_caches when they become empty and release
>the associated memory cgroup.
>
> There is a bonus: currently we do release empty kmem_caches on cgroup
> removal, however all other are waiting for the releasing of the memory cgroup.
> These refactorings allow kmem_caches to be released as soon as they
> become inactive and free.
>
> Some additional implementation details are provided in corresponding
> commit messages.
>
> # Results
>
> Below is the average number of dying cgroups on two groups of our production
> hosts. They do run some sort of web frontend workload, the memory pressure
> is moderate. As we can see, with the kernel memory reparenting the number
> stabilizes in 60s range; however with the original version it grows almost
> linearly and doesn't show any signs of plateauing. The difference in slab
> and percpu usage between patched and unpatched versions also grows linearly.
> In 7 days it exceeded 200Mb.
>
> day   01234567
> original 56  362  628  752 1070 1250 1490 1560
> patched  23   46   51   55   60   57   67   69
> mem diff(Mb) 22   74  123  152  164  182  214  241

No objection to the idea, but a question...

In patched kernel, does slabinfo (or similar) show the list reparented
slab caches?  A pile of zombie kmem_caches is certainly better than a
pile of zombie mem_cgroup.  But it still seems like it'll might cause
degradation - does cache_reap() walk an ever growing set of zombie
caches?

We've found it useful to add a slabinfo_full file which includes zombie
kmem_cache with their memcg_name.  This can help hunt down zombies.

> # History
>
> v4:
>   1) removed excessive memcg != parent check in memcg_deactivate_kmem_caches()
>   2) fixed rcu_read_lock() usage in memcg_charge_slab()
>   3) fixed synchronization