Re: [PATCH RFC 4/4] mm, page_alloc: add static key for should_fail_alloc_page()

2024-05-31 Thread Roman Gushchin
On Fri, May 31, 2024 at 11:33:35AM +0200, Vlastimil Babka wrote:
> Similarly to should_failslab(), remove the overhead of calling the
> noinline function should_fail_alloc_page() with a static key that guards
> the allocation hotpath callsite and is controlled by the fault and error
> injection frameworks.
> 
> Signed-off-by: Vlastimil Babka 

Reviewed-by: Roman Gushchin 

Thanks!



Re: [PATCH RFC 3/4] mm, slab: add static key for should_failslab()

2024-05-31 Thread Roman Gushchin
On Fri, May 31, 2024 at 11:33:34AM +0200, Vlastimil Babka wrote:
> Since commit 4f6923fbb352 ("mm: make should_failslab always available for
> fault injection") should_failslab() is unconditionally a noinline
> function. This adds visible overhead to the slab allocation hotpath,
> even if the function is empty. With CONFIG_FAILSLAB=y there's additional
> overhead when the functionality is not enabled by a boot parameter or
> debugfs.
> 
> The overhead can be eliminated with a static key around the callsite.
> Fault injection and error injection frameworks can now be told that the
> this function has a static key associated, and are able to enable and
> disable it accordingly.
> 
> Signed-off-by: Vlastimil Babka 

Reviewed-by: Roman Gushchin 



Re: [PATCH RFC 0/4] static key support for error injection functions

2024-05-31 Thread Roman Gushchin
On Fri, May 31, 2024 at 11:33:31AM +0200, Vlastimil Babka wrote:
> Incomplete, help needed from ftrace/kprobe and bpf folks.
> 
> As previously mentioned by myself [1] and others [2] the functions
> designed for error injection can bring visible overhead in fastpaths
> such as slab or page allocation, because even if nothing hooks into them
> at a given moment, they are noninline function calls regardless of
> CONFIG_ options since commits 4f6923fbb352 ("mm: make should_failslab
> always available for fault injection") and af3b854492f3
> ("mm/page_alloc.c: allow error injection").
> 
> Live patching their callsites has been also suggested in both [1] and
> [2] threads, and this is an attempt to do that with static keys that
> guard the call sites. When disabled, the error injection functions still
> exist and are noinline, but are not being called. Any of the existing
> mechanisms that can inject errors should make sure to enable the
> respective static key. I have added that support to some of them but
> need help with the others.

I think it's a clever idea and makes total sense!

> 
> - the legacy fault injection, i.e. CONFIG_FAILSLAB and
>   CONFIG_FAIL_PAGE_ALLOC is handled in Patch 1, and can be passed the
>   address of the static key if it exists. The key will be activated if the
>   fault injection probability becomes non-zero, and deactivated in the
>   opposite transition. This also removes the overhead of the evaluation
>   (on top of the noninline function call) when these mechanisms are
>   configured in the kernel but unused at the moment.
> 
> - the generic error injection using kretprobes with
>   override_function_with_return is handled in patch 2. The
>   ALLOW_ERROR_INJECTION() annotation is extended so that static key
>   address can be passed, and the framework controls it when error
>   injection is enabled or disabled in debugfs for the function.
> 
> There are two more users I know of but am not familiar enough to fix up
> myself. I hope people that are more familiar can help me here.
> 
> - ftrace seems to be using override_function_with_return from
>   #define ftrace_override_function_with_return but I found no place
>   where the latter is used. I assume it might be hidden behind more
>   macro magic? But the point is if ftrace can be instructed to act like
>   an error injection, it would also have to use some form of metadata
>   (from patch 2 presumably?) to get to the static key and control it.
> 
>   If ftrace can only observe the function being called, maybe it
>   wouldn't be wrong to just observe nothing if the static key isn't
>   enabled because nobody is doing the fault injection?
> 
> - bpftrace, as can be seen from the example in commit 4f6923fbb352
>   description. I suppose bpf is already aware what functions the
>   currently loaded bpf programs hook into, so that it could look up the
>   static key and control it. Maybe using again the metadata from patch 2,
>   or extending its own, as I've noticed there's e.g. BTF_ID(func,
>   should_failslab)
> 
> Now I realize maybe handling this at the k(ret)probe level would be
> sufficient for all cases except the legacy fault injection from Patch 1?
> Also wanted to note that by AFAIU by using the static_key_slow_dec/inc
> API (as done in patches 1/2) should allow all mechanisms to coexist
> naturally without fighting each other on the static key state, and also
> handle the reference count for e.g. active probes or bpf programs if
> there's no similar internal mechanism.
> 
> Patches 3 and 4 implement the static keys for the two mm fault injection
> sites in slab and page allocators. For a quick demonstration I've run a
> VM and the simple test from [1] that stresses the slab allocator and got
> this time before the series:
> 
> real0m8.349s
> user0m0.694s
> sys 0m7.648s
> 
> with perf showing
> 
>0.61%  nonexistent  [kernel.kallsyms]  [k] should_failslab.constprop.0
>0.00%  nonexistent  [kernel.kallsyms]  [k] should_fail_alloc_page  
>   
>   
>   ▒
> 
> And after the series
> 
> real0m7.924s
> user0m0.727s
> sys 0m7.191s

Is "user" increase a measurement error or it's real?

Otherwise, nice savings!



Re: 回复:回复:general protection fault in refill_obj_stock

2024-04-04 Thread Roman Gushchin
On Tue, Apr 02, 2024 at 02:14:58PM +0800, Ubisectech Sirius wrote:
> > On Tue, Apr 02, 2024 at 09:50:54AM +0800, Ubisectech Sirius wrote:
> >>> On Mon, Apr 01, 2024 at 03:04:46PM +0800, Ubisectech Sirius wrote:
> >>> Hello.
> >>> We are Ubisectech Sirius Team, the vulnerability lab of China ValiantSec. 
> >>> Recently, our team has discovered a issue in Linux kernel 6.7. Attached 
> >>> to the email were a PoC file of the issue.
> >>
> >>> Thank you for the report!
> >>
> >>> I tried to compile and run your test program for about half an hour
> >>> on a virtual machine running 6.7 with enabled KASAN, but wasn't able
> >>> to reproduce the problem.
> >> 
> >>> Can you, please, share a bit more information? How long does it take
> >>> to reproduce? Do you mind sharing your kernel config? Is there anything 
> >>> special
> >>> about your setup? What are exact steps to reproduce the problem?
> >>> Is this problem reproducible on 6.6?
> >> 
> >> Hi. 
> >> The .config of linux kernel 6.7 has send to you as attachment.
> > Thanks!
> > How long it takes to reproduce a problem? Do you just start your reproducer 
> > and wait?
> I just start the reproducer and wait without any other operation. The speed 
> of reproducing this problem is vary fast(Less than 5 seconds). 
> >> And The problem is reproducible on 6.6.
> > Hm, it rules out my recent changes.
> > Did you try any older kernels? 6.5? 6.0? Did you try to bisect the problem?
> > if it's fast to reproduce, it might be the best option.
> I have try the 6.0, 6.3, 6.4, 6.5 kernel. The Linux kernel 6.5 will get same 
> error output. But other version will get different output like below:
> [ 55.306672][ T7950] KASAN: null-ptr-deref in range 
> [0x0018-0x001f]
> [ 55.307259][ T7950] CPU: 1 PID: 7950 Comm: poc Not tainted 6.3.0 #1
> [ 55.307714][ T7950] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> BIOS 1.15.0-1 04/01/2014
> [ 55.308363][ T7950] RIP: 0010:tomoyo_check_acl (security/tomoyo/domain.c:173)
> [ 55.316475][ T7950] Call Trace:
> [ 55.316713][ T7950] 
> [ 55.317353][ T7950] tomoyo_path_permission (security/tomoyo/file.c:170 
> security/tomoyo/file.c:587 security/tomoyo/file.c:573)
> [ 55.317744][ T7950] tomoyo_check_open_permission (security/tomoyo/file.c:779)
> [ 55.320152][ T7950] tomoyo_file_open (security/tomoyo/tomoyo.c:332 
> security/tomoyo/tomoyo.c:327)
> [ 55.320495][ T7950] security_file_open (security/security.c:1719 
> (discriminator 13))
> [ 55.320850][ T7950] do_dentry_open (fs/open.c:908)
> [ 55.321526][ T7950] path_openat (fs/namei.c:3561 fs/namei.c:3715)
> [ 55.322614][ T7950] do_filp_open (fs/namei.c:3743)
> [ 55.325086][ T7950] do_sys_openat2 (fs/open.c:1349)
> [ 55.326249][ T7950] __x64_sys_openat (fs/open.c:1375)
> [ 55.327428][ T7950] do_syscall_64 (arch/x86/entry/common.c:50 
> arch/x86/entry/common.c:80)
> [ 55.327756][ T7950] entry_SYSCALL_64_after_hwframe 
> (arch/x86/entry/entry_64.S:120)
> [ 55.328185][ T7950] RIP: 0033:0x7f1c4a484f29
> [ 55.328504][ T7950] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 
> 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 
> <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 37 8f 0d 00 f7 d8 64 89 01 48
> [ 55.329864][ T7950] RSP: 002b:7ffd7bfe8398 EFLAGS: 0246 ORIG_RAX: 
> 0101
> [ 55.330464][ T7950] RAX: ffda RBX:  RCX: 
> 7f1c4a484f29
> [ 55.331024][ T7950] RDX: 00141842 RSI: 2380 RDI: 
> ff9c
> [ 55.331585][ T7950] RBP: 7ffd7bfe83a0 R08:  R09: 
> 7ffd7bfe83f0
> [ 55.332148][ T7950] R10:  R11: 0246 R12: 
> 55c5e36482d0
> [ 55.332707][ T7950] R13:  R14:  R15: 
> 
> [ 55.333268][ T7950] 
> [ 55.333488][ T7950] Modules linked in:
> [ 55.340525][ T7950] ---[ end trace  ]---
> [ 55.340936][ T7950] RIP: 0010:tomoyo_check_acl (security/tomoyo/domain.c:173)
> It look like other problem?
> > Also, are you running vanilla kernels or you do have some custom changes on 
> > top?
> I haven't made any custom changes. 
> >Thanks!

Ok, I installed a new toolchain, built a kernel with your config and reproduced 
the (a?) problem.
It definitely smells a generic memory corruption, as I get new stacktraces 
every time I run it.
I got something similar to your tomoyo stacktrace, then I got something about
ima_add_template_entry() and then something else. Never saw your original 
obj_cgroup_get()
stack.

It seems to be connected to your very full kernel config, as I can't reproduce 
anything
with my original more minimal config. It also doesn't seem to be connected to 
the
kernel memory accounting directly.

It would be helpful to understand which kernel config options are required to 
reproduce
the issue as well as what exactly the reproducer does. I'll try to spend some 
cycles
on this, but can't promise much.

Thanks!



Re: 回复:回复:general protection fault in refill_obj_stock

2024-04-03 Thread Roman Gushchin
On Tue, Apr 02, 2024 at 02:14:58PM +0800, Ubisectech Sirius wrote:
> > On Tue, Apr 02, 2024 at 09:50:54AM +0800, Ubisectech Sirius wrote:
> >>> On Mon, Apr 01, 2024 at 03:04:46PM +0800, Ubisectech Sirius wrote:
> >>> Hello.
> >>> We are Ubisectech Sirius Team, the vulnerability lab of China ValiantSec. 
> >>> Recently, our team has discovered a issue in Linux kernel 6.7. Attached 
> >>> to the email were a PoC file of the issue.
> >>
> >>> Thank you for the report!
> >>
> >>> I tried to compile and run your test program for about half an hour
> >>> on a virtual machine running 6.7 with enabled KASAN, but wasn't able
> >>> to reproduce the problem.
> >> 
> >>> Can you, please, share a bit more information? How long does it take
> >>> to reproduce? Do you mind sharing your kernel config? Is there anything 
> >>> special
> >>> about your setup? What are exact steps to reproduce the problem?
> >>> Is this problem reproducible on 6.6?
> >> 
> >> Hi. 
> >> The .config of linux kernel 6.7 has send to you as attachment.
> > Thanks!
> > How long it takes to reproduce a problem? Do you just start your reproducer 
> > and wait?
> I just start the reproducer and wait without any other operation. The speed 
> of reproducing this problem is vary fast(Less than 5 seconds). 
> >> And The problem is reproducible on 6.6.
> > Hm, it rules out my recent changes.
> > Did you try any older kernels? 6.5? 6.0? Did you try to bisect the problem?
> > if it's fast to reproduce, it might be the best option.
> I have try the 6.0, 6.3, 6.4, 6.5 kernel. The Linux kernel 6.5 will get same 
> error output. But other version will get different output like below:
> [ 55.306672][ T7950] KASAN: null-ptr-deref in range 
> [0x0018-0x001f]
> [ 55.307259][ T7950] CPU: 1 PID: 7950 Comm: poc Not tainted 6.3.0 #1
> [ 55.307714][ T7950] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> BIOS 1.15.0-1 04/01/2014
> [ 55.308363][ T7950] RIP: 0010:tomoyo_check_acl (security/tomoyo/domain.c:173)
> [ 55.316475][ T7950] Call Trace:
> [ 55.316713][ T7950] 
> [ 55.317353][ T7950] tomoyo_path_permission (security/tomoyo/file.c:170 
> security/tomoyo/file.c:587 security/tomoyo/file.c:573)
> [ 55.317744][ T7950] tomoyo_check_open_permission (security/tomoyo/file.c:779)
> [ 55.320152][ T7950] tomoyo_file_open (security/tomoyo/tomoyo.c:332 
> security/tomoyo/tomoyo.c:327)
> [ 55.320495][ T7950] security_file_open (security/security.c:1719 
> (discriminator 13))
> [ 55.320850][ T7950] do_dentry_open (fs/open.c:908)
> [ 55.321526][ T7950] path_openat (fs/namei.c:3561 fs/namei.c:3715)
> [ 55.322614][ T7950] do_filp_open (fs/namei.c:3743)
> [ 55.325086][ T7950] do_sys_openat2 (fs/open.c:1349)
> [ 55.326249][ T7950] __x64_sys_openat (fs/open.c:1375)
> [ 55.327428][ T7950] do_syscall_64 (arch/x86/entry/common.c:50 
> arch/x86/entry/common.c:80)
> [ 55.327756][ T7950] entry_SYSCALL_64_after_hwframe 
> (arch/x86/entry/entry_64.S:120)
> [ 55.328185][ T7950] RIP: 0033:0x7f1c4a484f29
> [ 55.328504][ T7950] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 
> 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 
> <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 37 8f 0d 00 f7 d8 64 89 01 48
> [ 55.329864][ T7950] RSP: 002b:7ffd7bfe8398 EFLAGS: 0246 ORIG_RAX: 
> 0101
> [ 55.330464][ T7950] RAX: ffda RBX:  RCX: 
> 7f1c4a484f29
> [ 55.331024][ T7950] RDX: 00141842 RSI: 2380 RDI: 
> ff9c
> [ 55.331585][ T7950] RBP: 7ffd7bfe83a0 R08:  R09: 
> 7ffd7bfe83f0
> [ 55.332148][ T7950] R10:  R11: 0246 R12: 
> 55c5e36482d0
> [ 55.332707][ T7950] R13:  R14:  R15: 
> 
> [ 55.333268][ T7950] 
> [ 55.333488][ T7950] Modules linked in:
> [ 55.340525][ T7950] ---[ end trace  ]---
> [ 55.340936][ T7950] RIP: 0010:tomoyo_check_acl (security/tomoyo/domain.c:173)
> It look like other problem?

It does look differently.

I can't reproduce any of those. I run into some build time issues when trying to
build the kernel with your config (I have a fairly old toolchain, maybe it's 
the reason),
but when running a more minimalistic config I do not see any issues on 6.1, 6.6 
and 6.7.
Is this some sort of all-yes config or it's somehow specially crafted? Did you 
try
to reproduce the problem with other kernel configs?

It all smells a memory corruption, but who knows.

Thanks!



Re: 回复:general protection fault in refill_obj_stock

2024-04-01 Thread Roman Gushchin
On Tue, Apr 02, 2024 at 09:50:54AM +0800, Ubisectech Sirius wrote:
> > On Mon, Apr 01, 2024 at 03:04:46PM +0800, Ubisectech Sirius wrote:
> > Hello.
> > We are Ubisectech Sirius Team, the vulnerability lab of China ValiantSec. 
> > Recently, our team has discovered a issue in Linux kernel 6.7. Attached to 
> > the email were a PoC file of the issue.
> 
> > Thank you for the report!
> 
> > I tried to compile and run your test program for about half an hour
> > on a virtual machine running 6.7 with enabled KASAN, but wasn't able
> > to reproduce the problem.
> 
> > Can you, please, share a bit more information? How long does it take
> > to reproduce? Do you mind sharing your kernel config? Is there anything 
> > special
> > about your setup? What are exact steps to reproduce the problem?
> > Is this problem reproducible on 6.6?
> 
> Hi. 
>The .config of linux kernel 6.7 has send to you as attachment.

Thanks!

How long it takes to reproduce a problem? Do you just start your reproducer and 
wait?

> And The problem is reproducible on 6.6.

Hm, it rules out my recent changes.
Did you try any older kernels? 6.5? 6.0? Did you try to bisect the problem?
If it's fast to reproduce, it might be the best option.

Also, are you running vanilla kernels or you do have some custom changes on top?

Thanks!



Re: general protection fault in refill_obj_stock

2024-04-01 Thread Roman Gushchin
On Mon, Apr 01, 2024 at 03:04:46PM +0800, Ubisectech Sirius wrote:
> Hello.
> We are Ubisectech Sirius Team, the vulnerability lab of China ValiantSec. 
> Recently, our team has discovered a issue in Linux kernel 6.7. Attached to 
> the email were a PoC file of the issue.

Thank you for the report!

I tried to compile and run your test program for about half an hour
on a virtual machine running 6.7 with enabled KASAN, but wasn't able
to reproduce the problem.

Can you, please, share a bit more information? How long does it take
to reproduce? Do you mind sharing your kernel config? Is there anything special
about your setup? What are exact steps to reproduce the problem?
Is this problem reproducible on 6.6?

It's interesting that the problem looks like use-after-free for the objcg 
pointer
but happens in the context of udev-systemd, which I believe should be fairly 
stable
and it's cgroup is not going anywhere.

Thanks!



Re: [PATCH 4/4] percpu: use reclaim threshold instead of running for every page

2021-04-20 Thread Roman Gushchin
On Mon, Apr 19, 2021 at 10:50:47PM +, Dennis Zhou wrote:
> The last patch implements reclaim by adding 2 additional lists where a
> chunk's lifecycle is:
>   active_slot -> to_depopulate_slot -> sidelined_slot
> 
> This worked great because we're able to nicely converge paths into
> isolation. However, it's a bit aggressive to run for every free page.
> Let's accumulate a few free pages before we do this. To do this, the new
> lifecycle is:
>   active_slot -> sidelined_slot -> to_depopulate_slot -> sidelined_slot
> 
> The transition from sidelined_slot -> to_depopulate_slot occurs on a
> threshold instead of before where it directly went to the
> to_depopulate_slot. pcpu_nr_isolated_empty_pop_pages[] is introduced to
> aid with this.
> 
> Suggested-by: Roman Gushchin 
> Signed-off-by: Dennis Zhou 

Acked-by: Roman Gushchin 

Thanks, Dennis!


Re: [PATCH 2/4] percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1

2021-04-20 Thread Roman Gushchin
On Mon, Apr 19, 2021 at 10:50:45PM +, Dennis Zhou wrote:
> This prepares for adding a to_depopulate list and sidelined list after
> the free slot in the set of lists in pcpu_slot.
> 
> Signed-off-by: Dennis Zhou 

Acked-by: Roman Gushchin 


Re: [RFC] memory reserve for userspace oom-killer

2021-04-20 Thread Roman Gushchin
On Mon, Apr 19, 2021 at 06:44:02PM -0700, Shakeel Butt wrote:
> Proposal: Provide memory guarantees to userspace oom-killer.
> 
> Background:
> 
> Issues with kernel oom-killer:
> 1. Very conservative and prefer to reclaim. Applications can suffer
> for a long time.
> 2. Borrows the context of the allocator which can be resource limited
> (low sched priority or limited CPU quota).
> 3. Serialized by global lock.
> 4. Very simplistic oom victim selection policy.
> 
> These issues are resolved through userspace oom-killer by:
> 1. Ability to monitor arbitrary metrics (PSI, vmstat, memcg stats) to
> early detect suffering.
> 2. Independent process context which can be given dedicated CPU quota
> and high scheduling priority.
> 3. Can be more aggressive as required.
> 4. Can implement sophisticated business logic/policies.
> 
> Android's LMKD and Facebook's oomd are the prime examples of userspace
> oom-killers. One of the biggest challenges for userspace oom-killers
> is to potentially function under intense memory pressure and are prone
> to getting stuck in memory reclaim themselves. Current userspace
> oom-killers aim to avoid this situation by preallocating user memory
> and protecting themselves from global reclaim by either mlocking or
> memory.min. However a new allocation from userspace oom-killer can
> still get stuck in the reclaim and policy rich oom-killer do trigger
> new allocations through syscalls or even heap.
> 
> Our attempt of userspace oom-killer faces similar challenges.
> Particularly at the tail on the very highly utilized machines we have
> observed userspace oom-killer spectacularly failing in many possible
> ways in the direct reclaim. We have seen oom-killer stuck in direct
> reclaim throttling, stuck in reclaim and allocations from interrupts
> keep stealing reclaimed memory. We have even observed systems where
> all the processes were stuck in throttle_direct_reclaim() and only
> kswapd was running and the interrupts kept stealing the memory
> reclaimed by kswapd.
> 
> To reliably solve this problem, we need to give guaranteed memory to
> the userspace oom-killer. At the moment we are contemplating between
> the following options and I would like to get some feedback.
> 
> 1. prctl(PF_MEMALLOC)
> 
> The idea is to give userspace oom-killer (just one thread which is
> finding the appropriate victims and will be sending SIGKILLs) access
> to MEMALLOC reserves. Most of the time the preallocation, mlock and
> memory.min will be good enough but for rare occasions, when the
> userspace oom-killer needs to allocate, the PF_MEMALLOC flag will
> protect it from reclaim and let the allocation dip into the memory
> reserves.
> 
> The misuse of this feature would be risky but it can be limited to
> privileged applications. Userspace oom-killer is the only appropriate
> user of this feature. This option is simple to implement.

Hello Shakeel!

If ordinary PAGE_SIZE and smaller kernel allocations start to fail,
the system is already in a relatively bad shape. Arguably the userspace
OOM killer should kick in earlier, it's already a bit too late.
Allowing to use reserves just pushes this even further, so we're risking
the kernel stability for no good reason.

But I agree that throttling the oom daemon in direct reclaim makes no sense.
I wonder if we can introduce a per-task flag which will exclude the task from
throttling, but instead all (large) allocations will just fail under a
significant memory pressure more easily. In this case if there is a significant
memory shortage the oom daemon will not be fully functional (will get -ENOMEM
for an attempt to read some stats, for example), but still will be able to kill
some processes and make the forward progress.
But maybe it can be done in userspace too: by splitting the daemon into
a core- and extended part and avoid doing anything behind bare minimum
in the core part.

> 
> 2. Mempool
> 
> The idea is to preallocate mempool with a given amount of memory for
> userspace oom-killer. Preferably this will be per-thread and
> oom-killer can preallocate mempool for its specific threads. The core
> page allocator can check before going to the reclaim path if the task
> has private access to the mempool and return page from it if yes.
> 
> This option would be more complicated than the previous option as the
> lifecycle of the page from the mempool would be more sophisticated.
> Additionally the current mempool does not handle higher order pages
> and we might need to extend it to allow such allocations. Though this
> feature might have more use-cases and it would be less risky than the
> previous option.

It looks like an over-kill for the oom daemon protection, but if there
are other good use cases, maybe it's a good feature to have.

> 
> Another idea I had was to use kthread based oom-killer and provide the
> policies through eBPF program. Though I am not sure how to make it
> monitor arbitrary metrics and if that can be done without any
> 

Re: [PATCH v3 0/6] percpu: partial chunk depopulation

2021-04-16 Thread Roman Gushchin
On Sat, Apr 17, 2021 at 01:14:03AM +0530, Pratik Sampat wrote:
> 
> 
> On 17/04/21 12:39 am, Roman Gushchin wrote:
> > On Sat, Apr 17, 2021 at 12:11:37AM +0530, Pratik Sampat wrote:
> > > 
> > > On 17/04/21 12:04 am, Roman Gushchin wrote:
> > > > On Fri, Apr 16, 2021 at 11:57:03PM +0530, Pratik Sampat wrote:
> > > > > On 16/04/21 10:43 pm, Roman Gushchin wrote:
> > > > > > On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote:
> > > > > > > Hello Dennis,
> > > > > > > 
> > > > > > > I apologize for the clutter of logs before, I'm pasting the logs 
> > > > > > > of before and
> > > > > > > after the percpu test in the case of the patchset being applied 
> > > > > > > on 5.12-rc6 and
> > > > > > > the vanilla kernel 5.12-rc6.
> > > > > > > 
> > > > > > > On 16/04/21 7:48 pm, Dennis Zhou wrote:
> > > > > > > > Hello,
> > > > > > > > 
> > > > > > > > On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote:
> > > > > > > > > Hello Roman,
> > > > > > > > > 
> > > > > > > > > I've tried the v3 patch series on a POWER9 and an x86 KVM 
> > > > > > > > > setup.
> > > > > > > > > 
> > > > > > > > > My results of the percpu_test are as follows:
> > > > > > > > > Intel KVM 4CPU:4G
> > > > > > > > > Vanilla 5.12-rc6
> > > > > > > > > # ./percpu_test.sh
> > > > > > > > > Percpu: 1952 kB
> > > > > > > > > Percpu:   219648 kB
> > > > > > > > > Percpu:   219648 kB
> > > > > > > > > 
> > > > > > > > > 5.12-rc6 + with patchset applied
> > > > > > > > > # ./percpu_test.sh
> > > > > > > > > Percpu: 2080 kB
> > > > > > > > > Percpu:   219712 kB
> > > > > > > > > Percpu:    72672 kB
> > > > > > > > > 
> > > > > > > > > I'm able to see improvement comparable to that of what you're 
> > > > > > > > > see too.
> > > > > > > > > 
> > > > > > > > > However, on POWERPC I'm unable to reproduce these 
> > > > > > > > > improvements with the patchset in the same configuration
> > > > > > > > > 
> > > > > > > > > POWER9 KVM 4CPU:4G
> > > > > > > > > Vanilla 5.12-rc6
> > > > > > > > > # ./percpu_test.sh
> > > > > > > > > Percpu: 5888 kB
> > > > > > > > > Percpu:   118272 kB
> > > > > > > > > Percpu:   118272 kB
> > > > > > > > > 
> > > > > > > > > 5.12-rc6 + with patchset applied
> > > > > > > > > # ./percpu_test.sh
> > > > > > > > > Percpu: 6144 kB
> > > > > > > > > Percpu:   119040 kB
> > > > > > > > > Percpu:   119040 kB
> > > > > > > > > 
> > > > > > > > > I'm wondering if there's any architectural specific code that 
> > > > > > > > > needs plumbing
> > > > > > > > > here?
> > > > > > > > > 
> > > > > > > > There shouldn't be. Can you send me the percpu_stats debug 
> > > > > > > > output before
> > > > > > > > and after?
> > > > > > > I'll paste the whole debug stats before and after here.
> > > > > > > 5.12-rc6 + patchset
> > > > > > > -BEFORE-
> > > > > > > Percpu Memory Statistics
> > > > > > > Allocation Info:
> > > > > > Hm, this looks highly suspicious. Here is your stats in a more 
> > > > > > compact form:
> > > > > > 
> > > > > > Vanilla
> > > > > > 
> > > > > > nr_alloc: 9038 nr_alloc:
> >

Re: [PATCH v3 0/6] percpu: partial chunk depopulation

2021-04-16 Thread Roman Gushchin
On Sat, Apr 17, 2021 at 12:11:37AM +0530, Pratik Sampat wrote:
> 
> 
> On 17/04/21 12:04 am, Roman Gushchin wrote:
> > On Fri, Apr 16, 2021 at 11:57:03PM +0530, Pratik Sampat wrote:
> > > 
> > > On 16/04/21 10:43 pm, Roman Gushchin wrote:
> > > > On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote:
> > > > > Hello Dennis,
> > > > > 
> > > > > I apologize for the clutter of logs before, I'm pasting the logs of 
> > > > > before and
> > > > > after the percpu test in the case of the patchset being applied on 
> > > > > 5.12-rc6 and
> > > > > the vanilla kernel 5.12-rc6.
> > > > > 
> > > > > On 16/04/21 7:48 pm, Dennis Zhou wrote:
> > > > > > Hello,
> > > > > > 
> > > > > > On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote:
> > > > > > > Hello Roman,
> > > > > > > 
> > > > > > > I've tried the v3 patch series on a POWER9 and an x86 KVM setup.
> > > > > > > 
> > > > > > > My results of the percpu_test are as follows:
> > > > > > > Intel KVM 4CPU:4G
> > > > > > > Vanilla 5.12-rc6
> > > > > > > # ./percpu_test.sh
> > > > > > > Percpu: 1952 kB
> > > > > > > Percpu:   219648 kB
> > > > > > > Percpu:   219648 kB
> > > > > > > 
> > > > > > > 5.12-rc6 + with patchset applied
> > > > > > > # ./percpu_test.sh
> > > > > > > Percpu: 2080 kB
> > > > > > > Percpu:   219712 kB
> > > > > > > Percpu:    72672 kB
> > > > > > > 
> > > > > > > I'm able to see improvement comparable to that of what you're see 
> > > > > > > too.
> > > > > > > 
> > > > > > > However, on POWERPC I'm unable to reproduce these improvements 
> > > > > > > with the patchset in the same configuration
> > > > > > > 
> > > > > > > POWER9 KVM 4CPU:4G
> > > > > > > Vanilla 5.12-rc6
> > > > > > > # ./percpu_test.sh
> > > > > > > Percpu: 5888 kB
> > > > > > > Percpu:   118272 kB
> > > > > > > Percpu:   118272 kB
> > > > > > > 
> > > > > > > 5.12-rc6 + with patchset applied
> > > > > > > # ./percpu_test.sh
> > > > > > > Percpu: 6144 kB
> > > > > > > Percpu:   119040 kB
> > > > > > > Percpu:   119040 kB
> > > > > > > 
> > > > > > > I'm wondering if there's any architectural specific code that 
> > > > > > > needs plumbing
> > > > > > > here?
> > > > > > > 
> > > > > > There shouldn't be. Can you send me the percpu_stats debug output 
> > > > > > before
> > > > > > and after?
> > > > > I'll paste the whole debug stats before and after here.
> > > > > 5.12-rc6 + patchset
> > > > > -BEFORE-
> > > > > Percpu Memory Statistics
> > > > > Allocation Info:
> > > > Hm, this looks highly suspicious. Here is your stats in a more compact 
> > > > form:
> > > > 
> > > > Vanilla
> > > > 
> > > > nr_alloc: 9038 nr_alloc:
> > > > 97046
> > > > nr_dealloc  : 6992 nr_dealloc  :
> > > > 94237
> > > > nr_cur_alloc: 2046 nr_cur_alloc:
> > > >  2809
> > > > nr_max_alloc: 2178 nr_max_alloc:
> > > > 90054
> > > > nr_chunks   :3 nr_chunks   :
> > > >11
> > > > nr_max_chunks   :3 nr_max_chunks   :
> > > >47
> > > > min_alloc_size  :4 min_alloc_size  :
> > > > 4
> > > > max_alloc_size  : 1072 max_alloc_size  :
> > > >  1072
> > > > empty_pop_pages 

Re: [PATCH v3 0/6] percpu: partial chunk depopulation

2021-04-16 Thread Roman Gushchin
On Fri, Apr 16, 2021 at 11:57:03PM +0530, Pratik Sampat wrote:
> 
> 
> On 16/04/21 10:43 pm, Roman Gushchin wrote:
> > On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote:
> > > Hello Dennis,
> > > 
> > > I apologize for the clutter of logs before, I'm pasting the logs of 
> > > before and
> > > after the percpu test in the case of the patchset being applied on 
> > > 5.12-rc6 and
> > > the vanilla kernel 5.12-rc6.
> > > 
> > > On 16/04/21 7:48 pm, Dennis Zhou wrote:
> > > > Hello,
> > > > 
> > > > On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote:
> > > > > Hello Roman,
> > > > > 
> > > > > I've tried the v3 patch series on a POWER9 and an x86 KVM setup.
> > > > > 
> > > > > My results of the percpu_test are as follows:
> > > > > Intel KVM 4CPU:4G
> > > > > Vanilla 5.12-rc6
> > > > > # ./percpu_test.sh
> > > > > Percpu: 1952 kB
> > > > > Percpu:   219648 kB
> > > > > Percpu:   219648 kB
> > > > > 
> > > > > 5.12-rc6 + with patchset applied
> > > > > # ./percpu_test.sh
> > > > > Percpu: 2080 kB
> > > > > Percpu:   219712 kB
> > > > > Percpu:    72672 kB
> > > > > 
> > > > > I'm able to see improvement comparable to that of what you're see too.
> > > > > 
> > > > > However, on POWERPC I'm unable to reproduce these improvements with 
> > > > > the patchset in the same configuration
> > > > > 
> > > > > POWER9 KVM 4CPU:4G
> > > > > Vanilla 5.12-rc6
> > > > > # ./percpu_test.sh
> > > > > Percpu: 5888 kB
> > > > > Percpu:   118272 kB
> > > > > Percpu:   118272 kB
> > > > > 
> > > > > 5.12-rc6 + with patchset applied
> > > > > # ./percpu_test.sh
> > > > > Percpu: 6144 kB
> > > > > Percpu:   119040 kB
> > > > > Percpu:   119040 kB
> > > > > 
> > > > > I'm wondering if there's any architectural specific code that needs 
> > > > > plumbing
> > > > > here?
> > > > > 
> > > > There shouldn't be. Can you send me the percpu_stats debug output before
> > > > and after?
> > > I'll paste the whole debug stats before and after here.
> > > 5.12-rc6 + patchset
> > > -BEFORE-
> > > Percpu Memory Statistics
> > > Allocation Info:
> > 
> > Hm, this looks highly suspicious. Here is your stats in a more compact form:
> > 
> > Vanilla
> > 
> > nr_alloc: 9038 nr_alloc:
> > 97046
> > nr_dealloc  : 6992 nr_dealloc  :94237
> > nr_cur_alloc: 2046 nr_cur_alloc: 2809
> > nr_max_alloc: 2178 nr_max_alloc:90054
> > nr_chunks   :3 nr_chunks   :   11
> > nr_max_chunks   :3 nr_max_chunks   :   47
> > min_alloc_size  :4 min_alloc_size  :4
> > max_alloc_size  : 1072 max_alloc_size  : 1072
> > empty_pop_pages :5 empty_pop_pages :   29
> > 
> > 
> > Patched
> > 
> > nr_alloc: 9040 nr_alloc:
> > 97048
> > nr_dealloc  : 6994 nr_dealloc  :95002
> > nr_cur_alloc: 2046 nr_cur_alloc: 2046
> > nr_max_alloc: 2208 nr_max_alloc:90054
> > nr_chunks   :3 nr_chunks   :   48
> > nr_max_chunks   :3 nr_max_chunks   :   48
> > min_alloc_size  :4 min_alloc_size  :4
> > max_alloc_size  : 1072 max_alloc_size  : 1072
> > empty_pop_pages :   12 empty_pop_pages :   61
> > 
> > 
> > So it looks like the number of chunks got bigger, as well as the number of
> > empty_pop_pages? This contradicts to what you wrote, so can you, please, 
> > make
> > sure that the data is correct and we're not messing tw

Re: [PATCH v3 0/6] percpu: partial chunk depopulation

2021-04-16 Thread Roman Gushchin
On Fri, Apr 16, 2021 at 08:58:33PM +0530, Pratik Sampat wrote:
> Hello Dennis,
> 
> I apologize for the clutter of logs before, I'm pasting the logs of before and
> after the percpu test in the case of the patchset being applied on 5.12-rc6 
> and
> the vanilla kernel 5.12-rc6.
> 
> On 16/04/21 7:48 pm, Dennis Zhou wrote:
> > Hello,
> > 
> > On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote:
> > > Hello Roman,
> > > 
> > > I've tried the v3 patch series on a POWER9 and an x86 KVM setup.
> > > 
> > > My results of the percpu_test are as follows:
> > > Intel KVM 4CPU:4G
> > > Vanilla 5.12-rc6
> > > # ./percpu_test.sh
> > > Percpu: 1952 kB
> > > Percpu:   219648 kB
> > > Percpu:   219648 kB
> > > 
> > > 5.12-rc6 + with patchset applied
> > > # ./percpu_test.sh
> > > Percpu: 2080 kB
> > > Percpu:   219712 kB
> > > Percpu:    72672 kB
> > > 
> > > I'm able to see improvement comparable to that of what you're see too.
> > > 
> > > However, on POWERPC I'm unable to reproduce these improvements with the 
> > > patchset in the same configuration
> > > 
> > > POWER9 KVM 4CPU:4G
> > > Vanilla 5.12-rc6
> > > # ./percpu_test.sh
> > > Percpu: 5888 kB
> > > Percpu:   118272 kB
> > > Percpu:   118272 kB
> > > 
> > > 5.12-rc6 + with patchset applied
> > > # ./percpu_test.sh
> > > Percpu: 6144 kB
> > > Percpu:   119040 kB
> > > Percpu:   119040 kB
> > > 
> > > I'm wondering if there's any architectural specific code that needs 
> > > plumbing
> > > here?
> > > 
> > There shouldn't be. Can you send me the percpu_stats debug output before
> > and after?
> 
> I'll paste the whole debug stats before and after here.
> 5.12-rc6 + patchset
> -BEFORE-
> Percpu Memory Statistics
> Allocation Info:


Hm, this looks highly suspicious. Here is your stats in a more compact form:

Vanilla

nr_alloc: 9038 nr_alloc:97046
nr_dealloc  : 6992 nr_dealloc  :94237
nr_cur_alloc: 2046 nr_cur_alloc: 2809
nr_max_alloc: 2178 nr_max_alloc:90054
nr_chunks   :3 nr_chunks   :   11
nr_max_chunks   :3 nr_max_chunks   :   47
min_alloc_size  :4 min_alloc_size  :4
max_alloc_size  : 1072 max_alloc_size  : 1072
empty_pop_pages :5 empty_pop_pages :   29


Patched

nr_alloc: 9040 nr_alloc:97048
nr_dealloc  : 6994 nr_dealloc  :95002
nr_cur_alloc: 2046 nr_cur_alloc: 2046
nr_max_alloc: 2208 nr_max_alloc:90054
nr_chunks   :3 nr_chunks   :   48
nr_max_chunks   :3 nr_max_chunks   :   48
min_alloc_size  :4 min_alloc_size  :4
max_alloc_size  : 1072 max_alloc_size  : 1072
empty_pop_pages :   12 empty_pop_pages :   61


So it looks like the number of chunks got bigger, as well as the number of
empty_pop_pages? This contradicts to what you wrote, so can you, please, make
sure that the data is correct and we're not messing two cases?

So it looks like for some reason sidelined (depopulated) chunks are not getting
freed completely. But I struggle to explain why the initial empty_pop_pages is
bigger with the same amount of chunks.

So, can you, please, apply the following patch and provide an updated 
statistics?

--

>From d0d2bfdb891afec6bd63790b3492b852db490640 Mon Sep 17 00:00:00 2001
From: Roman Gushchin 
Date: Fri, 16 Apr 2021 09:54:38 -0700
Subject: [PATCH] percpu: include sidelined and depopulating chunks into debug
 output

Information about sidelined chunks and chunks in the depopulate queue
could be extremely valuable for debugging different problems.

Dump information about these chunks on pair with regular chunks
in percpu slots via percpu stats interface.

Signed-off-by: Roman Gushchin 
---
 mm/percpu-internal.h |  2 ++
 mm/percpu-stats.c| 10 ++
 mm/percpu.c  |  4 ++--
 3 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index 8e432

Re: [PATCH v3 0/6] percpu: partial chunk depopulation

2021-04-16 Thread Roman Gushchin
On Fri, Apr 16, 2021 at 02:18:10PM +, Dennis Zhou wrote:
> Hello,
> 
> On Fri, Apr 16, 2021 at 06:26:15PM +0530, Pratik Sampat wrote:
> > Hello Roman,
> > 
> > I've tried the v3 patch series on a POWER9 and an x86 KVM setup.
> > 
> > My results of the percpu_test are as follows:
> > Intel KVM 4CPU:4G
> > Vanilla 5.12-rc6
> > # ./percpu_test.sh
> > Percpu: 1952 kB
> > Percpu:   219648 kB
> > Percpu:   219648 kB
> > 
> > 5.12-rc6 + with patchset applied
> > # ./percpu_test.sh
> > Percpu: 2080 kB
> > Percpu:   219712 kB
> > Percpu:    72672 kB
> > 
> > I'm able to see improvement comparable to that of what you're see too.
> > 
> > However, on POWERPC I'm unable to reproduce these improvements with the 
> > patchset in the same configuration
> > 
> > POWER9 KVM 4CPU:4G
> > Vanilla 5.12-rc6
> > # ./percpu_test.sh
> > Percpu: 5888 kB
> > Percpu:   118272 kB
> > Percpu:   118272 kB
> > 
> > 5.12-rc6 + with patchset applied
> > # ./percpu_test.sh
> > Percpu: 6144 kB
> > Percpu:   119040 kB
> > Percpu:   119040 kB
> > 
> > I'm wondering if there's any architectural specific code that needs plumbing
> > here?
> > 
> 
> There shouldn't be. Can you send me the percpu_stats debug output before
> and after?

Btw, sidelined chunks are not listed in the debug output. It was actually on my
to-do list, looks like I need to prioritize it a bit.

> 
> > I will also look through the code to find the reason why POWER isn't
> > depopulating pages.
> > 
> > Thank you,
> > Pratik
> > 
> > On 08/04/21 9:27 am, Roman Gushchin wrote:
> > > In our production experience the percpu memory allocator is sometimes 
> > > struggling
> > > with returning the memory to the system. A typical example is a creation 
> > > of
> > > several thousands memory cgroups (each has several chunks of the percpu 
> > > data
> > > used for vmstats, vmevents, ref counters etc). Deletion and complete 
> > > releasing
> > > of these cgroups doesn't always lead to a shrinkage of the percpu memory,
> > > so that sometimes there are several GB's of memory wasted.
> > > 
> > > The underlying problem is the fragmentation: to release an underlying 
> > > chunk
> > > all percpu allocations should be released first. The percpu allocator 
> > > tends
> > > to top up chunks to improve the utilization. It means new small-ish 
> > > allocations
> > > (e.g. percpu ref counters) are placed onto almost filled old-ish chunks,
> > > effectively pinning them in memory.
> > > 
> > > This patchset solves this problem by implementing a partial depopulation
> > > of percpu chunks: chunks with many empty pages are being asynchronously
> > > depopulated and the pages are returned to the system.
> > > 
> > > To illustrate the problem the following script can be used:
> > > 
> > > --
> > > #!/bin/bash
> > > 
> > > cd /sys/fs/cgroup
> > > 
> > > mkdir percpu_test
> > > echo "+memory" > percpu_test/cgroup.subtree_control
> > > 
> > > cat /proc/meminfo | grep Percpu
> > > 
> > > for i in `seq 1 1000`; do
> > >  mkdir percpu_test/cg_"${i}"
> > >  for j in `seq 1 10`; do
> > >   mkdir percpu_test/cg_"${i}"_"${j}"
> > >  done
> > > done
> > > 
> > > cat /proc/meminfo | grep Percpu
> > > 
> > > for i in `seq 1 1000`; do
> > >  for j in `seq 1 10`; do
> > >   rmdir percpu_test/cg_"${i}"_"${j}"
> > >  done
> > > done
> > > 
> > > sleep 10
> > > 
> > > cat /proc/meminfo | grep Percpu
> > > 
> > > for i in `seq 1 1000`; do
> > >  rmdir percpu_test/cg_"${i}"
> > > done
> > > 
> > > rmdir percpu_test
> > > --
> > > 
> > > It creates 11000 memory cgroups and removes every 10 out of 11.
> > > It prints the initial size of the percpu memory, the size after
> > > creating all cgroups and the size after deleting most of them.
> > > 
> > > Results:
> > >vanilla:
> > >  ./percpu_test.sh
> > >  Percpu: 7488 kB
> > >  Percpu:   481152 kB

Re: [PATCH 4/7] mm: memcontrol: simplify lruvec_holds_page_lru_lock

2021-04-13 Thread Roman Gushchin
On Tue, Apr 13, 2021 at 02:51:50PM +0800, Muchun Song wrote:
> We already have a helper lruvec_memcg() to get the memcg from lruvec, we
> do not need to do it ourselves in the lruvec_holds_page_lru_lock(). So use
> lruvec_memcg() instead. And if mem_cgroup_disabled() returns false, the
> page_memcg(page) (the LRU pages) cannot be NULL. So remove the odd logic
> of "memcg = page_memcg(page) ? : root_mem_cgroup". And use lruvec_pgdat
> to simplify the code. We can have a single definition for this function
> that works for !CONFIG_MEMCG, CONFIG_MEMCG + mem_cgroup_disabled() and
> CONFIG_MEMCG.
> 
> Signed-off-by: Muchun Song 
> Acked-by: Johannes Weiner 

Acked-by: Roman Gushchin 


> ---
>  include/linux/memcontrol.h | 31 +++
>  1 file changed, 7 insertions(+), 24 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 4f49865c9958..38b8d3fb24ff 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -755,22 +755,6 @@ static inline struct lruvec 
> *mem_cgroup_page_lruvec(struct page *page)
>   return mem_cgroup_lruvec(memcg, pgdat);
>  }
>  
> -static inline bool lruvec_holds_page_lru_lock(struct page *page,
> -   struct lruvec *lruvec)
> -{
> - pg_data_t *pgdat = page_pgdat(page);
> - const struct mem_cgroup *memcg;
> - struct mem_cgroup_per_node *mz;
> -
> - if (mem_cgroup_disabled())
> - return lruvec == >__lruvec;
> -
> - mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
> - memcg = page_memcg(page) ? : root_mem_cgroup;
> -
> - return lruvec->pgdat == pgdat && mz->memcg == memcg;
> -}
> -
>  struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
>  
>  struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
> @@ -1229,14 +1213,6 @@ static inline struct lruvec 
> *mem_cgroup_page_lruvec(struct page *page)
>   return >__lruvec;
>  }
>  
> -static inline bool lruvec_holds_page_lru_lock(struct page *page,
> -   struct lruvec *lruvec)
> -{
> - pg_data_t *pgdat = page_pgdat(page);
> -
> - return lruvec == >__lruvec;
> -}
> -
>  static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page 
> *page)
>  {
>  }
> @@ -1518,6 +1494,13 @@ static inline void 
> unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
>   spin_unlock_irqrestore(>lru_lock, flags);
>  }
>  
> +static inline bool lruvec_holds_page_lru_lock(struct page *page,
> +   struct lruvec *lruvec)
> +{
> + return lruvec_pgdat(lruvec) == page_pgdat(page) &&
> +lruvec_memcg(lruvec) == page_memcg(page);
> +}
> +
>  /* Don't lock again iff page's lruvec locked */
>  static inline struct lruvec *relock_page_lruvec_irq(struct page *page,
>   struct lruvec *locked_lruvec)
> -- 
> 2.11.0
> 


Re: [PATCH 6/7] mm: memcontrol: move obj_cgroup_uncharge_pages() out of css_set_lock

2021-04-13 Thread Roman Gushchin
On Tue, Apr 13, 2021 at 02:51:52PM +0800, Muchun Song wrote:
> The css_set_lock is used to guard the list of inherited objcgs. So there
> is no need to uncharge kernel memory under css_set_lock. Just move it
> out of the lock.
> 
> Signed-off-by: Muchun Song 

Acked-by: Roman Gushchin 

> ---
>  mm/memcontrol.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 42d8c0f4ab1d..d9c7e44abcd0 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -289,9 +289,10 @@ static void obj_cgroup_release(struct percpu_ref *ref)
>   WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1));
>   nr_pages = nr_bytes >> PAGE_SHIFT;
>  
> - spin_lock_irqsave(_set_lock, flags);
>   if (nr_pages)
>   obj_cgroup_uncharge_pages(objcg, nr_pages);
> +
> + spin_lock_irqsave(_set_lock, flags);
>   list_del(>list);
>   spin_unlock_irqrestore(_set_lock, flags);
>  
> -- 
> 2.11.0
> 


Re: [PATCH 5/7] mm: memcontrol: simplify the logic of objcg pinning memcg

2021-04-13 Thread Roman Gushchin
On Tue, Apr 13, 2021 at 02:51:51PM +0800, Muchun Song wrote:
> The obj_cgroup_release() and memcg_reparent_objcgs() are serialized by
> the css_set_lock. We do not need to care about objcg->memcg being
> released in the process of obj_cgroup_release(). So there is no need
> to pin memcg before releasing objcg. Remove those pinning logic to
> simplfy the code.
> 
> There are only two places that modifies the objcg->memcg. One is the
> initialization to objcg->memcg in the memcg_online_kmem(), another
> is objcgs reparenting in the memcg_reparent_objcgs(). It is also
> impossible for the two to run in parallel. So xchg() is unnecessary
> and it is enough to use WRITE_ONCE().
> 
> Signed-off-by: Muchun Song 
> Acked-by: Johannes Weiner 

It's a good one! It took me some time to realize that it's safe.
Thanks!

Acked-by: Roman Gushchin 

> ---
>  mm/memcontrol.c | 20 ++--
>  1 file changed, 6 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 1f807448233e..42d8c0f4ab1d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -261,7 +261,6 @@ static void obj_cgroup_uncharge_pages(struct obj_cgroup 
> *objcg,
>  static void obj_cgroup_release(struct percpu_ref *ref)
>  {
>   struct obj_cgroup *objcg = container_of(ref, struct obj_cgroup, refcnt);
> - struct mem_cgroup *memcg;
>   unsigned int nr_bytes;
>   unsigned int nr_pages;
>   unsigned long flags;
> @@ -291,11 +290,9 @@ static void obj_cgroup_release(struct percpu_ref *ref)
>   nr_pages = nr_bytes >> PAGE_SHIFT;
>  
>   spin_lock_irqsave(_set_lock, flags);
> - memcg = obj_cgroup_memcg(objcg);
>   if (nr_pages)
>   obj_cgroup_uncharge_pages(objcg, nr_pages);
>   list_del(>list);
> - mem_cgroup_put(memcg);
>   spin_unlock_irqrestore(_set_lock, flags);
>  
>   percpu_ref_exit(ref);
> @@ -330,17 +327,12 @@ static void memcg_reparent_objcgs(struct mem_cgroup 
> *memcg,
>  
>   spin_lock_irq(_set_lock);
>  
> - /* Move active objcg to the parent's list */
> - xchg(>memcg, parent);
> - css_get(>css);
> - list_add(>list, >objcg_list);
> -
> - /* Move already reparented objcgs to the parent's list */
> - list_for_each_entry(iter, >objcg_list, list) {
> - css_get(>css);
> - xchg(>memcg, parent);
> - css_put(>css);
> - }
> + /* 1) Ready to reparent active objcg. */
> + list_add(>list, >objcg_list);
> + /* 2) Reparent active objcg and already reparented objcgs to parent. */
> + list_for_each_entry(iter, >objcg_list, list)
> + WRITE_ONCE(iter->memcg, parent);
> + /* 3) Move already reparented objcgs to the parent's list */
>   list_splice(>objcg_list, >objcg_list);
>  
>   spin_unlock_irq(_set_lock);
> -- 
> 2.11.0
> 


Re: [PATCH 3/7] mm: memcontrol: remove the pgdata parameter of mem_cgroup_page_lruvec

2021-04-13 Thread Roman Gushchin
On Tue, Apr 13, 2021 at 02:51:49PM +0800, Muchun Song wrote:
> All the callers of mem_cgroup_page_lruvec() just pass page_pgdat(page)
> as the 2nd parameter to it (except isolate_migratepages_block()). But
> for isolate_migratepages_block(), the page_pgdat(page) is also equal
> to the local variable of @pgdat. So mem_cgroup_page_lruvec() do not
> need the pgdat parameter. Just remove it to simplify the code.
> 
> Signed-off-by: Muchun Song 
> Acked-by: Johannes Weiner 

Acked-by: Roman Gushchin 

> ---
>  include/linux/memcontrol.h | 10 +-
>  mm/compaction.c|  2 +-
>  mm/memcontrol.c|  9 +++--
>  mm/swap.c  |  2 +-
>  mm/workingset.c|  2 +-
>  5 files changed, 11 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index c960fd49c3e8..4f49865c9958 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -743,13 +743,12 @@ static inline struct lruvec *mem_cgroup_lruvec(struct 
> mem_cgroup *memcg,
>  /**
>   * mem_cgroup_page_lruvec - return lruvec for isolating/putting an LRU page
>   * @page: the page
> - * @pgdat: pgdat of the page
>   *
>   * This function relies on page->mem_cgroup being stable.
>   */
> -static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
> - struct pglist_data *pgdat)
> +static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page)
>  {
> + pg_data_t *pgdat = page_pgdat(page);
>   struct mem_cgroup *memcg = page_memcg(page);
>  
>   VM_WARN_ON_ONCE_PAGE(!memcg && !mem_cgroup_disabled(), page);
> @@ -1223,9 +1222,10 @@ static inline struct lruvec *mem_cgroup_lruvec(struct 
> mem_cgroup *memcg,
>   return >__lruvec;
>  }
>  
> -static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
> - struct pglist_data *pgdat)
> +static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page)
>  {
> + pg_data_t *pgdat = page_pgdat(page);
> +
>   return >__lruvec;
>  }
>  
> diff --git a/mm/compaction.c b/mm/compaction.c
> index caa4c36c1db3..e7da342003dd 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1033,7 +1033,7 @@ isolate_migratepages_block(struct compact_control *cc, 
> unsigned long low_pfn,
>   if (!TestClearPageLRU(page))
>   goto isolate_fail_put;
>  
> - lruvec = mem_cgroup_page_lruvec(page, pgdat);
> + lruvec = mem_cgroup_page_lruvec(page);
>  
>   /* If we already hold the lock, we can skip some rechecking */
>   if (lruvec != locked) {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 9cbfff59b171..1f807448233e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1177,9 +1177,8 @@ void lruvec_memcg_debug(struct lruvec *lruvec, struct 
> page *page)
>  struct lruvec *lock_page_lruvec(struct page *page)
>  {
>   struct lruvec *lruvec;
> - struct pglist_data *pgdat = page_pgdat(page);
>  
> - lruvec = mem_cgroup_page_lruvec(page, pgdat);
> + lruvec = mem_cgroup_page_lruvec(page);
>   spin_lock(>lru_lock);
>  
>   lruvec_memcg_debug(lruvec, page);
> @@ -1190,9 +1189,8 @@ struct lruvec *lock_page_lruvec(struct page *page)
>  struct lruvec *lock_page_lruvec_irq(struct page *page)
>  {
>   struct lruvec *lruvec;
> - struct pglist_data *pgdat = page_pgdat(page);
>  
> - lruvec = mem_cgroup_page_lruvec(page, pgdat);
> + lruvec = mem_cgroup_page_lruvec(page);
>   spin_lock_irq(>lru_lock);
>  
>   lruvec_memcg_debug(lruvec, page);
> @@ -1203,9 +1201,8 @@ struct lruvec *lock_page_lruvec_irq(struct page *page)
>  struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long 
> *flags)
>  {
>   struct lruvec *lruvec;
> - struct pglist_data *pgdat = page_pgdat(page);
>  
> - lruvec = mem_cgroup_page_lruvec(page, pgdat);
> + lruvec = mem_cgroup_page_lruvec(page);
>   spin_lock_irqsave(>lru_lock, *flags);
>  
>   lruvec_memcg_debug(lruvec, page);
> diff --git a/mm/swap.c b/mm/swap.c
> index a75a8265302b..e0d5699213cc 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -313,7 +313,7 @@ void lru_note_cost(struct lruvec *lruvec, bool file, 
> unsigned int nr_pages)
>  
>  void lru_note_cost_page(struct page *page)
>  {
> - lru_note_cost(mem_cgroup_page_lruvec(page, page_pgdat(page)),
> + lru_note_cost(mem_cgroup_page_lruvec(page),
> page_is_file_lru(page), thp_nr_pages(page));
>  }
>  
> diff --git 

Re: [PATCH 2/7] mm: memcontrol: bail out early when !mm in get_mem_cgroup_from_mm

2021-04-13 Thread Roman Gushchin
On Tue, Apr 13, 2021 at 02:51:48PM +0800, Muchun Song wrote:
> When mm is NULL, we do not need to hold rcu lock and call css_tryget for
> the root memcg. And we also do not need to check !mm in every loop of
> while. So bail out early when !mm.
> 
> Signed-off-by: Muchun Song 
> Acked-by: Johannes Weiner 
> Reviewed-by: Shakeel Butt 

Acked-by: Roman Gushchin 

Nice!

> ---
>  mm/memcontrol.c | 21 ++---
>  1 file changed, 10 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f229de925aa5..9cbfff59b171 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -901,20 +901,19 @@ struct mem_cgroup *get_mem_cgroup_from_mm(struct 
> mm_struct *mm)
>   if (mem_cgroup_disabled())
>   return NULL;
>  
> + /*
> +  * Page cache insertions can happen without an
> +  * actual mm context, e.g. during disk probing
> +  * on boot, loopback IO, acct() writes etc.
> +  */
> + if (unlikely(!mm))
> + return root_mem_cgroup;
> +
>   rcu_read_lock();
>   do {
> - /*
> -  * Page cache insertions can happen without an
> -  * actual mm context, e.g. during disk probing
> -  * on boot, loopback IO, acct() writes etc.
> -  */
> - if (unlikely(!mm))
> + memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
> + if (unlikely(!memcg))
>   memcg = root_mem_cgroup;
> - else {
> - memcg = 
> mem_cgroup_from_task(rcu_dereference(mm->owner));
> - if (unlikely(!memcg))
> - memcg = root_mem_cgroup;
> - }
>   } while (!css_tryget(>css));
>   rcu_read_unlock();
>   return memcg;
> -- 
> 2.11.0
> 


Re: [PATCH 1/7] mm: memcontrol: fix page charging in page replacement

2021-04-13 Thread Roman Gushchin
On Tue, Apr 13, 2021 at 02:51:47PM +0800, Muchun Song wrote:
> The pages aren't accounted at the root level, so do not charge the page
> to the root memcg in page replacement. Although we do not display the
> value (mem_cgroup_usage) so there shouldn't be any actual problem, but
> there is a WARN_ON_ONCE in the page_counter_cancel(). Who knows if it
> will trigger? So it is better to fix it.
> 
> Signed-off-by: Muchun Song 
> Acked-by: Johannes Weiner 
> Reviewed-by: Shakeel Butt 

Acked-by: Roman Gushchin 

Thanks!
> ---
>  mm/memcontrol.c | 8 +---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 64ada9e650a5..f229de925aa5 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6806,9 +6806,11 @@ void mem_cgroup_migrate(struct page *oldpage, struct 
> page *newpage)
>   /* Force-charge the new page. The old one will be freed soon */
>   nr_pages = thp_nr_pages(newpage);
>  
> - page_counter_charge(>memory, nr_pages);
> - if (do_memsw_account())
> - page_counter_charge(>memsw, nr_pages);
> + if (!mem_cgroup_is_root(memcg)) {
> + page_counter_charge(>memory, nr_pages);
> + if (do_memsw_account())
> + page_counter_charge(>memsw, nr_pages);
> + }
>  
>   css_get(>css);
>   commit_charge(newpage, memcg);
> -- 
> 2.11.0
> 


Re: [PATCH 0/5] mm/memcg: Reduce kmemcache memory accounting overhead

2021-04-12 Thread Roman Gushchin
On Fri, Apr 09, 2021 at 07:18:37PM -0400, Waiman Long wrote:
> With the recent introduction of the new slab memory controller, we
> eliminate the need for having separate kmemcaches for each memory
> cgroup and reduce overall kernel memory usage. However, we also add
> additional memory accounting overhead to each call of kmem_cache_alloc()
> and kmem_cache_free().
> 
> For workloads that require a lot of kmemcache allocations and
> de-allocations, they may experience performance regression as illustrated
> in [1].
> 
> With a simple kernel module that performs repeated loop of 100,000,000
> kmem_cache_alloc() and kmem_cache_free() of 64-byte object at module
> init. The execution time to load the kernel module with and without
> memory accounting were:
> 
>   with accounting = 6.798s
>   w/o  accounting = 1.758s
> 
> That is an increase of 5.04s (287%). With this patchset applied, the
> execution time became 4.254s. So the memory accounting overhead is now
> 2.496s which is a 50% reduction.

Btw, there were two recent independent report about benchmark results
regression caused by the introduction of the per-object accounting:
1) Xing reported a hackbench regression:
https://lkml.org/lkml/2021/1/13/1277
2) Masayoshi reported a pgbench regression:
https://www.spinics.net/lists/linux-mm/msg252540.html

I wonder if you can run them (or at least one) and attach the result
to the series? It would be very helpful.

Thank you!


Re: [PATCH 5/5] mm/memcg: Optimize user context object stock access

2021-04-12 Thread Roman Gushchin
On Fri, Apr 09, 2021 at 07:18:42PM -0400, Waiman Long wrote:
> Most kmem_cache_alloc() calls are from user context. With instrumentation
> enabled, the measured amount of kmem_cache_alloc() calls from non-task
> context was about 0.01% of the total.
> 
> The irq disable/enable sequence used in this case to access content
> from object stock is slow.  To optimize for user context access, there
> are now two object stocks for task context and interrupt context access
> respectively.
> 
> The task context object stock can be accessed after disabling preemption
> which is cheap in non-preempt kernel. The interrupt context object stock
> can only be accessed after disabling interrupt. User context code can
> access interrupt object stock, but not vice versa.
> 
> The mod_objcg_state() function is also modified to make sure that memcg
> and lruvec stat updates are done with interrupted disabled.
> 
> The downside of this change is that there are more data stored in local
> object stocks and not reflected in the charge counter and the vmstat
> arrays.  However, this is a small price to pay for better performance.

I agree, the extra memory space is not a significant concern.
I'd be more worried about the code complexity, but the result looks
nice to me!

Acked-by: Roman Gushchin 

Btw, it seems that the mm tree ran a bit off, so I had to apply this series
on top of Linus's tree to review. Please, rebase.

Thanks!


Re: [PATCH 4/5] mm/memcg: Separate out object stock data into its own struct

2021-04-12 Thread Roman Gushchin
On Fri, Apr 09, 2021 at 07:18:41PM -0400, Waiman Long wrote:
> The object stock data stored in struct memcg_stock_pcp are independent
> of the other page based data stored there. Separating them out into
> their own struct to highlight the independency.
> 
> Signed-off-by: Waiman Long 

Acked-by: Roman Gushchin 



Re: [PATCH 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp

2021-04-12 Thread Roman Gushchin
On Fri, Apr 09, 2021 at 07:18:40PM -0400, Waiman Long wrote:
> Before the new slab memory controller with per object byte charging,
> charging and vmstat data update happen only when new slab pages are
> allocated or freed. Now they are done with every kmem_cache_alloc()
> and kmem_cache_free(). This causes additional overhead for workloads
> that generate a lot of alloc and free calls.
> 
> The memcg_stock_pcp is used to cache byte charge for a specific
> obj_cgroup to reduce that overhead. To further reducing it, this patch
> makes the vmstat data cached in the memcg_stock_pcp structure as well
> until it accumulates a page size worth of update or when other cached
> data change.

The idea makes total sense to me and also gives a hope to remove
byte-sized vmstats in the long-term.

> 
> On a 2-socket Cascade Lake server with instrumentation enabled and this
> patch applied, it was found that about 17% (946796 out of 5515184) of the
> time when __mod_obj_stock_state() is called leads to an actual call to
> mod_objcg_state() after initial boot. When doing parallel kernel build,
> the figure was about 16% (21894614 out of 139780628). So caching the
> vmstat data reduces the number of calls to mod_objcg_state() by more
> than 80%.
> 
> Signed-off-by: Waiman Long 
> ---
>  mm/memcontrol.c | 78 +++--
>  mm/slab.h   | 26 +++--
>  2 files changed, 79 insertions(+), 25 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b19100c68aa0..539c3b632e47 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2220,7 +2220,10 @@ struct memcg_stock_pcp {
>  
>  #ifdef CONFIG_MEMCG_KMEM
>   struct obj_cgroup *cached_objcg;
> + struct pglist_data *cached_pgdat;
>   unsigned int nr_bytes;
> + int vmstat_idx;
> + int vmstat_bytes;
>  #endif

Because vmstat_idx can realistically take only 3 values (slab_reclaimable,
slab_unreclaimable and percpu), I wonder if it's better to have
vmstat_bytes[3] and save a bit more on the reduced number of flushes?
It must be an often case when a complex (reclaimable) kernel object has
non-reclaimable parts (e.g. kmallocs) or percpu counters.
If the difference will be too small, maybe the current form is better.

>  
>   struct work_struct work;
> @@ -3157,6 +3160,21 @@ void __memcg_kmem_uncharge_page(struct page *page, int 
> order)
>   css_put(>css);
>  }
>  
> +static inline void mod_objcg_state(struct obj_cgroup *objcg,
> +struct pglist_data *pgdat,
> +enum node_stat_item idx, int nr)
> +{
> + struct mem_cgroup *memcg;
> + struct lruvec *lruvec = NULL;
> +
> + rcu_read_lock();
> + memcg = obj_cgroup_memcg(objcg);
> + if (pgdat)
> + lruvec = mem_cgroup_lruvec(memcg, pgdat);
> + __mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
> + rcu_read_unlock();
> +}
> +
>  static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int 
> nr_bytes)
>  {
>   struct memcg_stock_pcp *stock;
> @@ -3207,6 +3225,14 @@ static void drain_obj_stock(struct memcg_stock_pcp 
> *stock)
>   stock->nr_bytes = 0;
>   }
>  
> + if (stock->vmstat_bytes) {
> + mod_objcg_state(old, stock->cached_pgdat, stock->vmstat_idx,
> + stock->vmstat_bytes);
> + stock->vmstat_bytes = 0;
> + stock->vmstat_idx = 0;
> + stock->cached_pgdat = NULL;
> + }
> +
>   obj_cgroup_put(old);
>   stock->cached_objcg = NULL;
>  }
> @@ -3251,6 +3277,48 @@ static void refill_obj_stock(struct obj_cgroup *objcg, 
> unsigned int nr_bytes)
>   local_irq_restore(flags);
>  }
>  
> +static void __mod_obj_stock_state(struct obj_cgroup *objcg,
> +   struct pglist_data *pgdat, int idx, int nr)
> +{
> + struct memcg_stock_pcp *stock = this_cpu_ptr(_stock);
> +
> + if (stock->cached_objcg != objcg) {
> + /* Output the current data as is */
> + } else if (!stock->vmstat_bytes) {
> + /* Save the current data */
> + stock->vmstat_bytes = nr;
> + stock->vmstat_idx = idx;
> + stock->cached_pgdat = pgdat;
> + nr = 0;
> + } else if ((stock->cached_pgdat != pgdat) ||
> +(stock->vmstat_idx != idx)) {
> + /* Output the cached data & save the current data */
> + swap(nr, stock->vmstat_bytes);
> + swap(idx, stock->vmstat_idx);
> + swap(pgdat, stock->cached_pgdat);
> + } else {
> + stock->vmstat_bytes += nr;
> + if (abs(nr) > PAGE_SIZE) {
> + nr = stock->vmstat_bytes;
> + stock->vmstat_bytes = 0;
> + } else {
> + nr = 0;
> + }
> + }
> + if (nr)
> + mod_objcg_state(objcg, pgdat, idx, nr);
> +}
> +
> +void mod_obj_stock_state(struct 

Re: [PATCH 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()

2021-04-12 Thread Roman Gushchin
On Fri, Apr 09, 2021 at 07:18:39PM -0400, Waiman Long wrote:
> In memcg_slab_free_hook()/pcpu_memcg_free_hook(), obj_cgroup_uncharge()
> is followed by mod_objcg_state()/mod_memcg_state(). Each of these
> function call goes through a separate irq_save/irq_restore cycle. That
> is inefficient.  Introduce a new function obj_cgroup_uncharge_mod_state()
> that combines them with a single irq_save/irq_restore cycle.
> 
> Signed-off-by: Waiman Long 

Acked-by: Roman Gushchin 

Thanks!


Re: [PATCH 1/5] mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()

2021-04-12 Thread Roman Gushchin
On Fri, Apr 09, 2021 at 07:18:38PM -0400, Waiman Long wrote:
> The caller of mod_memcg_lruvec_state() has both memcg and lruvec readily
> available. So both of them are now passed to mod_memcg_lruvec_state()
> and __mod_memcg_lruvec_state(). The __mod_memcg_lruvec_state() is
> updated to allow either of the two parameters to be set to null. This
> makes mod_memcg_lruvec_state() equivalent to mod_memcg_state() if lruvec
> is null.

This patch seems to be correct, but it's a bit hard to understand why
it's required without looking into the rest of the series. Can you, please,
add a couple of words about it? E.g. we need it to handle stats which do not
exist on the lruvec level...

Otherwise,
Acked-by: Roman Gushchin 

Thanks!

> 
> Signed-off-by: Waiman Long 
> ---
>  include/linux/memcontrol.h | 12 +++-
>  mm/memcontrol.c| 19 +--
>  mm/slab.h  |  2 +-
>  3 files changed, 21 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 0c04d39a7967..95f12996e66c 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -955,8 +955,8 @@ static inline unsigned long 
> lruvec_page_state_local(struct lruvec *lruvec,
>   return x;
>  }
>  
> -void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
> -   int val);
> +void __mod_memcg_lruvec_state(struct mem_cgroup *memcg, struct lruvec 
> *lruvec,
> +   enum node_stat_item idx, int val);
>  void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val);
>  
>  static inline void mod_lruvec_kmem_state(void *p, enum node_stat_item idx,
> @@ -969,13 +969,14 @@ static inline void mod_lruvec_kmem_state(void *p, enum 
> node_stat_item idx,
>   local_irq_restore(flags);
>  }
>  
> -static inline void mod_memcg_lruvec_state(struct lruvec *lruvec,
> +static inline void mod_memcg_lruvec_state(struct mem_cgroup *memcg,
> +   struct lruvec *lruvec,
> enum node_stat_item idx, int val)
>  {
>   unsigned long flags;
>  
>   local_irq_save(flags);
> - __mod_memcg_lruvec_state(lruvec, idx, val);
> + __mod_memcg_lruvec_state(memcg, lruvec, idx, val);
>   local_irq_restore(flags);
>  }
>  
> @@ -1369,7 +1370,8 @@ static inline unsigned long 
> lruvec_page_state_local(struct lruvec *lruvec,
>   return node_page_state(lruvec_pgdat(lruvec), idx);
>  }
>  
> -static inline void __mod_memcg_lruvec_state(struct lruvec *lruvec,
> +static inline void __mod_memcg_lruvec_state(struct mem_cgroup *memcg,
> + struct lruvec *lruvec,
>   enum node_stat_item idx, int val)
>  {
>  }
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e064ac0d850a..d66e1e38f8ac 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -799,20 +799,27 @@ parent_nodeinfo(struct mem_cgroup_per_node *pn, int nid)
>   return mem_cgroup_nodeinfo(parent, nid);
>  }
>  
> -void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
> -   int val)
> +/*
> + * Either one of memcg or lruvec can be NULL, but not both.
> + */
> +void __mod_memcg_lruvec_state(struct mem_cgroup *memcg, struct lruvec 
> *lruvec,
> +   enum node_stat_item idx, int val)
>  {
>   struct mem_cgroup_per_node *pn;
> - struct mem_cgroup *memcg;
>   long x, threshold = MEMCG_CHARGE_BATCH;
>  
> + /* Update lruvec */
>   pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
> - memcg = pn->memcg;
> +
> + if (!memcg)
> + memcg = pn->memcg;
>  
>   /* Update memcg */
>   __mod_memcg_state(memcg, idx, val);
>  
> - /* Update lruvec */
> + if (!lruvec)
> + return;
> +
>   __this_cpu_add(pn->lruvec_stat_local->count[idx], val);
>  
>   if (vmstat_item_in_bytes(idx))
> @@ -848,7 +855,7 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum 
> node_stat_item idx,
>  
>   /* Update memcg and lruvec */
>   if (!mem_cgroup_disabled())
> - __mod_memcg_lruvec_state(lruvec, idx, val);
> + __mod_memcg_lruvec_state(NULL, lruvec, idx, val);
>  }
>  
>  void __mod_lruvec_page_state(struct page *page, enum node_stat_item idx,
> diff --git a/mm/slab.h b/mm/slab.h
> index 076582f58f68..bc6c7545e487 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -293,7 +293,7 @@ static inline void mod_objcg_state(struct obj_cgroup 
> *objcg,
>   rcu_read_lock();
>   memcg = obj_cgroup_memcg(objcg);
>   lruvec = mem_cgroup_lruvec(memcg, pgdat);
> - mod_memcg_lruvec_state(lruvec, idx, nr);
> + mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
>   rcu_read_unlock();
>  }
>  
> -- 
> 2.18.1
> 


Re: [PATCH 0/5] mm/memcg: Reduce kmemcache memory accounting overhead

2021-04-12 Thread Roman Gushchin
On Mon, Apr 12, 2021 at 10:03:13AM -0400, Waiman Long wrote:
> On 4/9/21 9:51 PM, Roman Gushchin wrote:
> > On Fri, Apr 09, 2021 at 07:18:37PM -0400, Waiman Long wrote:
> > > With the recent introduction of the new slab memory controller, we
> > > eliminate the need for having separate kmemcaches for each memory
> > > cgroup and reduce overall kernel memory usage. However, we also add
> > > additional memory accounting overhead to each call of kmem_cache_alloc()
> > > and kmem_cache_free().
> > > 
> > > For workloads that require a lot of kmemcache allocations and
> > > de-allocations, they may experience performance regression as illustrated
> > > in [1].
> > > 
> > > With a simple kernel module that performs repeated loop of 100,000,000
> > > kmem_cache_alloc() and kmem_cache_free() of 64-byte object at module
> > > init. The execution time to load the kernel module with and without
> > > memory accounting were:
> > > 
> > >with accounting = 6.798s
> > >w/o  accounting = 1.758s
> > > 
> > > That is an increase of 5.04s (287%). With this patchset applied, the
> > > execution time became 4.254s. So the memory accounting overhead is now
> > > 2.496s which is a 50% reduction.
> > Hi Waiman!
> > 
> > Thank you for working on it, it's indeed very useful!
> > A couple of questions:
> > 1) did your config included lockdep or not?
> The test kernel is based on a production kernel config and so lockdep isn't
> enabled.
> > 2) do you have a (rough) estimation how much each change contributes
> > to the overall reduction?
> 
> I should have a better breakdown of the effect of individual patches. I
> rerun the benchmarking module with turbo-boosting disabled to reduce
> run-to-run variation. The execution times were:
> 
> Before patch: time = 10.800s (with memory accounting), 2.848s (w/o
> accounting), overhead = 7.952s
> After patch 2: time = 9.140s, overhead = 6.292s
> After patch 3: time = 7.641s, overhead = 4.793s
> After patch 5: time = 6.801s, overhead = 3.953s

Thank you! If there will be v2, I'd include this information into commit logs.

> 
> Patches 1 & 4 are preparatory patches that should affect performance.
> 
> So the memory accounting overhead was reduced by about half.

This is really great!

Thanks!


Re: [RFC PATCH v2 00/18] Use obj_cgroup APIs to charge the LRU pages

2021-04-12 Thread Roman Gushchin
On Mon, Apr 12, 2021 at 01:14:57PM -0400, Johannes Weiner wrote:
> On Fri, Apr 09, 2021 at 06:29:46PM -0700, Roman Gushchin wrote:
> > On Fri, Apr 09, 2021 at 08:29:41PM +0800, Muchun Song wrote:
> > > Since the following patchsets applied. All the kernel memory are charged
> > > with the new APIs of obj_cgroup.
> > > 
> > >   [v17,00/19] The new cgroup slab memory controller
> > >   [v5,0/7] Use obj_cgroup APIs to charge kmem pages
> > > 
> > > But user memory allocations (LRU pages) pinning memcgs for a long time -
> > > it exists at a larger scale and is causing recurring problems in the real
> > > world: page cache doesn't get reclaimed for a long time, or is used by the
> > > second, third, fourth, ... instance of the same job that was restarted 
> > > into
> > > a new cgroup every time. Unreclaimable dying cgroups pile up, waste 
> > > memory,
> > > and make page reclaim very inefficient.
> > > 
> > > We can convert LRU pages and most other raw memcg pins to the objcg 
> > > direction
> > > to fix this problem, and then the LRU pages will not pin the memcgs.
> > > 
> > > This patchset aims to make the LRU pages to drop the reference to memory
> > > cgroup by using the APIs of obj_cgroup. Finally, we can see that the 
> > > number
> > > of the dying cgroups will not increase if we run the following test 
> > > script.
> > > 
> > > ```bash
> > > #!/bin/bash
> > > 
> > > cat /proc/cgroups | grep memory
> > > 
> > > cd /sys/fs/cgroup/memory
> > > 
> > > for i in range{1..500}
> > > do
> > >   mkdir test
> > >   echo $$ > test/cgroup.procs
> > >   sleep 60 &
> > >   echo $$ > cgroup.procs
> > >   echo `cat test/cgroup.procs` > cgroup.procs
> > >   rmdir test
> > > done
> > > 
> > > cat /proc/cgroups | grep memory
> > > ```
> > > 
> > > Patch 1 aims to fix page charging in page replacement.
> > > Patch 2-5 are code cleanup and simplification.
> > > Patch 6-18 convert LRU pages pin to the objcg direction.
> > > 
> > > Any comments are welcome. Thanks.
> > 
> > Indeed the problem exists for a long time and it would be nice to fix it.
> > However I'm against merging the patchset in the current form (there are some
> > nice fixes/clean-ups, which can/must be applied independently). Let me 
> > explain
> > my concerns:
> > 
> > Back to the new slab controller discussion obj_cgroup was suggested by 
> > Johannes
> > as a union of two concepts:
> > 1) reparenting (basically an auto-pointer to a memcg in c++ terms)
> > 2) byte-sized accounting
> > 
> > I was initially against this union because I anticipated that the 
> > reparenting
> > part will be useful separately. And the time told it was true.
> 
> "The idea of moving stocks and leftovers to the memcg_ptr/obj_cgroup
> level is really good."
> 
> https://lore.kernel.org/lkml/20191025200020.ga8...@castle.dhcp.thefacebook.com/
> 
> If you recall, the main concern was how the byte charging interface
> was added to the existing page charging interface, instead of being
> layered on top of it. I suggested to do that and, since there was no
> other user for the indirection pointer, just include it in the API.
> 
> It made sense at the time, and you seemed to agree. But I also agree
> it makes sense to factor it out now that more users are materializing.

Agreed.

> 
> > I still think obj_cgroup API must be significantly reworked before being
> > applied outside of the kmem area: reparenting part must be separated
> > and moved to the cgroup core level to be used not only in the memcg
> > context but also for other controllers, which are facing similar problems.
> > Spilling obj_cgroup API in the current form over all memcg code will
> > make it more complicated and will delay it, given the amount of changes
> > and the number of potential code conflicts.
> > 
> > I'm working on the generalization of obj_cgroup API (as described above)
> > and expect to have some patches next week.
> 
> Yeah, splitting the byte charging API from the reference API and
> making the latter cgroup-generic makes sense. I'm looking forward to
> your patches.
> 
> And yes, the conflicts between that work and Muchun's patches would be
> quite large. However, most of them would come down to renames, since
> the access rules and refcounting sites will remain the same, so it
> shouldn't be too bad to rebase Muchun's patches on yours. And we can
> continue reviewing his patches for correctness for now.

Sounds good to me!

Thanks


Re: [PATCH 0/5] mm/memcg: Reduce kmemcache memory accounting overhead

2021-04-09 Thread Roman Gushchin
On Fri, Apr 09, 2021 at 07:18:37PM -0400, Waiman Long wrote:
> With the recent introduction of the new slab memory controller, we
> eliminate the need for having separate kmemcaches for each memory
> cgroup and reduce overall kernel memory usage. However, we also add
> additional memory accounting overhead to each call of kmem_cache_alloc()
> and kmem_cache_free().
> 
> For workloads that require a lot of kmemcache allocations and
> de-allocations, they may experience performance regression as illustrated
> in [1].
> 
> With a simple kernel module that performs repeated loop of 100,000,000
> kmem_cache_alloc() and kmem_cache_free() of 64-byte object at module
> init. The execution time to load the kernel module with and without
> memory accounting were:
> 
>   with accounting = 6.798s
>   w/o  accounting = 1.758s
> 
> That is an increase of 5.04s (287%). With this patchset applied, the
> execution time became 4.254s. So the memory accounting overhead is now
> 2.496s which is a 50% reduction.

Hi Waiman!

Thank you for working on it, it's indeed very useful!
A couple of questions:
1) did your config included lockdep or not?
2) do you have a (rough) estimation how much each change contributes
   to the overall reduction?

Thanks!

> 
> It was found that a major part of the memory accounting overhead
> is caused by the local_irq_save()/local_irq_restore() sequences in
> updating local stock charge bytes and vmstat array, at least in x86
> systems. There are two such sequences in kmem_cache_alloc() and two
> in kmem_cache_free(). This patchset tries to reduce the use of such
> sequences as much as possible. In fact, it eliminates them in the common
> case. Another part of this patchset to cache the vmstat data update in
> the local stock as well which also helps.
> 
> [1] 
> https://lore.kernel.org/linux-mm/20210408193948.vfktg3azh2wrt56t@gabell/T/#u
> 
> Waiman Long (5):
>   mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()
>   mm/memcg: Introduce obj_cgroup_uncharge_mod_state()
>   mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
>   mm/memcg: Separate out object stock data into its own struct
>   mm/memcg: Optimize user context object stock access
> 
>  include/linux/memcontrol.h |  14 ++-
>  mm/memcontrol.c| 198 -
>  mm/percpu.c|   9 +-
>  mm/slab.h  |  32 +++---
>  4 files changed, 195 insertions(+), 58 deletions(-)
> 
> -- 
> 2.18.1
> 


Re: [RFC PATCH v2 00/18] Use obj_cgroup APIs to charge the LRU pages

2021-04-09 Thread Roman Gushchin
On Fri, Apr 09, 2021 at 08:29:41PM +0800, Muchun Song wrote:
> Since the following patchsets applied. All the kernel memory are charged
> with the new APIs of obj_cgroup.
> 
>   [v17,00/19] The new cgroup slab memory controller
>   [v5,0/7] Use obj_cgroup APIs to charge kmem pages
> 
> But user memory allocations (LRU pages) pinning memcgs for a long time -
> it exists at a larger scale and is causing recurring problems in the real
> world: page cache doesn't get reclaimed for a long time, or is used by the
> second, third, fourth, ... instance of the same job that was restarted into
> a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory,
> and make page reclaim very inefficient.
> 
> We can convert LRU pages and most other raw memcg pins to the objcg direction
> to fix this problem, and then the LRU pages will not pin the memcgs.
> 
> This patchset aims to make the LRU pages to drop the reference to memory
> cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
> of the dying cgroups will not increase if we run the following test script.
> 
> ```bash
> #!/bin/bash
> 
> cat /proc/cgroups | grep memory
> 
> cd /sys/fs/cgroup/memory
> 
> for i in range{1..500}
> do
>   mkdir test
>   echo $$ > test/cgroup.procs
>   sleep 60 &
>   echo $$ > cgroup.procs
>   echo `cat test/cgroup.procs` > cgroup.procs
>   rmdir test
> done
> 
> cat /proc/cgroups | grep memory
> ```
> 
> Patch 1 aims to fix page charging in page replacement.
> Patch 2-5 are code cleanup and simplification.
> Patch 6-18 convert LRU pages pin to the objcg direction.
> 
> Any comments are welcome. Thanks.

Indeed the problem exists for a long time and it would be nice to fix it.
However I'm against merging the patchset in the current form (there are some
nice fixes/clean-ups, which can/must be applied independently). Let me explain
my concerns:

Back to the new slab controller discussion obj_cgroup was suggested by Johannes
as a union of two concepts:
1) reparenting (basically an auto-pointer to a memcg in c++ terms)
2) byte-sized accounting

I was initially against this union because I anticipated that the reparenting
part will be useful separately. And the time told it was true.

I still think obj_cgroup API must be significantly reworked before being
applied outside of the kmem area: reparenting part must be separated
and moved to the cgroup core level to be used not only in the memcg
context but also for other controllers, which are facing similar problems.
Spilling obj_cgroup API in the current form over all memcg code will
make it more complicated and will delay it, given the amount of changes
and the number of potential code conflicts.

I'm working on the generalization of obj_cgroup API (as described above)
and expect to have some patches next week.

Thanks!


Re: [RFC PATCH v2 09/18] mm: vmscan: remove noinline_for_stack

2021-04-09 Thread Roman Gushchin
On Fri, Apr 09, 2021 at 08:29:50PM +0800, Muchun Song wrote:
> The noinline_for_stack is introduced by commit 666356297ec4 ("vmscan:
> set up pagevec as late as possible in shrink_inactive_list()"), its
> purpose is to delay the allocation of pagevec as late as possible to
> save stack memory. But the commit 2bcf88796381 ("mm: take pagevecs off
> reclaim stack") replace pagevecs by lists of pages_to_free. So we do
> not need noinline_for_stack, just remove it (let the compiler decide
> whether to inline).
> 
> Signed-off-by: Muchun Song 

Acked-by: Roman Gushchin 

> ---
>  mm/vmscan.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 64bf07cc20f2..e40b21298d77 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2015,8 +2015,8 @@ static int too_many_isolated(struct pglist_data *pgdat, 
> int file,
>   *
>   * Returns the number of pages moved to the given lruvec.
>   */
> -static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
> -  struct list_head *list)
> +static unsigned int move_pages_to_lru(struct lruvec *lruvec,
> +   struct list_head *list)
>  {
>   int nr_pages, nr_moved = 0;
>   LIST_HEAD(pages_to_free);
> @@ -2096,7 +2096,7 @@ static int current_may_throttle(void)
>   * shrink_inactive_list() is a helper for shrink_node().  It returns the 
> number
>   * of reclaimed pages
>   */
> -static noinline_for_stack unsigned long
> +static unsigned long
>  shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>struct scan_control *sc, enum lru_list lru)
>  {
> -- 
> 2.11.0
> 


[PATCH v3 6/6] percpu: implement partial chunk depopulation

2021-04-07 Thread Roman Gushchin
This patch implements partial depopulation of percpu chunks.

As now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.

This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
sidelined to a special list or freed. New allocations can't be served
using a sidelined chunk. The chunk can be moved back to a corresponding
slot if there are not enough chunks with empty populated pages.

The depopulation is scheduled on the free path. Is the chunk:
  1) has more than 1/4 of total pages free and populated
  2) the system has enough free percpu pages aside of this chunk
  3) isn't the reserved chunk
  4) isn't the first chunk
  5) isn't entirely free
it's a good target for depopulation. If it's already depopulated
but got free populated pages, it's a good target too.
The chunk is moved to a special pcpu_depopulate_list, chunk->isolate
flag is set and the async balancing is scheduled.

The async balancing moves pcpu_depopulate_list to a local list
(because pcpu_depopulate_list can be changed when pcpu_lock is
releases), and then tries to depopulate each chunk.  The depopulation
is performed in the reverse direction to keep populated pages close to
the beginning, if the global number of empty pages is reached.
Depopulated chunks are sidelined to prevent further allocations.
Skipped and fully empty chunks are returned to the corresponding slot.

On the allocation path, if there are no suitable chunks found,
the list of sidelined chunks in scanned prior to creating a new chunk.
If there is a good sidelined chunk, it's placed back to the slot
and the scanning is restarted.

Many thanks to Dennis Zhou for his great ideas and a very constructive
discussion which led to many improvements in this patchset!

Signed-off-by: Roman Gushchin 
---
 mm/percpu-internal.h |   2 +
 mm/percpu.c  | 158 ++-
 2 files changed, 158 insertions(+), 2 deletions(-)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index 095d7eaa0db4..8e432663c41e 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -67,6 +67,8 @@ struct pcpu_chunk {
 
void*data;  /* chunk data */
boolimmutable;  /* no [de]population allowed */
+   boolisolated;   /* isolated from chunk slot 
lists */
+   booldepopulated;/* sidelined after depopulation 
*/
int start_offset;   /* the overlap with the previous
   region to have a page aligned
   base_addr */
diff --git a/mm/percpu.c b/mm/percpu.c
index 357fd6994278..5bb294e394b3 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -181,6 +181,19 @@ static LIST_HEAD(pcpu_map_extend_chunks);
  */
 int pcpu_nr_empty_pop_pages[PCPU_NR_CHUNK_TYPES];
 
+/*
+ * List of chunks with a lot of free pages.  Used to depopulate them
+ * asynchronously.
+ */
+static struct list_head pcpu_depopulate_list[PCPU_NR_CHUNK_TYPES];
+
+/*
+ * List of previously depopulated chunks.  They are not usually used for new
+ * allocations, but can be returned back to service if a need arises.
+ */
+static struct list_head pcpu_sideline_list[PCPU_NR_CHUNK_TYPES];
+
+
 /*
  * The number of populated pages in use by the allocator, protected by
  * pcpu_lock.  This number is kept per a unit per chunk (i.e. when a page gets
@@ -562,6 +575,12 @@ static void pcpu_chunk_relocate(struct pcpu_chunk *chunk, 
int oslot)
 {
int nslot = pcpu_chunk_slot(chunk);
 
+   /*
+* Keep isolated and depopulated chunks on a sideline.
+*/
+   if (chunk->isolated || chunk->depopulated)
+   return;
+
if (oslot != nslot)
__pcpu_chunk_move(chunk, nslot, oslot < nslot);
 }
@@ -1790,6 +1809,19 @@ static void __percpu *pcpu_alloc(size_t size, size_t 
align, bool reserved,
}
}
 
+   /* search through sidelined depopulated chunks */
+   list_for_each_entry(chunk, _sideline_list[type], list) {
+   /*
+* If the allocation can fit the chunk, place the chunk back
+* into corresponding slot and restart the scanning.
+*/
+   if (pcpu_check_chunk_hint(>chunk_md, bits, bit_align)) {
+   chunk->depopulated = false;
+   pcpu_chunk_relocate(chunk, -1);
+ 

[PATCH v3 0/6] percpu: partial chunk depopulation

2021-04-07 Thread Roman Gushchin
In our production experience the percpu memory allocator is sometimes struggling
with returning the memory to the system. A typical example is a creation of
several thousands memory cgroups (each has several chunks of the percpu data
used for vmstats, vmevents, ref counters etc). Deletion and complete releasing
of these cgroups doesn't always lead to a shrinkage of the percpu memory,
so that sometimes there are several GB's of memory wasted.

The underlying problem is the fragmentation: to release an underlying chunk
all percpu allocations should be released first. The percpu allocator tends
to top up chunks to improve the utilization. It means new small-ish allocations
(e.g. percpu ref counters) are placed onto almost filled old-ish chunks,
effectively pinning them in memory.

This patchset solves this problem by implementing a partial depopulation
of percpu chunks: chunks with many empty pages are being asynchronously
depopulated and the pages are returned to the system.

To illustrate the problem the following script can be used:

--
#!/bin/bash

cd /sys/fs/cgroup

mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control

cat /proc/meminfo | grep Percpu

for i in `seq 1 1000`; do
mkdir percpu_test/cg_"${i}"
for j in `seq 1 10`; do
mkdir percpu_test/cg_"${i}"_"${j}"
done
done

cat /proc/meminfo | grep Percpu

for i in `seq 1 1000`; do
for j in `seq 1 10`; do
rmdir percpu_test/cg_"${i}"_"${j}"
done
done

sleep 10

cat /proc/meminfo | grep Percpu

for i in `seq 1 1000`; do
rmdir percpu_test/cg_"${i}"
done

rmdir percpu_test
--

It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.

Results:
  vanilla:
./percpu_test.sh
Percpu: 7488 kB
Percpu:   481152 kB
Percpu:   481152 kB

  with this patchset applied:
./percpu_test.sh
Percpu: 7488 kB
Percpu:   481408 kB
Percpu:   135552 kB

So the total size of the percpu memory was reduced by more than 3.5 times.

v3:
  - introduced pcpu_check_chunk_hint()
  - fixed a bug related to the hint check
  - minor cosmetic changes
  - s/pretends/fixes (cc Vlastimil)

v2:
  - depopulated chunks are sidelined
  - depopulation happens in the reverse order
  - depopulate list made per-chunk type
  - better results due to better heuristics

v1:
  - depopulation heuristics changed and optimized
  - chunks are put into a separate list, depopulation scan this list
  - chunk->isolated is introduced, chunk->depopulate is dropped
  - rearranged patches a bit
  - fixed a panic discovered by krobot
  - made pcpu_nr_empty_pop_pages per chunk type
  - minor fixes

rfc:
  https://lwn.net/Articles/850508/


Roman Gushchin (6):
  percpu: fix a comment about the chunks ordering
  percpu: split __pcpu_balance_workfn()
  percpu: make pcpu_nr_empty_pop_pages per chunk type
  percpu: generalize pcpu_balance_populated()
  percpu: factor out pcpu_check_chunk_hint()
  percpu: implement partial chunk depopulation

 mm/percpu-internal.h |   4 +-
 mm/percpu-stats.c|   9 +-
 mm/percpu.c  | 306 +++
 3 files changed, 261 insertions(+), 58 deletions(-)

-- 
2.30.2



[PATCH v3 4/6] percpu: generalize pcpu_balance_populated()

2021-04-07 Thread Roman Gushchin
To prepare for the depopulation of percpu chunks, split out the
populating part of the pcpu_balance_populated() into the new
pcpu_grow_populated() (with an intention to add
pcpu_shrink_populated() in the next commit).

The goal of pcpu_balance_populated() is to determine whether
there is a shortage or an excessive amount of empty percpu pages
and call into the corresponding function.

pcpu_grow_populated() takes a desired number of pages as an argument
(nr_to_pop). If it creates a new chunk, nr_to_pop should be updated
to reflect that the new chunk could be created already populated.
Otherwise an infinite loop might appear.

Signed-off-by: Roman Gushchin 
---
 mm/percpu.c | 63 +
 1 file changed, 39 insertions(+), 24 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 61339b3d9337..e20119668c42 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1979,7 +1979,7 @@ static void pcpu_balance_free(enum pcpu_chunk_type type)
 }
 
 /**
- * pcpu_balance_populated - manage the amount of populated pages
+ * pcpu_grow_populated - populate chunk(s) to satisfy atomic allocations
  * @type: chunk type
  *
  * Maintain a certain amount of populated pages to satisfy atomic allocations.
@@ -1988,35 +1988,15 @@ static void pcpu_balance_free(enum pcpu_chunk_type type)
  * allocation causes the failure as it is possible that requests can be
  * serviced from already backed regions.
  */
-static void pcpu_balance_populated(enum pcpu_chunk_type type)
+static void pcpu_grow_populated(enum pcpu_chunk_type type, int nr_to_pop)
 {
/* gfp flags passed to underlying allocators */
const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
struct list_head *pcpu_slot = pcpu_chunk_list(type);
struct pcpu_chunk *chunk;
-   int slot, nr_to_pop, ret;
+   int slot, ret;
 
-   /*
-* Ensure there are certain number of free populated pages for
-* atomic allocs.  Fill up from the most packed so that atomic
-* allocs don't increase fragmentation.  If atomic allocation
-* failed previously, always populate the maximum amount.  This
-* should prevent atomic allocs larger than PAGE_SIZE from keeping
-* failing indefinitely; however, large atomic allocs are not
-* something we support properly and can be highly unreliable and
-* inefficient.
-*/
 retry_pop:
-   if (pcpu_atomic_alloc_failed) {
-   nr_to_pop = PCPU_EMPTY_POP_PAGES_HIGH;
-   /* best effort anyway, don't worry about synchronization */
-   pcpu_atomic_alloc_failed = false;
-   } else {
-   nr_to_pop = clamp(PCPU_EMPTY_POP_PAGES_HIGH -
- pcpu_nr_empty_pop_pages[type],
- 0, PCPU_EMPTY_POP_PAGES_HIGH);
-   }
-
for (slot = pcpu_size_to_slot(PAGE_SIZE); slot < pcpu_nr_slots; slot++) 
{
unsigned int nr_unpop = 0, rs, re;
 
@@ -2060,12 +2040,47 @@ static void pcpu_balance_populated(enum pcpu_chunk_type 
type)
if (chunk) {
spin_lock_irq(_lock);
pcpu_chunk_relocate(chunk, -1);
+   nr_to_pop = max_t(int, 0, nr_to_pop - 
chunk->nr_populated);
spin_unlock_irq(_lock);
-   goto retry_pop;
+   if (nr_to_pop)
+   goto retry_pop;
}
}
 }
 
+/**
+ * pcpu_balance_populated - manage the amount of populated pages
+ * @type: chunk type
+ *
+ * Populate or depopulate chunks to maintain a certain amount
+ * of free pages to satisfy atomic allocations, but not waste
+ * large amounts of memory.
+ */
+static void pcpu_balance_populated(enum pcpu_chunk_type type)
+{
+   int nr_to_pop;
+
+   /*
+* Ensure there are certain number of free populated pages for
+* atomic allocs.  Fill up from the most packed so that atomic
+* allocs don't increase fragmentation.  If atomic allocation
+* failed previously, always populate the maximum amount.  This
+* should prevent atomic allocs larger than PAGE_SIZE from keeping
+* failing indefinitely; however, large atomic allocs are not
+* something we support properly and can be highly unreliable and
+* inefficient.
+*/
+   if (pcpu_atomic_alloc_failed) {
+   nr_to_pop = PCPU_EMPTY_POP_PAGES_HIGH;
+   /* best effort anyway, don't worry about synchronization */
+   pcpu_atomic_alloc_failed = false;
+   pcpu_grow_populated(type, nr_to_pop);
+   } else if (pcpu_nr_empty_pop_pages[type] < PCPU_EMPTY_POP_PAGES_HIGH) {
+   nr_to_pop = PCPU_EMPTY_POP_PAGES_HIGH - 
pcpu_nr_empty_pop_pages[type];
+   pcpu_grow_populated(type, nr_to_pop);
+   }
+}
+
 /**
  * pcpu_balance_workfn - manage the amount of fr

[PATCH v3 3/6] percpu: make pcpu_nr_empty_pop_pages per chunk type

2021-04-07 Thread Roman Gushchin
nr_empty_pop_pages is used to guarantee that there are some free
populated pages to satisfy atomic allocations. Accounted and
non-accounted allocations are using separate sets of chunks,
so both need to have a surplus of empty pages.

This commit makes pcpu_nr_empty_pop_pages and the corresponding logic
per chunk type.

Signed-off-by: Roman Gushchin 
---
 mm/percpu-internal.h |  2 +-
 mm/percpu-stats.c|  9 +++--
 mm/percpu.c  | 14 +++---
 3 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index 18b768ac7dca..095d7eaa0db4 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -87,7 +87,7 @@ extern spinlock_t pcpu_lock;
 
 extern struct list_head *pcpu_chunk_lists;
 extern int pcpu_nr_slots;
-extern int pcpu_nr_empty_pop_pages;
+extern int pcpu_nr_empty_pop_pages[];
 
 extern struct pcpu_chunk *pcpu_first_chunk;
 extern struct pcpu_chunk *pcpu_reserved_chunk;
diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c
index c8400a2adbc2..f6026dbcdf6b 100644
--- a/mm/percpu-stats.c
+++ b/mm/percpu-stats.c
@@ -145,6 +145,7 @@ static int percpu_stats_show(struct seq_file *m, void *v)
int slot, max_nr_alloc;
int *buffer;
enum pcpu_chunk_type type;
+   int nr_empty_pop_pages;
 
 alloc_buffer:
spin_lock_irq(_lock);
@@ -165,7 +166,11 @@ static int percpu_stats_show(struct seq_file *m, void *v)
goto alloc_buffer;
}
 
-#define PL(X) \
+   nr_empty_pop_pages = 0;
+   for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
+   nr_empty_pop_pages += pcpu_nr_empty_pop_pages[type];
+
+#define PL(X)  \
seq_printf(m, "  %-20s: %12lld\n", #X, (long long int)pcpu_stats_ai.X)
 
seq_printf(m,
@@ -196,7 +201,7 @@ static int percpu_stats_show(struct seq_file *m, void *v)
PU(nr_max_chunks);
PU(min_alloc_size);
PU(max_alloc_size);
-   P("empty_pop_pages", pcpu_nr_empty_pop_pages);
+   P("empty_pop_pages", nr_empty_pop_pages);
seq_putc(m, '\n');
 
 #undef PU
diff --git a/mm/percpu.c b/mm/percpu.c
index 7e31e1b8725f..61339b3d9337 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -176,10 +176,10 @@ struct list_head *pcpu_chunk_lists __ro_after_init; /* 
chunk list slots */
 static LIST_HEAD(pcpu_map_extend_chunks);
 
 /*
- * The number of empty populated pages, protected by pcpu_lock.  The
- * reserved chunk doesn't contribute to the count.
+ * The number of empty populated pages by chunk type, protected by pcpu_lock.
+ * The reserved chunk doesn't contribute to the count.
  */
-int pcpu_nr_empty_pop_pages;
+int pcpu_nr_empty_pop_pages[PCPU_NR_CHUNK_TYPES];
 
 /*
  * The number of populated pages in use by the allocator, protected by
@@ -559,7 +559,7 @@ static inline void pcpu_update_empty_pages(struct 
pcpu_chunk *chunk, int nr)
 {
chunk->nr_empty_pop_pages += nr;
if (chunk != pcpu_reserved_chunk)
-   pcpu_nr_empty_pop_pages += nr;
+   pcpu_nr_empty_pop_pages[pcpu_chunk_type(chunk)] += nr;
 }
 
 /*
@@ -1835,7 +1835,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t 
align, bool reserved,
mutex_unlock(_alloc_mutex);
}
 
-   if (pcpu_nr_empty_pop_pages < PCPU_EMPTY_POP_PAGES_LOW)
+   if (pcpu_nr_empty_pop_pages[type] < PCPU_EMPTY_POP_PAGES_LOW)
pcpu_schedule_balance_work();
 
/* clear the areas and return address relative to base address */
@@ -2013,7 +2013,7 @@ static void pcpu_balance_populated(enum pcpu_chunk_type 
type)
pcpu_atomic_alloc_failed = false;
} else {
nr_to_pop = clamp(PCPU_EMPTY_POP_PAGES_HIGH -
- pcpu_nr_empty_pop_pages,
+ pcpu_nr_empty_pop_pages[type],
  0, PCPU_EMPTY_POP_PAGES_HIGH);
}
 
@@ -2595,7 +2595,7 @@ void __init pcpu_setup_first_chunk(const struct 
pcpu_alloc_info *ai,
 
/* link the first chunk in */
pcpu_first_chunk = chunk;
-   pcpu_nr_empty_pop_pages = pcpu_first_chunk->nr_empty_pop_pages;
+   pcpu_nr_empty_pop_pages[PCPU_CHUNK_ROOT] = 
pcpu_first_chunk->nr_empty_pop_pages;
pcpu_chunk_relocate(pcpu_first_chunk, -1);
 
/* include all regions of the first chunk */
-- 
2.30.2



[PATCH v3 5/6] percpu: factor out pcpu_check_chunk_hint()

2021-04-07 Thread Roman Gushchin
Factor out the pcpu_check_chunk_hint() helper, which will be useful
in the future. The new function checks if the allocation can likely
fit the given chunk.

Signed-off-by: Roman Gushchin 
---
 mm/percpu.c | 30 +-
 1 file changed, 21 insertions(+), 9 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index e20119668c42..357fd6994278 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -306,6 +306,26 @@ static unsigned long pcpu_block_off_to_off(int index, int 
off)
return index * PCPU_BITMAP_BLOCK_BITS + off;
 }
 
+/**
+ * pcpu_check_chunk_hint - check that allocation can fit a chunk
+ * @chunk_md: chunk's block
+ * @bits: size of request in allocation units
+ * @align: alignment of area (max PAGE_SIZE)
+ *
+ * Check to see if the allocation can fit in the chunk's contig hint.
+ * This is an optimization to prevent scanning by assuming if it
+ * cannot fit in the global hint, there is memory pressure and creating
+ * a new chunk would happen soon.
+ */
+static bool pcpu_check_chunk_hint(struct pcpu_block_md *chunk_md, int bits,
+ size_t align)
+{
+   int bit_off = ALIGN(chunk_md->contig_hint_start, align) -
+   chunk_md->contig_hint_start;
+
+   return bit_off + bits <= chunk_md->contig_hint;
+}
+
 /*
  * pcpu_next_hint - determine which hint to use
  * @block: block of interest
@@ -1065,15 +1085,7 @@ static int pcpu_find_block_fit(struct pcpu_chunk *chunk, 
int alloc_bits,
struct pcpu_block_md *chunk_md = >chunk_md;
int bit_off, bits, next_off;
 
-   /*
-* Check to see if the allocation can fit in the chunk's contig hint.
-* This is an optimization to prevent scanning by assuming if it
-* cannot fit in the global hint, there is memory pressure and creating
-* a new chunk would happen soon.
-*/
-   bit_off = ALIGN(chunk_md->contig_hint_start, align) -
- chunk_md->contig_hint_start;
-   if (bit_off + alloc_bits > chunk_md->contig_hint)
+   if (!pcpu_check_chunk_hint(chunk_md, alloc_bits, align))
return -1;
 
bit_off = pcpu_next_hint(chunk_md, alloc_bits);
-- 
2.30.2



[PATCH v3 2/6] percpu: split __pcpu_balance_workfn()

2021-04-07 Thread Roman Gushchin
__pcpu_balance_workfn() became fairly big and hard to follow, but in
fact it consists of two fully independent parts, responsible for
the destruction of excessive free chunks and population of necessarily
amount of free pages.

In order to simplify the code and prepare for adding of a new
functionality, split it in two functions:

  1) pcpu_balance_free,
  2) pcpu_balance_populated.

Move the taking/releasing of the pcpu_alloc_mutex to an upper level
to keep the current synchronization in place.

Signed-off-by: Roman Gushchin 
Reviewed-by: Dennis Zhou 
---
 mm/percpu.c | 46 +-
 1 file changed, 29 insertions(+), 17 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 2f27123bb489..7e31e1b8725f 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1933,31 +1933,22 @@ void __percpu *__alloc_reserved_percpu(size_t size, 
size_t align)
 }
 
 /**
- * __pcpu_balance_workfn - manage the amount of free chunks and populated pages
+ * pcpu_balance_free - manage the amount of free chunks
  * @type: chunk type
  *
- * Reclaim all fully free chunks except for the first one.  This is also
- * responsible for maintaining the pool of empty populated pages.  However,
- * it is possible that this is called when physical memory is scarce causing
- * OOM killer to be triggered.  We should avoid doing so until an actual
- * allocation causes the failure as it is possible that requests can be
- * serviced from already backed regions.
+ * Reclaim all fully free chunks except for the first one.
  */
-static void __pcpu_balance_workfn(enum pcpu_chunk_type type)
+static void pcpu_balance_free(enum pcpu_chunk_type type)
 {
-   /* gfp flags passed to underlying allocators */
-   const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
LIST_HEAD(to_free);
struct list_head *pcpu_slot = pcpu_chunk_list(type);
struct list_head *free_head = _slot[pcpu_nr_slots - 1];
struct pcpu_chunk *chunk, *next;
-   int slot, nr_to_pop, ret;
 
/*
 * There's no reason to keep around multiple unused chunks and VM
 * areas can be scarce.  Destroy all free chunks except for one.
 */
-   mutex_lock(_alloc_mutex);
spin_lock_irq(_lock);
 
list_for_each_entry_safe(chunk, next, free_head, list) {
@@ -1985,6 +1976,25 @@ static void __pcpu_balance_workfn(enum pcpu_chunk_type 
type)
pcpu_destroy_chunk(chunk);
cond_resched();
}
+}
+
+/**
+ * pcpu_balance_populated - manage the amount of populated pages
+ * @type: chunk type
+ *
+ * Maintain a certain amount of populated pages to satisfy atomic allocations.
+ * It is possible that this is called when physical memory is scarce causing
+ * OOM killer to be triggered.  We should avoid doing so until an actual
+ * allocation causes the failure as it is possible that requests can be
+ * serviced from already backed regions.
+ */
+static void pcpu_balance_populated(enum pcpu_chunk_type type)
+{
+   /* gfp flags passed to underlying allocators */
+   const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
+   struct list_head *pcpu_slot = pcpu_chunk_list(type);
+   struct pcpu_chunk *chunk;
+   int slot, nr_to_pop, ret;
 
/*
 * Ensure there are certain number of free populated pages for
@@ -2054,22 +2064,24 @@ static void __pcpu_balance_workfn(enum pcpu_chunk_type 
type)
goto retry_pop;
}
}
-
-   mutex_unlock(_alloc_mutex);
 }
 
 /**
  * pcpu_balance_workfn - manage the amount of free chunks and populated pages
  * @work: unused
  *
- * Call __pcpu_balance_workfn() for each chunk type.
+ * Call pcpu_balance_free() and pcpu_balance_populated() for each chunk type.
  */
 static void pcpu_balance_workfn(struct work_struct *work)
 {
enum pcpu_chunk_type type;
 
-   for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
-   __pcpu_balance_workfn(type);
+   for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) {
+   mutex_lock(_alloc_mutex);
+   pcpu_balance_free(type);
+   pcpu_balance_populated(type);
+   mutex_unlock(_alloc_mutex);
+   }
 }
 
 /**
-- 
2.30.2



[PATCH v3 1/6] percpu: fix a comment about the chunks ordering

2021-04-07 Thread Roman Gushchin
Since the commit 3e54097beb22 ("percpu: manage chunks based on
contig_bits instead of free_bytes") chunks are sorted based on the
size of the biggest continuous free area instead of the total number
of free bytes. Update the corresponding comment to reflect this.

Signed-off-by: Roman Gushchin 
---
 mm/percpu.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 6596a0a4286e..2f27123bb489 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -99,7 +99,10 @@
 
 #include "percpu-internal.h"
 
-/* the slots are sorted by free bytes left, 1-31 bytes share the same slot */
+/*
+ * The slots are sorted by the size of the biggest continuous free area.
+ * 1-31 bytes share the same slot.
+ */
 #define PCPU_SLOT_BASE_SHIFT   5
 /* chunks in slots below this are subject to being sidelined on failed alloc */
 #define PCPU_SLOT_FAIL_THRESHOLD   3
-- 
2.30.2



[PATCH v2 5/5] percpu: implement partial chunk depopulation

2021-04-07 Thread Roman Gushchin
This patch implements partial depopulation of percpu chunks.

As now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.

This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
sidelined to a special list or freed. New allocations can't be served
using a sidelined chunk. The chunk can be moved back to a corresponding
slot if there are not enough chunks with empty populated pages.

The depopulation is scheduled on the free path. Is the chunk:
  1) has more than 1/4 of total pages free and populated
  2) the system has enough free percpu pages aside of this chunk
  3) isn't the reserved chunk
  4) isn't the first chunk
  5) isn't entirely free
it's a good target for depopulation. If it's already depopulated
but got free populated pages, it's a good target too.
The chunk is moved to a special pcpu_depopulate_list, chunk->isolate
flag is set and the async balancing is scheduled.

The async balancing moves pcpu_depopulate_list to a local list
(because pcpu_depopulate_list can be changed when pcpu_lock is
releases), and then tries to depopulate each chunk.  The depopulation
is performed in the reverse direction to keep populated pages close to
the beginning, if the global number of empty pages is reached.
Depopulated chunks are sidelined to prevent further allocations.
Skipped and fully empty chunks are returned to the corresponding slot.

On the allocation path, if there are no suitable chunks found,
the list of sidelined chunks in scanned prior to creating a new chunk.
If there is a good sidelined chunk, it's placed back to the slot
and the scanning is restarted.

Many thanks to Dennis Zhou for his great ideas and a very constructive
discussion which led to many improvements in this patchset!

Signed-off-by: Roman Gushchin 
---
 mm/percpu-internal.h |   2 +
 mm/percpu.c  | 164 ++-
 2 files changed, 164 insertions(+), 2 deletions(-)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index 095d7eaa0db4..8e432663c41e 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -67,6 +67,8 @@ struct pcpu_chunk {
 
void*data;  /* chunk data */
boolimmutable;  /* no [de]population allowed */
+   boolisolated;   /* isolated from chunk slot 
lists */
+   booldepopulated;/* sidelined after depopulation 
*/
int start_offset;   /* the overlap with the previous
   region to have a page aligned
   base_addr */
diff --git a/mm/percpu.c b/mm/percpu.c
index e20119668c42..0a5a5e84e0a4 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -181,6 +181,19 @@ static LIST_HEAD(pcpu_map_extend_chunks);
  */
 int pcpu_nr_empty_pop_pages[PCPU_NR_CHUNK_TYPES];
 
+/*
+ * List of chunks with a lot of free pages.  Used to depopulate them
+ * asynchronously.
+ */
+static struct list_head pcpu_depopulate_list[PCPU_NR_CHUNK_TYPES];
+
+/*
+ * List of previously depopulated chunks.  They are not usually used for new
+ * allocations, but can be returned back to service if a need arises.
+ */
+static struct list_head pcpu_sideline_list[PCPU_NR_CHUNK_TYPES];
+
+
 /*
  * The number of populated pages in use by the allocator, protected by
  * pcpu_lock.  This number is kept per a unit per chunk (i.e. when a page gets
@@ -542,6 +555,12 @@ static void pcpu_chunk_relocate(struct pcpu_chunk *chunk, 
int oslot)
 {
int nslot = pcpu_chunk_slot(chunk);
 
+   /*
+* Keep isolated and depopulated chunks on a sideline.
+*/
+   if (chunk->isolated || chunk->depopulated)
+   return;
+
if (oslot != nslot)
__pcpu_chunk_move(chunk, nslot, oslot < nslot);
 }
@@ -1778,6 +1797,25 @@ static void __percpu *pcpu_alloc(size_t size, size_t 
align, bool reserved,
}
}
 
+   /* search through sidelined depopulated chunks */
+   list_for_each_entry(chunk, _sideline_list[type], list) {
+   struct pcpu_block_md *chunk_md = >chunk_md;
+   int bit_off;
+
+   /*
+* If the allocation can fit in the chunk's contig hint,
+* place the chunk back into corresponding slot and restart
+* the scanning.
+*/
+   bit_off = ALIGN(chunk_md->contig_hint_start, align) -
+   chunk_md-

[PATCH v2 4/5] percpu: generalize pcpu_balance_populated()

2021-04-07 Thread Roman Gushchin
To prepare for the depopulation of percpu chunks, split out the
populating part of the pcpu_balance_populated() into the new
pcpu_grow_populated() (with an intention to add
pcpu_shrink_populated() in the next commit).

The goal of pcpu_balance_populated() is to determine whether
there is a shortage or an excessive amount of empty percpu pages
and call into the corresponding function.

pcpu_grow_populated() takes a desired number of pages as an argument
(nr_to_pop). If it creates a new chunk, nr_to_pop should be updated
to reflect that the new chunk could be created already populated.
Otherwise an infinite loop might appear.

Signed-off-by: Roman Gushchin 
---
 mm/percpu.c | 63 +
 1 file changed, 39 insertions(+), 24 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 61339b3d9337..e20119668c42 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1979,7 +1979,7 @@ static void pcpu_balance_free(enum pcpu_chunk_type type)
 }
 
 /**
- * pcpu_balance_populated - manage the amount of populated pages
+ * pcpu_grow_populated - populate chunk(s) to satisfy atomic allocations
  * @type: chunk type
  *
  * Maintain a certain amount of populated pages to satisfy atomic allocations.
@@ -1988,35 +1988,15 @@ static void pcpu_balance_free(enum pcpu_chunk_type type)
  * allocation causes the failure as it is possible that requests can be
  * serviced from already backed regions.
  */
-static void pcpu_balance_populated(enum pcpu_chunk_type type)
+static void pcpu_grow_populated(enum pcpu_chunk_type type, int nr_to_pop)
 {
/* gfp flags passed to underlying allocators */
const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
struct list_head *pcpu_slot = pcpu_chunk_list(type);
struct pcpu_chunk *chunk;
-   int slot, nr_to_pop, ret;
+   int slot, ret;
 
-   /*
-* Ensure there are certain number of free populated pages for
-* atomic allocs.  Fill up from the most packed so that atomic
-* allocs don't increase fragmentation.  If atomic allocation
-* failed previously, always populate the maximum amount.  This
-* should prevent atomic allocs larger than PAGE_SIZE from keeping
-* failing indefinitely; however, large atomic allocs are not
-* something we support properly and can be highly unreliable and
-* inefficient.
-*/
 retry_pop:
-   if (pcpu_atomic_alloc_failed) {
-   nr_to_pop = PCPU_EMPTY_POP_PAGES_HIGH;
-   /* best effort anyway, don't worry about synchronization */
-   pcpu_atomic_alloc_failed = false;
-   } else {
-   nr_to_pop = clamp(PCPU_EMPTY_POP_PAGES_HIGH -
- pcpu_nr_empty_pop_pages[type],
- 0, PCPU_EMPTY_POP_PAGES_HIGH);
-   }
-
for (slot = pcpu_size_to_slot(PAGE_SIZE); slot < pcpu_nr_slots; slot++) 
{
unsigned int nr_unpop = 0, rs, re;
 
@@ -2060,12 +2040,47 @@ static void pcpu_balance_populated(enum pcpu_chunk_type 
type)
if (chunk) {
spin_lock_irq(_lock);
pcpu_chunk_relocate(chunk, -1);
+   nr_to_pop = max_t(int, 0, nr_to_pop - 
chunk->nr_populated);
spin_unlock_irq(_lock);
-   goto retry_pop;
+   if (nr_to_pop)
+   goto retry_pop;
}
}
 }
 
+/**
+ * pcpu_balance_populated - manage the amount of populated pages
+ * @type: chunk type
+ *
+ * Populate or depopulate chunks to maintain a certain amount
+ * of free pages to satisfy atomic allocations, but not waste
+ * large amounts of memory.
+ */
+static void pcpu_balance_populated(enum pcpu_chunk_type type)
+{
+   int nr_to_pop;
+
+   /*
+* Ensure there are certain number of free populated pages for
+* atomic allocs.  Fill up from the most packed so that atomic
+* allocs don't increase fragmentation.  If atomic allocation
+* failed previously, always populate the maximum amount.  This
+* should prevent atomic allocs larger than PAGE_SIZE from keeping
+* failing indefinitely; however, large atomic allocs are not
+* something we support properly and can be highly unreliable and
+* inefficient.
+*/
+   if (pcpu_atomic_alloc_failed) {
+   nr_to_pop = PCPU_EMPTY_POP_PAGES_HIGH;
+   /* best effort anyway, don't worry about synchronization */
+   pcpu_atomic_alloc_failed = false;
+   pcpu_grow_populated(type, nr_to_pop);
+   } else if (pcpu_nr_empty_pop_pages[type] < PCPU_EMPTY_POP_PAGES_HIGH) {
+   nr_to_pop = PCPU_EMPTY_POP_PAGES_HIGH - 
pcpu_nr_empty_pop_pages[type];
+   pcpu_grow_populated(type, nr_to_pop);
+   }
+}
+
 /**
  * pcpu_balance_workfn - manage the amount of fr

[PATCH v2 1/5] percpu: fix a comment about the chunks ordering

2021-04-07 Thread Roman Gushchin
Since the commit 3e54097beb22 ("percpu: manage chunks based on
contig_bits instead of free_bytes") chunks are sorted based on the
size of the biggest continuous free area instead of the total number
of free bytes. Update the corresponding comment to reflect this.

Signed-off-by: Roman Gushchin 
---
 mm/percpu.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 6596a0a4286e..2f27123bb489 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -99,7 +99,10 @@
 
 #include "percpu-internal.h"
 
-/* the slots are sorted by free bytes left, 1-31 bytes share the same slot */
+/*
+ * The slots are sorted by the size of the biggest continuous free area.
+ * 1-31 bytes share the same slot.
+ */
 #define PCPU_SLOT_BASE_SHIFT   5
 /* chunks in slots below this are subject to being sidelined on failed alloc */
 #define PCPU_SLOT_FAIL_THRESHOLD   3
-- 
2.30.2



[PATCH v2 3/5] percpu: make pcpu_nr_empty_pop_pages per chunk type

2021-04-07 Thread Roman Gushchin
nr_empty_pop_pages is used to guarantee that there are some free
populated pages to satisfy atomic allocations. Accounted and
non-accounted allocations are using separate sets of chunks,
so both need to have a surplus of empty pages.

This commit makes pcpu_nr_empty_pop_pages and the corresponding logic
per chunk type.

Signed-off-by: Roman Gushchin 
---
 mm/percpu-internal.h |  2 +-
 mm/percpu-stats.c|  9 +++--
 mm/percpu.c  | 14 +++---
 3 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index 18b768ac7dca..095d7eaa0db4 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -87,7 +87,7 @@ extern spinlock_t pcpu_lock;
 
 extern struct list_head *pcpu_chunk_lists;
 extern int pcpu_nr_slots;
-extern int pcpu_nr_empty_pop_pages;
+extern int pcpu_nr_empty_pop_pages[];
 
 extern struct pcpu_chunk *pcpu_first_chunk;
 extern struct pcpu_chunk *pcpu_reserved_chunk;
diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c
index c8400a2adbc2..f6026dbcdf6b 100644
--- a/mm/percpu-stats.c
+++ b/mm/percpu-stats.c
@@ -145,6 +145,7 @@ static int percpu_stats_show(struct seq_file *m, void *v)
int slot, max_nr_alloc;
int *buffer;
enum pcpu_chunk_type type;
+   int nr_empty_pop_pages;
 
 alloc_buffer:
spin_lock_irq(_lock);
@@ -165,7 +166,11 @@ static int percpu_stats_show(struct seq_file *m, void *v)
goto alloc_buffer;
}
 
-#define PL(X) \
+   nr_empty_pop_pages = 0;
+   for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
+   nr_empty_pop_pages += pcpu_nr_empty_pop_pages[type];
+
+#define PL(X)  \
seq_printf(m, "  %-20s: %12lld\n", #X, (long long int)pcpu_stats_ai.X)
 
seq_printf(m,
@@ -196,7 +201,7 @@ static int percpu_stats_show(struct seq_file *m, void *v)
PU(nr_max_chunks);
PU(min_alloc_size);
PU(max_alloc_size);
-   P("empty_pop_pages", pcpu_nr_empty_pop_pages);
+   P("empty_pop_pages", nr_empty_pop_pages);
seq_putc(m, '\n');
 
 #undef PU
diff --git a/mm/percpu.c b/mm/percpu.c
index 7e31e1b8725f..61339b3d9337 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -176,10 +176,10 @@ struct list_head *pcpu_chunk_lists __ro_after_init; /* 
chunk list slots */
 static LIST_HEAD(pcpu_map_extend_chunks);
 
 /*
- * The number of empty populated pages, protected by pcpu_lock.  The
- * reserved chunk doesn't contribute to the count.
+ * The number of empty populated pages by chunk type, protected by pcpu_lock.
+ * The reserved chunk doesn't contribute to the count.
  */
-int pcpu_nr_empty_pop_pages;
+int pcpu_nr_empty_pop_pages[PCPU_NR_CHUNK_TYPES];
 
 /*
  * The number of populated pages in use by the allocator, protected by
@@ -559,7 +559,7 @@ static inline void pcpu_update_empty_pages(struct 
pcpu_chunk *chunk, int nr)
 {
chunk->nr_empty_pop_pages += nr;
if (chunk != pcpu_reserved_chunk)
-   pcpu_nr_empty_pop_pages += nr;
+   pcpu_nr_empty_pop_pages[pcpu_chunk_type(chunk)] += nr;
 }
 
 /*
@@ -1835,7 +1835,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t 
align, bool reserved,
mutex_unlock(_alloc_mutex);
}
 
-   if (pcpu_nr_empty_pop_pages < PCPU_EMPTY_POP_PAGES_LOW)
+   if (pcpu_nr_empty_pop_pages[type] < PCPU_EMPTY_POP_PAGES_LOW)
pcpu_schedule_balance_work();
 
/* clear the areas and return address relative to base address */
@@ -2013,7 +2013,7 @@ static void pcpu_balance_populated(enum pcpu_chunk_type 
type)
pcpu_atomic_alloc_failed = false;
} else {
nr_to_pop = clamp(PCPU_EMPTY_POP_PAGES_HIGH -
- pcpu_nr_empty_pop_pages,
+ pcpu_nr_empty_pop_pages[type],
  0, PCPU_EMPTY_POP_PAGES_HIGH);
}
 
@@ -2595,7 +2595,7 @@ void __init pcpu_setup_first_chunk(const struct 
pcpu_alloc_info *ai,
 
/* link the first chunk in */
pcpu_first_chunk = chunk;
-   pcpu_nr_empty_pop_pages = pcpu_first_chunk->nr_empty_pop_pages;
+   pcpu_nr_empty_pop_pages[PCPU_CHUNK_ROOT] = 
pcpu_first_chunk->nr_empty_pop_pages;
pcpu_chunk_relocate(pcpu_first_chunk, -1);
 
/* include all regions of the first chunk */
-- 
2.30.2



[PATCH v2 0/5] percpu: partial chunk depopulation

2021-04-07 Thread Roman Gushchin
In our production experience the percpu memory allocator is sometimes struggling
with returning the memory to the system. A typical example is a creation of
several thousands memory cgroups (each has several chunks of the percpu data
used for vmstats, vmevents, ref counters etc). Deletion and complete releasing
of these cgroups doesn't always lead to a shrinkage of the percpu memory,
so that sometimes there are several GB's of memory wasted.

The underlying problem is the fragmentation: to release an underlying chunk
all percpu allocations should be released first. The percpu allocator tends
to top up chunks to improve the utilization. It means new small-ish allocations
(e.g. percpu ref counters) are placed onto almost filled old-ish chunks,
effectively pinning them in memory.

This patchset pretends to solve this problem by implementing a partial
depopulation of percpu chunks: chunks with many empty pages are being
asynchronously depopulated and the pages are returned to the system.

To illustrate the problem the following script can be used:

--
#!/bin/bash

cd /sys/fs/cgroup

mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control

cat /proc/meminfo | grep Percpu

for i in `seq 1 1000`; do
mkdir percpu_test/cg_"${i}"
for j in `seq 1 10`; do
mkdir percpu_test/cg_"${i}"_"${j}"
done
done

cat /proc/meminfo | grep Percpu

for i in `seq 1 1000`; do
for j in `seq 1 10`; do
rmdir percpu_test/cg_"${i}"_"${j}"
done
done

sleep 10

cat /proc/meminfo | grep Percpu

for i in `seq 1 1000`; do
rmdir percpu_test/cg_"${i}"
done

rmdir percpu_test
--

It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.

Results:
  vanilla:
./percpu_test.sh
Percpu: 7488 kB
Percpu:   481152 kB
Percpu:   481152 kB

  with this patchset applied:
./percpu_test.sh
Percpu: 7488 kB
Percpu:   481408 kB
Percpu:   135552 kB

So the total size of the percpu memory was reduced by more than 3.5 times.


v2:
  - depopulated chunks are sidelined
  - depopulation happens in the reverse order
  - depopulate list made per-chunk type
  - better results due to better heuristics

v1:
  - depopulation heuristics changed and optimized
  - chunks are put into a separate list, depopulation scan this list
  - chunk->isolated is introduced, chunk->depopulate is dropped
  - rearranged patches a bit
  - fixed a panic discovered by krobot
  - made pcpu_nr_empty_pop_pages per chunk type
  - minor fixes

rfc:
  https://lwn.net/Articles/850508/


Roman Gushchin (5):
  percpu: fix a comment about the chunks ordering
  percpu: split __pcpu_balance_workfn()
  percpu: make pcpu_nr_empty_pop_pages per chunk type
  percpu: generalize pcpu_balance_populated()
  percpu: implement partial chunk depopulation

 mm/percpu-internal.h |   4 +-
 mm/percpu-stats.c|   9 +-
 mm/percpu.c  | 282 ---
 3 files changed, 246 insertions(+), 49 deletions(-)

-- 
2.30.2



[PATCH v2 2/5] percpu: split __pcpu_balance_workfn()

2021-04-07 Thread Roman Gushchin
__pcpu_balance_workfn() became fairly big and hard to follow, but in
fact it consists of two fully independent parts, responsible for
the destruction of excessive free chunks and population of necessarily
amount of free pages.

In order to simplify the code and prepare for adding of a new
functionality, split it in two functions:

  1) pcpu_balance_free,
  2) pcpu_balance_populated.

Move the taking/releasing of the pcpu_alloc_mutex to an upper level
to keep the current synchronization in place.

Signed-off-by: Roman Gushchin 
Reviewed-by: Dennis Zhou 
---
 mm/percpu.c | 46 +-
 1 file changed, 29 insertions(+), 17 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 2f27123bb489..7e31e1b8725f 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1933,31 +1933,22 @@ void __percpu *__alloc_reserved_percpu(size_t size, 
size_t align)
 }
 
 /**
- * __pcpu_balance_workfn - manage the amount of free chunks and populated pages
+ * pcpu_balance_free - manage the amount of free chunks
  * @type: chunk type
  *
- * Reclaim all fully free chunks except for the first one.  This is also
- * responsible for maintaining the pool of empty populated pages.  However,
- * it is possible that this is called when physical memory is scarce causing
- * OOM killer to be triggered.  We should avoid doing so until an actual
- * allocation causes the failure as it is possible that requests can be
- * serviced from already backed regions.
+ * Reclaim all fully free chunks except for the first one.
  */
-static void __pcpu_balance_workfn(enum pcpu_chunk_type type)
+static void pcpu_balance_free(enum pcpu_chunk_type type)
 {
-   /* gfp flags passed to underlying allocators */
-   const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
LIST_HEAD(to_free);
struct list_head *pcpu_slot = pcpu_chunk_list(type);
struct list_head *free_head = _slot[pcpu_nr_slots - 1];
struct pcpu_chunk *chunk, *next;
-   int slot, nr_to_pop, ret;
 
/*
 * There's no reason to keep around multiple unused chunks and VM
 * areas can be scarce.  Destroy all free chunks except for one.
 */
-   mutex_lock(_alloc_mutex);
spin_lock_irq(_lock);
 
list_for_each_entry_safe(chunk, next, free_head, list) {
@@ -1985,6 +1976,25 @@ static void __pcpu_balance_workfn(enum pcpu_chunk_type 
type)
pcpu_destroy_chunk(chunk);
cond_resched();
}
+}
+
+/**
+ * pcpu_balance_populated - manage the amount of populated pages
+ * @type: chunk type
+ *
+ * Maintain a certain amount of populated pages to satisfy atomic allocations.
+ * It is possible that this is called when physical memory is scarce causing
+ * OOM killer to be triggered.  We should avoid doing so until an actual
+ * allocation causes the failure as it is possible that requests can be
+ * serviced from already backed regions.
+ */
+static void pcpu_balance_populated(enum pcpu_chunk_type type)
+{
+   /* gfp flags passed to underlying allocators */
+   const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
+   struct list_head *pcpu_slot = pcpu_chunk_list(type);
+   struct pcpu_chunk *chunk;
+   int slot, nr_to_pop, ret;
 
/*
 * Ensure there are certain number of free populated pages for
@@ -2054,22 +2064,24 @@ static void __pcpu_balance_workfn(enum pcpu_chunk_type 
type)
goto retry_pop;
}
}
-
-   mutex_unlock(_alloc_mutex);
 }
 
 /**
  * pcpu_balance_workfn - manage the amount of free chunks and populated pages
  * @work: unused
  *
- * Call __pcpu_balance_workfn() for each chunk type.
+ * Call pcpu_balance_free() and pcpu_balance_populated() for each chunk type.
  */
 static void pcpu_balance_workfn(struct work_struct *work)
 {
enum pcpu_chunk_type type;
 
-   for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
-   __pcpu_balance_workfn(type);
+   for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) {
+   mutex_lock(_alloc_mutex);
+   pcpu_balance_free(type);
+   pcpu_balance_populated(type);
+   mutex_unlock(_alloc_mutex);
+   }
 }
 
 /**
-- 
2.30.2



Re: [PATCH v4 1/8] mm/cma: change cma mutex to irq safe spinlock

2021-04-07 Thread Roman Gushchin
On Mon, Apr 05, 2021 at 04:00:36PM -0700, Mike Kravetz wrote:
> cma_release is currently a sleepable operatation because the bitmap
> manipulation is protected by cma->lock mutex. Hugetlb code which relies
> on cma_release for CMA backed (giga) hugetlb pages, however, needs to be
> irq safe.
> 
> The lock doesn't protect any sleepable operation so it can be changed to
> a (irq aware) spin lock. The bitmap processing should be quite fast in
> typical case but if cma sizes grow to TB then we will likely need to
> replace the lock by a more optimized bitmap implementation.
> 
> Signed-off-by: Mike Kravetz 

Acked-by: Roman Gushchin 

> ---
>  mm/cma.c   | 18 +-
>  mm/cma.h   |  2 +-
>  mm/cma_debug.c |  8 
>  3 files changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/cma.c b/mm/cma.c
> index f3bca4178c7f..995e15480937 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -24,7 +24,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>  #include 
> @@ -83,13 +82,14 @@ static void cma_clear_bitmap(struct cma *cma, unsigned 
> long pfn,
>unsigned long count)
>  {
>   unsigned long bitmap_no, bitmap_count;
> + unsigned long flags;
>  
>   bitmap_no = (pfn - cma->base_pfn) >> cma->order_per_bit;
>   bitmap_count = cma_bitmap_pages_to_bits(cma, count);
>  
> - mutex_lock(>lock);
> + spin_lock_irqsave(>lock, flags);
>   bitmap_clear(cma->bitmap, bitmap_no, bitmap_count);
> - mutex_unlock(>lock);
> + spin_unlock_irqrestore(>lock, flags);
>  }
>  
>  static void __init cma_activate_area(struct cma *cma)
> @@ -118,7 +118,7 @@ static void __init cma_activate_area(struct cma *cma)
>pfn += pageblock_nr_pages)
>   init_cma_reserved_pageblock(pfn_to_page(pfn));
>  
> - mutex_init(>lock);
> + spin_lock_init(>lock);
>  
>  #ifdef CONFIG_CMA_DEBUGFS
>   INIT_HLIST_HEAD(>mem_head);
> @@ -392,7 +392,7 @@ static void cma_debug_show_areas(struct cma *cma)
>   unsigned long nr_part, nr_total = 0;
>   unsigned long nbits = cma_bitmap_maxno(cma);
>  
> - mutex_lock(>lock);
> + spin_lock_irq(>lock);
>   pr_info("number of available pages: ");
>   for (;;) {
>   next_zero_bit = find_next_zero_bit(cma->bitmap, nbits, start);
> @@ -407,7 +407,7 @@ static void cma_debug_show_areas(struct cma *cma)
>   start = next_zero_bit + nr_zero;
>   }
>   pr_cont("=> %lu free of %lu total pages\n", nr_total, cma->count);
> - mutex_unlock(>lock);
> + spin_unlock_irq(>lock);
>  }
>  #else
>  static inline void cma_debug_show_areas(struct cma *cma) { }
> @@ -454,12 +454,12 @@ struct page *cma_alloc(struct cma *cma, unsigned long 
> count,
>   goto out;
>  
>   for (;;) {
> - mutex_lock(>lock);
> + spin_lock_irq(>lock);
>   bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap,
>   bitmap_maxno, start, bitmap_count, mask,
>   offset);
>   if (bitmap_no >= bitmap_maxno) {
> - mutex_unlock(>lock);
> + spin_unlock_irq(>lock);
>   break;
>   }
>   bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
> @@ -468,7 +468,7 @@ struct page *cma_alloc(struct cma *cma, unsigned long 
> count,
>* our exclusive use. If the migration fails we will take the
>* lock again and unmark it.
>*/
> - mutex_unlock(>lock);
> + spin_unlock_irq(>lock);
>  
>   pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
>   ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA,
> diff --git a/mm/cma.h b/mm/cma.h
> index 68ffad4e430d..2c775877eae2 100644
> --- a/mm/cma.h
> +++ b/mm/cma.h
> @@ -15,7 +15,7 @@ struct cma {
>   unsigned long   count;
>   unsigned long   *bitmap;
>   unsigned int order_per_bit; /* Order of pages represented by one bit */
> - struct mutexlock;
> + spinlock_t  lock;
>  #ifdef CONFIG_CMA_DEBUGFS
>   struct hlist_head mem_head;
>   spinlock_t mem_head_lock;
> diff --git a/mm/cma_debug.c b/mm/cma_debug.c
> index d5bf8aa34fdc..2e7704955f4f 100644
> --- a/mm/cma_debug.c
> +++ b/mm/cma_debug.c
> @@ -36,10 +36,10 @@ static int cma_used_get(void *data, u64 *val)
>   struct cma *cma = data;
>   unsigned long used;
> 

Re: High kmalloc-32 slab cache consumption with 10k containers

2021-04-05 Thread Roman Gushchin
On Mon, Apr 05, 2021 at 11:08:26AM -0700, Yang Shi wrote:
> On Sun, Apr 4, 2021 at 10:49 PM Bharata B Rao  wrote:
> >
> > Hi,
> >
> > When running 1 (more-or-less-empty-)containers on a bare-metal Power9
> > server(160 CPUs, 2 NUMA nodes, 256G memory), it is seen that memory
> > consumption increases quite a lot (around 172G) when the containers are
> > running. Most of it comes from slab (149G) and within slab, the majority of
> > it comes from kmalloc-32 cache (102G)
> >
> > The major allocator of kmalloc-32 slab cache happens to be the list_head
> > allocations of list_lru_one list. These lists are created whenever a
> > FS mount happens. Specially two such lists are registered by alloc_super(),
> > one for dentry and another for inode shrinker list. And these lists
> > are created for all possible NUMA nodes and for all given memcgs
> > (memcg_nr_cache_ids to be particular)
> >
> > If,
> >
> > A = Nr allocation request per mount: 2 (one for dentry and inode list)
> > B = Nr NUMA possible nodes
> > C = memcg_nr_cache_ids
> > D = size of each kmalloc-32 object: 32 bytes,
> >
> > then for every mount, the amount of memory consumed by kmalloc-32 slab
> > cache for list_lru creation is A*B*C*D bytes.
> 
> Yes, this is exactly what the current implementation does.
> 
> >
> > Following factors contribute to the excessive allocations:
> >
> > - Lists are created for possible NUMA nodes.
> 
> Yes, because filesystem caches (dentry and inode) are NUMA aware.
> 
> > - memcg_nr_cache_ids grows in bulk (see memcg_alloc_cache_id() and 
> > additional
> >   list_lrus are created when it grows. Thus we end up creating list_lru_one
> >   list_heads even for those memcgs which are yet to be created.
> >   For example, when 1 memcgs are created, memcg_nr_cache_ids reach
> >   a value of 12286.
> > - When a memcg goes offline, the list elements are drained to the parent
> >   memcg, but the list_head entry remains.
> > - The lists are destroyed only when the FS is unmounted. So list_heads
> >   for non-existing memcgs remain and continue to contribute to the
> >   kmalloc-32 allocation. This is presumably done for performance
> >   reason as they get reused when new memcgs are created, but they end up
> >   consuming slab memory until then.
> 
> The current implementation has list_lrus attached with super_block. So
> the list can't be freed until the super block is unmounted.
> 
> I'm looking into consolidating list_lrus more closely with memcgs. It
> means the list_lrus will have the same life cycles as memcgs rather
> than filesystems. This may be able to improve some. But I'm supposed
> the filesystem will be unmounted once the container exits and the
> memcgs will get offlined for your usecase.
> 
> > - In case of containers, a few file systems get mounted and are specific
> >   to the container namespace and hence to a particular memcg, but we
> >   end up creating lists for all the memcgs.
> 
> Yes, because the kernel is *NOT* aware of containers.
> 
> >   As an example, if 7 FS mounts are done for every container and when
> >   10k containers are created, we end up creating 2*7*12286 list_lru_one
> >   lists for each NUMA node. It appears that no elements will get added
> >   to other than 2*7=14 of them in the case of containers.
> >
> > One straight forward way to prevent this excessive list_lru_one
> > allocations is to limit the list_lru_one creation only to the
> > relevant memcg. However I don't see an easy way to figure out
> > that relevant memcg from FS mount path (alloc_super())
> >
> > As an alternative approach, I have this below hack that does lazy
> > list_lru creation. The memcg-specific list is created and initialized
> > only when there is a request to add an element to that particular
> > list. Though I am not sure about the full impact of this change
> > on the owners of the lists and also the performance impact of this,
> > the overall savings look good.
> 
> It is fine to reduce the memory consumption for your usecase, but I'm
> not sure if this would incur any noticeable overhead for vfs
> operations since list_lru_add() should be called quite often, but it
> just needs to allocate the list for once (for each memcg +
> filesystem), so the overhead might be fine.
> 
> And I'm wondering how much memory can be saved for real life workload.
> I don't expect most containers are idle in production environments.
> 
> Added some more memcg/list_lru experts in this loop, they may have better 
> ideas.
> 
> >
> > Used memory
> > Before  During  After
> > W/o patch   23G 172G40G
> > W/  patch   23G 69G 29G
> >
> > Slab consumption
> > Before  During  After
> > W/o patch   1.5G149G22G
> > W/  patch   1.5G45G 10G
> >
> > Number of kmalloc-32 allocations
> > Before  During  After
> > W/o patch  

Re: [PATCH] mm: memcontrol: fix forget to obtain the ref to objcg in split_page_memcg

2021-04-02 Thread Roman Gushchin
On Fri, Apr 02, 2021 at 06:04:54PM -0700, Andrew Morton wrote:
> On Wed, 31 Mar 2021 20:35:02 -0700 Roman Gushchin  wrote:
> 
> > On Thu, Apr 01, 2021 at 11:31:16AM +0800, Miaohe Lin wrote:
> > > On 2021/4/1 11:01, Muchun Song wrote:
> > > > Christian Borntraeger reported a warning about "percpu ref
> > > > (obj_cgroup_release) <= 0 (-1) after switching to atomic".
> > > > Because we forgot to obtain the reference to the objcg and
> > > > wrongly obtain the reference of memcg.
> > > > 
> > > > Reported-by: Christian Borntraeger 
> > > > Signed-off-by: Muchun Song 
> > > 
> > > Thanks for the patch.
> > > Is a Fixes tag needed?
> > 
> > No, as the original patch hasn't been merged into the Linus's tree yet.
> > So the fix can be simply squashed.
> 
> Help.  Which is "the original patch"?

"mm: memcontrol: use obj_cgroup APIs to charge kmem pages"


[PATCH v1 3/5] percpu: generalize pcpu_balance_populated()

2021-04-01 Thread Roman Gushchin
To prepare for the depopulation of percpu chunks, split out the
populating part of the pcpu_balance_populated() into the new
pcpu_grow_populated() (with an intention to add
pcpu_shrink_populated() in the next commit).

The goal of pcpu_balance_populated() is to determine whether
there is a shortage or an excessive amount of empty percpu pages
and call into the corresponding function.

pcpu_grow_populated() takes a desired number of pages as an argument
(nr_to_pop). If it creates a new chunk, nr_to_pop should be updated
to reflect that the new chunk could be created already populated.
Otherwise an infinite loop might appear.

Signed-off-by: Roman Gushchin 
---
 mm/percpu.c | 63 +
 1 file changed, 39 insertions(+), 24 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 0eeeb4e7a2f9..25a181328353 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1976,7 +1976,7 @@ static void pcpu_balance_free(enum pcpu_chunk_type type)
 }
 
 /**
- * pcpu_balance_populated - manage the amount of populated pages
+ * pcpu_grow_populated - populate chunk(s) to satisfy atomic allocations
  * @type: chunk type
  *
  * Maintain a certain amount of populated pages to satisfy atomic allocations.
@@ -1985,35 +1985,15 @@ static void pcpu_balance_free(enum pcpu_chunk_type type)
  * allocation causes the failure as it is possible that requests can be
  * serviced from already backed regions.
  */
-static void pcpu_balance_populated(enum pcpu_chunk_type type)
+static void pcpu_grow_populated(enum pcpu_chunk_type type, int nr_to_pop)
 {
/* gfp flags passed to underlying allocators */
const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
struct list_head *pcpu_slot = pcpu_chunk_list(type);
struct pcpu_chunk *chunk;
-   int slot, nr_to_pop, ret;
+   int slot, ret;
 
-   /*
-* Ensure there are certain number of free populated pages for
-* atomic allocs.  Fill up from the most packed so that atomic
-* allocs don't increase fragmentation.  If atomic allocation
-* failed previously, always populate the maximum amount.  This
-* should prevent atomic allocs larger than PAGE_SIZE from keeping
-* failing indefinitely; however, large atomic allocs are not
-* something we support properly and can be highly unreliable and
-* inefficient.
-*/
 retry_pop:
-   if (pcpu_atomic_alloc_failed) {
-   nr_to_pop = PCPU_EMPTY_POP_PAGES_HIGH;
-   /* best effort anyway, don't worry about synchronization */
-   pcpu_atomic_alloc_failed = false;
-   } else {
-   nr_to_pop = clamp(PCPU_EMPTY_POP_PAGES_HIGH -
- pcpu_nr_empty_pop_pages[type],
- 0, PCPU_EMPTY_POP_PAGES_HIGH);
-   }
-
for (slot = pcpu_size_to_slot(PAGE_SIZE); slot < pcpu_nr_slots; slot++) 
{
unsigned int nr_unpop = 0, rs, re;
 
@@ -2057,12 +2037,47 @@ static void pcpu_balance_populated(enum pcpu_chunk_type 
type)
if (chunk) {
spin_lock_irq(_lock);
pcpu_chunk_relocate(chunk, -1);
+   nr_to_pop = max_t(int, 0, nr_to_pop - 
chunk->nr_populated);
spin_unlock_irq(_lock);
-   goto retry_pop;
+   if (nr_to_pop)
+   goto retry_pop;
}
}
 }
 
+/**
+ * pcpu_balance_populated - manage the amount of populated pages
+ * @type: chunk type
+ *
+ * Populate or depopulate chunks to maintain a certain amount
+ * of free pages to satisfy atomic allocations, but not waste
+ * large amounts of memory.
+ */
+static void pcpu_balance_populated(enum pcpu_chunk_type type)
+{
+   int nr_to_pop;
+
+   /*
+* Ensure there are certain number of free populated pages for
+* atomic allocs.  Fill up from the most packed so that atomic
+* allocs don't increase fragmentation.  If atomic allocation
+* failed previously, always populate the maximum amount.  This
+* should prevent atomic allocs larger than PAGE_SIZE from keeping
+* failing indefinitely; however, large atomic allocs are not
+* something we support properly and can be highly unreliable and
+* inefficient.
+*/
+   if (pcpu_atomic_alloc_failed) {
+   nr_to_pop = PCPU_EMPTY_POP_PAGES_HIGH;
+   /* best effort anyway, don't worry about synchronization */
+   pcpu_atomic_alloc_failed = false;
+   pcpu_grow_populated(type, nr_to_pop);
+   } else if (pcpu_nr_empty_pop_pages[type] < PCPU_EMPTY_POP_PAGES_HIGH) {
+   nr_to_pop = PCPU_EMPTY_POP_PAGES_HIGH - 
pcpu_nr_empty_pop_pages[type];
+   pcpu_grow_populated(type, nr_to_pop);
+   }
+}
+
 /**
  * pcpu_balance_workfn - manage the amount of fr

[PATCH v1 5/5] percpu: implement partial chunk depopulation

2021-04-01 Thread Roman Gushchin
This patch implements partial depopulation of percpu chunks.

As now, a chunk can be depopulated only as a part of the final
destruction, if there are no more outstanding allocations. However
to minimize a memory waste it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being fully reclaimed.

This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
returned to the original slot (but is appended to the tail of the list
to minimize the chances of population).

Because the pcpu_lock is dropped while calling pcpu_depopulate_chunk(),
the chunk can be concurrently moved to a different slot. To prevent
this, bool chunk->isolated flag is introduced. If set, the chunk can't
be moved to a different slot.

The depopulation is scheduled on the free path. Is the chunk:
  1) has more than 1/8 of total pages free and populated
  2) the system has enough free percpu pages aside of this chunk
  3) isn't the reserved chunk
  4) isn't the first chunk
  5) isn't entirely free
it's a good target for depopulation.

If so, the chunk is moved to a special pcpu_depopulate_list,
chunk->isolate flag is set and the async balancing is scheduled.

The async balancing moves pcpu_depopulate_list to a local list
(because pcpu_depopulate_list can be changed when pcpu_lock is
releases), and then tries to depopulate each chunk. Successfully
or not, at the end all chunks are returned to appropriate slots
and their isolated flags are cleared.

Many thanks to Dennis Zhou for his great ideas and a very constructive
discussion which led to many improvements in this patchset!

Signed-off-by: Roman Gushchin 
---
 mm/percpu-internal.h |   1 +
 mm/percpu.c  | 101 ++-
 2 files changed, 100 insertions(+), 2 deletions(-)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index 095d7eaa0db4..ff318752915d 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -67,6 +67,7 @@ struct pcpu_chunk {
 
void*data;  /* chunk data */
boolimmutable;  /* no [de]population allowed */
+   boolisolated;   /* isolated from chunk slot 
lists */
int start_offset;   /* the overlap with the previous
   region to have a page aligned
   base_addr */
diff --git a/mm/percpu.c b/mm/percpu.c
index e20119668c42..dae0b870e10a 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -181,6 +181,12 @@ static LIST_HEAD(pcpu_map_extend_chunks);
  */
 int pcpu_nr_empty_pop_pages[PCPU_NR_CHUNK_TYPES];
 
+/*
+ * List of chunks with a lot of free pages. Used to depopulate them
+ * asynchronously.
+ */
+static LIST_HEAD(pcpu_depopulate_list);
+
 /*
  * The number of populated pages in use by the allocator, protected by
  * pcpu_lock.  This number is kept per a unit per chunk (i.e. when a page gets
@@ -542,7 +548,7 @@ static void pcpu_chunk_relocate(struct pcpu_chunk *chunk, 
int oslot)
 {
int nslot = pcpu_chunk_slot(chunk);
 
-   if (oslot != nslot)
+   if (!chunk->isolated && oslot != nslot)
__pcpu_chunk_move(chunk, nslot, oslot < nslot);
 }
 
@@ -2048,6 +2054,82 @@ static void pcpu_grow_populated(enum pcpu_chunk_type 
type, int nr_to_pop)
}
 }
 
+/**
+ * pcpu_shrink_populated - scan chunks and release unused pages to the system
+ * @type: chunk type
+ *
+ * Scan over all chunks, find those marked with the depopulate flag and
+ * try to release unused pages to the system. On every attempt clear the
+ * chunk's depopulate flag to avoid wasting CPU by scanning the same
+ * chunk again and again.
+ */
+static void pcpu_shrink_populated(enum pcpu_chunk_type type)
+{
+   struct pcpu_block_md *block;
+   struct pcpu_chunk *chunk, *tmp;
+   LIST_HEAD(to_depopulate);
+   int i, start;
+
+   spin_lock_irq(_lock);
+
+   list_splice_init(_depopulate_list, _depopulate);
+
+   list_for_each_entry_safe(chunk, tmp, _depopulate, list) {
+   WARN_ON(chunk->immutable);
+
+   for (i = 0, start = -1; i < chunk->nr_pages; i++) {
+   /*
+* If the chunk has no empty pages or
+* we're short on empty pages in general,
+* just put the chunk back into the original slot.
+*/
+   if (!chunk->nr_empty_pop_pages ||
+   pcpu_nr_empty_pop_pages[type] <
+   PCPU_EMPTY_POP_PAGES_HIGH)
+   break;
+
+   /*
+   

[PATCH v1 4/5] percpu: fix a comment about the chunks ordering

2021-04-01 Thread Roman Gushchin
Since the commit 3e54097beb22 ("percpu: manage chunks based on
contig_bits instead of free_bytes") chunks are sorted based on the
size of the biggest continuous free area instead of the total number
of free bytes. Update the corresponding comment to reflect this.

Signed-off-by: Roman Gushchin 
---
 mm/percpu.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 25a181328353..e20119668c42 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -99,7 +99,10 @@
 
 #include "percpu-internal.h"
 
-/* the slots are sorted by free bytes left, 1-31 bytes share the same slot */
+/*
+ * The slots are sorted by the size of the biggest continuous free area.
+ * 1-31 bytes share the same slot.
+ */
 #define PCPU_SLOT_BASE_SHIFT   5
 /* chunks in slots below this are subject to being sidelined on failed alloc */
 #define PCPU_SLOT_FAIL_THRESHOLD   3
-- 
2.30.2



[PATCH v1 2/5] percpu: make pcpu_nr_empty_pop_pages per chunk type

2021-04-01 Thread Roman Gushchin
nr_empty_pop_pages is used to guarantee that there are some free
populated pages to satisfy atomic allocations. Accounted and
non-accounted allocations are using separate sets of chunks,
so both need to have a surplus of empty pages.

This commit makes pcpu_nr_empty_pop_pages and the corresponding logic
per chunk type.

Signed-off-by: Roman Gushchin 
---
 mm/percpu-internal.h |  2 +-
 mm/percpu-stats.c|  9 +++--
 mm/percpu.c  | 14 +++---
 3 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index 18b768ac7dca..095d7eaa0db4 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -87,7 +87,7 @@ extern spinlock_t pcpu_lock;
 
 extern struct list_head *pcpu_chunk_lists;
 extern int pcpu_nr_slots;
-extern int pcpu_nr_empty_pop_pages;
+extern int pcpu_nr_empty_pop_pages[];
 
 extern struct pcpu_chunk *pcpu_first_chunk;
 extern struct pcpu_chunk *pcpu_reserved_chunk;
diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c
index c8400a2adbc2..f6026dbcdf6b 100644
--- a/mm/percpu-stats.c
+++ b/mm/percpu-stats.c
@@ -145,6 +145,7 @@ static int percpu_stats_show(struct seq_file *m, void *v)
int slot, max_nr_alloc;
int *buffer;
enum pcpu_chunk_type type;
+   int nr_empty_pop_pages;
 
 alloc_buffer:
spin_lock_irq(_lock);
@@ -165,7 +166,11 @@ static int percpu_stats_show(struct seq_file *m, void *v)
goto alloc_buffer;
}
 
-#define PL(X) \
+   nr_empty_pop_pages = 0;
+   for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
+   nr_empty_pop_pages += pcpu_nr_empty_pop_pages[type];
+
+#define PL(X)  \
seq_printf(m, "  %-20s: %12lld\n", #X, (long long int)pcpu_stats_ai.X)
 
seq_printf(m,
@@ -196,7 +201,7 @@ static int percpu_stats_show(struct seq_file *m, void *v)
PU(nr_max_chunks);
PU(min_alloc_size);
PU(max_alloc_size);
-   P("empty_pop_pages", pcpu_nr_empty_pop_pages);
+   P("empty_pop_pages", nr_empty_pop_pages);
seq_putc(m, '\n');
 
 #undef PU
diff --git a/mm/percpu.c b/mm/percpu.c
index 5b505a459028..0eeeb4e7a2f9 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -173,10 +173,10 @@ struct list_head *pcpu_chunk_lists __ro_after_init; /* 
chunk list slots */
 static LIST_HEAD(pcpu_map_extend_chunks);
 
 /*
- * The number of empty populated pages, protected by pcpu_lock.  The
- * reserved chunk doesn't contribute to the count.
+ * The number of empty populated pages by chunk type, protected by pcpu_lock.
+ * The reserved chunk doesn't contribute to the count.
  */
-int pcpu_nr_empty_pop_pages;
+int pcpu_nr_empty_pop_pages[PCPU_NR_CHUNK_TYPES];
 
 /*
  * The number of populated pages in use by the allocator, protected by
@@ -556,7 +556,7 @@ static inline void pcpu_update_empty_pages(struct 
pcpu_chunk *chunk, int nr)
 {
chunk->nr_empty_pop_pages += nr;
if (chunk != pcpu_reserved_chunk)
-   pcpu_nr_empty_pop_pages += nr;
+   pcpu_nr_empty_pop_pages[pcpu_chunk_type(chunk)] += nr;
 }
 
 /*
@@ -1832,7 +1832,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t 
align, bool reserved,
mutex_unlock(_alloc_mutex);
}
 
-   if (pcpu_nr_empty_pop_pages < PCPU_EMPTY_POP_PAGES_LOW)
+   if (pcpu_nr_empty_pop_pages[type] < PCPU_EMPTY_POP_PAGES_LOW)
pcpu_schedule_balance_work();
 
/* clear the areas and return address relative to base address */
@@ -2010,7 +2010,7 @@ static void pcpu_balance_populated(enum pcpu_chunk_type 
type)
pcpu_atomic_alloc_failed = false;
} else {
nr_to_pop = clamp(PCPU_EMPTY_POP_PAGES_HIGH -
- pcpu_nr_empty_pop_pages,
+ pcpu_nr_empty_pop_pages[type],
  0, PCPU_EMPTY_POP_PAGES_HIGH);
}
 
@@ -2592,7 +2592,7 @@ void __init pcpu_setup_first_chunk(const struct 
pcpu_alloc_info *ai,
 
/* link the first chunk in */
pcpu_first_chunk = chunk;
-   pcpu_nr_empty_pop_pages = pcpu_first_chunk->nr_empty_pop_pages;
+   pcpu_nr_empty_pop_pages[PCPU_CHUNK_ROOT] = 
pcpu_first_chunk->nr_empty_pop_pages;
pcpu_chunk_relocate(pcpu_first_chunk, -1);
 
/* include all regions of the first chunk */
-- 
2.30.2



[PATCH v1 1/5] percpu: split __pcpu_balance_workfn()

2021-04-01 Thread Roman Gushchin
__pcpu_balance_workfn() became fairly big and hard to follow, but in
fact it consists of two fully independent parts, responsible for
the destruction of excessive free chunks and population of necessarily
amount of free pages.

In order to simplify the code and prepare for adding of a new
functionality, split it in two functions:

  1) pcpu_balance_free,
  2) pcpu_balance_populated.

Move the taking/releasing of the pcpu_alloc_mutex to an upper level
to keep the current synchronization in place.

Signed-off-by: Roman Gushchin 
Reviewed-by: Dennis Zhou 
---
 mm/percpu.c | 46 +-
 1 file changed, 29 insertions(+), 17 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 6596a0a4286e..5b505a459028 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1930,31 +1930,22 @@ void __percpu *__alloc_reserved_percpu(size_t size, 
size_t align)
 }
 
 /**
- * __pcpu_balance_workfn - manage the amount of free chunks and populated pages
+ * pcpu_balance_free - manage the amount of free chunks
  * @type: chunk type
  *
- * Reclaim all fully free chunks except for the first one.  This is also
- * responsible for maintaining the pool of empty populated pages.  However,
- * it is possible that this is called when physical memory is scarce causing
- * OOM killer to be triggered.  We should avoid doing so until an actual
- * allocation causes the failure as it is possible that requests can be
- * serviced from already backed regions.
+ * Reclaim all fully free chunks except for the first one.
  */
-static void __pcpu_balance_workfn(enum pcpu_chunk_type type)
+static void pcpu_balance_free(enum pcpu_chunk_type type)
 {
-   /* gfp flags passed to underlying allocators */
-   const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
LIST_HEAD(to_free);
struct list_head *pcpu_slot = pcpu_chunk_list(type);
struct list_head *free_head = _slot[pcpu_nr_slots - 1];
struct pcpu_chunk *chunk, *next;
-   int slot, nr_to_pop, ret;
 
/*
 * There's no reason to keep around multiple unused chunks and VM
 * areas can be scarce.  Destroy all free chunks except for one.
 */
-   mutex_lock(_alloc_mutex);
spin_lock_irq(_lock);
 
list_for_each_entry_safe(chunk, next, free_head, list) {
@@ -1982,6 +1973,25 @@ static void __pcpu_balance_workfn(enum pcpu_chunk_type 
type)
pcpu_destroy_chunk(chunk);
cond_resched();
}
+}
+
+/**
+ * pcpu_balance_populated - manage the amount of populated pages
+ * @type: chunk type
+ *
+ * Maintain a certain amount of populated pages to satisfy atomic allocations.
+ * It is possible that this is called when physical memory is scarce causing
+ * OOM killer to be triggered.  We should avoid doing so until an actual
+ * allocation causes the failure as it is possible that requests can be
+ * serviced from already backed regions.
+ */
+static void pcpu_balance_populated(enum pcpu_chunk_type type)
+{
+   /* gfp flags passed to underlying allocators */
+   const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
+   struct list_head *pcpu_slot = pcpu_chunk_list(type);
+   struct pcpu_chunk *chunk;
+   int slot, nr_to_pop, ret;
 
/*
 * Ensure there are certain number of free populated pages for
@@ -2051,22 +2061,24 @@ static void __pcpu_balance_workfn(enum pcpu_chunk_type 
type)
goto retry_pop;
}
}
-
-   mutex_unlock(_alloc_mutex);
 }
 
 /**
  * pcpu_balance_workfn - manage the amount of free chunks and populated pages
  * @work: unused
  *
- * Call __pcpu_balance_workfn() for each chunk type.
+ * Call pcpu_balance_free() and pcpu_balance_populated() for each chunk type.
  */
 static void pcpu_balance_workfn(struct work_struct *work)
 {
enum pcpu_chunk_type type;
 
-   for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
-   __pcpu_balance_workfn(type);
+   for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) {
+   mutex_lock(_alloc_mutex);
+   pcpu_balance_free(type);
+   pcpu_balance_populated(type);
+   mutex_unlock(_alloc_mutex);
+   }
 }
 
 /**
-- 
2.30.2



[PATCH v1 0/5] percpu: partial chunk depopulation

2021-04-01 Thread Roman Gushchin
In our production experience the percpu memory allocator is sometimes struggling
with returning the memory to the system. A typical example is a creation of
several thousands memory cgroups (each has several chunks of the percpu data
used for vmstats, vmevents, ref counters etc). Deletion and complete releasing
of these cgroups doesn't always lead to a shrinkage of the percpu memory.

The underlying problem is the fragmentation: to release an underlying chunk
all percpu allocations should be released first. The percpu allocator tends
to top up chunks to improve the utilization. It means new small-ish allocations
(e.g. percpu ref counters) are placed onto almost filled old-ish chunks,
effectively pinning them in memory.

This patchset pretends to solve this problem by implementing a partial
depopulation of percpu chunks: chunks with many empty pages are being
asynchronously depopulated and the pages are returned to the system.

To illustrate the problem the following script can be used:

--
#!/bin/bash

cd /sys/fs/cgroup

mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control

cat /proc/meminfo | grep Percpu

for i in `seq 1 1000`; do
mkdir percpu_test/cg_"${i}"
for j in `seq 1 10`; do
mkdir percpu_test/cg_"${i}"_"${j}"
done
done

cat /proc/meminfo | grep Percpu

for i in `seq 1 1000`; do
for j in `seq 1 10`; do
rmdir percpu_test/cg_"${i}"_"${j}"
done
done

sleep 10

cat /proc/meminfo | grep Percpu

for i in `seq 1 1000`; do
rmdir percpu_test/cg_"${i}"
done

rmdir percpu_test
--

It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.

Results:
  vanilla:
./percpu_test.sh
Percpu: 7488 kB
Percpu:   481152 kB
Percpu:   481152 kB

  with this patchset applied:
./percpu_test.sh
Percpu: 7488 kB
Percpu:   481408 kB
Percpu:   159488 kB

So the total size of the percpu memory was reduced by 3 times.

v2:
  - depopulation heuristics changed and optimized
  - chunks are put into a separate list, depopulation scan this list
  - chunk->isolated is introduced, chunk->depopulate is dropped
  - rearranged patches a bit
  - fixed a panic discovered by krobot
  - made pcpu_nr_empty_pop_pages per chunk type
  - minor fixes

rfc:
  https://lwn.net/Articles/850508/


Roman Gushchin (5):
  percpu: split __pcpu_balance_workfn()
  percpu: make pcpu_nr_empty_pop_pages per chunk type
  percpu: generalize pcpu_balance_populated()
  percpu: fix a comment about the chunks ordering
  percpu: implement partial chunk depopulation

 mm/percpu-internal.h |   3 +-
 mm/percpu-stats.c|   9 +-
 mm/percpu.c  | 219 ++-
 3 files changed, 182 insertions(+), 49 deletions(-)

-- 
2.30.2



Re: [External] Re: [RFC PATCH 00/15] Use obj_cgroup APIs to charge the LRU pages

2021-04-01 Thread Roman Gushchin
On Fri, Apr 02, 2021 at 12:07:42AM +0800, Muchun Song wrote:
> On Wed, Mar 31, 2021 at 11:17 PM Johannes Weiner  wrote:
> >
> > On Tue, Mar 30, 2021 at 03:05:42PM -0700, Roman Gushchin wrote:
> > > On Tue, Mar 30, 2021 at 05:30:10PM -0400, Johannes Weiner wrote:
> > > > On Tue, Mar 30, 2021 at 11:58:31AM -0700, Roman Gushchin wrote:
> > > > > On Tue, Mar 30, 2021 at 11:34:11AM -0700, Shakeel Butt wrote:
> > > > > > On Tue, Mar 30, 2021 at 3:20 AM Muchun Song 
> > > > > >  wrote:
> > > > > > >
> > > > > > > Since the following patchsets applied. All the kernel memory are 
> > > > > > > charged
> > > > > > > with the new APIs of obj_cgroup.
> > > > > > >
> > > > > > > [v17,00/19] The new cgroup slab memory controller
> > > > > > > [v5,0/7] Use obj_cgroup APIs to charge kmem pages
> > > > > > >
> > > > > > > But user memory allocations (LRU pages) pinning memcgs for a long 
> > > > > > > time -
> > > > > > > it exists at a larger scale and is causing recurring problems in 
> > > > > > > the real
> > > > > > > world: page cache doesn't get reclaimed for a long time, or is 
> > > > > > > used by the
> > > > > > > second, third, fourth, ... instance of the same job that was 
> > > > > > > restarted into
> > > > > > > a new cgroup every time. Unreclaimable dying cgroups pile up, 
> > > > > > > waste memory,
> > > > > > > and make page reclaim very inefficient.
> > > > > > >
> > > > > > > We can convert LRU pages and most other raw memcg pins to the 
> > > > > > > objcg direction
> > > > > > > to fix this problem, and then the LRU pages will not pin the 
> > > > > > > memcgs.
> > > > > > >
> > > > > > > This patchset aims to make the LRU pages to drop the reference to 
> > > > > > > memory
> > > > > > > cgroup by using the APIs of obj_cgroup. Finally, we can see that 
> > > > > > > the number
> > > > > > > of the dying cgroups will not increase if we run the following 
> > > > > > > test script.
> > > > > > >
> > > > > > > ```bash
> > > > > > > #!/bin/bash
> > > > > > >
> > > > > > > cat /proc/cgroups | grep memory
> > > > > > >
> > > > > > > cd /sys/fs/cgroup/memory
> > > > > > >
> > > > > > > for i in range{1..500}
> > > > > > > do
> > > > > > > mkdir test
> > > > > > > echo $$ > test/cgroup.procs
> > > > > > > sleep 60 &
> > > > > > > echo $$ > cgroup.procs
> > > > > > > echo `cat test/cgroup.procs` > cgroup.procs
> > > > > > > rmdir test
> > > > > > > done
> > > > > > >
> > > > > > > cat /proc/cgroups | grep memory
> > > > > > > ```
> > > > > > >
> > > > > > > Patch 1 aims to fix page charging in page replacement.
> > > > > > > Patch 2-5 are code cleanup and simplification.
> > > > > > > Patch 6-15 convert LRU pages pin to the objcg direction.
> > > > > >
> > > > > > The main concern I have with *just* reparenting LRU pages is that 
> > > > > > for
> > > > > > the long running systems, the root memcg will become a dumping 
> > > > > > ground.
> > > > > > In addition a job running multiple times on a machine will see
> > > > > > inconsistent memory usage if it re-accesses the file pages which 
> > > > > > were
> > > > > > reparented to the root memcg.
> > > > >
> > > > > I agree, but also the reparenting is not the perfect thing in a 
> > > > > combination
> > > > > with any memory protections (e.g. memory.low).
> > > > >
> > > > > Imagine the following configuration:
> > > > > workload.slice
> > > > > - workload_gen_1.service   memory.m

Re: [PATCH] mm: memcontrol: fix forget to obtain the ref to objcg in split_page_memcg

2021-03-31 Thread Roman Gushchin
On Thu, Apr 01, 2021 at 11:31:16AM +0800, Miaohe Lin wrote:
> On 2021/4/1 11:01, Muchun Song wrote:
> > Christian Borntraeger reported a warning about "percpu ref
> > (obj_cgroup_release) <= 0 (-1) after switching to atomic".
> > Because we forgot to obtain the reference to the objcg and
> > wrongly obtain the reference of memcg.
> > 
> > Reported-by: Christian Borntraeger 
> > Signed-off-by: Muchun Song 
> 
> Thanks for the patch.
> Is a Fixes tag needed?

No, as the original patch hasn't been merged into the Linus's tree yet.
So the fix can be simply squashed.

Btw, the fix looks good to me.

Acked-by: Roman Gushchin 

> 
> > ---
> >  include/linux/memcontrol.h | 6 ++
> >  mm/memcontrol.c| 6 +-
> >  2 files changed, 11 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 0e8907957227..c960fd49c3e8 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -804,6 +804,12 @@ static inline void obj_cgroup_get(struct obj_cgroup 
> > *objcg)
> > percpu_ref_get(>refcnt);
> >  }
> >  
> > +static inline void obj_cgroup_get_many(struct obj_cgroup *objcg,
> > +  unsigned long nr)
> > +{
> > +   percpu_ref_get_many(>refcnt, nr);
> > +}
> > +
> >  static inline void obj_cgroup_put(struct obj_cgroup *objcg)
> >  {
> > percpu_ref_put(>refcnt);
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index c0b83a396299..64ada9e650a5 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -3133,7 +3133,11 @@ void split_page_memcg(struct page *head, unsigned 
> > int nr)
> >  
> > for (i = 1; i < nr; i++)
> > head[i].memcg_data = head->memcg_data;
> > -   css_get_many(>css, nr - 1);
> > +
> > +   if (PageMemcgKmem(head))
> > +   obj_cgroup_get_many(__page_objcg(head), nr - 1);
> > +   else
> > +   css_get_many(>css, nr - 1);
> >  }
> >  
> >  #ifdef CONFIG_MEMCG_SWAP
> > 
> 


Re: [RFC PATCH 00/15] Use obj_cgroup APIs to charge the LRU pages

2021-03-30 Thread Roman Gushchin
On Tue, Mar 30, 2021 at 05:30:10PM -0400, Johannes Weiner wrote:
> On Tue, Mar 30, 2021 at 11:58:31AM -0700, Roman Gushchin wrote:
> > On Tue, Mar 30, 2021 at 11:34:11AM -0700, Shakeel Butt wrote:
> > > On Tue, Mar 30, 2021 at 3:20 AM Muchun Song  
> > > wrote:
> > > >
> > > > Since the following patchsets applied. All the kernel memory are charged
> > > > with the new APIs of obj_cgroup.
> > > >
> > > > [v17,00/19] The new cgroup slab memory controller
> > > > [v5,0/7] Use obj_cgroup APIs to charge kmem pages
> > > >
> > > > But user memory allocations (LRU pages) pinning memcgs for a long time -
> > > > it exists at a larger scale and is causing recurring problems in the 
> > > > real
> > > > world: page cache doesn't get reclaimed for a long time, or is used by 
> > > > the
> > > > second, third, fourth, ... instance of the same job that was restarted 
> > > > into
> > > > a new cgroup every time. Unreclaimable dying cgroups pile up, waste 
> > > > memory,
> > > > and make page reclaim very inefficient.
> > > >
> > > > We can convert LRU pages and most other raw memcg pins to the objcg 
> > > > direction
> > > > to fix this problem, and then the LRU pages will not pin the memcgs.
> > > >
> > > > This patchset aims to make the LRU pages to drop the reference to memory
> > > > cgroup by using the APIs of obj_cgroup. Finally, we can see that the 
> > > > number
> > > > of the dying cgroups will not increase if we run the following test 
> > > > script.
> > > >
> > > > ```bash
> > > > #!/bin/bash
> > > >
> > > > cat /proc/cgroups | grep memory
> > > >
> > > > cd /sys/fs/cgroup/memory
> > > >
> > > > for i in range{1..500}
> > > > do
> > > > mkdir test
> > > > echo $$ > test/cgroup.procs
> > > > sleep 60 &
> > > > echo $$ > cgroup.procs
> > > > echo `cat test/cgroup.procs` > cgroup.procs
> > > > rmdir test
> > > > done
> > > >
> > > > cat /proc/cgroups | grep memory
> > > > ```
> > > >
> > > > Patch 1 aims to fix page charging in page replacement.
> > > > Patch 2-5 are code cleanup and simplification.
> > > > Patch 6-15 convert LRU pages pin to the objcg direction.
> > > 
> > > The main concern I have with *just* reparenting LRU pages is that for
> > > the long running systems, the root memcg will become a dumping ground.
> > > In addition a job running multiple times on a machine will see
> > > inconsistent memory usage if it re-accesses the file pages which were
> > > reparented to the root memcg.
> > 
> > I agree, but also the reparenting is not the perfect thing in a combination
> > with any memory protections (e.g. memory.low).
> > 
> > Imagine the following configuration:
> > workload.slice
> > - workload_gen_1.service   memory.min = 30G
> > - workload_gen_2.service   memory.min = 30G
> > - workload_gen_3.service   memory.min = 30G
> >   ...
> > 
> > Parent cgroup and several generations of the child cgroup, protected by a 
> > memory.low.
> > Once the memory is getting reparented, it's not protected anymore.
> 
> That doesn't sound right.
> 
> A deleted cgroup today exerts no control over its abandoned
> pages. css_reset() will blow out any control settings.

I know. Currently it works in the following way: once cgroup gen_1 is deleted,
it's memory is not protected anymore, so eventually it's getting evicted and
re-faulted as gen_2 (or gen_N) memory. Muchun's patchset doesn't change this,
of course. But long-term we likely wanna re-charge such pages to new cgroups
and avoid unnecessary evictions and re-faults. Switching to obj_cgroups doesn't
help and likely will complicate this change. So I'm a bit skeptical here.

Also, in my experience the pagecache is not the main/worst memcg reference
holder (writeback structures are way worse). Pages are relatively large
(in comparison to some slab objects), and rarely there is only one page pinning
a separate memcg. And switching to obj_cgroup doesn't completely eliminate
the problem: we just switch from accumulating larger mem_cgroups to accumulating
smaller obj_cgroups.

With all this said, I'm not necessarily opposing the patchset, but it's
necessary to discuss how it fits into the long-term picture

Re: [RFC PATCH 00/15] Use obj_cgroup APIs to charge the LRU pages

2021-03-30 Thread Roman Gushchin
On Tue, Mar 30, 2021 at 11:34:11AM -0700, Shakeel Butt wrote:
> On Tue, Mar 30, 2021 at 3:20 AM Muchun Song  wrote:
> >
> > Since the following patchsets applied. All the kernel memory are charged
> > with the new APIs of obj_cgroup.
> >
> > [v17,00/19] The new cgroup slab memory controller
> > [v5,0/7] Use obj_cgroup APIs to charge kmem pages
> >
> > But user memory allocations (LRU pages) pinning memcgs for a long time -
> > it exists at a larger scale and is causing recurring problems in the real
> > world: page cache doesn't get reclaimed for a long time, or is used by the
> > second, third, fourth, ... instance of the same job that was restarted into
> > a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory,
> > and make page reclaim very inefficient.
> >
> > We can convert LRU pages and most other raw memcg pins to the objcg 
> > direction
> > to fix this problem, and then the LRU pages will not pin the memcgs.
> >
> > This patchset aims to make the LRU pages to drop the reference to memory
> > cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
> > of the dying cgroups will not increase if we run the following test script.
> >
> > ```bash
> > #!/bin/bash
> >
> > cat /proc/cgroups | grep memory
> >
> > cd /sys/fs/cgroup/memory
> >
> > for i in range{1..500}
> > do
> > mkdir test
> > echo $$ > test/cgroup.procs
> > sleep 60 &
> > echo $$ > cgroup.procs
> > echo `cat test/cgroup.procs` > cgroup.procs
> > rmdir test
> > done
> >
> > cat /proc/cgroups | grep memory
> > ```
> >
> > Patch 1 aims to fix page charging in page replacement.
> > Patch 2-5 are code cleanup and simplification.
> > Patch 6-15 convert LRU pages pin to the objcg direction.
> 
> The main concern I have with *just* reparenting LRU pages is that for
> the long running systems, the root memcg will become a dumping ground.
> In addition a job running multiple times on a machine will see
> inconsistent memory usage if it re-accesses the file pages which were
> reparented to the root memcg.

I agree, but also the reparenting is not the perfect thing in a combination
with any memory protections (e.g. memory.low).

Imagine the following configuration:
workload.slice
- workload_gen_1.service   memory.min = 30G
- workload_gen_2.service   memory.min = 30G
- workload_gen_3.service   memory.min = 30G
  ...

Parent cgroup and several generations of the child cgroup, protected by a 
memory.low.
Once the memory is getting reparented, it's not protected anymore.

I guess we need something smarter: e.g. reassign a page to a different
cgroup if the page is activated/rotated and is currently on a dying lru.

Also, I'm somewhat concerned about the interaction of the reparenting
with the writeback and dirty throttling. How does it work together?

Thanks!


Re: [PATCH v2 1/8] mm/cma: change cma mutex to irq safe spinlock

2021-03-29 Thread Roman Gushchin
On Mon, Mar 29, 2021 at 04:23:55PM -0700, Mike Kravetz wrote:
> Ideally, cma_release could be called from any context.  However, that is
> not possible because a mutex is used to protect the per-area bitmap.
> Change the bitmap to an irq safe spinlock.
> 
> Signed-off-by: Mike Kravetz 

Acked-by: Roman Gushchin 

Thanks!

> ---
>  mm/cma.c   | 20 +++-
>  mm/cma.h   |  2 +-
>  mm/cma_debug.c | 10 ++
>  3 files changed, 18 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/cma.c b/mm/cma.c
> index b2393b892d3b..80875fd4487b 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -24,7 +24,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>  #include 
> @@ -83,13 +82,14 @@ static void cma_clear_bitmap(struct cma *cma, unsigned 
> long pfn,
>unsigned int count)
>  {
>   unsigned long bitmap_no, bitmap_count;
> + unsigned long flags;
>  
>   bitmap_no = (pfn - cma->base_pfn) >> cma->order_per_bit;
>   bitmap_count = cma_bitmap_pages_to_bits(cma, count);
>  
> - mutex_lock(>lock);
> + spin_lock_irqsave(>lock, flags);
>   bitmap_clear(cma->bitmap, bitmap_no, bitmap_count);
> - mutex_unlock(>lock);
> + spin_unlock_irqrestore(>lock, flags);
>  }
>  
>  static void __init cma_activate_area(struct cma *cma)
> @@ -118,7 +118,7 @@ static void __init cma_activate_area(struct cma *cma)
>pfn += pageblock_nr_pages)
>   init_cma_reserved_pageblock(pfn_to_page(pfn));
>  
> - mutex_init(>lock);
> + spin_lock_init(>lock);
>  
>  #ifdef CONFIG_CMA_DEBUGFS
>   INIT_HLIST_HEAD(>mem_head);
> @@ -391,8 +391,9 @@ static void cma_debug_show_areas(struct cma *cma)
>   unsigned long start = 0;
>   unsigned long nr_part, nr_total = 0;
>   unsigned long nbits = cma_bitmap_maxno(cma);
> + unsigned long flags;
>  
> - mutex_lock(>lock);
> + spin_lock_irqsave(>lock, flags);
>   pr_info("number of available pages: ");
>   for (;;) {
>   next_zero_bit = find_next_zero_bit(cma->bitmap, nbits, start);
> @@ -407,7 +408,7 @@ static void cma_debug_show_areas(struct cma *cma)
>   start = next_zero_bit + nr_zero;
>   }
>   pr_cont("=> %lu free of %lu total pages\n", nr_total, cma->count);
> - mutex_unlock(>lock);
> + spin_unlock_irqrestore(>lock, flags);
>  }
>  #else
>  static inline void cma_debug_show_areas(struct cma *cma) { }
> @@ -430,6 +431,7 @@ struct page *cma_alloc(struct cma *cma, size_t count, 
> unsigned int align,
>   unsigned long pfn = -1;
>   unsigned long start = 0;
>   unsigned long bitmap_maxno, bitmap_no, bitmap_count;
> + unsigned long flags;
>   size_t i;
>   struct page *page = NULL;
>   int ret = -ENOMEM;
> @@ -454,12 +456,12 @@ struct page *cma_alloc(struct cma *cma, size_t count, 
> unsigned int align,
>   goto out;
>  
>   for (;;) {
> - mutex_lock(>lock);
> + spin_lock_irqsave(>lock, flags);
>   bitmap_no = bitmap_find_next_zero_area_off(cma->bitmap,
>   bitmap_maxno, start, bitmap_count, mask,
>   offset);
>   if (bitmap_no >= bitmap_maxno) {
> - mutex_unlock(>lock);
> + spin_unlock_irqrestore(>lock, flags);
>   break;
>   }
>   bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
> @@ -468,7 +470,7 @@ struct page *cma_alloc(struct cma *cma, size_t count, 
> unsigned int align,
>* our exclusive use. If the migration fails we will take the
>* lock again and unmark it.
>*/
> - mutex_unlock(>lock);
> + spin_unlock_irqrestore(>lock, flags);
>  
>   pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
>   ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA,
> diff --git a/mm/cma.h b/mm/cma.h
> index 68ffad4e430d..2c775877eae2 100644
> --- a/mm/cma.h
> +++ b/mm/cma.h
> @@ -15,7 +15,7 @@ struct cma {
>   unsigned long   count;
>   unsigned long   *bitmap;
>   unsigned int order_per_bit; /* Order of pages represented by one bit */
> - struct mutexlock;
> + spinlock_t  lock;
>  #ifdef CONFIG_CMA_DEBUGFS
>   struct hlist_head mem_head;
>   spinlock_t mem_head_lock;
> diff --git a/mm/cma_debug.c b/mm/cma_debug.c
> index d5bf8aa34fdc..6379cfbfd568 100644
> --- a/mm/cma_debug.c

Re: [PATCH v2 2/8] hugetlb: no need to drop hugetlb_lock to call cma_release

2021-03-29 Thread Roman Gushchin
On Mon, Mar 29, 2021 at 04:23:56PM -0700, Mike Kravetz wrote:
> Now that cma_release is non-blocking and irq safe, there is no need to
> drop hugetlb_lock before calling.
> 
> Signed-off-by: Mike Kravetz 
> ---
>  mm/hugetlb.c | 6 --
>  1 file changed, 6 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 3c3e4baa4156..1d62f0492e7b 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1353,14 +1353,8 @@ static void update_and_free_page(struct hstate *h, 
> struct page *page)
>   set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
>   set_page_refcounted(page);
>   if (hstate_is_gigantic(h)) {
> - /*
> -  * Temporarily drop the hugetlb_lock, because
> -  * we might block in free_gigantic_page().
> -  */
> - spin_unlock(_lock);
>   destroy_compound_gigantic_page(page, huge_page_order(h));
>   free_gigantic_page(page, huge_page_order(h));
> - spin_lock(_lock);
>   } else {
>   __free_pages(page, huge_page_order(h));
>   }
> -- 
> 2.30.2
> 

Acked-by: Roman Gushchin 

Thanks!


Re: [PATCH rfc 3/4] percpu: on demand chunk depopulation

2021-03-29 Thread Roman Gushchin
On Mon, Mar 29, 2021 at 11:12:34PM +, Dennis Zhou wrote:
> On Mon, Mar 29, 2021 at 01:10:10PM -0700, Roman Gushchin wrote:
> > On Mon, Mar 29, 2021 at 07:21:24PM +, Dennis Zhou wrote:
> > > On Wed, Mar 24, 2021 at 12:06:25PM -0700, Roman Gushchin wrote:
> > > > To return unused memory to the system schedule an async
> > > > depopulation of percpu chunks.
> > > > 
> > > > To balance between scanning too much and creating an overhead because
> > > > of the pcpu_lock contention and scanning not enough, let's track an
> > > > amount of chunks to scan and mark chunks which are potentially a good
> > > > target for the depopulation with a new boolean flag.  The async
> > > > depopulation work will clear the flag after trying to depopulate a
> > > > chunk (successfully or not).
> > > > 
> > > > This commit suggest the following logic: if a chunk
> > > >   1) has more than 1/4 of total pages free and populated
> > > >   2) isn't a reserved chunk
> > > >   3) isn't entirely free
> > > >   4) isn't alone in the corresponding slot
> > > 
> > > I'm not sure I like the check for alone that much. The reason being what
> > > about some odd case where each slot has a single chunk, but every slot
> > > is populated. It doesn't really make sense to keep them all around.
> > 
> > Yeah, I agree, I'm not sure either. Maybe we can just look at the total
> > number of populated empty pages and make sure it's not too low and not
> > too high. Btw, we should probably double PCPU_EMPTY_POP_PAGES_LOW/HIGH
> > if memcg accounting is on.
> > 
> 
> Hmmm. pcpu_nr_populated and pcpu_nr_empty_pop_pages should probably be
> per chunk type now that you mention it.
> 
> > > 
> > > I think there is some decision making we can do here to handle packing
> > > post depopulation allocations into a handful of chunks. Depopulated
> > > chunks could be sidelined with say a flag ->depopulated to prevent the
> > > first attempt of allocations from using them. And then we could bring
> > > back a chunk 1 by 1 somehow to attempt to suffice the allocation.
> > > I'm not too sure if this is a good idea, just a thought.
> > 
> > I thought about it in this way: depopulated chunks are not different to
> > new chunks, which are not yet fully populated. And they are naturally
> > de-prioritized by being located in higher slots (and at the tail of the 
> > list).
> > So I'm not sure we should handle them any special.
> > 
> 
> I'm thinking of the following. Imagine 3 chunks, A and B in slot X, and
> C in slot X+1. If B gets depopulated followed by A getting exhausted,
> which chunk B or C should be used? If C is fully populated, we might
> want to use that one.
> 
> I see that the priority is chunks at the very end, but I don't want to
> take something that doesn't reasonable generalize to any slot PAGE_SIZE
> and up. Or it should explicitly try to tackle only say the last N slots
> (but preferably the former).
> 
> > > 
> > > > it's a good target for depopulation.
> > > > 
> > > > If there are 2 or more of such chunks, an async depopulation
> > > > is scheduled.
> > > > 
> > > > Because chunk population and depopulation are opposite processes
> > > > which make a little sense together, split out the shrinking part of
> > > > pcpu_balance_populated() into pcpu_grow_populated() and make
> > > > pcpu_balance_populated() calling into pcpu_grow_populated() or
> > > > pcpu_shrink_populated() conditionally.
> > > > 
> > > > Signed-off-by: Roman Gushchin 
> > > > ---
> > > >  mm/percpu-internal.h |   1 +
> > > >  mm/percpu.c  | 111 ---
> > > >  2 files changed, 85 insertions(+), 27 deletions(-)
> > > > 
> > > > diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
> > > > index 18b768ac7dca..1c5b92af02eb 100644
> > > > --- a/mm/percpu-internal.h
> > > > +++ b/mm/percpu-internal.h
> > > > @@ -67,6 +67,7 @@ struct pcpu_chunk {
> > > >  
> > > > void*data;  /* chunk data */
> > > > boolimmutable;  /* no [de]population 
> > > > allowed */
> > > > +   booldepopulate; /* depopulation hint */
> > > > int start_offset;   /* the overl

Re: [PATCH rfc 3/4] percpu: on demand chunk depopulation

2021-03-29 Thread Roman Gushchin
On Mon, Mar 29, 2021 at 07:21:24PM +, Dennis Zhou wrote:
> On Wed, Mar 24, 2021 at 12:06:25PM -0700, Roman Gushchin wrote:
> > To return unused memory to the system schedule an async
> > depopulation of percpu chunks.
> > 
> > To balance between scanning too much and creating an overhead because
> > of the pcpu_lock contention and scanning not enough, let's track an
> > amount of chunks to scan and mark chunks which are potentially a good
> > target for the depopulation with a new boolean flag.  The async
> > depopulation work will clear the flag after trying to depopulate a
> > chunk (successfully or not).
> > 
> > This commit suggest the following logic: if a chunk
> >   1) has more than 1/4 of total pages free and populated
> >   2) isn't a reserved chunk
> >   3) isn't entirely free
> >   4) isn't alone in the corresponding slot
> 
> I'm not sure I like the check for alone that much. The reason being what
> about some odd case where each slot has a single chunk, but every slot
> is populated. It doesn't really make sense to keep them all around.

Yeah, I agree, I'm not sure either. Maybe we can just look at the total
number of populated empty pages and make sure it's not too low and not
too high. Btw, we should probably double PCPU_EMPTY_POP_PAGES_LOW/HIGH
if memcg accounting is on.

> 
> I think there is some decision making we can do here to handle packing
> post depopulation allocations into a handful of chunks. Depopulated
> chunks could be sidelined with say a flag ->depopulated to prevent the
> first attempt of allocations from using them. And then we could bring
> back a chunk 1 by 1 somehow to attempt to suffice the allocation.
> I'm not too sure if this is a good idea, just a thought.

I thought about it in this way: depopulated chunks are not different to
new chunks, which are not yet fully populated. And they are naturally
de-prioritized by being located in higher slots (and at the tail of the list).
So I'm not sure we should handle them any special.

> 
> > it's a good target for depopulation.
> > 
> > If there are 2 or more of such chunks, an async depopulation
> > is scheduled.
> > 
> > Because chunk population and depopulation are opposite processes
> > which make a little sense together, split out the shrinking part of
> > pcpu_balance_populated() into pcpu_grow_populated() and make
> > pcpu_balance_populated() calling into pcpu_grow_populated() or
> > pcpu_shrink_populated() conditionally.
> > 
> > Signed-off-by: Roman Gushchin 
> > ---
> >  mm/percpu-internal.h |   1 +
> >  mm/percpu.c  | 111 ---
> >  2 files changed, 85 insertions(+), 27 deletions(-)
> > 
> > diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
> > index 18b768ac7dca..1c5b92af02eb 100644
> > --- a/mm/percpu-internal.h
> > +++ b/mm/percpu-internal.h
> > @@ -67,6 +67,7 @@ struct pcpu_chunk {
> >  
> > void*data;  /* chunk data */
> > boolimmutable;  /* no [de]population allowed */
> > +   booldepopulate; /* depopulation hint */
> > int start_offset;   /* the overlap with the previous
> >region to have a page aligned
> >base_addr */
> > diff --git a/mm/percpu.c b/mm/percpu.c
> > index 015d076893f5..148137f0fc0b 100644
> > --- a/mm/percpu.c
> > +++ b/mm/percpu.c
> > @@ -178,6 +178,12 @@ static LIST_HEAD(pcpu_map_extend_chunks);
> >   */
> >  int pcpu_nr_empty_pop_pages;
> >  
> > +/*
> > + * Track the number of chunks with a lot of free memory.
> > + * It's used to release unused pages to the system.
> > + */
> > +static int pcpu_nr_chunks_to_depopulate;
> > +
> >  /*
> >   * The number of populated pages in use by the allocator, protected by
> >   * pcpu_lock.  This number is kept per a unit per chunk (i.e. when a page 
> > gets
> > @@ -1955,6 +1961,11 @@ static void pcpu_balance_free(enum pcpu_chunk_type 
> > type)
> > if (chunk == list_first_entry(free_head, struct pcpu_chunk, 
> > list))
> > continue;
> >  
> > +   if (chunk->depopulate) {
> > +   chunk->depopulate = false;
> > +   pcpu_nr_chunks_to_depopulate--;
> > +   }
> > +
> > list_move(>list, _free);
> > }
> >  
> > @@ -1976,7 +1987,7 @@ static void pcpu_balance_free(enum pcpu_chunk_type 
&

Re: [PATCH rfc 1/4] percpu: implement partial chunk depopulation

2021-03-29 Thread Roman Gushchin
On Mon, Mar 29, 2021 at 07:28:49PM +, Dennis Zhou wrote:
> On Mon, Mar 29, 2021 at 11:29:22AM -0700, Roman Gushchin wrote:
> > On Mon, Mar 29, 2021 at 05:20:55PM +, Dennis Zhou wrote:
> > > On Wed, Mar 24, 2021 at 12:06:23PM -0700, Roman Gushchin wrote:
> > > > This patch implements partial depopulation of percpu chunks.
> > > > 
> > > > As now, a chunk can be depopulated only as a part of the final
> > > > destruction, when there are no more outstanding allocations. However
> > > > to minimize a memory waste, it might be useful to depopulate a
> > > > partially filed chunk, if a small number of outstanding allocations
> > > > prevents the chunk from being reclaimed.
> > > > 
> > > > This patch implements the following depopulation process: it scans
> > > > over the chunk pages, looks for a range of empty and populated pages
> > > > and performs the depopulation. To avoid races with new allocations,
> > > > the chunk is previously isolated. After the depopulation the chunk is
> > > > returned to the original slot (but is appended to the tail of the list
> > > > to minimize the chances of population).
> > > > 
> > > > Because the pcpu_lock is dropped while calling pcpu_depopulate_chunk(),
> > > > the chunk can be concurrently moved to a different slot. So we need
> > > > to isolate it again on each step. pcpu_alloc_mutex is held, so the
> > > > chunk can't be populated/depopulated asynchronously.
> > > > 
> > > > Signed-off-by: Roman Gushchin 
> > > > ---
> > > >  mm/percpu.c | 90 +
> > > >  1 file changed, 90 insertions(+)
> > > > 
> > > > diff --git a/mm/percpu.c b/mm/percpu.c
> > > > index 6596a0a4286e..78c55c73fa28 100644
> > > > --- a/mm/percpu.c
> > > > +++ b/mm/percpu.c
> > > > @@ -2055,6 +2055,96 @@ static void __pcpu_balance_workfn(enum 
> > > > pcpu_chunk_type type)
> > > > mutex_unlock(_alloc_mutex);
> > > >  }
> > > >  
> > > > +/**
> > > > + * pcpu_shrink_populated - scan chunks and release unused pages to the 
> > > > system
> > > > + * @type: chunk type
> > > > + *
> > > > + * Scan over all chunks, find those marked with the depopulate flag and
> > > > + * try to release unused pages to the system. On every attempt clear 
> > > > the
> > > > + * chunk's depopulate flag to avoid wasting CPU by scanning the same
> > > > + * chunk again and again.
> > > > + */
> > > > +static void pcpu_shrink_populated(enum pcpu_chunk_type type)
> > > > +{
> > > > +   struct list_head *pcpu_slot = pcpu_chunk_list(type);
> > > > +   struct pcpu_chunk *chunk;
> > > > +   int slot, i, off, start;
> > > > +
> > > > +   spin_lock_irq(_lock);
> > > > +   for (slot = pcpu_nr_slots - 1; slot >= 0; slot--) {
> > > > +restart:
> > > > +   list_for_each_entry(chunk, _slot[slot], list) {
> > > > +   bool isolated = false;
> > > > +
> > > > +   if (pcpu_nr_empty_pop_pages < 
> > > > PCPU_EMPTY_POP_PAGES_HIGH)
> > > > +   break;
> > > > +
> > > 
> > > Deallocation makes me a little worried for the atomic case as now we
> > > could in theory pathologically scan deallocated chunks before finding a
> > > populated one.
> > > 
> > > I wonder if we should do something like once a chunk gets depopulated,
> > > it gets deprioritized and then only once we exhaust looking through
> > > allocated chunks we then find a depopulated chunk and add it back into
> > > the rotation. Possibly just add another set of slots? I guess it adds a
> > > few dimensions to pcpu_slots after the memcg change.
> > 
> > Please, take a look at patch 3 in the series ("percpu: on demand chunk 
> > depopulation").
> > Chunks considered to be a good target for the depopulation are in advance
> > marked with a special flag, so we'll actually try to depopulate only
> > few chunks at once. While the total number of chunks is fairly low,
> > I think it should work.
> > 
> > Another option is to link all such chunks into a list and scan over it,
> > instead of iterating over all slots.
> &g

Re: [PATCH rfc 1/4] percpu: implement partial chunk depopulation

2021-03-29 Thread Roman Gushchin
On Mon, Mar 29, 2021 at 05:20:55PM +, Dennis Zhou wrote:
> On Wed, Mar 24, 2021 at 12:06:23PM -0700, Roman Gushchin wrote:
> > This patch implements partial depopulation of percpu chunks.
> > 
> > As now, a chunk can be depopulated only as a part of the final
> > destruction, when there are no more outstanding allocations. However
> > to minimize a memory waste, it might be useful to depopulate a
> > partially filed chunk, if a small number of outstanding allocations
> > prevents the chunk from being reclaimed.
> > 
> > This patch implements the following depopulation process: it scans
> > over the chunk pages, looks for a range of empty and populated pages
> > and performs the depopulation. To avoid races with new allocations,
> > the chunk is previously isolated. After the depopulation the chunk is
> > returned to the original slot (but is appended to the tail of the list
> > to minimize the chances of population).
> > 
> > Because the pcpu_lock is dropped while calling pcpu_depopulate_chunk(),
> > the chunk can be concurrently moved to a different slot. So we need
> > to isolate it again on each step. pcpu_alloc_mutex is held, so the
> > chunk can't be populated/depopulated asynchronously.
> > 
> > Signed-off-by: Roman Gushchin 
> > ---
> >  mm/percpu.c | 90 +
> >  1 file changed, 90 insertions(+)
> > 
> > diff --git a/mm/percpu.c b/mm/percpu.c
> > index 6596a0a4286e..78c55c73fa28 100644
> > --- a/mm/percpu.c
> > +++ b/mm/percpu.c
> > @@ -2055,6 +2055,96 @@ static void __pcpu_balance_workfn(enum 
> > pcpu_chunk_type type)
> > mutex_unlock(_alloc_mutex);
> >  }
> >  
> > +/**
> > + * pcpu_shrink_populated - scan chunks and release unused pages to the 
> > system
> > + * @type: chunk type
> > + *
> > + * Scan over all chunks, find those marked with the depopulate flag and
> > + * try to release unused pages to the system. On every attempt clear the
> > + * chunk's depopulate flag to avoid wasting CPU by scanning the same
> > + * chunk again and again.
> > + */
> > +static void pcpu_shrink_populated(enum pcpu_chunk_type type)
> > +{
> > +   struct list_head *pcpu_slot = pcpu_chunk_list(type);
> > +   struct pcpu_chunk *chunk;
> > +   int slot, i, off, start;
> > +
> > +   spin_lock_irq(_lock);
> > +   for (slot = pcpu_nr_slots - 1; slot >= 0; slot--) {
> > +restart:
> > +   list_for_each_entry(chunk, _slot[slot], list) {
> > +   bool isolated = false;
> > +
> > +   if (pcpu_nr_empty_pop_pages < PCPU_EMPTY_POP_PAGES_HIGH)
> > +   break;
> > +
> 
> Deallocation makes me a little worried for the atomic case as now we
> could in theory pathologically scan deallocated chunks before finding a
> populated one.
> 
> I wonder if we should do something like once a chunk gets depopulated,
> it gets deprioritized and then only once we exhaust looking through
> allocated chunks we then find a depopulated chunk and add it back into
> the rotation. Possibly just add another set of slots? I guess it adds a
> few dimensions to pcpu_slots after the memcg change.

Please, take a look at patch 3 in the series ("percpu: on demand chunk 
depopulation").
Chunks considered to be a good target for the depopulation are in advance
marked with a special flag, so we'll actually try to depopulate only
few chunks at once. While the total number of chunks is fairly low,
I think it should work.

Another option is to link all such chunks into a list and scan over it,
instead of iterating over all slots.

Adding new dimensions to pcpu_slots is an option too, but I hope we can avoid
this, as it would complicate the code.

> 
> > +   for (i = 0, start = -1; i < chunk->nr_pages; i++) {
> > +   if (!chunk->nr_empty_pop_pages)
> > +   break;
> > +
> > +   /*
> > +* If the page is empty and populated, start or
> > +* extend the [start, i) range.
> > +*/
> > +   if (test_bit(i, chunk->populated)) {
> > +   off = find_first_bit(
> > +   pcpu_index_alloc_map(chunk, i),
> > +   PCPU_BITMAP_BLOCK_BITS);
> > +   if (off >= PCPU_BITMAP_BLOCK_BITS) {
> > + 

Re: [PATCH rfc 2/4] percpu: split __pcpu_balance_workfn()

2021-03-29 Thread Roman Gushchin
On Mon, Mar 29, 2021 at 05:28:23PM +, Dennis Zhou wrote:
> On Wed, Mar 24, 2021 at 12:06:24PM -0700, Roman Gushchin wrote:
> > __pcpu_balance_workfn() became fairly big and hard to follow, but in
> > fact it consists of two fully independent parts, responsible for
> > the destruction of excessive free chunks and population of necessarily
> > amount of free pages.
> > 
> > In order to simplify the code and prepare for adding of a new
> > functionality, split it in two functions:
> > 
> >   1) pcpu_balance_free,
> >   2) pcpu_balance_populated.
> > 
> > Move the taking/releasing of the pcpu_alloc_mutex to an upper level
> > to keep the current synchronization in place.
> > 
> > Signed-off-by: Roman Gushchin 
> > ---
> >  mm/percpu.c | 46 +-
> >  1 file changed, 29 insertions(+), 17 deletions(-)
> > 
> > diff --git a/mm/percpu.c b/mm/percpu.c
> > index 78c55c73fa28..015d076893f5 100644
> > --- a/mm/percpu.c
> > +++ b/mm/percpu.c
> > @@ -1930,31 +1930,22 @@ void __percpu *__alloc_reserved_percpu(size_t size, 
> > size_t align)
> >  }
> >  
> >  /**
> > - * __pcpu_balance_workfn - manage the amount of free chunks and populated 
> > pages
> > + * pcpu_balance_free - manage the amount of free chunks
> >   * @type: chunk type
> >   *
> > - * Reclaim all fully free chunks except for the first one.  This is also
> > - * responsible for maintaining the pool of empty populated pages.  However,
> > - * it is possible that this is called when physical memory is scarce 
> > causing
> > - * OOM killer to be triggered.  We should avoid doing so until an actual
> > - * allocation causes the failure as it is possible that requests can be
> > - * serviced from already backed regions.
> > + * Reclaim all fully free chunks except for the first one.
> >   */
> > -static void __pcpu_balance_workfn(enum pcpu_chunk_type type)
> > +static void pcpu_balance_free(enum pcpu_chunk_type type)
> >  {
> > -   /* gfp flags passed to underlying allocators */
> > -   const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
> > LIST_HEAD(to_free);
> > struct list_head *pcpu_slot = pcpu_chunk_list(type);
> > struct list_head *free_head = _slot[pcpu_nr_slots - 1];
> > struct pcpu_chunk *chunk, *next;
> > -   int slot, nr_to_pop, ret;
> >  
> > /*
> >  * There's no reason to keep around multiple unused chunks and VM
> >  * areas can be scarce.  Destroy all free chunks except for one.
> >  */
> > -   mutex_lock(_alloc_mutex);
> > spin_lock_irq(_lock);
> >  
> > list_for_each_entry_safe(chunk, next, free_head, list) {
> > @@ -1982,6 +1973,25 @@ static void __pcpu_balance_workfn(enum 
> > pcpu_chunk_type type)
> > pcpu_destroy_chunk(chunk);
> > cond_resched();
> > }
> > +}
> > +
> > +/**
> > + * pcpu_balance_populated - manage the amount of populated pages
> > + * @type: chunk type
> > + *
> > + * Maintain a certain amount of populated pages to satisfy atomic 
> > allocations.
> > + * It is possible that this is called when physical memory is scarce 
> > causing
> > + * OOM killer to be triggered.  We should avoid doing so until an actual
> > + * allocation causes the failure as it is possible that requests can be
> > + * serviced from already backed regions.
> > + */
> > +static void pcpu_balance_populated(enum pcpu_chunk_type type)
> > +{
> > +   /* gfp flags passed to underlying allocators */
> > +   const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
> > +   struct list_head *pcpu_slot = pcpu_chunk_list(type);
> > +   struct pcpu_chunk *chunk;
> > +   int slot, nr_to_pop, ret;
> >  
> > /*
> >  * Ensure there are certain number of free populated pages for
> > @@ -2051,8 +2061,6 @@ static void __pcpu_balance_workfn(enum 
> > pcpu_chunk_type type)
> > goto retry_pop;
> > }
> > }
> > -
> > -   mutex_unlock(_alloc_mutex);
> >  }
> >  
> >  /**
> > @@ -2149,14 +2157,18 @@ static void pcpu_shrink_populated(enum 
> > pcpu_chunk_type type)
> >   * pcpu_balance_workfn - manage the amount of free chunks and populated 
> > pages
> >   * @work: unused
> >   *
> > - * Call __pcpu_balance_workfn() for each chunk type.
> > + * Call pcpu_balance_free() and pcpu_balance_populated() for each chunk 
> > type.
> >   */
> >  static void 

Re: [percpu] 28c9dada65: invoked_oom-killer:gfp_mask=0x

2021-03-29 Thread Roman Gushchin
On Mon, Mar 29, 2021 at 04:37:15PM +0800, kernel test robot wrote:
> 
> 
> Greeting,
> 
> FYI, we noticed the following commit (built with gcc-9):
> 
> commit: 28c9dada655513462112084f5f1769ee49d6fe87 ("[PATCH rfc 3/4] percpu: on 
> demand chunk depopulation")
> url: 
> https://github.com/0day-ci/linux/commits/Roman-Gushchin/percpu-partial-chunk-depopulation/20210325-030746
> base: https://git.kernel.org/cgit/linux/kernel/git/dennis/percpu.git for-next
> 
> in testcase: trinity
> version: trinity-i386-4d2343bd-1_20200320
> with following parameters:
> 
>   group: group-01
> 
> test-description: Trinity is a linux system call fuzz tester.
> test-url: http://codemonkey.org.uk/projects/trinity/ 
> 
> 
> on test machine: qemu-system-i386 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
> 
> caused below changes (please refer to attached dmesg/kmsg for entire 
> log/backtrace):
> 
> 
> +-+++
> | | 0a6ed3d793 | 
> 28c9dada65 |
> +-+++
> | boot_successes  | 6  | 0
>   |
> | boot_failures   | 0  | 6
>   |
> | invoked_oom-killer:gfp_mask=0x  | 0  | 6
>   |
> | Mem-Info| 0  | 6
>   |
> | Out_of_memory_and_no_killable_processes | 0  | 6
>   |
> | Kernel_panic-not_syncing:System_is_deadlocked_on_memory | 0  | 6
>   |
> +-+++

Found and fixed the problem. V2 will include the fix.

Thanks!


Re: [PATCH 1/8] mm: cma: introduce cma_release_nowait()

2021-03-25 Thread Roman Gushchin
On Thu, Mar 25, 2021 at 01:12:51PM -0700, Minchan Kim wrote:
> On Thu, Mar 25, 2021 at 06:15:11PM +0100, David Hildenbrand wrote:
> > On 25.03.21 17:56, Mike Kravetz wrote:
> > > On 3/25/21 3:22 AM, Michal Hocko wrote:
> > > > On Thu 25-03-21 10:56:38, David Hildenbrand wrote:
> > > > > On 25.03.21 01:28, Mike Kravetz wrote:
> > > > > > From: Roman Gushchin 
> > > > > > 
> > > > > > cma_release() has to lock the cma_lock mutex to clear the cma 
> > > > > > bitmap.
> > > > > > It makes it a blocking function, which complicates its usage from
> > > > > > non-blocking contexts. For instance, hugetlbfs code is temporarily
> > > > > > dropping the hugetlb_lock spinlock to call cma_release().
> > > > > > 
> > > > > > This patch introduces a non-blocking cma_release_nowait(), which
> > > > > > postpones the cma bitmap clearance. It's done later from a work
> > > > > > context. The first page in the cma allocation is used to store
> > > > > > the work struct. Because CMA allocations and de-allocations are
> > > > > > usually not that frequent, a single global workqueue is used.
> > > > > > 
> > > > > > To make sure that subsequent cma_alloc() call will pass, cma_alloc()
> > > > > > flushes the cma_release_wq workqueue. To avoid a performance
> > > > > > regression in the case when only cma_release() is used, gate it
> > > > > > by a per-cma area flag, which is set by the first call
> > > > > > of cma_release_nowait().
> > > > > > 
> > > > > > Signed-off-by: Roman Gushchin 
> > > > > > [mike.krav...@oracle.com: rebased to 
> > > > > > v5.12-rc3-mmotm-2021-03-17-22-24]
> > > > > > Signed-off-by: Mike Kravetz 
> > > > > > ---
> > > > > 
> > > > > 
> > > > > 1. Is there a real reason this is a mutex and not a spin lock? It 
> > > > > seems to
> > > > > only protect the bitmap. Are bitmaps that huge that we spend a 
> > > > > significant
> > > > > amount of time in there?
> > > > 
> > > > Good question. Looking at the code it doesn't seem that there is any
> > > > blockable operation or any heavy lifting done under the lock.
> > > > 7ee793a62fa8 ("cma: Remove potential deadlock situation") has introduced
> > > > the lock and there was a simple bitmat protection back then. I suspect
> > > > the patch just followed the cma_mutex lead and used the same type of the
> > > > lock. cma_mutex used to protect alloc_contig_range which is sleepable.
> > > > 
> > > > This all suggests that there is no real reason to use a sleepable lock
> > > > at all and we do not need all this heavy lifting.
> > > > 
> > > 
> > > When Roman first proposed these patches, I brought up the same issue:
> > > 
> > > https://lore.kernel.org/linux-mm/20201022023352.gc300...@carbon.dhcp.thefacebook.com/
> > > 
> > > Previously, Roman proposed replacing the mutex with a spinlock but
> > > Joonsoo was opposed.
> > > 
> > > Adding Joonsoo on Cc:
> > > 
> > 
> > There has to be a good reason not to. And if there is a good reason,
> > lockless clearing might be one feasible alternative.
> 
> I also don't think nowait variant is good idea. If the scanning of
> bitmap is *really* significant, it might be signal that we need to
> introduce different technique or data structure not bitmap rather
> than a new API variant.

I'd also prefer to just replace the mutex with a spinlock rather than fiddling
with a delayed release.

Thanks!


[PATCH rfc 2/4] percpu: split __pcpu_balance_workfn()

2021-03-24 Thread Roman Gushchin
__pcpu_balance_workfn() became fairly big and hard to follow, but in
fact it consists of two fully independent parts, responsible for
the destruction of excessive free chunks and population of necessarily
amount of free pages.

In order to simplify the code and prepare for adding of a new
functionality, split it in two functions:

  1) pcpu_balance_free,
  2) pcpu_balance_populated.

Move the taking/releasing of the pcpu_alloc_mutex to an upper level
to keep the current synchronization in place.

Signed-off-by: Roman Gushchin 
---
 mm/percpu.c | 46 +-
 1 file changed, 29 insertions(+), 17 deletions(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 78c55c73fa28..015d076893f5 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1930,31 +1930,22 @@ void __percpu *__alloc_reserved_percpu(size_t size, 
size_t align)
 }
 
 /**
- * __pcpu_balance_workfn - manage the amount of free chunks and populated pages
+ * pcpu_balance_free - manage the amount of free chunks
  * @type: chunk type
  *
- * Reclaim all fully free chunks except for the first one.  This is also
- * responsible for maintaining the pool of empty populated pages.  However,
- * it is possible that this is called when physical memory is scarce causing
- * OOM killer to be triggered.  We should avoid doing so until an actual
- * allocation causes the failure as it is possible that requests can be
- * serviced from already backed regions.
+ * Reclaim all fully free chunks except for the first one.
  */
-static void __pcpu_balance_workfn(enum pcpu_chunk_type type)
+static void pcpu_balance_free(enum pcpu_chunk_type type)
 {
-   /* gfp flags passed to underlying allocators */
-   const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
LIST_HEAD(to_free);
struct list_head *pcpu_slot = pcpu_chunk_list(type);
struct list_head *free_head = _slot[pcpu_nr_slots - 1];
struct pcpu_chunk *chunk, *next;
-   int slot, nr_to_pop, ret;
 
/*
 * There's no reason to keep around multiple unused chunks and VM
 * areas can be scarce.  Destroy all free chunks except for one.
 */
-   mutex_lock(_alloc_mutex);
spin_lock_irq(_lock);
 
list_for_each_entry_safe(chunk, next, free_head, list) {
@@ -1982,6 +1973,25 @@ static void __pcpu_balance_workfn(enum pcpu_chunk_type 
type)
pcpu_destroy_chunk(chunk);
cond_resched();
}
+}
+
+/**
+ * pcpu_balance_populated - manage the amount of populated pages
+ * @type: chunk type
+ *
+ * Maintain a certain amount of populated pages to satisfy atomic allocations.
+ * It is possible that this is called when physical memory is scarce causing
+ * OOM killer to be triggered.  We should avoid doing so until an actual
+ * allocation causes the failure as it is possible that requests can be
+ * serviced from already backed regions.
+ */
+static void pcpu_balance_populated(enum pcpu_chunk_type type)
+{
+   /* gfp flags passed to underlying allocators */
+   const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
+   struct list_head *pcpu_slot = pcpu_chunk_list(type);
+   struct pcpu_chunk *chunk;
+   int slot, nr_to_pop, ret;
 
/*
 * Ensure there are certain number of free populated pages for
@@ -2051,8 +2061,6 @@ static void __pcpu_balance_workfn(enum pcpu_chunk_type 
type)
goto retry_pop;
}
}
-
-   mutex_unlock(_alloc_mutex);
 }
 
 /**
@@ -2149,14 +2157,18 @@ static void pcpu_shrink_populated(enum pcpu_chunk_type 
type)
  * pcpu_balance_workfn - manage the amount of free chunks and populated pages
  * @work: unused
  *
- * Call __pcpu_balance_workfn() for each chunk type.
+ * Call pcpu_balance_free() and pcpu_balance_populated() for each chunk type.
  */
 static void pcpu_balance_workfn(struct work_struct *work)
 {
enum pcpu_chunk_type type;
 
-   for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++)
-   __pcpu_balance_workfn(type);
+   for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) {
+   mutex_lock(_alloc_mutex);
+   pcpu_balance_free(type);
+   pcpu_balance_populated(type);
+   mutex_unlock(_alloc_mutex);
+   }
 }
 
 /**
-- 
2.30.2



[PATCH rfc 4/4] percpu: fix a comment about the chunks ordering

2021-03-24 Thread Roman Gushchin
Since the commit 3e54097beb22 ("percpu: manage chunks based on
contig_bits instead of free_bytes") chunks are sorted based on the
size of the biggest continuous free area instead of the total number
of free bytes. Update the corresponding comment to reflect this.

Signed-off-by: Roman Gushchin 
---
 mm/percpu.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/percpu.c b/mm/percpu.c
index 148137f0fc0b..08fb6e5d3232 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -99,7 +99,10 @@
 
 #include "percpu-internal.h"
 
-/* the slots are sorted by free bytes left, 1-31 bytes share the same slot */
+/*
+ * The slots are sorted by the size of the biggest continuous free area.
+ * 1-31 bytes share the same slot.
+ */
 #define PCPU_SLOT_BASE_SHIFT   5
 /* chunks in slots below this are subject to being sidelined on failed alloc */
 #define PCPU_SLOT_FAIL_THRESHOLD   3
-- 
2.30.2



[PATCH rfc 0/4] percpu: partial chunk depopulation

2021-03-24 Thread Roman Gushchin
In our production experience the percpu memory allocator is sometimes struggling
with returning the memory to the system. A typical example is a creation of
several thousands memory cgroups (each has several chunks of the percpu data
used for vmstats, vmevents, ref counters etc). Deletion and complete releasing
of these cgroups doesn't always lead to a shrinkage of the percpu memory.

The underlying problem is the fragmentation: to release an underlying chunk
all percpu allocations should be released first. The percpu allocator tends
to top up chunks to improve the utilization. It means new small-ish allocations
(e.g. percpu ref counters) are placed onto almost filled old-ish chunks,
effectively pinning them in memory.

This patchset pretends to solve this problem by implementing a partial
depopulation of percpu chunks: chunks with many empty pages are being
asynchronously depopulated and the pages are returned to the system.

To illustrate the problem the following script can be used:

--
#!/bin/bash

cd /sys/fs/cgroup

mkdir percpu_test
echo "+memory" > percpu_test/cgroup.subtree_control

cat /proc/meminfo | grep Percpu

for i in `seq 1 1000`; do
mkdir percpu_test/cg_"${i}"
for j in `seq 1 10`; do
mkdir percpu_test/cg_"${i}"_"${j}"
done
done

cat /proc/meminfo | grep Percpu

for i in `seq 1 1000`; do
for j in `seq 1 10`; do
rmdir percpu_test/cg_"${i}"_"${j}"
done
done

sleep 10

cat /proc/meminfo | grep Percpu

for i in `seq 1 1000`; do
rmdir percpu_test/cg_"${i}"
done

rmdir percpu_test
--

It creates 11000 memory cgroups and removes every 10 out of 11.
It prints the initial size of the percpu memory, the size after
creating all cgroups and the size after deleting most of them.

Results:
  vanilla:
$ ./percpu_test.sh
Percpu: 7296 kB
Percpu:   481024 kB
Percpu:   481024 kB

  with this patchset applied:
./percpu_test.sh
Percpu: 7488 kB
Percpu:   481152 kB
Percpu:   153920 kB

So the total size of the percpu memory was reduced by ~3 times.


Roman Gushchin (4):
  percpu: implement partial chunk depopulation
  percpu: split __pcpu_balance_workfn()
  percpu: on demand chunk depopulation
  percpu: fix a comment about the chunks ordering

 mm/percpu-internal.h |   1 +
 mm/percpu.c  | 242 ---
 2 files changed, 203 insertions(+), 40 deletions(-)

-- 
2.30.2



[PATCH rfc 1/4] percpu: implement partial chunk depopulation

2021-03-24 Thread Roman Gushchin
This patch implements partial depopulation of percpu chunks.

As now, a chunk can be depopulated only as a part of the final
destruction, when there are no more outstanding allocations. However
to minimize a memory waste, it might be useful to depopulate a
partially filed chunk, if a small number of outstanding allocations
prevents the chunk from being reclaimed.

This patch implements the following depopulation process: it scans
over the chunk pages, looks for a range of empty and populated pages
and performs the depopulation. To avoid races with new allocations,
the chunk is previously isolated. After the depopulation the chunk is
returned to the original slot (but is appended to the tail of the list
to minimize the chances of population).

Because the pcpu_lock is dropped while calling pcpu_depopulate_chunk(),
the chunk can be concurrently moved to a different slot. So we need
to isolate it again on each step. pcpu_alloc_mutex is held, so the
chunk can't be populated/depopulated asynchronously.

Signed-off-by: Roman Gushchin 
---
 mm/percpu.c | 90 +
 1 file changed, 90 insertions(+)

diff --git a/mm/percpu.c b/mm/percpu.c
index 6596a0a4286e..78c55c73fa28 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -2055,6 +2055,96 @@ static void __pcpu_balance_workfn(enum pcpu_chunk_type 
type)
mutex_unlock(_alloc_mutex);
 }
 
+/**
+ * pcpu_shrink_populated - scan chunks and release unused pages to the system
+ * @type: chunk type
+ *
+ * Scan over all chunks, find those marked with the depopulate flag and
+ * try to release unused pages to the system. On every attempt clear the
+ * chunk's depopulate flag to avoid wasting CPU by scanning the same
+ * chunk again and again.
+ */
+static void pcpu_shrink_populated(enum pcpu_chunk_type type)
+{
+   struct list_head *pcpu_slot = pcpu_chunk_list(type);
+   struct pcpu_chunk *chunk;
+   int slot, i, off, start;
+
+   spin_lock_irq(_lock);
+   for (slot = pcpu_nr_slots - 1; slot >= 0; slot--) {
+restart:
+   list_for_each_entry(chunk, _slot[slot], list) {
+   bool isolated = false;
+
+   if (pcpu_nr_empty_pop_pages < PCPU_EMPTY_POP_PAGES_HIGH)
+   break;
+
+   for (i = 0, start = -1; i < chunk->nr_pages; i++) {
+   if (!chunk->nr_empty_pop_pages)
+   break;
+
+   /*
+* If the page is empty and populated, start or
+* extend the [start, i) range.
+*/
+   if (test_bit(i, chunk->populated)) {
+   off = find_first_bit(
+   pcpu_index_alloc_map(chunk, i),
+   PCPU_BITMAP_BLOCK_BITS);
+   if (off >= PCPU_BITMAP_BLOCK_BITS) {
+   if (start == -1)
+   start = i;
+   continue;
+   }
+   }
+
+   /*
+* Otherwise check if there is an active range,
+* and if yes, depopulate it.
+*/
+   if (start == -1)
+   continue;
+
+   /*
+* Isolate the chunk, so new allocations
+* wouldn't be served using this chunk.
+* Async releases can still happen.
+*/
+   if (!list_empty(>list)) {
+   list_del_init(>list);
+   isolated = true;
+   }
+
+   spin_unlock_irq(_lock);
+   pcpu_depopulate_chunk(chunk, start, i);
+   cond_resched();
+   spin_lock_irq(_lock);
+
+   pcpu_chunk_depopulated(chunk, start, i);
+
+   /*
+* Reset the range and continue.
+*/
+   start = -1;
+   }
+
+   if (isolated) {
+   /*
+* The chunk could have been moved while
+* pcpu_lock wasn't held. Make sure we put
+* the chunk back into the slot and restart
+   

[PATCH rfc 3/4] percpu: on demand chunk depopulation

2021-03-24 Thread Roman Gushchin
To return unused memory to the system schedule an async
depopulation of percpu chunks.

To balance between scanning too much and creating an overhead because
of the pcpu_lock contention and scanning not enough, let's track an
amount of chunks to scan and mark chunks which are potentially a good
target for the depopulation with a new boolean flag.  The async
depopulation work will clear the flag after trying to depopulate a
chunk (successfully or not).

This commit suggest the following logic: if a chunk
  1) has more than 1/4 of total pages free and populated
  2) isn't a reserved chunk
  3) isn't entirely free
  4) isn't alone in the corresponding slot
it's a good target for depopulation.

If there are 2 or more of such chunks, an async depopulation
is scheduled.

Because chunk population and depopulation are opposite processes
which make a little sense together, split out the shrinking part of
pcpu_balance_populated() into pcpu_grow_populated() and make
pcpu_balance_populated() calling into pcpu_grow_populated() or
pcpu_shrink_populated() conditionally.

Signed-off-by: Roman Gushchin 
---
 mm/percpu-internal.h |   1 +
 mm/percpu.c  | 111 ---
 2 files changed, 85 insertions(+), 27 deletions(-)

diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index 18b768ac7dca..1c5b92af02eb 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -67,6 +67,7 @@ struct pcpu_chunk {
 
void*data;  /* chunk data */
boolimmutable;  /* no [de]population allowed */
+   booldepopulate; /* depopulation hint */
int start_offset;   /* the overlap with the previous
   region to have a page aligned
   base_addr */
diff --git a/mm/percpu.c b/mm/percpu.c
index 015d076893f5..148137f0fc0b 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -178,6 +178,12 @@ static LIST_HEAD(pcpu_map_extend_chunks);
  */
 int pcpu_nr_empty_pop_pages;
 
+/*
+ * Track the number of chunks with a lot of free memory.
+ * It's used to release unused pages to the system.
+ */
+static int pcpu_nr_chunks_to_depopulate;
+
 /*
  * The number of populated pages in use by the allocator, protected by
  * pcpu_lock.  This number is kept per a unit per chunk (i.e. when a page gets
@@ -1955,6 +1961,11 @@ static void pcpu_balance_free(enum pcpu_chunk_type type)
if (chunk == list_first_entry(free_head, struct pcpu_chunk, 
list))
continue;
 
+   if (chunk->depopulate) {
+   chunk->depopulate = false;
+   pcpu_nr_chunks_to_depopulate--;
+   }
+
list_move(>list, _free);
}
 
@@ -1976,7 +1987,7 @@ static void pcpu_balance_free(enum pcpu_chunk_type type)
 }
 
 /**
- * pcpu_balance_populated - manage the amount of populated pages
+ * pcpu_grow_populated - populate chunk(s) to satisfy atomic allocations
  * @type: chunk type
  *
  * Maintain a certain amount of populated pages to satisfy atomic allocations.
@@ -1985,35 +1996,15 @@ static void pcpu_balance_free(enum pcpu_chunk_type type)
  * allocation causes the failure as it is possible that requests can be
  * serviced from already backed regions.
  */
-static void pcpu_balance_populated(enum pcpu_chunk_type type)
+static void pcpu_grow_populated(enum pcpu_chunk_type type, int nr_to_pop)
 {
/* gfp flags passed to underlying allocators */
const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
struct list_head *pcpu_slot = pcpu_chunk_list(type);
struct pcpu_chunk *chunk;
-   int slot, nr_to_pop, ret;
+   int slot, ret;
 
-   /*
-* Ensure there are certain number of free populated pages for
-* atomic allocs.  Fill up from the most packed so that atomic
-* allocs don't increase fragmentation.  If atomic allocation
-* failed previously, always populate the maximum amount.  This
-* should prevent atomic allocs larger than PAGE_SIZE from keeping
-* failing indefinitely; however, large atomic allocs are not
-* something we support properly and can be highly unreliable and
-* inefficient.
-*/
 retry_pop:
-   if (pcpu_atomic_alloc_failed) {
-   nr_to_pop = PCPU_EMPTY_POP_PAGES_HIGH;
-   /* best effort anyway, don't worry about synchronization */
-   pcpu_atomic_alloc_failed = false;
-   } else {
-   nr_to_pop = clamp(PCPU_EMPTY_POP_PAGES_HIGH -
- pcpu_nr_empty_pop_pages,
- 0, PCPU_EMPTY_POP_PAGES_HIGH);
-   }
-
for (slot = pcpu_size_to_slot(PAGE_SIZE); slot < pcpu_nr_slots; slot++) 
{
unsigned int nr_unpop = 0, rs, re;
 
@@ -2084,9 +2075,18 @@

Re: [RFC PATCH 7/8] hugetlb: add update_and_free_page_no_sleep for irq context

2021-03-23 Thread Roman Gushchin
On Tue, Mar 23, 2021 at 11:51:04AM -0700, Mike Kravetz wrote:
> On 3/22/21 11:10 AM, Roman Gushchin wrote:
> > On Mon, Mar 22, 2021 at 10:42:23AM -0700, Mike Kravetz wrote:
> >> Cc: Roman, Christoph
> >>
> >> On 3/22/21 1:41 AM, Peter Zijlstra wrote:
> >>> On Fri, Mar 19, 2021 at 03:42:08PM -0700, Mike Kravetz wrote:
> >>>> The locks acquired in free_huge_page are irq safe.  However, in certain
> >>>> circumstances the routine update_and_free_page could sleep.  Since
> >>>> free_huge_page can be called from any context, it can not sleep.
> >>>>
> >>>> Use a waitqueue to defer freeing of pages if the operation may sleep.  A
> >>>> new routine update_and_free_page_no_sleep provides this functionality
> >>>> and is only called from free_huge_page.
> >>>>
> >>>> Note that any 'pages' sent to the workqueue for deferred freeing have
> >>>> already been removed from the hugetlb subsystem.  What is actually
> >>>> deferred is returning those base pages to the low level allocator.
> >>>
> >>> So maybe I'm stupid, but why do you need that work in hugetlb? Afaict it
> >>> should be in cma_release().
> >>
> >> My thinking (which could be totally wrong) is that cma_release makes no
> >> claims about calling context.  From the code, it is pretty clear that it
> >> can only be called from task context with no locks held.  Although,
> >> there could be code incorrectly calling it today hugetlb does.  Since
> >> hugetlb is the only code with this new requirement, it should do the
> >> work.
> >>
> >> Wait!!!  That made me remember something.
> >> Roman had code to create a non-blocking version of cma_release().
> >> https://lore.kernel.org/linux-mm/2020105308.2927890-1-g...@fb.com/
> >>
> >> There were no objections, and Christoph even thought there may be
> >> problems with callers of dma_free_contiguous.
> >>
> >> Perhaps, we should just move forward with Roman's patches to create
> >> cma_release_nowait() and avoid this workqueue stuff?
> > 
> > Sounds good to me. If it's the preferred path, I can rebase and resend
> > those patches (they been carried for some time by Zi Yan for his 1GB THP 
> > work,
> > but they are completely independent).
> 
> Thanks Roman,
> 
> Yes, this is the preferred path.  If there is a non blocking version of
> cma_release, then it makes fixup of hugetlb put_page path much easier.
> 
> If you would prefer, I can rebase your patches and send with this series.

Sounds good! Please, proceed. And, please, let me know if I can help.

Thanks!


Re: [PATCH v5 1/7] mm: memcontrol: slab: fix obtain a reference to a freeing memcg

2021-03-22 Thread Roman Gushchin
On Sat, Mar 20, 2021 at 12:38:14AM +0800, Muchun Song wrote:
> The rcu_read_lock/unlock only can guarantee that the memcg will not be
> freed, but it cannot guarantee the success of css_get (which is in the
> refill_stock when cached memcg changed) to memcg.
> 
>   rcu_read_lock()
>   memcg = obj_cgroup_memcg(old)
>   __memcg_kmem_uncharge(memcg)
>   refill_stock(memcg)
>   if (stock->cached != memcg)
>   // css_get can change the ref counter from 0 back to 1.
>   css_get(>css)
>   rcu_read_unlock()
> 
> This fix is very like the commit:
> 
>   eefbfa7fd678 ("mm: memcg/slab: fix use after free in obj_cgroup_charge")
> 
> Fix this by holding a reference to the memcg which is passed to the
> __memcg_kmem_uncharge() before calling __memcg_kmem_uncharge().
> 
> Fixes: 3de7d4f25a74 ("mm: memcg/slab: optimize objcg stock draining")
> Signed-off-by: Muchun Song 

Acked-by: Roman Gushchin 

Thanks!


Re: [PATCH v5 6/7] mm: memcontrol: inline __memcg_kmem_{un}charge() into obj_cgroup_{un}charge_pages()

2021-03-22 Thread Roman Gushchin
On Sat, Mar 20, 2021 at 12:38:19AM +0800, Muchun Song wrote:
> There is only one user of __memcg_kmem_charge(), so manually inline
> __memcg_kmem_charge() to obj_cgroup_charge_pages(). Similarly manually
> inline __memcg_kmem_uncharge() into obj_cgroup_uncharge_pages() and
> call obj_cgroup_uncharge_pages() in obj_cgroup_release().
> 
> This is just code cleanup without any functionality changes.
> 
> Signed-off-by: Muchun Song 

Acked-by: Roman Gushchin 


Re: [PATCH v5 5/7] mm: memcontrol: use obj_cgroup APIs to charge kmem pages

2021-03-22 Thread Roman Gushchin
On Sat, Mar 20, 2021 at 12:38:18AM +0800, Muchun Song wrote:
> Since Roman series "The new cgroup slab memory controller" applied. All
> slab objects are charged via the new APIs of obj_cgroup. The new APIs
> introduce a struct obj_cgroup to charge slab objects. It prevents
> long-living objects from pinning the original memory cgroup in the memory.
> But there are still some corner objects (e.g. allocations larger than
> order-1 page on SLUB) which are not charged via the new APIs. Those
> objects (include the pages which are allocated from buddy allocator
> directly) are charged as kmem pages which still hold a reference to
> the memory cgroup.
> 
> We want to reuse the obj_cgroup APIs to charge the kmem pages.
> If we do that, we should store an object cgroup pointer to
> page->memcg_data for the kmem pages.
> 
> Finally, page->memcg_data will have 3 different meanings.
> 
>   1) For the slab pages, page->memcg_data points to an object cgroups
>  vector.
> 
>   2) For the kmem pages (exclude the slab pages), page->memcg_data
>  points to an object cgroup.
> 
>   3) For the user pages (e.g. the LRU pages), page->memcg_data points
>  to a memory cgroup.
> 
> We do not change the behavior of page_memcg() and page_memcg_rcu().
> They are also suitable for LRU pages and kmem pages. Why?
> 
> Because memory allocations pinning memcgs for a long time - it exists
> at a larger scale and is causing recurring problems in the real world:
> page cache doesn't get reclaimed for a long time, or is used by the
> second, third, fourth, ... instance of the same job that was restarted
> into a new cgroup every time. Unreclaimable dying cgroups pile up,
> waste memory, and make page reclaim very inefficient.
> 
> We can convert LRU pages and most other raw memcg pins to the objcg
> direction to fix this problem, and then the page->memcg will always
> point to an object cgroup pointer. At that time, LRU pages and kmem
> pages will be treated the same. The implementation of page_memcg()
> will remove the kmem page check.
> 
> This patch aims to charge the kmem pages by using the new APIs of
> obj_cgroup. Finally, the page->memcg_data of the kmem page points to
> an object cgroup. We can use the __page_objcg() to get the object
> cgroup associated with a kmem page. Or we can use page_memcg()
> to get the memory cgroup associated with a kmem page, but caller must
> ensure that the returned memcg won't be released (e.g. acquire the
> rcu_read_lock or css_set_lock).
> 
> Signed-off-by: Muchun Song 
> Acked-by: Johannes Weiner 

Acked-by: Roman Gushchin 

Thanks!

> ---
>  include/linux/memcontrol.h | 116 
> +++--
>  mm/memcontrol.c| 110 +-
>  2 files changed, 145 insertions(+), 81 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index e6dc793d587d..395a113e4a3b 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -358,6 +358,62 @@ enum page_memcg_data_flags {
>  
>  #define MEMCG_DATA_FLAGS_MASK (__NR_MEMCG_DATA_FLAGS - 1)
>  
> +static inline bool PageMemcgKmem(struct page *page);
> +
> +/*
> + * After the initialization objcg->memcg is always pointing at
> + * a valid memcg, but can be atomically swapped to the parent memcg.
> + *
> + * The caller must ensure that the returned memcg won't be released:
> + * e.g. acquire the rcu_read_lock or css_set_lock.
> + */
> +static inline struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg)
> +{
> + return READ_ONCE(objcg->memcg);
> +}
> +
> +/*
> + * __page_memcg - get the memory cgroup associated with a non-kmem page
> + * @page: a pointer to the page struct
> + *
> + * Returns a pointer to the memory cgroup associated with the page,
> + * or NULL. This function assumes that the page is known to have a
> + * proper memory cgroup pointer. It's not safe to call this function
> + * against some type of pages, e.g. slab pages or ex-slab pages or
> + * kmem pages.
> + */
> +static inline struct mem_cgroup *__page_memcg(struct page *page)
> +{
> + unsigned long memcg_data = page->memcg_data;
> +
> + VM_BUG_ON_PAGE(PageSlab(page), page);
> + VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page);
> + VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_KMEM, page);
> +
> + return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
> +}
> +
> +/*
> + * __page_objcg - get the object cgroup associated with a kmem page
> + * @page: a pointer to the page struct
> + *
> + * Returns a pointer to the object cgroup associated with the

Re: [RFC PATCH 7/8] hugetlb: add update_and_free_page_no_sleep for irq context

2021-03-22 Thread Roman Gushchin
On Mon, Mar 22, 2021 at 10:42:23AM -0700, Mike Kravetz wrote:
> Cc: Roman, Christoph
> 
> On 3/22/21 1:41 AM, Peter Zijlstra wrote:
> > On Fri, Mar 19, 2021 at 03:42:08PM -0700, Mike Kravetz wrote:
> >> The locks acquired in free_huge_page are irq safe.  However, in certain
> >> circumstances the routine update_and_free_page could sleep.  Since
> >> free_huge_page can be called from any context, it can not sleep.
> >>
> >> Use a waitqueue to defer freeing of pages if the operation may sleep.  A
> >> new routine update_and_free_page_no_sleep provides this functionality
> >> and is only called from free_huge_page.
> >>
> >> Note that any 'pages' sent to the workqueue for deferred freeing have
> >> already been removed from the hugetlb subsystem.  What is actually
> >> deferred is returning those base pages to the low level allocator.
> > 
> > So maybe I'm stupid, but why do you need that work in hugetlb? Afaict it
> > should be in cma_release().
> 
> My thinking (which could be totally wrong) is that cma_release makes no
> claims about calling context.  From the code, it is pretty clear that it
> can only be called from task context with no locks held.  Although,
> there could be code incorrectly calling it today hugetlb does.  Since
> hugetlb is the only code with this new requirement, it should do the
> work.
> 
> Wait!!!  That made me remember something.
> Roman had code to create a non-blocking version of cma_release().
> https://lore.kernel.org/linux-mm/2020105308.2927890-1-g...@fb.com/
> 
> There were no objections, and Christoph even thought there may be
> problems with callers of dma_free_contiguous.
> 
> Perhaps, we should just move forward with Roman's patches to create
> cma_release_nowait() and avoid this workqueue stuff?

Sounds good to me. If it's the preferred path, I can rebase and resend
those patches (they been carried for some time by Zi Yan for his 1GB THP work,
but they are completely independent).

Thanks!


> -- 
> Mike Kravetz


Re: [PATCH v3 0/4] mm/slub: Fix count_partial() problem

2021-03-15 Thread Roman Gushchin


On Mon, Mar 15, 2021 at 07:49:57PM +0100, Vlastimil Babka wrote:
> On 3/9/21 4:25 PM, Xunlei Pang wrote:
> > count_partial() can hold n->list_lock spinlock for quite long, which
> > makes much trouble to the system. This series eliminate this problem.
> 
> Before I check the details, I have two high-level comments:
> 
> - patch 1 introduces some counting scheme that patch 4 then changes, could we 
> do
> this in one step to avoid the churn?
> 
> - the series addresses the concern that spinlock is being held, but doesn't
> address the fact that counting partial per-node slabs is not nearly enough if 
> we
> want accurate  in /proc/slabinfo because there are also percpu
> slabs and per-cpu partial slabs, where we don't track the free objects at all.
> So after this series while the readers of /proc/slabinfo won't block the
> spinlock, they will get the same garbage data as before. So Christoph is not
> wrong to say that we can just report active_objs == num_objs and it won't
> actually break any ABI.
> At the same time somebody might actually want accurate object statistics at 
> the
> expense of peak performance, and it would be nice to give them such option in
> SLUB. Right now we don't provide this accuracy even with CONFIG_SLUB_STATS,
> although that option provides many additional tuning stats, with additional
> overhead.
> So my proposal would be a new config for "accurate active objects" (or just 
> tie
> it to CONFIG_SLUB_DEBUG?) that would extend the approach of percpu counters in
> patch 4 to all alloc/free, so that it includes percpu slabs. Without this 
> config
> enabled, let's just report active_objs == num_objs.

It sounds really good to me! The only thing, I'd avoid introducing a new option
and use CONFIG_SLUB_STATS instead.

It seems like CONFIG_SLUB_DEBUG is a more popular option than CONFIG_SLUB_STATS.
CONFIG_SLUB_DEBUG is enabled on my Fedora workstation, CONFIG_SLUB_STATS is off.
I doubt an average user needs this data, so I'd go with CONFIG_SLUB_STATS.

Thanks!

> 
> Vlastimil
> 
> > v1->v2:
> > - Improved changelog and variable naming for PATCH 1~2.
> > - PATCH3 adds per-cpu counter to avoid performance regression
> >   in concurrent __slab_free().
> > 
> > v2->v3:
> > - Changed "page->inuse" to the safe "new.inuse", etc.
> > - Used CONFIG_SLUB_DEBUG and CONFIG_SYSFS condition for new counters.
> > - atomic_long_t -> unsigned long
> > 
> > [Testing]
> > There seems might be a little performance impact under extreme
> > __slab_free() concurrent calls according to my tests.
> > 
> > On my 32-cpu 2-socket physical machine:
> > Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
> > 
> > 1) perf stat --null --repeat 10 -- hackbench 20 thread 2
> > 
> > == original, no patched
> > Performance counter stats for 'hackbench 20 thread 2' (10 runs):
> > 
> >   24.536050899 seconds time elapsed 
> >  ( +-  0.24% )
> > 
> > 
> > Performance counter stats for 'hackbench 20 thread 2' (10 runs):
> > 
> >   24.588049142 seconds time elapsed 
> >  ( +-  0.35% )
> > 
> > 
> > == patched with patch1~4
> > Performance counter stats for 'hackbench 20 thread 2' (10 runs):
> > 
> >   24.670892273 seconds time elapsed 
> >  ( +-  0.29% )
> > 
> > 
> > Performance counter stats for 'hackbench 20 thread 2' (10 runs):
> > 
> >   24.746755689 seconds time elapsed 
> >  ( +-  0.21% )
> > 
> > 
> > 2) perf stat --null --repeat 10 -- hackbench 32 thread 2
> > 
> > == original, no patched
> >  Performance counter stats for 'hackbench 32 thread 2' (10 runs):
> > 
> >   39.784911855 seconds time elapsed 
> >  ( +-  0.14% )
> > 
> >  Performance counter stats for 'hackbench 32 thread 2' (10 runs):
> > 
> >   39.868687608 seconds time elapsed 
> >  ( +-  0.19% )
> > 
> > == patched with patch1~4
> >  Performance counter stats for 'hackbench 32 thread 2' (10 runs):
> > 
> >   39.681273015 seconds time elapsed 
> >  ( +-  0.21% )
> > 
> >  Performance counter stats for 'hackbench 32 thread 2' (10 runs):
> > 
> >   39.681238459 seconds time elapsed 
> >  ( +-  0.09% )
> > 
> > 
> > Xunlei Pang (4):
> >   mm/slub: Introduce two counters for partial objects
> >   mm/slub: Get rid of count_partial()
> >   percpu: Export per_cpu_sum()
> >   mm/slub: Use percpu partial free counter
> > 
> >  include/linux/percpu-defs.h   |  10 
> >  kernel/locking/percpu-rwsem.c |  10 
> >  mm/slab.h |   4 ++
> >  mm/slub.c | 120 
> > +-
> >  4 files changed, 97 insertions(+), 47 deletions(-)
> > 
> 


Re: [PATCH v3 2/4] mm: memcontrol: make page_memcg{_rcu} only applicable for non-kmem page

2021-03-10 Thread Roman Gushchin
On Tue, Mar 09, 2021 at 06:07:15PM +0800, Muchun Song wrote:
> We want to reuse the obj_cgroup APIs to charge the kmem pages.
> If we do that, we should store an object cgroup pointer to
> page->memcg_data for the kmem pages.
> 
> Finally, page->memcg_data can have 3 different meanings.
> 
>   1) For the slab pages, page->memcg_data points to an object cgroups
>  vector.
> 
>   2) For the kmem pages (exclude the slab pages), page->memcg_data
>  points to an object cgroup.
> 
>   3) For the user pages (e.g. the LRU pages), page->memcg_data points
>  to a memory cgroup.
> 
> Currently we always get the memory cgroup associated with a page via
> page_memcg() or page_memcg_rcu(). page_memcg_check() is special, it
> has to be used in cases when it's not known if a page has an
> associated memory cgroup pointer or an object cgroups vector. Because
> the page->memcg_data of the kmem page is not pointing to a memory
> cgroup in the later patch, the page_memcg() and page_memcg_rcu()
> cannot be applicable for the kmem pages. In this patch, make
> page_memcg() and page_memcg_rcu() no longer apply to the kmem pages.
> We do not change the behavior of the page_memcg_check(), it is also
> applicable for the kmem pages.
> 
> In the end, there are 3 helpers to get the memcg associated with a page.
> Usage is as follows.
> 
>   1) Get the memory cgroup associated with a non-kmem page (e.g. the LRU
>  pages).
> 
>  - page_memcg()
>  - page_memcg_rcu()
> 
>   2) Get the memory cgroup associated with a page. It has to be used in
>  cases when it's not known if a page has an associated memory cgroup
>  pointer or an object cgroups vector. Returns NULL for slab pages or
>  uncharged pages. Otherwise, returns memory cgroup for charged pages
>  (e.g. the kmem pages, the LRU pages).
> 
>  - page_memcg_check()
> 
> In some place, we use page_memcg() to check whether the page is charged.
> Now introduce page_memcg_charged() helper to do that.
> 
> This is a preparation for reparenting the kmem pages.
> 
> Signed-off-by: Muchun Song 
> ---
>  include/linux/memcontrol.h | 33 +++--
>  mm/memcontrol.c| 23 +--
>  mm/page_alloc.c|  4 ++--
>  3 files changed, 42 insertions(+), 18 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index e6dc793d587d..83cbcdcfcc92 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -358,14 +358,26 @@ enum page_memcg_data_flags {
>  
>  #define MEMCG_DATA_FLAGS_MASK (__NR_MEMCG_DATA_FLAGS - 1)
>  
> +/* Return true for charged page, otherwise false. */
> +static inline bool page_memcg_charged(struct page *page)
> +{
> + unsigned long memcg_data = page->memcg_data;
> +
> + VM_BUG_ON_PAGE(PageSlab(page), page);
> + VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page);
> +
> + return !!memcg_data;
> +}
> +
>  /*
> - * page_memcg - get the memory cgroup associated with a page
> + * page_memcg - get the memory cgroup associated with a non-kmem page
>   * @page: a pointer to the page struct
>   *
>   * Returns a pointer to the memory cgroup associated with the page,
>   * or NULL. This function assumes that the page is known to have a
>   * proper memory cgroup pointer. It's not safe to call this function
> - * against some type of pages, e.g. slab pages or ex-slab pages.
> + * against some type of pages, e.g. slab pages, kmem pages or ex-slab
> + * pages.
>   *
>   * Any of the following ensures page and memcg binding stability:
>   * - the page lock
> @@ -378,27 +390,31 @@ static inline struct mem_cgroup *page_memcg(struct page 
> *page)
>   unsigned long memcg_data = page->memcg_data;
>  
>   VM_BUG_ON_PAGE(PageSlab(page), page);
> + VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_KMEM, page);
>   VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page);
>  
>   return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
>  }
>  
>  /*
> - * page_memcg_rcu - locklessly get the memory cgroup associated with a page
> + * page_memcg_rcu - locklessly get the memory cgroup associated with a 
> non-kmem page
>   * @page: a pointer to the page struct
>   *
>   * Returns a pointer to the memory cgroup associated with the page,
>   * or NULL. This function assumes that the page is known to have a
>   * proper memory cgroup pointer. It's not safe to call this function
> - * against some type of pages, e.g. slab pages or ex-slab pages.
> + * against some type of pages, e.g. slab pages, kmem pages or ex-slab
> + * pages.
>   */
>  static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
>  {
> + unsigned long memcg_data = READ_ONCE(page->memcg_data);
> +
>   VM_BUG_ON_PAGE(PageSlab(page), page);
> + VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_KMEM, page);
>   WARN_ON_ONCE(!rcu_read_lock_held());
>  
> - return (struct mem_cgroup *)(READ_ONCE(page->memcg_data) &
> -

Re: [PATCH v3 3/4] mm: memcontrol: use obj_cgroup APIs to charge kmem pages

2021-03-10 Thread Roman Gushchin
On Tue, Mar 09, 2021 at 06:07:16PM +0800, Muchun Song wrote:
> Since Roman series "The new cgroup slab memory controller" applied. All
> slab objects are charged via the new APIs of obj_cgroup. The new APIs
> introduce a struct obj_cgroup to charge slab objects. It prevents
> long-living objects from pinning the original memory cgroup in the memory.
> But there are still some corner objects (e.g. allocations larger than
> order-1 page on SLUB) which are not charged via the new APIs. Those
> objects (include the pages which are allocated from buddy allocator
> directly) are charged as kmem pages which still hold a reference to
> the memory cgroup.
> 
> This patch aims to charge the kmem pages by using the new APIs of
> obj_cgroup. Finally, the page->memcg_data of the kmem page points to
> an object cgroup. We can use the page_objcg() to get the object
> cgroup associated with a kmem page. Or we can use page_memcg_check()
> to get the memory cgroup associated with a kmem page, but caller must
> ensure that the returned memcg won't be released (e.g. acquire the
> rcu_read_lock or css_set_lock).
> 
> Signed-off-by: Muchun Song 
> ---
>  include/linux/memcontrol.h |  63 ++--
>  mm/memcontrol.c| 119 
> ++---
>  2 files changed, 128 insertions(+), 54 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 83cbcdcfcc92..07c449af9c0f 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -370,6 +370,18 @@ static inline bool page_memcg_charged(struct page *page)
>  }
>  
>  /*
> + * After the initialization objcg->memcg is always pointing at
> + * a valid memcg, but can be atomically swapped to the parent memcg.
> + *
> + * The caller must ensure that the returned memcg won't be released:
> + * e.g. acquire the rcu_read_lock or css_set_lock.
> + */
> +static inline struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg)
> +{
> + return READ_ONCE(objcg->memcg);
> +}
> +
> +/*
>   * page_memcg - get the memory cgroup associated with a non-kmem page
>   * @page: a pointer to the page struct
>   *
> @@ -422,15 +434,19 @@ static inline struct mem_cgroup *page_memcg_rcu(struct 
> page *page)
>   * @page: a pointer to the page struct
>   *
>   * Returns a pointer to the memory cgroup associated with the page,
> - * or NULL. This function unlike page_memcg() can take any  page
> + * or NULL. This function unlike page_memcg() can take any page
>   * as an argument. It has to be used in cases when it's not known if a page
> - * has an associated memory cgroup pointer or an object cgroups vector.
> + * has an associated memory cgroup pointer or an object cgroups vector or
> + * an object cgroup.
>   *
>   * Any of the following ensures page and memcg binding stability:
>   * - the page lock
>   * - LRU isolation
>   * - lock_page_memcg()
>   * - exclusive reference
> + *
> + * Should be called under rcu lock which can protect memcg associated with a
> + * kmem page from being released.

How about this:

For a non-kmem page any of the following ensures page and memcg binding 
stability:
- the page lock
- LRU isolation
- lock_page_memcg()
- exclusive reference

For a kmem page a caller should hold an rcu read lock to protect memcg 
associated
with a kmem page from being released.

>   */
>  static inline struct mem_cgroup *page_memcg_check(struct page *page)
>  {
> @@ -443,6 +459,13 @@ static inline struct mem_cgroup *page_memcg_check(struct 
> page *page)
>   if (memcg_data & MEMCG_DATA_OBJCGS)
>   return NULL;
>  
> + if (memcg_data & MEMCG_DATA_KMEM) {
> + struct obj_cgroup *objcg;
> +
> + objcg = (void *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
> + return obj_cgroup_memcg(objcg);
> + }
> +
>   return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
>  }
>  
> @@ -501,6 +524,25 @@ static inline struct obj_cgroup 
> **page_objcgs_check(struct page *page)
>   return (struct obj_cgroup **)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
>  }
>  
> +/*
> + * page_objcg - get the object cgroup associated with a kmem page
> + * @page: a pointer to the page struct
> + *
> + * Returns a pointer to the object cgroup associated with the kmem page,
> + * or NULL. This function assumes that the page is known to have an
> + * associated object cgroup. It's only safe to call this function
> + * against kmem pages (PageMemcgKmem() returns true).
> + */
> +static inline struct obj_cgroup *page_objcg(struct page *page)
> +{
> + unsigned long memcg_data = page->memcg_data;
> +
> + VM_BUG_ON_PAGE(PageSlab(page), page);
> + VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page);
> + VM_BUG_ON_PAGE(!(memcg_data & MEMCG_DATA_KMEM), page);
> +
> + return (struct obj_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
> +}
>  #else
>  static inline struct obj_cgroup **page_objcgs(struct page *page)
>  {
> @@ -511,6 +553,11 

Re: [PATCH v3 4/4] mm: memcontrol: move PageMemcgKmem to the scope of CONFIG_MEMCG_KMEM

2021-03-10 Thread Roman Gushchin
On Tue, Mar 09, 2021 at 06:07:17PM +0800, Muchun Song wrote:
> The page only can be marked as kmem when CONFIG_MEMCG_KMEM is enabled.
> So move PageMemcgKmem() to the scope of the CONFIG_MEMCG_KMEM.
> 
> As a bonus, on !CONFIG_MEMCG_KMEM build some code can be compiled out.
> 
> Signed-off-by: Muchun Song 

Nice!

Acked-by: Roman Gushchin 

Thanks!

> ---
>  include/linux/memcontrol.h | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 07c449af9c0f..d3ca8c8e7fc3 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -469,6 +469,7 @@ static inline struct mem_cgroup *page_memcg_check(struct 
> page *page)
>   return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
>  }
>  
> +#ifdef CONFIG_MEMCG_KMEM
>  /*
>   * PageMemcgKmem - check if the page has MemcgKmem flag set
>   * @page: a pointer to the page struct
> @@ -483,7 +484,6 @@ static inline bool PageMemcgKmem(struct page *page)
>   return page->memcg_data & MEMCG_DATA_KMEM;
>  }
>  
> -#ifdef CONFIG_MEMCG_KMEM
>  /*
>   * page_objcgs - get the object cgroups vector associated with a page
>   * @page: a pointer to the page struct
> @@ -544,6 +544,11 @@ static inline struct obj_cgroup *page_objcg(struct page 
> *page)
>   return (struct obj_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
>  }
>  #else
> +static inline bool PageMemcgKmem(struct page *page)
> +{
> + return false;
> +}
> +
>  static inline struct obj_cgroup **page_objcgs(struct page *page)
>  {
>   return NULL;
> -- 
> 2.11.0
> 


Re: [v8 PATCH 05/13] mm: vmscan: use kvfree_rcu instead of call_rcu

2021-03-08 Thread Roman Gushchin
On Sun, Mar 07, 2021 at 10:13:04PM -0800, Shakeel Butt wrote:
> On Tue, Feb 16, 2021 at 4:13 PM Yang Shi  wrote:
> >
> > Using kvfree_rcu() to free the old shrinker_maps instead of call_rcu().
> > We don't have to define a dedicated callback for call_rcu() anymore.
> >
> > Signed-off-by: Yang Shi 
> > ---
> >  mm/vmscan.c | 7 +--
> >  1 file changed, 1 insertion(+), 6 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 2e753c2516fa..c2a309acd86b 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -192,11 +192,6 @@ static inline int shrinker_map_size(int nr_items)
> > return (DIV_ROUND_UP(nr_items, BITS_PER_LONG) * sizeof(unsigned 
> > long));
> >  }
> >
> > -static void free_shrinker_map_rcu(struct rcu_head *head)
> > -{
> > -   kvfree(container_of(head, struct memcg_shrinker_map, rcu));
> > -}
> > -
> >  static int expand_one_shrinker_map(struct mem_cgroup *memcg,
> >int size, int old_size)
> >  {
> > @@ -219,7 +214,7 @@ static int expand_one_shrinker_map(struct mem_cgroup 
> > *memcg,
> > memset((void *)new->map + old_size, 0, size - old_size);
> >
> > rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, new);
> > -   call_rcu(>rcu, free_shrinker_map_rcu);
> > +   kvfree_rcu(old);
> 
> Please use kvfree_rcu(old, rcu) instead of kvfree_rcu(old). The single
> param can call synchronize_rcu().

Oh, I didn't know about this difference. Thank you for noticing!


Re: [PATCH v2] MIPS: kernel: Reserve exception base early to prevent corruption

2021-03-08 Thread Roman Gushchin
On Sat, Mar 06, 2021 at 09:29:09AM +0100, Thomas Bogendoerfer wrote:
> BMIPS is one of the few platforms that do change the exception base.
> After commit 2dcb39645441 ("memblock: do not start bottom-up allocations
> with kernel_end") we started seeing BMIPS boards fail to boot with the
> built-in FDT being corrupted.
> 
> Before the cited commit, early allocations would be in the [kernel_end,
> RAM_END] range, but after commit they would be within [RAM_START +
> PAGE_SIZE, RAM_END].
> 
> The custom exception base handler that is installed by
> bmips_ebase_setup() done for BMIPS5000 CPUs ends-up trampling on the
> memory region allocated by unflatten_and_copy_device_tree() thus
> corrupting the FDT used by the kernel.
> 
> To fix this, we need to perform an early reservation of the custom
> exception space. So we reserve it already in cpu_probe() for the CPUs
> where this is fixed. For CPU with an ebase config register allocation
> of exception space will be done in trap_init().
> 
> Huge thanks to Serget for analysing and proposing a solution to this
> issue.
> 
> Fixes: 2dcb39645441 ("memblock: do not start bottom-up allocations with 
> kernel_end")
> Reported-by: Kamal Dasu 
> Debugged-by: Serge Semin 
> Signed-off-by: Thomas Bogendoerfer 
> ---
> Changes in v2:
>  - do only memblock reservation in reserve_exception_space()
>  - reserve 0..0x400 for all CPUs without ebase register and
>to addtional reserve_exception_space for BMIPS CPUs
> 
>  arch/mips/include/asm/traps.h|  3 +++
>  arch/mips/kernel/cpu-probe.c |  7 +++
>  arch/mips/kernel/cpu-r3k-probe.c |  3 +++
>  arch/mips/kernel/traps.c | 10 +-
>  4 files changed, 18 insertions(+), 5 deletions(-)

Acked-by: Roman Gushchin 

Thanks!

> 
> diff --git a/arch/mips/include/asm/traps.h b/arch/mips/include/asm/traps.h
> index 6aa8f126a43d..b710e76c9c65 100644
> --- a/arch/mips/include/asm/traps.h
> +++ b/arch/mips/include/asm/traps.h
> @@ -24,8 +24,11 @@ extern void (*board_ebase_setup)(void);
>  extern void (*board_cache_error_setup)(void);
>  
>  extern int register_nmi_notifier(struct notifier_block *nb);
> +extern void reserve_exception_space(phys_addr_t addr, unsigned long size);
>  extern char except_vec_nmi[];
>  
> +#define VECTORSPACING 0x100  /* for EI/VI mode */
> +
>  #define nmi_notifier(fn, pri)
> \
>  ({   \
>   static struct notifier_block fn##_nb = {\
> diff --git a/arch/mips/kernel/cpu-probe.c b/arch/mips/kernel/cpu-probe.c
> index 9a89637b4ecf..b565bc4b900d 100644
> --- a/arch/mips/kernel/cpu-probe.c
> +++ b/arch/mips/kernel/cpu-probe.c
> @@ -26,6 +26,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  #include "fpu-probe.h"
> @@ -1628,6 +1629,7 @@ static inline void cpu_probe_broadcom(struct 
> cpuinfo_mips *c, unsigned int cpu)
>   c->cputype = CPU_BMIPS3300;
>   __cpu_name[cpu] = "Broadcom BMIPS3300";
>   set_elf_platform(cpu, "bmips3300");
> + reserve_exception_space(0x400, VECTORSPACING * 64);
>   break;
>   case PRID_IMP_BMIPS43XX: {
>   int rev = c->processor_id & PRID_REV_MASK;
> @@ -1638,6 +1640,7 @@ static inline void cpu_probe_broadcom(struct 
> cpuinfo_mips *c, unsigned int cpu)
>   __cpu_name[cpu] = "Broadcom BMIPS4380";
>   set_elf_platform(cpu, "bmips4380");
>   c->options |= MIPS_CPU_RIXI;
> + reserve_exception_space(0x400, VECTORSPACING * 64);
>   } else {
>   c->cputype = CPU_BMIPS4350;
>   __cpu_name[cpu] = "Broadcom BMIPS4350";
> @@ -1654,6 +1657,7 @@ static inline void cpu_probe_broadcom(struct 
> cpuinfo_mips *c, unsigned int cpu)
>   __cpu_name[cpu] = "Broadcom BMIPS5000";
>   set_elf_platform(cpu, "bmips5000");
>   c->options |= MIPS_CPU_ULRI | MIPS_CPU_RIXI;
> + reserve_exception_space(0x1000, VECTORSPACING * 64);
>   break;
>   }
>  }
> @@ -2133,6 +2137,9 @@ void cpu_probe(void)
>   if (cpu == 0)
>   __ua_limit = ~((1ull << cpu_vmbits) - 1);
>  #endif
> +
> + if (cpu_has_mips_r2_r6)
> + reserve_exception_space(0, 0x400);
>  }
>  
>  void cpu_report(void)
> diff --git a/arch/mips/kernel/cpu-r3k-probe.c 
> b/arch/mips/kernel/cpu-r3k-probe.c
> index abdbbe8c5a43..af6

Re: [PATCH v4] memcg: charge before adding to swapcache on swapin

2021-03-05 Thread Roman Gushchin
On Fri, Mar 05, 2021 at 01:26:39PM -0800, Shakeel Butt wrote:
> Currently the kernel adds the page, allocated for swapin, to the
> swapcache before charging the page. This is fine but now we want a
> per-memcg swapcache stat which is essential for folks who wants to
> transparently migrate from cgroup v1's memsw to cgroup v2's memory and
> swap counters. In addition charging a page before exposing it to other
> parts of the kernel is a step in the right direction.
> 
> To correctly maintain the per-memcg swapcache stat, this patch has
> adopted to charge the page before adding it to swapcache. One
> challenge in this option is the failure case of add_to_swap_cache() on
> which we need to undo the mem_cgroup_charge(). Specifically undoing
> mem_cgroup_uncharge_swap() is not simple.
> 
> To resolve the issue, this patch introduces transaction like interface
> to charge a page for swapin. The function mem_cgroup_charge_swapin_page()
> initiates the charging of the page and mem_cgroup_finish_swapin_page()
> completes the charging process. So, the kernel starts the charging
> process of the page for swapin with mem_cgroup_charge_swapin_page(),
> adds the page to the swapcache and on success completes the charging
> process with mem_cgroup_finish_swapin_page().
> 
> Signed-off-by: Shakeel Butt 
> Acked-by: Johannes Weiner 
> Acked-by: Hugh Dickins 

Acked-by: Roman Gushchin 

Thanks!

> ---
> Changes since v3:
> - Updated the comments on introduced functions (Johannes)
> - Rename the funcations to be more clear (Hugh & Johannes)
> 
> Changes since v2:
> - fixed build for !CONFIG_MEMCG
> - simplified failure path from add_to_swap_cache()
> 
> Changes since v1:
> - Removes __GFP_NOFAIL and introduced transaction interface for charging
>   (suggested by Johannes)
> - Updated the commit message
> 
>  include/linux/memcontrol.h |  13 +
>  mm/memcontrol.c| 117 +++--
>  mm/memory.c|  14 ++---
>  mm/swap_state.c|  13 ++---
>  4 files changed, 97 insertions(+), 60 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index e6dc793d587d..f522b09f2df7 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -596,6 +596,9 @@ static inline bool mem_cgroup_below_min(struct mem_cgroup 
> *memcg)
>  }
>  
>  int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t 
> gfp_mask);
> +int mem_cgroup_swapin_charge_page(struct page *page, struct mm_struct *mm,
> +   gfp_t gfp, swp_entry_t entry);
> +void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry);
>  
>  void mem_cgroup_uncharge(struct page *page);
>  void mem_cgroup_uncharge_list(struct list_head *page_list);
> @@ -1141,6 +1144,16 @@ static inline int mem_cgroup_charge(struct page *page, 
> struct mm_struct *mm,
>   return 0;
>  }
>  
> +static inline int mem_cgroup_swapin_charge_page(struct page *page,
> + struct mm_struct *mm, gfp_t gfp, swp_entry_t entry)
> +{
> + return 0;
> +}
> +
> +static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry)
> +{
> +}
> +
>  static inline void mem_cgroup_uncharge(struct page *page)
>  {
>  }
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2db2aeac8a9e..21c38c0b6e5a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6690,6 +6690,27 @@ void mem_cgroup_calculate_protection(struct mem_cgroup 
> *root,
>   atomic_long_read(>memory.children_low_usage)));
>  }
>  
> +static int __mem_cgroup_charge(struct page *page, struct mem_cgroup *memcg,
> +gfp_t gfp)
> +{
> + unsigned int nr_pages = thp_nr_pages(page);
> + int ret;
> +
> + ret = try_charge(memcg, gfp, nr_pages);
> + if (ret)
> + goto out;
> +
> + css_get(>css);
> + commit_charge(page, memcg);
> +
> + local_irq_disable();
> + mem_cgroup_charge_statistics(memcg, page, nr_pages);
> + memcg_check_events(memcg, page);
> + local_irq_enable();
> +out:
> + return ret;
> +}
> +
>  /**
>   * mem_cgroup_charge - charge a newly allocated page to a cgroup
>   * @page: page to charge
> @@ -6699,55 +6720,71 @@ void mem_cgroup_calculate_protection(struct 
> mem_cgroup *root,
>   * Try to charge @page to the memcg that @mm belongs to, reclaiming
>   * pages according to @gfp_mask if necessary.
>   *
> + * Do not use this for pages allocated for swapin.
> + *
>   * Returns 0 on success. Otherwise, an error code is returned.
>   */
>  int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t 

Re: [PATCH v2 4/5] mm: memcontrol: introduce remote objcg charging API

2021-03-05 Thread Roman Gushchin
On Wed, Mar 03, 2021 at 01:59:16PM +0800, Muchun Song wrote:
> The remote memcg charing APIs is a mechanism to charge pages to a given
> memcg. Since all kernel memory are charged by using obj_cgroup APIs.
> Actually, we want to charge kernel memory to the remote object cgroup
> instead of memory cgroup. So introduce remote objcg charging APIs to
> charge the kmem pages by using objcg_cgroup APIs. And if remote memcg
> and objcg are both set, objcg takes precedence over memcg to charge
> the kmem pages.
> 
> In the later patch, we will use those API to charge kernel memory to
> the remote objcg.

I'd abandon/postpone the rest of the patchset (patches 4 and 5) as now.
They add a lot of new code to solve a theoretical problem (please, fix
me if I'm wrong), which is not a panic or data corruption, but
a sub-optimal garbage collection behavior. I think we need a better
motivation or/and an implementation which makes the code simpler
and smaller.

> 
> Signed-off-by: Muchun Song 
> ---
>  include/linux/sched.h|  4 
>  include/linux/sched/mm.h | 38 ++
>  kernel/fork.c|  3 +++
>  mm/memcontrol.c  | 44 
>  4 files changed, 85 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index ee46f5cab95b..8edcc71a0a1d 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1318,6 +1318,10 @@ struct task_struct {
>   /* Used by memcontrol for targeted memcg charge: */
>   struct mem_cgroup   *active_memcg;
>  #endif
> +#ifdef CONFIG_MEMCG_KMEM
> + /* Used by memcontrol for targeted objcg charge: */
> + struct obj_cgroup   *active_objcg;
> +#endif
>  
>  #ifdef CONFIG_BLK_CGROUP
>   struct request_queue*throttle_queue;
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 1ae08b8462a4..be1189598b09 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -330,6 +330,44 @@ set_active_memcg(struct mem_cgroup *memcg)
>  }
>  #endif
>  
> +#ifdef CONFIG_MEMCG_KMEM
> +DECLARE_PER_CPU(struct obj_cgroup *, int_active_objcg);
> +
> +/**
> + * set_active_objcg - Starts the remote objcg kmem pages charging scope.
> + * @objcg: objcg to charge.
> + *
> + * This function marks the beginning of the remote objcg charging scope. All 
> the
> + * __GFP_ACCOUNT kmem page allocations till the end of the scope will be 
> charged
> + * to the given objcg.
> + *
> + * NOTE: This function can nest. Users must save the return value and
> + * reset the previous value after their own charging scope is over.
> + *
> + * If remote memcg and objcg are both set, objcg takes precedence over memcg
> + * to charge the kmem pages.
> + */
> +static inline struct obj_cgroup *set_active_objcg(struct obj_cgroup *objcg)
> +{
> + struct obj_cgroup *old;
> +
> + if (in_interrupt()) {
> + old = this_cpu_read(int_active_objcg);
> + this_cpu_write(int_active_objcg, objcg);
> + } else {
> + old = current->active_objcg;
> + current->active_objcg = objcg;
> + }
> +
> + return old;
> +}
> +#else
> +static inline struct obj_cgroup *set_active_objcg(struct obj_cgroup *objcg)
> +{
> + return NULL;
> +}
> +#endif
> +
>  #ifdef CONFIG_MEMBARRIER
>  enum {
>   MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY= (1U << 0),
> diff --git a/kernel/fork.c b/kernel/fork.c
> index d66cd1014211..b4b9dd5d122f 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -945,6 +945,9 @@ static struct task_struct *dup_task_struct(struct 
> task_struct *orig, int node)
>  #ifdef CONFIG_MEMCG
>   tsk->active_memcg = NULL;
>  #endif
> +#ifdef CONFIG_MEMCG_KMEM
> + tsk->active_objcg = NULL;
> +#endif
>   return tsk;
>  
>  free_stack:
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 0cf342d22547..e48d4ab0af76 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -79,6 +79,11 @@ struct mem_cgroup *root_mem_cgroup __read_mostly;
>  /* Active memory cgroup to use from an interrupt context */
>  DEFINE_PER_CPU(struct mem_cgroup *, int_active_memcg);
>  
> +#ifdef CONFIG_MEMCG_KMEM
> +/* Active object cgroup to use from an interrupt context */
> +DEFINE_PER_CPU(struct obj_cgroup *, int_active_objcg);
> +#endif
> +
>  /* Socket memory accounting disabled? */
>  static bool cgroup_memory_nosocket;
>  
> @@ -1076,7 +1081,7 @@ static __always_inline struct mem_cgroup 
> *get_active_memcg(void)
>   return memcg;
>  }
>  
> -static __always_inline bool memcg_kmem_bypass(void)
> +static __always_inline bool memcg_charge_bypass(void)
>  {
>   /* Allow remote memcg charging from any context. */
>   if (unlikely(active_memcg()))
> @@ -1094,7 +1099,7 @@ static __always_inline bool memcg_kmem_bypass(void)
>   */
>  static __always_inline struct mem_cgroup *get_mem_cgroup_from_current(void)
>  {
> - if (memcg_kmem_bypass())
> +   

Re: [PATCH v2 3/5] mm: memcontrol: charge kmem pages by using obj_cgroup APIs

2021-03-05 Thread Roman Gushchin
On Wed, Mar 03, 2021 at 01:59:15PM +0800, Muchun Song wrote:
> Since Roman series "The new cgroup slab memory controller" applied. All
> slab objects are charged via the new APIs of obj_cgroup. The new APIs
> introduce a struct obj_cgroup to charge slab objects. It prevents
> long-living objects from pinning the original memory cgroup in the memory.
> But there are still some corner objects (e.g. allocations larger than
> order-1 page on SLUB) which are not charged via the new APIs. Those
> objects (include the pages which are allocated from buddy allocator
> directly) are charged as kmem pages which still hold a reference to
> the memory cgroup.
> 
> This patch aims to charge the kmem pages by using the new APIs of
> obj_cgroup. Finally, the page->memcg_data of the kmem page points to
> an object cgroup. We can use the page_objcg() to get the object
> cgroup associated with a kmem page. Or we can use page_memcg_check()
> to get the memory cgroup associated with a kmem page, but caller must
> ensure that the returned memcg won't be released (e.g. acquire the
> rcu_read_lock or css_set_lock).

I believe it's a good direction, but there are still things which
need to be figured out first.

> 
> Signed-off-by: Muchun Song 
> ---
>  include/linux/memcontrol.h |  63 +--
>  mm/memcontrol.c| 123 
> +++--
>  2 files changed, 133 insertions(+), 53 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 049b80246cbf..5911b9d107b0 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -370,6 +370,18 @@ static inline bool page_memcg_charged(struct page *page)
>  }
>  
>  /*
> + * After the initialization objcg->memcg is always pointing at
> + * a valid memcg, but can be atomically swapped to the parent memcg.
> + *
> + * The caller must ensure that the returned memcg won't be released:
> + * e.g. acquire the rcu_read_lock or css_set_lock.
> + */
> +static inline struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg)
> +{
> + return READ_ONCE(objcg->memcg);
> +}
> +
> +/*
>   * page_memcg - get the memory cgroup associated with a non-kmem page
>   * @page: a pointer to the page struct
>   *
> @@ -421,9 +433,10 @@ static inline struct mem_cgroup *page_memcg_rcu(struct 
> page *page)
>   * @page: a pointer to the page struct
>   *
>   * Returns a pointer to the memory cgroup associated with the page,
> - * or NULL. This function unlike page_memcg() can take any  page
> + * or NULL. This function unlike page_memcg() can take any non-kmem page
>   * as an argument. It has to be used in cases when it's not known if a page
> - * has an associated memory cgroup pointer or an object cgroups vector.
> + * has an associated memory cgroup pointer or an object cgroups vector or
> + * an object cgroup.
>   *
>   * Any of the following ensures page and memcg binding stability:
>   * - the page lock
> @@ -442,6 +455,17 @@ static inline struct mem_cgroup *page_memcg_check(struct 
> page *page)
>   if (memcg_data & MEMCG_DATA_OBJCGS)
>   return NULL;
>  
> + if (memcg_data & MEMCG_DATA_KMEM) {

This is confusing: the comment above says it can't take kmem pages?

> + struct obj_cgroup *objcg;
> +
> + /*
> +  * The caller must ensure that the returned memcg won't be
> +  * released: e.g. acquire the rcu_read_lock or css_set_lock.
> +  */
> + objcg = (void *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
> + return obj_cgroup_memcg(objcg);
> + }
> +
>   return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);

Also, the comment about page<->memcg binding stability is not correct anymore.
Taking page_lock, for example, won't protect memcg from being released,
if this a kmem page.

_Maybe_ it's ok to just say that page_memcg_check() requires a rcu lock,
but I'm not yet quite sure. The calling convention is already complicated,
we should avoid making it even more complicated, if we can.

>  }
>  
> @@ -500,6 +524,24 @@ static inline struct obj_cgroup 
> **page_objcgs_check(struct page *page)
>   return (struct obj_cgroup **)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
>  }
>  
> +/*
> + * page_objcg - get the object cgroup associated with a kmem page
> + * @page: a pointer to the page struct
> + *
> + * Returns a pointer to the object cgroup associated with the kmem page,
> + * or NULL. This function assumes that the page is known to have an
> + * associated object cgroup. It's only safe to call this function
> + * against kmem pages (PageMemcgKmem() returns true).
> + */
> +static inline struct obj_cgroup *page_objcg(struct page *page)
> +{
> + unsigned long memcg_data = page->memcg_data;
> +
> + VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page);
> + VM_BUG_ON_PAGE(!(memcg_data & MEMCG_DATA_KMEM), page);
> +
> + return (struct obj_cgroup *)(memcg_data & 

Re: [PATCH v2 2/5] mm: memcontrol: make page_memcg{_rcu} only applicable for non-kmem page

2021-03-05 Thread Roman Gushchin
On Wed, Mar 03, 2021 at 01:59:14PM +0800, Muchun Song wrote:
> We want to reuse the obj_cgroup APIs to charge the kmem pages when
> If we do that, we should store an object cgroup pointer to
> page->memcg_data for the kmem pages.
> 
> Finally, page->memcg_data can have 3 different meanings.
> 
>   1) For the slab pages, page->memcg_data points to an object cgroups
>  vector.
> 
>   2) For the kmem pages (exclude the slab pages), page->memcg_data
>  points to an object cgroup.
> 
>   3) For the user pages (e.g. the LRU pages), page->memcg_data points
>  to a memory cgroup.
> 
> Currently we always get the memory cgroup associated with a page via
> page_memcg() or page_memcg_rcu(). page_memcg_check() is special, it
> has to be used in cases when it's not known if a page has an
> associated memory cgroup pointer or an object cgroups vector. Because
> the page->memcg_data of the kmem page is not pointing to a memory
> cgroup in the later patch, the page_memcg() and page_memcg_rcu()
> cannot be applicable for the kmem pages. In this patch, make
> page_memcg() and page_memcg_rcu() no longer apply to the kmem pages.
> We do not change the behavior of the page_memcg_check(), it is also
> applicable for the kmem pages.
> 
> In the end, there are 3 helpers to get the memcg associated with a page.
> Usage is as follows.
> 
>   1) Get the memory cgroup associated with a non-kmem page (e.g. the LRU
>  pages).
> 
>  - page_memcg()
>  - page_memcg_rcu()
> 
>   2) Get the memory cgroup associated with a page. It has to be used in
>  cases when it's not known if a page has an associated memory cgroup
>  pointer or an object cgroups vector. Returns NULL for slab pages or
>  uncharged pages. Otherwise, returns memory cgroup for charged pages
>  (e.g. the kmem pages, the LRU pages).
> 
>  - page_memcg_check()
> 
> In some place, we use page_memcg() to check whether the page is charged.
> Now introduce page_memcg_charged() helper to do that.
> 
> This is a preparation for reparenting the kmem pages.
> 
> Signed-off-by: Muchun Song 

This patch also looks good to me, but, please, make it safe for adding
new memcg_data flags. E.g. if someone will add a new flag with a completely
new meaning, it shouldn't break the code.

I'll ack it after another look at the final version, but overall it
looks good.

Thanks!

> ---
>  include/linux/memcontrol.h | 36 
>  mm/memcontrol.c| 23 +--
>  mm/page_alloc.c|  4 ++--
>  3 files changed, 43 insertions(+), 20 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index e6dc793d587d..049b80246cbf 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -358,14 +358,26 @@ enum page_memcg_data_flags {
>  
>  #define MEMCG_DATA_FLAGS_MASK (__NR_MEMCG_DATA_FLAGS - 1)
>  
> +/* Return true for charged page, otherwise false. */
> +static inline bool page_memcg_charged(struct page *page)
> +{
> + unsigned long memcg_data = page->memcg_data;
> +
> + VM_BUG_ON_PAGE(PageSlab(page), page);
> + VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page);
> +
> + return !!memcg_data;
> +}
> +
>  /*
> - * page_memcg - get the memory cgroup associated with a page
> + * page_memcg - get the memory cgroup associated with a non-kmem page
>   * @page: a pointer to the page struct
>   *
>   * Returns a pointer to the memory cgroup associated with the page,
>   * or NULL. This function assumes that the page is known to have a
>   * proper memory cgroup pointer. It's not safe to call this function
> - * against some type of pages, e.g. slab pages or ex-slab pages.
> + * against some type of pages, e.g. slab pages, kmem pages or ex-slab
> + * pages.
>   *
>   * Any of the following ensures page and memcg binding stability:
>   * - the page lock
> @@ -378,27 +390,30 @@ static inline struct mem_cgroup *page_memcg(struct page 
> *page)
>   unsigned long memcg_data = page->memcg_data;
>  
>   VM_BUG_ON_PAGE(PageSlab(page), page);
> - VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page);
> + VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_FLAGS_MASK, page);
>  
> - return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
> + return (struct mem_cgroup *)memcg_data;
>  }
>  
>  /*
> - * page_memcg_rcu - locklessly get the memory cgroup associated with a page
> + * page_memcg_rcu - locklessly get the memory cgroup associated with a 
> non-kmem page
>   * @page: a pointer to the page struct
>   *
>   * Returns a pointer to the memory cgroup associated with the page,
>   * or NULL. This function assumes that the page is known to have a
>   * proper memory cgroup pointer. It's not safe to call this function
> - * against some type of pages, e.g. slab pages or ex-slab pages.
> + * against some type of pages, e.g. slab pages, kmem pages or ex-slab
> + * pages.
>   */
>  static inline struct mem_cgroup 

Re: [PATCH v2 1/5] mm: memcontrol: introduce obj_cgroup_{un}charge_page

2021-03-05 Thread Roman Gushchin
On Wed, Mar 03, 2021 at 01:59:13PM +0800, Muchun Song wrote:
> We know that the unit of slab object charging is bytes, the unit of
> kmem page charging is PAGE_SIZE. If we want to reuse obj_cgroup APIs
> to charge the kmem pages, we should pass PAGE_SIZE (as third parameter)
> to obj_cgroup_charge(). Because the size is already PAGE_SIZE, we can
> skip touch the objcg stock. And obj_cgroup_{un}charge_page() are
> introduced to charge in units of page level.
> 
> In the later patch, we also can reuse those two helpers to charge or
> uncharge a number of kernel pages to a object cgroup. This is just
> a code movement without any functional changes.
> 
> Signed-off-by: Muchun Song 

This patch looks good to me, even as a standalone refactoring.
Please, rename obj_cgroup_charge_page() to obj_cgroup_charge_pages()
and the same with uncharge. It's because _page suffix usually means
we're dealing with a physical page (e.g. struct page * as an argument),
here it's not the case.

Please, add my Acked-by: Roman Gushchin 
after the renaming.

Thank you!

> ---
>  mm/memcontrol.c | 46 +++---
>  1 file changed, 31 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 845eec01ef9d..faae16def127 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3056,6 +3056,34 @@ static void memcg_free_cache_id(int id)
>   ida_simple_remove(_cache_ida, id);
>  }
>  
> +static inline void obj_cgroup_uncharge_page(struct obj_cgroup *objcg,
> + unsigned int nr_pages)
> +{
> + rcu_read_lock();
> + __memcg_kmem_uncharge(obj_cgroup_memcg(objcg), nr_pages);
> + rcu_read_unlock();
> +}
> +
> +static int obj_cgroup_charge_page(struct obj_cgroup *objcg, gfp_t gfp,
> +   unsigned int nr_pages)
> +{
> + struct mem_cgroup *memcg;
> + int ret;
> +
> + rcu_read_lock();
> +retry:
> + memcg = obj_cgroup_memcg(objcg);
> + if (unlikely(!css_tryget(>css)))
> + goto retry;
> + rcu_read_unlock();
> +
> + ret = __memcg_kmem_charge(memcg, gfp, nr_pages);
> +
> + css_put(>css);
> +
> + return ret;
> +}
> +
>  /**
>   * __memcg_kmem_charge: charge a number of kernel pages to a memcg
>   * @memcg: memory cgroup to charge
> @@ -3180,11 +3208,8 @@ static void drain_obj_stock(struct memcg_stock_pcp 
> *stock)
>   unsigned int nr_pages = stock->nr_bytes >> PAGE_SHIFT;
>   unsigned int nr_bytes = stock->nr_bytes & (PAGE_SIZE - 1);
>  
> - if (nr_pages) {
> - rcu_read_lock();
> - __memcg_kmem_uncharge(obj_cgroup_memcg(old), nr_pages);
> - rcu_read_unlock();
> - }
> + if (nr_pages)
> + obj_cgroup_uncharge_page(old, nr_pages);
>  
>   /*
>* The leftover is flushed to the centralized per-memcg value.
> @@ -3242,7 +3267,6 @@ static void refill_obj_stock(struct obj_cgroup *objcg, 
> unsigned int nr_bytes)
>  
>  int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
>  {
> - struct mem_cgroup *memcg;
>   unsigned int nr_pages, nr_bytes;
>   int ret;
>  
> @@ -3259,24 +3283,16 @@ int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t 
> gfp, size_t size)
>* refill_obj_stock(), called from this function or
>* independently later.
>*/
> - rcu_read_lock();
> -retry:
> - memcg = obj_cgroup_memcg(objcg);
> - if (unlikely(!css_tryget(>css)))
> - goto retry;
> - rcu_read_unlock();
> -
>   nr_pages = size >> PAGE_SHIFT;
>   nr_bytes = size & (PAGE_SIZE - 1);
>  
>   if (nr_bytes)
>   nr_pages += 1;
>  
> - ret = __memcg_kmem_charge(memcg, gfp, nr_pages);
> + ret = obj_cgroup_charge_page(objcg, gfp, nr_pages);
>   if (!ret && nr_bytes)
>   refill_obj_stock(objcg, PAGE_SIZE - nr_bytes);
>  
> - css_put(>css);
>   return ret;
>  }
>  
> -- 
> 2.11.0
> 


Re: [PATCH v2 3/4] mm: /proc/sys/vm/stat_refresh skip checking known negative stats

2021-03-03 Thread Roman Gushchin
On Mon, Mar 01, 2021 at 10:03:26PM -0800, Hugh Dickins wrote:
> vmstat_refresh() can occasionally catch nr_zone_write_pending and
> nr_writeback when they are transiently negative.  The reason is partly
> that the interrupt which decrements them in test_clear_page_writeback()
> can come in before __test_set_page_writeback() got to increment them;
> but transient negatives are still seen even when that is prevented, and
> I am not yet certain why (but see Roman's note below).  Those stats are
> not buggy, they have never been seen to drift away from 0 permanently:
> so just avoid the annoyance of showing a warning on them.
> 
> Similarly avoid showing a warning on nr_free_cma: CMA users have seen
> that one reported negative from /proc/sys/vm/stat_refresh too, but it
> does drift away permanently: I believe that's because its incrementation
> and decrementation are decided by page migratetype, but the migratetype
> of a pageblock is not guaranteed to be constant.
> 
> Roman Gushchin points out:
> For performance reasons, vmstat counters are incremented and decremented
> using per-cpu batches.  vmstat_refresh() flushes the per-cpu batches on
> all CPUs, to get values as accurate as possible; but this method is not
> atomic, so the resulting value is not always precise.  As a consequence,
> for those counters whose actual value is close to 0, a small negative
> value may occasionally be reported.  If the value is small and the state
> is transient, it is not an indication of an error.
> 
> Link: https://lore.kernel.org/linux-mm/20200714173747.3315771-1-g...@fb.com/
> Reported-by: Roman Gushchin 
> Signed-off-by: Hugh Dickins 
> ---

Oh, sorry, it looks like I missed to ack it. Thank you for updating
the commit log!

Acked-by: Roman Gushchin 


Re: [PATCH v3] mm: memcontrol: fix kernel stack account

2021-03-03 Thread Roman Gushchin
On Wed, Mar 03, 2021 at 11:18:43PM +0800, Muchun Song wrote:
> For simplification 991e7673859e ("mm: memcontrol: account kernel stack
> per node") has changed the per zone vmalloc backed stack pages
> accounting to per node. By doing that we have lost a certain precision
> because those pages might live in different NUMA nodes. In the end
> NR_KERNEL_STACK_KB exported to the userspace might be over estimated on
> some nodes while underestimated on others. But this is not a real world
> problem, just a problem found by reading the code. So there is no actual
> data to showing how much impact it has on users.
> 
> This doesn't impose any real problem to correctnes of the kernel
> behavior as the counter is not used for any internal processing but it
> can cause some confusion to the userspace.
> 
> Address the problem by accounting each vmalloc backing page to its own
> node.
> 
> Signed-off-by: Muchun Song 
> Reviewed-by: Shakeel Butt 
> Acked-by: Michal Hocko 
> Acked-by: Johannes Weiner 

Acked-by: Roman Gushchin 

Thanks!

> ---
> Changelog in v3:
>  - Remove BUG_ON().
>  - Update commit log.
> 
> Changelog in v2:
>  - Rework commit log suggested by Michal.
> 
>  Thanks to Michal and Shakeel for review.
> 
>  kernel/fork.c | 13 -
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index d66cd1014211..242fdad6972b 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -379,14 +379,17 @@ static void account_kernel_stack(struct task_struct 
> *tsk, int account)
>   void *stack = task_stack_page(tsk);
>   struct vm_struct *vm = task_stack_vm_area(tsk);
>  
> + if (vm) {
> + int i;
>  
> - /* All stack pages are in the same node. */
> - if (vm)
> - mod_lruvec_page_state(vm->pages[0], NR_KERNEL_STACK_KB,
> -   account * (THREAD_SIZE / 1024));
> - else
> + for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++)
> + mod_lruvec_page_state(vm->pages[i], NR_KERNEL_STACK_KB,
> +   account * (PAGE_SIZE / 1024));
> + } else {
> + /* All stack pages are in the same node. */
>   mod_lruvec_kmem_state(stack, NR_KERNEL_STACK_KB,
> account * (THREAD_SIZE / 1024));
> + }
>  }
>  
>  static int memcg_charge_kernel_stack(struct task_struct *tsk)
> -- 
> 2.11.0
> 


Re: [PATCH] mm: memcontrol: fix kernel stack account

2021-03-02 Thread Roman Gushchin
On Tue, Mar 02, 2021 at 08:33:20PM +0100, Michal Hocko wrote:
> On Tue 02-03-21 10:50:32, Roman Gushchin wrote:
> > On Tue, Mar 02, 2021 at 03:37:33PM +0800, Muchun Song wrote:
> > > The alloc_thread_stack_node() cannot guarantee that allocated stack pages
> > > are in the same node when CONFIG_VMAP_STACK. Because we do not specify
> > > __GFP_THISNODE to __vmalloc_node_range(). Fix it by caling
> > > mod_lruvec_page_state() for each page one by one.
> > 
> > Hm, I actually wonder if it makes any sense to split the stack over multiple
> > nodes? Maybe we should fix this instead?
> 
> While this is not really ideal I am not really sure it is an actual
> problem worth complicating the code. I am pretty sure this would grow
> into more tricky problem quite quickly (e.g. proper memory policy
> handling).

I'd agree and IMO accounting a couple of pages to a different node
is even a smaller problem.


Re: [PATCH] MIPS: BMIPS: Reserve exception base to prevent corruption

2021-03-02 Thread Roman Gushchin
On Mon, Mar 01, 2021 at 08:19:38PM -0800, Florian Fainelli wrote:
> BMIPS is one of the few platforms that do change the exception base.
> After commit 2dcb39645441 ("memblock: do not start bottom-up allocations
> with kernel_end") we started seeing BMIPS boards fail to boot with the
> built-in FDT being corrupted.
> 
> Before the cited commit, early allocations would be in the [kernel_end,
> RAM_END] range, but after commit they would be within [RAM_START +
> PAGE_SIZE, RAM_END].
> 
> The custom exception base handler that is installed by
> bmips_ebase_setup() done for BMIPS5000 CPUs ends-up trampling on the
> memory region allocated by unflatten_and_copy_device_tree() thus
> corrupting the FDT used by the kernel.
> 
> To fix this, we need to perform an early reservation of the custom
> exception that is going to be installed and this needs to happen at
> plat_mem_setup() time to ensure that unflatten_and_copy_device_tree()
> finds a space that is suitable, away from reserved memory.
> 
> Huge thanks to Serget for analysing and proposing a solution to this
> issue.
> 
> Fixes: Fixes: 2dcb39645441 ("memblock: do not start bottom-up allocations 
> with kernel_end")
> Debugged-by: Serge Semin 
> Reported-by: Kamal Dasu 
> Signed-off-by: Florian Fainelli 

Acked-by: Roman Gushchin 

Thank you!

> ---
> Thomas,
> 
> This is intended as a stop-gap solution for 5.12-rc1 and to be picked up
> by the stable team for 5.11. We should find a safer way to avoid these
> problems for 5.13 maybe.
> 
>  arch/mips/bmips/setup.c   | 22 ++
>  arch/mips/include/asm/traps.h |  2 ++
>  2 files changed, 24 insertions(+)
> 
> diff --git a/arch/mips/bmips/setup.c b/arch/mips/bmips/setup.c
> index 31bcfa4e08b9..0088bd45b892 100644
> --- a/arch/mips/bmips/setup.c
> +++ b/arch/mips/bmips/setup.c
> @@ -149,6 +149,26 @@ void __init plat_time_init(void)
>   mips_hpt_frequency = freq;
>  }
>  
> +static void __init bmips_ebase_reserve(void)
> +{
> + phys_addr_t base, size = VECTORSPACING * 64;
> +
> + switch (current_cpu_type()) {
> + default:
> + case CPU_BMIPS4350:
> + return;
> + case CPU_BMIPS3300:
> + case CPU_BMIPS4380:
> + base = 0x0400;
> + break;
> + case CPU_BMIPS5000:
> + base = 0x1000;
> + break;
> + }
> +
> + memblock_reserve(base, size);
> +}
> +
>  void __init plat_mem_setup(void)
>  {
>   void *dtb;
> @@ -169,6 +189,8 @@ void __init plat_mem_setup(void)
>  
>   __dt_setup_arch(dtb);
>  
> + bmips_ebase_reserve();
> +
>   for (q = bmips_quirk_list; q->quirk_fn; q++) {
>   if (of_flat_dt_is_compatible(of_get_flat_dt_root(),
>q->compatible)) {
> diff --git a/arch/mips/include/asm/traps.h b/arch/mips/include/asm/traps.h
> index 6aa8f126a43d..0ba6bb7f9618 100644
> --- a/arch/mips/include/asm/traps.h
> +++ b/arch/mips/include/asm/traps.h
> @@ -14,6 +14,8 @@
>  #define MIPS_BE_FIXUP1   /* return to the fixup code */
>  #define MIPS_BE_FATAL2   /* treat as an unrecoverable 
> error */
>  
> +#define VECTORSPACING 0x100  /* for EI/VI mode */
> +
>  extern void (*board_be_init)(void);
>  extern int (*board_be_handler)(struct pt_regs *regs, int is_fixup);
>  
> -- 
> 2.25.1
> 


Re: [PATCH] mm: memcontrol: fix root_mem_cgroup charging

2021-03-02 Thread Roman Gushchin
On Tue, Mar 02, 2021 at 04:18:23PM +0800, Muchun Song wrote:
> CPU0:   CPU1:
> 
> objcg = get_obj_cgroup_from_current();
> obj_cgroup_charge(objcg);
> memcg_reparent_objcgs();
> xchg(>memcg, 
> root_mem_cgroup);
> // memcg == root_mem_cgroup
> memcg = obj_cgroup_memcg(objcg);
> __memcg_kmem_charge(memcg);
> // Do not charge to the root memcg
> try_charge(memcg);
> 
> If the objcg->memcg is reparented to the root_mem_cgroup,
> obj_cgroup_charge() can pass root_mem_cgroup as the first
> parameter to here. The root_mem_cgroup is skipped in the
> try_charge(). So the page counters of it do not update.
> 
> When we uncharge this, we will decrease the page counters
> (e.g. memory and memsw) of the root_mem_cgroup. This will
> cause the page counters of the root_mem_cgroup to be out
> of balance. Fix it by charging the page to the
> root_mem_cgroup unconditional.

Is this a problem? It seems that we do not expose root memcg's counters
except kmem and tcp. It seems that the described problem is not
applicable to the kmem counter. Please, explain.

Thanks!

> 
> Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")
> Signed-off-by: Muchun Song 
> ---
>  mm/memcontrol.c | 13 +
>  1 file changed, 13 insertions(+)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2db2aeac8a9e..edf604824d63 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3078,6 +3078,19 @@ static int __memcg_kmem_charge(struct mem_cgroup 
> *memcg, gfp_t gfp,
>   if (ret)
>   return ret;
>  
> + /*
> +  * If the objcg->memcg is reparented to the root_mem_cgroup,
> +  * obj_cgroup_charge() can pass root_mem_cgroup as the first
> +  * parameter to here. We should charge the page to the
> +  * root_mem_cgroup unconditional to keep it's page counters
> +  * balance.
> +  */
> + if (unlikely(mem_cgroup_is_root(memcg))) {
> + page_counter_charge(>memory, nr_pages);
> + if (do_memsw_account())
> + page_counter_charge(>memsw, nr_pages);
> + }
> +
>   if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
>   !page_counter_try_charge(>kmem, nr_pages, )) {
>  
> -- 
> 2.11.0
> 


Re: [PATCH] mm: memcontrol: fix kernel stack account

2021-03-02 Thread Roman Gushchin
On Tue, Mar 02, 2021 at 03:37:33PM +0800, Muchun Song wrote:
> The alloc_thread_stack_node() cannot guarantee that allocated stack pages
> are in the same node when CONFIG_VMAP_STACK. Because we do not specify
> __GFP_THISNODE to __vmalloc_node_range(). Fix it by caling
> mod_lruvec_page_state() for each page one by one.

Hm, I actually wonder if it makes any sense to split the stack over multiple
nodes? Maybe we should fix this instead?

> 
> Fixes: 991e7673859e ("mm: memcontrol: account kernel stack per node")
> Signed-off-by: Muchun Song 
> ---
>  kernel/fork.c | 15 ++-
>  1 file changed, 10 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index d66cd1014211..6e2201feb524 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -379,14 +379,19 @@ static void account_kernel_stack(struct task_struct 
> *tsk, int account)
>   void *stack = task_stack_page(tsk);
>   struct vm_struct *vm = task_stack_vm_area(tsk);
>  
> + if (vm) {
> + int i;
>  
> - /* All stack pages are in the same node. */
> - if (vm)
> - mod_lruvec_page_state(vm->pages[0], NR_KERNEL_STACK_KB,
> -   account * (THREAD_SIZE / 1024));
> - else
> + BUG_ON(vm->nr_pages != THREAD_SIZE / PAGE_SIZE);
> +
> + for (i = 0; i < THREAD_SIZE / PAGE_SIZE; i++)
> + mod_lruvec_page_state(vm->pages[i], NR_KERNEL_STACK_KB,
> +   account * (PAGE_SIZE / 1024));
> + } else {
> + /* All stack pages are in the same node. */
>   mod_lruvec_kmem_state(stack, NR_KERNEL_STACK_KB,
> account * (THREAD_SIZE / 1024));
> + }
>  }
>  
>  static int memcg_charge_kernel_stack(struct task_struct *tsk)
> -- 
> 2.11.0
> 


Re: [PATCH 4/5] mm: memcontrol: move remote memcg charging APIs to CONFIG_MEMCG_KMEM

2021-03-02 Thread Roman Gushchin
On Mon, Mar 01, 2021 at 07:43:27PM -0800, Shakeel Butt wrote:
> On Mon, Mar 1, 2021 at 5:16 PM Roman Gushchin  wrote:
> >
> > On Mon, Mar 01, 2021 at 02:22:26PM +0800, Muchun Song wrote:
> > > The remote memcg charing APIs is a mechanism to charge kernel memory
> > > to a given memcg. So we can move the infrastructure to the scope of
> > > the CONFIG_MEMCG_KMEM.
> >
> > This is not a good idea, because there is nothing kmem-specific
> > in the idea of remote charging, and we definitely will see cases
> > when user memory is charged to the process different from the current.
> >
> 
> Indeed and which remind me: what happened to the "Charge loop device
> i/o to issuing cgroup" series? That series was doing remote charging
> for user pages.

Yeah, this is exactly what I minded. We're using it internally, and as I
remember there were no obstacles to upstream it too.
I'll ping Dan when after the merge window.

Thanks!


Re: [PATCH v2 2/2] memblock: do not start bottom-up allocations with kernel_end

2021-03-02 Thread Roman Gushchin
On Mon, Mar 01, 2021 at 11:45:42AM +0200, Mike Rapoport wrote:
> On Sun, Feb 28, 2021 at 07:50:45PM -0800, Florian Fainelli wrote:
> > Hi Serge,
> > 
> > On 2/28/2021 3:08 PM, Serge Semin wrote:
> > > Hi folks,
> > > What you've got here seems a more complicated problem than it
> > > could originally look like. Please, see my comments below.
> > > 
> > > (Note I've discarded some of the email logs, which of no interest
> > > to the discovered problem. Please also note that I haven't got any
> > > Broadcom hardware to test out a solution suggested below.)
> > > 
> > > On Sun, Feb 28, 2021 at 10:19:51AM -0800, Florian Fainelli wrote:
> > >> Hi Mike,
> > >>
> > >> On 2/28/2021 1:00 AM, Mike Rapoport wrote:
> > >>> Hi Florian,
> > >>>
> > >>> On Sat, Feb 27, 2021 at 08:18:47PM -0800, Florian Fainelli wrote:
> > 
> > > 
> >  [...]
> > > 
> > 
> >  Hi Roman, Thomas and other linux-mips folks,
> > 
> >  Kamal and myself have been unable to boot v5.11 on MIPS since this
> >  commit, reverting it makes our MIPS platforms boot successfully. We do
> >  not see a warning like this one in the commit message, instead what
> >  happens appear to be a corrupted Device Tree which prevents the parsing
> >  of the "rdb" node and leading to the interrupt controllers not being
> >  registered, and the system eventually not booting.
> > 
> >  The Device Tree is built-into the kernel image and resides at
> >  arch/mips/boot/dts/brcm/bcm97435svmb.dts.
> > 
> >  Do you have any idea what could be wrong with MIPS specifically here?
> > > 
> > > Most likely the problem you've discovered has been there for quite
> > > some time. The patch you are referring to just caused it to be
> > > triggered by extending the early allocation range. See before that
> > > patch was accepted the early memory allocations had been performed
> > > in the range:
> > > [kernel_end, RAM_END].
> > > The patch changed that, so the early allocations are done within
> > > [RAM_START + PAGE_SIZE, RAM_END].
> > > 
> > > In normal situations it's safe to do that as long as all the critical
> > > memory regions (including the memory residing a space below the
> > > kernel) have been reserved. But as soon as a memory with some critical
> > > structures haven't been reserved, the kernel may allocate it to be used
> > > for instance for early initializations with obviously unpredictable but
> > > most of the times unpleasant consequences.
> > > 
> > >>>
> > >>> Apparently there is a memblock allocation in one of the functions called
> > >>> from arch_mem_init() between plat_mem_setup() and
> > >>> early_init_fdt_reserve_self().
> > > 
> > > Mike, alas according to the log provided by Florian that's not the reason
> > > of the problem. Please, see my considerations below.
> > > 
> > >> [...]
> > >>
> > >> [0.00] Linux version 5.11.0-g5695e5161974 (florian@localhost)
> > >> (mipsel-linux-gcc (GCC) 8.3.0, GNU ld (GNU Binutils) 2.32) #84 SMP Sun
> > >> Feb 28 10:01:50 PST 2021
> > >> [0.00] CPU0 revision is: 00025b00 (Broadcom BMIPS5200)
> > >> [0.00] FPU revision is: 00130001
> > > 
> > >> [0.00] memblock_add: [0x-0x0fff]
> > >> early_init_dt_scan_memory+0x160/0x1e0
> > >> [0.00] memblock_add: [0x2000-0x4fff]
> > >> early_init_dt_scan_memory+0x160/0x1e0
> > >> [0.00] memblock_add: [0x9000-0xcfff]
> > >> early_init_dt_scan_memory+0x160/0x1e0
> > > 
> > > Here the memory has been added to the memblock allocator.
> > > 
> > >> [0.00] MIPS: machine is Broadcom BCM97435SVMB
> > >> [0.00] earlycon: ns16550a0 at MMIO32 0x10406b00 (options '')
> > >> [0.00] printk: bootconsole [ns16550a0] enabled
> > > 
> > >> [0.00] memblock_reserve: [0x00aa7600-0x00aaa0a0]
> > >> setup_arch+0x128/0x69c
> > > 
> > > Here the fdt memory has been reserved. (Note it's built into the
> > > kernel.)
> > > 
> > >> [0.00] memblock_reserve: [0x0001-0x018313cf]
> > >> setup_arch+0x1f8/0x69c
> > > 
> > > Here the kernel itself together with built-in dtb have been reserved.
> > > So far so good.
> > > 
> > >> [0.00] Initrd not found or empty - disabling initrd
> > > 
> > >> [0.00] memblock_alloc_try_nid: 10913 bytes align=0x40 nid=-1
> > >> from=0x max_addr=0x
> > >> early_init_dt_alloc_memory_arch+0x40/0x84
> > >> [0.00] memblock_reserve: [0x1000-0x3aa0]
> > >> memblock_alloc_range_nid+0xf8/0x198
> > >> [0.00] memblock_alloc_try_nid: 32680 bytes align=0x4 nid=-1
> > >> from=0x max_addr=0x
> > >> early_init_dt_alloc_memory_arch+0x40/0x84
> > >> [0.00] memblock_reserve: [0x3aa4-0xba4b]
> > >> memblock_alloc_range_nid+0xf8/0x198
> > > 
> > > The log above most likely belongs to the call-chain:
> > > setup_arch()
> > > +-> arch_mem_init()
> > > +-> device_tree_init() - BMIPS specific method
> > > +-> 

Re: [PATCH 5/5] mm: memcontrol: use object cgroup for remote memory cgroup charging

2021-03-01 Thread Roman Gushchin
On Mon, Mar 01, 2021 at 02:22:27PM +0800, Muchun Song wrote:
> We spent a lot of energy to make slab accounting do not hold a refcount
> to memory cgroup, so the dying cgroup can be freed as soon as possible
> on cgroup offlined.
> 
> But some users of remote memory cgroup charging (e.g. bpf and fsnotify)
> hold a refcount to memory cgroup for charging to it later. Actually,
> the slab core use obj_cgroup APIs for memory cgroup charing, so we can
> hold a refcount to obj_cgroup instead of memory cgroup. In this case,
> the infrastructure of remote meory charging also do not hold a refcount
> to memory cgroup.

-cc all except mm folks

Same here, let's not switch the remote charging infra to objcg to save
an ability to use it for user pages. If we have a real problem with bpf/...,
let's solve it case by case.

Thanks!

> 
> Signed-off-by: Muchun Song 
> ---
>  fs/buffer.c  | 10 --
>  fs/notify/fanotify/fanotify.c|  6 ++--
>  fs/notify/fanotify/fanotify_user.c   |  2 +-
>  fs/notify/group.c|  3 +-
>  fs/notify/inotify/inotify_fsnotify.c |  8 ++---
>  fs/notify/inotify/inotify_user.c |  2 +-
>  include/linux/bpf.h  |  2 +-
>  include/linux/fsnotify_backend.h |  2 +-
>  include/linux/memcontrol.h   | 15 
>  include/linux/sched.h|  4 +--
>  include/linux/sched/mm.h | 28 +++
>  kernel/bpf/syscall.c | 35 +--
>  kernel/fork.c|  2 +-
>  mm/memcontrol.c  | 66 
> 
>  14 files changed, 121 insertions(+), 64 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 591547779dbd..cc99fcf66368 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -842,14 +842,16 @@ struct buffer_head *alloc_page_buffers(struct page 
> *page, unsigned long size,
>   struct buffer_head *bh, *head;
>   gfp_t gfp = GFP_NOFS | __GFP_ACCOUNT;
>   long offset;
> - struct mem_cgroup *memcg, *old_memcg;
> + struct mem_cgroup *memcg;
> + struct obj_cgroup *objcg, *old_objcg;
>  
>   if (retry)
>   gfp |= __GFP_NOFAIL;
>  
>   /* The page lock pins the memcg */
>   memcg = page_memcg(page);
> - old_memcg = set_active_memcg(memcg);
> + objcg = get_obj_cgroup_from_mem_cgroup(memcg);
> + old_objcg = set_active_obj_cgroup(objcg);
>  
>   head = NULL;
>   offset = PAGE_SIZE;
> @@ -868,7 +870,9 @@ struct buffer_head *alloc_page_buffers(struct page *page, 
> unsigned long size,
>   set_bh_page(bh, page, offset);
>   }
>  out:
> - set_active_memcg(old_memcg);
> + set_active_obj_cgroup(old_objcg);
> + if (objcg)
> + obj_cgroup_put(objcg);
>   return head;
>  /*
>   * In case anything failed, we just free everything we got.
> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> index 1192c9953620..04d24acfffc7 100644
> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
> @@ -530,7 +530,7 @@ static struct fanotify_event *fanotify_alloc_event(struct 
> fsnotify_group *group,
>   struct inode *dirid = fanotify_dfid_inode(mask, data, data_type, dir);
>   const struct path *path = fsnotify_data_path(data, data_type);
>   unsigned int fid_mode = FAN_GROUP_FLAG(group, FANOTIFY_FID_BITS);
> - struct mem_cgroup *old_memcg;
> + struct obj_cgroup *old_objcg;
>   struct inode *child = NULL;
>   bool name_event = false;
>  
> @@ -580,7 +580,7 @@ static struct fanotify_event *fanotify_alloc_event(struct 
> fsnotify_group *group,
>   gfp |= __GFP_RETRY_MAYFAIL;
>  
>   /* Whoever is interested in the event, pays for the allocation. */
> - old_memcg = set_active_memcg(group->memcg);
> + old_objcg = set_active_obj_cgroup(group->objcg);
>  
>   if (fanotify_is_perm_event(mask)) {
>   event = fanotify_alloc_perm_event(path, gfp);
> @@ -608,7 +608,7 @@ static struct fanotify_event *fanotify_alloc_event(struct 
> fsnotify_group *group,
>   event->pid = get_pid(task_tgid(current));
>  
>  out:
> - set_active_memcg(old_memcg);
> + set_active_obj_cgroup(old_objcg);
>   return event;
>  }
>  
> diff --git a/fs/notify/fanotify/fanotify_user.c 
> b/fs/notify/fanotify/fanotify_user.c
> index 9e0c1afac8bd..055ca36d4e0e 100644
> --- a/fs/notify/fanotify/fanotify_user.c
> +++ b/fs/notify/fanotify/fanotify_user.c
> @@ -985,7 +985,7 @@ SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, 
> unsigned int, event_f_flags)
>   group->fanotify_data.user = user;
>   group->fanotify_data.flags = flags;
>   atomic_inc(>fanotify_listeners);
> - group->memcg = get_mem_cgroup_from_mm(current->mm);
> + group->objcg = get_obj_cgroup_from_current();
>  
>   group->overflow_event = fanotify_alloc_overflow_event();
>   if (unlikely(!group->overflow_event)) {
> diff --git 

Re: [PATCH 4/5] mm: memcontrol: move remote memcg charging APIs to CONFIG_MEMCG_KMEM

2021-03-01 Thread Roman Gushchin
On Mon, Mar 01, 2021 at 02:22:26PM +0800, Muchun Song wrote:
> The remote memcg charing APIs is a mechanism to charge kernel memory
> to a given memcg. So we can move the infrastructure to the scope of
> the CONFIG_MEMCG_KMEM.

This is not a good idea, because there is nothing kmem-specific
in the idea of remote charging, and we definitely will see cases
when user memory is charged to the process different from the current.

> 
> As a bonus, on !CONFIG_MEMCG_KMEM build some functions and variables
> can be compiled out.
> 
> Signed-off-by: Muchun Song 
> ---
>  include/linux/sched.h| 2 ++
>  include/linux/sched/mm.h | 2 +-
>  kernel/fork.c| 2 +-
>  mm/memcontrol.c  | 4 
>  4 files changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index ee46f5cab95b..c2d488eddf85 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1314,7 +1314,9 @@ struct task_struct {
>  
>   /* Number of pages to reclaim on returning to userland: */
>   unsigned intmemcg_nr_pages_over_high;
> +#endif
>  
> +#ifdef CONFIG_MEMCG_KMEM
>   /* Used by memcontrol for targeted memcg charge: */
>   struct mem_cgroup   *active_memcg;
>  #endif
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 1ae08b8462a4..64a72975270e 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -294,7 +294,7 @@ static inline void memalloc_nocma_restore(unsigned int 
> flags)
>  }
>  #endif
>  
> -#ifdef CONFIG_MEMCG
> +#ifdef CONFIG_MEMCG_KMEM
>  DECLARE_PER_CPU(struct mem_cgroup *, int_active_memcg);
>  /**
>   * set_active_memcg - Starts the remote memcg charging scope.
> diff --git a/kernel/fork.c b/kernel/fork.c
> index d66cd1014211..d66718bc82d5 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -942,7 +942,7 @@ static struct task_struct *dup_task_struct(struct 
> task_struct *orig, int node)
>   tsk->use_memdelay = 0;
>  #endif
>  
> -#ifdef CONFIG_MEMCG
> +#ifdef CONFIG_MEMCG_KMEM
>   tsk->active_memcg = NULL;
>  #endif
>   return tsk;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 39cb8c5bf8b2..092dc4588b43 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -76,8 +76,10 @@ EXPORT_SYMBOL(memory_cgrp_subsys);
>  
>  struct mem_cgroup *root_mem_cgroup __read_mostly;
>  
> +#ifdef CONFIG_MEMCG_KMEM
>  /* Active memory cgroup to use from an interrupt context */
>  DEFINE_PER_CPU(struct mem_cgroup *, int_active_memcg);
> +#endif
>  
>  /* Socket memory accounting disabled? */
>  static bool cgroup_memory_nosocket;
> @@ -1054,6 +1056,7 @@ struct mem_cgroup *get_mem_cgroup_from_mm(struct 
> mm_struct *mm)
>  }
>  EXPORT_SYMBOL(get_mem_cgroup_from_mm);
>  
> +#ifdef CONFIG_MEMCG_KMEM
>  static __always_inline struct mem_cgroup *active_memcg(void)
>  {
>   if (in_interrupt())
> @@ -1074,6 +1077,7 @@ static __always_inline bool memcg_kmem_bypass(void)
>  
>   return false;
>  }
> +#endif
>  
>  /**
>   * mem_cgroup_iter - iterate over memory cgroup hierarchy
> -- 
> 2.11.0
> 


Re: [PATCH 0/5] Use obj_cgroup APIs to change kmem pages

2021-03-01 Thread Roman Gushchin
Hi Muchun!

On Mon, Mar 01, 2021 at 02:22:22PM +0800, Muchun Song wrote:
> Since Roman series "The new cgroup slab memory controller" applied. All
> slab objects are changed via the new APIs of obj_cgroup. This new APIs
> introduce a struct obj_cgroup instead of using struct mem_cgroup directly
> to charge slab objects. It prevents long-living objects from pinning the
> original memory cgroup in the memory. But there are still some corner
> objects (e.g. allocations larger than order-1 page on SLUB) which are
> not charged via the API of obj_cgroup. Those objects (include the pages
> which are allocated from buddy allocator directly) are charged as kmem
> pages which still hold a reference to the memory cgroup.

Yes, this is a good idea, large kmallocs should be treated the same
way as small ones.

> 
> E.g. We know that the kernel stack is charged as kmem pages because the
> size of the kernel stack can be greater than 2 pages (e.g. 16KB on x86_64
> or arm64). If we create a thread (suppose the thread stack is charged to
> memory cgroup A) and then move it from memory cgroup A to memory cgroup
> B. Because the kernel stack of the thread hold a reference to the memory
> cgroup A. The thread can pin the memory cgroup A in the memory even if
> we remove the cgroup A. If we want to see this scenario by using the
> following script. We can see that the system has added 500 dying cgroups.
> 
>   #!/bin/bash
> 
>   cat /proc/cgroups | grep memory
> 
>   cd /sys/fs/cgroup/memory
>   echo 1 > memory.move_charge_at_immigrate
> 
>   for i in range{1..500}
>   do
>   mkdir kmem_test
>   echo $$ > kmem_test/cgroup.procs
>   sleep 3600 &
>   echo $$ > cgroup.procs
>   echo `cat kmem_test/cgroup.procs` > cgroup.procs
>   rmdir kmem_test
>   done
> 
>   cat /proc/cgroups | grep memory

Well, moving processes between cgroups always created a lot of issues
and corner cases and this one is definitely not the worst. So this problem
looks a bit artificial, unless I'm missing something. But if it doesn't
introduce any new performance costs and doesn't make the code more complex,
I have nothing against.

Btw, can you, please, run the spell-checker on commit logs? There are many
typos (starting from the title of the series, I guess), which make the patchset
look less appealing.

Thank you!

> 
> This patchset aims to make those kmem pages drop the reference to memory
> cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
> of the dying cgroups will not increase if we run the above test script.
> 
> Patch 1-3 are using obj_cgroup APIs to charge kmem pages. The remote
> memory cgroup charing APIs is a mechanism to charge kernel memory to a
> given memory cgroup. So I also make it use the APIs of obj_cgroup.
> Patch 4-5 are doing this.
> 
> Muchun Song (5):
>   mm: memcontrol: introduce obj_cgroup_{un}charge_page
>   mm: memcontrol: make page_memcg{_rcu} only applicable for non-kmem
> page
>   mm: memcontrol: reparent the kmem pages on cgroup removal
>   mm: memcontrol: move remote memcg charging APIs to CONFIG_MEMCG_KMEM
>   mm: memcontrol: use object cgroup for remote memory cgroup charging
> 
>  fs/buffer.c  |  10 +-
>  fs/notify/fanotify/fanotify.c|   6 +-
>  fs/notify/fanotify/fanotify_user.c   |   2 +-
>  fs/notify/group.c|   3 +-
>  fs/notify/inotify/inotify_fsnotify.c |   8 +-
>  fs/notify/inotify/inotify_user.c |   2 +-
>  include/linux/bpf.h  |   2 +-
>  include/linux/fsnotify_backend.h |   2 +-
>  include/linux/memcontrol.h   | 109 +++---
>  include/linux/sched.h|   6 +-
>  include/linux/sched/mm.h |  30 ++--
>  kernel/bpf/syscall.c |  35 ++---
>  kernel/fork.c|   4 +-
>  mm/memcontrol.c  | 276 
> ++-
>  mm/page_alloc.c  |   4 +-
>  15 files changed, 324 insertions(+), 175 deletions(-)
> 
> -- 
> 2.11.0
> 


Re: [PATCH 3/4] mm: /proc/sys/vm/stat_refresh skip checking known negative stats

2021-03-01 Thread Roman Gushchin
Mon, Mar 01, 2021 at 02:08:17PM -0800, Hugh Dickins wrote:
> On Sun, 28 Feb 2021, Roman Gushchin wrote:
> > On Thu, Feb 25, 2021 at 03:14:03PM -0800, Hugh Dickins wrote:
> > > vmstat_refresh() can occasionally catch nr_zone_write_pending and
> > > nr_writeback when they are transiently negative.  The reason is partly
> > > that the interrupt which decrements them in test_clear_page_writeback()
> > > can come in before __test_set_page_writeback() got to increment them;
> > > but transient negatives are still seen even when that is prevented, and
> > > we have not yet resolved why (Roman believes that it is an unavoidable
> > > consequence of the refresh scheduled on each cpu).  But those stats are
> > > not buggy, they have never been seen to drift away from 0 permanently:
> > > so just avoid the annoyance of showing a warning on them.
> > > 
> > > Similarly avoid showing a warning on nr_free_cma: CMA users have seen
> > > that one reported negative from /proc/sys/vm/stat_refresh too, but it
> > > does drift away permanently: I believe that's because its incrementation
> > > and decrementation are decided by page migratetype, but the migratetype
> > > of a pageblock is not guaranteed to be constant.
> > > 
> > > Use switch statements so we can most easily add or remove cases later.
> > 
> > I'm OK with the code, but I can't fully agree with the commit log. I don't 
> > think
> > there is any mystery around negative values. Let me copy-paste the 
> > explanation
> > from my original patch:
> > 
> > These warnings* are generated by the vmstat_refresh() function, which
> > assumes that atomic zone and numa counters can't go below zero.  
> > However,
> > on a SMP machine it's not quite right: due to per-cpu caching it can in
> > theory be as low as -(zone threshold) * NR_CPUs.
> > 
> > For instance, let's say all cma pages are in use and NR_FREE_CMA_PAGES
> > reached 0.  Then we've reclaimed a small number of cma pages on each CPU
> > except CPU0, so that most percpu NR_FREE_CMA_PAGES counters are slightly
> > positive (the atomic counter is still 0).  Then somebody on CPU0 
> > consumes
> > all these pages.  The number of pages can easily exceed the threshold 
> > and
> > a negative value will be committed to the atomic counter.
> > 
> > * warnings about negative NR_FREE_CMA_PAGES
> 
> Hi Roman, thanks for your Acks on the others - and indeed this
> is the one on which disagreement was more to be expected.
> 
> I certainly wanted (and included below) a Link to your original patch;
> and even wondered whether to paste your description into mine.
> But I read it again and still have issues with it.
> 
> Mainly, it does not convey at all, that touching stat_refresh adds the
> per-cpu counts into the global atomics, resetting per-cpu counts to 0.
> Which does not invalidate your explanation: races might still manage
> to underflow; but it does take the "easily" out of "can easily exceed".

Hi Hugh!

It could be that "easily" simple comes from the scale (number of machines).

> 
> Since I don't use CMA on any machine, I cannot be sure, but it looked
> like a bad example to rely upon, because of its migratetype-based
> accounting.  If you use /proc/sys/vm/stat_refresh frequently enough,
> without suppressing the warning, I guess that uncertainty could be
> resolved by checking whether nr_free_cma is seen with negative value
> in consecutive refreshes - which would tend to support my migratetype
> theory - or only singly - which would support your raciness theory.
> 
> > 
> > Actually, the same is almost true for ANY other counter. What differs CMA, 
> > dirty
> > and write pending counters is that they can reach 0 value under normal 
> > conditions.
> > Other counters are usually not reaching values small enough to see negative 
> > values
> > on a reasonable sized machine.
> 
> Looking through /proc/vmstat now, yes, I can see that there are fewer
> counters which hover near 0 than I had imagined: more have a positive
> bias, or are monotonically increasing.  And I'd be lying if I said I'd
> never seen any others than nr_writeback or nr_zone_write_pending caught
> negative.  But what are you asking for?  Should the patch be changed, to
> retry the refresh_vm_stats() before warning, if it sees any negative?
> Depends on how terrible one line in dmesg is considered!
> 
> > 
> > Does it makes sense?
> 
> I'm not sure: you were not asking for the patch to be changed, but
> its commit lo

  1   2   3   4   5   6   7   8   9   10   >