Re: [RFC] memory reserve for userspace oom-killer

2021-04-20 Thread Suren Baghdasaryan
Hi Folks,

On Tue, Apr 20, 2021 at 12:18 PM Roman Gushchin  wrote:
>
> On Mon, Apr 19, 2021 at 06:44:02PM -0700, Shakeel Butt wrote:
> > Proposal: Provide memory guarantees to userspace oom-killer.
> >
> > Background:
> >
> > Issues with kernel oom-killer:
> > 1. Very conservative and prefer to reclaim. Applications can suffer
> > for a long time.
> > 2. Borrows the context of the allocator which can be resource limited
> > (low sched priority or limited CPU quota).
> > 3. Serialized by global lock.
> > 4. Very simplistic oom victim selection policy.
> >
> > These issues are resolved through userspace oom-killer by:
> > 1. Ability to monitor arbitrary metrics (PSI, vmstat, memcg stats) to
> > early detect suffering.
> > 2. Independent process context which can be given dedicated CPU quota
> > and high scheduling priority.
> > 3. Can be more aggressive as required.
> > 4. Can implement sophisticated business logic/policies.
> >
> > Android's LMKD and Facebook's oomd are the prime examples of userspace
> > oom-killers. One of the biggest challenges for userspace oom-killers
> > is to potentially function under intense memory pressure and are prone
> > to getting stuck in memory reclaim themselves. Current userspace
> > oom-killers aim to avoid this situation by preallocating user memory
> > and protecting themselves from global reclaim by either mlocking or
> > memory.min. However a new allocation from userspace oom-killer can
> > still get stuck in the reclaim and policy rich oom-killer do trigger
> > new allocations through syscalls or even heap.
> >
> > Our attempt of userspace oom-killer faces similar challenges.
> > Particularly at the tail on the very highly utilized machines we have
> > observed userspace oom-killer spectacularly failing in many possible
> > ways in the direct reclaim. We have seen oom-killer stuck in direct
> > reclaim throttling, stuck in reclaim and allocations from interrupts
> > keep stealing reclaimed memory. We have even observed systems where
> > all the processes were stuck in throttle_direct_reclaim() and only
> > kswapd was running and the interrupts kept stealing the memory
> > reclaimed by kswapd.
> >
> > To reliably solve this problem, we need to give guaranteed memory to
> > the userspace oom-killer. At the moment we are contemplating between
> > the following options and I would like to get some feedback.
> >
> > 1. prctl(PF_MEMALLOC)
> >
> > The idea is to give userspace oom-killer (just one thread which is
> > finding the appropriate victims and will be sending SIGKILLs) access
> > to MEMALLOC reserves. Most of the time the preallocation, mlock and
> > memory.min will be good enough but for rare occasions, when the
> > userspace oom-killer needs to allocate, the PF_MEMALLOC flag will
> > protect it from reclaim and let the allocation dip into the memory
> > reserves.
> >
> > The misuse of this feature would be risky but it can be limited to
> > privileged applications. Userspace oom-killer is the only appropriate
> > user of this feature. This option is simple to implement.
>
> Hello Shakeel!
>
> If ordinary PAGE_SIZE and smaller kernel allocations start to fail,
> the system is already in a relatively bad shape. Arguably the userspace
> OOM killer should kick in earlier, it's already a bit too late.

I tend to agree here. This is how we are trying to avoid issues with
such severe memory shortages - by tuning the killer a bit more
aggressively. But a more reliable mechanism would definitely be an
improvement.

> Allowing to use reserves just pushes this even further, so we're risking
> the kernel stability for no good reason.
>
> But I agree that throttling the oom daemon in direct reclaim makes no sense.
> I wonder if we can introduce a per-task flag which will exclude the task from
> throttling, but instead all (large) allocations will just fail under a
> significant memory pressure more easily. In this case if there is a 
> significant
> memory shortage the oom daemon will not be fully functional (will get -ENOMEM
> for an attempt to read some stats, for example), but still will be able to 
> kill
> some processes and make the forward progress.

This sounds like a good idea to me.

> But maybe it can be done in userspace too: by splitting the daemon into
> a core- and extended part and avoid doing anything behind bare minimum
> in the core part.
>
> >
> > 2. Mempool
> >
> > The idea is to preallocate mempool with a given amount of memory for
> > userspace oom-killer. Preferably this will be per-thread and
> > oom-kil

Re: [RFC] memory reserve for userspace oom-killer

2021-04-20 Thread Roman Gushchin
On Mon, Apr 19, 2021 at 06:44:02PM -0700, Shakeel Butt wrote:
> Proposal: Provide memory guarantees to userspace oom-killer.
> 
> Background:
> 
> Issues with kernel oom-killer:
> 1. Very conservative and prefer to reclaim. Applications can suffer
> for a long time.
> 2. Borrows the context of the allocator which can be resource limited
> (low sched priority or limited CPU quota).
> 3. Serialized by global lock.
> 4. Very simplistic oom victim selection policy.
> 
> These issues are resolved through userspace oom-killer by:
> 1. Ability to monitor arbitrary metrics (PSI, vmstat, memcg stats) to
> early detect suffering.
> 2. Independent process context which can be given dedicated CPU quota
> and high scheduling priority.
> 3. Can be more aggressive as required.
> 4. Can implement sophisticated business logic/policies.
> 
> Android's LMKD and Facebook's oomd are the prime examples of userspace
> oom-killers. One of the biggest challenges for userspace oom-killers
> is to potentially function under intense memory pressure and are prone
> to getting stuck in memory reclaim themselves. Current userspace
> oom-killers aim to avoid this situation by preallocating user memory
> and protecting themselves from global reclaim by either mlocking or
> memory.min. However a new allocation from userspace oom-killer can
> still get stuck in the reclaim and policy rich oom-killer do trigger
> new allocations through syscalls or even heap.
> 
> Our attempt of userspace oom-killer faces similar challenges.
> Particularly at the tail on the very highly utilized machines we have
> observed userspace oom-killer spectacularly failing in many possible
> ways in the direct reclaim. We have seen oom-killer stuck in direct
> reclaim throttling, stuck in reclaim and allocations from interrupts
> keep stealing reclaimed memory. We have even observed systems where
> all the processes were stuck in throttle_direct_reclaim() and only
> kswapd was running and the interrupts kept stealing the memory
> reclaimed by kswapd.
> 
> To reliably solve this problem, we need to give guaranteed memory to
> the userspace oom-killer. At the moment we are contemplating between
> the following options and I would like to get some feedback.
> 
> 1. prctl(PF_MEMALLOC)
> 
> The idea is to give userspace oom-killer (just one thread which is
> finding the appropriate victims and will be sending SIGKILLs) access
> to MEMALLOC reserves. Most of the time the preallocation, mlock and
> memory.min will be good enough but for rare occasions, when the
> userspace oom-killer needs to allocate, the PF_MEMALLOC flag will
> protect it from reclaim and let the allocation dip into the memory
> reserves.
> 
> The misuse of this feature would be risky but it can be limited to
> privileged applications. Userspace oom-killer is the only appropriate
> user of this feature. This option is simple to implement.

Hello Shakeel!

If ordinary PAGE_SIZE and smaller kernel allocations start to fail,
the system is already in a relatively bad shape. Arguably the userspace
OOM killer should kick in earlier, it's already a bit too late.
Allowing to use reserves just pushes this even further, so we're risking
the kernel stability for no good reason.

But I agree that throttling the oom daemon in direct reclaim makes no sense.
I wonder if we can introduce a per-task flag which will exclude the task from
throttling, but instead all (large) allocations will just fail under a
significant memory pressure more easily. In this case if there is a significant
memory shortage the oom daemon will not be fully functional (will get -ENOMEM
for an attempt to read some stats, for example), but still will be able to kill
some processes and make the forward progress.
But maybe it can be done in userspace too: by splitting the daemon into
a core- and extended part and avoid doing anything behind bare minimum
in the core part.

> 
> 2. Mempool
> 
> The idea is to preallocate mempool with a given amount of memory for
> userspace oom-killer. Preferably this will be per-thread and
> oom-killer can preallocate mempool for its specific threads. The core
> page allocator can check before going to the reclaim path if the task
> has private access to the mempool and return page from it if yes.
> 
> This option would be more complicated than the previous option as the
> lifecycle of the page from the mempool would be more sophisticated.
> Additionally the current mempool does not handle higher order pages
> and we might need to extend it to allow such allocations. Though this
> feature might have more use-cases and it would be less risky than the
> previous option.

It looks like an over-kill for the oom daemon protection, but if there
are other good use cases, maybe it's a good feature to

Re: [RFC] memory reserve for userspace oom-killer

2021-04-20 Thread Shakeel Butt
On Mon, Apr 19, 2021 at 11:46 PM Michal Hocko  wrote:
>
> On Mon 19-04-21 18:44:02, Shakeel Butt wrote:
[...]
> > memory.min. However a new allocation from userspace oom-killer can
> > still get stuck in the reclaim and policy rich oom-killer do trigger
> > new allocations through syscalls or even heap.
>
> Can you be more specific please?
>

To decide when to kill, the oom-killer has to read a lot of metrics.
It has to open a lot of files to read them and there will definitely
be new allocations involved in those operations. For example reading
memory.stat does a page size allocation. Similarly, to perform action
the oom-killer may have to read cgroup.procs file which again has
allocation inside it.

Regarding sophisticated oom policy, I can give one example of our
cluster level policy. For robustness, many user facing jobs run a lot
of instances in a cluster to handle failures. Such jobs are tolerant
to some amount of failures but they still have requirements to not let
the number of running instances below some threshold. Normally killing
such jobs is fine but we do want to make sure that we do not violate
their cluster level agreement. So, the userspace oom-killer may
dynamically need to confirm if such a job can be killed.

[...]
> > To reliably solve this problem, we need to give guaranteed memory to
> > the userspace oom-killer.
>
> There is nothing like that. Even memory reserves are a finite resource
> which can be consumed as it is sharing those reserves with other users
> who are not necessarily coordinated. So before we start discussing
> making this even more muddy by handing over memory reserves to the
> userspace we should really examine whether pre-allocation is something
> that will not work.
>

We actually explored if we can restrict the syscalls for the
oom-killer which does not do memory allocations. We concluded that is
not practical and not maintainable. Whatever the list we can come up
with will be outdated soon. In addition, converting all the must-have
syscalls to not do allocations is not possible/practical.

> > At the moment we are contemplating between
> > the following options and I would like to get some feedback.
> >
> > 1. prctl(PF_MEMALLOC)
> >
> > The idea is to give userspace oom-killer (just one thread which is
> > finding the appropriate victims and will be sending SIGKILLs) access
> > to MEMALLOC reserves. Most of the time the preallocation, mlock and
> > memory.min will be good enough but for rare occasions, when the
> > userspace oom-killer needs to allocate, the PF_MEMALLOC flag will
> > protect it from reclaim and let the allocation dip into the memory
> > reserves.
>
> I do not think that handing over an unlimited ticket to the memory
> reserves to userspace is a good idea. Even the in kernel oom killer is
> bound to a partial access to reserves. So if we really want this then
> it should be in sync with and bound by the ALLOC_OOM.
>

Makes sense.

> > The misuse of this feature would be risky but it can be limited to
> > privileged applications. Userspace oom-killer is the only appropriate
> > user of this feature. This option is simple to implement.
> >
> > 2. Mempool
> >
> > The idea is to preallocate mempool with a given amount of memory for
> > userspace oom-killer. Preferably this will be per-thread and
> > oom-killer can preallocate mempool for its specific threads. The core
> > page allocator can check before going to the reclaim path if the task
> > has private access to the mempool and return page from it if yes.
>
> Could you elaborate some more on how this would be controlled from the
> userspace? A dedicated syscall? A driver?
>

I was thinking of simply prctl(SET_MEMPOOL, bytes) to assign mempool
to a thread (not shared between threads) and prctl(RESET_MEMPOOL) to
free the mempool.

> > This option would be more complicated than the previous option as the
> > lifecycle of the page from the mempool would be more sophisticated.
> > Additionally the current mempool does not handle higher order pages
> > and we might need to extend it to allow such allocations. Though this
> > feature might have more use-cases and it would be less risky than the
> > previous option.
>
> I would tend to agree.
>
> > Another idea I had was to use kthread based oom-killer and provide the
> > policies through eBPF program. Though I am not sure how to make it
> > monitor arbitrary metrics and if that can be done without any
> > allocations.
>
> A kernel module or eBPF to implement oom decisions has already been
> discussed few years back. But I am afraid this would be hard to wire in
> for anything except for the victim selection. I am not sure it is
> maint

Re: [RFC] memory reserve for userspace oom-killer

2021-04-20 Thread Michal Hocko
On Mon 19-04-21 18:44:02, Shakeel Butt wrote:
> Proposal: Provide memory guarantees to userspace oom-killer.
> 
> Background:
> 
> Issues with kernel oom-killer:
> 1. Very conservative and prefer to reclaim. Applications can suffer
> for a long time.
> 2. Borrows the context of the allocator which can be resource limited
> (low sched priority or limited CPU quota).
> 3. Serialized by global lock.
> 4. Very simplistic oom victim selection policy.
> 
> These issues are resolved through userspace oom-killer by:
> 1. Ability to monitor arbitrary metrics (PSI, vmstat, memcg stats) to
> early detect suffering.
> 2. Independent process context which can be given dedicated CPU quota
> and high scheduling priority.
> 3. Can be more aggressive as required.
> 4. Can implement sophisticated business logic/policies.
> 
> Android's LMKD and Facebook's oomd are the prime examples of userspace
> oom-killers. One of the biggest challenges for userspace oom-killers
> is to potentially function under intense memory pressure and are prone
> to getting stuck in memory reclaim themselves. Current userspace
> oom-killers aim to avoid this situation by preallocating user memory
> and protecting themselves from global reclaim by either mlocking or
> memory.min. However a new allocation from userspace oom-killer can
> still get stuck in the reclaim and policy rich oom-killer do trigger
> new allocations through syscalls or even heap.

Can you be more specific please?

> Our attempt of userspace oom-killer faces similar challenges.
> Particularly at the tail on the very highly utilized machines we have
> observed userspace oom-killer spectacularly failing in many possible
> ways in the direct reclaim. We have seen oom-killer stuck in direct
> reclaim throttling, stuck in reclaim and allocations from interrupts
> keep stealing reclaimed memory. We have even observed systems where
> all the processes were stuck in throttle_direct_reclaim() and only
> kswapd was running and the interrupts kept stealing the memory
> reclaimed by kswapd.
> 
> To reliably solve this problem, we need to give guaranteed memory to
> the userspace oom-killer.

There is nothing like that. Even memory reserves are a finite resource
which can be consumed as it is sharing those reserves with other users
who are not necessarily coordinated. So before we start discussing
making this even more muddy by handing over memory reserves to the
userspace we should really examine whether pre-allocation is something
that will not work.

> At the moment we are contemplating between
> the following options and I would like to get some feedback.
> 
> 1. prctl(PF_MEMALLOC)
> 
> The idea is to give userspace oom-killer (just one thread which is
> finding the appropriate victims and will be sending SIGKILLs) access
> to MEMALLOC reserves. Most of the time the preallocation, mlock and
> memory.min will be good enough but for rare occasions, when the
> userspace oom-killer needs to allocate, the PF_MEMALLOC flag will
> protect it from reclaim and let the allocation dip into the memory
> reserves.

I do not think that handing over an unlimited ticket to the memory
reserves to userspace is a good idea. Even the in kernel oom killer is
bound to a partial access to reserves. So if we really want this then
it should be in sync with and bound by the ALLOC_OOM.

> The misuse of this feature would be risky but it can be limited to
> privileged applications. Userspace oom-killer is the only appropriate
> user of this feature. This option is simple to implement.
> 
> 2. Mempool
> 
> The idea is to preallocate mempool with a given amount of memory for
> userspace oom-killer. Preferably this will be per-thread and
> oom-killer can preallocate mempool for its specific threads. The core
> page allocator can check before going to the reclaim path if the task
> has private access to the mempool and return page from it if yes.

Could you elaborate some more on how this would be controlled from the
userspace? A dedicated syscall? A driver?

> This option would be more complicated than the previous option as the
> lifecycle of the page from the mempool would be more sophisticated.
> Additionally the current mempool does not handle higher order pages
> and we might need to extend it to allow such allocations. Though this
> feature might have more use-cases and it would be less risky than the
> previous option.

I would tend to agree.

> Another idea I had was to use kthread based oom-killer and provide the
> policies through eBPF program. Though I am not sure how to make it
> monitor arbitrary metrics and if that can be done without any
> allocations.

A kernel module or eBPF to implement oom decisions has already been
discussed few years back. But I am afraid this would be hard to wire in
for anything except for the victim selection. I am not sure it is
maintainable to also control when the OOM handling should trigger.

-- 
Michal Hocko
SUSE Labs


[RFC] memory reserve for userspace oom-killer

2021-04-19 Thread Shakeel Butt
Proposal: Provide memory guarantees to userspace oom-killer.

Background:

Issues with kernel oom-killer:
1. Very conservative and prefer to reclaim. Applications can suffer
for a long time.
2. Borrows the context of the allocator which can be resource limited
(low sched priority or limited CPU quota).
3. Serialized by global lock.
4. Very simplistic oom victim selection policy.

These issues are resolved through userspace oom-killer by:
1. Ability to monitor arbitrary metrics (PSI, vmstat, memcg stats) to
early detect suffering.
2. Independent process context which can be given dedicated CPU quota
and high scheduling priority.
3. Can be more aggressive as required.
4. Can implement sophisticated business logic/policies.

Android's LMKD and Facebook's oomd are the prime examples of userspace
oom-killers. One of the biggest challenges for userspace oom-killers
is to potentially function under intense memory pressure and are prone
to getting stuck in memory reclaim themselves. Current userspace
oom-killers aim to avoid this situation by preallocating user memory
and protecting themselves from global reclaim by either mlocking or
memory.min. However a new allocation from userspace oom-killer can
still get stuck in the reclaim and policy rich oom-killer do trigger
new allocations through syscalls or even heap.

Our attempt of userspace oom-killer faces similar challenges.
Particularly at the tail on the very highly utilized machines we have
observed userspace oom-killer spectacularly failing in many possible
ways in the direct reclaim. We have seen oom-killer stuck in direct
reclaim throttling, stuck in reclaim and allocations from interrupts
keep stealing reclaimed memory. We have even observed systems where
all the processes were stuck in throttle_direct_reclaim() and only
kswapd was running and the interrupts kept stealing the memory
reclaimed by kswapd.

To reliably solve this problem, we need to give guaranteed memory to
the userspace oom-killer. At the moment we are contemplating between
the following options and I would like to get some feedback.

1. prctl(PF_MEMALLOC)

The idea is to give userspace oom-killer (just one thread which is
finding the appropriate victims and will be sending SIGKILLs) access
to MEMALLOC reserves. Most of the time the preallocation, mlock and
memory.min will be good enough but for rare occasions, when the
userspace oom-killer needs to allocate, the PF_MEMALLOC flag will
protect it from reclaim and let the allocation dip into the memory
reserves.

The misuse of this feature would be risky but it can be limited to
privileged applications. Userspace oom-killer is the only appropriate
user of this feature. This option is simple to implement.

2. Mempool

The idea is to preallocate mempool with a given amount of memory for
userspace oom-killer. Preferably this will be per-thread and
oom-killer can preallocate mempool for its specific threads. The core
page allocator can check before going to the reclaim path if the task
has private access to the mempool and return page from it if yes.

This option would be more complicated than the previous option as the
lifecycle of the page from the mempool would be more sophisticated.
Additionally the current mempool does not handle higher order pages
and we might need to extend it to allow such allocations. Though this
feature might have more use-cases and it would be less risky than the
previous option.

Another idea I had was to use kthread based oom-killer and provide the
policies through eBPF program. Though I am not sure how to make it
monitor arbitrary metrics and if that can be done without any
allocations.

Please do provide feedback on these approaches.

thanks,
Shakeel


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-06-18 Thread Yafang Shao
On Thu, Jun 18, 2020 at 8:37 PM Chris Down  wrote:
>
> Yafang Shao writes:
> >On Thu, Jun 18, 2020 at 5:09 AM Chris Down  wrote:
> >>
> >> Naresh Kamboju writes:
> >> >After this patch applied the reported issue got fixed.
> >>
> >> Great! Thank you Naresh and Michal for helping to get to the bottom of 
> >> this :-)
> >>
> >> I'll send out a new version tomorrow with the fixes applied and both of you
> >> credited in the changelog for the detection and fix.
> >
> >As we have already found that the usage around memory.{emin, elow} has
> >many limitations, I think memory.{emin, elow} should be used for
> >memcg-tree internally only, that means they can only be used to
> >calculate the protection of a memcg in a specified memcg-tree but
> >should not be exposed to other MM parts.
>
> I agree that the current semantics are mentally taxing and we should generally
> avoid exposing the implementation details outside of memcg where possible. Do
> you have a suggested rework? :-)

Keeping the mem_cgroup_protected() as-is is my suggestion. Anyway I
think it is bad to put memory.{emin, elow} here and there.
If we don't have any better idea by now, just putting all the
references of memory.{emin, elow}  into one
wrapper(mem_cgroup_protected()) is the reasonable solution.

-- 
Thanks
Yafang


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-06-18 Thread Chris Down

Michal Hocko writes:

I would really prefer to do that work on top of the fixes we (used to)
have in mmotm (with the fixup).


Oh, for sure. We should reintroduce the patches with the fix, and then look at 
longer-term solutions once that's in :-)


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-06-18 Thread Michal Hocko
On Thu 18-06-20 13:37:43, Chris Down wrote:
> Yafang Shao writes:
> > On Thu, Jun 18, 2020 at 5:09 AM Chris Down  wrote:
> > > 
> > > Naresh Kamboju writes:
> > > >After this patch applied the reported issue got fixed.
> > > 
> > > Great! Thank you Naresh and Michal for helping to get to the bottom of 
> > > this :-)
> > > 
> > > I'll send out a new version tomorrow with the fixes applied and both of 
> > > you
> > > credited in the changelog for the detection and fix.
> > 
> > As we have already found that the usage around memory.{emin, elow} has
> > many limitations, I think memory.{emin, elow} should be used for
> > memcg-tree internally only, that means they can only be used to
> > calculate the protection of a memcg in a specified memcg-tree but
> > should not be exposed to other MM parts.
> 
> I agree that the current semantics are mentally taxing and we should
> generally avoid exposing the implementation details outside of memcg where
> possible. Do you have a suggested rework? :-)

I would really prefer to do that work on top of the fixes we (used to)
have in mmotm (with the fixup).
-- 
Michal Hocko
SUSE Labs


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-06-18 Thread Chris Down

Yafang Shao writes:

On Thu, Jun 18, 2020 at 5:09 AM Chris Down  wrote:


Naresh Kamboju writes:
>After this patch applied the reported issue got fixed.

Great! Thank you Naresh and Michal for helping to get to the bottom of this :-)

I'll send out a new version tomorrow with the fixes applied and both of you
credited in the changelog for the detection and fix.


As we have already found that the usage around memory.{emin, elow} has
many limitations, I think memory.{emin, elow} should be used for
memcg-tree internally only, that means they can only be used to
calculate the protection of a memcg in a specified memcg-tree but
should not be exposed to other MM parts.


I agree that the current semantics are mentally taxing and we should generally 
avoid exposing the implementation details outside of memcg where possible. Do 
you have a suggested rework? :-)


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-06-17 Thread Yafang Shao
On Thu, Jun 18, 2020 at 5:09 AM Chris Down  wrote:
>
> Naresh Kamboju writes:
> >After this patch applied the reported issue got fixed.
>
> Great! Thank you Naresh and Michal for helping to get to the bottom of this 
> :-)
>
> I'll send out a new version tomorrow with the fixes applied and both of you
> credited in the changelog for the detection and fix.

As we have already found that the usage around memory.{emin, elow} has
many limitations, I think memory.{emin, elow} should be used for
memcg-tree internally only, that means they can only be used to
calculate the protection of a memcg in a specified memcg-tree but
should not be exposed to other MM parts.

-- 
Thanks
Yafang


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-06-17 Thread Chris Down

Naresh Kamboju writes:

After this patch applied the reported issue got fixed.


Great! Thank you Naresh and Michal for helping to get to the bottom of this :-)

I'll send out a new version tomorrow with the fixes applied and both of you 
credited in the changelog for the detection and fix.


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-06-17 Thread Naresh Kamboju
On Wed, 17 Jun 2020 at 21:36, Michal Hocko  wrote:
>
> On Wed 17-06-20 21:23:05, Naresh Kamboju wrote:
> > On Wed, 17 Jun 2020 at 19:41, Michal Hocko  wrote:
> > >
> > > [Our emails have crossed]
> > >
> > > On Wed 17-06-20 14:57:58, Chris Down wrote:
> > > > Naresh Kamboju writes:
> > > > > mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8RQK14KF6XF
> > > > > mke2fs 1.43.8 (1-Jan-2018)
> > > > > Creating filesystem with 244190646 4k blocks and 61054976 inodes
> > > > > Filesystem UUID: 7c380766-0ed8-41ba-a0de-3c08e78f1891
> > > > > Superblock backups stored on blocks:
> > > > > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 
> > > > > 2654208,
> > > > > 4096000, 7962624, 11239424, 2048, 23887872, 71663616, 78675968,
> > > > > 10240, 214990848
> > > > > Allocating group tables:0/7453 done
> > > > > Writing inode tables:0/7453 done
> > > > > Creating journal (262144 blocks): [   51.544525] under min:0 emin:0
> > > > > [   51.845304] under min:0 emin:0
> > > > > [   51.848738] under min:0 emin:0
> > > > > [   51.858147] under min:0 emin:0
> > > > > [   51.861333] under min:0 emin:0
> > > > > [   51.862034] under min:0 emin:0
> > > > > [   51.862442] under min:0 emin:0
> > > > > [   51.862763] under min:0 emin:0
> > > >
> > > > Thanks, this helps a lot. Somehow we're entering mem_cgroup_below_min 
> > > > even
> > > > when min/emin is 0 (which should indeed be the case if you haven't set 
> > > > them
> > > > in the hierarchy).
> > > >
> > > > My guess is that page_counter_read(>memory) is 0, which means
> > > > mem_cgroup_below_min will return 1.
> > >
> > > Yes this is the case because this is likely the root memcg which skips
> > > all charges.
> > >
> > > > However, I don't know for sure why that should then result in the OOM 
> > > > killer
> > > > coming along. My guess is that since this memcg has 0 pages to scan 
> > > > anyway,
> > > > we enter premature OOM under some conditions. I don't know why we 
> > > > wouldn't
> > > > have hit that with the old version of mem_cgroup_protected that returned
> > > > MEMCG_PROT_* members, though.
> > >
> > > Not really. There is likely no other memcg to reclaim from and assuming
> > > min limit protection will result in no reclaimable memory and thus the
> > > OOM killer.
> > >
> > > > Can you please try the patch with the `>=` checks in 
> > > > mem_cgroup_below_min
> > > > and mem_cgroup_below_low changed to `>`? If that fixes it, then that 
> > > > gives a
> > > > strong hint about what's going on here.
> > >
> > > This would work but I believe an explicit check for the root memcg would
> > > be easier to spot the reasoning.
> >
> > May I request you to send debugging or proposed fix patches here.
> > I am happy to do more testing.
>
> Sure, here is the diff to test.
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index c74a8f2323f1..6b5a31672fbe 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -392,6 +392,13 @@ static inline bool mem_cgroup_below_low(struct 
> mem_cgroup *memcg)
> if (mem_cgroup_disabled())
> return false;
>
> +   /*
> +* Root memcg doesn't account charges and doesn't support
> +* protection
> +*/
> +   if (mem_cgroup_is_root(memcg))
> +   return false;
> +
> return READ_ONCE(memcg->memory.elow) >=
> page_counter_read(>memory);
>  }
> @@ -401,6 +408,13 @@ static inline bool mem_cgroup_below_min(struct 
> mem_cgroup *memcg)
> if (mem_cgroup_disabled())
> return false;
>
> +   /*
> +* Root memcg doesn't account charges and doesn't support
> +* protection
> +*/
> +   if (mem_cgroup_is_root(memcg))
> +   return false;
> +
> return READ_ONCE(memcg->memory.emin) >=
> page_counter_read(>memory);
>  }


After this patch applied the reported issue got fixed.

test log link,
https://lkft.validation.linaro.org/scheduler/job/1505417#L1429

- Naresh


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-06-17 Thread Michal Hocko
On Wed 17-06-20 21:23:05, Naresh Kamboju wrote:
> On Wed, 17 Jun 2020 at 19:41, Michal Hocko  wrote:
> >
> > [Our emails have crossed]
> >
> > On Wed 17-06-20 14:57:58, Chris Down wrote:
> > > Naresh Kamboju writes:
> > > > mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8RQK14KF6XF
> > > > mke2fs 1.43.8 (1-Jan-2018)
> > > > Creating filesystem with 244190646 4k blocks and 61054976 inodes
> > > > Filesystem UUID: 7c380766-0ed8-41ba-a0de-3c08e78f1891
> > > > Superblock backups stored on blocks:
> > > > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> > > > 4096000, 7962624, 11239424, 2048, 23887872, 71663616, 78675968,
> > > > 10240, 214990848
> > > > Allocating group tables:0/7453 done
> > > > Writing inode tables:0/7453 done
> > > > Creating journal (262144 blocks): [   51.544525] under min:0 emin:0
> > > > [   51.845304] under min:0 emin:0
> > > > [   51.848738] under min:0 emin:0
> > > > [   51.858147] under min:0 emin:0
> > > > [   51.861333] under min:0 emin:0
> > > > [   51.862034] under min:0 emin:0
> > > > [   51.862442] under min:0 emin:0
> > > > [   51.862763] under min:0 emin:0
> > >
> > > Thanks, this helps a lot. Somehow we're entering mem_cgroup_below_min even
> > > when min/emin is 0 (which should indeed be the case if you haven't set 
> > > them
> > > in the hierarchy).
> > >
> > > My guess is that page_counter_read(>memory) is 0, which means
> > > mem_cgroup_below_min will return 1.
> >
> > Yes this is the case because this is likely the root memcg which skips
> > all charges.
> >
> > > However, I don't know for sure why that should then result in the OOM 
> > > killer
> > > coming along. My guess is that since this memcg has 0 pages to scan 
> > > anyway,
> > > we enter premature OOM under some conditions. I don't know why we wouldn't
> > > have hit that with the old version of mem_cgroup_protected that returned
> > > MEMCG_PROT_* members, though.
> >
> > Not really. There is likely no other memcg to reclaim from and assuming
> > min limit protection will result in no reclaimable memory and thus the
> > OOM killer.
> >
> > > Can you please try the patch with the `>=` checks in mem_cgroup_below_min
> > > and mem_cgroup_below_low changed to `>`? If that fixes it, then that 
> > > gives a
> > > strong hint about what's going on here.
> >
> > This would work but I believe an explicit check for the root memcg would
> > be easier to spot the reasoning.
> 
> May I request you to send debugging or proposed fix patches here.
> I am happy to do more testing.

Sure, here is the diff to test.

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c74a8f2323f1..6b5a31672fbe 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -392,6 +392,13 @@ static inline bool mem_cgroup_below_low(struct mem_cgroup 
*memcg)
if (mem_cgroup_disabled())
return false;
 
+   /*
+* Root memcg doesn't account charges and doesn't support
+* protection
+*/
+   if (mem_cgroup_is_root(memcg))
+   return false;
+
return READ_ONCE(memcg->memory.elow) >=
page_counter_read(>memory);
 }
@@ -401,6 +408,13 @@ static inline bool mem_cgroup_below_min(struct mem_cgroup 
*memcg)
if (mem_cgroup_disabled())
return false;
 
+   /*
+* Root memcg doesn't account charges and doesn't support
+* protection
+*/
+   if (mem_cgroup_is_root(memcg))
+   return false;
+
return READ_ONCE(memcg->memory.emin) >=
page_counter_read(>memory);
 }
-- 
Michal Hocko
SUSE Labs


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-06-17 Thread Naresh Kamboju
On Wed, 17 Jun 2020 at 19:41, Michal Hocko  wrote:
>
> [Our emails have crossed]
>
> On Wed 17-06-20 14:57:58, Chris Down wrote:
> > Naresh Kamboju writes:
> > > mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8RQK14KF6XF
> > > mke2fs 1.43.8 (1-Jan-2018)
> > > Creating filesystem with 244190646 4k blocks and 61054976 inodes
> > > Filesystem UUID: 7c380766-0ed8-41ba-a0de-3c08e78f1891
> > > Superblock backups stored on blocks:
> > > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> > > 4096000, 7962624, 11239424, 2048, 23887872, 71663616, 78675968,
> > > 10240, 214990848
> > > Allocating group tables:0/7453 done
> > > Writing inode tables:0/7453 done
> > > Creating journal (262144 blocks): [   51.544525] under min:0 emin:0
> > > [   51.845304] under min:0 emin:0
> > > [   51.848738] under min:0 emin:0
> > > [   51.858147] under min:0 emin:0
> > > [   51.861333] under min:0 emin:0
> > > [   51.862034] under min:0 emin:0
> > > [   51.862442] under min:0 emin:0
> > > [   51.862763] under min:0 emin:0
> >
> > Thanks, this helps a lot. Somehow we're entering mem_cgroup_below_min even
> > when min/emin is 0 (which should indeed be the case if you haven't set them
> > in the hierarchy).
> >
> > My guess is that page_counter_read(>memory) is 0, which means
> > mem_cgroup_below_min will return 1.
>
> Yes this is the case because this is likely the root memcg which skips
> all charges.
>
> > However, I don't know for sure why that should then result in the OOM killer
> > coming along. My guess is that since this memcg has 0 pages to scan anyway,
> > we enter premature OOM under some conditions. I don't know why we wouldn't
> > have hit that with the old version of mem_cgroup_protected that returned
> > MEMCG_PROT_* members, though.
>
> Not really. There is likely no other memcg to reclaim from and assuming
> min limit protection will result in no reclaimable memory and thus the
> OOM killer.
>
> > Can you please try the patch with the `>=` checks in mem_cgroup_below_min
> > and mem_cgroup_below_low changed to `>`? If that fixes it, then that gives a
> > strong hint about what's going on here.
>
> This would work but I believe an explicit check for the root memcg would
> be easier to spot the reasoning.

May I request you to send debugging or proposed fix patches here.
I am happy to do more testing.

FYI,
Here is my repository for testing.
git: https://github.com/nareshkamboju/linux/tree/printk
branch: printk

- Naresh


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-06-17 Thread Michal Hocko
[Our emails have crossed]

On Wed 17-06-20 14:57:58, Chris Down wrote:
> Naresh Kamboju writes:
> > mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8RQK14KF6XF
> > mke2fs 1.43.8 (1-Jan-2018)
> > Creating filesystem with 244190646 4k blocks and 61054976 inodes
> > Filesystem UUID: 7c380766-0ed8-41ba-a0de-3c08e78f1891
> > Superblock backups stored on blocks:
> > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> > 4096000, 7962624, 11239424, 2048, 23887872, 71663616, 78675968,
> > 10240, 214990848
> > Allocating group tables:0/7453 done
> > Writing inode tables:0/7453 done
> > Creating journal (262144 blocks): [   51.544525] under min:0 emin:0
> > [   51.845304] under min:0 emin:0
> > [   51.848738] under min:0 emin:0
> > [   51.858147] under min:0 emin:0
> > [   51.861333] under min:0 emin:0
> > [   51.862034] under min:0 emin:0
> > [   51.862442] under min:0 emin:0
> > [   51.862763] under min:0 emin:0
> 
> Thanks, this helps a lot. Somehow we're entering mem_cgroup_below_min even
> when min/emin is 0 (which should indeed be the case if you haven't set them
> in the hierarchy).
> 
> My guess is that page_counter_read(>memory) is 0, which means
> mem_cgroup_below_min will return 1.

Yes this is the case because this is likely the root memcg which skips
all charges.

> However, I don't know for sure why that should then result in the OOM killer
> coming along. My guess is that since this memcg has 0 pages to scan anyway,
> we enter premature OOM under some conditions. I don't know why we wouldn't
> have hit that with the old version of mem_cgroup_protected that returned
> MEMCG_PROT_* members, though.

Not really. There is likely no other memcg to reclaim from and assuming
min limit protection will result in no reclaimable memory and thus the
OOM killer.

> Can you please try the patch with the `>=` checks in mem_cgroup_below_min
> and mem_cgroup_below_low changed to `>`? If that fixes it, then that gives a
> strong hint about what's going on here.

This would work but I believe an explicit check for the root memcg would
be easier to spot the reasoning.

-- 
Michal Hocko
SUSE Labs


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-06-17 Thread Chris Down

Michal Hocko writes:

and it makes some sense. Except for the root memcg where we do not
account any memory. Adding if (mem_cgroup_is_root(memcg)) return false;
should do the trick. The same is the case for mem_cgroup_below_low.
Could you give it a try please just to confirm?


Oh, of course :-) This seems more likely than what I proposed, and would be 
great to test.


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-06-17 Thread Michal Hocko
On Wed 17-06-20 19:07:20, Naresh Kamboju wrote:
> On Thu, 21 May 2020 at 22:04, Michal Hocko  wrote:
> >
> > On Thu 21-05-20 11:55:16, Michal Hocko wrote:
> > > On Wed 20-05-20 20:09:06, Chris Down wrote:
> > > > Hi Naresh,
> > > >
> > > > Naresh Kamboju writes:
> > > > > As a part of investigation on this issue LKFT teammate Anders Roxell
> > > > > git bisected the problem and found bad commit(s) which caused this 
> > > > > problem.
> > > > >
> > > > > The following two patches have been reverted on next-20200519 and 
> > > > > retested the
> > > > > reproducible steps and confirmed the test case mkfs -t ext4 got PASS.
> > > > > ( invoked oom-killer is gone now)
> > > > >
> > > > > Revert "mm, memcg: avoid stale protection values when cgroup is above
> > > > > protection"
> > > > >This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
> > > > >
> > > > > Revert "mm, memcg: decouple e{low,min} state mutations from protection
> > > > > checks"
> > > > >This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
> > > >
> > > > Thanks Anders and Naresh for tracking this down and reverting.
> > > >
> > > > I'll take a look tomorrow. I don't see anything immediately obviously 
> > > > wrong
> > > > in either of those commits from a (very) cursory glance, but they should
> > > > only be taking effect if protections are set.
> > >
> > > Agreed. If memory.{low,min} is not used then the patch should be
> > > effectively a nop.
> >
> > I was staring into the code and do not see anything.  Could you give the
> > following debugging patch a try and see whether it triggers?
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index cc555903a332..df2e8df0eb71 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2404,6 +2404,8 @@ static void get_scan_count(struct lruvec *lruvec, 
> > struct scan_control *sc,
> >  * sc->priority further than desirable.
> >  */
> > scan = max(scan, SWAP_CLUSTER_MAX);
> > +
> > +   trace_printk("scan:%lu protection:%lu\n", scan, 
> > protection);
> > } else {
> > scan = lruvec_size;
> > }
> > @@ -2648,6 +2650,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, 
> > struct scan_control *sc)
> > mem_cgroup_calculate_protection(target_memcg, memcg);
> >
> > if (mem_cgroup_below_min(memcg)) {
> > +   trace_printk("under min:%lu emin:%lu\n", 
> > memcg->memory.min, memcg->memory.emin);
> > /*
> >  * Hard protection.
> >  * If there is no reclaimable memory, OOM.
> > @@ -2660,6 +2663,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, 
> > struct scan_control *sc)
> >  * there is an unprotected supply
> >  * of reclaimable memory from other cgroups.
> >  */
> > +   trace_printk("under low:%lu elow:%lu\n", 
> > memcg->memory.low, memcg->memory.elow);
> > if (!sc->memcg_low_reclaim) {
> > sc->memcg_low_skipped = 1;
> > continue;
> 
> As per your suggestions on debugging this problem,
> trace_printk is replaced with printk and applied to your patch on top of the
> problematic kernel and here is the test output and link.
> 
> mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8RQK14KF6XF
> mke2fs 1.43.8 (1-Jan-2018)
> Creating filesystem with 244190646 4k blocks and 61054976 inodes
> Filesystem UUID: 7c380766-0ed8-41ba-a0de-3c08e78f1891
> Superblock backups stored on blocks:
> 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> 4096000, 7962624, 11239424, 2048, 23887872, 71663616, 78675968,
> 10240, 214990848
> Allocating group tables:0/7453 done
> Writing inode tables:0/7453 done
> Creating journal (262144 blocks): [   51.544525] under min:0 emin:0
> [   51.845304] under min:0 emin:0
> [   51.848738] under min:0 emin:0
> [   51.858147] under min:0 emin:0
> [   51.861333] under min:0 emin:0
> [   51.862034] under min:0 emin

Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-06-17 Thread Chris Down

Naresh Kamboju writes:

mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8RQK14KF6XF
mke2fs 1.43.8 (1-Jan-2018)
Creating filesystem with 244190646 4k blocks and 61054976 inodes
Filesystem UUID: 7c380766-0ed8-41ba-a0de-3c08e78f1891
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 2048, 23887872, 71663616, 78675968,
10240, 214990848
Allocating group tables:0/7453 done
Writing inode tables:0/7453 done
Creating journal (262144 blocks): [   51.544525] under min:0 emin:0
[   51.845304] under min:0 emin:0
[   51.848738] under min:0 emin:0
[   51.858147] under min:0 emin:0
[   51.861333] under min:0 emin:0
[   51.862034] under min:0 emin:0
[   51.862442] under min:0 emin:0
[   51.862763] under min:0 emin:0


Thanks, this helps a lot. Somehow we're entering mem_cgroup_below_min even when 
min/emin is 0 (which should indeed be the case if you haven't set them in the 
hierarchy).


My guess is that page_counter_read(>memory) is 0, which means 
mem_cgroup_below_min will return 1.


However, I don't know for sure why that should then result in the OOM killer 
coming along. My guess is that since this memcg has 0 pages to scan anyway, we 
enter premature OOM under some conditions. I don't know why we wouldn't have 
hit that with the old version of mem_cgroup_protected that returned 
MEMCG_PROT_* members, though.


Can you please try the patch with the `>=` checks in mem_cgroup_below_min and 
mem_cgroup_below_low changed to `>`? If that fixes it, then that gives a strong 
hint about what's going on here.


Thanks for your help!


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-06-17 Thread Naresh Kamboju
On Thu, 21 May 2020 at 22:04, Michal Hocko  wrote:
>
> On Thu 21-05-20 11:55:16, Michal Hocko wrote:
> > On Wed 20-05-20 20:09:06, Chris Down wrote:
> > > Hi Naresh,
> > >
> > > Naresh Kamboju writes:
> > > > As a part of investigation on this issue LKFT teammate Anders Roxell
> > > > git bisected the problem and found bad commit(s) which caused this 
> > > > problem.
> > > >
> > > > The following two patches have been reverted on next-20200519 and 
> > > > retested the
> > > > reproducible steps and confirmed the test case mkfs -t ext4 got PASS.
> > > > ( invoked oom-killer is gone now)
> > > >
> > > > Revert "mm, memcg: avoid stale protection values when cgroup is above
> > > > protection"
> > > >This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
> > > >
> > > > Revert "mm, memcg: decouple e{low,min} state mutations from protection
> > > > checks"
> > > >This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
> > >
> > > Thanks Anders and Naresh for tracking this down and reverting.
> > >
> > > I'll take a look tomorrow. I don't see anything immediately obviously 
> > > wrong
> > > in either of those commits from a (very) cursory glance, but they should
> > > only be taking effect if protections are set.
> >
> > Agreed. If memory.{low,min} is not used then the patch should be
> > effectively a nop.
>
> I was staring into the code and do not see anything.  Could you give the
> following debugging patch a try and see whether it triggers?
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index cc555903a332..df2e8df0eb71 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2404,6 +2404,8 @@ static void get_scan_count(struct lruvec *lruvec, 
> struct scan_control *sc,
>  * sc->priority further than desirable.
>  */
> scan = max(scan, SWAP_CLUSTER_MAX);
> +
> +   trace_printk("scan:%lu protection:%lu\n", scan, 
> protection);
> } else {
> scan = lruvec_size;
> }
> @@ -2648,6 +2650,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct 
> scan_control *sc)
> mem_cgroup_calculate_protection(target_memcg, memcg);
>
> if (mem_cgroup_below_min(memcg)) {
> +   trace_printk("under min:%lu emin:%lu\n", 
> memcg->memory.min, memcg->memory.emin);
> /*
>  * Hard protection.
>  * If there is no reclaimable memory, OOM.
> @@ -2660,6 +2663,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct 
> scan_control *sc)
>  * there is an unprotected supply
>  * of reclaimable memory from other cgroups.
>  */
> +   trace_printk("under low:%lu elow:%lu\n", 
> memcg->memory.low, memcg->memory.elow);
> if (!sc->memcg_low_reclaim) {
> sc->memcg_low_skipped = 1;
> continue;

As per your suggestions on debugging this problem,
trace_printk is replaced with printk and applied to your patch on top of the
problematic kernel and here is the test output and link.

mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8RQK14KF6XF
mke2fs 1.43.8 (1-Jan-2018)
Creating filesystem with 244190646 4k blocks and 61054976 inodes
Filesystem UUID: 7c380766-0ed8-41ba-a0de-3c08e78f1891
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 2048, 23887872, 71663616, 78675968,
10240, 214990848
Allocating group tables:0/7453 done
Writing inode tables:0/7453 done
Creating journal (262144 blocks): [   51.544525] under min:0 emin:0
[   51.845304] under min:0 emin:0
[   51.848738] under min:0 emin:0
[   51.858147] under min:0 emin:0
[   51.861333] under min:0 emin:0
[   51.862034] under min:0 emin:0
[   51.862442] under min:0 emin:0
[   51.862763] under min:0 emin:0

Full test log link,
https://lkft.validation.linaro.org/scheduler/job/1497412#L1451

- Naresh


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-06-12 Thread Michal Hocko
On Fri 12-06-20 15:13:22, Naresh Kamboju wrote:
> On Thu, 11 Jun 2020 at 15:25, Michal Hocko  wrote:
> >
> > On Fri 29-05-20 11:49:20, Michal Hocko wrote:
> > > On Fri 29-05-20 02:56:44, Chris Down wrote:
> > > > Yafang Shao writes:
> > > Agreed. Even if e{low,min} might still have some rough edges I am
> > > completely puzzled how we could end up oom if none of the protection
> > > path triggers which the additional debugging should confirm. Maybe my
> > > debugging patch is incomplete or used incorrectly (maybe it would be
> > > esier to use printk rather than trace_printk?).
> >
> > It would be really great if we could move forward. While the fix (which
> > has been dropped from mmotm) is not super urgent I would really like to
> > understand how it could hit the observed behavior. Can we double check
> > that the debugging patch really doesn't trigger (e.g.
> > s@trace_printk@printk in the first step)?
> 
> Please suggest to me the way to get more debug information
> by providing kernel debug patches and extra kernel configs.
> 
> I have applied your debug patch and tested on top on linux next 20200612
> but did not find any printk output while running mkfs -t ext4 /drive test 
> case.

Have you tried s@trace_printk@printk@ in the patch? AFAIK trace_printk
doesn't dump anything into the printk ring buffer. You would have to
look into trace ring buffer.
-- 
Michal Hocko
SUSE Labs


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-06-12 Thread Naresh Kamboju
On Thu, 11 Jun 2020 at 15:25, Michal Hocko  wrote:
>
> On Fri 29-05-20 11:49:20, Michal Hocko wrote:
> > On Fri 29-05-20 02:56:44, Chris Down wrote:
> > > Yafang Shao writes:
> > Agreed. Even if e{low,min} might still have some rough edges I am
> > completely puzzled how we could end up oom if none of the protection
> > path triggers which the additional debugging should confirm. Maybe my
> > debugging patch is incomplete or used incorrectly (maybe it would be
> > esier to use printk rather than trace_printk?).
>
> It would be really great if we could move forward. While the fix (which
> has been dropped from mmotm) is not super urgent I would really like to
> understand how it could hit the observed behavior. Can we double check
> that the debugging patch really doesn't trigger (e.g.
> s@trace_printk@printk in the first step)?

Please suggest to me the way to get more debug information
by providing kernel debug patches and extra kernel configs.

I have applied your debug patch and tested on top on linux next 20200612
but did not find any printk output while running mkfs -t ext4 /drive test case.


> I have checked it again but
> do not see any potential code path which would be affected by the patch
> yet not trigger any output. But another pair of eyes would be really
> great.


---
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b6d84326bdf2..d13ce7b02de4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2375,6 +2375,8 @@ static void get_scan_count(struct lruvec
*lruvec, struct scan_control *sc,
  * sc->priority further than desirable.
  */
  scan = max(scan, SWAP_CLUSTER_MAX);
+
+ trace_printk("scan:%lu protection:%lu\n", scan, protection);
  } else {
  scan = lruvec_size;
  }
@@ -2618,6 +2620,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat,
struct scan_control *sc)

  switch (mem_cgroup_protected(target_memcg, memcg)) {
  case MEMCG_PROT_MIN:
+ trace_printk("under min:%lu emin:%lu\n", memcg->memory.min,
memcg->memory.emin);
  /*
  * Hard protection.
  * If there is no reclaimable memory, OOM.
@@ -2630,6 +2633,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat,
struct scan_control *sc)
  * there is an unprotected supply
  * of reclaimable memory from other cgroups.
  */
+ trace_printk("under low:%lu elow:%lu\n", memcg->memory.low,
memcg->memory.elow);
  if (!sc->memcg_low_reclaim) {
  sc->memcg_low_skipped = 1;
  continue;
-- 
2.23.0

ref:
test output:
https://lkft.validation.linaro.org/scheduler/job/1489767#L1388

Test artifacts link (kernel / modules):
https://builds.tuxbuild.com/5rRNgQqF_wHsSRptdj4A1A/
- Naresh


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-06-11 Thread Michal Hocko
On Fri 29-05-20 11:49:20, Michal Hocko wrote:
> On Fri 29-05-20 02:56:44, Chris Down wrote:
> > Yafang Shao writes:
> > > Look at this patch[1] carefully you will find that it introduces the
> > > same issue that I tried to fix in another patch [2]. Even more sad is
> > > these two patches are in the same patchset. Although this issue isn't
> > > related with the issue found by Naresh, we have to ask ourselves why
> > > we always make the same mistake ?
> > > One possible answer is that we always forget the lifecyle of
> > > memory.emin before we read it. memory.emin doesn't have the same
> > > lifecycle with the memcg, while it really has the same lifecyle with
> > > the reclaimer. IOW, once a reclaimer begins the protetion value should
> > > be set to 0, and after we traversal the memcg tree we calculate a
> > > protection value for this reclaimer, finnaly it disapears after the
> > > reclaimer stops. That is why I highly suggest to add an new protection
> > > member in scan_control before.
> > 
> > I agree with you that the e{min,low} lifecycle is confusing for everyone --
> > the only thing I've not seen confirmation of is any confirmed correlation
> > with the i386 oom killer issue. If you've validated that, I'd like to see
> > the data :-)
> 
> Agreed. Even if e{low,min} might still have some rough edges I am
> completely puzzled how we could end up oom if none of the protection
> path triggers which the additional debugging should confirm. Maybe my
> debugging patch is incomplete or used incorrectly (maybe it would be
> esier to use printk rather than trace_printk?).

It would be really great if we could move forward. While the fix (which
has been dropped from mmotm) is not super urgent I would really like to
understand how it could hit the observed behavior. Can we double check
that the debugging patch really doesn't trigger (e.g.
s@trace_printk@printk in the first step)? I have checked it again but
do not see any potential code path which would be affected by the patch
yet not trigger any output. But another pair of eyes would be really
great.
-- 
Michal Hocko
SUSE Labs


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-05-29 Thread Michal Hocko
On Fri 29-05-20 02:56:44, Chris Down wrote:
> Yafang Shao writes:
> > Look at this patch[1] carefully you will find that it introduces the
> > same issue that I tried to fix in another patch [2]. Even more sad is
> > these two patches are in the same patchset. Although this issue isn't
> > related with the issue found by Naresh, we have to ask ourselves why
> > we always make the same mistake ?
> > One possible answer is that we always forget the lifecyle of
> > memory.emin before we read it. memory.emin doesn't have the same
> > lifecycle with the memcg, while it really has the same lifecyle with
> > the reclaimer. IOW, once a reclaimer begins the protetion value should
> > be set to 0, and after we traversal the memcg tree we calculate a
> > protection value for this reclaimer, finnaly it disapears after the
> > reclaimer stops. That is why I highly suggest to add an new protection
> > member in scan_control before.
> 
> I agree with you that the e{min,low} lifecycle is confusing for everyone --
> the only thing I've not seen confirmation of is any confirmed correlation
> with the i386 oom killer issue. If you've validated that, I'd like to see
> the data :-)

Agreed. Even if e{low,min} might still have some rough edges I am
completely puzzled how we could end up oom if none of the protection
path triggers which the additional debugging should confirm. Maybe my
debugging patch is incomplete or used incorrectly (maybe it would be
esier to use printk rather than trace_printk?).
-- 
Michal Hocko
SUSE Labs


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-05-28 Thread Chris Down

Yafang Shao writes:

Look at this patch[1] carefully you will find that it introduces the
same issue that I tried to fix in another patch [2]. Even more sad is
these two patches are in the same patchset. Although this issue isn't
related with the issue found by Naresh, we have to ask ourselves why
we always make the same mistake ?
One possible answer is that we always forget the lifecyle of
memory.emin before we read it. memory.emin doesn't have the same
lifecycle with the memcg, while it really has the same lifecyle with
the reclaimer. IOW, once a reclaimer begins the protetion value should
be set to 0, and after we traversal the memcg tree we calculate a
protection value for this reclaimer, finnaly it disapears after the
reclaimer stops. That is why I highly suggest to add an new protection
member in scan_control before.


I agree with you that the e{min,low} lifecycle is confusing for everyone -- the 
only thing I've not seen confirmation of is any confirmed correlation with the 
i386 oom killer issue. If you've validated that, I'd like to see the data :-)


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-05-28 Thread Yafang Shao
On Fri, May 29, 2020 at 12:41 AM Chris Down  wrote:
>
> Naresh Kamboju writes:
> >On Thu, 28 May 2020 at 20:33, Michal Hocko  wrote:
> >>
> >> On Fri 22-05-20 02:23:09, Naresh Kamboju wrote:
> >> > My apology !
> >> > As per the test results history this problem started happening from
> >> > Bad : next-20200430 (still reproducible on next-20200519)
> >> > Good : next-20200429
> >> >
> >> > The git tree / tag used for testing is from linux next-20200430 tag and 
> >> > reverted
> >> > following three patches and oom-killer problem fixed.
> >> >
> >> > Revert "mm, memcg: avoid stale protection values when cgroup is above
> >> > protection"
> >> > Revert "mm, memcg: decouple e{low,min} state mutations from protectinn 
> >> > checks"
> >> > Revert 
> >> > "mm-memcg-decouple-elowmin-state-mutations-from-protection-checks-fix"
> >>
> >> The discussion has fragmented and I got lost TBH.
> >> In 
> >> http://lkml.kernel.org/r/ca+g9fyudwgzx50upd+wcsdehx9vi3hpksvbawbmgrzadb0p...@mail.gmail.com
> >> you have said that none of the added tracing output has triggered. Does
> >> this still hold? Because I still have a hard time to understand how
> >> those three patches could have the observed effects.
> >
> >On the other email thread [1] this issue is concluded.
> >
> >Yafang wrote on May 22 2020,
> >
> >Regarding the root cause, my guess is it makes a similar mistake that
> >I tried to fix in the previous patch that the direct reclaimer read a
> >stale protection value.  But I don't think it is worth to add another
> >fix. The best way is to revert this commit.
>
> This isn't a conclusion, just a guess (and one I think is unlikely). For this
> to reliably happen, it implies that the same race happens the same way each
> time.


Hi Chris,

Look at this patch[1] carefully you will find that it introduces the
same issue that I tried to fix in another patch [2]. Even more sad is
these two patches are in the same patchset. Although this issue isn't
related with the issue found by Naresh, we have to ask ourselves why
we always make the same mistake ?
One possible answer is that we always forget the lifecyle of
memory.emin before we read it. memory.emin doesn't have the same
lifecycle with the memcg, while it really has the same lifecyle with
the reclaimer. IOW, once a reclaimer begins the protetion value should
be set to 0, and after we traversal the memcg tree we calculate a
protection value for this reclaimer, finnaly it disapears after the
reclaimer stops. That is why I highly suggest to add an new protection
member in scan_control before.

[1]. 
https://lore.kernel.org/linux-mm/20200505084127.12923-3-laoar.s...@gmail.com/
[2]. 
https://lore.kernel.org/linux-mm/20200505084127.12923-2-laoar.s...@gmail.com/

-- 
Thanks
Yafang


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-05-28 Thread Chris Down

Naresh Kamboju writes:

On Thu, 28 May 2020 at 20:33, Michal Hocko  wrote:


On Fri 22-05-20 02:23:09, Naresh Kamboju wrote:
> My apology !
> As per the test results history this problem started happening from
> Bad : next-20200430 (still reproducible on next-20200519)
> Good : next-20200429
>
> The git tree / tag used for testing is from linux next-20200430 tag and 
reverted
> following three patches and oom-killer problem fixed.
>
> Revert "mm, memcg: avoid stale protection values when cgroup is above
> protection"
> Revert "mm, memcg: decouple e{low,min} state mutations from protectinn checks"
> Revert "mm-memcg-decouple-elowmin-state-mutations-from-protection-checks-fix"

The discussion has fragmented and I got lost TBH.
In 
http://lkml.kernel.org/r/ca+g9fyudwgzx50upd+wcsdehx9vi3hpksvbawbmgrzadb0p...@mail.gmail.com
you have said that none of the added tracing output has triggered. Does
this still hold? Because I still have a hard time to understand how
those three patches could have the observed effects.


On the other email thread [1] this issue is concluded.

Yafang wrote on May 22 2020,

Regarding the root cause, my guess is it makes a similar mistake that
I tried to fix in the previous patch that the direct reclaimer read a
stale protection value.  But I don't think it is worth to add another
fix. The best way is to revert this commit.


This isn't a conclusion, just a guess (and one I think is unlikely). For this 
to reliably happen, it implies that the same race happens the same way each 
time.


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-05-28 Thread Naresh Kamboju
On Thu, 28 May 2020 at 20:33, Michal Hocko  wrote:
>
> On Fri 22-05-20 02:23:09, Naresh Kamboju wrote:
> > My apology !
> > As per the test results history this problem started happening from
> > Bad : next-20200430 (still reproducible on next-20200519)
> > Good : next-20200429
> >
> > The git tree / tag used for testing is from linux next-20200430 tag and 
> > reverted
> > following three patches and oom-killer problem fixed.
> >
> > Revert "mm, memcg: avoid stale protection values when cgroup is above
> > protection"
> > Revert "mm, memcg: decouple e{low,min} state mutations from protectinn 
> > checks"
> > Revert 
> > "mm-memcg-decouple-elowmin-state-mutations-from-protection-checks-fix"
>
> The discussion has fragmented and I got lost TBH.
> In 
> http://lkml.kernel.org/r/ca+g9fyudwgzx50upd+wcsdehx9vi3hpksvbawbmgrzadb0p...@mail.gmail.com
> you have said that none of the added tracing output has triggered. Does
> this still hold? Because I still have a hard time to understand how
> those three patches could have the observed effects.

On the other email thread [1] this issue is concluded.

Yafang wrote on May 22 2020,

Regarding the root cause, my guess is it makes a similar mistake that
I tried to fix in the previous patch that the direct reclaimer read a
stale protection value.  But I don't think it is worth to add another
fix. The best way is to revert this commit.


[1]  [PATCH v3 2/2] mm, memcg: Decouple e{low,min} state mutations
from protection checks
https://lore.kernel.org/linux-mm/caloahbarz3nsur3mcnx_kbsf8ktpjhuf2kaata7mb7ocjaj...@mail.gmail.com/

- Naresh

> --
> Michal Hocko
> SUSE Labs


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-05-28 Thread Michal Hocko
On Fri 22-05-20 02:23:09, Naresh Kamboju wrote:
> My apology !
> As per the test results history this problem started happening from
> Bad : next-20200430 (still reproducible on next-20200519)
> Good : next-20200429
> 
> The git tree / tag used for testing is from linux next-20200430 tag and 
> reverted
> following three patches and oom-killer problem fixed.
> 
> Revert "mm, memcg: avoid stale protection values when cgroup is above
> protection"
> Revert "mm, memcg: decouple e{low,min} state mutations from protectinn checks"
> Revert "mm-memcg-decouple-elowmin-state-mutations-from-protection-checks-fix"

The discussion has fragmented and I got lost TBH.
In 
http://lkml.kernel.org/r/ca+g9fyudwgzx50upd+wcsdehx9vi3hpksvbawbmgrzadb0p...@mail.gmail.com
you have said that none of the added tracing output has triggered. Does
this still hold? Because I still have a hard time to understand how
those three patches could have the observed effects.
-- 
Michal Hocko
SUSE Labs


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-05-28 Thread Michal Hocko
[Sorry for a late reply - was offline for few days]

On Thu 21-05-20 17:58:55, Johannes Weiner wrote:
> On Thu, May 21, 2020 at 01:06:28PM -0700, Hugh Dickins wrote:
[...]
> >From d9e7ed15d1c9248a3fd99e35e82437549154dac7 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner 
> Date: Thu, 21 May 2020 17:44:25 -0400
> Subject: [PATCH] mm: memcontrol: prepare swap controller setup for integration
>  fix
> 
> Fix crash with cgroup_disable=memory:
> 
> > > > > + mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
> > > > > mke2fs 1.43.8 (1-Jan-2018)
> > > > > Creating filesystem with 244190646 4k blocks and 61054976 inodes
> > > > > Filesystem UUID: 3bb1a285-2cb4-44b4-b6e8-62548f3ac620
> > > > > Superblock backups stored on blocks:
> > > > > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 
> > > > > 2654208,
> > > > > 4096000, 7962624, 11239424, 2048, 23887872, 71663616, 78675968,
> > > > > 10240, 214990848
> > > > > Allocating group tables:0/7453   done
> > > > > Writing inode tables:0/7453   done
> > > > > Creating journal (262144 blocks): [   35.502102] BUG: kernel NULL
> > > > > pointer dereference, address: 00c8
> > > > > [   35.508372] #PF: supervisor read access in kernel mode
> > > > > [   35.513506] #PF: error_code(0x) - not-present page
> > > > > [   35.518638] *pde = 
> > > > > [   35.521514] Oops:  [#1] SMP
> > > > > [   35.524652] CPU: 0 PID: 145 Comm: kswapd0 Not tainted
> > > > > 5.7.0-rc6-next-20200519+ #1
> > > > > [   35.532121] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> > > > > 2.2 05/23/2018
> > > > > [   35.539507] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60
> 
> Swap accounting used to be implied-disabled when the cgroup controller
> was disabled. Restore that for the new cgroup_memory_noswap, so that
> we bail out of this function instead of dereferencing a NULL memcg.
> 
> Reported-by: Naresh Kamboju 
> Debugged-by: Hugh Dickins 
> Debugged-by: Michal Hocko 
> Signed-off-by: Johannes Weiner 

Yes this looks better. I hope to get to your series soon to have the
full picture finally.

> ---
>  mm/memcontrol.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 3e000a316b59..e3b785d6e771 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -7075,7 +7075,11 @@ static struct cftype memsw_files[] = {
>  
>  static int __init mem_cgroup_swap_init(void)
>  {
> - if (mem_cgroup_disabled() || cgroup_memory_noswap)
> + /* No memory control -> no swap control */
> + if (mem_cgroup_disabled())
> + cgroup_memory_noswap = true;
> +
> + if (cgroup_memory_noswap)
>   return 0;
>  
>   WARN_ON(cgroup_add_dfl_cftypes(_cgrp_subsys, swap_files));
> -- 
> 2.26.2

-- 
Michal Hocko
SUSE Labs


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-05-21 Thread Hugh Dickins
On Thu, 21 May 2020, Johannes Weiner wrote:
> On Thu, May 21, 2020 at 01:06:28PM -0700, Hugh Dickins wrote:
> > On Thu, 21 May 2020, Johannes Weiner wrote:
> > > do_memsw_account() used to be automatically false when the cgroup
> > > controller was disabled. Now that it's replaced by
> > > cgroup_memory_noswap, for which this isn't true, make the
> > > mem_cgroup_disabled() checks explicit in the swap control API.
> > > 
> > > [han...@cmpxchg.org: use mem_cgroup_disabled() in all API functions]
> > > Reported-by: Naresh Kamboju 
> > > Debugged-by: Hugh Dickins 
> > > Debugged-by: Michal Hocko 
> > > Signed-off-by: Johannes Weiner 
> > > ---
> > >  mm/memcontrol.c | 47 +--
> > >  1 file changed, 41 insertions(+), 6 deletions(-)
> > 
> > I'm certainly not against a mem_cgroup_disabled() check in the only
> > place that's been observed to need it, as a fixup to merge into your
> > original patch; but this seems rather an over-reaction - and I'm a
> > little surprised that setting mem_cgroup_disabled() doesn't just
> > force cgroup_memory_noswap, saving repetitious checks elsewhere
> > (perhaps there's a difficulty in that, I haven't looked).
> 
> Fair enough, I changed it to set the flag at initialization time if
> mem_cgroup_disabled(). I was never a fan of the old flags, where it
> was never clear what was commandline, and what was internal runtime
> state - do_swap_account? really_do_swap_account? But I think it's
> straight-forward in this case now.
> 
> > Historically, I think we've added mem_cgroup_disabled() checks
> > (accessing a cacheline we'd rather avoid) where they're necessary,
> > rather than at every "interface".
> 
> To me that always seemed like bugs waiting to happen. Like this one!
> 
> It's a jump label nowadays, so I've been liberal with these to avoid
> subtle bugs.
> 
> > And you seem to be in a very "goto out" mood today - we all have
> > our "goto out" days, alternating with our "return 0" days :)
> 
> :-)
> 
> But I agree, best to keep this fixup self-contained and defer anything
> else to separate cleanup patches.
> 
> How about the below? It survives a swaptest with cgroup_disable=memory
> for me.

I like this version *a lot*, thank you. I got worried for a bit by
the "#define cgroup_memory_noswap 1" when #ifndef CONFIG_MEMCG_SWAP,
but now realize that fits perfectly.

> 
> Hugh, I started with your patch, which is why I kept you as the
> author, but as the patch now (and arguably the previous one) is
> sufficiently different, I dropped that now. I hope that's okay.

Absolutely okay, these are yours: I was a little uncomfortable to
see me on the From line before, but it also seemed just too petty
to insist that my name be removed.

(By the way, off-topic for this particular issue, but advance warning
that I hope to post a couple of patches to __read_swap_cache_async()
before the end of the day, first being fixup to some of your mods -
I suspect you got it working well enough, and intended to come back
to check a few details later, but never quite got around to that.)

> 
> ---
> From d9e7ed15d1c9248a3fd99e35e82437549154dac7 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner 
> Date: Thu, 21 May 2020 17:44:25 -0400
> Subject: [PATCH] mm: memcontrol: prepare swap controller setup for integration
>  fix
> 
> Fix crash with cgroup_disable=memory:
> 
> > > > > + mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
> > > > > mke2fs 1.43.8 (1-Jan-2018)
> > > > > Creating filesystem with 244190646 4k blocks and 61054976 inodes
> > > > > Filesystem UUID: 3bb1a285-2cb4-44b4-b6e8-62548f3ac620
> > > > > Superblock backups stored on blocks:
> > > > > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 
> > > > > 2654208,
> > > > > 4096000, 7962624, 11239424, 2048, 23887872, 71663616, 78675968,
> > > > > 10240, 214990848
> > > > > Allocating group tables:0/7453   done
> > > > > Writing inode tables:0/7453   done
> > > > > Creating journal (262144 blocks): [   35.502102] BUG: kernel NULL
> > > > > pointer dereference, address: 00c8
> > > > > [   35.508372] #PF: supervisor read access in kernel mode
> > > > > [   35.513506] #PF: error_code(0x) - not-present page
> > > > > [   35.518638] *pde = 
> > > > > [   35.521514] Oops:  [#1] SMP
> > > > > [   35.524652] CPU: 0 PID: 145 Comm: kswapd0 Not tainted
> > > > > 5.7.0-rc6-next-20200519+ #1
> > > > > [   35.532121] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> > > > > 2.2 05/23/2018
> > > > > [   35.539507] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60
> 
> Swap accounting used to be implied-disabled when the cgroup controller
> was disabled. Restore that for the new cgroup_memory_noswap, so that
> we bail out of this function instead of dereferencing a NULL memcg.
> 
> Reported-by: Naresh Kamboju 
> Debugged-by: Hugh Dickins 
> Debugged-by: Michal Hocko 
> Signed-off-by: Johannes Weiner 


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-05-21 Thread Johannes Weiner
On Thu, May 21, 2020 at 01:06:28PM -0700, Hugh Dickins wrote:
> On Thu, 21 May 2020, Johannes Weiner wrote:
> > do_memsw_account() used to be automatically false when the cgroup
> > controller was disabled. Now that it's replaced by
> > cgroup_memory_noswap, for which this isn't true, make the
> > mem_cgroup_disabled() checks explicit in the swap control API.
> > 
> > [han...@cmpxchg.org: use mem_cgroup_disabled() in all API functions]
> > Reported-by: Naresh Kamboju 
> > Debugged-by: Hugh Dickins 
> > Debugged-by: Michal Hocko 
> > Signed-off-by: Johannes Weiner 
> > ---
> >  mm/memcontrol.c | 47 +--
> >  1 file changed, 41 insertions(+), 6 deletions(-)
> 
> I'm certainly not against a mem_cgroup_disabled() check in the only
> place that's been observed to need it, as a fixup to merge into your
> original patch; but this seems rather an over-reaction - and I'm a
> little surprised that setting mem_cgroup_disabled() doesn't just
> force cgroup_memory_noswap, saving repetitious checks elsewhere
> (perhaps there's a difficulty in that, I haven't looked).

Fair enough, I changed it to set the flag at initialization time if
mem_cgroup_disabled(). I was never a fan of the old flags, where it
was never clear what was commandline, and what was internal runtime
state - do_swap_account? really_do_swap_account? But I think it's
straight-forward in this case now.

> Historically, I think we've added mem_cgroup_disabled() checks
> (accessing a cacheline we'd rather avoid) where they're necessary,
> rather than at every "interface".

To me that always seemed like bugs waiting to happen. Like this one!

It's a jump label nowadays, so I've been liberal with these to avoid
subtle bugs.

> And you seem to be in a very "goto out" mood today - we all have
> our "goto out" days, alternating with our "return 0" days :)

:-)

But I agree, best to keep this fixup self-contained and defer anything
else to separate cleanup patches.

How about the below? It survives a swaptest with cgroup_disable=memory
for me.

Hugh, I started with your patch, which is why I kept you as the
author, but as the patch now (and arguably the previous one) is
sufficiently different, I dropped that now. I hope that's okay.

---
>From d9e7ed15d1c9248a3fd99e35e82437549154dac7 Mon Sep 17 00:00:00 2001
From: Johannes Weiner 
Date: Thu, 21 May 2020 17:44:25 -0400
Subject: [PATCH] mm: memcontrol: prepare swap controller setup for integration
 fix

Fix crash with cgroup_disable=memory:

> > > > + mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
> > > > mke2fs 1.43.8 (1-Jan-2018)
> > > > Creating filesystem with 244190646 4k blocks and 61054976 inodes
> > > > Filesystem UUID: 3bb1a285-2cb4-44b4-b6e8-62548f3ac620
> > > > Superblock backups stored on blocks:
> > > > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> > > > 4096000, 7962624, 11239424, 2048, 23887872, 71663616, 78675968,
> > > > 10240, 214990848
> > > > Allocating group tables:0/7453   done
> > > > Writing inode tables:0/7453   done
> > > > Creating journal (262144 blocks): [   35.502102] BUG: kernel NULL
> > > > pointer dereference, address: 00c8
> > > > [   35.508372] #PF: supervisor read access in kernel mode
> > > > [   35.513506] #PF: error_code(0x) - not-present page
> > > > [   35.518638] *pde = 
> > > > [   35.521514] Oops:  [#1] SMP
> > > > [   35.524652] CPU: 0 PID: 145 Comm: kswapd0 Not tainted
> > > > 5.7.0-rc6-next-20200519+ #1
> > > > [   35.532121] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> > > > 2.2 05/23/2018
> > > > [   35.539507] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60

Swap accounting used to be implied-disabled when the cgroup controller
was disabled. Restore that for the new cgroup_memory_noswap, so that
we bail out of this function instead of dereferencing a NULL memcg.

Reported-by: Naresh Kamboju 
Debugged-by: Hugh Dickins 
Debugged-by: Michal Hocko 
Signed-off-by: Johannes Weiner 
---
 mm/memcontrol.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3e000a316b59..e3b785d6e771 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7075,7 +7075,11 @@ static struct cftype memsw_files[] = {
 
 static int __init mem_cgroup_swap_init(void)
 {
-   if (mem_cgroup_disabled() || cgroup_memory_noswap)
+   /* No memory control -> no swap control */
+   if (mem_cgroup_disabled())
+   cgroup_memory_noswap = true;
+
+   if (cgroup_memory_noswap)
return 0;
 
WARN_ON(cgroup_add_dfl_cftypes(_cgrp_subsys, swap_files));
-- 
2.26.2



Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-05-21 Thread Naresh Kamboju
My apology !
As per the test results history this problem started happening from
Bad : next-20200430 (still reproducible on next-20200519)
Good : next-20200429

The git tree / tag used for testing is from linux next-20200430 tag and reverted
following three patches and oom-killer problem fixed.

Revert "mm, memcg: avoid stale protection values when cgroup is above
protection"
Revert "mm, memcg: decouple e{low,min} state mutations from protectinn checks"
Revert "mm-memcg-decouple-elowmin-state-mutations-from-protection-checks-fix"

Ref tree:
https://github.com/roxell/linux/commits/my-next-20200430

Build images:
https://builds.tuxbuild.com/whyTLI1O8s5HiILwpLTLtg/

Test log:
https://lkft.validation.linaro.org/scheduler/job/1444321#L1164

- Naresh


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-05-21 Thread Hugh Dickins
On Thu, 21 May 2020, Johannes Weiner wrote:
> 
> Very much appreciate you guys tracking it down so quickly. Sorry about
> the breakage.
> 
> I think mem_cgroup_disabled() checks are pretty good markers of public
> entry points to the memcg API, so I'd prefer that even if a bit more
> verbose. What do you think?

An explicit mem_cgroup_disabled() check would be fine, but I must admit,
the patch below is rather too verbose for my own taste.  Your call.

> 
> ---
> From cd373ec232942a9bc43ee5e7d2171352019a58fb Mon Sep 17 00:00:00 2001
> From: Hugh Dickins 
> Date: Thu, 21 May 2020 14:58:36 -0400
> Subject: [PATCH] mm: memcontrol: prepare swap controller setup for integration
>  fix
> 
> Fix crash with cgroup_disable=memory:
> 
> > > > > + mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
> > > > > mke2fs 1.43.8 (1-Jan-2018)
> > > > > Creating filesystem with 244190646 4k blocks and 61054976 inodes
> > > > > Filesystem UUID: 3bb1a285-2cb4-44b4-b6e8-62548f3ac620
> > > > > Superblock backups stored on blocks:
> > > > > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 
> > > > > 2654208,
> > > > > 4096000, 7962624, 11239424, 2048, 23887872, 71663616, 78675968,
> > > > > 10240, 214990848
> > > > > Allocating group tables:0/7453   done
> > > > > Writing inode tables:0/7453   done
> > > > > Creating journal (262144 blocks): [   35.502102] BUG: kernel NULL
> > > > > pointer dereference, address: 00c8
> > > > > [   35.508372] #PF: supervisor read access in kernel mode
> > > > > [   35.513506] #PF: error_code(0x) - not-present page
> > > > > [   35.518638] *pde = 
> > > > > [   35.521514] Oops:  [#1] SMP
> > > > > [   35.524652] CPU: 0 PID: 145 Comm: kswapd0 Not tainted
> > > > > 5.7.0-rc6-next-20200519+ #1
> > > > > [   35.532121] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> > > > > 2.2 05/23/2018
> > > > > [   35.539507] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60
> 
> do_memsw_account() used to be automatically false when the cgroup
> controller was disabled. Now that it's replaced by
> cgroup_memory_noswap, for which this isn't true, make the
> mem_cgroup_disabled() checks explicit in the swap control API.
> 
> [han...@cmpxchg.org: use mem_cgroup_disabled() in all API functions]
> Reported-by: Naresh Kamboju 
> Debugged-by: Hugh Dickins 
> Debugged-by: Michal Hocko 
> Signed-off-by: Johannes Weiner 
> ---
>  mm/memcontrol.c | 47 +--
>  1 file changed, 41 insertions(+), 6 deletions(-)

I'm certainly not against a mem_cgroup_disabled() check in the only
place that's been observed to need it, as a fixup to merge into your
original patch; but this seems rather an over-reaction - and I'm a
little surprised that setting mem_cgroup_disabled() doesn't just
force cgroup_memory_noswap, saving repetitious checks elsewhere
(perhaps there's a difficulty in that, I haven't looked).

Historically, I think we've added mem_cgroup_disabled() checks
(accessing a cacheline we'd rather avoid) where they're necessary,
rather than at every "interface".

And you seem to be in a very "goto out" mood today - we all have
our "goto out" days, alternating with our "return 0" days :)

Hugh

> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 3e000a316b59..850bca380562 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6811,6 +6811,9 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t 
> entry)
>   VM_BUG_ON_PAGE(PageLRU(page), page);
>   VM_BUG_ON_PAGE(page_count(page), page);
>  
> + if (mem_cgroup_disabled())
> + return;
> +
>   if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
>   return;
>  
> @@ -6876,6 +6879,10 @@ int mem_cgroup_try_charge_swap(struct page *page, 
> swp_entry_t entry)
>   struct mem_cgroup *memcg;
>   unsigned short oldid;
>  
> + if (mem_cgroup_disabled())
> + return 0;
> +
> + /* Only cgroup2 has swap.max */
>   if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
>   return 0;
>  
> @@ -6920,6 +6927,9 @@ void mem_cgroup_uncharge_swap(swp_entry_t entry, 
> unsigned int nr_pages)
>   struct mem_cgroup *memcg;
>   unsigned short id;
>  
> + if (mem_cgroup_disabled())
> + return;
> +
>   id = swap_cgroup_record(entry, 0, nr_pages);
>   rcu_read_lock();
>   memcg = mem_cgroup_from_id(id);
> @@ -6940,12 +6950,25 @@ long mem_cgroup_get_nr_swap_pages(struct mem_cgroup 
> *memcg)
>  {
>   long nr_swap_pages = get_nr_swap_pages();
>  
> - if (cgroup_memory_noswap || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
> - return nr_swap_pages;
> + if (mem_cgroup_disabled())
> + goto out;
> +
> + /* Swap control disabled */
> + if (cgroup_memory_noswap)
> + goto out;
> +
> + /*
> +  * Only cgroup2 has swap.max, cgroup1 does mem+sw accounting,
> +  * which does not place 

Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-05-21 Thread Johannes Weiner
On Thu, May 21, 2020 at 02:44:44PM +0200, Michal Hocko wrote:
> On Thu 21-05-20 05:24:27, Hugh Dickins wrote:
> > On Thu, 21 May 2020, Michal Hocko wrote:
> > > On Thu 21-05-20 16:11:11, Naresh Kamboju wrote:
> > > > On Thu, 21 May 2020 at 15:25, Michal Hocko  wrote:
> > > > >
> > > > > On Wed 20-05-20 20:09:06, Chris Down wrote:
> > > > > > Hi Naresh,
> > > > > >
> > > > > > Naresh Kamboju writes:
> > > > > > > As a part of investigation on this issue LKFT teammate Anders 
> > > > > > > Roxell
> > > > > > > git bisected the problem and found bad commit(s) which caused 
> > > > > > > this problem.
> > > > > > >
> > > > > > > The following two patches have been reverted on next-20200519 and 
> > > > > > > retested the
> > > > > > > reproducible steps and confirmed the test case mkfs -t ext4 got 
> > > > > > > PASS.
> > > > > > > ( invoked oom-killer is gone now)
> > > > > > >
> > > > > > > Revert "mm, memcg: avoid stale protection values when cgroup is 
> > > > > > > above
> > > > > > > protection"
> > > > > > >This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
> > > > > > >
> > > > > > > Revert "mm, memcg: decouple e{low,min} state mutations from 
> > > > > > > protection
> > > > > > > checks"
> > > > > > >This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
> > > > > >
> > > > > > Thanks Anders and Naresh for tracking this down and reverting.
> > > > > >
> > > > > > I'll take a look tomorrow. I don't see anything immediately 
> > > > > > obviously wrong
> > > > > > in either of those commits from a (very) cursory glance, but they 
> > > > > > should
> > > > > > only be taking effect if protections are set.
> > > > >
> > > > > Agreed. If memory.{low,min} is not used then the patch should be
> > > > > effectively a nop. Btw. do you see the problem when booting with
> > > > > cgroup_disable=memory kernel command line parameter?
> > > > 
> > > > With extra kernel command line parameters, cgroup_disable=memory
> > > > I have noticed a differ problem now.
> > > > 
> > > > + mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
> > > > mke2fs 1.43.8 (1-Jan-2018)
> > > > Creating filesystem with 244190646 4k blocks and 61054976 inodes
> > > > Filesystem UUID: 3bb1a285-2cb4-44b4-b6e8-62548f3ac620
> > > > Superblock backups stored on blocks:
> > > > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> > > > 4096000, 7962624, 11239424, 2048, 23887872, 71663616, 78675968,
> > > > 10240, 214990848
> > > > Allocating group tables:0/7453   done
> > > > Writing inode tables:0/7453   done
> > > > Creating journal (262144 blocks): [   35.502102] BUG: kernel NULL
> > > > pointer dereference, address: 00c8
> > > > [   35.508372] #PF: supervisor read access in kernel mode
> > > > [   35.513506] #PF: error_code(0x) - not-present page
> > > > [   35.518638] *pde = 
> > > > [   35.521514] Oops:  [#1] SMP
> > > > [   35.524652] CPU: 0 PID: 145 Comm: kswapd0 Not tainted
> > > > 5.7.0-rc6-next-20200519+ #1
> > > > [   35.532121] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> > > > 2.2 05/23/2018
> > > > [   35.539507] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60
> > > 
> > > Could you get faddr2line for this offset?
> > 
> > No need for that, I can help with the "cgroup_disabled=memory" crash:
> > I've been happily running with the fixup below, but haven't got to
> > send it in yet (and wouldn't normally be reading mail at this time!)
> > because of busy chasing a couple of other bugs (not necessarily mm);
> > and maybe the fix would be better with explicit mem_cgroup_disabled()
> > test, or maybe that should be where cgroup_memory_noswap is decided -
> > up to Johannes.
> 
> Thanks Hugh. I can see what is the problem now. I was looking at the
> Linus' tree 

Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-05-21 Thread Naresh Kamboju
On Thu, 21 May 2020 at 22:04, Michal Hocko  wrote:
>
> On Thu 21-05-20 11:55:16, Michal Hocko wrote:
> > On Wed 20-05-20 20:09:06, Chris Down wrote:
> > > Hi Naresh,
> > >
> > > Naresh Kamboju writes:
> > > > As a part of investigation on this issue LKFT teammate Anders Roxell
> > > > git bisected the problem and found bad commit(s) which caused this 
> > > > problem.
> > > >
> > > > The following two patches have been reverted on next-20200519 and 
> > > > retested the
> > > > reproducible steps and confirmed the test case mkfs -t ext4 got PASS.
> > > > ( invoked oom-killer is gone now)
> > > >
> > > > Revert "mm, memcg: avoid stale protection values when cgroup is above
> > > > protection"
> > > >This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
> > > >
> > > > Revert "mm, memcg: decouple e{low,min} state mutations from protection
> > > > checks"
> > > >This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
> > >
> > > Thanks Anders and Naresh for tracking this down and reverting.
> > >
> > > I'll take a look tomorrow. I don't see anything immediately obviously 
> > > wrong
> > > in either of those commits from a (very) cursory glance, but they should
> > > only be taking effect if protections are set.
> >
> > Agreed. If memory.{low,min} is not used then the patch should be
> > effectively a nop.
>
> I was staring into the code and did not see anything.  Could you give the
> following debugging patch a try and see whether it triggers?

These code paths did not touch it seems. but still see the reported problem.
Please find a detailed test log output [1]

And
One more test log with cgroup_disable=memory [2]

Test log link,
[1] https://pastebin.com/XJU7We1g
[2] https://pastebin.com/BZ0BMUVt


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-05-21 Thread Michal Hocko
On Thu 21-05-20 11:55:16, Michal Hocko wrote:
> On Wed 20-05-20 20:09:06, Chris Down wrote:
> > Hi Naresh,
> > 
> > Naresh Kamboju writes:
> > > As a part of investigation on this issue LKFT teammate Anders Roxell
> > > git bisected the problem and found bad commit(s) which caused this 
> > > problem.
> > > 
> > > The following two patches have been reverted on next-20200519 and 
> > > retested the
> > > reproducible steps and confirmed the test case mkfs -t ext4 got PASS.
> > > ( invoked oom-killer is gone now)
> > > 
> > > Revert "mm, memcg: avoid stale protection values when cgroup is above
> > > protection"
> > >This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
> > > 
> > > Revert "mm, memcg: decouple e{low,min} state mutations from protection
> > > checks"
> > >This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
> > 
> > Thanks Anders and Naresh for tracking this down and reverting.
> > 
> > I'll take a look tomorrow. I don't see anything immediately obviously wrong
> > in either of those commits from a (very) cursory glance, but they should
> > only be taking effect if protections are set.
> 
> Agreed. If memory.{low,min} is not used then the patch should be
> effectively a nop.

I was staring into the code and do not see anything.  Could you give the
following debugging patch a try and see whether it triggers?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index cc555903a332..df2e8df0eb71 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2404,6 +2404,8 @@ static void get_scan_count(struct lruvec *lruvec, struct 
scan_control *sc,
 * sc->priority further than desirable.
 */
scan = max(scan, SWAP_CLUSTER_MAX);
+
+   trace_printk("scan:%lu protection:%lu\n", scan, 
protection);
} else {
scan = lruvec_size;
}
@@ -2648,6 +2650,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct 
scan_control *sc)
mem_cgroup_calculate_protection(target_memcg, memcg);
 
if (mem_cgroup_below_min(memcg)) {
+   trace_printk("under min:%lu emin:%lu\n", 
memcg->memory.min, memcg->memory.emin);
/*
 * Hard protection.
 * If there is no reclaimable memory, OOM.
@@ -2660,6 +2663,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct 
scan_control *sc)
 * there is an unprotected supply
 * of reclaimable memory from other cgroups.
 */
+   trace_printk("under low:%lu elow:%lu\n", 
memcg->memory.low, memcg->memory.elow);
if (!sc->memcg_low_reclaim) {
sc->memcg_low_skipped = 1;
continue;
-- 
Michal Hocko
SUSE Labs


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-05-21 Thread Michal Hocko
On Thu 21-05-20 05:24:27, Hugh Dickins wrote:
> On Thu, 21 May 2020, Michal Hocko wrote:
> > On Thu 21-05-20 16:11:11, Naresh Kamboju wrote:
> > > On Thu, 21 May 2020 at 15:25, Michal Hocko  wrote:
> > > >
> > > > On Wed 20-05-20 20:09:06, Chris Down wrote:
> > > > > Hi Naresh,
> > > > >
> > > > > Naresh Kamboju writes:
> > > > > > As a part of investigation on this issue LKFT teammate Anders Roxell
> > > > > > git bisected the problem and found bad commit(s) which caused this 
> > > > > > problem.
> > > > > >
> > > > > > The following two patches have been reverted on next-20200519 and 
> > > > > > retested the
> > > > > > reproducible steps and confirmed the test case mkfs -t ext4 got 
> > > > > > PASS.
> > > > > > ( invoked oom-killer is gone now)
> > > > > >
> > > > > > Revert "mm, memcg: avoid stale protection values when cgroup is 
> > > > > > above
> > > > > > protection"
> > > > > >This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
> > > > > >
> > > > > > Revert "mm, memcg: decouple e{low,min} state mutations from 
> > > > > > protection
> > > > > > checks"
> > > > > >This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
> > > > >
> > > > > Thanks Anders and Naresh for tracking this down and reverting.
> > > > >
> > > > > I'll take a look tomorrow. I don't see anything immediately obviously 
> > > > > wrong
> > > > > in either of those commits from a (very) cursory glance, but they 
> > > > > should
> > > > > only be taking effect if protections are set.
> > > >
> > > > Agreed. If memory.{low,min} is not used then the patch should be
> > > > effectively a nop. Btw. do you see the problem when booting with
> > > > cgroup_disable=memory kernel command line parameter?
> > > 
> > > With extra kernel command line parameters, cgroup_disable=memory
> > > I have noticed a differ problem now.
> > > 
> > > + mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
> > > mke2fs 1.43.8 (1-Jan-2018)
> > > Creating filesystem with 244190646 4k blocks and 61054976 inodes
> > > Filesystem UUID: 3bb1a285-2cb4-44b4-b6e8-62548f3ac620
> > > Superblock backups stored on blocks:
> > > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> > > 4096000, 7962624, 11239424, 2048, 23887872, 71663616, 78675968,
> > > 10240, 214990848
> > > Allocating group tables:0/7453   done
> > > Writing inode tables:0/7453   done
> > > Creating journal (262144 blocks): [   35.502102] BUG: kernel NULL
> > > pointer dereference, address: 00c8
> > > [   35.508372] #PF: supervisor read access in kernel mode
> > > [   35.513506] #PF: error_code(0x) - not-present page
> > > [   35.518638] *pde = 
> > > [   35.521514] Oops:  [#1] SMP
> > > [   35.524652] CPU: 0 PID: 145 Comm: kswapd0 Not tainted
> > > 5.7.0-rc6-next-20200519+ #1
> > > [   35.532121] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> > > 2.2 05/23/2018
> > > [   35.539507] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60
> > 
> > Could you get faddr2line for this offset?
> 
> No need for that, I can help with the "cgroup_disabled=memory" crash:
> I've been happily running with the fixup below, but haven't got to
> send it in yet (and wouldn't normally be reading mail at this time!)
> because of busy chasing a couple of other bugs (not necessarily mm);
> and maybe the fix would be better with explicit mem_cgroup_disabled()
> test, or maybe that should be where cgroup_memory_noswap is decided -
> up to Johannes.

Thanks Hugh. I can see what is the problem now. I was looking at the
Linus' tree and we have a different code there

long nr_swap_pages = get_nr_swap_pages();

if (!do_swap_account || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
return nr_swap_pages;

which would be impossible to crash so I was really wondering what is
going on here. But there are other changes in the mmotm which I haven't
reviewed yet. Looking at the next tree now it is a fallout from "mm:
memcontrol: prepare swap controller setup for integration".

!memcg check slightly more cryptic than an explicit mem_cgroup_disabled
but I would just leave it to Johannes as well.

> 
> ---
> 
>  mm/memcontrol.c |3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> --- 5.7-rc6-mm1/mm/memcontrol.c   2020-05-20 12:21:56.109693740 -0700
> +++ linux/mm/memcontrol.c 2020-05-20 12:26:15.500478753 -0700
> @@ -6954,7 +6954,8 @@ long mem_cgroup_get_nr_swap_pages(struct
>  {
>   long nr_swap_pages = get_nr_swap_pages();
>  
> - if (cgroup_memory_noswap || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
> + if (!memcg || cgroup_memory_noswap ||
> +!cgroup_subsys_on_dfl(memory_cgrp_subsys))
>   return nr_swap_pages;
>   for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg))
>   nr_swap_pages = min_t(long, nr_swap_pages,

-- 
Michal Hocko
SUSE Labs


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-05-21 Thread Hugh Dickins
On Thu, 21 May 2020, Michal Hocko wrote:
> On Thu 21-05-20 16:11:11, Naresh Kamboju wrote:
> > On Thu, 21 May 2020 at 15:25, Michal Hocko  wrote:
> > >
> > > On Wed 20-05-20 20:09:06, Chris Down wrote:
> > > > Hi Naresh,
> > > >
> > > > Naresh Kamboju writes:
> > > > > As a part of investigation on this issue LKFT teammate Anders Roxell
> > > > > git bisected the problem and found bad commit(s) which caused this 
> > > > > problem.
> > > > >
> > > > > The following two patches have been reverted on next-20200519 and 
> > > > > retested the
> > > > > reproducible steps and confirmed the test case mkfs -t ext4 got PASS.
> > > > > ( invoked oom-killer is gone now)
> > > > >
> > > > > Revert "mm, memcg: avoid stale protection values when cgroup is above
> > > > > protection"
> > > > >This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
> > > > >
> > > > > Revert "mm, memcg: decouple e{low,min} state mutations from protection
> > > > > checks"
> > > > >This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
> > > >
> > > > Thanks Anders and Naresh for tracking this down and reverting.
> > > >
> > > > I'll take a look tomorrow. I don't see anything immediately obviously 
> > > > wrong
> > > > in either of those commits from a (very) cursory glance, but they should
> > > > only be taking effect if protections are set.
> > >
> > > Agreed. If memory.{low,min} is not used then the patch should be
> > > effectively a nop. Btw. do you see the problem when booting with
> > > cgroup_disable=memory kernel command line parameter?
> > 
> > With extra kernel command line parameters, cgroup_disable=memory
> > I have noticed a differ problem now.
> > 
> > + mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
> > mke2fs 1.43.8 (1-Jan-2018)
> > Creating filesystem with 244190646 4k blocks and 61054976 inodes
> > Filesystem UUID: 3bb1a285-2cb4-44b4-b6e8-62548f3ac620
> > Superblock backups stored on blocks:
> > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> > 4096000, 7962624, 11239424, 2048, 23887872, 71663616, 78675968,
> > 10240, 214990848
> > Allocating group tables:0/7453   done
> > Writing inode tables:0/7453   done
> > Creating journal (262144 blocks): [   35.502102] BUG: kernel NULL
> > pointer dereference, address: 00c8
> > [   35.508372] #PF: supervisor read access in kernel mode
> > [   35.513506] #PF: error_code(0x) - not-present page
> > [   35.518638] *pde = 
> > [   35.521514] Oops:  [#1] SMP
> > [   35.524652] CPU: 0 PID: 145 Comm: kswapd0 Not tainted
> > 5.7.0-rc6-next-20200519+ #1
> > [   35.532121] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> > 2.2 05/23/2018
> > [   35.539507] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60
> 
> Could you get faddr2line for this offset?

No need for that, I can help with the "cgroup_disabled=memory" crash:
I've been happily running with the fixup below, but haven't got to
send it in yet (and wouldn't normally be reading mail at this time!)
because of busy chasing a couple of other bugs (not necessarily mm);
and maybe the fix would be better with explicit mem_cgroup_disabled()
test, or maybe that should be where cgroup_memory_noswap is decided -
up to Johannes.

---

 mm/memcontrol.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- 5.7-rc6-mm1/mm/memcontrol.c 2020-05-20 12:21:56.109693740 -0700
+++ linux/mm/memcontrol.c   2020-05-20 12:26:15.500478753 -0700
@@ -6954,7 +6954,8 @@ long mem_cgroup_get_nr_swap_pages(struct
 {
long nr_swap_pages = get_nr_swap_pages();
 
-   if (cgroup_memory_noswap || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
+   if (!memcg || cgroup_memory_noswap ||
+!cgroup_subsys_on_dfl(memory_cgrp_subsys))
return nr_swap_pages;
for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg))
nr_swap_pages = min_t(long, nr_swap_pages,


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-05-21 Thread Michal Hocko
On Thu 21-05-20 16:11:11, Naresh Kamboju wrote:
> On Thu, 21 May 2020 at 15:25, Michal Hocko  wrote:
> >
> > On Wed 20-05-20 20:09:06, Chris Down wrote:
> > > Hi Naresh,
> > >
> > > Naresh Kamboju writes:
> > > > As a part of investigation on this issue LKFT teammate Anders Roxell
> > > > git bisected the problem and found bad commit(s) which caused this 
> > > > problem.
> > > >
> > > > The following two patches have been reverted on next-20200519 and 
> > > > retested the
> > > > reproducible steps and confirmed the test case mkfs -t ext4 got PASS.
> > > > ( invoked oom-killer is gone now)
> > > >
> > > > Revert "mm, memcg: avoid stale protection values when cgroup is above
> > > > protection"
> > > >This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
> > > >
> > > > Revert "mm, memcg: decouple e{low,min} state mutations from protection
> > > > checks"
> > > >This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
> > >
> > > Thanks Anders and Naresh for tracking this down and reverting.
> > >
> > > I'll take a look tomorrow. I don't see anything immediately obviously 
> > > wrong
> > > in either of those commits from a (very) cursory glance, but they should
> > > only be taking effect if protections are set.
> >
> > Agreed. If memory.{low,min} is not used then the patch should be
> > effectively a nop. Btw. do you see the problem when booting with
> > cgroup_disable=memory kernel command line parameter?
> 
> With extra kernel command line parameters, cgroup_disable=memory
> I have noticed a differ problem now.
> 
> + mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
> mke2fs 1.43.8 (1-Jan-2018)
> Creating filesystem with 244190646 4k blocks and 61054976 inodes
> Filesystem UUID: 3bb1a285-2cb4-44b4-b6e8-62548f3ac620
> Superblock backups stored on blocks:
> 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> 4096000, 7962624, 11239424, 2048, 23887872, 71663616, 78675968,
> 10240, 214990848
> Allocating group tables:0/7453   done
> Writing inode tables:0/7453   done
> Creating journal (262144 blocks): [   35.502102] BUG: kernel NULL
> pointer dereference, address: 00c8
> [   35.508372] #PF: supervisor read access in kernel mode
> [   35.513506] #PF: error_code(0x) - not-present page
> [   35.518638] *pde = 
> [   35.521514] Oops:  [#1] SMP
> [   35.524652] CPU: 0 PID: 145 Comm: kswapd0 Not tainted
> 5.7.0-rc6-next-20200519+ #1
> [   35.532121] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> 2.2 05/23/2018
> [   35.539507] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60

Could you get faddr2line for this offset?

-- 
Michal Hocko
SUSE Labs


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-05-21 Thread Naresh Kamboju
On Thu, 21 May 2020 at 15:25, Michal Hocko  wrote:
>
> On Wed 20-05-20 20:09:06, Chris Down wrote:
> > Hi Naresh,
> >
> > Naresh Kamboju writes:
> > > As a part of investigation on this issue LKFT teammate Anders Roxell
> > > git bisected the problem and found bad commit(s) which caused this 
> > > problem.
> > >
> > > The following two patches have been reverted on next-20200519 and 
> > > retested the
> > > reproducible steps and confirmed the test case mkfs -t ext4 got PASS.
> > > ( invoked oom-killer is gone now)
> > >
> > > Revert "mm, memcg: avoid stale protection values when cgroup is above
> > > protection"
> > >This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
> > >
> > > Revert "mm, memcg: decouple e{low,min} state mutations from protection
> > > checks"
> > >This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
> >
> > Thanks Anders and Naresh for tracking this down and reverting.
> >
> > I'll take a look tomorrow. I don't see anything immediately obviously wrong
> > in either of those commits from a (very) cursory glance, but they should
> > only be taking effect if protections are set.
>
> Agreed. If memory.{low,min} is not used then the patch should be
> effectively a nop. Btw. do you see the problem when booting with
> cgroup_disable=memory kernel command line parameter?

With extra kernel command line parameters, cgroup_disable=memory
I have noticed a differ problem now.

+ mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
mke2fs 1.43.8 (1-Jan-2018)
Creating filesystem with 244190646 4k blocks and 61054976 inodes
Filesystem UUID: 3bb1a285-2cb4-44b4-b6e8-62548f3ac620
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 2048, 23887872, 71663616, 78675968,
10240, 214990848
Allocating group tables:0/7453   done
Writing inode tables:0/7453   done
Creating journal (262144 blocks): [   35.502102] BUG: kernel NULL
pointer dereference, address: 00c8
[   35.508372] #PF: supervisor read access in kernel mode
[   35.513506] #PF: error_code(0x) - not-present page
[   35.518638] *pde = 
[   35.521514] Oops:  [#1] SMP
[   35.524652] CPU: 0 PID: 145 Comm: kswapd0 Not tainted
5.7.0-rc6-next-20200519+ #1
[   35.532121] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
2.2 05/23/2018
[   35.539507] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60
[   35.544724] Code: 00 00 80 3d 84 b5 e1 cb 00 89 c2 a1 9c a5 f5 cb
75 48 55 89 e5 57 56 53 3e 8d 74 26 00 8b 1d 88 b5 e1 cb 31 f6 eb 27
8d 76 00 <8b> 8a c8 00 00 00 8b ba bc 00 00 00 29 f9 39 c8 0f 4f c1 8b
8a 98
[   35.563461] EAX:  EBX: f5411000 ECX:  EDX: 
[   35.569718] ESI:  EDI: f4e13ea8 EBP: f4e13e10 ESP: f4e13e04
[   35.575976] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010207
[   35.582751] CR0: 80050033 CR2: 00c8 CR3: 0bef4000 CR4: 003406d0
[   35.589010] DR0:  DR1:  DR2:  DR3: 
[   35.595266] DR6: fffe0ff0 DR7: 0400
[   35.599096] Call Trace:
[   35.601544]  shrink_lruvec+0x447/0x630
[   35.605294]  ? newidle_balance.isra.100+0x8e/0x3f0
[   35.610080]  ? pick_next_task_fair+0x3a/0x320
[   35.614437]  ? deactivate_task+0xcf/0x100
[   35.618442]  ? put_prev_entity+0x1a/0xd0
[   35.622359]  ? deactivate_task+0xcf/0x100
[   35.626363]  shrink_node+0x1be/0x640
[   35.629932]  ? shrink_node+0x1be/0x640
[   35.633676]  kswapd+0x32c/0x890
[   35.636815]  ? deactivate_task+0xcf/0x100
[   35.640820]  kthread+0xf1/0x110
[   35.643963]  ? do_try_to_free_pages+0x3b0/0x3b0
[   35.648489]  ? kthread_park+0xa0/0xa0
[   35.652147]  ret_from_fork+0x1c/0x28
[   35.655726] Modules linked in: x86_pkg_temp_thermal
[   35.660605] CR2: 00c8
[   35.663916] ---[ end trace d85b8564ea55fb0d ]---
[   35.663917] BUG: kernel NULL pointer dereference, address: 00c8
[   35.663918] #PF: supervisor read access in kernel mode
[   35.668534] EIP: mem_cgroup_get_nr_swap_pages+0x28/0x60
[   35.674792] #PF: error_code(0x) - not-present page
[   35.674792] *pde = 
[   35.679921] Code: 00 00 80 3d 84 b5 e1 cb 00 89 c2 a1 9c a5 f5 cb
75 48 55 89 e5 57 56 53 3e 8d 74 26 00 8b 1d 88 b5 e1 cb 31 f6 eb 27
8d 76 00 <8b> 8a c8 00 00 00 8b ba bc 00 00 00 29 f9 39 c8 0f 4f c1 8b
8a 98
[   35.685140] Oops:  [#2] SMP
[   35.685142] CPU: 2 PID: 391 Comm: mkfs.ext4 Tainted: G  D
5.7.0-rc6-next-20200519+ #1
[   35.690278] EAX:  EBX: f5411000 ECX:  EDX: 
[   35.690279] ESI:  EDI: f4e13ea8 EBP: f4e13e10 ESP: f4e13e04
[   35.693155] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
2.2 05/23/2018
[   35

Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-05-21 Thread Michal Hocko
On Wed 20-05-20 20:09:06, Chris Down wrote:
> Hi Naresh,
> 
> Naresh Kamboju writes:
> > As a part of investigation on this issue LKFT teammate Anders Roxell
> > git bisected the problem and found bad commit(s) which caused this problem.
> > 
> > The following two patches have been reverted on next-20200519 and retested 
> > the
> > reproducible steps and confirmed the test case mkfs -t ext4 got PASS.
> > ( invoked oom-killer is gone now)
> > 
> > Revert "mm, memcg: avoid stale protection values when cgroup is above
> > protection"
> >This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
> > 
> > Revert "mm, memcg: decouple e{low,min} state mutations from protection
> > checks"
> >This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
> 
> Thanks Anders and Naresh for tracking this down and reverting.
> 
> I'll take a look tomorrow. I don't see anything immediately obviously wrong
> in either of those commits from a (very) cursory glance, but they should
> only be taking effect if protections are set.

Agreed. If memory.{low,min} is not used then the patch should be
effectively a nop. Btw. do you see the problem when booting with
cgroup_disable=memory kernel command line parameter?

I suspect that something might be initialized for memcg incorrectly and
the patch just makes it more visible for some reason.
-- 
Michal Hocko
SUSE Labs


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-05-21 Thread Yafang Shao
On Thu, May 21, 2020 at 4:59 PM Naresh Kamboju
 wrote:
>
> On Thu, 21 May 2020 at 08:10, Yafang Shao  wrote:
> >
> > On Thu, May 21, 2020 at 2:00 AM Naresh Kamboju
> >  wrote:
> > >
> > > On Wed, 20 May 2020 at 17:26, Naresh Kamboju  
> > > wrote:
> > > >
> > > >
> > > > This issue is specific on 32-bit architectures i386 and arm on 
> > > > linux-next tree.
> > > > As per the test results history this problem started happening from
> > > > mkfs -t ext4 /dev/disk/by-id/ata-SanDisk_SSD_PLUS_120GB_190804A00BE5
> > > >
> > > >
> > > > Problem:
> > > > [   38.802375] dd invoked oom-killer: gfp_mask=0x100cc0(GFP_USER),
> > > > order=0, oom_score_adj=0
> > >
> > My guess is that we made the same mistake in commit "mm, memcg:
> > decouple e{low,min} state mutations from protection
> > checks" that it read a stale memcg protection in
> > mem_cgroup_below_low() and mem_cgroup_below_min().
> >
> > Bellow is a possble fix,
>
> Sorry. The proposed fix did not work.
> I have took your patch and applied on top of linux-next master branch and
> tested and mkfs -t ext4 invoked oom-killer.
>
> After patch applied test log link,
> https://lkft.validation.linaro.org/scheduler/job/1443936#L1168
>
>
> test  log,
> + mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
> mke2fs 1.43.8 (1-Jan-2018)
> Creating filesystem with 244190646 4k blocks and 61054976 inodes
> Filesystem UUID: ab107250-bf18-4357-a06a-67f2bfcc1048
> Superblock backups stored on blocks:
> 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
> 4096000, 7962624, 11239424, 2048, 23887872, 71663616, 78675968,
> 10240, 214990848
> Allocating group tables:0/7453   done
> Writing inode tables:0/7453   done
> Creating journal (262144 blocks): [   34.423940] mkfs.ext4 invoked
> oom-killer: gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0,
> oom_score_adj=0
> [   34.433694] CPU: 0 PID: 402 Comm: mkfs.ext4 Not tainted
> 5.7.0-rc6-next-20200519+ #1
> [   34.441342] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> 2.2 05/23/2018
> [   34.448734] Call Trace:
> [   34.451196]  dump_stack+0x54/0x76
> [   34.454517]  dump_header+0x40/0x1f0
> [   34.458008]  ? oom_badness+0x1f/0x120
> [   34.461673]  ? ___ratelimit+0x6c/0xe0
> [   34.465332]  oom_kill_process+0xc9/0x110
> [   34.469255]  out_of_memory+0xd7/0x2f0
> [   34.472916]  __alloc_pages_nodemask+0xdd1/0xe90
> [   34.477446]  ? set_bh_page+0x33/0x50
> [   34.481016]  ? __xa_set_mark+0x4d/0x70
> [   34.484762]  pagecache_get_page+0xbe/0x250
> [   34.488859]  grab_cache_page_write_begin+0x1a/0x30
> [   34.493645]  block_write_begin+0x25/0x90
> [   34.497569]  blkdev_write_begin+0x1e/0x20
> [   34.501574]  ? bdev_evict_inode+0xc0/0xc0
> [   34.505578]  generic_perform_write+0x95/0x190
> [   34.509927]  __generic_file_write_iter+0xe0/0x1a0
> [   34.514626]  blkdev_write_iter+0xbf/0x1c0
> [   34.518630]  __vfs_write+0x122/0x1e0
> [   34.522200]  vfs_write+0x8f/0x1b0
> [   34.525510]  ksys_pwrite64+0x60/0x80
> [   34.529081]  __ia32_sys_ia32_pwrite64+0x16/0x20
> [   34.533604]  do_fast_syscall_32+0x66/0x240
> [   34.537697]  entry_SYSENTER_32+0xa5/0xf8
> [   34.541613] EIP: 0xb7f3c549
> [   34.544403] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01
> 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f
> 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90
> 8d 76
> [   34.563140] EAX: ffda EBX: 0003 ECX: b7830010 EDX: 0040
> [   34.569397] ESI: 3840 EDI: 0074 EBP: 07438400 ESP: bff1e650
> [   34.575654] DS: 007b ES: 007b FS:  GS: 0033 SS: 007b EFLAGS: 0246
> [   34.582453] Mem-Info:
> [   34.584732] active_anon:5713 inactive_anon:2169 isolated_anon:0
> [   34.584732]  active_file:4040 inactive_file:211204 isolated_file:0
> [   34.584732]  unevictable:0 dirty:17270 writeback:6240 unstable:0
> [   34.584732]  slab_reclaimable:5856 slab_unreclaimable:3439
> [   34.584732]  mapped:6192 shmem:2258 pagetables:178 bounce:0
> [   34.584732]  free:265105 free_pcp:1330 free_cma:0
> [   34.618483] Node 0 active_anon:22852kB inactive_anon:8676kB
> active_file:16160kB inactive_file:844816kB unevictable:0kB
> isolated(anon):0kB isolated(file):0kB mapped:24768kB dirty:69080kB
> writeback:19628kB shmem:9032kB writeback_tmp:0kB unstable:0kB
> all_unreclaimable? yes
> [   34.642354] DMA free:3588kB min:68kB low:84kB high:100kB
> reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
> active_file:0kB inactive_file

Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-05-21 Thread Arnd Bergmann
On Thu, May 21, 2020 at 11:22 AM Naresh Kamboju
 wrote:
> On Thu, 21 May 2020 at 00:39, Chris Down  wrote:
> > Since you have i386 hardware available, and I don't, could you please apply
> > only "avoid stale protection" again and check if it only happens with that
> > commit, or requires both? That would help narrow down the suspects.

Note that Naresh is running an i386 kernel on regular 64-bit hardware that
most people have access to.

> kernel config link,
> https://builds.tuxbuild.com/8lg6WQibcwtQRRtIa0bcFA/kernel.config

Do you know if the same bug shows up running a kernel with that
configuration in qemu? I would expect it to, and that would make
it much easier to reproduce.

I would also not be surprised if it happens on all architectures but only
shows up on the 32-bit arm and x86 machines first because they have
a rather limited amount of lowmem. Maybe booting a 64-bit kernel
with "mem=512M" and then running "dd if=/dev/sda of=/dev/null bs=1M"
will also trigger it. I did not attempt to run this myself.

   Arnd


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-05-21 Thread Naresh Kamboju
On Thu, 21 May 2020 at 00:39, Chris Down  wrote:
>
> Hi Naresh,
>
> Naresh Kamboju writes:
> >As a part of investigation on this issue LKFT teammate Anders Roxell
> >git bisected the problem and found bad commit(s) which caused this problem.
> >
> >The following two patches have been reverted on next-20200519 and retested 
> >the
> >reproducible steps and confirmed the test case mkfs -t ext4 got PASS.
> >( invoked oom-killer is gone now)
> >
> >Revert "mm, memcg: avoid stale protection values when cgroup is above
> >protection"
> >This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
> >
> >Revert "mm, memcg: decouple e{low,min} state mutations from protection
> >checks"
> >This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
>
> Thanks Anders and Naresh for tracking this down and reverting.
>
> I'll take a look tomorrow. I don't see anything immediately obviously wrong in
> either of those commits from a (very) cursory glance, but they should only be
> taking effect if protections are set.
>
> Since you have i386 hardware available, and I don't, could you please apply
> only "avoid stale protection" again and check if it only happens with that
> commit, or requires both? That would help narrow down the suspects.

Not both.
The bad commit is
"mm, memcg: decouple e{low,min} state mutations from protection checks"

>
> Do you use any memcg protections in these tests?
I see three MEMCG configs and please find the kernel config link
for more details.

CONFIG_MEMCG=y
CONFIG_MEMCG_SWAP=y
CONFIG_MEMCG_KMEM=y

kernel config link,
https://builds.tuxbuild.com/8lg6WQibcwtQRRtIa0bcFA/kernel.config

- Naresh


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-05-21 Thread Naresh Kamboju
On Thu, 21 May 2020 at 08:10, Yafang Shao  wrote:
>
> On Thu, May 21, 2020 at 2:00 AM Naresh Kamboju
>  wrote:
> >
> > On Wed, 20 May 2020 at 17:26, Naresh Kamboju  
> > wrote:
> > >
> > >
> > > This issue is specific on 32-bit architectures i386 and arm on linux-next 
> > > tree.
> > > As per the test results history this problem started happening from
> > > mkfs -t ext4 /dev/disk/by-id/ata-SanDisk_SSD_PLUS_120GB_190804A00BE5
> > >
> > >
> > > Problem:
> > > [   38.802375] dd invoked oom-killer: gfp_mask=0x100cc0(GFP_USER),
> > > order=0, oom_score_adj=0
> >
> My guess is that we made the same mistake in commit "mm, memcg:
> decouple e{low,min} state mutations from protection
> checks" that it read a stale memcg protection in
> mem_cgroup_below_low() and mem_cgroup_below_min().
>
> Bellow is a possble fix,

Sorry. The proposed fix did not work.
I have took your patch and applied on top of linux-next master branch and
tested and mkfs -t ext4 invoked oom-killer.

After patch applied test log link,
https://lkft.validation.linaro.org/scheduler/job/1443936#L1168


test  log,
+ mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8NRK0BPF6XF
mke2fs 1.43.8 (1-Jan-2018)
Creating filesystem with 244190646 4k blocks and 61054976 inodes
Filesystem UUID: ab107250-bf18-4357-a06a-67f2bfcc1048
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 2048, 23887872, 71663616, 78675968,
10240, 214990848
Allocating group tables:0/7453           done
Writing inode tables:0/7453   done
Creating journal (262144 blocks): [   34.423940] mkfs.ext4 invoked
oom-killer: gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0,
oom_score_adj=0
[   34.433694] CPU: 0 PID: 402 Comm: mkfs.ext4 Not tainted
5.7.0-rc6-next-20200519+ #1
[   34.441342] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
2.2 05/23/2018
[   34.448734] Call Trace:
[   34.451196]  dump_stack+0x54/0x76
[   34.454517]  dump_header+0x40/0x1f0
[   34.458008]  ? oom_badness+0x1f/0x120
[   34.461673]  ? ___ratelimit+0x6c/0xe0
[   34.465332]  oom_kill_process+0xc9/0x110
[   34.469255]  out_of_memory+0xd7/0x2f0
[   34.472916]  __alloc_pages_nodemask+0xdd1/0xe90
[   34.477446]  ? set_bh_page+0x33/0x50
[   34.481016]  ? __xa_set_mark+0x4d/0x70
[   34.484762]  pagecache_get_page+0xbe/0x250
[   34.488859]  grab_cache_page_write_begin+0x1a/0x30
[   34.493645]  block_write_begin+0x25/0x90
[   34.497569]  blkdev_write_begin+0x1e/0x20
[   34.501574]  ? bdev_evict_inode+0xc0/0xc0
[   34.505578]  generic_perform_write+0x95/0x190
[   34.509927]  __generic_file_write_iter+0xe0/0x1a0
[   34.514626]  blkdev_write_iter+0xbf/0x1c0
[   34.518630]  __vfs_write+0x122/0x1e0
[   34.522200]  vfs_write+0x8f/0x1b0
[   34.525510]  ksys_pwrite64+0x60/0x80
[   34.529081]  __ia32_sys_ia32_pwrite64+0x16/0x20
[   34.533604]  do_fast_syscall_32+0x66/0x240
[   34.537697]  entry_SYSENTER_32+0xa5/0xf8
[   34.541613] EIP: 0xb7f3c549
[   34.544403] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01
10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f
34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90
8d 76
[   34.563140] EAX: ffda EBX: 0003 ECX: b7830010 EDX: 0040
[   34.569397] ESI: 3840 EDI: 0074 EBP: 07438400 ESP: bff1e650
[   34.575654] DS: 007b ES: 007b FS:  GS: 0033 SS: 007b EFLAGS: 0246
[   34.582453] Mem-Info:
[   34.584732] active_anon:5713 inactive_anon:2169 isolated_anon:0
[   34.584732]  active_file:4040 inactive_file:211204 isolated_file:0
[   34.584732]  unevictable:0 dirty:17270 writeback:6240 unstable:0
[   34.584732]  slab_reclaimable:5856 slab_unreclaimable:3439
[   34.584732]  mapped:6192 shmem:2258 pagetables:178 bounce:0
[   34.584732]  free:265105 free_pcp:1330 free_cma:0
[   34.618483] Node 0 active_anon:22852kB inactive_anon:8676kB
active_file:16160kB inactive_file:844816kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB mapped:24768kB dirty:69080kB
writeback:19628kB shmem:9032kB writeback_tmp:0kB unstable:0kB
all_unreclaimable? yes
[   34.642354] DMA free:3588kB min:68kB low:84kB high:100kB
reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
active_file:0kB inactive_file:11848kB unevictable:0kB
writepending:11856kB present:15964kB managed:15876kB mlocked:0kB
kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB
free_cma:0kB
[   34.670194] lowmem_reserve[]: 0 824 1947 824
[   34.674483] Normal free:4228kB min:3636kB low:4544kB high:5452kB
reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
active_file:1136kB inactive_file:786456kB unevictable:0kB
writepending:68084kB present:884728kB managed:845324kB mlocked:0kB
kernel_stack:1104kB pagetables:0kB bounce:0kB free_pcp:3056kB
local_pcp:388

Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-05-20 Thread Yafang Shao
On Thu, May 21, 2020 at 2:00 AM Naresh Kamboju
 wrote:
>
> On Wed, 20 May 2020 at 17:26, Naresh Kamboju  
> wrote:
> >
> >
> > This issue is specific on 32-bit architectures i386 and arm on linux-next 
> > tree.
> > As per the test results history this problem started happening from
> > Bad : next-20200430
> > Good : next-20200429
> >
> > steps to reproduce:
> > dd if=/dev/disk/by-id/ata-SanDisk_SSD_PLUS_120GB_190504A00573
> > of=/dev/null bs=1M count=2048
> > or
> > mkfs -t ext4 /dev/disk/by-id/ata-SanDisk_SSD_PLUS_120GB_190804A00BE5
> >
> >
> > Problem:
> > [   38.802375] dd invoked oom-killer: gfp_mask=0x100cc0(GFP_USER),
> > order=0, oom_score_adj=0
>
> As a part of investigation on this issue LKFT teammate Anders Roxell
> git bisected the problem and found bad commit(s) which caused this problem.
>
> The following two patches have been reverted on next-20200519 and retested the
> reproducible steps and confirmed the test case mkfs -t ext4 got PASS.
> ( invoked oom-killer is gone now)
>
> Revert "mm, memcg: avoid stale protection values when cgroup is above
> protection"
> This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.
>
> Revert "mm, memcg: decouple e{low,min} state mutations from protection
> checks"
> This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.
>

My guess is that we made the same mistake in commit "mm, memcg:
decouple e{low,min} state mutations from protection
checks" that it read a stale memcg protection in
mem_cgroup_below_low() and mem_cgroup_below_min().

Bellow is a possble fix,

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7a2c56fc..6591b71 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -391,20 +391,28 @@ static inline unsigned long
mem_cgroup_protection(struct mem_cgroup *root,
 void mem_cgroup_calculate_protection(struct mem_cgroup *root,
 struct mem_cgroup *memcg);

-static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg)
+static inline bool mem_cgroup_below_low(struct mem_cgroup *root,
+   struct mem_cgroup *memcg)
 {
if (mem_cgroup_disabled())
return false;

+   if (root == memcg)
+   return false;
+
return READ_ONCE(memcg->memory.elow) >=
page_counter_read(>memory);
 }

-static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg)
+static inline bool mem_cgroup_below_min(struct mem_cgroup *root,
+   struct mem_cgroup *memcg)
 {
if (mem_cgroup_disabled())
return false;

+   if (root == memcg)
+   return false;
+
return READ_ONCE(memcg->memory.emin) >=
page_counter_read(>memory);
 }
@@ -896,12 +904,14 @@ static inline void
mem_cgroup_calculate_protection(struct mem_cgroup *root,
 {
 }

-static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg)
+static inline bool mem_cgroup_below_low(struct mem_cgroup *root,
+   struct mem_cgroup *memcg)
 {
return false;
 }

-static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg)
+static inline bool mem_cgroup_below_min(struct mem_cgroup *root,
+   struct mem_cgroup *memcg)
 {
return false;
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c71660e..fdcdd88 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2637,13 +2637,13 @@ static void shrink_node_memcgs(pg_data_t
*pgdat, struct scan_control *sc)

mem_cgroup_calculate_protection(target_memcg, memcg);

-   if (mem_cgroup_below_min(memcg)) {
+   if (mem_cgroup_below_min(target_memcg, memcg)) {
/*
 * Hard protection.
 * If there is no reclaimable memory, OOM.
 */
continue;
-   } else if (mem_cgroup_below_low(memcg)) {
+   } else if (mem_cgroup_below_low(target_memcg, memcg)) {
/*
 * Soft protection.
 * Respect the protection only as long as





> i386 test log shows mkfs -t ext4 pass
> https://lkft.validation.linaro.org/scheduler/job/1443405#L1200
>
> ref:
> https://lore.kernel.org/linux-mm/cover.1588092152.git.ch...@chrisdown.name/
> https://lore.kernel.org/linux-mm/ca+g9fyvzlm7n1be7ajxd8_49fogpgwwtiq7sxkvre_zoerj...@mail.gmail.com/T/#t
>
> --
> Linaro LKFT
> https://lkft.linaro.org



--
Thanks
Yafang


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-05-20 Thread Chris Down

Hi Naresh,

Naresh Kamboju writes:

As a part of investigation on this issue LKFT teammate Anders Roxell
git bisected the problem and found bad commit(s) which caused this problem.

The following two patches have been reverted on next-20200519 and retested the
reproducible steps and confirmed the test case mkfs -t ext4 got PASS.
( invoked oom-killer is gone now)

Revert "mm, memcg: avoid stale protection values when cgroup is above
protection"
   This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.

Revert "mm, memcg: decouple e{low,min} state mutations from protection
checks"
   This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.


Thanks Anders and Naresh for tracking this down and reverting.

I'll take a look tomorrow. I don't see anything immediately obviously wrong in 
either of those commits from a (very) cursory glance, but they should only be 
taking effect if protections are set.


Since you have i386 hardware available, and I don't, could you please apply 
only "avoid stale protection" again and check if it only happens with that 
commit, or requires both? That would help narrow down the suspects.


Do you use any memcg protections in these tests?

Thank you!

Chris


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-05-20 Thread Naresh Kamboju
On Wed, 20 May 2020 at 17:26, Naresh Kamboju  wrote:
>
>
> This issue is specific on 32-bit architectures i386 and arm on linux-next 
> tree.
> As per the test results history this problem started happening from
> Bad : next-20200430
> Good : next-20200429
>
> steps to reproduce:
> dd if=/dev/disk/by-id/ata-SanDisk_SSD_PLUS_120GB_190504A00573
> of=/dev/null bs=1M count=2048
> or
> mkfs -t ext4 /dev/disk/by-id/ata-SanDisk_SSD_PLUS_120GB_190804A00BE5
>
>
> Problem:
> [   38.802375] dd invoked oom-killer: gfp_mask=0x100cc0(GFP_USER),
> order=0, oom_score_adj=0

As a part of investigation on this issue LKFT teammate Anders Roxell
git bisected the problem and found bad commit(s) which caused this problem.

The following two patches have been reverted on next-20200519 and retested the
reproducible steps and confirmed the test case mkfs -t ext4 got PASS.
( invoked oom-killer is gone now)

Revert "mm, memcg: avoid stale protection values when cgroup is above
protection"
This reverts commit 23a53e1c02006120f89383270d46cbd040a70bc6.

Revert "mm, memcg: decouple e{low,min} state mutations from protection
checks"
This reverts commit 7b88906ab7399b58bb088c28befe50bcce076d82.

i386 test log shows mkfs -t ext4 pass
https://lkft.validation.linaro.org/scheduler/job/1443405#L1200

ref:
https://lore.kernel.org/linux-mm/cover.1588092152.git.ch...@chrisdown.name/
https://lore.kernel.org/linux-mm/ca+g9fyvzlm7n1be7ajxd8_49fogpgwwtiq7sxkvre_zoerj...@mail.gmail.com/T/#t

-- 
Linaro LKFT
https://lkft.linaro.org


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-05-20 Thread Naresh Kamboju
FYI,

This issue is specific on 32-bit architectures i386 and arm on linux-next tree.
As per the test results history this problem started happening from
Bad : next-20200430
Good : next-20200429

steps to reproduce:
dd if=/dev/disk/by-id/ata-SanDisk_SSD_PLUS_120GB_190504A00573
of=/dev/null bs=1M count=2048
or
mkfs -t ext4 /dev/disk/by-id/ata-SanDisk_SSD_PLUS_120GB_190804A00BE5


Problem:
[   38.802375] dd invoked oom-killer: gfp_mask=0x100cc0(GFP_USER),
order=0, oom_score_adj=0

i386 crash log:  https://pastebin.com/Hb8U89vU
arm crash log: https://pastebin.com/BD9t3JTm

On Tue, 19 May 2020 at 14:15, Michal Hocko  wrote:
>
> On Tue 19-05-20 10:11:25, Arnd Bergmann wrote:
> > On Tue, May 19, 2020 at 9:52 AM Michal Hocko  wrote:
> > >
> > > On Mon 18-05-20 19:40:55, Naresh Kamboju wrote:
> > > > Thanks for looking into this problem.
> > > >
> > > > On Sat, 2 May 2020 at 02:28, Andrew Morton  
> > > > wrote:
> > > > >
> > > > > On Fri, 1 May 2020 18:08:28 +0530 Naresh Kamboju 
> > > > >  wrote:
> > > > >
> > > > > > mkfs -t ext4 invoked oom-killer on i386 kernel running on x86_64 
> > > > > > device
> > > > > > and started happening on linux -next master branch kernel tag 
> > > > > > next-20200430
> > > > > > and next-20200501. We did not bisect this problem.
> > > [...]
> > > > Creating journal (131072 blocks): [   31.251333] mkfs.ext4 invoked
> > > > oom-killer: gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0,
> > > > oom_score_adj=0
> > > [...]
> > > > [   31.500943] DMA free:187396kB min:22528kB low:28160kB high:33792kB
> > > > reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
> > > > active_file:4736kB inactive_file:431688kB unevictable:0kB
> > > > writepending:62020kB present:783360kB managed:668264kB mlocked:0kB
> > > > kernel_stack:888kB pagetables:0kB bounce:0kB free_pcp:880kB
> > > > local_pcp:216kB free_cma:163840kB
> > >
> > > This is really unexpected. You are saying this is a regular i386 and DMA
> > > should be bottom 16MB while yours is 780MB and the rest of the low mem
> > > is in the Normal zone which is completely missing here. How have you got
> > > to that configuration? I have to say I haven't seen anything like that
> > > on i386.
> >
> > I think that line comes from an ARM32 beaglebone-X15 machine showing
> > the same symptom. The i386 line from the log file that Naresh linked to at
> > https://lkft.validation.linaro.org/scheduler/job/1406110#L1223  is less
> > unusual:
>
> OK, that makes more sense! At least for the memory layout.
>
> > [   34.931663] Node 0 active_anon:21464kB inactive_anon:8688kB
> > active_file:16604kB inactive_file:849976kB unevictable:0kB
> > isolated(anon):0kB isolated(file):0kB mapped:25284kB dirty:58952kB
> > writeback:27772kB shmem:8944kB writeback_tmp:0kB unstable:0kB
> > all_unreclaimable? yes
> > [   34.955523] DMA free:3356kB min:68kB low:84kB high:100kB
> > reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
> > active_file:0kB inactive_file:11964kB unevictable:0kB
> > writepending:11980kB present:15964kB managed:15876kB mlocked:0kB
> > kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB
> > free_cma:0kB
> > [   34.983385] lowmem_reserve[]: 0 825 1947 825
> > [   34.987678] Normal free:3948kB min:7732kB low:8640kB high:9548kB
> > reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
> > active_file:1096kB inactive_file:786400kB unevictable:0kB
> > writepending:65432kB present:884728kB managed:845576kB mlocked:0kB
> > kernel_stack:1112kB pagetables:0kB bounce:0kB free_pcp:2908kB
> > local_pcp:500kB free_cma:0kB
>
> The lowmem is really low (way below the min watermark so even memory
> reserves for high priority and atomic requests are depleted. There is
> still 786MB of inactive page cache to be reclaimed. It doesn't seem to
> be dirty or under the writeback but it still might be pinned by the
> filesystem. I would suggest watching vmscan reclaim tracepoints and
> check why the reclaim fails to reclaim anything.
> --
> Michal Hocko
> SUSE Labs


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-05-19 Thread Michal Hocko
On Tue 19-05-20 10:11:25, Arnd Bergmann wrote:
> On Tue, May 19, 2020 at 9:52 AM Michal Hocko  wrote:
> >
> > On Mon 18-05-20 19:40:55, Naresh Kamboju wrote:
> > > Thanks for looking into this problem.
> > >
> > > On Sat, 2 May 2020 at 02:28, Andrew Morton  
> > > wrote:
> > > >
> > > > On Fri, 1 May 2020 18:08:28 +0530 Naresh Kamboju 
> > > >  wrote:
> > > >
> > > > > mkfs -t ext4 invoked oom-killer on i386 kernel running on x86_64 
> > > > > device
> > > > > and started happening on linux -next master branch kernel tag 
> > > > > next-20200430
> > > > > and next-20200501. We did not bisect this problem.
> > [...]
> > > Creating journal (131072 blocks): [   31.251333] mkfs.ext4 invoked
> > > oom-killer: gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0,
> > > oom_score_adj=0
> > [...]
> > > [   31.500943] DMA free:187396kB min:22528kB low:28160kB high:33792kB
> > > reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
> > > active_file:4736kB inactive_file:431688kB unevictable:0kB
> > > writepending:62020kB present:783360kB managed:668264kB mlocked:0kB
> > > kernel_stack:888kB pagetables:0kB bounce:0kB free_pcp:880kB
> > > local_pcp:216kB free_cma:163840kB
> >
> > This is really unexpected. You are saying this is a regular i386 and DMA
> > should be bottom 16MB while yours is 780MB and the rest of the low mem
> > is in the Normal zone which is completely missing here. How have you got
> > to that configuration? I have to say I haven't seen anything like that
> > on i386.
> 
> I think that line comes from an ARM32 beaglebone-X15 machine showing
> the same symptom. The i386 line from the log file that Naresh linked to at
> https://lkft.validation.linaro.org/scheduler/job/1406110#L1223  is less
> unusual:

OK, that makes more sense! At least for the memory layout.
 
> [   34.931663] Node 0 active_anon:21464kB inactive_anon:8688kB
> active_file:16604kB inactive_file:849976kB unevictable:0kB
> isolated(anon):0kB isolated(file):0kB mapped:25284kB dirty:58952kB
> writeback:27772kB shmem:8944kB writeback_tmp:0kB unstable:0kB
> all_unreclaimable? yes
> [   34.955523] DMA free:3356kB min:68kB low:84kB high:100kB
> reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
> active_file:0kB inactive_file:11964kB unevictable:0kB
> writepending:11980kB present:15964kB managed:15876kB mlocked:0kB
> kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB
> free_cma:0kB
> [   34.983385] lowmem_reserve[]: 0 825 1947 825
> [   34.987678] Normal free:3948kB min:7732kB low:8640kB high:9548kB
> reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
> active_file:1096kB inactive_file:786400kB unevictable:0kB
> writepending:65432kB present:884728kB managed:845576kB mlocked:0kB
> kernel_stack:1112kB pagetables:0kB bounce:0kB free_pcp:2908kB
> local_pcp:500kB free_cma:0kB

The lowmem is really low (way below the min watermark so even memory
reserves for high priority and atomic requests are depleted. There is
still 786MB of inactive page cache to be reclaimed. It doesn't seem to
be dirty or under the writeback but it still might be pinned by the
filesystem. I would suggest watching vmscan reclaim tracepoints and
check why the reclaim fails to reclaim anything.
-- 
Michal Hocko
SUSE Labs


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-05-19 Thread Arnd Bergmann
On Tue, May 19, 2020 at 9:52 AM Michal Hocko  wrote:
>
> On Mon 18-05-20 19:40:55, Naresh Kamboju wrote:
> > Thanks for looking into this problem.
> >
> > On Sat, 2 May 2020 at 02:28, Andrew Morton  
> > wrote:
> > >
> > > On Fri, 1 May 2020 18:08:28 +0530 Naresh Kamboju 
> > >  wrote:
> > >
> > > > mkfs -t ext4 invoked oom-killer on i386 kernel running on x86_64 device
> > > > and started happening on linux -next master branch kernel tag 
> > > > next-20200430
> > > > and next-20200501. We did not bisect this problem.
> [...]
> > Creating journal (131072 blocks): [   31.251333] mkfs.ext4 invoked
> > oom-killer: gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0,
> > oom_score_adj=0
> [...]
> > [   31.500943] DMA free:187396kB min:22528kB low:28160kB high:33792kB
> > reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
> > active_file:4736kB inactive_file:431688kB unevictable:0kB
> > writepending:62020kB present:783360kB managed:668264kB mlocked:0kB
> > kernel_stack:888kB pagetables:0kB bounce:0kB free_pcp:880kB
> > local_pcp:216kB free_cma:163840kB
>
> This is really unexpected. You are saying this is a regular i386 and DMA
> should be bottom 16MB while yours is 780MB and the rest of the low mem
> is in the Normal zone which is completely missing here. How have you got
> to that configuration? I have to say I haven't seen anything like that
> on i386.

I think that line comes from an ARM32 beaglebone-X15 machine showing
the same symptom. The i386 line from the log file that Naresh linked to at
https://lkft.validation.linaro.org/scheduler/job/1406110#L1223  is less
unusual:

[   34.931663] Node 0 active_anon:21464kB inactive_anon:8688kB
active_file:16604kB inactive_file:849976kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB mapped:25284kB dirty:58952kB
writeback:27772kB shmem:8944kB writeback_tmp:0kB unstable:0kB
all_unreclaimable? yes
[   34.955523] DMA free:3356kB min:68kB low:84kB high:100kB
reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
active_file:0kB inactive_file:11964kB unevictable:0kB
writepending:11980kB present:15964kB managed:15876kB mlocked:0kB
kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB
free_cma:0kB
[   34.983385] lowmem_reserve[]: 0 825 1947 825
[   34.987678] Normal free:3948kB min:7732kB low:8640kB high:9548kB
reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
active_file:1096kB inactive_file:786400kB unevictable:0kB
writepending:65432kB present:884728kB managed:845576kB mlocked:0kB
kernel_stack:1112kB pagetables:0kB bounce:0kB free_pcp:2908kB
local_pcp:500kB free_cma:0kB
[   35.017427] lowmem_reserve[]: 0 0 8980 0
[   35.021362] HighMem free:1049496kB min:512kB low:1748kB high:2984kB
reserved_highatomic:0KB active_anon:21464kB inactive_anon:8688kB
active_file:15508kB inactive_file:51612kB unevictable:0kB
writepending:0kB present:1149540kB managed:1149540kB mlocked:0kB
kernel_stack:0kB pagetables:712kB bounce:0kB free_pcp:1524kB
local_pcp:292kB free_cma:0kB
[   35.051717] lowmem_reserve[]: 0 0 0 0
[   35.055374] DMA: 8*4kB (UE) 1*8kB (E) 1*16kB (E) 0*32kB 0*64kB
0*128kB 1*256kB (E) 0*512kB 1*1024kB (E) 1*2048kB (E) 0*4096kB =
3384kB
[   35.067446] Normal: 27*4kB (U) 23*8kB (U) 12*16kB (UE) 12*32kB (U)
4*64kB (UE) 2*128kB (U) 2*256kB (UE) 1*512kB (E) 0*1024kB 1*2048kB (U)
0*4096kB = 4452kB
[   35.081347] HighMem: 2*4kB (UM) 0*8kB 1*16kB (M) 2*32kB (UM) 1*64kB
(U) 0*128kB 1*256kB (M) 1*512kB (M) 0*1024kB 0*2048kB 256*4096kB (M) =
1049496kB

Arnd


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-05-19 Thread Michal Hocko
On Mon 18-05-20 19:40:55, Naresh Kamboju wrote:
> Thanks for looking into this problem.
> 
> On Sat, 2 May 2020 at 02:28, Andrew Morton  wrote:
> >
> > On Fri, 1 May 2020 18:08:28 +0530 Naresh Kamboju 
> >  wrote:
> >
> > > mkfs -t ext4 invoked oom-killer on i386 kernel running on x86_64 device
> > > and started happening on linux -next master branch kernel tag 
> > > next-20200430
> > > and next-20200501. We did not bisect this problem.
[...]
> Creating journal (131072 blocks): [   31.251333] mkfs.ext4 invoked
> oom-killer: gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0,
> oom_score_adj=0
[...]
> [   31.500943] DMA free:187396kB min:22528kB low:28160kB high:33792kB
> reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
> active_file:4736kB inactive_file:431688kB unevictable:0kB
> writepending:62020kB present:783360kB managed:668264kB mlocked:0kB
> kernel_stack:888kB pagetables:0kB bounce:0kB free_pcp:880kB
> local_pcp:216kB free_cma:163840kB

This is really unexpected. You are saying this is a regular i386 and DMA
should be bottom 16MB while yours is 780MB and the rest of the low mem
is in the Normal zone which is completely missing here. How have you got
to that configuration? I have to say I haven't seen anything like that
on i386.

The failing request is GFP_USER so highmem is not really allowed but
free pages are way above watermarks so the allocation should have just
succeeded.

-- 
Michal Hocko
SUSE Labs


Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-05-18 Thread Naresh Kamboju
Thanks for looking into this problem.

On Sat, 2 May 2020 at 02:28, Andrew Morton  wrote:
>
> On Fri, 1 May 2020 18:08:28 +0530 Naresh Kamboju  
> wrote:
>
> > mkfs -t ext4 invoked oom-killer on i386 kernel running on x86_64 device
> > and started happening on linux -next master branch kernel tag next-20200430
> > and next-20200501. We did not bisect this problem.
>
> It would be wonderful if you could do so, please.  I can't immediately see
> any MM change in this area which might cause this.

We are planning a bisection soon on this problem.

>
> > metadata
> >   git branch: master
> >   git repo: 
> > https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
> >   git commit: e4a08b64261ab411b15580c369a3b8fbed28bbc1
> >   git describe: next-20200430
> >   make_kernelversion: 5.7.0-rc3
> >   kernel-config:
> > https://builds.tuxbuild.com/1YrE_XUQ6odA52tSBM919w/kernel.config
> >
> > Steps to reproduce: (always reproducible)
>
> Reproducibility helps!
>
> > oom-killer: gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0,
>
> > [   34.793430]  pagecache_get_page+0xae/0x260
>
> > [   34.897923] active_anon:5366 inactive_anon:2172 isolated_anon:0
> > [   34.897923]  active_file:4151 inactive_file:212494 isolated_file:0
> > [   34.897923]  unevictable:0 dirty:16505 writeback:6520 unstable:0
>
> > [ 34.987678] Normal free:3948kB min:7732kB low:8640kB high:9548kB
> > reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
> > active_file:1096kB inactive_file:786400kB unevictable:0kB
> > writepending:65432kB present:884728kB managed:845576kB mlocked:0kB
> > kernel_stack:1112kB pagetables:0kB bounce:0kB free_pcp:2908kB
> > local_pcp:500kB free_cma:0kB
>
> ZONE_NORMAL has a huge amount of clean pagecache stuck on the
> inactive list, not being reclaimed.

FYI,
This issue is already reported here.
Now this problem is happening and easily reproducible on i386
and arm beagleboard x15 devices.

mkfs -t ext4 /dev/disk/by-id/ata-SanDisk_SSD_PLUS_120GB_190703A01414
mke2fs 1.43.8 (1-Jan-2018)
Discarding device blocks: 4096/29306880
2625536/29306880
9441280/29306880 16257024/29306880
23072768/29306880
 done
Creating filesystem with 29306880 4k blocks and 7331840 inodes
Filesystem UUID: a838d994-0a1e-403a-88d5-444d75aecc5a
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 2048, 23887872
Allocating group tables:   0/895 done
Writing inode tables:   0/895 done
Creating journal (131072 blocks): [   31.251333] mkfs.ext4 invoked
oom-killer: gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0,
oom_score_adj=0
[   31.261172] CPU: 0 PID: 397 Comm: mkfs.ext4 Not tainted
5.7.0-rc6-next-20200518 #1
[   31.268771] Hardware name: Generic DRA74X (Flattened Device Tree)
[   31.274904] [] (unwind_backtrace) from []
(show_stack+0x10/0x14)
[   31.282685] [] (show_stack) from []
(dump_stack+0xc4/0xd8)
[   31.289940] [] (dump_stack) from []
(dump_header+0x54/0x1ec)
[   31.297367] [] (dump_header) from []
(oom_kill_process+0x18c/0x198)
[   31.305405] [] (oom_kill_process) from []
(out_of_memory+0x250/0x368)
[   31.313619] [] (out_of_memory) from []
(__alloc_pages_nodemask+0xce8/0x10bc)
[   31.322445] [] (__alloc_pages_nodemask) from []
(pagecache_get_page+0x128/0x358)
[   31.331619] [] (pagecache_get_page) from []
(grab_cache_page_write_begin+0x18/0x2c)
[   31.341054] [] (grab_cache_page_write_begin) from
[] (block_write_begin+0x20/0xc4)
[   31.350401] [] (block_write_begin) from []
(generic_perform_write+0xb8/0x1d8)
[   31.359312] [] (generic_perform_write) from []
(__generic_file_write_iter+0x164/0x1ec)
[   31.369007] [] (__generic_file_write_iter) from
[] (blkdev_write_iter+0xc8/0x1a4)
[   31.378269] [] (blkdev_write_iter) from []
(__vfs_write+0x13c/0x1cc)
[   31.386397] [] (__vfs_write) from []
(vfs_write+0xb0/0x1bc)
[   31.393738] [] (vfs_write) from []
(ksys_pwrite64+0x60/0x8c)
[   31.401167] [] (ksys_pwrite64) from []
(ret_fast_syscall+0x0/0x4c)
[   31.409115] Exception stack(0xe810dfa8 to 0xe810dff0)
[   31.414185] dfa0:   a200 000d 0003
b6952008 0040 
[   31.422395] dfc0: a200 000d a200 00b5 0040
0003b768 b6952008 00da2000
[   31.430604] dfe0: 0064 beb891b8 b6f85108 b6e38f2c
[   31.435809] Mem-Info:
[   31.438098] active_anon:5813 inactive_anon:4129 isolated_anon:0
[   31.438098]  active_file:6080 inactive_file:118548 isolated_file:0
[   31.438098]  unevictable:0 dirty:13674 writeback:7440 unstable:0
[   31.438098]  slab_reclaimable:5651 slab_unreclaimable:4566
[   31.438098]  mapped:5585 shmem:4468 pagetables:182 bounce:0
[   31.438098]  free:347556 free_pcp:608 free_cma:57235
[   31.47236

Re: [PATCH] doc: cgroup: update note about conditions when oom killer is invoked

2020-05-11 Thread Michal Hocko
On Mon 11-05-20 12:34:00, Konstantin Khlebnikov wrote:
> 
> 
> On 11/05/2020 11.39, Michal Hocko wrote:
> > On Fri 08-05-20 17:16:29, Konstantin Khlebnikov wrote:
> > > Starting from v4.19 commit 29ef680ae7c2 ("memcg, oom: move out_of_memory
> > > back to the charge path") cgroup oom killer is no longer invoked only from
> > > page faults. Now it implements the same semantics as global OOM killer:
> > > allocation context invokes OOM killer and keeps retrying until success.
> > > 
> > > Signed-off-by: Konstantin Khlebnikov 
> > 
> > Acked-by: Michal Hocko 
> > 
> > > ---
> > >   Documentation/admin-guide/cgroup-v2.rst |   17 -
> > >   1 file changed, 8 insertions(+), 9 deletions(-)
> > > 
> > > diff --git a/Documentation/admin-guide/cgroup-v2.rst 
> > > b/Documentation/admin-guide/cgroup-v2.rst
> > > index bcc80269bb6a..1bb9a8f6ebe1 100644
> > > --- a/Documentation/admin-guide/cgroup-v2.rst
> > > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > > @@ -1172,6 +1172,13 @@ PAGE_SIZE multiple when read back.
> > >   Under certain circumstances, the usage may go over the limit
> > >   temporarily.
> > > + In default configuration regular 0-order allocation always
> > > + succeed unless OOM killer choose current task as a victim.
> > > +
> > > + Some kinds of allocations don't invoke the OOM killer.
> > > + Caller could retry them differently, return into userspace
> > > + as -ENOMEM or silently ignore in cases like disk readahead.
> > 
> > I would probably add -EFAULT but the less error codes we document the
> > better.
> 
> Yeah, EFAULT was a most obscure result of memory shortage.
> Fortunately with new behaviour this shouldn't happens a lot.

Yes, it shouldn't really happen very often. gup was the most prominent
example but this one should be taken care of by triggering the OOM
killer. But I wouldn't bet my hat there are no potential cases anymore.

> Actually where it is still possible? THP always fallback to 0-order.
> I mean EFAULT could appear inside kernel only if task is killed so
> nobody would see it.

Yes fatal_signal_pending paths are ok. And no I do not have any specific
examples. But as you've said EFAULT was a real surprise so I thought it
would be nice to still keep a reference for it around. Even when it is
unlikely.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] doc: cgroup: update note about conditions when oom killer is invoked

2020-05-11 Thread Konstantin Khlebnikov




On 11/05/2020 11.39, Michal Hocko wrote:

On Fri 08-05-20 17:16:29, Konstantin Khlebnikov wrote:

Starting from v4.19 commit 29ef680ae7c2 ("memcg, oom: move out_of_memory
back to the charge path") cgroup oom killer is no longer invoked only from
page faults. Now it implements the same semantics as global OOM killer:
allocation context invokes OOM killer and keeps retrying until success.

Signed-off-by: Konstantin Khlebnikov 


Acked-by: Michal Hocko 


---
  Documentation/admin-guide/cgroup-v2.rst |   17 -
  1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst 
b/Documentation/admin-guide/cgroup-v2.rst
index bcc80269bb6a..1bb9a8f6ebe1 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1172,6 +1172,13 @@ PAGE_SIZE multiple when read back.
Under certain circumstances, the usage may go over the limit
temporarily.
  
+	In default configuration regular 0-order allocation always

+   succeed unless OOM killer choose current task as a victim.
+
+   Some kinds of allocations don't invoke the OOM killer.
+   Caller could retry them differently, return into userspace
+   as -ENOMEM or silently ignore in cases like disk readahead.


I would probably add -EFAULT but the less error codes we document the
better.


Yeah, EFAULT was a most obscure result of memory shortage.
Fortunately with new behaviour this shouldn't happens a lot.

Actually where it is still possible? THP always fallback to 0-order.
I mean EFAULT could appear inside kernel only if task is killed so
nobody would see it.




+
This is the ultimate protection mechanism.  As long as the
high limit is used and monitored properly, this limit's
utility is limited to providing the final safety net.
@@ -1228,17 +1235,9 @@ PAGE_SIZE multiple when read back.
The number of time the cgroup's memory usage was
reached the limit and allocation was about to fail.
  
-		Depending on context result could be invocation of OOM

-   killer and retrying allocation or failing allocation.
-
-   Failed allocation in its turn could be returned into
-   userspace as -ENOMEM or silently ignored in cases like
-   disk readahead.  For now OOM in memory cgroup kills
-   tasks iff shortage has happened inside page fault.
-
This event is not raised if the OOM killer is not
considered as an option, e.g. for failed high-order
-   allocations.
+   allocations or if caller asked to not retry attempts.
  
  	  oom_kill

The number of processes belonging to this cgroup




Re: [PATCH] doc: cgroup: update note about conditions when oom killer is invoked

2020-05-11 Thread Michal Hocko
On Fri 08-05-20 17:16:29, Konstantin Khlebnikov wrote:
> Starting from v4.19 commit 29ef680ae7c2 ("memcg, oom: move out_of_memory
> back to the charge path") cgroup oom killer is no longer invoked only from
> page faults. Now it implements the same semantics as global OOM killer:
> allocation context invokes OOM killer and keeps retrying until success.
> 
> Signed-off-by: Konstantin Khlebnikov 

Acked-by: Michal Hocko 

> ---
>  Documentation/admin-guide/cgroup-v2.rst |   17 -
>  1 file changed, 8 insertions(+), 9 deletions(-)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst 
> b/Documentation/admin-guide/cgroup-v2.rst
> index bcc80269bb6a..1bb9a8f6ebe1 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1172,6 +1172,13 @@ PAGE_SIZE multiple when read back.
>   Under certain circumstances, the usage may go over the limit
>   temporarily.
>  
> + In default configuration regular 0-order allocation always
> + succeed unless OOM killer choose current task as a victim.
> +
> + Some kinds of allocations don't invoke the OOM killer.
> + Caller could retry them differently, return into userspace
> + as -ENOMEM or silently ignore in cases like disk readahead.

I would probably add -EFAULT but the less error codes we document the
better.

> +
>   This is the ultimate protection mechanism.  As long as the
>   high limit is used and monitored properly, this limit's
>   utility is limited to providing the final safety net.
> @@ -1228,17 +1235,9 @@ PAGE_SIZE multiple when read back.
>   The number of time the cgroup's memory usage was
>   reached the limit and allocation was about to fail.
>  
> - Depending on context result could be invocation of OOM
> - killer and retrying allocation or failing allocation.
> -
> - Failed allocation in its turn could be returned into
> - userspace as -ENOMEM or silently ignored in cases like
> - disk readahead.  For now OOM in memory cgroup kills
> - tasks iff shortage has happened inside page fault.
> -
>   This event is not raised if the OOM killer is not
>   considered as an option, e.g. for failed high-order
> - allocations.
> + allocations or if caller asked to not retry attempts.
>  
> oom_kill
>   The number of processes belonging to this cgroup

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] doc: cgroup: update note about conditions when oom killer is invoked

2020-05-08 Thread Randy Dunlap
Hi,

On 5/8/20 7:16 AM, Konstantin Khlebnikov wrote:
> Starting from v4.19 commit 29ef680ae7c2 ("memcg, oom: move out_of_memory
> back to the charge path") cgroup oom killer is no longer invoked only from
> page faults. Now it implements the same semantics as global OOM killer:
> allocation context invokes OOM killer and keeps retrying until success.
> 
> Signed-off-by: Konstantin Khlebnikov 
> ---
>  Documentation/admin-guide/cgroup-v2.rst |   17 -
>  1 file changed, 8 insertions(+), 9 deletions(-)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst 
> b/Documentation/admin-guide/cgroup-v2.rst
> index bcc80269bb6a..1bb9a8f6ebe1 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1172,6 +1172,13 @@ PAGE_SIZE multiple when read back.
>   Under certain circumstances, the usage may go over the limit
>   temporarily.
>  
> + In default configuration regular 0-order allocation always

         allocations

> + succeed unless OOM killer choose current task as a victim.

      chooses

> +
> + Some kinds of allocations don't invoke the OOM killer.
> + Caller could retry them differently, return into userspace
> + as -ENOMEM or silently ignore in cases like disk readahead.
> +
>   This is the ultimate protection mechanism.  As long as the
>   high limit is used and monitored properly, this limit's
>   utility is limited to providing the final safety net.
> @@ -1228,17 +1235,9 @@ PAGE_SIZE multiple when read back.
>   The number of time the cgroup's memory usage was
>   reached the limit and allocation was about to fail.
>  
> - Depending on context result could be invocation of OOM
> - killer and retrying allocation or failing allocation.
> -
> - Failed allocation in its turn could be returned into
> - userspace as -ENOMEM or silently ignored in cases like
> - disk readahead.  For now OOM in memory cgroup kills
> - tasks iff shortage has happened inside page fault.
> -
>   This event is not raised if the OOM killer is not
>   considered as an option, e.g. for failed high-order
> - allocations.
> + allocations or if caller asked to not retry attempts.
>  
> oom_kill
>   The number of processes belonging to this cgroup
> 


thanks for updating the docs.
-- 
~Randy



[PATCH] doc: cgroup: update note about conditions when oom killer is invoked

2020-05-08 Thread Konstantin Khlebnikov
Starting from v4.19 commit 29ef680ae7c2 ("memcg, oom: move out_of_memory
back to the charge path") cgroup oom killer is no longer invoked only from
page faults. Now it implements the same semantics as global OOM killer:
allocation context invokes OOM killer and keeps retrying until success.

Signed-off-by: Konstantin Khlebnikov 
---
 Documentation/admin-guide/cgroup-v2.rst |   17 -
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst 
b/Documentation/admin-guide/cgroup-v2.rst
index bcc80269bb6a..1bb9a8f6ebe1 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1172,6 +1172,13 @@ PAGE_SIZE multiple when read back.
Under certain circumstances, the usage may go over the limit
temporarily.
 
+   In default configuration regular 0-order allocation always
+   succeed unless OOM killer choose current task as a victim.
+
+   Some kinds of allocations don't invoke the OOM killer.
+   Caller could retry them differently, return into userspace
+   as -ENOMEM or silently ignore in cases like disk readahead.
+
This is the ultimate protection mechanism.  As long as the
high limit is used and monitored properly, this limit's
utility is limited to providing the final safety net.
@@ -1228,17 +1235,9 @@ PAGE_SIZE multiple when read back.
The number of time the cgroup's memory usage was
reached the limit and allocation was about to fail.
 
-   Depending on context result could be invocation of OOM
-   killer and retrying allocation or failing allocation.
-
-   Failed allocation in its turn could be returned into
-   userspace as -ENOMEM or silently ignored in cases like
-   disk readahead.  For now OOM in memory cgroup kills
-   tasks iff shortage has happened inside page fault.
-
This event is not raised if the OOM killer is not
considered as an option, e.g. for failed high-order
-   allocations.
+   allocations or if caller asked to not retry attempts.
 
  oom_kill
The number of processes belonging to this cgroup



Re: mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-05-01 Thread Andrew Morton
On Fri, 1 May 2020 18:08:28 +0530 Naresh Kamboju  
wrote:

> mkfs -t ext4 invoked oom-killer on i386 kernel running on x86_64 device
> and started happening on linux -next master branch kernel tag next-20200430
> and next-20200501. We did not bisect this problem.

It would be wonderful if you could do so, please.  I can't immediately see
any MM change in this area which might cause this.

> metadata
>   git branch: master
>   git repo: 
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
>   git commit: e4a08b64261ab411b15580c369a3b8fbed28bbc1
>   git describe: next-20200430
>   make_kernelversion: 5.7.0-rc3
>   kernel-config:
> https://builds.tuxbuild.com/1YrE_XUQ6odA52tSBM919w/kernel.config
> 
> Steps to reproduce: (always reproducible)

Reproducibility helps!

> oom-killer: gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0,

> [   34.793430]  pagecache_get_page+0xae/0x260

> [   34.897923] active_anon:5366 inactive_anon:2172 isolated_anon:0
> [   34.897923]  active_file:4151 inactive_file:212494 isolated_file:0
> [   34.897923]  unevictable:0 dirty:16505 writeback:6520 unstable:0

> [ 34.987678] Normal free:3948kB min:7732kB low:8640kB high:9548kB
> reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
> active_file:1096kB inactive_file:786400kB unevictable:0kB
> writepending:65432kB present:884728kB managed:845576kB mlocked:0kB
> kernel_stack:1112kB pagetables:0kB bounce:0kB free_pcp:2908kB
> local_pcp:500kB free_cma:0kB

ZONE_NORMAL has a huge amount of clean pagecache stuck on the
inactive list, not being reclaimed.


mm: mkfs.ext4 invoked oom-killer on i386 - pagecache_get_page

2020-05-01 Thread Naresh Kamboju
mkfs -t ext4 invoked oom-killer on i386 kernel running on x86_64 device
and started happening on linux -next master branch kernel tag next-20200430
and next-20200501. We did not bisect this problem.

metadata
  git branch: master
  git repo: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
  git commit: e4a08b64261ab411b15580c369a3b8fbed28bbc1
  git describe: next-20200430
  make_kernelversion: 5.7.0-rc3
  kernel-config:
https://builds.tuxbuild.com/1YrE_XUQ6odA52tSBM919w/kernel.config

Steps to reproduce: (always reproducible)
---
mkfs -t ext4 

Test log:

+ mkfs -t ext4 /dev/disk/by-id/ata-TOSHIBA_MG04ACA100N_Y8RQK14KF6XF
mke2fs 1.43.8 (1-Jan-2018)
Creating filesystem with 244190646 4k blocks and 61054976 inodes
Filesystem UUID: 05e8451c-1dd6-4d94-b030-0f806653e4b4
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 2048, 23887872, 71663616, 78675968,
10240, 214990848
Allocating group tables:0/7453   done
Writing inode tables:0/7453   done
Creating journal (262144 blocks): [   34.739137] mkfs.ext4 invoked
oom-killer: gfp_mask=0x101cc0(GFP_USER|__GFP_WRITE), order=0,
oom_score_adj=0
[   34.748889] CPU: 0 PID: 393 Comm: mkfs.ext4 Not tainted
5.7.0-rc3-next-20200430 #1
[   34.756450] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
2.2 05/23/2018
[   34.763844] Call Trace:
[   34.766305]  dump_stack+0x54/0x6e
[   34.769629]  dump_header+0x3d/0x1c6
[   34.773126]  ? oom_badness.part.0+0x10/0x120
[   34.777397]  ? ___ratelimit+0x8f/0xdc
[   34.781056]  oom_kill_process.cold+0x9/0xe
[   34.785152]  out_of_memory+0x1ab/0x260
[   34.788898]  __alloc_pages_nodemask+0xe0e/0xec0
[   34.793430]  pagecache_get_page+0xae/0x260
[   34.797521]  grab_cache_page_write_begin+0x1c/0x30
[   34.802303]  block_write_begin+0x1e/0x90
[   34.806222]  blkdev_write_begin+0x1e/0x20
[   34.810225]  ? bdev_evict_inode+0xd0/0xd0
[   34.814230]  generic_perform_write+0x97/0x180
[   34.818579]  __generic_file_write_iter+0x140/0x1f0
[   34.823365]  blkdev_write_iter+0xc0/0x190
[   34.827376]  __vfs_write+0x132/0x1e0
[   34.830947]  ? __audit_syscall_entry+0xa8/0xe0
[   34.835385]  vfs_write+0xa1/0x1a0
[   34.838696]  ksys_pwrite64+0x50/0x80
[   34.842267]  __ia32_sys_ia32_pwrite64+0x16/0x20
[   34.846798]  do_fast_syscall_32+0x6b/0x270
[   34.850890]  entry_SYSENTER_32+0xa5/0xf8
[   34.854805] EIP: 0xb7f0d549
[   34.857596] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01
10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f
34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90
8d 76
[   34.876334] EAX: ffda EBX: 0003 ECX: b7801010 EDX: 0040
[   34.882591] ESI: 3840 EDI: 0074 EBP: 07438400 ESP: bfd266f0
[   34.47] DS: 007b ES: 007b FS:  GS: 0033 SS: 007b EFLAGS: 0246
[   34.895630] Mem-Info:
[   34.897923] active_anon:5366 inactive_anon:2172 isolated_anon:0
[   34.897923]  active_file:4151 inactive_file:212494 isolated_file:0
[   34.897923]  unevictable:0 dirty:16505 writeback:6520 unstable:0
[   34.897923]  slab_reclaimable:5855 slab_unreclaimable:3531
[   34.897923]  mapped:6321 shmem:2236 pagetables:178 bounce:0
[   34.897923]  free:264202 free_pcp:1082 free_cma:0
[   34.931663] Node 0 active_anon:21464kB inactive_anon:8688kB
active_file:16604kB inactive_file:849976kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB mapped:25284kB dirty:58952kB
writeback:27772kB shmem:8944kB writeback_tmp:0kB unstable:0kB
all_unreclaimable? yes
[   34.955523] DMA free:3356kB min:68kB low:84kB high:100kB
reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
active_file:0kB inactive_file:11964kB unevictable:0kB
writepending:11980kB present:15964kB managed:15876kB mlocked:0kB
kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB
free_cma:0kB
[   34.983385] lowmem_reserve[]: 0 825 1947 825
[   34.987678] Normal free:3948kB min:7732kB low:8640kB high:9548kB
reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
active_file:1096kB inactive_file:786400kB unevictable:0kB
writepending:65432kB present:884728kB managed:845576kB mlocked:0kB
kernel_stack:1112kB pagetables:0kB bounce:0kB free_pcp:2908kB
local_pcp:500kB free_cma:0kB
[   35.017427] lowmem_reserve[]: 0 0 8980 0
[   35.021362] HighMem free:1049496kB min:512kB low:1748kB high:2984kB
reserved_highatomic:0KB active_anon:21464kB inactive_anon:8688kB
active_file:15508kB inactive_file:51612kB unevictable:0kB
writepending:0kB present:1149540kB managed:1149540kB mlocked:0kB
kernel_stack:0kB pagetables:712kB bounce:0kB free_pcp:1524kB
local_pcp:292kB free_cma:0kB
[   35.051717] lowmem_reserve[]: 0 0 0 0
[   35.055374] DMA: 8*4kB (UE) 1*8kB (E) 1*16kB (E) 0*32kB 0*64kB
0*128kB 1*256kB (E) 0*512kB 1*1024kB (E) 1*2048kB (E) 0*4096kB =
3384kB
[   35.067446] Normal: 27*4kB (U) 23*8kB (U) 12*16kB (UE) 12*32kB

[PATCH 4.19 183/211] memcg, oom: dont require __GFP_FS when invoking memcg OOM killer

2019-10-03 Thread Greg Kroah-Hartman
From: Tetsuo Handa 

commit f9c645621a28e37813a1de96d9cbd89cde94a1e4 upstream.

Masoud Sharbiani noticed that commit 29ef680ae7c21110 ("memcg, oom: move
out_of_memory back to the charge path") broke memcg OOM called from
__xfs_filemap_fault() path.  It turned out that try_charge() is retrying
forever without making forward progress because mem_cgroup_oom(GFP_NOFS)
cannot invoke the OOM killer due to commit 3da88fb3bacfaa33 ("mm, oom:
move GFP_NOFS check to out_of_memory").

Allowing forced charge due to being unable to invoke memcg OOM killer will
lead to global OOM situation.  Also, just returning -ENOMEM will be risky
because OOM path is lost and some paths (e.g.  get_user_pages()) will leak
-ENOMEM.  Therefore, invoking memcg OOM killer (despite GFP_NOFS) will be
the only choice we can choose for now.

Until 29ef680ae7c21110, we were able to invoke memcg OOM killer when
GFP_KERNEL reclaim failed [1].  But since 29ef680ae7c21110, we need to
invoke memcg OOM killer when GFP_NOFS reclaim failed [2].  Although in the
past we did invoke memcg OOM killer for GFP_NOFS [3], we might get
pre-mature memcg OOM reports due to this patch.

[1]

 leaker invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), 
nodemask=(null), order=0, oom_score_adj=0
 CPU: 0 PID: 2746 Comm: leaker Not tainted 4.18.0+ #19
 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference 
Platform, BIOS 6.00 04/13/2018
 Call Trace:
  dump_stack+0x63/0x88
  dump_header+0x67/0x27a
  ? mem_cgroup_scan_tasks+0x91/0xf0
  oom_kill_process+0x210/0x410
  out_of_memory+0x10a/0x2c0
  mem_cgroup_out_of_memory+0x46/0x80
  mem_cgroup_oom_synchronize+0x2e4/0x310
  ? high_work_func+0x20/0x20
  pagefault_out_of_memory+0x31/0x76
  mm_fault_error+0x55/0x115
  ? handle_mm_fault+0xfd/0x220
  __do_page_fault+0x433/0x4e0
  do_page_fault+0x22/0x30
  ? page_fault+0x8/0x30
  page_fault+0x1e/0x30
 RIP: 0033:0x4009f0
 Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 
40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 
c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83
 RSP: 002b:7ffe29ae96f0 EFLAGS: 00010206
 RAX: 001b RBX:  RCX: 01ce1000
 RDX:  RSI: 7fe5 RDI: 
 RBP: 000c R08:  R09: 7f94be09220d
 R10: 0002 R11: 0246 R12: 000186a0
 R13: 0003 R14: 7f949d845000 R15: 0280
 Task in /leaker killed as a result of limit of /leaker
 memory: usage 524288kB, limit 524288kB, failcnt 158965
 memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
 kmem: usage 2016kB, limit 9007199254740988kB, failcnt 0
 Memory cgroup stats for /leaker: cache:844KB rss:521136KB rss_huge:0KB 
shmem:0KB mapped_file:0KB dirty:132KB writeback:0KB inactive_anon:0KB 
active_anon:521224KB inactive_file:1012KB active_file:8KB unevictable:0KB
 Memory cgroup out of memory: Kill process 2746 (leaker) score 998 or sacrifice 
child
 Killed process 2746 (leaker) total-vm:536704kB, anon-rss:521176kB, 
file-rss:1208kB, shmem-rss:0kB
 oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, 
shmem-rss:0kB

[2]

 leaker invoked oom-killer: gfp_mask=0x600040(GFP_NOFS), nodemask=(null), 
order=0, oom_score_adj=0
 CPU: 1 PID: 2746 Comm: leaker Not tainted 4.18.0+ #20
 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference 
Platform, BIOS 6.00 04/13/2018
 Call Trace:
  dump_stack+0x63/0x88
  dump_header+0x67/0x27a
  ? mem_cgroup_scan_tasks+0x91/0xf0
  oom_kill_process+0x210/0x410
  out_of_memory+0x109/0x2d0
  mem_cgroup_out_of_memory+0x46/0x80
  try_charge+0x58d/0x650
  ? __radix_tree_replace+0x81/0x100
  mem_cgroup_try_charge+0x7a/0x100
  __add_to_page_cache_locked+0x92/0x180
  add_to_page_cache_lru+0x4d/0xf0
  iomap_readpages_actor+0xde/0x1b0
  ? iomap_zero_range_actor+0x1d0/0x1d0
  iomap_apply+0xaf/0x130
  iomap_readpages+0x9f/0x150
  ? iomap_zero_range_actor+0x1d0/0x1d0
  xfs_vm_readpages+0x18/0x20 [xfs]
  read_pages+0x60/0x140
  __do_page_cache_readahead+0x193/0x1b0
  ondemand_readahead+0x16d/0x2c0
  page_cache_async_readahead+0x9a/0xd0
  filemap_fault+0x403/0x620
  ? alloc_set_pte+0x12c/0x540
  ? _cond_resched+0x14/0x30
  __xfs_filemap_fault+0x66/0x180 [xfs]
  xfs_filemap_fault+0x27/0x30 [xfs]
  __do_fault+0x19/0x40
  __handle_mm_fault+0x8e8/0xb60
  handle_mm_fault+0xfd/0x220
  __do_page_fault+0x238/0x4e0
  do_page_fault+0x22/0x30
  ? page_fault+0x8/0x30
  page_fault+0x1e/0x30
 RIP: 0033:0x4009f0
 Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 
40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 
c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83
 RSP: 002b:7ffda45c9290 EFLAGS: 00010206
 RAX: 001b RBX:  RCX: 01a1e000
 RDX:  RSI: 7fe5 RDI: 
 RBP: 00

[PATCH 5.3 293/344] memcg, oom: dont require __GFP_FS when invoking memcg OOM killer

2019-10-03 Thread Greg Kroah-Hartman
From: Tetsuo Handa 

commit f9c645621a28e37813a1de96d9cbd89cde94a1e4 upstream.

Masoud Sharbiani noticed that commit 29ef680ae7c21110 ("memcg, oom: move
out_of_memory back to the charge path") broke memcg OOM called from
__xfs_filemap_fault() path.  It turned out that try_charge() is retrying
forever without making forward progress because mem_cgroup_oom(GFP_NOFS)
cannot invoke the OOM killer due to commit 3da88fb3bacfaa33 ("mm, oom:
move GFP_NOFS check to out_of_memory").

Allowing forced charge due to being unable to invoke memcg OOM killer will
lead to global OOM situation.  Also, just returning -ENOMEM will be risky
because OOM path is lost and some paths (e.g.  get_user_pages()) will leak
-ENOMEM.  Therefore, invoking memcg OOM killer (despite GFP_NOFS) will be
the only choice we can choose for now.

Until 29ef680ae7c21110, we were able to invoke memcg OOM killer when
GFP_KERNEL reclaim failed [1].  But since 29ef680ae7c21110, we need to
invoke memcg OOM killer when GFP_NOFS reclaim failed [2].  Although in the
past we did invoke memcg OOM killer for GFP_NOFS [3], we might get
pre-mature memcg OOM reports due to this patch.

[1]

 leaker invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), 
nodemask=(null), order=0, oom_score_adj=0
 CPU: 0 PID: 2746 Comm: leaker Not tainted 4.18.0+ #19
 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference 
Platform, BIOS 6.00 04/13/2018
 Call Trace:
  dump_stack+0x63/0x88
  dump_header+0x67/0x27a
  ? mem_cgroup_scan_tasks+0x91/0xf0
  oom_kill_process+0x210/0x410
  out_of_memory+0x10a/0x2c0
  mem_cgroup_out_of_memory+0x46/0x80
  mem_cgroup_oom_synchronize+0x2e4/0x310
  ? high_work_func+0x20/0x20
  pagefault_out_of_memory+0x31/0x76
  mm_fault_error+0x55/0x115
  ? handle_mm_fault+0xfd/0x220
  __do_page_fault+0x433/0x4e0
  do_page_fault+0x22/0x30
  ? page_fault+0x8/0x30
  page_fault+0x1e/0x30
 RIP: 0033:0x4009f0
 Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 
40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 
c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83
 RSP: 002b:7ffe29ae96f0 EFLAGS: 00010206
 RAX: 001b RBX:  RCX: 01ce1000
 RDX:  RSI: 7fe5 RDI: 
 RBP: 000c R08:  R09: 7f94be09220d
 R10: 0002 R11: 0246 R12: 000186a0
 R13: 0003 R14: 7f949d845000 R15: 0280
 Task in /leaker killed as a result of limit of /leaker
 memory: usage 524288kB, limit 524288kB, failcnt 158965
 memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
 kmem: usage 2016kB, limit 9007199254740988kB, failcnt 0
 Memory cgroup stats for /leaker: cache:844KB rss:521136KB rss_huge:0KB 
shmem:0KB mapped_file:0KB dirty:132KB writeback:0KB inactive_anon:0KB 
active_anon:521224KB inactive_file:1012KB active_file:8KB unevictable:0KB
 Memory cgroup out of memory: Kill process 2746 (leaker) score 998 or sacrifice 
child
 Killed process 2746 (leaker) total-vm:536704kB, anon-rss:521176kB, 
file-rss:1208kB, shmem-rss:0kB
 oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, 
shmem-rss:0kB

[2]

 leaker invoked oom-killer: gfp_mask=0x600040(GFP_NOFS), nodemask=(null), 
order=0, oom_score_adj=0
 CPU: 1 PID: 2746 Comm: leaker Not tainted 4.18.0+ #20
 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference 
Platform, BIOS 6.00 04/13/2018
 Call Trace:
  dump_stack+0x63/0x88
  dump_header+0x67/0x27a
  ? mem_cgroup_scan_tasks+0x91/0xf0
  oom_kill_process+0x210/0x410
  out_of_memory+0x109/0x2d0
  mem_cgroup_out_of_memory+0x46/0x80
  try_charge+0x58d/0x650
  ? __radix_tree_replace+0x81/0x100
  mem_cgroup_try_charge+0x7a/0x100
  __add_to_page_cache_locked+0x92/0x180
  add_to_page_cache_lru+0x4d/0xf0
  iomap_readpages_actor+0xde/0x1b0
  ? iomap_zero_range_actor+0x1d0/0x1d0
  iomap_apply+0xaf/0x130
  iomap_readpages+0x9f/0x150
  ? iomap_zero_range_actor+0x1d0/0x1d0
  xfs_vm_readpages+0x18/0x20 [xfs]
  read_pages+0x60/0x140
  __do_page_cache_readahead+0x193/0x1b0
  ondemand_readahead+0x16d/0x2c0
  page_cache_async_readahead+0x9a/0xd0
  filemap_fault+0x403/0x620
  ? alloc_set_pte+0x12c/0x540
  ? _cond_resched+0x14/0x30
  __xfs_filemap_fault+0x66/0x180 [xfs]
  xfs_filemap_fault+0x27/0x30 [xfs]
  __do_fault+0x19/0x40
  __handle_mm_fault+0x8e8/0xb60
  handle_mm_fault+0xfd/0x220
  __do_page_fault+0x238/0x4e0
  do_page_fault+0x22/0x30
  ? page_fault+0x8/0x30
  page_fault+0x1e/0x30
 RIP: 0033:0x4009f0
 Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 
40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 
c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83
 RSP: 002b:7ffda45c9290 EFLAGS: 00010206
 RAX: 001b RBX:  RCX: 01a1e000
 RDX:  RSI: 7fe5 RDI: 
 RBP: 00

[PATCH 5.2 267/313] memcg, oom: dont require __GFP_FS when invoking memcg OOM killer

2019-10-03 Thread Greg Kroah-Hartman
From: Tetsuo Handa 

commit f9c645621a28e37813a1de96d9cbd89cde94a1e4 upstream.

Masoud Sharbiani noticed that commit 29ef680ae7c21110 ("memcg, oom: move
out_of_memory back to the charge path") broke memcg OOM called from
__xfs_filemap_fault() path.  It turned out that try_charge() is retrying
forever without making forward progress because mem_cgroup_oom(GFP_NOFS)
cannot invoke the OOM killer due to commit 3da88fb3bacfaa33 ("mm, oom:
move GFP_NOFS check to out_of_memory").

Allowing forced charge due to being unable to invoke memcg OOM killer will
lead to global OOM situation.  Also, just returning -ENOMEM will be risky
because OOM path is lost and some paths (e.g.  get_user_pages()) will leak
-ENOMEM.  Therefore, invoking memcg OOM killer (despite GFP_NOFS) will be
the only choice we can choose for now.

Until 29ef680ae7c21110, we were able to invoke memcg OOM killer when
GFP_KERNEL reclaim failed [1].  But since 29ef680ae7c21110, we need to
invoke memcg OOM killer when GFP_NOFS reclaim failed [2].  Although in the
past we did invoke memcg OOM killer for GFP_NOFS [3], we might get
pre-mature memcg OOM reports due to this patch.

[1]

 leaker invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), 
nodemask=(null), order=0, oom_score_adj=0
 CPU: 0 PID: 2746 Comm: leaker Not tainted 4.18.0+ #19
 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference 
Platform, BIOS 6.00 04/13/2018
 Call Trace:
  dump_stack+0x63/0x88
  dump_header+0x67/0x27a
  ? mem_cgroup_scan_tasks+0x91/0xf0
  oom_kill_process+0x210/0x410
  out_of_memory+0x10a/0x2c0
  mem_cgroup_out_of_memory+0x46/0x80
  mem_cgroup_oom_synchronize+0x2e4/0x310
  ? high_work_func+0x20/0x20
  pagefault_out_of_memory+0x31/0x76
  mm_fault_error+0x55/0x115
  ? handle_mm_fault+0xfd/0x220
  __do_page_fault+0x433/0x4e0
  do_page_fault+0x22/0x30
  ? page_fault+0x8/0x30
  page_fault+0x1e/0x30
 RIP: 0033:0x4009f0
 Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 
40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 
c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83
 RSP: 002b:7ffe29ae96f0 EFLAGS: 00010206
 RAX: 001b RBX:  RCX: 01ce1000
 RDX:  RSI: 7fe5 RDI: 
 RBP: 000c R08:  R09: 7f94be09220d
 R10: 0002 R11: 0246 R12: 000186a0
 R13: 0003 R14: 7f949d845000 R15: 0280
 Task in /leaker killed as a result of limit of /leaker
 memory: usage 524288kB, limit 524288kB, failcnt 158965
 memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
 kmem: usage 2016kB, limit 9007199254740988kB, failcnt 0
 Memory cgroup stats for /leaker: cache:844KB rss:521136KB rss_huge:0KB 
shmem:0KB mapped_file:0KB dirty:132KB writeback:0KB inactive_anon:0KB 
active_anon:521224KB inactive_file:1012KB active_file:8KB unevictable:0KB
 Memory cgroup out of memory: Kill process 2746 (leaker) score 998 or sacrifice 
child
 Killed process 2746 (leaker) total-vm:536704kB, anon-rss:521176kB, 
file-rss:1208kB, shmem-rss:0kB
 oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, 
shmem-rss:0kB

[2]

 leaker invoked oom-killer: gfp_mask=0x600040(GFP_NOFS), nodemask=(null), 
order=0, oom_score_adj=0
 CPU: 1 PID: 2746 Comm: leaker Not tainted 4.18.0+ #20
 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference 
Platform, BIOS 6.00 04/13/2018
 Call Trace:
  dump_stack+0x63/0x88
  dump_header+0x67/0x27a
  ? mem_cgroup_scan_tasks+0x91/0xf0
  oom_kill_process+0x210/0x410
  out_of_memory+0x109/0x2d0
  mem_cgroup_out_of_memory+0x46/0x80
  try_charge+0x58d/0x650
  ? __radix_tree_replace+0x81/0x100
  mem_cgroup_try_charge+0x7a/0x100
  __add_to_page_cache_locked+0x92/0x180
  add_to_page_cache_lru+0x4d/0xf0
  iomap_readpages_actor+0xde/0x1b0
  ? iomap_zero_range_actor+0x1d0/0x1d0
  iomap_apply+0xaf/0x130
  iomap_readpages+0x9f/0x150
  ? iomap_zero_range_actor+0x1d0/0x1d0
  xfs_vm_readpages+0x18/0x20 [xfs]
  read_pages+0x60/0x140
  __do_page_cache_readahead+0x193/0x1b0
  ondemand_readahead+0x16d/0x2c0
  page_cache_async_readahead+0x9a/0xd0
  filemap_fault+0x403/0x620
  ? alloc_set_pte+0x12c/0x540
  ? _cond_resched+0x14/0x30
  __xfs_filemap_fault+0x66/0x180 [xfs]
  xfs_filemap_fault+0x27/0x30 [xfs]
  __do_fault+0x19/0x40
  __handle_mm_fault+0x8e8/0xb60
  handle_mm_fault+0xfd/0x220
  __do_page_fault+0x238/0x4e0
  do_page_fault+0x22/0x30
  ? page_fault+0x8/0x30
  page_fault+0x1e/0x30
 RIP: 0033:0x4009f0
 Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 
40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 
c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83
 RSP: 002b:7ffda45c9290 EFLAGS: 00010206
 RAX: 001b RBX:  RCX: 01a1e000
 RDX:  RSI: 7fe5 RDI: 
 RBP: 00

[PATCH 4.14 164/185] memcg, oom: dont require __GFP_FS when invoking memcg OOM killer

2019-10-03 Thread Greg Kroah-Hartman
From: Tetsuo Handa 

commit f9c645621a28e37813a1de96d9cbd89cde94a1e4 upstream.

Masoud Sharbiani noticed that commit 29ef680ae7c21110 ("memcg, oom: move
out_of_memory back to the charge path") broke memcg OOM called from
__xfs_filemap_fault() path.  It turned out that try_charge() is retrying
forever without making forward progress because mem_cgroup_oom(GFP_NOFS)
cannot invoke the OOM killer due to commit 3da88fb3bacfaa33 ("mm, oom:
move GFP_NOFS check to out_of_memory").

Allowing forced charge due to being unable to invoke memcg OOM killer will
lead to global OOM situation.  Also, just returning -ENOMEM will be risky
because OOM path is lost and some paths (e.g.  get_user_pages()) will leak
-ENOMEM.  Therefore, invoking memcg OOM killer (despite GFP_NOFS) will be
the only choice we can choose for now.

Until 29ef680ae7c21110, we were able to invoke memcg OOM killer when
GFP_KERNEL reclaim failed [1].  But since 29ef680ae7c21110, we need to
invoke memcg OOM killer when GFP_NOFS reclaim failed [2].  Although in the
past we did invoke memcg OOM killer for GFP_NOFS [3], we might get
pre-mature memcg OOM reports due to this patch.

[1]

 leaker invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), 
nodemask=(null), order=0, oom_score_adj=0
 CPU: 0 PID: 2746 Comm: leaker Not tainted 4.18.0+ #19
 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference 
Platform, BIOS 6.00 04/13/2018
 Call Trace:
  dump_stack+0x63/0x88
  dump_header+0x67/0x27a
  ? mem_cgroup_scan_tasks+0x91/0xf0
  oom_kill_process+0x210/0x410
  out_of_memory+0x10a/0x2c0
  mem_cgroup_out_of_memory+0x46/0x80
  mem_cgroup_oom_synchronize+0x2e4/0x310
  ? high_work_func+0x20/0x20
  pagefault_out_of_memory+0x31/0x76
  mm_fault_error+0x55/0x115
  ? handle_mm_fault+0xfd/0x220
  __do_page_fault+0x433/0x4e0
  do_page_fault+0x22/0x30
  ? page_fault+0x8/0x30
  page_fault+0x1e/0x30
 RIP: 0033:0x4009f0
 Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 
40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 
c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83
 RSP: 002b:7ffe29ae96f0 EFLAGS: 00010206
 RAX: 001b RBX:  RCX: 01ce1000
 RDX:  RSI: 7fe5 RDI: 
 RBP: 000c R08:  R09: 7f94be09220d
 R10: 0002 R11: 0246 R12: 000186a0
 R13: 0003 R14: 7f949d845000 R15: 0280
 Task in /leaker killed as a result of limit of /leaker
 memory: usage 524288kB, limit 524288kB, failcnt 158965
 memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
 kmem: usage 2016kB, limit 9007199254740988kB, failcnt 0
 Memory cgroup stats for /leaker: cache:844KB rss:521136KB rss_huge:0KB 
shmem:0KB mapped_file:0KB dirty:132KB writeback:0KB inactive_anon:0KB 
active_anon:521224KB inactive_file:1012KB active_file:8KB unevictable:0KB
 Memory cgroup out of memory: Kill process 2746 (leaker) score 998 or sacrifice 
child
 Killed process 2746 (leaker) total-vm:536704kB, anon-rss:521176kB, 
file-rss:1208kB, shmem-rss:0kB
 oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, 
shmem-rss:0kB

[2]

 leaker invoked oom-killer: gfp_mask=0x600040(GFP_NOFS), nodemask=(null), 
order=0, oom_score_adj=0
 CPU: 1 PID: 2746 Comm: leaker Not tainted 4.18.0+ #20
 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference 
Platform, BIOS 6.00 04/13/2018
 Call Trace:
  dump_stack+0x63/0x88
  dump_header+0x67/0x27a
  ? mem_cgroup_scan_tasks+0x91/0xf0
  oom_kill_process+0x210/0x410
  out_of_memory+0x109/0x2d0
  mem_cgroup_out_of_memory+0x46/0x80
  try_charge+0x58d/0x650
  ? __radix_tree_replace+0x81/0x100
  mem_cgroup_try_charge+0x7a/0x100
  __add_to_page_cache_locked+0x92/0x180
  add_to_page_cache_lru+0x4d/0xf0
  iomap_readpages_actor+0xde/0x1b0
  ? iomap_zero_range_actor+0x1d0/0x1d0
  iomap_apply+0xaf/0x130
  iomap_readpages+0x9f/0x150
  ? iomap_zero_range_actor+0x1d0/0x1d0
  xfs_vm_readpages+0x18/0x20 [xfs]
  read_pages+0x60/0x140
  __do_page_cache_readahead+0x193/0x1b0
  ondemand_readahead+0x16d/0x2c0
  page_cache_async_readahead+0x9a/0xd0
  filemap_fault+0x403/0x620
  ? alloc_set_pte+0x12c/0x540
  ? _cond_resched+0x14/0x30
  __xfs_filemap_fault+0x66/0x180 [xfs]
  xfs_filemap_fault+0x27/0x30 [xfs]
  __do_fault+0x19/0x40
  __handle_mm_fault+0x8e8/0xb60
  handle_mm_fault+0xfd/0x220
  __do_page_fault+0x238/0x4e0
  do_page_fault+0x22/0x30
  ? page_fault+0x8/0x30
  page_fault+0x1e/0x30
 RIP: 0033:0x4009f0
 Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 
40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 
c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83
 RSP: 002b:7ffda45c9290 EFLAGS: 00010206
 RAX: 001b RBX:  RCX: 01a1e000
 RDX:  RSI: 7fe5 RDI: 
 RBP: 00

Re: oom-killer

2019-08-06 Thread Michal Hocko
On Tue 06-08-19 20:39:22, Pankaj Suryawanshi wrote:
> On Tue, Aug 6, 2019 at 8:37 PM Michal Hocko  wrote:
> >
> > On Tue 06-08-19 20:24:03, Pankaj Suryawanshi wrote:
> > > On Tue, 6 Aug, 2019, 1:46 AM Michal Hocko,  wrote:
> > > >
> > > > On Mon 05-08-19 21:04:53, Pankaj Suryawanshi wrote:
> > > > > On Mon, Aug 5, 2019 at 5:35 PM Michal Hocko  wrote:
> > > > > >
> > > > > > On Mon 05-08-19 13:56:20, Vlastimil Babka wrote:
> > > > > > > On 8/5/19 1:24 PM, Michal Hocko wrote:
> > > > > > > >> [  727.954355] CPU: 0 PID: 56 Comm: kworker/u8:2 Tainted: P
> > > > > > > >>O  4.14.65 #606
> > > > > > > > [...]
> > > > > > > >> [  728.029390] [] (oom_kill_process) from 
> > > > > > > >> [] (out_of_memory+0x140/0x368)
> > > > > > > >> [  728.037569]  r10:0001 r9:c12169bc r8:0041 
> > > > > > > >> r7:c121e680 r6:c1216588 r5:dd347d7c > [  728.045392]  
> > > > > > > >> r4:d5737080
> > > > > > > >> [  728.047929] [] (out_of_memory) from []  
> > > > > > > >> (__alloc_pages_nodemask+0x1178/0x124c)
> > > > > > > >> [  728.056798]  r7:c141e7d0 r6:c12166a4 r5: r4:1155
> > > > > > > >> [  728.062460] [] (__alloc_pages_nodemask) from 
> > > > > > > >> [] (copy_process.part.5+0x114/0x1a28)
> > > > > > > >> [  728.071764]  r10: r9:dd358000 r8: 
> > > > > > > >> r7:c1447e08 r6:c1216588 r5:00808111
> > > > > > > >> [  728.079587]  r4:d1063c00
> > > > > > > >> [  728.082119] [] (copy_process.part.5) from 
> > > > > > > >> [] (_do_fork+0xd0/0x464)
> > > > > > > >> [  728.090034]  r10: r9: r8:dd008400 
> > > > > > > >> r7: r6:c1216588 r5:d2d58ac0
> > > > > > > >> [  728.097857]  r4:00808111
> > > > > > > >
> > > > > > > > The call trace tells that this is a fork (of a usermodhlper but 
> > > > > > > > that is
> > > > > > > > not all that important.
> > > > > > > > [...]
> > > > > > > >> [  728.260031] DMA free:17960kB min:16384kB low:25664kB 
> > > > > > > >> high:29760kB active_anon:3556kB inactive_anon:0kB 
> > > > > > > >> active_file:280kB inactive_file:28kB unevictable:0kB 
> > > > > > > >> writepending:0kB present:458752kB managed:422896kB mlocked:0kB 
> > > > > > > >> kernel_stack:6496kB pagetables:9904kB bounce:0kB 
> > > > > > > >> free_pcp:348kB local_pcp:0kB free_cma:0kB
> > > > > > > >> [  728.287402] lowmem_reserve[]: 0 0 579 579
> > > > > > > >
> > > > > > > > So this is the only usable zone and you are close to the min 
> > > > > > > > watermark
> > > > > > > > which means that your system is under a serious memory pressure 
> > > > > > > > but not
> > > > > > > > yet under OOM for order-0 request. The situation is not great 
> > > > > > > > though
> > > > > > >
> > > > > > > Looking at lowmem_reserve above, wonder if 579 applies here? What 
> > > > > > > does
> > > > > > > /proc/zoneinfo say?
> > > > >
> > > > >
> > > > > What is  lowmem_reserve[]: 0 0 579 579 ?
> > > >
> > > > This controls how much of memory from a lower zone you might an
> > > > allocation request for a higher zone consume. E.g. __GFP_HIGHMEM is
> > > > allowed to use both lowmem and highmem zones. It is preferable to use
> > > > highmem zone because other requests are not allowed to use it.
> > > >
> > > > Please see __zone_watermark_ok for more details.
> > > >
> > > >
> > > > > > This is GFP_KERNEL request essentially so there shouldn't be any 
> > > > > > lowmem
> > > > > > reserve here, no?
> > > > >
> > > > >
> > > > > Why only low 1G is accessible by kernel in 32-bit system ?
> > >
> > >
> > > 1G ivirtual or physical memory (I have 2GB of RAM) ?
> >
> > virtual
> >
>  I have set 2G/2G still it works ?

It would reduce the amount of memory that userspace might use. It may
work for your particular case but the fundamental restriction is still
there.
-- 
Michal Hocko
SUSE Labs


Re: oom-killer

2019-08-06 Thread Michal Hocko
On Tue 06-08-19 20:25:51, Pankaj Suryawanshi wrote:
[...]
> lowmem reserve ? it is min_free_kbytes or something else.

Nope. Lowmem rezerve is a measure to protect from allocations targetting
higher zones (have a look at setup_per_zone_lowmem_reserve). The value
for each zone depends on the amount of memory managed by the zone
and the ratio which can be tuned from the userspace. min_free_kbytes
controls reclaim watermarks.

-- 
Michal Hocko
SUSE Labs


Re: oom-killer

2019-08-06 Thread Pankaj Suryawanshi
On Tue, Aug 6, 2019 at 8:37 PM Michal Hocko  wrote:
>
> On Tue 06-08-19 20:24:03, Pankaj Suryawanshi wrote:
> > On Tue, 6 Aug, 2019, 1:46 AM Michal Hocko,  wrote:
> > >
> > > On Mon 05-08-19 21:04:53, Pankaj Suryawanshi wrote:
> > > > On Mon, Aug 5, 2019 at 5:35 PM Michal Hocko  wrote:
> > > > >
> > > > > On Mon 05-08-19 13:56:20, Vlastimil Babka wrote:
> > > > > > On 8/5/19 1:24 PM, Michal Hocko wrote:
> > > > > > >> [  727.954355] CPU: 0 PID: 56 Comm: kworker/u8:2 Tainted: P  
> > > > > > >>  O  4.14.65 #606
> > > > > > > [...]
> > > > > > >> [  728.029390] [] (oom_kill_process) from [] 
> > > > > > >> (out_of_memory+0x140/0x368)
> > > > > > >> [  728.037569]  r10:0001 r9:c12169bc r8:0041 r7:c121e680 
> > > > > > >> r6:c1216588 r5:dd347d7c > [  728.045392]  r4:d5737080
> > > > > > >> [  728.047929] [] (out_of_memory) from []  
> > > > > > >> (__alloc_pages_nodemask+0x1178/0x124c)
> > > > > > >> [  728.056798]  r7:c141e7d0 r6:c12166a4 r5: r4:1155
> > > > > > >> [  728.062460] [] (__alloc_pages_nodemask) from 
> > > > > > >> [] (copy_process.part.5+0x114/0x1a28)
> > > > > > >> [  728.071764]  r10: r9:dd358000 r8: r7:c1447e08 
> > > > > > >> r6:c1216588 r5:00808111
> > > > > > >> [  728.079587]  r4:d1063c00
> > > > > > >> [  728.082119] [] (copy_process.part.5) from 
> > > > > > >> [] (_do_fork+0xd0/0x464)
> > > > > > >> [  728.090034]  r10: r9: r8:dd008400 r7: 
> > > > > > >> r6:c1216588 r5:d2d58ac0
> > > > > > >> [  728.097857]  r4:00808111
> > > > > > >
> > > > > > > The call trace tells that this is a fork (of a usermodhlper but 
> > > > > > > that is
> > > > > > > not all that important.
> > > > > > > [...]
> > > > > > >> [  728.260031] DMA free:17960kB min:16384kB low:25664kB 
> > > > > > >> high:29760kB active_anon:3556kB inactive_anon:0kB 
> > > > > > >> active_file:280kB inactive_file:28kB unevictable:0kB 
> > > > > > >> writepending:0kB present:458752kB managed:422896kB mlocked:0kB 
> > > > > > >> kernel_stack:6496kB pagetables:9904kB bounce:0kB free_pcp:348kB 
> > > > > > >> local_pcp:0kB free_cma:0kB
> > > > > > >> [  728.287402] lowmem_reserve[]: 0 0 579 579
> > > > > > >
> > > > > > > So this is the only usable zone and you are close to the min 
> > > > > > > watermark
> > > > > > > which means that your system is under a serious memory pressure 
> > > > > > > but not
> > > > > > > yet under OOM for order-0 request. The situation is not great 
> > > > > > > though
> > > > > >
> > > > > > Looking at lowmem_reserve above, wonder if 579 applies here? What 
> > > > > > does
> > > > > > /proc/zoneinfo say?
> > > >
> > > >
> > > > What is  lowmem_reserve[]: 0 0 579 579 ?
> > >
> > > This controls how much of memory from a lower zone you might an
> > > allocation request for a higher zone consume. E.g. __GFP_HIGHMEM is
> > > allowed to use both lowmem and highmem zones. It is preferable to use
> > > highmem zone because other requests are not allowed to use it.
> > >
> > > Please see __zone_watermark_ok for more details.
> > >
> > >
> > > > > This is GFP_KERNEL request essentially so there shouldn't be any 
> > > > > lowmem
> > > > > reserve here, no?
> > > >
> > > >
> > > > Why only low 1G is accessible by kernel in 32-bit system ?
> >
> >
> > 1G ivirtual or physical memory (I have 2GB of RAM) ?
>
> virtual
>
 I have set 2G/2G still it works ?

>
> > > https://www.kernel.org/doc/gorman/, https://lwn.net/Articles/75174/
> > > and many more articles. In very short, the 32b virtual address space
> > > is quite small and it has to cover both the users space and the
> > > kernel. That is why we do split it into 3G reserved for userspace and 1G
> > > for kernel. Kernel can only access its 1G portion directly everything
> > > else has to be mapped explicitly (e.g. while data is copied).
> > > Thanks Michal.
> >
> >
> > >
> > > > My system configuration is :-
> > > > 3G/1G - vmsplit
> > > > vmalloc = 480M (I think vmalloc size will set your highmem ?)
> > >
> > > No, vmalloc is part of the 1GB kernel adress space.
> >
> > I read in one article , vmalloc end is fixed if you increase vmalloc
> > size it decrease highmem. ?
> > Total = lowmem + (vmalloc + high mem)
>
> As the kernel is using vmalloc area _directly_ then it has to be a part
> of the kernel address space - thus reducing the lowmem.
> --
> Michal Hocko
> SUSE Labs


Re: oom-killer

2019-08-06 Thread Michal Hocko
On Tue 06-08-19 20:24:03, Pankaj Suryawanshi wrote:
> On Tue, 6 Aug, 2019, 1:46 AM Michal Hocko,  wrote:
> >
> > On Mon 05-08-19 21:04:53, Pankaj Suryawanshi wrote:
> > > On Mon, Aug 5, 2019 at 5:35 PM Michal Hocko  wrote:
> > > >
> > > > On Mon 05-08-19 13:56:20, Vlastimil Babka wrote:
> > > > > On 8/5/19 1:24 PM, Michal Hocko wrote:
> > > > > >> [  727.954355] CPU: 0 PID: 56 Comm: kworker/u8:2 Tainted: P
> > > > > >>O  4.14.65 #606
> > > > > > [...]
> > > > > >> [  728.029390] [] (oom_kill_process) from [] 
> > > > > >> (out_of_memory+0x140/0x368)
> > > > > >> [  728.037569]  r10:0001 r9:c12169bc r8:0041 r7:c121e680 
> > > > > >> r6:c1216588 r5:dd347d7c > [  728.045392]  r4:d5737080
> > > > > >> [  728.047929] [] (out_of_memory) from []  
> > > > > >> (__alloc_pages_nodemask+0x1178/0x124c)
> > > > > >> [  728.056798]  r7:c141e7d0 r6:c12166a4 r5: r4:1155
> > > > > >> [  728.062460] [] (__alloc_pages_nodemask) from 
> > > > > >> [] (copy_process.part.5+0x114/0x1a28)
> > > > > >> [  728.071764]  r10: r9:dd358000 r8: r7:c1447e08 
> > > > > >> r6:c1216588 r5:00808111
> > > > > >> [  728.079587]  r4:d1063c00
> > > > > >> [  728.082119] [] (copy_process.part.5) from 
> > > > > >> [] (_do_fork+0xd0/0x464)
> > > > > >> [  728.090034]  r10: r9: r8:dd008400 r7: 
> > > > > >> r6:c1216588 r5:d2d58ac0
> > > > > >> [  728.097857]  r4:00808111
> > > > > >
> > > > > > The call trace tells that this is a fork (of a usermodhlper but 
> > > > > > that is
> > > > > > not all that important.
> > > > > > [...]
> > > > > >> [  728.260031] DMA free:17960kB min:16384kB low:25664kB 
> > > > > >> high:29760kB active_anon:3556kB inactive_anon:0kB 
> > > > > >> active_file:280kB inactive_file:28kB unevictable:0kB 
> > > > > >> writepending:0kB present:458752kB managed:422896kB mlocked:0kB 
> > > > > >> kernel_stack:6496kB pagetables:9904kB bounce:0kB free_pcp:348kB 
> > > > > >> local_pcp:0kB free_cma:0kB
> > > > > >> [  728.287402] lowmem_reserve[]: 0 0 579 579
> > > > > >
> > > > > > So this is the only usable zone and you are close to the min 
> > > > > > watermark
> > > > > > which means that your system is under a serious memory pressure but 
> > > > > > not
> > > > > > yet under OOM for order-0 request. The situation is not great though
> > > > >
> > > > > Looking at lowmem_reserve above, wonder if 579 applies here? What does
> > > > > /proc/zoneinfo say?
> > >
> > >
> > > What is  lowmem_reserve[]: 0 0 579 579 ?
> >
> > This controls how much of memory from a lower zone you might an
> > allocation request for a higher zone consume. E.g. __GFP_HIGHMEM is
> > allowed to use both lowmem and highmem zones. It is preferable to use
> > highmem zone because other requests are not allowed to use it.
> >
> > Please see __zone_watermark_ok for more details.
> >
> >
> > > > This is GFP_KERNEL request essentially so there shouldn't be any lowmem
> > > > reserve here, no?
> > >
> > >
> > > Why only low 1G is accessible by kernel in 32-bit system ?
> 
> 
> 1G ivirtual or physical memory (I have 2GB of RAM) ?

virtual

> > https://www.kernel.org/doc/gorman/, https://lwn.net/Articles/75174/
> > and many more articles. In very short, the 32b virtual address space
> > is quite small and it has to cover both the users space and the
> > kernel. That is why we do split it into 3G reserved for userspace and 1G
> > for kernel. Kernel can only access its 1G portion directly everything
> > else has to be mapped explicitly (e.g. while data is copied).
> > Thanks Michal.
> 
> 
> >
> > > My system configuration is :-
> > > 3G/1G - vmsplit
> > > vmalloc = 480M (I think vmalloc size will set your highmem ?)
> >
> > No, vmalloc is part of the 1GB kernel adress space.
> 
> I read in one article , vmalloc end is fixed if you increase vmalloc
> size it decrease highmem. ?
> Total = lowmem + (vmalloc + high mem)

As the kernel is using vmalloc area _directly_ then it has to be a part
of the kernel address space - thus reducing the lowmem.
-- 
Michal Hocko
SUSE Labs


Re: oom-killer

2019-08-06 Thread Pankaj Suryawanshi
On Tue, 6 Aug, 2019, 1:46 AM Michal Hocko,  wrote:
>
> On Mon 05-08-19 21:04:53, Pankaj Suryawanshi wrote:
> > On Mon, Aug 5, 2019 at 5:35 PM Michal Hocko  wrote:
> > >
> > > On Mon 05-08-19 13:56:20, Vlastimil Babka wrote:
> > > > On 8/5/19 1:24 PM, Michal Hocko wrote:
> > > > >> [  727.954355] CPU: 0 PID: 56 Comm: kworker/u8:2 Tainted: P  
> > > > >>  O  4.14.65 #606
> > > > > [...]
> > > > >> [  728.029390] [] (oom_kill_process) from [] 
> > > > >> (out_of_memory+0x140/0x368)
> > > > >> [  728.037569]  r10:0001 r9:c12169bc r8:0041 r7:c121e680 
> > > > >> r6:c1216588 r5:dd347d7c > [  728.045392]  r4:d5737080
> > > > >> [  728.047929] [] (out_of_memory) from []  
> > > > >> (__alloc_pages_nodemask+0x1178/0x124c)
> > > > >> [  728.056798]  r7:c141e7d0 r6:c12166a4 r5: r4:1155
> > > > >> [  728.062460] [] (__alloc_pages_nodemask) from 
> > > > >> [] (copy_process.part.5+0x114/0x1a28)
> > > > >> [  728.071764]  r10: r9:dd358000 r8: r7:c1447e08 
> > > > >> r6:c1216588 r5:00808111
> > > > >> [  728.079587]  r4:d1063c00
> > > > >> [  728.082119] [] (copy_process.part.5) from [] 
> > > > >> (_do_fork+0xd0/0x464)
> > > > >> [  728.090034]  r10: r9: r8:dd008400 r7: 
> > > > >> r6:c1216588 r5:d2d58ac0
> > > > >> [  728.097857]  r4:00808111
> > > > >
> > > > > The call trace tells that this is a fork (of a usermodhlper but that 
> > > > > is
> > > > > not all that important.
> > > > > [...]
> > > > >> [  728.260031] DMA free:17960kB min:16384kB low:25664kB high:29760kB 
> > > > >> active_anon:3556kB inactive_anon:0kB active_file:280kB 
> > > > >> inactive_file:28kB unevictable:0kB writepending:0kB present:458752kB 
> > > > >> managed:422896kB mlocked:0kB kernel_stack:6496kB pagetables:9904kB 
> > > > >> bounce:0kB free_pcp:348kB local_pcp:0kB free_cma:0kB
> > > > >> [  728.287402] lowmem_reserve[]: 0 0 579 579
> > > > >
> > > > > So this is the only usable zone and you are close to the min watermark
> > > > > which means that your system is under a serious memory pressure but 
> > > > > not
> > > > > yet under OOM for order-0 request. The situation is not great though
> > > >
> > > > Looking at lowmem_reserve above, wonder if 579 applies here? What does
> > > > /proc/zoneinfo say?
> >
> >
> > What is  lowmem_reserve[]: 0 0 579 579 ?
>
> This controls how much of memory from a lower zone you might an
> allocation request for a higher zone consume. E.g. __GFP_HIGHMEM is
> allowed to use both lowmem and highmem zones. It is preferable to use
> highmem zone because other requests are not allowed to use it.
>
> Please see __zone_watermark_ok for more details.
>
>
> > > This is GFP_KERNEL request essentially so there shouldn't be any lowmem
> > > reserve here, no?
> >
> >
> > Why only low 1G is accessible by kernel in 32-bit system ?


1G ivirtual or physical memory (I have 2GB of RAM) ?
>
>
> https://www.kernel.org/doc/gorman/, https://lwn.net/Articles/75174/
> and many more articles. In very short, the 32b virtual address space
> is quite small and it has to cover both the users space and the
> kernel. That is why we do split it into 3G reserved for userspace and 1G
> for kernel. Kernel can only access its 1G portion directly everything
> else has to be mapped explicitly (e.g. while data is copied).
> Thanks Michal.


>
> > My system configuration is :-
> > 3G/1G - vmsplit
> > vmalloc = 480M (I think vmalloc size will set your highmem ?)
>
> No, vmalloc is part of the 1GB kernel adress space.

I read in one article , vmalloc end is fixed if you increase vmalloc
size it decrease highmem. ?
Total = lowmem + (vmalloc + high mem)
>
>
> --
> Michal Hocko
> SUSE Labs


[PATCH v3] memcg, oom: don't require __GFP_FS when invoking memcg OOM killer

2019-08-06 Thread Tetsuo Handa
Masoud Sharbiani noticed that commit 29ef680ae7c21110 ("memcg, oom: move
out_of_memory back to the charge path") broke memcg OOM called from
__xfs_filemap_fault() path. It turned out that try_charge() is retrying
forever without making forward progress because mem_cgroup_oom(GFP_NOFS)
cannot invoke the OOM killer due to commit 3da88fb3bacfaa33 ("mm, oom:
move GFP_NOFS check to out_of_memory").

Allowing forced charge due to being unable to invoke memcg OOM killer
will lead to global OOM situation. Also, just returning -ENOMEM will be
risky because OOM path is lost and some paths (e.g. get_user_pages())
will leak -ENOMEM. Therefore, invoking memcg OOM killer (despite GFP_NOFS)
will be the only choice we can choose for now.

Until 29ef680ae7c21110~1, we were able to invoke memcg OOM killer when
GFP_KERNEL reclaim failed [1]. But since 29ef680ae7c21110, we need to
invoke memcg OOM killer when GFP_NOFS reclaim failed [2]. Although in
the past we did invoke memcg OOM killer for GFP_NOFS [3], we might get
pre-mature memcg OOM reports due to this patch.

Signed-off-by: Tetsuo Handa 
Reported-and-tested-by: Masoud Sharbiani 
Bisected-by: Masoud Sharbiani 
Acked-by: Michal Hocko 
Fixes: 3da88fb3bacfaa33 # necessary after 29ef680ae7c21110
Cc:  # 4.19+


[1]

 leaker invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), 
nodemask=(null), order=0, oom_score_adj=0
 CPU: 0 PID: 2746 Comm: leaker Not tainted 4.18.0+ #19
 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference 
Platform, BIOS 6.00 04/13/2018
 Call Trace:
  dump_stack+0x63/0x88
  dump_header+0x67/0x27a
  ? mem_cgroup_scan_tasks+0x91/0xf0
  oom_kill_process+0x210/0x410
  out_of_memory+0x10a/0x2c0
  mem_cgroup_out_of_memory+0x46/0x80
  mem_cgroup_oom_synchronize+0x2e4/0x310
  ? high_work_func+0x20/0x20
  pagefault_out_of_memory+0x31/0x76
  mm_fault_error+0x55/0x115
  ? handle_mm_fault+0xfd/0x220
  __do_page_fault+0x433/0x4e0
  do_page_fault+0x22/0x30
  ? page_fault+0x8/0x30
  page_fault+0x1e/0x30
 RIP: 0033:0x4009f0
 Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 
40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 
c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83
 RSP: 002b:7ffe29ae96f0 EFLAGS: 00010206
 RAX: 001b RBX:  RCX: 01ce1000
 RDX:  RSI: 7fe5 RDI: 
 RBP: 000c R08:  R09: 7f94be09220d
 R10: 0002 R11: 0246 R12: 000186a0
 R13: 0003 R14: 7f949d845000 R15: 0280
 Task in /leaker killed as a result of limit of /leaker
 memory: usage 524288kB, limit 524288kB, failcnt 158965
 memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
 kmem: usage 2016kB, limit 9007199254740988kB, failcnt 0
 Memory cgroup stats for /leaker: cache:844KB rss:521136KB rss_huge:0KB 
shmem:0KB mapped_file:0KB dirty:132KB writeback:0KB inactive_anon:0KB 
active_anon:521224KB inactive_file:1012KB active_file:8KB unevictable:0KB
 Memory cgroup out of memory: Kill process 2746 (leaker) score 998 or sacrifice 
child
 Killed process 2746 (leaker) total-vm:536704kB, anon-rss:521176kB, 
file-rss:1208kB, shmem-rss:0kB
 oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, 
shmem-rss:0kB


[2]

 leaker invoked oom-killer: gfp_mask=0x600040(GFP_NOFS), nodemask=(null), 
order=0, oom_score_adj=0
 CPU: 1 PID: 2746 Comm: leaker Not tainted 4.18.0+ #20
 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference 
Platform, BIOS 6.00 04/13/2018
 Call Trace:
  dump_stack+0x63/0x88
  dump_header+0x67/0x27a
  ? mem_cgroup_scan_tasks+0x91/0xf0
  oom_kill_process+0x210/0x410
  out_of_memory+0x109/0x2d0
  mem_cgroup_out_of_memory+0x46/0x80
  try_charge+0x58d/0x650
  ? __radix_tree_replace+0x81/0x100
  mem_cgroup_try_charge+0x7a/0x100
  __add_to_page_cache_locked+0x92/0x180
  add_to_page_cache_lru+0x4d/0xf0
  iomap_readpages_actor+0xde/0x1b0
  ? iomap_zero_range_actor+0x1d0/0x1d0
  iomap_apply+0xaf/0x130
  iomap_readpages+0x9f/0x150
  ? iomap_zero_range_actor+0x1d0/0x1d0
  xfs_vm_readpages+0x18/0x20 [xfs]
  read_pages+0x60/0x140
  __do_page_cache_readahead+0x193/0x1b0
  ondemand_readahead+0x16d/0x2c0
  page_cache_async_readahead+0x9a/0xd0
  filemap_fault+0x403/0x620
  ? alloc_set_pte+0x12c/0x540
  ? _cond_resched+0x14/0x30
  __xfs_filemap_fault+0x66/0x180 [xfs]
  xfs_filemap_fault+0x27/0x30 [xfs]
  __do_fault+0x19/0x40
  __handle_mm_fault+0x8e8/0xb60
  handle_mm_fault+0xfd/0x220
  __do_page_fault+0x238/0x4e0
  do_page_fault+0x22/0x30
  ? page_fault+0x8/0x30
  page_fault+0x1e/0x30
 RIP: 0033:0x4009f0
 Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 
40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 
c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83
 RSP: 002b:7ffda45c9290 EFLAGS: 00010206
 RAX: 001b RBX: 000

Re: oom-killer

2019-08-06 Thread Vlastimil Babka
On 8/5/19 5:34 PM, Pankaj Suryawanshi wrote:
> On Mon, Aug 5, 2019 at 5:35 PM Michal Hocko  wrote:
>>
>> On Mon 05-08-19 13:56:20, Vlastimil Babka wrote:
>> > On 8/5/19 1:24 PM, Michal Hocko wrote:
>> > >> [  727.954355] CPU: 0 PID: 56 Comm: kworker/u8:2 Tainted: P   O 
>> > >>  4.14.65 #606
>> > > [...]
>> > >> [  728.029390] [] (oom_kill_process) from [] 
>> > >> (out_of_memory+0x140/0x368)
>> > >> [  728.037569]  r10:0001 r9:c12169bc r8:0041 r7:c121e680 
>> > >> r6:c1216588 r5:dd347d7c > [  728.045392]  r4:d5737080
>> > >> [  728.047929] [] (out_of_memory) from []  
>> > >> (__alloc_pages_nodemask+0x1178/0x124c)
>> > >> [  728.056798]  r7:c141e7d0 r6:c12166a4 r5: r4:1155
>> > >> [  728.062460] [] (__alloc_pages_nodemask) from [] 
>> > >> (copy_process.part.5+0x114/0x1a28)
>> > >> [  728.071764]  r10: r9:dd358000 r8: r7:c1447e08 
>> > >> r6:c1216588 r5:00808111
>> > >> [  728.079587]  r4:d1063c00
>> > >> [  728.082119] [] (copy_process.part.5) from [] 
>> > >> (_do_fork+0xd0/0x464)
>> > >> [  728.090034]  r10: r9: r8:dd008400 r7: 
>> > >> r6:c1216588 r5:d2d58ac0
>> > >> [  728.097857]  r4:00808111
>> > >
>> > > The call trace tells that this is a fork (of a usermodhlper but that is
>> > > not all that important.
>> > > [...]
>> > >> [  728.260031] DMA free:17960kB min:16384kB low:25664kB high:29760kB 
>> > >> active_anon:3556kB inactive_anon:0kB active_file:280kB 
>> > >> inactive_file:28kB unevictable:0kB writepending:0kB present:458752kB 
>> > >> managed:422896kB mlocked:0kB kernel_stack:6496kB pagetables:9904kB 
>> > >> bounce:0kB free_pcp:348kB local_pcp:0kB free_cma:0kB
>> > >> [  728.287402] lowmem_reserve[]: 0 0 579 579
>> > >
>> > > So this is the only usable zone and you are close to the min watermark
>> > > which means that your system is under a serious memory pressure but not
>> > > yet under OOM for order-0 request. The situation is not great though
>> >
>> > Looking at lowmem_reserve above, wonder if 579 applies here? What does
>> > /proc/zoneinfo say?
> 
> 
> What is  lowmem_reserve[]: 0 0 579 579 ?
> 
> $cat /proc/sys/vm/lowmem_reserve_ratio
> 256 32  32
> 
> $cat /proc/sys/vm/min_free_kbytes
> 16384
> 
> here is cat /proc/zoneinfo (in normal situation not when oom)

Thanks, that shows the lowmem reserve was indeed 0 for the GFP_KERNEL allocation
checking watermarks in the DMA zone. The zone was probably genuinely below min
watermark when the check happened, and things changed while the allocation
failure was printing memory info.


Re: oom-killer

2019-08-05 Thread Michal Hocko
On Mon 05-08-19 21:04:53, Pankaj Suryawanshi wrote:
> On Mon, Aug 5, 2019 at 5:35 PM Michal Hocko  wrote:
> >
> > On Mon 05-08-19 13:56:20, Vlastimil Babka wrote:
> > > On 8/5/19 1:24 PM, Michal Hocko wrote:
> > > >> [  727.954355] CPU: 0 PID: 56 Comm: kworker/u8:2 Tainted: P   
> > > >> O  4.14.65 #606
> > > > [...]
> > > >> [  728.029390] [] (oom_kill_process) from [] 
> > > >> (out_of_memory+0x140/0x368)
> > > >> [  728.037569]  r10:0001 r9:c12169bc r8:0041 r7:c121e680 
> > > >> r6:c1216588 r5:dd347d7c > [  728.045392]  r4:d5737080
> > > >> [  728.047929] [] (out_of_memory) from []  
> > > >> (__alloc_pages_nodemask+0x1178/0x124c)
> > > >> [  728.056798]  r7:c141e7d0 r6:c12166a4 r5: r4:1155
> > > >> [  728.062460] [] (__alloc_pages_nodemask) from [] 
> > > >> (copy_process.part.5+0x114/0x1a28)
> > > >> [  728.071764]  r10: r9:dd358000 r8: r7:c1447e08 
> > > >> r6:c1216588 r5:00808111
> > > >> [  728.079587]  r4:d1063c00
> > > >> [  728.082119] [] (copy_process.part.5) from [] 
> > > >> (_do_fork+0xd0/0x464)
> > > >> [  728.090034]  r10: r9: r8:dd008400 r7: 
> > > >> r6:c1216588 r5:d2d58ac0
> > > >> [  728.097857]  r4:00808111
> > > >
> > > > The call trace tells that this is a fork (of a usermodhlper but that is
> > > > not all that important.
> > > > [...]
> > > >> [  728.260031] DMA free:17960kB min:16384kB low:25664kB high:29760kB 
> > > >> active_anon:3556kB inactive_anon:0kB active_file:280kB 
> > > >> inactive_file:28kB unevictable:0kB writepending:0kB present:458752kB 
> > > >> managed:422896kB mlocked:0kB kernel_stack:6496kB pagetables:9904kB 
> > > >> bounce:0kB free_pcp:348kB local_pcp:0kB free_cma:0kB
> > > >> [  728.287402] lowmem_reserve[]: 0 0 579 579
> > > >
> > > > So this is the only usable zone and you are close to the min watermark
> > > > which means that your system is under a serious memory pressure but not
> > > > yet under OOM for order-0 request. The situation is not great though
> > >
> > > Looking at lowmem_reserve above, wonder if 579 applies here? What does
> > > /proc/zoneinfo say?
> 
> 
> What is  lowmem_reserve[]: 0 0 579 579 ?

This controls how much of memory from a lower zone you might an
allocation request for a higher zone consume. E.g. __GFP_HIGHMEM is
allowed to use both lowmem and highmem zones. It is preferable to use
highmem zone because other requests are not allowed to use it.

Please see __zone_watermark_ok for more details.

> > This is GFP_KERNEL request essentially so there shouldn't be any lowmem
> > reserve here, no?
> 
> 
> Why only low 1G is accessible by kernel in 32-bit system ?

https://www.kernel.org/doc/gorman/, https://lwn.net/Articles/75174/
and many more articles. In very short, the 32b virtual address space
is quite small and it has to cover both the users space and the
kernel. That is why we do split it into 3G reserved for userspace and 1G
for kernel. Kernel can only access its 1G portion directly everything
else has to be mapped explicitly (e.g. while data is copied).

> My system configuration is :-
> 3G/1G - vmsplit
> vmalloc = 480M (I think vmalloc size will set your highmem ?)

No, vmalloc is part of the 1GB kernel adress space.

-- 
Michal Hocko
SUSE Labs


Re: oom-killer

2019-08-05 Thread Pankaj Suryawanshi
On Mon, Aug 5, 2019 at 5:35 PM Michal Hocko  wrote:
>
> On Mon 05-08-19 13:56:20, Vlastimil Babka wrote:
> > On 8/5/19 1:24 PM, Michal Hocko wrote:
> > >> [  727.954355] CPU: 0 PID: 56 Comm: kworker/u8:2 Tainted: P   O  
> > >> 4.14.65 #606
> > > [...]
> > >> [  728.029390] [] (oom_kill_process) from [] 
> > >> (out_of_memory+0x140/0x368)
> > >> [  728.037569]  r10:0001 r9:c12169bc r8:0041 r7:c121e680 
> > >> r6:c1216588 r5:dd347d7c > [  728.045392]  r4:d5737080
> > >> [  728.047929] [] (out_of_memory) from []  
> > >> (__alloc_pages_nodemask+0x1178/0x124c)
> > >> [  728.056798]  r7:c141e7d0 r6:c12166a4 r5: r4:1155
> > >> [  728.062460] [] (__alloc_pages_nodemask) from [] 
> > >> (copy_process.part.5+0x114/0x1a28)
> > >> [  728.071764]  r10: r9:dd358000 r8: r7:c1447e08 
> > >> r6:c1216588 r5:00808111
> > >> [  728.079587]  r4:d1063c00
> > >> [  728.082119] [] (copy_process.part.5) from [] 
> > >> (_do_fork+0xd0/0x464)
> > >> [  728.090034]  r10: r9: r8:dd008400 r7: 
> > >> r6:c1216588 r5:d2d58ac0
> > >> [  728.097857]  r4:00808111
> > >
> > > The call trace tells that this is a fork (of a usermodhlper but that is
> > > not all that important.
> > > [...]
> > >> [  728.260031] DMA free:17960kB min:16384kB low:25664kB high:29760kB 
> > >> active_anon:3556kB inactive_anon:0kB active_file:280kB 
> > >> inactive_file:28kB unevictable:0kB writepending:0kB present:458752kB 
> > >> managed:422896kB mlocked:0kB kernel_stack:6496kB pagetables:9904kB 
> > >> bounce:0kB free_pcp:348kB local_pcp:0kB free_cma:0kB
> > >> [  728.287402] lowmem_reserve[]: 0 0 579 579
> > >
> > > So this is the only usable zone and you are close to the min watermark
> > > which means that your system is under a serious memory pressure but not
> > > yet under OOM for order-0 request. The situation is not great though
> >
> > Looking at lowmem_reserve above, wonder if 579 applies here? What does
> > /proc/zoneinfo say?


What is  lowmem_reserve[]: 0 0 579 579 ?

$cat /proc/sys/vm/lowmem_reserve_ratio
256 32  32

$cat /proc/sys/vm/min_free_kbytes
16384

here is cat /proc/zoneinfo (in normal situation not when oom)

$cat /proc/zoneinfo
Node 0, zone  DMA
  per-node stats
  nr_inactive_anon 120
  nr_active_anon 94870
  nr_inactive_file 101188
  nr_active_file 74656
  nr_unevictable 614
  nr_slab_reclaimable 12489
  nr_slab_unreclaimable 8519
  nr_isolated_anon 0
  nr_isolated_file 0
  workingset_refault 7163
  workingset_activate 7163
  workingset_nodereclaim 0
  nr_anon_pages 94953
  nr_mapped109148
  nr_file_pages 176502
  nr_dirty 0
  nr_writeback 0
  nr_writeback_temp 0
  nr_shmem 166
  nr_shmem_hugepages 0
  nr_shmem_pmdmapped 0
  nr_anon_transparent_hugepages 0
  nr_unstable  0
  nr_vmscan_write 0
  nr_vmscan_immediate_reclaim 0
  nr_dirtied   7701
  nr_written   6978
  pages free 49492
min  4096
low  6416
high 7440
spanned  131072
present  114688
managed  105724
protection: (0, 0, 1491, 1491)
  nr_free_pages 49492
  nr_zone_inactive_anon 0
  nr_zone_active_anon 0
  nr_zone_inactive_file 65
  nr_zone_active_file 4859
  nr_zone_unevictable 0
  nr_zone_write_pending 0
  nr_mlock 0
  nr_page_table_pages 4352
  nr_kernel_stack 9056
  nr_bounce0
  nr_zspages   0
  nr_free_cma  0
  pagesets
cpu: 0
  count: 16
  high:  186
  batch: 31
  vm stats threshold: 18
cpu: 1
  count: 138
  high:  186
  batch: 31
  vm stats threshold: 18
cpu: 2
  count: 156
  high:  186
  batch: 31
  vm stats threshold: 18
cpu: 3
  count: 170
  high:  186
  batch: 31
  vm stats threshold: 18
  node_unreclaimable:  0
  start_pfn:   131072
  node_inactive_ratio: 0
Node 0, zone   Normal
  pages free 0
min  0
low  0
high 0
spanned  0
present  0
managed  0
protection: (0, 0, 11928, 11928)
Node 0, zone  HighMem
  pages free 63096
min  128
low  8506
high 12202
spanned  393216
present  381696
managed  381696
protection: (0, 0, 0, 0)
  nr_free_pages 63096
  nr_zone_inactive_anon 120
  nr_zone_active_anon 94863
  nr_zone_inactive_file 101123
  nr_zone_active_file 69797
  nr_zone_unevictable 614
  nr_zone_write_pending 0
  nr_mlock 614
  nr_page_table_pages 1478
  nr_kernel_stack 0
  nr_bounce0
  nr_zspages   0
  nr_free_cma  62429
  pagesets
cpu: 0
  count: 30
  high:  186
  batch: 31
  vm stats threshold: 30
cpu: 1
  count: 13
  

Re: oom-killer

2019-08-05 Thread Michal Hocko
On Mon 05-08-19 13:56:20, Vlastimil Babka wrote:
> On 8/5/19 1:24 PM, Michal Hocko wrote:
> >> [  727.954355] CPU: 0 PID: 56 Comm: kworker/u8:2 Tainted: P   O  
> >> 4.14.65 #606
> > [...]
> >> [  728.029390] [] (oom_kill_process) from [] 
> >> (out_of_memory+0x140/0x368)
> >> [  728.037569]  r10:0001 r9:c12169bc r8:0041 r7:c121e680 
> >> r6:c1216588 r5:dd347d7c > [  728.045392]  r4:d5737080
> >> [  728.047929] [] (out_of_memory) from []  
> >> (__alloc_pages_nodemask+0x1178/0x124c)
> >> [  728.056798]  r7:c141e7d0 r6:c12166a4 r5: r4:1155
> >> [  728.062460] [] (__alloc_pages_nodemask) from [] 
> >> (copy_process.part.5+0x114/0x1a28)
> >> [  728.071764]  r10: r9:dd358000 r8: r7:c1447e08 
> >> r6:c1216588 r5:00808111
> >> [  728.079587]  r4:d1063c00
> >> [  728.082119] [] (copy_process.part.5) from [] 
> >> (_do_fork+0xd0/0x464)
> >> [  728.090034]  r10: r9: r8:dd008400 r7: 
> >> r6:c1216588 r5:d2d58ac0
> >> [  728.097857]  r4:00808111
> > 
> > The call trace tells that this is a fork (of a usermodhlper but that is
> > not all that important.
> > [...]
> >> [  728.260031] DMA free:17960kB min:16384kB low:25664kB high:29760kB 
> >> active_anon:3556kB inactive_anon:0kB active_file:280kB inactive_file:28kB 
> >> unevictable:0kB writepending:0kB present:458752kB managed:422896kB 
> >> mlocked:0kB kernel_stack:6496kB pagetables:9904kB bounce:0kB 
> >> free_pcp:348kB local_pcp:0kB free_cma:0kB
> >> [  728.287402] lowmem_reserve[]: 0 0 579 579
> > 
> > So this is the only usable zone and you are close to the min watermark
> > which means that your system is under a serious memory pressure but not
> > yet under OOM for order-0 request. The situation is not great though
> 
> Looking at lowmem_reserve above, wonder if 579 applies here? What does
> /proc/zoneinfo say?

This is GFP_KERNEL request essentially so there shouldn't be any lowmem
reserve here, no?
-- 
Michal Hocko
SUSE Labs


Re: oom-killer

2019-08-05 Thread Vlastimil Babka
On 8/5/19 1:24 PM, Michal Hocko wrote:
>> [  727.954355] CPU: 0 PID: 56 Comm: kworker/u8:2 Tainted: P   O  
>> 4.14.65 #606
> [...]
>> [  728.029390] [] (oom_kill_process) from [] 
>> (out_of_memory+0x140/0x368)
>> [  728.037569]  r10:0001 r9:c12169bc r8:0041 r7:c121e680 r6:c1216588 
>> r5:dd347d7c > [  728.045392]  r4:d5737080
>> [  728.047929] [] (out_of_memory) from []  
>> (__alloc_pages_nodemask+0x1178/0x124c)
>> [  728.056798]  r7:c141e7d0 r6:c12166a4 r5: r4:1155
>> [  728.062460] [] (__alloc_pages_nodemask) from [] 
>> (copy_process.part.5+0x114/0x1a28)
>> [  728.071764]  r10: r9:dd358000 r8: r7:c1447e08 r6:c1216588 
>> r5:00808111
>> [  728.079587]  r4:d1063c00
>> [  728.082119] [] (copy_process.part.5) from [] 
>> (_do_fork+0xd0/0x464)
>> [  728.090034]  r10: r9: r8:dd008400 r7: r6:c1216588 
>> r5:d2d58ac0
>> [  728.097857]  r4:00808111
> 
> The call trace tells that this is a fork (of a usermodhlper but that is
> not all that important.
> [...]
>> [  728.260031] DMA free:17960kB min:16384kB low:25664kB high:29760kB 
>> active_anon:3556kB inactive_anon:0kB active_file:280kB inactive_file:28kB 
>> unevictable:0kB writepending:0kB present:458752kB managed:422896kB 
>> mlocked:0kB kernel_stack:6496kB pagetables:9904kB bounce:0kB free_pcp:348kB 
>> local_pcp:0kB free_cma:0kB
>> [  728.287402] lowmem_reserve[]: 0 0 579 579
> 
> So this is the only usable zone and you are close to the min watermark
> which means that your system is under a serious memory pressure but not
> yet under OOM for order-0 request. The situation is not great though

Looking at lowmem_reserve above, wonder if 579 applies here? What does
/proc/zoneinfo say?

> because there is close to no reclaimable memory (look at *_anon, *_file)
> counters and it is quite likely that compaction will stubmle over
> unmovable pages very often as well.
> 
>> [  728.326634] DMA: 71*4kB (EH) 113*8kB (UH) 207*16kB (UMH) 103*32kB (UMH) 
>> 70*64kB (UMH) 27*128kB (UMH) 5*256kB (UMH) 1*512kB (H) 0*1024kB 0*2048kB 
>> 0*4096kB 0*8192kB 0*16384kB = 17524kB
> 
> This is more interesting because there seem to be order-1+ blocks to
> be used for this allocation. H stands for High atomic reserve, U for
> unmovable blocks and GFP_KERNEL belong to such an allocation and M is
> for movable pageblock (see show_migration_types for all migration
> types). From the above it would mean that the allocation should pass
> through but note that the information is dumped after the last watermark
> check so the situation might have changed.
> 
> In any case your system seems to be tight on the lowmem and I would
> expect it could get to OOM in peak memory demand on top of the current
> state.
> 



Re: oom-killer

2019-08-05 Thread Michal Hocko
On Sat 03-08-19 18:53:50, Pankaj Suryawanshi wrote:
> Hello,
> 
> Below are the logs from oom-kller. I am not able to interpret/decode the
> logs as well as not able to find root cause of oom-killer.
> 
> Note: CPU Arch: Arm 32-bit , Kernel - 4.14.65

Fixed up line wrapping and trimmed to the bare minimum

> [  727.941258] kworker/u8:2 invoked oom-killer: 
> gfp_mask=0x15080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), nodemask=(null),  order=1, 
> oom_score_adj=0

This tells us that this is order 1 (two physically contiguous pages)
request restricted to GFP_KERNEL (GFP_KERNEL_ACCOUNT is GFP_KERNEL |
__GFP_ACCOUNT) and that means that the request can be satisfied only
from the low memory zone. This is important because you are running 32b
system and that means that only low 1G is directly addressable by the
kernel. The rest is in highmem.

> [  727.954355] CPU: 0 PID: 56 Comm: kworker/u8:2 Tainted: P   O  
> 4.14.65 #606
[...]
> [  728.029390] [] (oom_kill_process) from [] 
> (out_of_memory+0x140/0x368)
> [  728.037569]  r10:0001 r9:c12169bc r8:0041 r7:c121e680 r6:c1216588 
> r5:dd347d7c > [  728.045392]  r4:d5737080
> [  728.047929] [] (out_of_memory) from []  
> (__alloc_pages_nodemask+0x1178/0x124c)
> [  728.056798]  r7:c141e7d0 r6:c12166a4 r5: r4:1155
> [  728.062460] [] (__alloc_pages_nodemask) from [] 
> (copy_process.part.5+0x114/0x1a28)
> [  728.071764]  r10: r9:dd358000 r8: r7:c1447e08 r6:c1216588 
> r5:00808111
> [  728.079587]  r4:d1063c00
> [  728.082119] [] (copy_process.part.5) from [] 
> (_do_fork+0xd0/0x464)
> [  728.090034]  r10: r9: r8:dd008400 r7: r6:c1216588 
> r5:d2d58ac0
> [  728.097857]  r4:00808111

The call trace tells that this is a fork (of a usermodhlper but that is
not all that important.
[...]
> [  728.260031] DMA free:17960kB min:16384kB low:25664kB high:29760kB 
> active_anon:3556kB inactive_anon:0kB active_file:280kB inactive_file:28kB 
> unevictable:0kB writepending:0kB present:458752kB managed:422896kB 
> mlocked:0kB kernel_stack:6496kB pagetables:9904kB bounce:0kB free_pcp:348kB 
> local_pcp:0kB free_cma:0kB
> [  728.287402] lowmem_reserve[]: 0 0 579 579

So this is the only usable zone and you are close to the min watermark
which means that your system is under a serious memory pressure but not
yet under OOM for order-0 request. The situation is not great though
because there is close to no reclaimable memory (look at *_anon, *_file)
counters and it is quite likely that compaction will stubmle over
unmovable pages very often as well.

> [  728.326634] DMA: 71*4kB (EH) 113*8kB (UH) 207*16kB (UMH) 103*32kB (UMH) 
> 70*64kB (UMH) 27*128kB (UMH) 5*256kB (UMH) 1*512kB (H) 0*1024kB 0*2048kB 
> 0*4096kB 0*8192kB 0*16384kB = 17524kB

This is more interesting because there seem to be order-1+ blocks to
be used for this allocation. H stands for High atomic reserve, U for
unmovable blocks and GFP_KERNEL belong to such an allocation and M is
for movable pageblock (see show_migration_types for all migration
types). From the above it would mean that the allocation should pass
through but note that the information is dumped after the last watermark
check so the situation might have changed.

In any case your system seems to be tight on the lowmem and I would
expect it could get to OOM in peak memory demand on top of the current
state.

-- 
Michal Hocko
SUSE Labs


[PATCH 5.0 032/246] memcg: killed threads should not invoke memcg OOM killer

2019-04-04 Thread Greg Kroah-Hartman
5.0-stable review patch.  If anyone has any objections, please let me know.

--

[ Upstream commit 7775face207922ea62a4e96b9cd45abfdc7b9840 ]

If a memory cgroup contains a single process with many threads
(including different process group sharing the mm) then it is possible
to trigger a race when the oom killer complains that there are no oom
elible tasks and complain into the log which is both annoying and
confusing because there is no actual problem.  The race looks as
follows:

P1  oom_reaper  P2
try_charge  try_charge
  mem_cgroup_out_of_memory
mutex_lock(oom_lock)
  out_of_memory
oom_kill_process(P1,P2)
 wake_oom_reaper
mutex_unlock(oom_lock)
oom_reap_task
  mutex_lock(oom_lock)
select_bad_process 
# no victim

The problem is more visible with many threads.

Fix this by checking for fatal_signal_pending from
mem_cgroup_out_of_memory when the oom_lock is already held.

The oom bypass is safe because we do the same early in the try_charge
path already.  The situation migh have changed in the mean time.  It
should be safe to check for fatal_signal_pending and tsk_is_oom_victim
but for a better code readability abstract the current charge bypass
condition into should_force_charge and reuse it from that path.  "

Link: 
http://lkml.kernel.org/r/01370f70-e1f6-ebe4-b95e-0df21a0bc...@i-love.sakura.ne.jp
Signed-off-by: Tetsuo Handa 
Acked-by: Michal Hocko 
Acked-by: Johannes Weiner 
Cc: David Rientjes 
Cc: Kirill Tkhai 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Sasha Levin 
---
 mm/memcontrol.c | 19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index af7f18b32389..79a7d2a06bba 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -248,6 +248,12 @@ enum res_type {
 iter != NULL;  \
 iter = mem_cgroup_iter(NULL, iter, NULL))
 
+static inline bool should_force_charge(void)
+{
+   return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
+   (current->flags & PF_EXITING);
+}
+
 /* Some nice accessors for the vmpressure. */
 struct vmpressure *memcg_to_vmpressure(struct mem_cgroup *memcg)
 {
@@ -1389,8 +1395,13 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup 
*memcg, gfp_t gfp_mask,
};
bool ret;
 
-   mutex_lock(_lock);
-   ret = out_of_memory();
+   if (mutex_lock_killable(_lock))
+   return true;
+   /*
+* A few threads which were not waiting at mutex_lock_killable() can
+* fail to bail out. Therefore, check again after holding oom_lock.
+*/
+   ret = should_force_charge() || out_of_memory();
mutex_unlock(_lock);
return ret;
 }
@@ -2209,9 +2220,7 @@ retry:
 * bypass the last charges so that they can exit quickly and
 * free their memory.
 */
-   if (unlikely(tsk_is_oom_victim(current) ||
-fatal_signal_pending(current) ||
-current->flags & PF_EXITING))
+   if (unlikely(should_force_charge()))
goto force;
 
/*
-- 
2.19.1





[PATCH 4.19 028/187] memcg: killed threads should not invoke memcg OOM killer

2019-04-04 Thread Greg Kroah-Hartman
4.19-stable review patch.  If anyone has any objections, please let me know.

--

[ Upstream commit 7775face207922ea62a4e96b9cd45abfdc7b9840 ]

If a memory cgroup contains a single process with many threads
(including different process group sharing the mm) then it is possible
to trigger a race when the oom killer complains that there are no oom
elible tasks and complain into the log which is both annoying and
confusing because there is no actual problem.  The race looks as
follows:

P1  oom_reaper  P2
try_charge  try_charge
  mem_cgroup_out_of_memory
mutex_lock(oom_lock)
  out_of_memory
oom_kill_process(P1,P2)
 wake_oom_reaper
mutex_unlock(oom_lock)
oom_reap_task
  mutex_lock(oom_lock)
select_bad_process 
# no victim

The problem is more visible with many threads.

Fix this by checking for fatal_signal_pending from
mem_cgroup_out_of_memory when the oom_lock is already held.

The oom bypass is safe because we do the same early in the try_charge
path already.  The situation migh have changed in the mean time.  It
should be safe to check for fatal_signal_pending and tsk_is_oom_victim
but for a better code readability abstract the current charge bypass
condition into should_force_charge and reuse it from that path.  "

Link: 
http://lkml.kernel.org/r/01370f70-e1f6-ebe4-b95e-0df21a0bc...@i-love.sakura.ne.jp
Signed-off-by: Tetsuo Handa 
Acked-by: Michal Hocko 
Acked-by: Johannes Weiner 
Cc: David Rientjes 
Cc: Kirill Tkhai 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Sasha Levin 
---
 mm/memcontrol.c | 19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9518aefd8cbb..7c712c4565e6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -248,6 +248,12 @@ enum res_type {
 iter != NULL;  \
 iter = mem_cgroup_iter(NULL, iter, NULL))
 
+static inline bool should_force_charge(void)
+{
+   return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
+   (current->flags & PF_EXITING);
+}
+
 /* Some nice accessors for the vmpressure. */
 struct vmpressure *memcg_to_vmpressure(struct mem_cgroup *memcg)
 {
@@ -1382,8 +1388,13 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup 
*memcg, gfp_t gfp_mask,
};
bool ret;
 
-   mutex_lock(_lock);
-   ret = out_of_memory();
+   if (mutex_lock_killable(_lock))
+   return true;
+   /*
+* A few threads which were not waiting at mutex_lock_killable() can
+* fail to bail out. Therefore, check again after holding oom_lock.
+*/
+   ret = should_force_charge() || out_of_memory();
mutex_unlock(_lock);
return ret;
 }
@@ -2200,9 +2211,7 @@ retry:
 * bypass the last charges so that they can exit quickly and
 * free their memory.
 */
-   if (unlikely(tsk_is_oom_victim(current) ||
-fatal_signal_pending(current) ||
-current->flags & PF_EXITING))
+   if (unlikely(should_force_charge()))
goto force;
 
/*
-- 
2.19.1





[PATCH AUTOSEL 5.0 036/262] memcg: killed threads should not invoke memcg OOM killer

2019-03-27 Thread Sasha Levin
From: Tetsuo Handa 

[ Upstream commit 7775face207922ea62a4e96b9cd45abfdc7b9840 ]

If a memory cgroup contains a single process with many threads
(including different process group sharing the mm) then it is possible
to trigger a race when the oom killer complains that there are no oom
elible tasks and complain into the log which is both annoying and
confusing because there is no actual problem.  The race looks as
follows:

P1  oom_reaper  P2
try_charge  try_charge
  mem_cgroup_out_of_memory
mutex_lock(oom_lock)
  out_of_memory
oom_kill_process(P1,P2)
 wake_oom_reaper
mutex_unlock(oom_lock)
oom_reap_task
  mutex_lock(oom_lock)
select_bad_process 
# no victim

The problem is more visible with many threads.

Fix this by checking for fatal_signal_pending from
mem_cgroup_out_of_memory when the oom_lock is already held.

The oom bypass is safe because we do the same early in the try_charge
path already.  The situation migh have changed in the mean time.  It
should be safe to check for fatal_signal_pending and tsk_is_oom_victim
but for a better code readability abstract the current charge bypass
condition into should_force_charge and reuse it from that path.  "

Link: 
http://lkml.kernel.org/r/01370f70-e1f6-ebe4-b95e-0df21a0bc...@i-love.sakura.ne.jp
Signed-off-by: Tetsuo Handa 
Acked-by: Michal Hocko 
Acked-by: Johannes Weiner 
Cc: David Rientjes 
Cc: Kirill Tkhai 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Sasha Levin 
---
 mm/memcontrol.c | 19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index af7f18b32389..79a7d2a06bba 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -248,6 +248,12 @@ enum res_type {
 iter != NULL;  \
 iter = mem_cgroup_iter(NULL, iter, NULL))
 
+static inline bool should_force_charge(void)
+{
+   return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
+   (current->flags & PF_EXITING);
+}
+
 /* Some nice accessors for the vmpressure. */
 struct vmpressure *memcg_to_vmpressure(struct mem_cgroup *memcg)
 {
@@ -1389,8 +1395,13 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup 
*memcg, gfp_t gfp_mask,
};
bool ret;
 
-   mutex_lock(_lock);
-   ret = out_of_memory();
+   if (mutex_lock_killable(_lock))
+   return true;
+   /*
+* A few threads which were not waiting at mutex_lock_killable() can
+* fail to bail out. Therefore, check again after holding oom_lock.
+*/
+   ret = should_force_charge() || out_of_memory();
mutex_unlock(_lock);
return ret;
 }
@@ -2209,9 +2220,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t 
gfp_mask,
 * bypass the last charges so that they can exit quickly and
 * free their memory.
 */
-   if (unlikely(tsk_is_oom_victim(current) ||
-fatal_signal_pending(current) ||
-current->flags & PF_EXITING))
+   if (unlikely(should_force_charge()))
goto force;
 
/*
-- 
2.19.1



[PATCH AUTOSEL 4.19 026/192] memcg: killed threads should not invoke memcg OOM killer

2019-03-27 Thread Sasha Levin
From: Tetsuo Handa 

[ Upstream commit 7775face207922ea62a4e96b9cd45abfdc7b9840 ]

If a memory cgroup contains a single process with many threads
(including different process group sharing the mm) then it is possible
to trigger a race when the oom killer complains that there are no oom
elible tasks and complain into the log which is both annoying and
confusing because there is no actual problem.  The race looks as
follows:

P1  oom_reaper  P2
try_charge  try_charge
  mem_cgroup_out_of_memory
mutex_lock(oom_lock)
  out_of_memory
oom_kill_process(P1,P2)
 wake_oom_reaper
mutex_unlock(oom_lock)
oom_reap_task
  mutex_lock(oom_lock)
select_bad_process 
# no victim

The problem is more visible with many threads.

Fix this by checking for fatal_signal_pending from
mem_cgroup_out_of_memory when the oom_lock is already held.

The oom bypass is safe because we do the same early in the try_charge
path already.  The situation migh have changed in the mean time.  It
should be safe to check for fatal_signal_pending and tsk_is_oom_victim
but for a better code readability abstract the current charge bypass
condition into should_force_charge and reuse it from that path.  "

Link: 
http://lkml.kernel.org/r/01370f70-e1f6-ebe4-b95e-0df21a0bc...@i-love.sakura.ne.jp
Signed-off-by: Tetsuo Handa 
Acked-by: Michal Hocko 
Acked-by: Johannes Weiner 
Cc: David Rientjes 
Cc: Kirill Tkhai 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Sasha Levin 
---
 mm/memcontrol.c | 19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9518aefd8cbb..7c712c4565e6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -248,6 +248,12 @@ enum res_type {
 iter != NULL;  \
 iter = mem_cgroup_iter(NULL, iter, NULL))
 
+static inline bool should_force_charge(void)
+{
+   return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
+   (current->flags & PF_EXITING);
+}
+
 /* Some nice accessors for the vmpressure. */
 struct vmpressure *memcg_to_vmpressure(struct mem_cgroup *memcg)
 {
@@ -1382,8 +1388,13 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup 
*memcg, gfp_t gfp_mask,
};
bool ret;
 
-   mutex_lock(_lock);
-   ret = out_of_memory();
+   if (mutex_lock_killable(_lock))
+   return true;
+   /*
+* A few threads which were not waiting at mutex_lock_killable() can
+* fail to bail out. Therefore, check again after holding oom_lock.
+*/
+   ret = should_force_charge() || out_of_memory();
mutex_unlock(_lock);
return ret;
 }
@@ -2200,9 +2211,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t 
gfp_mask,
 * bypass the last charges so that they can exit quickly and
 * free their memory.
 */
-   if (unlikely(tsk_is_oom_victim(current) ||
-fatal_signal_pending(current) ||
-current->flags & PF_EXITING))
+   if (unlikely(should_force_charge()))
goto force;
 
/*
-- 
2.19.1



Re: [PATCH] mm, oom: OOM killer use rss size without shmem

2019-02-22 Thread Johannes Weiner
On Fri, Feb 22, 2019 at 08:10:01AM +0100, Michal Hocko wrote:
> On Fri 22-02-19 13:37:33, Junil Lee wrote:
> > The oom killer use get_mm_rss() function to estimate how free memory
> > will be reclaimed when the oom killer select victim task.
> > 
> > However, the returned rss size by get_mm_rss() function was changed from
> > "mm, shmem: add internal shmem resident memory accounting" commit.
> > This commit makes the get_mm_rss() return size including SHMEM pages.
> 
> This was actually the case even before eca56ff906bdd because SHMEM was
> just accounted to MM_FILEPAGES so this commit hasn't changed much
> really.
> 
> Besides that we cannot really rule out SHMEM pages simply. They are
> backing MAP_ANON|MAP_SHARED which might be unmapped and freed during the
> oom victim exit. Moreover this is essentially the same as file backed
> pages or even MAP_PRIVATE|MAP_ANON pages. Bothe can be pinned by other
> processes e.g. via private pages via CoW mappings and file pages by
> filesystem or simply mlocked by another process. So this really gross
> evaluation will never be perfect. We would basically have to do exact
> calculation of the freeable memory of each process and that is just not
> feasible.
> 
> That being said, I do not think the patch is an improvement in that
> direction. It just turnes one fuzzy evaluation by another that even
> misses a lot of memory potentially.

You make good points.

I think it's also worth noting that while the OOM killer is ultimately
about freeing memory, the victim algorithm is not about finding the
*optimal* amount of memory to free, but to kill the thing that is most
likely to have put the system into trouble. We're not going for
killing the smallest tasks until we're barely back over the line and
operational again, but instead we're finding the biggest offender to
stop the most likely source of unsustainable allocations. That's why
our metric is called "badness score", and not "freeable" or similar.

So even if a good chunk of the biggest task are tmpfs pages that
aren't necessarily freed upon kill, from a heuristics POV it's still
the best candidate to kill.


Re: [PATCH] mm, oom: OOM killer use rss size without shmem

2019-02-21 Thread Michal Hocko
On Fri 22-02-19 13:37:33, Junil Lee wrote:
> The oom killer use get_mm_rss() function to estimate how free memory
> will be reclaimed when the oom killer select victim task.
> 
> However, the returned rss size by get_mm_rss() function was changed from
> "mm, shmem: add internal shmem resident memory accounting" commit.
> This commit makes the get_mm_rss() return size including SHMEM pages.

This was actually the case even before eca56ff906bdd because SHMEM was
just accounted to MM_FILEPAGES so this commit hasn't changed much
really.

Besides that we cannot really rule out SHMEM pages simply. They are
backing MAP_ANON|MAP_SHARED which might be unmapped and freed during the
oom victim exit. Moreover this is essentially the same as file backed
pages or even MAP_PRIVATE|MAP_ANON pages. Bothe can be pinned by other
processes e.g. via private pages via CoW mappings and file pages by
filesystem or simply mlocked by another process. So this really gross
evaluation will never be perfect. We would basically have to do exact
calculation of the freeable memory of each process and that is just not
feasible.

That being said, I do not think the patch is an improvement in that
direction. It just turnes one fuzzy evaluation by another that even
misses a lot of memory potentially.

> The oom killer can't get free memory from SHMEM pages directly after
> kill victim process, it leads to mis-calculate victim points.
> 
> Therefore, make new API as get_mm_rss_wo_shmem() which returns the rss
> value excluding SHMEM_PAGES.
> 
> Signed-off-by: Junil Lee 
> ---
>  include/linux/mm.h | 6 ++
>  mm/oom_kill.c  | 4 ++--
>  2 files changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 2d483db..bca3acc 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1701,6 +1701,12 @@ static inline int mm_counter(struct page *page)
>   return mm_counter_file(page);
>  }
>  
> +static inline unsigned long get_mm_rss_wo_shmem(struct mm_struct *mm)
> +{
> + return get_mm_counter(mm, MM_FILEPAGES) +
> + get_mm_counter(mm, MM_ANONPAGES);
> +}
> +
>  static inline unsigned long get_mm_rss(struct mm_struct *mm)
>  {
>   return get_mm_counter(mm, MM_FILEPAGES) +
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 3a24848..e569737 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -230,7 +230,7 @@ unsigned long oom_badness(struct task_struct *p, struct 
> mem_cgroup *memcg,
>* The baseline for the badness score is the proportion of RAM that each
>* task's rss, pagetable and swap space use.
>*/
> - points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
> + points = get_mm_rss_wo_shmem(p->mm) + get_mm_counter(p->mm, 
> MM_SWAPENTS) +
>   mm_pgtables_bytes(p->mm) / PAGE_SIZE;
>   task_unlock(p);
>  
> @@ -419,7 +419,7 @@ static void dump_tasks(struct mem_cgroup *memcg, const 
> nodemask_t *nodemask)
>  
>   pr_info("[%7d] %5d %5d %8lu %8lu %8ld %8lu %5hd %s\n",
>   task->pid, from_kuid(_user_ns, task_uid(task)),
> - task->tgid, task->mm->total_vm, get_mm_rss(task->mm),
> + task->tgid, task->mm->total_vm, 
> get_mm_rss_wo_shmem(task->mm),
>   mm_pgtables_bytes(task->mm),
>   get_mm_counter(task->mm, MM_SWAPENTS),
>   task->signal->oom_score_adj, task->comm);
> -- 
> 2.6.2
> 

-- 
Michal Hocko
SUSE Labs


[PATCH] mm, oom: OOM killer use rss size without shmem

2019-02-21 Thread Junil Lee
The oom killer use get_mm_rss() function to estimate how free memory
will be reclaimed when the oom killer select victim task.

However, the returned rss size by get_mm_rss() function was changed from
"mm, shmem: add internal shmem resident memory accounting" commit.
This commit makes the get_mm_rss() return size including SHMEM pages.

The oom killer can't get free memory from SHMEM pages directly after
kill victim process, it leads to mis-calculate victim points.

Therefore, make new API as get_mm_rss_wo_shmem() which returns the rss
value excluding SHMEM_PAGES.

Signed-off-by: Junil Lee 
---
 include/linux/mm.h | 6 ++
 mm/oom_kill.c  | 4 ++--
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2d483db..bca3acc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1701,6 +1701,12 @@ static inline int mm_counter(struct page *page)
return mm_counter_file(page);
 }
 
+static inline unsigned long get_mm_rss_wo_shmem(struct mm_struct *mm)
+{
+   return get_mm_counter(mm, MM_FILEPAGES) +
+   get_mm_counter(mm, MM_ANONPAGES);
+}
+
 static inline unsigned long get_mm_rss(struct mm_struct *mm)
 {
return get_mm_counter(mm, MM_FILEPAGES) +
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 3a24848..e569737 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -230,7 +230,7 @@ unsigned long oom_badness(struct task_struct *p, struct 
mem_cgroup *memcg,
 * The baseline for the badness score is the proportion of RAM that each
 * task's rss, pagetable and swap space use.
 */
-   points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
+   points = get_mm_rss_wo_shmem(p->mm) + get_mm_counter(p->mm, 
MM_SWAPENTS) +
mm_pgtables_bytes(p->mm) / PAGE_SIZE;
task_unlock(p);
 
@@ -419,7 +419,7 @@ static void dump_tasks(struct mem_cgroup *memcg, const 
nodemask_t *nodemask)
 
pr_info("[%7d] %5d %5d %8lu %8lu %8ld %8lu %5hd %s\n",
task->pid, from_kuid(_user_ns, task_uid(task)),
-   task->tgid, task->mm->total_vm, get_mm_rss(task->mm),
+   task->tgid, task->mm->total_vm, 
get_mm_rss_wo_shmem(task->mm),
mm_pgtables_bytes(task->mm),
get_mm_counter(task->mm, MM_SWAPENTS),
task->signal->oom_score_adj, task->comm);
-- 
2.6.2



[PATCH 4.19 089/148] memcg, oom: notify on oom killer invocation from the charge path

2019-01-11 Thread Greg Kroah-Hartman
4.19-stable review patch.  If anyone has any objections, please let me know.

--

From: Michal Hocko 

commit 7056d3a37d2c6b10c13e8e69adc67ec1fc65 upstream.

Burt Holzman has noticed that memcg v1 doesn't notify about OOM events via
eventfd anymore.  The reason is that 29ef680ae7c2 ("memcg, oom: move
out_of_memory back to the charge path") has moved the oom handling back to
the charge path.  While doing so the notification was left behind in
mem_cgroup_oom_synchronize.

Fix the issue by replicating the oom hierarchy locking and the
notification.

Link: http://lkml.kernel.org/r/20181224091107.18354-1-mho...@kernel.org
Fixes: 29ef680ae7c2 ("memcg, oom: move out_of_memory back to the charge path")
Signed-off-by: Michal Hocko 
Reported-by: Burt Holzman 
Acked-by: Johannes Weiner 
Cc: Vladimir Davydov [4.19+]
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman 

---
 mm/memcontrol.c |   20 ++--
 1 file changed, 18 insertions(+), 2 deletions(-)

--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1666,6 +1666,9 @@ enum oom_status {
 
 static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, 
int order)
 {
+   enum oom_status ret;
+   bool locked;
+
if (order > PAGE_ALLOC_COSTLY_ORDER)
return OOM_SKIPPED;
 
@@ -1698,10 +1701,23 @@ static enum oom_status mem_cgroup_oom(st
return OOM_ASYNC;
}
 
+   mem_cgroup_mark_under_oom(memcg);
+
+   locked = mem_cgroup_oom_trylock(memcg);
+
+   if (locked)
+   mem_cgroup_oom_notify(memcg);
+
+   mem_cgroup_unmark_under_oom(memcg);
if (mem_cgroup_out_of_memory(memcg, mask, order))
-   return OOM_SUCCESS;
+   ret = OOM_SUCCESS;
+   else
+   ret = OOM_FAILED;
+
+   if (locked)
+   mem_cgroup_oom_unlock(memcg);
 
-   return OOM_FAILED;
+   return ret;
 }
 
 /**




[PATCH 4.20 09/65] memcg, oom: notify on oom killer invocation from the charge path

2019-01-11 Thread Greg Kroah-Hartman
4.20-stable review patch.  If anyone has any objections, please let me know.

--

From: Michal Hocko 

commit 7056d3a37d2c6b10c13e8e69adc67ec1fc65 upstream.

Burt Holzman has noticed that memcg v1 doesn't notify about OOM events via
eventfd anymore.  The reason is that 29ef680ae7c2 ("memcg, oom: move
out_of_memory back to the charge path") has moved the oom handling back to
the charge path.  While doing so the notification was left behind in
mem_cgroup_oom_synchronize.

Fix the issue by replicating the oom hierarchy locking and the
notification.

Link: http://lkml.kernel.org/r/20181224091107.18354-1-mho...@kernel.org
Fixes: 29ef680ae7c2 ("memcg, oom: move out_of_memory back to the charge path")
Signed-off-by: Michal Hocko 
Reported-by: Burt Holzman 
Acked-by: Johannes Weiner 
Cc: Vladimir Davydov [4.19+]
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman 

---
 mm/memcontrol.c |   20 ++--
 1 file changed, 18 insertions(+), 2 deletions(-)

--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1666,6 +1666,9 @@ enum oom_status {
 
 static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, 
int order)
 {
+   enum oom_status ret;
+   bool locked;
+
if (order > PAGE_ALLOC_COSTLY_ORDER)
return OOM_SKIPPED;
 
@@ -1700,10 +1703,23 @@ static enum oom_status mem_cgroup_oom(st
return OOM_ASYNC;
}
 
+   mem_cgroup_mark_under_oom(memcg);
+
+   locked = mem_cgroup_oom_trylock(memcg);
+
+   if (locked)
+   mem_cgroup_oom_notify(memcg);
+
+   mem_cgroup_unmark_under_oom(memcg);
if (mem_cgroup_out_of_memory(memcg, mask, order))
-   return OOM_SUCCESS;
+   ret = OOM_SUCCESS;
+   else
+   ret = OOM_FAILED;
+
+   if (locked)
+   mem_cgroup_oom_unlock(memcg);
 
-   return OOM_FAILED;
+   return ret;
 }
 
 /**




[PATCH] memcg, oom: notify on oom killer invocation from the charge path

2018-12-24 Thread Michal Hocko
From: Michal Hocko 

Burt Holzman has noticed that memcg v1 doesn't notify about OOM events
via eventfd anymore. The reason is that 29ef680ae7c2 ("memcg, oom: move
out_of_memory back to the charge path") has moved the oom handling back
to the charge path. While doing so the notification was left behind in
mem_cgroup_oom_synchronize.

Fix the issue by replicating the oom hierarchy locking and the
notification.

Reported-by: Burt Holzman 
Fixes: 29ef680ae7c2 ("memcg, oom: move out_of_memory back to the charge path")
Cc: stable # 4.19+
Acked-by: Johannes Weiner 
Signed-off-by: Michal Hocko 
---
Hi Andrew,
I forgot to CC you on the patch sent as a reply to the original bug
report [1] so I am reposting with Ack from Johannes. Burt has confirmed
this is resolving the regression for him [2]. 4.20 is out but I have
marked the patch for stable so it should hit both 4.19 and 4.20.

[1] http://lkml.kernel.org/r/20181221153302.gb6...@dhcp22.suse.cz
[2] 
http://lkml.kernel.org/r/96d4815c-420f-41b7-b1e9-a741e7523...@services.fnal.gov

 mm/memcontrol.c | 20 ++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6e1469b80cb7..7e6bf74ddb1e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1666,6 +1666,9 @@ enum oom_status {
 
 static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, 
int order)
 {
+   enum oom_status ret;
+   bool locked;
+
if (order > PAGE_ALLOC_COSTLY_ORDER)
return OOM_SKIPPED;
 
@@ -1700,10 +1703,23 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup 
*memcg, gfp_t mask, int
return OOM_ASYNC;
}
 
+   mem_cgroup_mark_under_oom(memcg);
+
+   locked = mem_cgroup_oom_trylock(memcg);
+
+   if (locked)
+   mem_cgroup_oom_notify(memcg);
+
+   mem_cgroup_unmark_under_oom(memcg);
if (mem_cgroup_out_of_memory(memcg, mask, order))
-   return OOM_SUCCESS;
+   ret = OOM_SUCCESS;
+   else
+   ret = OOM_FAILED;
 
-   return OOM_FAILED;
+   if (locked)
+   mem_cgroup_oom_unlock(memcg);
+
+   return ret;
 }
 
 /**
-- 
2.19.2



Re: cgroup aware oom killer (was Re: [PATCH 0/3] introduce memory.oom.group)

2018-08-20 Thread Roman Gushchin
On Sun, Aug 19, 2018 at 04:26:50PM -0700, David Rientjes wrote:
> Roman, have you had time to go through this?

Hm, I thought we've finished this part of discussion, no?
Anyway, let me repeat my position: I don't like the interface
you've proposed in that follow-up patchset, and I explained why.
If you've a new proposal, please, rebase it to the current
mm tree, and we can discuss it separately.
Alternatively, we can discuss the interface first (without
the implementation), but, please, make a new thread with a
fresh description of a proposed interface.

Thanks!

> 
> 
> On Tue, 7 Aug 2018, David Rientjes wrote:
> 
> > On Mon, 6 Aug 2018, Roman Gushchin wrote:
> > 
> > > > In a cgroup-aware oom killer world, yes, we need the ability to specify 
> > > > that the usage of the entire subtree should be compared as a single 
> > > > entity with other cgroups.  That is necessary for user subtrees but may 
> > > > not be necessary for top-level cgroups depending on how you structure 
> > > > your 
> > > > unified cgroup hierarchy.  So it needs to be configurable, as you 
> > > > suggest, 
> > > > and you are correct it can be different than oom.group.
> > > > 
> > > > That's not the only thing we need though, as I'm sure you were 
> > > > expecting 
> > > > me to say :)
> > > > 
> > > > We need the ability to preserve existing behavior, i.e. process based 
> > > > and 
> > > > not cgroup aware, for subtrees so that our users who have clear 
> > > > expectations and tune their oom_score_adj accordingly based on how the 
> > > > oom 
> > > > killer has always chosen processes for oom kill do not suddenly regress.
> > > 
> > > Isn't the combination of oom.group=0 and oom.evaluate_together=1 
> > > describing
> > > this case? This basically means that if memcg is selected as target,
> > > the process inside will be selected using traditional per-process 
> > > approach.
> > > 
> > 
> > No, that would overload the policy and mechanism.  We want the ability to 
> > consider user-controlled subtrees as a single entity for comparison with 
> > other user subtrees to select which subtree to target.  This does not 
> > imply that users want their entire subtree oom killed.
> > 
> > > > So we need to define the policy for a subtree that is oom, and I 
> > > > suggest 
> > > > we do that as a characteristic of the cgroup that is oom ("process" vs 
> > > > "cgroup", and process would be the default to preserve what currently 
> > > > happens in a user subtree).
> > > 
> > > I'm not entirely convinced here.
> > > I do agree, that some sub-tree may have a well tuned oom_score_adj,
> > > and it's preferable to keep the current behavior.
> > > 
> > > At the same time I don't like the idea to look at the policy of the OOMing
> > > cgroup. Why exceeding of one limit should be handled different to 
> > > exceeding
> > > of another? This seems to be a property of workload, not a limit.
> > > 
> > 
> > The limit is the property of the mem cgroup, so it's logical that the 
> > policy when reaching that limit is a property of the same mem cgroup.
> > Using the user-controlled subtree example, if we have /david and /roman, 
> > we can define our own policies on oom, we are not restricted to cgroup 
> > aware selection on the entire hierarchy.  /david/oom.policy can be 
> > "process" so that I haven't regressed with earlier kernels, and 
> > /roman/oom.policy can be "cgroup" to target the largest cgroup in your 
> > subtree.
> > 
> > Something needs to be oom killed when a mem cgroup at any level in the 
> > hierarchy is reached and reclaim has failed.  What to do when that limit 
> > is reached is a property of that cgroup.
> > 
> > > > Now, as users who rely on process selection are well aware, we have 
> > > > oom_score_adj to influence the decision of which process to oom kill.  
> > > > If 
> > > > our oom subtree is cgroup aware, we should have the ability to likewise 
> > > > influence that decision.  For example, we have high priority 
> > > > applications 
> > > > that run at the top-level that use a lot of memory and strictly oom 
> > > > killing them in all scenarios because they use a lot of memory isn't 
> > > > appropriate.  We need to be able to adjust the comparison of

Re: cgroup aware oom killer (was Re: [PATCH 0/3] introduce memory.oom.group)

2018-08-20 Thread Roman Gushchin
On Sun, Aug 19, 2018 at 04:26:50PM -0700, David Rientjes wrote:
> Roman, have you had time to go through this?

Hm, I thought we've finished this part of discussion, no?
Anyway, let me repeat my position: I don't like the interface
you've proposed in that follow-up patchset, and I explained why.
If you've a new proposal, please, rebase it to the current
mm tree, and we can discuss it separately.
Alternatively, we can discuss the interface first (without
the implementation), but, please, make a new thread with a
fresh description of a proposed interface.

Thanks!

> 
> 
> On Tue, 7 Aug 2018, David Rientjes wrote:
> 
> > On Mon, 6 Aug 2018, Roman Gushchin wrote:
> > 
> > > > In a cgroup-aware oom killer world, yes, we need the ability to specify 
> > > > that the usage of the entire subtree should be compared as a single 
> > > > entity with other cgroups.  That is necessary for user subtrees but may 
> > > > not be necessary for top-level cgroups depending on how you structure 
> > > > your 
> > > > unified cgroup hierarchy.  So it needs to be configurable, as you 
> > > > suggest, 
> > > > and you are correct it can be different than oom.group.
> > > > 
> > > > That's not the only thing we need though, as I'm sure you were 
> > > > expecting 
> > > > me to say :)
> > > > 
> > > > We need the ability to preserve existing behavior, i.e. process based 
> > > > and 
> > > > not cgroup aware, for subtrees so that our users who have clear 
> > > > expectations and tune their oom_score_adj accordingly based on how the 
> > > > oom 
> > > > killer has always chosen processes for oom kill do not suddenly regress.
> > > 
> > > Isn't the combination of oom.group=0 and oom.evaluate_together=1 
> > > describing
> > > this case? This basically means that if memcg is selected as target,
> > > the process inside will be selected using traditional per-process 
> > > approach.
> > > 
> > 
> > No, that would overload the policy and mechanism.  We want the ability to 
> > consider user-controlled subtrees as a single entity for comparison with 
> > other user subtrees to select which subtree to target.  This does not 
> > imply that users want their entire subtree oom killed.
> > 
> > > > So we need to define the policy for a subtree that is oom, and I 
> > > > suggest 
> > > > we do that as a characteristic of the cgroup that is oom ("process" vs 
> > > > "cgroup", and process would be the default to preserve what currently 
> > > > happens in a user subtree).
> > > 
> > > I'm not entirely convinced here.
> > > I do agree, that some sub-tree may have a well tuned oom_score_adj,
> > > and it's preferable to keep the current behavior.
> > > 
> > > At the same time I don't like the idea to look at the policy of the OOMing
> > > cgroup. Why exceeding of one limit should be handled different to 
> > > exceeding
> > > of another? This seems to be a property of workload, not a limit.
> > > 
> > 
> > The limit is the property of the mem cgroup, so it's logical that the 
> > policy when reaching that limit is a property of the same mem cgroup.
> > Using the user-controlled subtree example, if we have /david and /roman, 
> > we can define our own policies on oom, we are not restricted to cgroup 
> > aware selection on the entire hierarchy.  /david/oom.policy can be 
> > "process" so that I haven't regressed with earlier kernels, and 
> > /roman/oom.policy can be "cgroup" to target the largest cgroup in your 
> > subtree.
> > 
> > Something needs to be oom killed when a mem cgroup at any level in the 
> > hierarchy is reached and reclaim has failed.  What to do when that limit 
> > is reached is a property of that cgroup.
> > 
> > > > Now, as users who rely on process selection are well aware, we have 
> > > > oom_score_adj to influence the decision of which process to oom kill.  
> > > > If 
> > > > our oom subtree is cgroup aware, we should have the ability to likewise 
> > > > influence that decision.  For example, we have high priority 
> > > > applications 
> > > > that run at the top-level that use a lot of memory and strictly oom 
> > > > killing them in all scenarios because they use a lot of memory isn't 
> > > > appropriate.  We need to be able to adjust the comparison of

cgroup aware oom killer (was Re: [PATCH 0/3] introduce memory.oom.group)

2018-08-19 Thread David Rientjes
Roman, have you had time to go through this?


On Tue, 7 Aug 2018, David Rientjes wrote:

> On Mon, 6 Aug 2018, Roman Gushchin wrote:
> 
> > > In a cgroup-aware oom killer world, yes, we need the ability to specify 
> > > that the usage of the entire subtree should be compared as a single 
> > > entity with other cgroups.  That is necessary for user subtrees but may 
> > > not be necessary for top-level cgroups depending on how you structure 
> > > your 
> > > unified cgroup hierarchy.  So it needs to be configurable, as you 
> > > suggest, 
> > > and you are correct it can be different than oom.group.
> > > 
> > > That's not the only thing we need though, as I'm sure you were expecting 
> > > me to say :)
> > > 
> > > We need the ability to preserve existing behavior, i.e. process based and 
> > > not cgroup aware, for subtrees so that our users who have clear 
> > > expectations and tune their oom_score_adj accordingly based on how the 
> > > oom 
> > > killer has always chosen processes for oom kill do not suddenly regress.
> > 
> > Isn't the combination of oom.group=0 and oom.evaluate_together=1 describing
> > this case? This basically means that if memcg is selected as target,
> > the process inside will be selected using traditional per-process approach.
> > 
> 
> No, that would overload the policy and mechanism.  We want the ability to 
> consider user-controlled subtrees as a single entity for comparison with 
> other user subtrees to select which subtree to target.  This does not 
> imply that users want their entire subtree oom killed.
> 
> > > So we need to define the policy for a subtree that is oom, and I suggest 
> > > we do that as a characteristic of the cgroup that is oom ("process" vs 
> > > "cgroup", and process would be the default to preserve what currently 
> > > happens in a user subtree).
> > 
> > I'm not entirely convinced here.
> > I do agree, that some sub-tree may have a well tuned oom_score_adj,
> > and it's preferable to keep the current behavior.
> > 
> > At the same time I don't like the idea to look at the policy of the OOMing
> > cgroup. Why exceeding of one limit should be handled different to exceeding
> > of another? This seems to be a property of workload, not a limit.
> > 
> 
> The limit is the property of the mem cgroup, so it's logical that the 
> policy when reaching that limit is a property of the same mem cgroup.
> Using the user-controlled subtree example, if we have /david and /roman, 
> we can define our own policies on oom, we are not restricted to cgroup 
> aware selection on the entire hierarchy.  /david/oom.policy can be 
> "process" so that I haven't regressed with earlier kernels, and 
> /roman/oom.policy can be "cgroup" to target the largest cgroup in your 
> subtree.
> 
> Something needs to be oom killed when a mem cgroup at any level in the 
> hierarchy is reached and reclaim has failed.  What to do when that limit 
> is reached is a property of that cgroup.
> 
> > > Now, as users who rely on process selection are well aware, we have 
> > > oom_score_adj to influence the decision of which process to oom kill.  If 
> > > our oom subtree is cgroup aware, we should have the ability to likewise 
> > > influence that decision.  For example, we have high priority applications 
> > > that run at the top-level that use a lot of memory and strictly oom 
> > > killing them in all scenarios because they use a lot of memory isn't 
> > > appropriate.  We need to be able to adjust the comparison of a cgroup (or 
> > > subtree) when compared to other cgroups.
> > > 
> > > I've also suggested, but did not implement in my patchset because I was 
> > > trying to define the API and find common ground first, that we have a 
> > > need 
> > > for priority based selection.  In other words, define the priority of a 
> > > subtree regardless of cgroup usage.
> > > 
> > > So with these four things, we have
> > > 
> > >  - an "oom.policy" tunable to define "cgroup" or "process" for that 
> > >subtree (and plans for "priority" in the future),
> > > 
> > >  - your "oom.evaluate_as_group" tunable to account the usage of the
> > >subtree as the cgroup's own usage for comparison with others,
> > > 
> > >  - an "oom.adj" to adjust the usage of the cgroup (local or subtree)
> > >to protect important applicat

cgroup aware oom killer (was Re: [PATCH 0/3] introduce memory.oom.group)

2018-08-19 Thread David Rientjes
Roman, have you had time to go through this?


On Tue, 7 Aug 2018, David Rientjes wrote:

> On Mon, 6 Aug 2018, Roman Gushchin wrote:
> 
> > > In a cgroup-aware oom killer world, yes, we need the ability to specify 
> > > that the usage of the entire subtree should be compared as a single 
> > > entity with other cgroups.  That is necessary for user subtrees but may 
> > > not be necessary for top-level cgroups depending on how you structure 
> > > your 
> > > unified cgroup hierarchy.  So it needs to be configurable, as you 
> > > suggest, 
> > > and you are correct it can be different than oom.group.
> > > 
> > > That's not the only thing we need though, as I'm sure you were expecting 
> > > me to say :)
> > > 
> > > We need the ability to preserve existing behavior, i.e. process based and 
> > > not cgroup aware, for subtrees so that our users who have clear 
> > > expectations and tune their oom_score_adj accordingly based on how the 
> > > oom 
> > > killer has always chosen processes for oom kill do not suddenly regress.
> > 
> > Isn't the combination of oom.group=0 and oom.evaluate_together=1 describing
> > this case? This basically means that if memcg is selected as target,
> > the process inside will be selected using traditional per-process approach.
> > 
> 
> No, that would overload the policy and mechanism.  We want the ability to 
> consider user-controlled subtrees as a single entity for comparison with 
> other user subtrees to select which subtree to target.  This does not 
> imply that users want their entire subtree oom killed.
> 
> > > So we need to define the policy for a subtree that is oom, and I suggest 
> > > we do that as a characteristic of the cgroup that is oom ("process" vs 
> > > "cgroup", and process would be the default to preserve what currently 
> > > happens in a user subtree).
> > 
> > I'm not entirely convinced here.
> > I do agree, that some sub-tree may have a well tuned oom_score_adj,
> > and it's preferable to keep the current behavior.
> > 
> > At the same time I don't like the idea to look at the policy of the OOMing
> > cgroup. Why exceeding of one limit should be handled different to exceeding
> > of another? This seems to be a property of workload, not a limit.
> > 
> 
> The limit is the property of the mem cgroup, so it's logical that the 
> policy when reaching that limit is a property of the same mem cgroup.
> Using the user-controlled subtree example, if we have /david and /roman, 
> we can define our own policies on oom, we are not restricted to cgroup 
> aware selection on the entire hierarchy.  /david/oom.policy can be 
> "process" so that I haven't regressed with earlier kernels, and 
> /roman/oom.policy can be "cgroup" to target the largest cgroup in your 
> subtree.
> 
> Something needs to be oom killed when a mem cgroup at any level in the 
> hierarchy is reached and reclaim has failed.  What to do when that limit 
> is reached is a property of that cgroup.
> 
> > > Now, as users who rely on process selection are well aware, we have 
> > > oom_score_adj to influence the decision of which process to oom kill.  If 
> > > our oom subtree is cgroup aware, we should have the ability to likewise 
> > > influence that decision.  For example, we have high priority applications 
> > > that run at the top-level that use a lot of memory and strictly oom 
> > > killing them in all scenarios because they use a lot of memory isn't 
> > > appropriate.  We need to be able to adjust the comparison of a cgroup (or 
> > > subtree) when compared to other cgroups.
> > > 
> > > I've also suggested, but did not implement in my patchset because I was 
> > > trying to define the API and find common ground first, that we have a 
> > > need 
> > > for priority based selection.  In other words, define the priority of a 
> > > subtree regardless of cgroup usage.
> > > 
> > > So with these four things, we have
> > > 
> > >  - an "oom.policy" tunable to define "cgroup" or "process" for that 
> > >subtree (and plans for "priority" in the future),
> > > 
> > >  - your "oom.evaluate_as_group" tunable to account the usage of the
> > >subtree as the cgroup's own usage for comparison with others,
> > > 
> > >  - an "oom.adj" to adjust the usage of the cgroup (local or subtree)
> > >to protect important applicat

Re: [PATCH v13 0/7] cgroup-aware OOM killer

2018-07-16 Thread Tetsuo Handa
Roman Gushchin wrote:
> On Tue, Jul 17, 2018 at 06:13:47AM +0900, Tetsuo Handa wrote:
> > No response from Roman and David...
> > 
> > Andrew, will you once drop Roman's cgroup-aware OOM killer and David's 
> > patches?
> > Roman's series has a bug which I mentioned and which can be avoided by my 
> > patch.
> > David's patch is using MMF_UNSTABLE incorrectly such that it might start 
> > selecting
> > next OOM victim without trying to reclaim any memory.
> > 
> > Since they are not responding to my mail, I suggest once dropping from 
> > linux-next.
> 
> I was in cc, and didn't thought that you're expecting something from me.

Oops. I was waiting for your response. ;-)

  But Roman, my patch conflicts with your "mm, oom: cgroup-aware OOM killer" 
patch
  in linux-next. And it seems to me that your patch contains a bug which leads 
to
  premature memory allocation failure explained below.

  Can we apply my patch prior to your "mm, oom: cgroup-aware OOM killer" patch
  (which eliminates "delay" and "out:" from your patch) so that people can 
easily
  backport my patch? Or, do you want to apply a fix (which eliminates "delay" 
and
  "out:" from linux-next) prior to my patch?

> 
> I don't get, why it's necessary to drop the cgroup oom killer to merge your 
> fix?
> I'm happy to help with rebasing and everything else.

Yes, I wish you rebase your series on top of OOM lockup (CVE-2016-10723) 
mitigation
patch ( https://marc.info/?l=linux-mm=153112243424285=4 ). It is a trivial 
change
and easy to cleanly backport (if applied before your series).

Also, I expect you to check whether my cleanup patch which removes "abort" path
( [PATCH 1/2] at https://marc.info/?l=linux-mm=153119509215026=4 ) helps
simplifying your series. I don't know detailed behavior of your series, but I
assume that your series do not kill threads which current thread should not wait
for MMF_OOM_SKIP.


Re: [PATCH v13 0/7] cgroup-aware OOM killer

2018-07-16 Thread Tetsuo Handa
Roman Gushchin wrote:
> On Tue, Jul 17, 2018 at 06:13:47AM +0900, Tetsuo Handa wrote:
> > No response from Roman and David...
> > 
> > Andrew, will you once drop Roman's cgroup-aware OOM killer and David's 
> > patches?
> > Roman's series has a bug which I mentioned and which can be avoided by my 
> > patch.
> > David's patch is using MMF_UNSTABLE incorrectly such that it might start 
> > selecting
> > next OOM victim without trying to reclaim any memory.
> > 
> > Since they are not responding to my mail, I suggest once dropping from 
> > linux-next.
> 
> I was in cc, and didn't thought that you're expecting something from me.

Oops. I was waiting for your response. ;-)

  But Roman, my patch conflicts with your "mm, oom: cgroup-aware OOM killer" 
patch
  in linux-next. And it seems to me that your patch contains a bug which leads 
to
  premature memory allocation failure explained below.

  Can we apply my patch prior to your "mm, oom: cgroup-aware OOM killer" patch
  (which eliminates "delay" and "out:" from your patch) so that people can 
easily
  backport my patch? Or, do you want to apply a fix (which eliminates "delay" 
and
  "out:" from linux-next) prior to my patch?

> 
> I don't get, why it's necessary to drop the cgroup oom killer to merge your 
> fix?
> I'm happy to help with rebasing and everything else.

Yes, I wish you rebase your series on top of OOM lockup (CVE-2016-10723) 
mitigation
patch ( https://marc.info/?l=linux-mm=153112243424285=4 ). It is a trivial 
change
and easy to cleanly backport (if applied before your series).

Also, I expect you to check whether my cleanup patch which removes "abort" path
( [PATCH 1/2] at https://marc.info/?l=linux-mm=153119509215026=4 ) helps
simplifying your series. I don't know detailed behavior of your series, but I
assume that your series do not kill threads which current thread should not wait
for MMF_OOM_SKIP.


[patch v3 -mm 0/6] rewrite cgroup aware oom killer for general use

2018-07-13 Thread David Rientjes
There are three significant concerns about the cgroup aware oom killer as
it is implemented in -mm:

 (1) allows users to evade the oom killer by creating subcontainers or
 using other controllers since scoring is done per cgroup and not
 hierarchically,

 (2) unfairly compares the root mem cgroup using completely different
 criteria than leaf mem cgroups and allows wildly inaccurate results
 if oom_score_adj is used, and

 (3) does not allow the user to influence the decisionmaking, such that
 important subtrees cannot be preferred or biased.

This patchset fixes (1) and (2) completely and, by doing so, introduces a
completely extensible user interface that can be expanded in the future.

Concern (3) could subsequently be addressed either before or after the
cgroup-aware oom killer feature is merged.

It preserves all functionality that currently exists in -mm and extends
it to be generally useful outside of very specialized usecases.

It eliminates the mount option for the cgroup aware oom killer entirely
since it is now enabled through the root mem cgroup's oom policy.
---
v3:
 - Rebased to next-20180713

v2:
 - Rebased to next-20180322
 - Fixed get_nr_swap_pages() build bug found by kbuild test robot

 Documentation/admin-guide/cgroup-v2.rst | 100 ++-
 include/linux/cgroup-defs.h |   5 -
 include/linux/memcontrol.h  |  21 +++
 kernel/cgroup/cgroup.c  |  13 +-
 mm/memcontrol.c | 221 ++--
 5 files changed, 204 insertions(+), 156 deletions(-)


[patch v3 -mm 2/6] mm, memcg: replace cgroup aware oom killer mount option with tunable

2018-07-13 Thread David Rientjes
Now that each mem cgroup on the system has a memory.oom_policy tunable to
specify oom kill selection behavior, remove the needless "groupoom" mount
option that requires (1) the entire system to be forced, perhaps
unnecessarily, perhaps unexpectedly, into a single oom policy that
differs from the traditional per process selection, and (2) a remount to
change.

Instead of enabling the cgroup aware oom killer with the "groupoom" mount
option, set the mem cgroup subtree's memory.oom_policy to "cgroup".

The heuristic used to select a process or cgroup to kill from is
controlled by the oom mem cgroup's memory.oom_policy.  This means that if
a descendant mem cgroup has an oom policy of "none", for example, and an
oom condition originates in an ancestor with an oom policy of "cgroup",
the selection logic will treat all descendant cgroups as indivisible
memory consumers.

For example, consider an example where each mem cgroup has "memory" set
in cgroup.controllers:

mem cgroup  cgroup.procs
==  
/cg11 process consuming 250MB
/cg23 processes consuming 100MB each
/cg3/cg31   2 processes consuming 100MB each
/cg3/cg32   2 processes consuming 100MB each

If the root mem cgroup's memory.oom_policy is "none", the process from
/cg1 is chosen as the victim.  If memory.oom_policy is "cgroup", a process
from /cg2 is chosen because it is in the single indivisible memory
consumer with the greatest usage.  This policy of "cgroup" is identical to
to the current "groupoom" mount option, now removed.

Note that /cg3 is not the chosen victim when the oom mem cgroup policy is
"cgroup" because cgroups are treated individually without regard to
hierarchical /cg3/memory.current usage.  This will be addressed in a
follow-up patch.

This has the added benefit of allowing descendant cgroups to control their
own oom policies if they have memory.oom_policy file permissions without
being restricted to the system-wide policy.  In the above example, /cg2
and /cg3 can be either "none" or "cgroup" with the same results: the
selection heuristic depends only on the policy of the oom mem cgroup.  If
/cg2 or /cg3 themselves are oom, however, the policy is controlled by
their own oom policies, either process aware or cgroup aware.

Signed-off-by: David Rientjes 
---
 Documentation/admin-guide/cgroup-v2.rst | 78 +
 include/linux/cgroup-defs.h |  5 --
 include/linux/memcontrol.h  |  5 ++
 kernel/cgroup/cgroup.c  | 13 +
 mm/memcontrol.c | 19 +++---
 5 files changed, 56 insertions(+), 64 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst 
b/Documentation/admin-guide/cgroup-v2.rst
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1109,6 +1109,17 @@ PAGE_SIZE multiple when read back.
Documentation/filesystems/proc.txt).  This is the same policy as if
memory cgroups were not even mounted.
 
+   If "cgroup", the OOM killer will compare mem cgroups as indivisible
+   memory consumers; that is, they will compare mem cgroup usage rather
+   than process memory footprint.  See the "OOM Killer" section below.
+
+   When an OOM condition occurs, the policy is dictated by the mem
+   cgroup that is OOM (the root mem cgroup for a system-wide OOM
+   condition).  If a descendant mem cgroup has a policy of "none", for
+   example, for an OOM condition in a mem cgroup with policy "cgroup",
+   the heuristic will still compare mem cgroups as indivisible memory
+   consumers.
+
   memory.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined.  Unless specified
@@ -1336,43 +1347,36 @@ belonging to the affected files to ensure correct 
memory ownership.
 OOM Killer
 ~~
 
-Cgroup v2 memory controller implements a cgroup-aware OOM killer.
-It means that it treats cgroups as first class OOM entities.
-
-Cgroup-aware OOM logic is turned off by default and requires
-passing the "groupoom" option on mounting cgroupfs. It can also
-by remounting cgroupfs with the following command::
-
-  # mount -o remount,groupoom $MOUNT_POINT
-
-Under OOM conditions the memory controller tries to make the best
-choice of a victim, looking for a memory cgroup with the largest
-memory footprint, considering leaf cgroups and cgroups with the
-memory.oom_group option set, which are considered to be an indivisible
-memory consumers.
-
-By default, OOM killer will kill the biggest task in the selected
-memory cgroup. A user can change this behavior by enabling
-the per-cgroup memory.oom_group option. If set, it causes
-the OOM killer t

[patch v3 -mm 6/6] mm, memcg: disregard mempolicies for cgroup-aware oom killer

2018-07-13 Thread David Rientjes
The cgroup-aware oom killer currently considers the set of allowed nodes
for the allocation that triggers the oom killer and discounts usage from
disallowed nodes when comparing cgroups.

If a cgroup has both the cpuset and memory controllers enabled, it may be
possible to restrict allocations to a subset of nodes, for example.  Some
latency sensitive users use cpusets to allocate only local memory, almost
to the point of oom even though there is an abundance of available free
memory on other nodes.

The same is true for processes that mbind(2) their memory to a set of
allowed nodes.

This yields very inconsistent results by considering usage from each mem
cgroup (and perhaps its subtree) for the allocation's set of allowed nodes
for its mempolicy.  Allocating a single page for a vma that is mbind to a
now-oom node can cause a cgroup that is restricted to that node by its
cpuset controller to be oom killed when other cgroups may have much higher
overall usage.

The cgroup-aware oom killer is described as killing the largest memory
consuming cgroup (or subtree) without mentioning the mempolicy of the
allocation.  For now, discount it.  It would be possible to add an
additional oom policy for NUMA awareness if it would be generally useful
later with the extensible interface.

Signed-off-by: David Rientjes 
---
 mm/memcontrol.c | 18 ++
 1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2819,19 +2819,15 @@ static inline bool memcg_has_children(struct mem_cgroup 
*memcg)
return ret;
 }
 
-static long memcg_oom_badness(struct mem_cgroup *memcg,
- const nodemask_t *nodemask)
+static long memcg_oom_badness(struct mem_cgroup *memcg)
 {
const bool is_root_memcg = memcg == root_mem_cgroup;
long points = 0;
int nid;
-   pg_data_t *pgdat;
 
for_each_node_state(nid, N_MEMORY) {
-   if (nodemask && !node_isset(nid, *nodemask))
-   continue;
+   pg_data_t *pgdat = NODE_DATA(nid);
 
-   pgdat = NODE_DATA(nid);
if (is_root_memcg) {
points += node_page_state(pgdat, NR_ACTIVE_ANON) +
  node_page_state(pgdat, NR_INACTIVE_ANON);
@@ -2867,8 +2863,7 @@ static long memcg_oom_badness(struct mem_cgroup *memcg,
  *   >0: memcg is eligible, and the returned value is an estimation
  *   of the memory footprint
  */
-static long oom_evaluate_memcg(struct mem_cgroup *memcg,
-  const nodemask_t *nodemask)
+static long oom_evaluate_memcg(struct mem_cgroup *memcg)
 {
struct css_task_iter it;
struct task_struct *task;
@@ -2902,7 +2897,7 @@ static long oom_evaluate_memcg(struct mem_cgroup *memcg,
if (eligible <= 0)
return eligible;
 
-   return memcg_oom_badness(memcg, nodemask);
+   return memcg_oom_badness(memcg);
 }
 
 static void select_victim_memcg(struct mem_cgroup *root, struct oom_control 
*oc)
@@ -2962,7 +2957,7 @@ static void select_victim_memcg(struct mem_cgroup *root, 
struct oom_control *oc)
if (memcg_has_children(iter))
continue;
 
-   score = oom_evaluate_memcg(iter, oc->nodemask);
+   score = oom_evaluate_memcg(iter);
 
/*
 * Ignore empty and non-eligible memory cgroups.
@@ -2991,8 +2986,7 @@ static void select_victim_memcg(struct mem_cgroup *root, 
struct oom_control *oc)
 
if (oc->chosen_memcg != INFLIGHT_VICTIM) {
if (root == root_mem_cgroup) {
-   group_score = oom_evaluate_memcg(root_mem_cgroup,
-oc->nodemask);
+   group_score = oom_evaluate_memcg(root_mem_cgroup);
if (group_score > leaf_score) {
/*
 * Discount the sum of all leaf scores to find


[patch v3 -mm 0/6] rewrite cgroup aware oom killer for general use

2018-07-13 Thread David Rientjes
There are three significant concerns about the cgroup aware oom killer as
it is implemented in -mm:

 (1) allows users to evade the oom killer by creating subcontainers or
 using other controllers since scoring is done per cgroup and not
 hierarchically,

 (2) unfairly compares the root mem cgroup using completely different
 criteria than leaf mem cgroups and allows wildly inaccurate results
 if oom_score_adj is used, and

 (3) does not allow the user to influence the decisionmaking, such that
 important subtrees cannot be preferred or biased.

This patchset fixes (1) and (2) completely and, by doing so, introduces a
completely extensible user interface that can be expanded in the future.

Concern (3) could subsequently be addressed either before or after the
cgroup-aware oom killer feature is merged.

It preserves all functionality that currently exists in -mm and extends
it to be generally useful outside of very specialized usecases.

It eliminates the mount option for the cgroup aware oom killer entirely
since it is now enabled through the root mem cgroup's oom policy.
---
v3:
 - Rebased to next-20180713

v2:
 - Rebased to next-20180322
 - Fixed get_nr_swap_pages() build bug found by kbuild test robot

 Documentation/admin-guide/cgroup-v2.rst | 100 ++-
 include/linux/cgroup-defs.h |   5 -
 include/linux/memcontrol.h  |  21 +++
 kernel/cgroup/cgroup.c  |  13 +-
 mm/memcontrol.c | 221 ++--
 5 files changed, 204 insertions(+), 156 deletions(-)


[patch v3 -mm 2/6] mm, memcg: replace cgroup aware oom killer mount option with tunable

2018-07-13 Thread David Rientjes
Now that each mem cgroup on the system has a memory.oom_policy tunable to
specify oom kill selection behavior, remove the needless "groupoom" mount
option that requires (1) the entire system to be forced, perhaps
unnecessarily, perhaps unexpectedly, into a single oom policy that
differs from the traditional per process selection, and (2) a remount to
change.

Instead of enabling the cgroup aware oom killer with the "groupoom" mount
option, set the mem cgroup subtree's memory.oom_policy to "cgroup".

The heuristic used to select a process or cgroup to kill from is
controlled by the oom mem cgroup's memory.oom_policy.  This means that if
a descendant mem cgroup has an oom policy of "none", for example, and an
oom condition originates in an ancestor with an oom policy of "cgroup",
the selection logic will treat all descendant cgroups as indivisible
memory consumers.

For example, consider an example where each mem cgroup has "memory" set
in cgroup.controllers:

mem cgroup  cgroup.procs
==  
/cg11 process consuming 250MB
/cg23 processes consuming 100MB each
/cg3/cg31   2 processes consuming 100MB each
/cg3/cg32   2 processes consuming 100MB each

If the root mem cgroup's memory.oom_policy is "none", the process from
/cg1 is chosen as the victim.  If memory.oom_policy is "cgroup", a process
from /cg2 is chosen because it is in the single indivisible memory
consumer with the greatest usage.  This policy of "cgroup" is identical to
to the current "groupoom" mount option, now removed.

Note that /cg3 is not the chosen victim when the oom mem cgroup policy is
"cgroup" because cgroups are treated individually without regard to
hierarchical /cg3/memory.current usage.  This will be addressed in a
follow-up patch.

This has the added benefit of allowing descendant cgroups to control their
own oom policies if they have memory.oom_policy file permissions without
being restricted to the system-wide policy.  In the above example, /cg2
and /cg3 can be either "none" or "cgroup" with the same results: the
selection heuristic depends only on the policy of the oom mem cgroup.  If
/cg2 or /cg3 themselves are oom, however, the policy is controlled by
their own oom policies, either process aware or cgroup aware.

Signed-off-by: David Rientjes 
---
 Documentation/admin-guide/cgroup-v2.rst | 78 +
 include/linux/cgroup-defs.h |  5 --
 include/linux/memcontrol.h  |  5 ++
 kernel/cgroup/cgroup.c  | 13 +
 mm/memcontrol.c | 19 +++---
 5 files changed, 56 insertions(+), 64 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst 
b/Documentation/admin-guide/cgroup-v2.rst
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1109,6 +1109,17 @@ PAGE_SIZE multiple when read back.
Documentation/filesystems/proc.txt).  This is the same policy as if
memory cgroups were not even mounted.
 
+   If "cgroup", the OOM killer will compare mem cgroups as indivisible
+   memory consumers; that is, they will compare mem cgroup usage rather
+   than process memory footprint.  See the "OOM Killer" section below.
+
+   When an OOM condition occurs, the policy is dictated by the mem
+   cgroup that is OOM (the root mem cgroup for a system-wide OOM
+   condition).  If a descendant mem cgroup has a policy of "none", for
+   example, for an OOM condition in a mem cgroup with policy "cgroup",
+   the heuristic will still compare mem cgroups as indivisible memory
+   consumers.
+
   memory.events
A read-only flat-keyed file which exists on non-root cgroups.
The following entries are defined.  Unless specified
@@ -1336,43 +1347,36 @@ belonging to the affected files to ensure correct 
memory ownership.
 OOM Killer
 ~~
 
-Cgroup v2 memory controller implements a cgroup-aware OOM killer.
-It means that it treats cgroups as first class OOM entities.
-
-Cgroup-aware OOM logic is turned off by default and requires
-passing the "groupoom" option on mounting cgroupfs. It can also
-by remounting cgroupfs with the following command::
-
-  # mount -o remount,groupoom $MOUNT_POINT
-
-Under OOM conditions the memory controller tries to make the best
-choice of a victim, looking for a memory cgroup with the largest
-memory footprint, considering leaf cgroups and cgroups with the
-memory.oom_group option set, which are considered to be an indivisible
-memory consumers.
-
-By default, OOM killer will kill the biggest task in the selected
-memory cgroup. A user can change this behavior by enabling
-the per-cgroup memory.oom_group option. If set, it causes
-the OOM killer t

[patch v3 -mm 6/6] mm, memcg: disregard mempolicies for cgroup-aware oom killer

2018-07-13 Thread David Rientjes
The cgroup-aware oom killer currently considers the set of allowed nodes
for the allocation that triggers the oom killer and discounts usage from
disallowed nodes when comparing cgroups.

If a cgroup has both the cpuset and memory controllers enabled, it may be
possible to restrict allocations to a subset of nodes, for example.  Some
latency sensitive users use cpusets to allocate only local memory, almost
to the point of oom even though there is an abundance of available free
memory on other nodes.

The same is true for processes that mbind(2) their memory to a set of
allowed nodes.

This yields very inconsistent results by considering usage from each mem
cgroup (and perhaps its subtree) for the allocation's set of allowed nodes
for its mempolicy.  Allocating a single page for a vma that is mbind to a
now-oom node can cause a cgroup that is restricted to that node by its
cpuset controller to be oom killed when other cgroups may have much higher
overall usage.

The cgroup-aware oom killer is described as killing the largest memory
consuming cgroup (or subtree) without mentioning the mempolicy of the
allocation.  For now, discount it.  It would be possible to add an
additional oom policy for NUMA awareness if it would be generally useful
later with the extensible interface.

Signed-off-by: David Rientjes 
---
 mm/memcontrol.c | 18 ++
 1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2819,19 +2819,15 @@ static inline bool memcg_has_children(struct mem_cgroup 
*memcg)
return ret;
 }
 
-static long memcg_oom_badness(struct mem_cgroup *memcg,
- const nodemask_t *nodemask)
+static long memcg_oom_badness(struct mem_cgroup *memcg)
 {
const bool is_root_memcg = memcg == root_mem_cgroup;
long points = 0;
int nid;
-   pg_data_t *pgdat;
 
for_each_node_state(nid, N_MEMORY) {
-   if (nodemask && !node_isset(nid, *nodemask))
-   continue;
+   pg_data_t *pgdat = NODE_DATA(nid);
 
-   pgdat = NODE_DATA(nid);
if (is_root_memcg) {
points += node_page_state(pgdat, NR_ACTIVE_ANON) +
  node_page_state(pgdat, NR_INACTIVE_ANON);
@@ -2867,8 +2863,7 @@ static long memcg_oom_badness(struct mem_cgroup *memcg,
  *   >0: memcg is eligible, and the returned value is an estimation
  *   of the memory footprint
  */
-static long oom_evaluate_memcg(struct mem_cgroup *memcg,
-  const nodemask_t *nodemask)
+static long oom_evaluate_memcg(struct mem_cgroup *memcg)
 {
struct css_task_iter it;
struct task_struct *task;
@@ -2902,7 +2897,7 @@ static long oom_evaluate_memcg(struct mem_cgroup *memcg,
if (eligible <= 0)
return eligible;
 
-   return memcg_oom_badness(memcg, nodemask);
+   return memcg_oom_badness(memcg);
 }
 
 static void select_victim_memcg(struct mem_cgroup *root, struct oom_control 
*oc)
@@ -2962,7 +2957,7 @@ static void select_victim_memcg(struct mem_cgroup *root, 
struct oom_control *oc)
if (memcg_has_children(iter))
continue;
 
-   score = oom_evaluate_memcg(iter, oc->nodemask);
+   score = oom_evaluate_memcg(iter);
 
/*
 * Ignore empty and non-eligible memory cgroups.
@@ -2991,8 +2986,7 @@ static void select_victim_memcg(struct mem_cgroup *root, 
struct oom_control *oc)
 
if (oc->chosen_memcg != INFLIGHT_VICTIM) {
if (root == root_mem_cgroup) {
-   group_score = oom_evaluate_memcg(root_mem_cgroup,
-oc->nodemask);
+   group_score = oom_evaluate_memcg(root_mem_cgroup);
if (group_score > leaf_score) {
/*
 * Discount the sum of all leaf scores to find


Re: [PATCH] mm,oom: Bring OOM notifier callbacks to outside of OOM killer.

2018-07-06 Thread Paul E. McKenney
On Fri, Jul 06, 2018 at 07:39:42AM +0200, Michal Hocko wrote:
> On Tue 03-07-18 09:01:01, Paul E. McKenney wrote:
> > On Tue, Jul 03, 2018 at 09:24:13AM +0200, Michal Hocko wrote:
> > > On Mon 02-07-18 14:37:14, Paul E. McKenney wrote:
> > > [...]
> > > > commit d2b8d16b97ac2859919713b2d98b8a3ad22943a2
> > > > Author: Paul E. McKenney 
> > > > Date:   Mon Jul 2 14:30:37 2018 -0700
> > > > 
> > > > rcu: Remove OOM code
> > > > 
> > > > There is reason to believe that RCU's OOM code isn't really helping
> > > > that much, given that the best it can hope to do is accelerate 
> > > > invoking
> > > > callbacks by a few seconds, and even then only if some CPUs have no
> > > > non-lazy callbacks, a condition that has been observed to be rare.
> > > > This commit therefore removes RCU's OOM code.  If this causes 
> > > > problems,
> > > > it can easily be reinserted.
> > > > 
> > > > Reported-by: Michal Hocko 
> > > > Reported-by: Tetsuo Handa 
> > > > Signed-off-by: Paul E. McKenney 
> > > 
> > > I would also note that waiting in the notifier might be a problem on its
> > > own because we are holding the oom_lock and the system cannot trigger
> > > the OOM killer while we are holding it and waiting for oom_callback_wq
> > > event. I am not familiar with the code to tell whether this can deadlock
> > > but from a quick glance I _suspect_ that we might depend on __rcu_reclaim
> > > and basically an arbitrary callback so no good.
> > > 
> > > Acked-by: Michal Hocko 
> > > 
> > > Thanks!
> > 
> > Like this?
> 
> Thanks!

Very good, queued for the merge window after next, that is, whatever
number after v4.19.  ;-)

Thanx, Paul



Re: [PATCH] mm,oom: Bring OOM notifier callbacks to outside of OOM killer.

2018-07-06 Thread Paul E. McKenney
On Fri, Jul 06, 2018 at 07:39:42AM +0200, Michal Hocko wrote:
> On Tue 03-07-18 09:01:01, Paul E. McKenney wrote:
> > On Tue, Jul 03, 2018 at 09:24:13AM +0200, Michal Hocko wrote:
> > > On Mon 02-07-18 14:37:14, Paul E. McKenney wrote:
> > > [...]
> > > > commit d2b8d16b97ac2859919713b2d98b8a3ad22943a2
> > > > Author: Paul E. McKenney 
> > > > Date:   Mon Jul 2 14:30:37 2018 -0700
> > > > 
> > > > rcu: Remove OOM code
> > > > 
> > > > There is reason to believe that RCU's OOM code isn't really helping
> > > > that much, given that the best it can hope to do is accelerate 
> > > > invoking
> > > > callbacks by a few seconds, and even then only if some CPUs have no
> > > > non-lazy callbacks, a condition that has been observed to be rare.
> > > > This commit therefore removes RCU's OOM code.  If this causes 
> > > > problems,
> > > > it can easily be reinserted.
> > > > 
> > > > Reported-by: Michal Hocko 
> > > > Reported-by: Tetsuo Handa 
> > > > Signed-off-by: Paul E. McKenney 
> > > 
> > > I would also note that waiting in the notifier might be a problem on its
> > > own because we are holding the oom_lock and the system cannot trigger
> > > the OOM killer while we are holding it and waiting for oom_callback_wq
> > > event. I am not familiar with the code to tell whether this can deadlock
> > > but from a quick glance I _suspect_ that we might depend on __rcu_reclaim
> > > and basically an arbitrary callback so no good.
> > > 
> > > Acked-by: Michal Hocko 
> > > 
> > > Thanks!
> > 
> > Like this?
> 
> Thanks!

Very good, queued for the merge window after next, that is, whatever
number after v4.19.  ;-)

Thanx, Paul



  1   2   3   4   5   6   7   8   9   10   >