Re: [v6 2/4] mm, oom: cgroup-aware OOM killer

2017-08-31 Thread David Rientjes
On Thu, 31 Aug 2017, Roman Gushchin wrote:

> So, it looks to me that we're close to an acceptable version,
> and the only remaining question is the default behavior
> (when oom_group is not set).
> 

Nit: without knowledge of the implementation, I still don't think I would 
know what an "out of memory group" is.  Out of memory doesn't necessarily 
imply a kill.  I suggest oom_kill_all or something that includes the verb.

> Michal suggests to ignore non-oom_group memcgs, and compare tasks with
> memcgs with oom_group set. This makes the whole thing completely opt-in,
> but then we probably need another knob (or value) to select between
> "select memcg, kill biggest task" and "select memcg, kill all tasks".

It seems like that would either bias toward or bias against cgroups that 
opt-in.  I suggest comparing memory cgroups at each level in the hierarchy 
based on your new badness heuristic, regardless of any tunables it has 
enabled.  Then kill either the largest process or all the processes 
attached depending on oom_group or oom_kill_all.

> Also, as the whole thing is based on comparison between processes and
> memcgs, we probably need oom_priority for processes.

I think with the constraints of cgroup v2 that a victim memcg must first 
be chosen, and then a victim process attached to that memcg must be chosen 
or all eligible processes attached to it be killed, depending on the 
tunable.

The simplest and most clear way to define this, in my opinion, is to 
implement a heuristic that compares sibling memcgs based on usage, as you 
have done.  This can be overridden by a memory.oom_priority that userspace 
defines, and is enough support for them to change victim selection (no 
mount option needed, just set memory.oom_priority).  Then kill the largest 
process or all eligible processes attached.  We only use per-process 
priority to override process selection compared to sibling memcgs, but 
with cgroup v2 process constraints it doesn't seem like that is within the 
scope of your patchset.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v6 2/4] mm, oom: cgroup-aware OOM killer

2017-08-31 Thread Roman Gushchin
On Wed, Aug 30, 2017 at 01:56:22PM -0700, David Rientjes wrote:
> On Wed, 30 Aug 2017, Roman Gushchin wrote:
> 
> > I've spent some time to implement such a version.
> > 
> > It really became shorter and more existing code were reused,
> > howewer I've met a couple of serious issues:
> > 
> > 1) Simple summing of per-task oom_score doesn't make sense.
> >First, we calculate oom_score per-task, while should sum per-process 
> > values,
> >or, better, per-mm struct. We can take only threa-group leader's score
> >into account, but it's also not 100% accurate.
> >And, again, we have a question what to do with per-task oom_score_adj,
> >if we don't task the task's oom_score into account.
> > 
> >Using memcg stats still looks to me as a more accurate and consistent
> >way of estimating memcg memory footprint.
> > 
> 
> The patchset is introducing a new methodology for selecting oom victims so 
> you can define how cgroups are compared vs other cgroups with your own 
> "badness" calculation.  I think your implementation based heavily on anon 
> and unevictable lrus and unreclaimable slab is fine and you can describe 
> that detail in the documentation (along with the caveat that it is only 
> calculated for nodes in the allocation's mempolicy).  With 
> memory.oom_priority, the user has full ability to change that selection.  
> Process selection heuristics have changed over time themselves, it's not 
> something that must be backwards compatibile and trying to sum the usage 
> from each of the cgroup's mm_struct's and respect oom_score_adj is 
> unnecessarily complex.

I agree.

So, it looks to me that we're close to an acceptable version,
and the only remaining question is the default behavior
(when oom_group is not set).

Michal suggests to ignore non-oom_group memcgs, and compare tasks with
memcgs with oom_group set. This makes the whole thing completely opt-in,
but then we probably need another knob (or value) to select between
"select memcg, kill biggest task" and "select memcg, kill all tasks".
Also, as the whole thing is based on comparison between processes and
memcgs, we probably need oom_priority for processes.
I'm not necessary against this options, but I do worry about the complexity
of resulting interface.

In my implementation we always select a victim memcg first (or a task
in root memcg), and then kill the biggest task inside.
It actually changes the victim selection policy. By doing this
we achieve per-memcg fairness, which makes sense in a containerized
environment.
I believe it's acceptable, but I can also add a cgroup v2 mount option
to completely revert to the per-process OOM killer for those users, who
for some reasons depend on the existing victim selection policy.

Any thoughts/objections?

Thanks!

Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v6 2/4] mm, oom: cgroup-aware OOM killer

2017-08-30 Thread David Rientjes
On Wed, 30 Aug 2017, Roman Gushchin wrote:

> I've spent some time to implement such a version.
> 
> It really became shorter and more existing code were reused,
> howewer I've met a couple of serious issues:
> 
> 1) Simple summing of per-task oom_score doesn't make sense.
>First, we calculate oom_score per-task, while should sum per-process 
> values,
>or, better, per-mm struct. We can take only threa-group leader's score
>into account, but it's also not 100% accurate.
>And, again, we have a question what to do with per-task oom_score_adj,
>if we don't task the task's oom_score into account.
> 
>Using memcg stats still looks to me as a more accurate and consistent
>way of estimating memcg memory footprint.
> 

The patchset is introducing a new methodology for selecting oom victims so 
you can define how cgroups are compared vs other cgroups with your own 
"badness" calculation.  I think your implementation based heavily on anon 
and unevictable lrus and unreclaimable slab is fine and you can describe 
that detail in the documentation (along with the caveat that it is only 
calculated for nodes in the allocation's mempolicy).  With 
memory.oom_priority, the user has full ability to change that selection.  
Process selection heuristics have changed over time themselves, it's not 
something that must be backwards compatibile and trying to sum the usage 
from each of the cgroup's mm_struct's and respect oom_score_adj is 
unnecessarily complex.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v6 2/4] mm, oom: cgroup-aware OOM killer

2017-08-30 Thread Roman Gushchin
On Fri, Aug 25, 2017 at 10:14:03AM +0200, Michal Hocko wrote:
> On Thu 24-08-17 15:58:01, Roman Gushchin wrote:
> > On Thu, Aug 24, 2017 at 04:13:37PM +0200, Michal Hocko wrote:
> > > On Thu 24-08-17 14:58:42, Roman Gushchin wrote:
> [...]
> > > > Both ways are not ideal, and sum of the processes is not ideal too.
> > > > Especially, if you take oom_score_adj into account. Will you respect it?
> > > 
> > > Yes, and I do not see any reason why we shouldn't.
> > 
> > It makes things even more complicated.
> > Right now task's oom_score can be in (~ -total_memory, ~ +2*total_memory) 
> > range,
> > and it you're starting summing it, it can be multiplied by number of 
> > tasks...
> > Weird.
> 
> oom_score_adj is just a normalized bias so if tasks inside oom will use
> it the whole memcg will get accumulated bias from all such tasks so it
> is not completely off. I agree that the more tasks use the bias the more
> biased the whole memcg will be. This might or might not be a problem.
> As you are trying to reimplement the existing oom killer implementation
> I do not think we cannot simply ignore API which people are used to.
> 
> If this was a configurable oom policy then I could see how ignoring
> oom_score_adj is acceptable because it would be an explicit opt-in.
> 
> > It also will be different in case of system and memcg-wide OOM.
> 
> Why, we do honor oom_score_adj for the memcg OOM now and in fact the
> kernel memcg OOM killer shouldn't be very much different from the global
> one except for the tasks scope.
> 
> > > > I've started actually with such approach, but then found it weird.
> > > > 
> > > > > Besides that you have
> > > > > to check each task for over-killing anyway. So I do not see any
> > > > > performance merits here.
> > > > 
> > > > It's an implementation detail, and we can hopefully get rid of it at 
> > > > some point.
> > > 
> > > Well, we might do some estimations and ignore oom scopes but I that
> > > sounds really complicated and error prone. Unless we have anything like
> > > that then I would start from tasks and build up the necessary to make a
> > > decision at the higher level.
> > 
> > Seriously speaking, do you have an example, when summing per-process
> > oom_score will work better?
> 
> The primary reason I am pushing for this is to have the common iterator
> code path (which we have since Vladimir has unified memcg and global oom
> paths) and only parametrize the value calculation and victim selection.
> 
> > Especially, if we're talking about customizing oom_score calculation,
> > it makes no sence to me. How you will sum process timestamps?
> 
> Well, I meant you could sum oom_badness for your particular
> implementation. If we need some other policy then this wouldn't work and
> that's why I've said that I would like to preserve the current common
> code and only parametrize value calculation and victim selection...

I've spent some time to implement such a version.

It really became shorter and more existing code were reused,
howewer I've met a couple of serious issues:

1) Simple summing of per-task oom_score doesn't make sense.
   First, we calculate oom_score per-task, while should sum per-process values,
   or, better, per-mm struct. We can take only threa-group leader's score
   into account, but it's also not 100% accurate.
   And, again, we have a question what to do with per-task oom_score_adj,
   if we don't task the task's oom_score into account.

   Using memcg stats still looks to me as a more accurate and consistent
   way of estimating memcg memory footprint.

2) If we're treating tasks from not-kill-all cgroups as separate oom entities,
   and compare them with memcgs with kill-all flag, we definitely need
   per-task oom_priority to provide a clear way to compare entities.

   Otherwise we need per-memcg size-based oom_score_adj, which is not
   the best idea, as we agreed earlier.

Thanks!

Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v6 2/4] mm, oom: cgroup-aware OOM killer

2017-08-25 Thread Michal Hocko
On Fri 25-08-17 11:39:51, Roman Gushchin wrote:
> On Fri, Aug 25, 2017 at 10:14:03AM +0200, Michal Hocko wrote:
> > On Thu 24-08-17 15:58:01, Roman Gushchin wrote:
> > > On Thu, Aug 24, 2017 at 04:13:37PM +0200, Michal Hocko wrote:
> > > > On Thu 24-08-17 14:58:42, Roman Gushchin wrote:
> > [...]
> > > > > Both ways are not ideal, and sum of the processes is not ideal too.
> > > > > Especially, if you take oom_score_adj into account. Will you respect 
> > > > > it?
> > > > 
> > > > Yes, and I do not see any reason why we shouldn't.
> > > 
> > > It makes things even more complicated.
> > > Right now task's oom_score can be in (~ -total_memory, ~ +2*total_memory) 
> > > range,
> > > and it you're starting summing it, it can be multiplied by number of 
> > > tasks...
> > > Weird.
> > 
> > oom_score_adj is just a normalized bias so if tasks inside oom will use
> > it the whole memcg will get accumulated bias from all such tasks so it
> > is not completely off. I agree that the more tasks use the bias the more
> > biased the whole memcg will be. This might or might not be a problem.
> > As you are trying to reimplement the existing oom killer implementation
> > I do not think we cannot simply ignore API which people are used to.
> > 
> > If this was a configurable oom policy then I could see how ignoring
> > oom_score_adj is acceptable because it would be an explicit opt-in.
> >
> > > It also will be different in case of system and memcg-wide OOM.
> > 
> > Why, we do honor oom_score_adj for the memcg OOM now and in fact the
> > kernel memcg OOM killer shouldn't be very much different from the global
> > one except for the tasks scope.
> 
> Assume, you have two tasks (2Gb and 1Gb) in a cgroup with limit 3Gb.
> The second task has oom_score_adj +100. Total memory is 64Gb, for example.
> 
> I case of memcg-wide oom first task will be selected;
> in case of system-wide OOM - the second.
> 
> Personally I don't like this, but it looks like we have to respect
> oom_score_adj set to -1000, I'll alter my patch.

I cannot say I would love how oom_score_adj works but it's been like
that for a long time and people do rely on that. So we cannot simply
change it under people feets.
 
> > > > > I've started actually with such approach, but then found it weird.
> > > > > 
> > > > > > Besides that you have
> > > > > > to check each task for over-killing anyway. So I do not see any
> > > > > > performance merits here.
> > > > > 
> > > > > It's an implementation detail, and we can hopefully get rid of it at 
> > > > > some point.
> > > > 
> > > > Well, we might do some estimations and ignore oom scopes but I that
> > > > sounds really complicated and error prone. Unless we have anything like
> > > > that then I would start from tasks and build up the necessary to make a
> > > > decision at the higher level.
> > > 
> > > Seriously speaking, do you have an example, when summing per-process
> > > oom_score will work better?
> > 
> > The primary reason I am pushing for this is to have the common iterator
> > code path (which we have since Vladimir has unified memcg and global oom
> > paths) and only parametrize the value calculation and victim selection.
> 
> I agree, but I'm not sure that we can (and have to) totally unify the way,
> how oom_score is calculated for processes and cgroups.
> 
> But I'd like to see an unified oom_priority approach. This will allow
> to define an OOM killing order in a clear way, and use size-based tiebreaking
> for items of the same priority. Root-cgroup processes will be compared with
> other memory consumers by oom_priority first and oom_score afterwards.

This again changes the existing semantic so I really thing we should be
careful and this all should be opt-in.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v6 2/4] mm, oom: cgroup-aware OOM killer

2017-08-25 Thread Roman Gushchin
Hi David!

On Wed, Aug 23, 2017 at 04:19:11PM -0700, David Rientjes wrote:
> On Wed, 23 Aug 2017, Roman Gushchin wrote:
> 
> > Traditionally, the OOM killer is operating on a process level.
> > Under oom conditions, it finds a process with the highest oom score
> > and kills it.
> > 
> > This behavior doesn't suit well the system with many running
> > containers:
> > 
> > 1) There is no fairness between containers. A small container with
> > few large processes will be chosen over a large one with huge
> > number of small processes.
> > 
> > 2) Containers often do not expect that some random process inside
> > will be killed. In many cases much safer behavior is to kill
> > all tasks in the container. Traditionally, this was implemented
> > in userspace, but doing it in the kernel has some advantages,
> > especially in a case of a system-wide OOM.
> > 
> > 3) Per-process oom_score_adj affects global OOM, so it's a breache
> > in the isolation.
> > 
> > To address these issues, cgroup-aware OOM killer is introduced.
> > 
> > Under OOM conditions, it tries to find the biggest memory consumer,
> > and free memory by killing corresponding task(s). The difference
> > the "traditional" OOM killer is that it can treat memory cgroups
> > as memory consumers as well as single processes.
> > 
> > By default, it will look for the biggest leaf cgroup, and kill
> > the largest task inside.
> > 
> > But a user can change this behavior by enabling the per-cgroup
> > oom_kill_all_tasks option. If set, it causes the OOM killer treat
> > the whole cgroup as an indivisible memory consumer. In case if it's
> > selected as on OOM victim, all belonging tasks will be killed.
> > 
> 
> I'm very happy with the rest of the patchset, but I feel that I must renew 
> my objection to memory.oom_kill_all_tasks being able to override the 
> setting of the admin of setting a process to be oom disabled.  From my 
> perspective, setting memory.oom_kill_all_tasks with an oom disabled 
> process attached that now becomes killable either (1) overrides the 
> CAP_SYS_RESOURCE oom disabled setting or (2) is lazy and doesn't modify 
> /proc/pid/oom_score_adj itself.

Changed this in v7 (to be posted soon).

Thanks!

Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v6 2/4] mm, oom: cgroup-aware OOM killer

2017-08-25 Thread Roman Gushchin
On Fri, Aug 25, 2017 at 10:14:03AM +0200, Michal Hocko wrote:
> On Thu 24-08-17 15:58:01, Roman Gushchin wrote:
> > On Thu, Aug 24, 2017 at 04:13:37PM +0200, Michal Hocko wrote:
> > > On Thu 24-08-17 14:58:42, Roman Gushchin wrote:
> [...]
> > > > Both ways are not ideal, and sum of the processes is not ideal too.
> > > > Especially, if you take oom_score_adj into account. Will you respect it?
> > > 
> > > Yes, and I do not see any reason why we shouldn't.
> > 
> > It makes things even more complicated.
> > Right now task's oom_score can be in (~ -total_memory, ~ +2*total_memory) 
> > range,
> > and it you're starting summing it, it can be multiplied by number of 
> > tasks...
> > Weird.
> 
> oom_score_adj is just a normalized bias so if tasks inside oom will use
> it the whole memcg will get accumulated bias from all such tasks so it
> is not completely off. I agree that the more tasks use the bias the more
> biased the whole memcg will be. This might or might not be a problem.
> As you are trying to reimplement the existing oom killer implementation
> I do not think we cannot simply ignore API which people are used to.
> 
> If this was a configurable oom policy then I could see how ignoring
> oom_score_adj is acceptable because it would be an explicit opt-in.
>
> > It also will be different in case of system and memcg-wide OOM.
> 
> Why, we do honor oom_score_adj for the memcg OOM now and in fact the
> kernel memcg OOM killer shouldn't be very much different from the global
> one except for the tasks scope.

Assume, you have two tasks (2Gb and 1Gb) in a cgroup with limit 3Gb.
The second task has oom_score_adj +100. Total memory is 64Gb, for example.

I case of memcg-wide oom first task will be selected;
in case of system-wide OOM - the second.

Personally I don't like this, but it looks like we have to respect
oom_score_adj set to -1000, I'll alter my patch.

> 
> > > > I've started actually with such approach, but then found it weird.
> > > > 
> > > > > Besides that you have
> > > > > to check each task for over-killing anyway. So I do not see any
> > > > > performance merits here.
> > > > 
> > > > It's an implementation detail, and we can hopefully get rid of it at 
> > > > some point.
> > > 
> > > Well, we might do some estimations and ignore oom scopes but I that
> > > sounds really complicated and error prone. Unless we have anything like
> > > that then I would start from tasks and build up the necessary to make a
> > > decision at the higher level.
> > 
> > Seriously speaking, do you have an example, when summing per-process
> > oom_score will work better?
> 
> The primary reason I am pushing for this is to have the common iterator
> code path (which we have since Vladimir has unified memcg and global oom
> paths) and only parametrize the value calculation and victim selection.

I agree, but I'm not sure that we can (and have to) totally unify the way,
how oom_score is calculated for processes and cgroups.

But I'd like to see an unified oom_priority approach. This will allow
to define an OOM killing order in a clear way, and use size-based tiebreaking
for items of the same priority. Root-cgroup processes will be compared with
other memory consumers by oom_priority first and oom_score afterwards.

What do you think about it?

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v6 2/4] mm, oom: cgroup-aware OOM killer

2017-08-25 Thread Michal Hocko
On Thu 24-08-17 15:58:01, Roman Gushchin wrote:
> On Thu, Aug 24, 2017 at 04:13:37PM +0200, Michal Hocko wrote:
> > On Thu 24-08-17 14:58:42, Roman Gushchin wrote:
[...]
> > > Both ways are not ideal, and sum of the processes is not ideal too.
> > > Especially, if you take oom_score_adj into account. Will you respect it?
> > 
> > Yes, and I do not see any reason why we shouldn't.
> 
> It makes things even more complicated.
> Right now task's oom_score can be in (~ -total_memory, ~ +2*total_memory) 
> range,
> and it you're starting summing it, it can be multiplied by number of tasks...
> Weird.

oom_score_adj is just a normalized bias so if tasks inside oom will use
it the whole memcg will get accumulated bias from all such tasks so it
is not completely off. I agree that the more tasks use the bias the more
biased the whole memcg will be. This might or might not be a problem.
As you are trying to reimplement the existing oom killer implementation
I do not think we cannot simply ignore API which people are used to.

If this was a configurable oom policy then I could see how ignoring
oom_score_adj is acceptable because it would be an explicit opt-in.

> It also will be different in case of system and memcg-wide OOM.

Why, we do honor oom_score_adj for the memcg OOM now and in fact the
kernel memcg OOM killer shouldn't be very much different from the global
one except for the tasks scope.

> > > I've started actually with such approach, but then found it weird.
> > > 
> > > > Besides that you have
> > > > to check each task for over-killing anyway. So I do not see any
> > > > performance merits here.
> > > 
> > > It's an implementation detail, and we can hopefully get rid of it at some 
> > > point.
> > 
> > Well, we might do some estimations and ignore oom scopes but I that
> > sounds really complicated and error prone. Unless we have anything like
> > that then I would start from tasks and build up the necessary to make a
> > decision at the higher level.
> 
> Seriously speaking, do you have an example, when summing per-process
> oom_score will work better?

The primary reason I am pushing for this is to have the common iterator
code path (which we have since Vladimir has unified memcg and global oom
paths) and only parametrize the value calculation and victim selection.

> Especially, if we're talking about customizing oom_score calculation,
> it makes no sence to me. How you will sum process timestamps?

Well, I meant you could sum oom_badness for your particular
implementation. If we need some other policy then this wouldn't work and
that's why I've said that I would like to preserve the current common
code and only parametrize value calculation and victim selection...
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v6 2/4] mm, oom: cgroup-aware OOM killer

2017-08-24 Thread Roman Gushchin
On Thu, Aug 24, 2017 at 04:13:37PM +0200, Michal Hocko wrote:
> On Thu 24-08-17 14:58:42, Roman Gushchin wrote:
> > On Thu, Aug 24, 2017 at 02:58:11PM +0200, Michal Hocko wrote:
> > > On Thu 24-08-17 13:28:46, Roman Gushchin wrote:
> > > > Hi Michal!
> > > > 
> > > There is nothing like a "better victim". We are pretty much in a
> > > catastrophic situation when we try to survive by killing a userspace.
> > 
> > Not necessary, it can be a cgroup OOM.
> 
> memcg OOM is no different. The catastrophic is scoped to the specific
> hierarchy but tasks in that hierarchy still fail to make a further
> progress.
> 
> > > We try to kill the largest because that assumes that we return the
> > > most memory from it. Now I do understand that you want to treat the
> > > memcg as a single killable entity but I find it really questionable
> > > to do a per-memcg metric and then do not treat it like that and kill
> > > only a single task. Just imagine a single memcg with zillions of taks
> > > each very small and you select it as the largest while a small taks
> > > itself doesn't help to help to get us out of the OOM.
> > 
> > I don't think it's different from a non-containerized state: if you
> > have a zillion of small tasks in the system, you'll meet the same issues.
> 
> Yes this is possible but usually you are comparing apples to apples so
> you will kill the largest offender and then go on. To be honest I really
> do hate how we try to kill a children rather than the selected victim
> for the same reason.

I do hate it too.

> 
> > > > > I guess I have asked already and we haven't reached any consensus. I 
> > > > > do
> > > > > not like how you treat memcgs and tasks differently. Why cannot we 
> > > > > have
> > > > > a memcg score a sum of all its tasks?
> > > > 
> > > > It sounds like a more expensive way to get almost the same with less 
> > > > accuracy.
> > > > Why it's better?
> > > 
> > > because then you are comparing apples to apples?
> > 
> > Well, I can say that I compare some number of pages against some other 
> > number
> > of pages. And the relation between a page and memcg is more obvious, than a
> > relation between a page and a process.
> 
> But you are comparing different accounting systems.
>  
> > Both ways are not ideal, and sum of the processes is not ideal too.
> > Especially, if you take oom_score_adj into account. Will you respect it?
> 
> Yes, and I do not see any reason why we shouldn't.

It makes things even more complicated.
Right now task's oom_score can be in (~ -total_memory, ~ +2*total_memory) range,
and it you're starting summing it, it can be multiplied by number of tasks...
Weird.
It also will be different in case of system and memcg-wide OOM.

> 
> > I've started actually with such approach, but then found it weird.
> > 
> > > Besides that you have
> > > to check each task for over-killing anyway. So I do not see any
> > > performance merits here.
> > 
> > It's an implementation detail, and we can hopefully get rid of it at some 
> > point.
> 
> Well, we might do some estimations and ignore oom scopes but I that
> sounds really complicated and error prone. Unless we have anything like
> that then I would start from tasks and build up the necessary to make a
> decision at the higher level.

Seriously speaking, do you have an example, when summing per-process
oom_score will work better?

Especially, if we're talking about customizing oom_score calculation,
it makes no sence to me. How you will sum process timestamps?

>  
> > > > > How do you want to compare memcg score with tasks score?
> > > > 
> > > > I have to do it for tasks in root cgroups, but it shouldn't be a common 
> > > > case.
> > > 
> > > How come? I can easily imagine a setup where only some memcgs which
> > > really do need a kill-all semantic while all others can live with single
> > > task killed perfectly fine.
> > 
> > I mean taking a unified cgroup hierarchy into an account, there should not
> > be lot of tasks in the root cgroup, if any.
> 
> Is that really the case? I would assume that memory controller would be
> enabled only in those subtrees which really use the functionality and
> the rest will be sitting in the root memcg. It might be the case if you
> are running only containers but I am not really sure this is true in
> general.

Agree.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v6 2/4] mm, oom: cgroup-aware OOM killer

2017-08-24 Thread Michal Hocko
On Thu 24-08-17 14:58:42, Roman Gushchin wrote:
> On Thu, Aug 24, 2017 at 02:58:11PM +0200, Michal Hocko wrote:
> > On Thu 24-08-17 13:28:46, Roman Gushchin wrote:
> > > Hi Michal!
> > > 
> > There is nothing like a "better victim". We are pretty much in a
> > catastrophic situation when we try to survive by killing a userspace.
> 
> Not necessary, it can be a cgroup OOM.

memcg OOM is no different. The catastrophic is scoped to the specific
hierarchy but tasks in that hierarchy still fail to make a further
progress.

> > We try to kill the largest because that assumes that we return the
> > most memory from it. Now I do understand that you want to treat the
> > memcg as a single killable entity but I find it really questionable
> > to do a per-memcg metric and then do not treat it like that and kill
> > only a single task. Just imagine a single memcg with zillions of taks
> > each very small and you select it as the largest while a small taks
> > itself doesn't help to help to get us out of the OOM.
> 
> I don't think it's different from a non-containerized state: if you
> have a zillion of small tasks in the system, you'll meet the same issues.

Yes this is possible but usually you are comparing apples to apples so
you will kill the largest offender and then go on. To be honest I really
do hate how we try to kill a children rather than the selected victim
for the same reason.

> > > > I guess I have asked already and we haven't reached any consensus. I do
> > > > not like how you treat memcgs and tasks differently. Why cannot we have
> > > > a memcg score a sum of all its tasks?
> > > 
> > > It sounds like a more expensive way to get almost the same with less 
> > > accuracy.
> > > Why it's better?
> > 
> > because then you are comparing apples to apples?
> 
> Well, I can say that I compare some number of pages against some other number
> of pages. And the relation between a page and memcg is more obvious, than a
> relation between a page and a process.

But you are comparing different accounting systems.
 
> Both ways are not ideal, and sum of the processes is not ideal too.
> Especially, if you take oom_score_adj into account. Will you respect it?

Yes, and I do not see any reason why we shouldn't.

> I've started actually with such approach, but then found it weird.
> 
> > Besides that you have
> > to check each task for over-killing anyway. So I do not see any
> > performance merits here.
> 
> It's an implementation detail, and we can hopefully get rid of it at some 
> point.

Well, we might do some estimations and ignore oom scopes but I that
sounds really complicated and error prone. Unless we have anything like
that then I would start from tasks and build up the necessary to make a
decision at the higher level.
 
> > > > How do you want to compare memcg score with tasks score?
> > > 
> > > I have to do it for tasks in root cgroups, but it shouldn't be a common 
> > > case.
> > 
> > How come? I can easily imagine a setup where only some memcgs which
> > really do need a kill-all semantic while all others can live with single
> > task killed perfectly fine.
> 
> I mean taking a unified cgroup hierarchy into an account, there should not
> be lot of tasks in the root cgroup, if any.

Is that really the case? I would assume that memory controller would be
enabled only in those subtrees which really use the functionality and
the rest will be sitting in the root memcg. It might be the case if you
are running only containers but I am not really sure this is true in
general.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v6 2/4] mm, oom: cgroup-aware OOM killer

2017-08-24 Thread Roman Gushchin
On Thu, Aug 24, 2017 at 02:58:11PM +0200, Michal Hocko wrote:
> On Thu 24-08-17 13:28:46, Roman Gushchin wrote:
> > Hi Michal!
> > 
> There is nothing like a "better victim". We are pretty much in a
> catastrophic situation when we try to survive by killing a userspace.

Not necessary, it can be a cgroup OOM.

> We try to kill the largest because that assumes that we return the
> most memory from it. Now I do understand that you want to treat the
> memcg as a single killable entity but I find it really questionable
> to do a per-memcg metric and then do not treat it like that and kill
> only a single task. Just imagine a single memcg with zillions of taks
> each very small and you select it as the largest while a small taks
> itself doesn't help to help to get us out of the OOM.

I don't think it's different from a non-containerized state: if you
have a zillion of small tasks in the system, you'll meet the same issues.

> > > I guess I have asked already and we haven't reached any consensus. I do
> > > not like how you treat memcgs and tasks differently. Why cannot we have
> > > a memcg score a sum of all its tasks?
> > 
> > It sounds like a more expensive way to get almost the same with less 
> > accuracy.
> > Why it's better?
> 
> because then you are comparing apples to apples?

Well, I can say that I compare some number of pages against some other number
of pages. And the relation between a page and memcg is more obvious, than a
relation between a page and a process.

Both ways are not ideal, and sum of the processes is not ideal too.
Especially, if you take oom_score_adj into account. Will you respect it?

I've started actually with such approach, but then found it weird.

> Besides that you have
> to check each task for over-killing anyway. So I do not see any
> performance merits here.

It's an implementation detail, and we can hopefully get rid of it at some point.

> 
> > > How do you want to compare memcg score with tasks score?
> > 
> > I have to do it for tasks in root cgroups, but it shouldn't be a common 
> > case.
> 
> How come? I can easily imagine a setup where only some memcgs which
> really do need a kill-all semantic while all others can live with single
> task killed perfectly fine.

I mean taking a unified cgroup hierarchy into an account, there should not
be lot of tasks in the root cgroup, if any.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v6 2/4] mm, oom: cgroup-aware OOM killer

2017-08-24 Thread Michal Hocko
On Thu 24-08-17 13:28:46, Roman Gushchin wrote:
> Hi Michal!
> 
> On Thu, Aug 24, 2017 at 01:47:06PM +0200, Michal Hocko wrote:
> > This doesn't apply on top of mmotm cleanly. You are missing
> > http://lkml.kernel.org/r/20170807113839.16695-3-mho...@kernel.org
> 
> I'll rebase. Thanks!
> 
> > 
> > On Wed 23-08-17 17:51:59, Roman Gushchin wrote:
> > > Traditionally, the OOM killer is operating on a process level.
> > > Under oom conditions, it finds a process with the highest oom score
> > > and kills it.
> > > 
> > > This behavior doesn't suit well the system with many running
> > > containers:
> > > 
> > > 1) There is no fairness between containers. A small container with
> > > few large processes will be chosen over a large one with huge
> > > number of small processes.
> > > 
> > > 2) Containers often do not expect that some random process inside
> > > will be killed. In many cases much safer behavior is to kill
> > > all tasks in the container. Traditionally, this was implemented
> > > in userspace, but doing it in the kernel has some advantages,
> > > especially in a case of a system-wide OOM.
> > > 
> > > 3) Per-process oom_score_adj affects global OOM, so it's a breache
> > > in the isolation.
> > 
> > Please explain more. I guess you mean that an untrusted memcg could hide
> > itself from the global OOM killer by reducing the oom scores? Well you
> > need CAP_SYS_RESOURCE do reduce the current oom_score{_adj} as David has
> > already pointed out. I also agree that we absolutely must not kill an
> > oom disabled task. I am pretty sure somebody is using OOM_SCORE_ADJ_MIN
> > as a protection from an untrusted SIGKILL and inconsistent state as a
> > result. Those applications simply shouldn't behave differently in the
> > global and container contexts.
> 
> The main point of the kill_all option is to clean up the victim cgroup
> _completely_. If some tasks can survive, that means userspace should
> take care of them, look at the cgroup after oom, and kill the survivors
> manually.
> 
> If you want to rely on OOM_SCORE_ADJ_MIN, don't set kill_all.
> I really don't get the usecase for this "kill all, except this and that".

OOM_SCORE_ADJ_MIN has become a contract de-facto. You cannot simply
expect that somebody would alter a specific workload for a container
just to be safe against unexpected SIGKILL. kill-all might be set up the
memcg hierarchy which is out of your control.

> Also, it's really confusing to respect -1000 value, and completely ignore 
> -999.
> 
> I believe that any complex userspace OOM handling should use memory.high
> and handle memory shortage before an actual OOM.
> 
> > 
> > If nothing else we have to skip OOM_SCORE_ADJ_MIN tasks during the kill.
> > 
> > > To address these issues, cgroup-aware OOM killer is introduced.
> > > 
> > > Under OOM conditions, it tries to find the biggest memory consumer,
> > > and free memory by killing corresponding task(s). The difference
> > > the "traditional" OOM killer is that it can treat memory cgroups
> > > as memory consumers as well as single processes.
> > > 
> > > By default, it will look for the biggest leaf cgroup, and kill
> > > the largest task inside.
> > 
> > Why? I believe that the semantic should be as simple as kill the largest
> > oom killable entity. And the entity is either a process or a memcg which
> > is marked that way.
> 
> So, you still need to compare memcgroups and processes.
> 
> In my case, it's more like an exception (only processes from root memcg,
> and only if there are no eligible cgroups with lower oom_priority).
> You suggest to rely on this comparison.
> 
> > Why should we mix things and select a memcg to kill
> > a process inside it? More on that below.
> 
> To have some sort of "fairness" in a containerized environemnt.
> Say, 1 cgroup with 1 big task, another cgroup with many smaller tasks.
> It's not necessary true, that first one is a better victim.

There is nothing like a "better victim". We are pretty much in a
catastrophic situation when we try to survive by killing a userspace.
We try to kill the largest because that assumes that we return the
most memory from it. Now I do understand that you want to treat the
memcg as a single killable entity but I find it really questionable
to do a per-memcg metric and then do not treat it like that and kill
only a single task. Just imagine a single memcg with zillions of taks
each very small and you select it as the largest while a small taks
itself doesn't help to help to get us out of the OOM.
 
> > > But a user can change this behavior by enabling the per-cgroup
> > > oom_kill_all_tasks option. If set, it causes the OOM killer treat
> > > the whole cgroup as an indivisible memory consumer. In case if it's
> > > selected as on OOM victim, all belonging tasks will be killed.
> > > 
> > > Tasks in the root cgroup are treated as independent memory consumers,
> > > and are compared with other memory consumers (e.g. leaf cgroups).
> > > The root cgroup doesn't 

Re: [v6 2/4] mm, oom: cgroup-aware OOM killer

2017-08-24 Thread Roman Gushchin
Hi Michal!

On Thu, Aug 24, 2017 at 01:47:06PM +0200, Michal Hocko wrote:
> This doesn't apply on top of mmotm cleanly. You are missing
> http://lkml.kernel.org/r/20170807113839.16695-3-mho...@kernel.org

I'll rebase. Thanks!

> 
> On Wed 23-08-17 17:51:59, Roman Gushchin wrote:
> > Traditionally, the OOM killer is operating on a process level.
> > Under oom conditions, it finds a process with the highest oom score
> > and kills it.
> > 
> > This behavior doesn't suit well the system with many running
> > containers:
> > 
> > 1) There is no fairness between containers. A small container with
> > few large processes will be chosen over a large one with huge
> > number of small processes.
> > 
> > 2) Containers often do not expect that some random process inside
> > will be killed. In many cases much safer behavior is to kill
> > all tasks in the container. Traditionally, this was implemented
> > in userspace, but doing it in the kernel has some advantages,
> > especially in a case of a system-wide OOM.
> > 
> > 3) Per-process oom_score_adj affects global OOM, so it's a breache
> > in the isolation.
> 
> Please explain more. I guess you mean that an untrusted memcg could hide
> itself from the global OOM killer by reducing the oom scores? Well you
> need CAP_SYS_RESOURCE do reduce the current oom_score{_adj} as David has
> already pointed out. I also agree that we absolutely must not kill an
> oom disabled task. I am pretty sure somebody is using OOM_SCORE_ADJ_MIN
> as a protection from an untrusted SIGKILL and inconsistent state as a
> result. Those applications simply shouldn't behave differently in the
> global and container contexts.

The main point of the kill_all option is to clean up the victim cgroup
_completely_. If some tasks can survive, that means userspace should
take care of them, look at the cgroup after oom, and kill the survivors
manually.

If you want to rely on OOM_SCORE_ADJ_MIN, don't set kill_all.
I really don't get the usecase for this "kill all, except this and that".

Also, it's really confusing to respect -1000 value, and completely ignore -999.

I believe that any complex userspace OOM handling should use memory.high
and handle memory shortage before an actual OOM.

> 
> If nothing else we have to skip OOM_SCORE_ADJ_MIN tasks during the kill.
> 
> > To address these issues, cgroup-aware OOM killer is introduced.
> > 
> > Under OOM conditions, it tries to find the biggest memory consumer,
> > and free memory by killing corresponding task(s). The difference
> > the "traditional" OOM killer is that it can treat memory cgroups
> > as memory consumers as well as single processes.
> > 
> > By default, it will look for the biggest leaf cgroup, and kill
> > the largest task inside.
> 
> Why? I believe that the semantic should be as simple as kill the largest
> oom killable entity. And the entity is either a process or a memcg which
> is marked that way.

So, you still need to compare memcgroups and processes.

In my case, it's more like an exception (only processes from root memcg,
and only if there are no eligible cgroups with lower oom_priority).
You suggest to rely on this comparison.

> Why should we mix things and select a memcg to kill
> a process inside it? More on that below.

To have some sort of "fairness" in a containerized environemnt.
Say, 1 cgroup with 1 big task, another cgroup with many smaller tasks.
It's not necessary true, that first one is a better victim.

> 
> > But a user can change this behavior by enabling the per-cgroup
> > oom_kill_all_tasks option. If set, it causes the OOM killer treat
> > the whole cgroup as an indivisible memory consumer. In case if it's
> > selected as on OOM victim, all belonging tasks will be killed.
> > 
> > Tasks in the root cgroup are treated as independent memory consumers,
> > and are compared with other memory consumers (e.g. leaf cgroups).
> > The root cgroup doesn't support the oom_kill_all_tasks feature.
> 
> If anything you wouldn't have to treat the root memcg any special. It
> will be like any other memcg which doesn't have oom_kill_all_tasks...
>  
> [...]
> 
> > +static long memcg_oom_badness(struct mem_cgroup *memcg,
> > + const nodemask_t *nodemask)
> > +{
> > +   long points = 0;
> > +   int nid;
> > +   pg_data_t *pgdat;
> > +
> > +   for_each_node_state(nid, N_MEMORY) {
> > +   if (nodemask && !node_isset(nid, *nodemask))
> > +   continue;
> > +
> > +   points += mem_cgroup_node_nr_lru_pages(memcg, nid,
> > +   LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> > +
> > +   pgdat = NODE_DATA(nid);
> > +   points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
> > +   NR_SLAB_UNRECLAIMABLE);
> > +   }
> > +
> > +   points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
> > +   (PAGE_SIZE / 1024);
> > +   points += memcg_page_state(memcg, MEMCG_SOCK);
> > +   points += 

Re: [v6 2/4] mm, oom: cgroup-aware OOM killer

2017-08-24 Thread Michal Hocko
This doesn't apply on top of mmotm cleanly. You are missing
http://lkml.kernel.org/r/20170807113839.16695-3-mho...@kernel.org

On Wed 23-08-17 17:51:59, Roman Gushchin wrote:
> Traditionally, the OOM killer is operating on a process level.
> Under oom conditions, it finds a process with the highest oom score
> and kills it.
> 
> This behavior doesn't suit well the system with many running
> containers:
> 
> 1) There is no fairness between containers. A small container with
> few large processes will be chosen over a large one with huge
> number of small processes.
> 
> 2) Containers often do not expect that some random process inside
> will be killed. In many cases much safer behavior is to kill
> all tasks in the container. Traditionally, this was implemented
> in userspace, but doing it in the kernel has some advantages,
> especially in a case of a system-wide OOM.
> 
> 3) Per-process oom_score_adj affects global OOM, so it's a breache
> in the isolation.

Please explain more. I guess you mean that an untrusted memcg could hide
itself from the global OOM killer by reducing the oom scores? Well you
need CAP_SYS_RESOURCE do reduce the current oom_score{_adj} as David has
already pointed out. I also agree that we absolutely must not kill an
oom disabled task. I am pretty sure somebody is using OOM_SCORE_ADJ_MIN
as a protection from an untrusted SIGKILL and inconsistent state as a
result. Those applications simply shouldn't behave differently in the
global and container contexts.

If nothing else we have to skip OOM_SCORE_ADJ_MIN tasks during the kill.

> To address these issues, cgroup-aware OOM killer is introduced.
> 
> Under OOM conditions, it tries to find the biggest memory consumer,
> and free memory by killing corresponding task(s). The difference
> the "traditional" OOM killer is that it can treat memory cgroups
> as memory consumers as well as single processes.
> 
> By default, it will look for the biggest leaf cgroup, and kill
> the largest task inside.

Why? I believe that the semantic should be as simple as kill the largest
oom killable entity. And the entity is either a process or a memcg which
is marked that way. Why should we mix things and select a memcg to kill
a process inside it? More on that below.

> But a user can change this behavior by enabling the per-cgroup
> oom_kill_all_tasks option. If set, it causes the OOM killer treat
> the whole cgroup as an indivisible memory consumer. In case if it's
> selected as on OOM victim, all belonging tasks will be killed.
> 
> Tasks in the root cgroup are treated as independent memory consumers,
> and are compared with other memory consumers (e.g. leaf cgroups).
> The root cgroup doesn't support the oom_kill_all_tasks feature.

If anything you wouldn't have to treat the root memcg any special. It
will be like any other memcg which doesn't have oom_kill_all_tasks...
 
[...]

> +static long memcg_oom_badness(struct mem_cgroup *memcg,
> +   const nodemask_t *nodemask)
> +{
> + long points = 0;
> + int nid;
> + pg_data_t *pgdat;
> +
> + for_each_node_state(nid, N_MEMORY) {
> + if (nodemask && !node_isset(nid, *nodemask))
> + continue;
> +
> + points += mem_cgroup_node_nr_lru_pages(memcg, nid,
> + LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> +
> + pgdat = NODE_DATA(nid);
> + points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
> + NR_SLAB_UNRECLAIMABLE);
> + }
> +
> + points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
> + (PAGE_SIZE / 1024);
> + points += memcg_page_state(memcg, MEMCG_SOCK);
> + points += memcg_page_state(memcg, MEMCG_SWAP);
> +
> + return points;

I guess I have asked already and we haven't reached any consensus. I do
not like how you treat memcgs and tasks differently. Why cannot we have
a memcg score a sum of all its tasks? How do you want to compare memcg
score with tasks score? This just smells like the outcome of a weird
semantic that you try to select the largest group I have mentioned
above.

This is a rather fundamental concern and I believe we should find a
consensus on it before going any further. I believe that users shouldn't
see any difference in the OOM behavior when memcg v2 is used and there
is no kill-all memcg. If there is such a memcg then we should treat only
those specially. But you might have really strong usecases which haven't
been presented or I've missed their importance.

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v6 2/4] mm, oom: cgroup-aware OOM killer

2017-08-23 Thread David Rientjes
On Wed, 23 Aug 2017, Roman Gushchin wrote:

> Traditionally, the OOM killer is operating on a process level.
> Under oom conditions, it finds a process with the highest oom score
> and kills it.
> 
> This behavior doesn't suit well the system with many running
> containers:
> 
> 1) There is no fairness between containers. A small container with
> few large processes will be chosen over a large one with huge
> number of small processes.
> 
> 2) Containers often do not expect that some random process inside
> will be killed. In many cases much safer behavior is to kill
> all tasks in the container. Traditionally, this was implemented
> in userspace, but doing it in the kernel has some advantages,
> especially in a case of a system-wide OOM.
> 
> 3) Per-process oom_score_adj affects global OOM, so it's a breache
> in the isolation.
> 
> To address these issues, cgroup-aware OOM killer is introduced.
> 
> Under OOM conditions, it tries to find the biggest memory consumer,
> and free memory by killing corresponding task(s). The difference
> the "traditional" OOM killer is that it can treat memory cgroups
> as memory consumers as well as single processes.
> 
> By default, it will look for the biggest leaf cgroup, and kill
> the largest task inside.
> 
> But a user can change this behavior by enabling the per-cgroup
> oom_kill_all_tasks option. If set, it causes the OOM killer treat
> the whole cgroup as an indivisible memory consumer. In case if it's
> selected as on OOM victim, all belonging tasks will be killed.
> 

I'm very happy with the rest of the patchset, but I feel that I must renew 
my objection to memory.oom_kill_all_tasks being able to override the 
setting of the admin of setting a process to be oom disabled.  From my 
perspective, setting memory.oom_kill_all_tasks with an oom disabled 
process attached that now becomes killable either (1) overrides the 
CAP_SYS_RESOURCE oom disabled setting or (2) is lazy and doesn't modify 
/proc/pid/oom_score_adj itself.

I'm not sure what is objectionable about allowing 
memory.oom_kill_all_tasks to coexist with oom disabled processes.  Just 
kill everything else so that the oom disabled process can report the oom 
condition after notification, restart the task, etc.  If it's problematic, 
then whomever is declaring everything must be killed shall also modify 
/proc/pid/oom_score_adj of oom disabled processes.  If it doesn't have 
permission to change that, then I think there's a much larger concern.

> Tasks in the root cgroup are treated as independent memory consumers,
> and are compared with other memory consumers (e.g. leaf cgroups).
> The root cgroup doesn't support the oom_kill_all_tasks feature.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[v6 2/4] mm, oom: cgroup-aware OOM killer

2017-08-23 Thread Roman Gushchin
Traditionally, the OOM killer is operating on a process level.
Under oom conditions, it finds a process with the highest oom score
and kills it.

This behavior doesn't suit well the system with many running
containers:

1) There is no fairness between containers. A small container with
few large processes will be chosen over a large one with huge
number of small processes.

2) Containers often do not expect that some random process inside
will be killed. In many cases much safer behavior is to kill
all tasks in the container. Traditionally, this was implemented
in userspace, but doing it in the kernel has some advantages,
especially in a case of a system-wide OOM.

3) Per-process oom_score_adj affects global OOM, so it's a breache
in the isolation.

To address these issues, cgroup-aware OOM killer is introduced.

Under OOM conditions, it tries to find the biggest memory consumer,
and free memory by killing corresponding task(s). The difference
the "traditional" OOM killer is that it can treat memory cgroups
as memory consumers as well as single processes.

By default, it will look for the biggest leaf cgroup, and kill
the largest task inside.

But a user can change this behavior by enabling the per-cgroup
oom_kill_all_tasks option. If set, it causes the OOM killer treat
the whole cgroup as an indivisible memory consumer. In case if it's
selected as on OOM victim, all belonging tasks will be killed.

Tasks in the root cgroup are treated as independent memory consumers,
and are compared with other memory consumers (e.g. leaf cgroups).
The root cgroup doesn't support the oom_kill_all_tasks feature.

Signed-off-by: Roman Gushchin 
Cc: Michal Hocko 
Cc: Vladimir Davydov 
Cc: Johannes Weiner 
Cc: Tetsuo Handa 
Cc: David Rientjes 
Cc: Tejun Heo 
Cc: kernel-t...@fb.com
Cc: cgro...@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux...@kvack.org
---
 include/linux/memcontrol.h |  33 +++
 include/linux/oom.h|  12 ++-
 mm/memcontrol.c| 242 +
 mm/oom_kill.c  |  92 ++---
 4 files changed, 364 insertions(+), 15 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 8556f1b86d40..c57ee47c35bb 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -35,6 +35,7 @@ struct mem_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
+struct oom_control;
 
 /* Cgroup-specific page state, on top of universal node page state */
 enum memcg_stat_item {
@@ -199,6 +200,12 @@ struct mem_cgroup {
/* OOM-Killer disable */
int oom_kill_disable;
 
+   /* kill all tasks in the subtree in case of OOM */
+   bool oom_kill_all;
+
+   /* cached OOM score */
+   long oom_score;
+
/* handle for "memory.events" */
struct cgroup_file events_file;
 
@@ -342,6 +349,11 @@ struct mem_cgroup *mem_cgroup_from_css(struct 
cgroup_subsys_state *css){
return css ? container_of(css, struct mem_cgroup, css) : NULL;
 }
 
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+   css_put(>css);
+}
+
 #define mem_cgroup_from_counter(counter, member)   \
container_of(counter, struct mem_cgroup, member)
 
@@ -480,6 +492,13 @@ static inline bool task_in_memcg_oom(struct task_struct *p)
 
 bool mem_cgroup_oom_synchronize(bool wait);
 
+bool mem_cgroup_select_oom_victim(struct oom_control *oc);
+
+static inline bool mem_cgroup_oom_kill_all(struct mem_cgroup *memcg)
+{
+   return memcg->oom_kill_all;
+}
+
 #ifdef CONFIG_MEMCG_SWAP
 extern int do_swap_account;
 #endif
@@ -743,6 +762,10 @@ static inline bool task_in_mem_cgroup(struct task_struct 
*task,
return true;
 }
 
+static inline void mem_cgroup_put(struct mem_cgroup *memcg)
+{
+}
+
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
struct mem_cgroup *prev,
@@ -930,6 +953,16 @@ static inline
 void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 {
 }
+
+static inline bool mem_cgroup_select_oom_victim(struct oom_control *oc)
+{
+   return false;
+}
+
+static inline bool mem_cgroup_oom_kill_all(struct mem_cgroup *memcg)
+{
+   return false;
+}
 #endif /* CONFIG_MEMCG */
 
 /* idx can be of type enum memcg_stat_item or node_stat_item */
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 8a266e2be5a6..344ccb85eb74 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -7,6 +7,13 @@
 #include 
 #include 
 
+
+/*
+ * Special value returned by victim selection functions to indicate
+ * that are inflight OOM victims.
+ */
+#define INFLIGHT_VICTIM ((void *)-1UL)
+
 struct zonelist;
 struct notifier_block;
 struct mem_cgroup;
@@ -37,7 +44,8 @@ struct oom_control {
 
/* Used by oom