Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer

2017-06-05 Thread Michal Hocko
On Fri 02-06-17 16:18:52, Roman Gushchin wrote:
> On Fri, Jun 02, 2017 at 10:43:33AM +0200, Michal Hocko wrote:
> > On Wed 31-05-17 14:01:45, Johannes Weiner wrote:
> > > On Wed, May 31, 2017 at 06:25:04PM +0200, Michal Hocko wrote:
> > > > > > +   /*
> > > > > >  * If current has a pending SIGKILL or is exiting, then 
> > > > > > automatically
> > > > > >  * select it.  The goal is to allow it to allocate so that it 
> > > > > > may
> > > > > >  * quickly exit and free its memory.
> > > > > > 
> > > > > > Please note that I haven't explored how much of the infrastructure
> > > > > > needed for the OOM decision making is available to modules. But we 
> > > > > > can
> > > > > > export a lot of what we currently have in oom_kill.c. I admit it 
> > > > > > might
> > > > > > turn out that this is simply not feasible but I would like this to 
> > > > > > be at
> > > > > > least explored before we go and implement yet another hardcoded way 
> > > > > > to
> > > > > > handle (see how I didn't use policy ;)) OOM situation.
> > > > > 
> > > > > ;)
> > > > > 
> > > > > My doubt here is mainly that we'll see many (or any) real-life cases
> > > > > materialize that cannot be handled with cgroups and scoring. These are
> > > > > powerful building blocks on which userspace can implement all kinds of
> > > > > policy and sorting algorithms.
> > > > > 
> > > > > So this seems like a lot of churn and complicated code to handle one
> > > > > extension. An extension that implements basic functionality.
> > > > 
> > > > Well, as I've said I didn't get to explore this path so I have only a
> > > > very vague idea what we would have to export to implement e.g. the
> > > > proposed oom killing strategy suggested in this thread. Unfortunatelly I
> > > > do not have much time for that. I do not want to block a useful work
> > > > which you have a usecase for but I would be really happy if we could
> > > > consider longer term plans before diving into a "hardcoded"
> > > > implementation. We didn't do that previously and we are left with
> > > > oom_kill_allocating_task and similar one off things.
> > > 
> > > As I understand it, killing the allocating task was simply the default
> > > before the OOM killer and was added as a compat knob. I really doubt
> > > anybody is using it at this point, and we could probably delete it.
> > 
> > I might misremember but my recollection is that SGI simply had too
> > large machines with too many processes and so the task selection was
> > very expensinve.
> 
> Cgroup-aware OOM killer can be much better in case of large number of 
> processes,
> as we don't have to iterate over all processes locking each mm, and
> can select an appropriate cgroup based mostly on lockless counters.
> Of course, it depends on concrete setup, but it can be much more efficient
> under right circumstances.

Yes, I agree with that.

> > > I appreciate your concern of being too short-sighted here, but the
> > > fact that I cannot point to more usecases isn't for lack of trying. I
> > > simply don't see the endless possibilities of usecases that you do.
> > > 
> > > It's unlikely for more types of memory domains to pop up besides MMs
> > > and cgroups. (I mentioned vmas, but that just seems esoteric. And we
> > > have panic_on_oom for whole-system death. What else could there be?)
> > > 
> > > And as I pointed out, there is no real evidence that the current
> > > system for configuring preferences isn't sufficient in practice.
> > > 
> > > That's my thoughts on exploring. I'm not sure what else to do before
> > > it feels like running off into fairly contrived hypotheticals.
> > 
> > Yes, I do not want hypotheticals to block an otherwise useful feature,
> > of course. But I haven't heard a strong argument why a module based
> > approach would be a more maintenance burden longterm. From a very quick
> > glance over patches Roman has posted yesterday it seems that a large
> > part of the existing oom infrastructure can be reused reasonably.
> 
> I have nothing against module based approach, but I don't think that a module
> should implement anything rather than then oom score calculation
> (for a process and a cgroup).
> Maybe only some custom method for killing, but I can't really imagine anything
> reasonable except killing one "worst" process or killing whole cgroup(s).
> In case of a system wide OOM, we have to free some memory quickly,
> and this means we can't do anything much more complex,
> than killing some process(es).
> 
> So, in my understanding, what you're suggesting is not against the proposed
> approach at all. We still need to iterate over cgroups, somehow define
> their badness, find the worst one and destroy it. In my v2 I've tried
> to separate these two potentially customizable areas in two simple functions:
> mem_cgroup_oom_badness() and mem_cgroup_kill_oom_victim().

As I've said, I didn't get to look closer at your v2 yet. My point was
that we shouldn't hardcode the memcg specific 

Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer

2017-06-02 Thread Roman Gushchin
On Fri, Jun 02, 2017 at 10:43:33AM +0200, Michal Hocko wrote:
> On Wed 31-05-17 14:01:45, Johannes Weiner wrote:
> > On Wed, May 31, 2017 at 06:25:04PM +0200, Michal Hocko wrote:
> > > > > + /*
> > > > >* If current has a pending SIGKILL or is exiting, then 
> > > > > automatically
> > > > >* select it.  The goal is to allow it to allocate so that it 
> > > > > may
> > > > >* quickly exit and free its memory.
> > > > > 
> > > > > Please note that I haven't explored how much of the infrastructure
> > > > > needed for the OOM decision making is available to modules. But we can
> > > > > export a lot of what we currently have in oom_kill.c. I admit it might
> > > > > turn out that this is simply not feasible but I would like this to be 
> > > > > at
> > > > > least explored before we go and implement yet another hardcoded way to
> > > > > handle (see how I didn't use policy ;)) OOM situation.
> > > > 
> > > > ;)
> > > > 
> > > > My doubt here is mainly that we'll see many (or any) real-life cases
> > > > materialize that cannot be handled with cgroups and scoring. These are
> > > > powerful building blocks on which userspace can implement all kinds of
> > > > policy and sorting algorithms.
> > > > 
> > > > So this seems like a lot of churn and complicated code to handle one
> > > > extension. An extension that implements basic functionality.
> > > 
> > > Well, as I've said I didn't get to explore this path so I have only a
> > > very vague idea what we would have to export to implement e.g. the
> > > proposed oom killing strategy suggested in this thread. Unfortunatelly I
> > > do not have much time for that. I do not want to block a useful work
> > > which you have a usecase for but I would be really happy if we could
> > > consider longer term plans before diving into a "hardcoded"
> > > implementation. We didn't do that previously and we are left with
> > > oom_kill_allocating_task and similar one off things.
> > 
> > As I understand it, killing the allocating task was simply the default
> > before the OOM killer and was added as a compat knob. I really doubt
> > anybody is using it at this point, and we could probably delete it.
> 
> I might misremember but my recollection is that SGI simply had too
> large machines with too many processes and so the task selection was
> very expensinve.

Cgroup-aware OOM killer can be much better in case of large number of processes,
as we don't have to iterate over all processes locking each mm, and
can select an appropriate cgroup based mostly on lockless counters.
Of course, it depends on concrete setup, but it can be much more efficient
under right circumstances.

> 
> > I appreciate your concern of being too short-sighted here, but the
> > fact that I cannot point to more usecases isn't for lack of trying. I
> > simply don't see the endless possibilities of usecases that you do.
> > 
> > It's unlikely for more types of memory domains to pop up besides MMs
> > and cgroups. (I mentioned vmas, but that just seems esoteric. And we
> > have panic_on_oom for whole-system death. What else could there be?)
> > 
> > And as I pointed out, there is no real evidence that the current
> > system for configuring preferences isn't sufficient in practice.
> > 
> > That's my thoughts on exploring. I'm not sure what else to do before
> > it feels like running off into fairly contrived hypotheticals.
> 
> Yes, I do not want hypotheticals to block an otherwise useful feature,
> of course. But I haven't heard a strong argument why a module based
> approach would be a more maintenance burden longterm. From a very quick
> glance over patches Roman has posted yesterday it seems that a large
> part of the existing oom infrastructure can be reused reasonably.

I have nothing against module based approach, but I don't think that a module
should implement anything rather than then oom score calculation
(for a process and a cgroup).
Maybe only some custom method for killing, but I can't really imagine anything
reasonable except killing one "worst" process or killing whole cgroup(s).
In case of a system wide OOM, we have to free some memory quickly,
and this means we can't do anything much more complex,
than killing some process(es).

So, in my understanding, what you're suggesting is not against the proposed
approach at all. We still need to iterate over cgroups, somehow define
their badness, find the worst one and destroy it. In my v2 I've tried
to separate these two potentially customizable areas in two simple functions:
mem_cgroup_oom_badness() and mem_cgroup_kill_oom_victim().
So we can add an ability to customize these functions (and similar stuff
for processes), if we'll have some real examples of where the proposed
functionality is insufficient.

Do you have any examples which can't be covered by this approach?

Thanks!

Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More 

Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer

2017-06-02 Thread Michal Hocko
On Wed 31-05-17 14:01:45, Johannes Weiner wrote:
> On Wed, May 31, 2017 at 06:25:04PM +0200, Michal Hocko wrote:
> > On Thu 25-05-17 13:08:05, Johannes Weiner wrote:
> > > Everything the user would want to dynamically program in the kernel,
> > > say with bpf, they could do in userspace and then update the scores
> > > for each group and task periodically.
> > 
> > I am rather skeptical about dynamic scores. oom_{score_}adj has turned
> > to mere oom disable/enable knobs from my experience.
> 
> That doesn't necessarily have to be a deficiency with the scoring
> system. I suspect that most people simply don't care as long as the
> the picks for OOM victims aren't entirely stupid.
> 
> For example, we have a lot of machines that run one class of job. If
> we run OOM there isn't much preference we'd need to express; just kill
> one job - the biggest, whatever - and move on. (The biggest makes
> sense because if all jobs are basically equal it's as good as any
> other victim, but if one has a runaway bug it goes for that.)
> 
> Where we have more than one job class, it actually is mostly one hipri
> and one lopri, in which case setting a hard limit on the lopri or the
> -1000 OOM score trick is enough.
> 
> How many systems run more than two clearly distinguishable classes of
> workloads concurrently?

What about those which run different containers on a large physical
machine?

> I'm sure they exist. I'm just saying it doesn't surprise me that
> elaborate OOM scoring isn't all that wide-spread.
> 
> > > The only limitation is that you have to recalculate and update the
> > > scoring tree every once in a while, whereas a bpf program could
> > > evaluate things just-in-time. But for that to matter in practice, OOM
> > > kills would have to be a fairly hot path.
> > 
> > I am not really sure how to reliably implement "kill the memcg with the
> > largest process" strategy. And who knows how many others strategies will
> > pop out.
> 
> That seems fairly contrived.
> 
> What does it mean to divide memory into subdomains, but when you run
> out of physical memory you kill based on biggest task?

Well, the biggest task might be the runaway one and so killing it first
before you kill other innocent ones makes some sense to me.

> Sure, it frees memory and gets the system going again, so it's as good
> as any answer to overcommit gone wrong, I guess. But is that something
> you'd intentionally want to express from a userspace perspective?
> 
[...]
> > > > Maybe. But that requires somebody to tweak the scoring which can be hard
> > > > from trivial.
> > > 
> > > Why is sorting and picking in userspace harder than sorting and
> > > picking in the kernel?
> > 
> > Because the userspace score based approach would be much more racy
> > especially in the busy system. This could lead to unexpected behavior
> > when OOM killer would kill a different than a run-away memcgs.
> 
> How would it be easier to weigh priority against runaway detection
> inside the kernel?

You have better chances to catch such a process at the time of the OOM
because you do the check at the time of the OOM rather than sometimes
back in time when your monitor was able to run and check all the
existing processes (which alone can be rather time consuming so you do
not want to do that very often).

> > > > +   /*
> > > >  * If current has a pending SIGKILL or is exiting, then 
> > > > automatically
> > > >  * select it.  The goal is to allow it to allocate so that it 
> > > > may
> > > >  * quickly exit and free its memory.
> > > > 
> > > > Please note that I haven't explored how much of the infrastructure
> > > > needed for the OOM decision making is available to modules. But we can
> > > > export a lot of what we currently have in oom_kill.c. I admit it might
> > > > turn out that this is simply not feasible but I would like this to be at
> > > > least explored before we go and implement yet another hardcoded way to
> > > > handle (see how I didn't use policy ;)) OOM situation.
> > > 
> > > ;)
> > > 
> > > My doubt here is mainly that we'll see many (or any) real-life cases
> > > materialize that cannot be handled with cgroups and scoring. These are
> > > powerful building blocks on which userspace can implement all kinds of
> > > policy and sorting algorithms.
> > > 
> > > So this seems like a lot of churn and complicated code to handle one
> > > extension. An extension that implements basic functionality.
> > 
> > Well, as I've said I didn't get to explore this path so I have only a
> > very vague idea what we would have to export to implement e.g. the
> > proposed oom killing strategy suggested in this thread. Unfortunatelly I
> > do not have much time for that. I do not want to block a useful work
> > which you have a usecase for but I would be really happy if we could
> > consider longer term plans before diving into a "hardcoded"
> > implementation. We didn't do that previously and we are left with
> > 

Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer

2017-05-31 Thread Johannes Weiner
On Wed, May 31, 2017 at 06:25:04PM +0200, Michal Hocko wrote:
> On Thu 25-05-17 13:08:05, Johannes Weiner wrote:
> > Everything the user would want to dynamically program in the kernel,
> > say with bpf, they could do in userspace and then update the scores
> > for each group and task periodically.
> 
> I am rather skeptical about dynamic scores. oom_{score_}adj has turned
> to mere oom disable/enable knobs from my experience.

That doesn't necessarily have to be a deficiency with the scoring
system. I suspect that most people simply don't care as long as the
the picks for OOM victims aren't entirely stupid.

For example, we have a lot of machines that run one class of job. If
we run OOM there isn't much preference we'd need to express; just kill
one job - the biggest, whatever - and move on. (The biggest makes
sense because if all jobs are basically equal it's as good as any
other victim, but if one has a runaway bug it goes for that.)

Where we have more than one job class, it actually is mostly one hipri
and one lopri, in which case setting a hard limit on the lopri or the
-1000 OOM score trick is enough.

How many systems run more than two clearly distinguishable classes of
workloads concurrently?

I'm sure they exist. I'm just saying it doesn't surprise me that
elaborate OOM scoring isn't all that wide-spread.

> > The only limitation is that you have to recalculate and update the
> > scoring tree every once in a while, whereas a bpf program could
> > evaluate things just-in-time. But for that to matter in practice, OOM
> > kills would have to be a fairly hot path.
> 
> I am not really sure how to reliably implement "kill the memcg with the
> largest process" strategy. And who knows how many others strategies will
> pop out.

That seems fairly contrived.

What does it mean to divide memory into subdomains, but when you run
out of physical memory you kill based on biggest task?

Sure, it frees memory and gets the system going again, so it's as good
as any answer to overcommit gone wrong, I guess. But is that something
you'd intentionally want to express from a userspace perspective?

> > > > > > > And both kinds of workloads (services/applications and individual
> > > > > > > processes run by users) can co-exist on the same host - consider 
> > > > > > > the
> > > > > > > default systemd setup, for instance.
> > > > > > > 
> > > > > > > IMHO it would be better to give users a choice regarding what they
> > > > > > > really want for a particular cgroup in case of OOM - killing the 
> > > > > > > whole
> > > > > > > cgroup or one of its descendants. For example, we could introduce 
> > > > > > > a
> > > > > > > per-cgroup flag that would tell the kernel whether the cgroup can
> > > > > > > tolerate killing a descendant or not. If it can, the kernel will 
> > > > > > > pick
> > > > > > > the fattest sub-cgroup or process and check it. If it cannot, it 
> > > > > > > will
> > > > > > > kill the whole cgroup and all its processes and sub-cgroups.
> > > > > > 
> > > > > > The last thing we want to do, is to compare processes with cgroups.
> > > > > > I agree, that we can have some option to disable the cgroup-aware 
> > > > > > OOM at all,
> > > > > > mostly for backward-compatibility. But I don't think it should be a
> > > > > > per-cgroup configuration option, which we will support forever.
> > > > > 
> > > > > I can clearly see a demand for "this is definitely more important
> > > > > container than others so do not kill" usecases. I can also see demand
> > > > > for "do not kill this container running for X days". And more are 
> > > > > likely
> > > > > to pop out.
> > > > 
> > > > That can all be done with scoring.
> > > 
> > > Maybe. But that requires somebody to tweak the scoring which can be hard
> > > from trivial.
> > 
> > Why is sorting and picking in userspace harder than sorting and
> > picking in the kernel?
> 
> Because the userspace score based approach would be much more racy
> especially in the busy system. This could lead to unexpected behavior
> when OOM killer would kill a different than a run-away memcgs.

How would it be easier to weigh priority against runaway detection
inside the kernel?

> > > + /*
> > >* If current has a pending SIGKILL or is exiting, then automatically
> > >* select it.  The goal is to allow it to allocate so that it may
> > >* quickly exit and free its memory.
> > > 
> > > Please note that I haven't explored how much of the infrastructure
> > > needed for the OOM decision making is available to modules. But we can
> > > export a lot of what we currently have in oom_kill.c. I admit it might
> > > turn out that this is simply not feasible but I would like this to be at
> > > least explored before we go and implement yet another hardcoded way to
> > > handle (see how I didn't use policy ;)) OOM situation.
> > 
> > ;)
> > 
> > My doubt here is mainly that we'll see many (or any) real-life cases
> > materialize that cannot be handled with cgroups and 

Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer

2017-05-25 Thread Johannes Weiner
On Thu, May 25, 2017 at 05:38:19PM +0200, Michal Hocko wrote:
> On Tue 23-05-17 09:25:44, Johannes Weiner wrote:
> > On Tue, May 23, 2017 at 09:07:47AM +0200, Michal Hocko wrote:
> > > On Mon 22-05-17 18:01:16, Roman Gushchin wrote:
> [...]
> > > > How to react on an OOM - is definitely a policy, which depends
> > > > on the workload. Nothing is changing here from how it's working now,
> > > > except now kernel will choose a victim cgroup, and kill the victim 
> > > > cgroup
> > > > rather than a process.
> > > 
> > > There is a _big_ difference. The current implementation just tries
> > > to recover from the OOM situation without carying much about the
> > > consequences on the workload. This is the last resort and a services for
> > > the _system_ to get back to sane state. You are trying to make it more
> > > clever and workload aware and that is inevitable going to depend on the
> > > specific workload. I really do think we cannot simply hardcode any
> > > policy into the kernel for this purpose and that is why I would like to
> > > see a discussion about how to do that in a more extensible way. This
> > > might be harder to implement now but it I believe it will turn out
> > > better longerm.
> > 
> > And that's where I still maintain that this isn't really a policy
> > change. Because what this code does ISN'T more clever, and the OOM
> > killer STILL IS a last-resort thing.
> 
> The thing I wanted to point out is that what and how much to kill
> definitely depends on the usecase. We currently kill all tasks which
> share the mm struct because that is the smallest unit that can unpin
> user memory. And that makes a lot of sense to me as a general default.
> I would call any attempt to guess tasks belonging to the same
> workload/job as a "more clever".

Yeah, I agree it needs to be configurable. But a memory domain is not
a random guess. It's a core concept of the VM at this point. The fact
that the OOM killer cannot handle it is pretty weird and goes way
beyond "I wish we could have some smarter heuristics to choose from."

> > We don't need any elaborate
> > just-in-time evaluation of what each entity is worth. We just want to
> > kill the biggest job, not the biggest MM. Just like you wouldn't want
> > just the biggest VMA unmapped and freed, since it leaves your process
> > incoherent, killing one process leaves a job incoherent.
> > 
> > I understand that making it fully configurable is a tempting thought,
> > because you'd offload all responsibility to userspace.
> 
> It is not only tempting it is also the only place which can define
> a more advanced OOM semantic sanely IMHO.

Why do you think that?

Everything the user would want to dynamically program in the kernel,
say with bpf, they could do in userspace and then update the scores
for each group and task periodically.

The only limitation is that you have to recalculate and update the
scoring tree every once in a while, whereas a bpf program could
evaluate things just-in-time. But for that to matter in practice, OOM
kills would have to be a fairly hot path.

> > > > > And both kinds of workloads (services/applications and individual
> > > > > processes run by users) can co-exist on the same host - consider the
> > > > > default systemd setup, for instance.
> > > > > 
> > > > > IMHO it would be better to give users a choice regarding what they
> > > > > really want for a particular cgroup in case of OOM - killing the whole
> > > > > cgroup or one of its descendants. For example, we could introduce a
> > > > > per-cgroup flag that would tell the kernel whether the cgroup can
> > > > > tolerate killing a descendant or not. If it can, the kernel will pick
> > > > > the fattest sub-cgroup or process and check it. If it cannot, it will
> > > > > kill the whole cgroup and all its processes and sub-cgroups.
> > > > 
> > > > The last thing we want to do, is to compare processes with cgroups.
> > > > I agree, that we can have some option to disable the cgroup-aware OOM 
> > > > at all,
> > > > mostly for backward-compatibility. But I don't think it should be a
> > > > per-cgroup configuration option, which we will support forever.
> > > 
> > > I can clearly see a demand for "this is definitely more important
> > > container than others so do not kill" usecases. I can also see demand
> > > for "do not kill this container running for X days". And more are likely
> > > to pop out.
> > 
> > That can all be done with scoring.
> 
> Maybe. But that requires somebody to tweak the scoring which can be hard
> from trivial.

Why is sorting and picking in userspace harder than sorting and
picking in the kernel?

> > This was 10 years ago, and nobody has missed anything critical enough
> > to implement something beyond scoring. So I don't see why we'd need to
> > do it for cgroups all of a sudden.
> > 
> > They're nothing special, they just group together things we have been
> > OOM killing for ages. So why shouldn't we use the same config model?
> > 
> > It 

Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer

2017-05-25 Thread Michal Hocko
On Tue 23-05-17 09:25:44, Johannes Weiner wrote:
> On Tue, May 23, 2017 at 09:07:47AM +0200, Michal Hocko wrote:
> > On Mon 22-05-17 18:01:16, Roman Gushchin wrote:
[...]
> > > How to react on an OOM - is definitely a policy, which depends
> > > on the workload. Nothing is changing here from how it's working now,
> > > except now kernel will choose a victim cgroup, and kill the victim cgroup
> > > rather than a process.
> > 
> > There is a _big_ difference. The current implementation just tries
> > to recover from the OOM situation without carying much about the
> > consequences on the workload. This is the last resort and a services for
> > the _system_ to get back to sane state. You are trying to make it more
> > clever and workload aware and that is inevitable going to depend on the
> > specific workload. I really do think we cannot simply hardcode any
> > policy into the kernel for this purpose and that is why I would like to
> > see a discussion about how to do that in a more extensible way. This
> > might be harder to implement now but it I believe it will turn out
> > better longerm.
> 
> And that's where I still maintain that this isn't really a policy
> change. Because what this code does ISN'T more clever, and the OOM
> killer STILL IS a last-resort thing.

The thing I wanted to point out is that what and how much to kill
definitely depends on the usecase. We currently kill all tasks which
share the mm struct because that is the smallest unit that can unpin
user memory. And that makes a lot of sense to me as a general default.
I would call any attempt to guess tasks belonging to the same
workload/job as a "more clever".

> We don't need any elaborate
> just-in-time evaluation of what each entity is worth. We just want to
> kill the biggest job, not the biggest MM. Just like you wouldn't want
> just the biggest VMA unmapped and freed, since it leaves your process
> incoherent, killing one process leaves a job incoherent.
> 
> I understand that making it fully configurable is a tempting thought,
> because you'd offload all responsibility to userspace.

It is not only tempting it is also the only place which can define
a more advanced OOM semantic sanely IMHO.

> But on the
> other hand, this was brought up years ago and nothing has happened
> since. And to me this is evidence that nobody really cares all that
> much. Because it's still a rather rare event, and there isn't much you
> cannot accomplish with periodic score adjustments.

Yes and there were no attempts since then which suggests that people
didn't care all that much. Maybe things have changed now that containers
got much more popular.

> > > > And both kinds of workloads (services/applications and individual
> > > > processes run by users) can co-exist on the same host - consider the
> > > > default systemd setup, for instance.
> > > > 
> > > > IMHO it would be better to give users a choice regarding what they
> > > > really want for a particular cgroup in case of OOM - killing the whole
> > > > cgroup or one of its descendants. For example, we could introduce a
> > > > per-cgroup flag that would tell the kernel whether the cgroup can
> > > > tolerate killing a descendant or not. If it can, the kernel will pick
> > > > the fattest sub-cgroup or process and check it. If it cannot, it will
> > > > kill the whole cgroup and all its processes and sub-cgroups.
> > > 
> > > The last thing we want to do, is to compare processes with cgroups.
> > > I agree, that we can have some option to disable the cgroup-aware OOM at 
> > > all,
> > > mostly for backward-compatibility. But I don't think it should be a
> > > per-cgroup configuration option, which we will support forever.
> > 
> > I can clearly see a demand for "this is definitely more important
> > container than others so do not kill" usecases. I can also see demand
> > for "do not kill this container running for X days". And more are likely
> > to pop out.
> 
> That can all be done with scoring.

Maybe. But that requires somebody to tweak the scoring which can be hard
from trivial.
 
> In fact, we HAD the oom killer consider a target's cputime/runtime
> before, and David replaced it all with simple scoring in a63d83f427fb
> ("oom: badness heuristic rewrite").

Yes, that is correct and I agree that this was definitely step in the
right direction because time based heuristics tend to behave very
unpredictably in general workloads.

> This was 10 years ago, and nobody has missed anything critical enough
> to implement something beyond scoring. So I don't see why we'd need to
> do it for cgroups all of a sudden.
> 
> They're nothing special, they just group together things we have been
> OOM killing for ages. So why shouldn't we use the same config model?
> 
> It seems to me, what we need for this patch is 1) a way to toggle
> whether the processes and subgroups of a group are interdependent or
> independent and 2) configurable OOM scoring per cgroup analogous to
> what we have per process 

Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer

2017-05-23 Thread Johannes Weiner
On Tue, May 23, 2017 at 09:07:47AM +0200, Michal Hocko wrote:
> On Mon 22-05-17 18:01:16, Roman Gushchin wrote:
> > On Sat, May 20, 2017 at 09:37:29PM +0300, Vladimir Davydov wrote:
> > > On Thu, May 18, 2017 at 05:28:04PM +0100, Roman Gushchin wrote:
> > > ...
> > > > +5-2-4. Cgroup-aware OOM Killer
> > > > +
> > > > +Cgroup v2 memory controller implements a cgroup-aware OOM killer.
> > > > +It means that it treats memory cgroups as memory consumers
> > > > +rather then individual processes. Under the OOM conditions it tries
> > > > +to find an elegible leaf memory cgroup, and kill all processes
> > > > +in this cgroup. If it's not possible (e.g. all processes belong
> > > > +to the root cgroup), it falls back to the traditional per-process
> > > > +behaviour.
> > > 
> > > I agree that the current OOM victim selection algorithm is totally
> > > unfair in a system using containers and it has been crying for rework
> > > for the last few years now, so it's great to see this finally coming.
> > > 
> > > However, I don't reckon that killing a whole leaf cgroup is always the
> > > best practice. It does make sense when cgroups are used for
> > > containerizing services or applications, because a service is unlikely
> > > to remain operational after one of its processes is gone, but one can
> > > also use cgroups to containerize processes started by a user. Kicking a
> > > user out for one of her process has gone mad doesn't sound right to me.
> > 
> > I agree, that it's not always a best practise, if you're not allowed
> > to change the cgroup configuration (e.g. create new cgroups).
> > IMHO, this case is mostly covered by using the v1 cgroup interface,
> > which remains unchanged.
> 
> But there are features which are v2 only and users might really want to
> use it. So I really do not buy this v2-only argument.

I have to agree here. We won't get around making the leaf killing
opt-in or opt-out in some fashion.

> > > Another example when the policy you're suggesting fails in my opinion is
> > > in case a service (cgroup) consists of sub-services (sub-cgroups) that
> > > run processes. The main service may stop working normally if one of its
> > > sub-services is killed. So it might make sense to kill not just an
> > > individual process or a leaf cgroup, but the whole main service with all
> > > its sub-services.
> > 
> > I agree, although I do not pretend for solving all possible
> > userspace problems caused by an OOM.
> > 
> > How to react on an OOM - is definitely a policy, which depends
> > on the workload. Nothing is changing here from how it's working now,
> > except now kernel will choose a victim cgroup, and kill the victim cgroup
> > rather than a process.
> 
> There is a _big_ difference. The current implementation just tries
> to recover from the OOM situation without carying much about the
> consequences on the workload. This is the last resort and a services for
> the _system_ to get back to sane state. You are trying to make it more
> clever and workload aware and that is inevitable going to depend on the
> specific workload. I really do think we cannot simply hardcode any
> policy into the kernel for this purpose and that is why I would like to
> see a discussion about how to do that in a more extensible way. This
> might be harder to implement now but it I believe it will turn out
> better longerm.

And that's where I still maintain that this isn't really a policy
change. Because what this code does ISN'T more clever, and the OOM
killer STILL IS a last-resort thing. We don't need any elaborate
just-in-time evaluation of what each entity is worth. We just want to
kill the biggest job, not the biggest MM. Just like you wouldn't want
just the biggest VMA unmapped and freed, since it leaves your process
incoherent, killing one process leaves a job incoherent.

I understand that making it fully configurable is a tempting thought,
because you'd offload all responsibility to userspace. But on the
other hand, this was brought up years ago and nothing has happened
since. And to me this is evidence that nobody really cares all that
much. Because it's still a rather rare event, and there isn't much you
cannot accomplish with periodic score adjustments.

> > > And both kinds of workloads (services/applications and individual
> > > processes run by users) can co-exist on the same host - consider the
> > > default systemd setup, for instance.
> > > 
> > > IMHO it would be better to give users a choice regarding what they
> > > really want for a particular cgroup in case of OOM - killing the whole
> > > cgroup or one of its descendants. For example, we could introduce a
> > > per-cgroup flag that would tell the kernel whether the cgroup can
> > > tolerate killing a descendant or not. If it can, the kernel will pick
> > > the fattest sub-cgroup or process and check it. If it cannot, it will
> > > kill the whole cgroup and all its processes and sub-cgroups.
> > 
> > The last thing we want to do, is to 

Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer

2017-05-23 Thread Michal Hocko
On Mon 22-05-17 18:01:16, Roman Gushchin wrote:
> On Sat, May 20, 2017 at 09:37:29PM +0300, Vladimir Davydov wrote:
> > Hello Roman,
> 
> Hi Vladimir!
> 
> > 
> > On Thu, May 18, 2017 at 05:28:04PM +0100, Roman Gushchin wrote:
> > ...
> > > +5-2-4. Cgroup-aware OOM Killer
> > > +
> > > +Cgroup v2 memory controller implements a cgroup-aware OOM killer.
> > > +It means that it treats memory cgroups as memory consumers
> > > +rather then individual processes. Under the OOM conditions it tries
> > > +to find an elegible leaf memory cgroup, and kill all processes
> > > +in this cgroup. If it's not possible (e.g. all processes belong
> > > +to the root cgroup), it falls back to the traditional per-process
> > > +behaviour.
> > 
> > I agree that the current OOM victim selection algorithm is totally
> > unfair in a system using containers and it has been crying for rework
> > for the last few years now, so it's great to see this finally coming.
> > 
> > However, I don't reckon that killing a whole leaf cgroup is always the
> > best practice. It does make sense when cgroups are used for
> > containerizing services or applications, because a service is unlikely
> > to remain operational after one of its processes is gone, but one can
> > also use cgroups to containerize processes started by a user. Kicking a
> > user out for one of her process has gone mad doesn't sound right to me.
> 
> I agree, that it's not always a best practise, if you're not allowed
> to change the cgroup configuration (e.g. create new cgroups).
> IMHO, this case is mostly covered by using the v1 cgroup interface,
> which remains unchanged.

But there are features which are v2 only and users might really want to
use it. So I really do not buy this v2-only argument.

> If you do have control over cgroups, you can put processes into
> separate cgroups, and obtain control over OOM victim selection and killing.

Usually you do not have that control because there is a global daemon
doing the placement for you.

> > Another example when the policy you're suggesting fails in my opinion is
> > in case a service (cgroup) consists of sub-services (sub-cgroups) that
> > run processes. The main service may stop working normally if one of its
> > sub-services is killed. So it might make sense to kill not just an
> > individual process or a leaf cgroup, but the whole main service with all
> > its sub-services.
> 
> I agree, although I do not pretend for solving all possible
> userspace problems caused by an OOM.
> 
> How to react on an OOM - is definitely a policy, which depends
> on the workload. Nothing is changing here from how it's working now,
> except now kernel will choose a victim cgroup, and kill the victim cgroup
> rather than a process.

There is a _big_ difference. The current implementation just tries
to recover from the OOM situation without carying much about the
consequences on the workload. This is the last resort and a services for
the _system_ to get back to sane state. You are trying to make it more
clever and workload aware and that is inevitable going to depend on the
specific workload. I really do think we cannot simply hardcode any
policy into the kernel for this purpose and that is why I would like to
see a discussion about how to do that in a more extensible way. This
might be harder to implement now but it I believe it will turn out
better longerm.

> > And both kinds of workloads (services/applications and individual
> > processes run by users) can co-exist on the same host - consider the
> > default systemd setup, for instance.
> > 
> > IMHO it would be better to give users a choice regarding what they
> > really want for a particular cgroup in case of OOM - killing the whole
> > cgroup or one of its descendants. For example, we could introduce a
> > per-cgroup flag that would tell the kernel whether the cgroup can
> > tolerate killing a descendant or not. If it can, the kernel will pick
> > the fattest sub-cgroup or process and check it. If it cannot, it will
> > kill the whole cgroup and all its processes and sub-cgroups.
> 
> The last thing we want to do, is to compare processes with cgroups.
> I agree, that we can have some option to disable the cgroup-aware OOM at all,
> mostly for backward-compatibility. But I don't think it should be a
> per-cgroup configuration option, which we will support forever.

I can clearly see a demand for "this is definitely more important
container than others so do not kill" usecases. I can also see demand
for "do not kill this container running for X days". And more are likely
to pop out.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer

2017-05-22 Thread Roman Gushchin
On Sat, May 20, 2017 at 09:37:29PM +0300, Vladimir Davydov wrote:
> Hello Roman,

Hi Vladimir!

> 
> On Thu, May 18, 2017 at 05:28:04PM +0100, Roman Gushchin wrote:
> ...
> > +5-2-4. Cgroup-aware OOM Killer
> > +
> > +Cgroup v2 memory controller implements a cgroup-aware OOM killer.
> > +It means that it treats memory cgroups as memory consumers
> > +rather then individual processes. Under the OOM conditions it tries
> > +to find an elegible leaf memory cgroup, and kill all processes
> > +in this cgroup. If it's not possible (e.g. all processes belong
> > +to the root cgroup), it falls back to the traditional per-process
> > +behaviour.
> 
> I agree that the current OOM victim selection algorithm is totally
> unfair in a system using containers and it has been crying for rework
> for the last few years now, so it's great to see this finally coming.
> 
> However, I don't reckon that killing a whole leaf cgroup is always the
> best practice. It does make sense when cgroups are used for
> containerizing services or applications, because a service is unlikely
> to remain operational after one of its processes is gone, but one can
> also use cgroups to containerize processes started by a user. Kicking a
> user out for one of her process has gone mad doesn't sound right to me.

I agree, that it's not always a best practise, if you're not allowed
to change the cgroup configuration (e.g. create new cgroups).
IMHO, this case is mostly covered by using the v1 cgroup interface,
which remains unchanged.
If you do have control over cgroups, you can put processes into
separate cgroups, and obtain control over OOM victim selection and killing.

> Another example when the policy you're suggesting fails in my opinion is
> in case a service (cgroup) consists of sub-services (sub-cgroups) that
> run processes. The main service may stop working normally if one of its
> sub-services is killed. So it might make sense to kill not just an
> individual process or a leaf cgroup, but the whole main service with all
> its sub-services.

I agree, although I do not pretend for solving all possible
userspace problems caused by an OOM.

How to react on an OOM - is definitely a policy, which depends
on the workload. Nothing is changing here from how it's working now,
except now kernel will choose a victim cgroup, and kill the victim cgroup
rather than a process.

> And both kinds of workloads (services/applications and individual
> processes run by users) can co-exist on the same host - consider the
> default systemd setup, for instance.
> 
> IMHO it would be better to give users a choice regarding what they
> really want for a particular cgroup in case of OOM - killing the whole
> cgroup or one of its descendants. For example, we could introduce a
> per-cgroup flag that would tell the kernel whether the cgroup can
> tolerate killing a descendant or not. If it can, the kernel will pick
> the fattest sub-cgroup or process and check it. If it cannot, it will
> kill the whole cgroup and all its processes and sub-cgroups.

The last thing we want to do, is to compare processes with cgroups.
I agree, that we can have some option to disable the cgroup-aware OOM at all,
mostly for backward-compatibility. But I don't think it should be a
per-cgroup configuration option, which we will support forever.

> 
> > +
> > +The memory controller tries to make the best choise of a victim cgroup.
> > +In general, it tries to select the largest cgroup, matching given
> > +node/zone requirements, but the concrete algorithm is not defined,
> > +and may be changed later.
> > +
> > +This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
> > +the memory controller considers only cgroups belonging to a sub-tree
> > +of the OOM-ing cgroup, including itself.
> ...
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index c131f7e..8d07481 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2625,6 +2625,75 @@ static inline bool memcg_has_children(struct 
> > mem_cgroup *memcg)
> > return ret;
> >  }
> >  
> > +bool mem_cgroup_select_oom_victim(struct oom_control *oc)
> > +{
> > +   struct mem_cgroup *iter;
> > +   unsigned long chosen_memcg_points;
> > +
> > +   oc->chosen_memcg = NULL;
> > +
> > +   if (mem_cgroup_disabled())
> > +   return false;
> > +
> > +   if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> > +   return false;
> > +
> > +   pr_info("Choosing a victim memcg because of %s",
> > +   oc->memcg ?
> > +   "memory limit reached of cgroup " :
> > +   "out of memory\n");
> > +   if (oc->memcg) {
> > +   pr_cont_cgroup_path(oc->memcg->css.cgroup);
> > +   pr_cont("\n");
> > +   }
> > +
> > +   chosen_memcg_points = 0;
> > +
> > +   for_each_mem_cgroup_tree(iter, oc->memcg) {
> > +   unsigned long points;
> > +   int nid;
> > +
> > +   if (mem_cgroup_is_root(iter))
> > +   continue;
> > +
> > +   if 

Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer

2017-05-20 Thread Vladimir Davydov
Hello Roman,

On Thu, May 18, 2017 at 05:28:04PM +0100, Roman Gushchin wrote:
...
> +5-2-4. Cgroup-aware OOM Killer
> +
> +Cgroup v2 memory controller implements a cgroup-aware OOM killer.
> +It means that it treats memory cgroups as memory consumers
> +rather then individual processes. Under the OOM conditions it tries
> +to find an elegible leaf memory cgroup, and kill all processes
> +in this cgroup. If it's not possible (e.g. all processes belong
> +to the root cgroup), it falls back to the traditional per-process
> +behaviour.

I agree that the current OOM victim selection algorithm is totally
unfair in a system using containers and it has been crying for rework
for the last few years now, so it's great to see this finally coming.

However, I don't reckon that killing a whole leaf cgroup is always the
best practice. It does make sense when cgroups are used for
containerizing services or applications, because a service is unlikely
to remain operational after one of its processes is gone, but one can
also use cgroups to containerize processes started by a user. Kicking a
user out for one of her process has gone mad doesn't sound right to me.

Another example when the policy you're suggesting fails in my opinion is
in case a service (cgroup) consists of sub-services (sub-cgroups) that
run processes. The main service may stop working normally if one of its
sub-services is killed. So it might make sense to kill not just an
individual process or a leaf cgroup, but the whole main service with all
its sub-services.

And both kinds of workloads (services/applications and individual
processes run by users) can co-exist on the same host - consider the
default systemd setup, for instance.

IMHO it would be better to give users a choice regarding what they
really want for a particular cgroup in case of OOM - killing the whole
cgroup or one of its descendants. For example, we could introduce a
per-cgroup flag that would tell the kernel whether the cgroup can
tolerate killing a descendant or not. If it can, the kernel will pick
the fattest sub-cgroup or process and check it. If it cannot, it will
kill the whole cgroup and all its processes and sub-cgroups.

> +
> +The memory controller tries to make the best choise of a victim cgroup.
> +In general, it tries to select the largest cgroup, matching given
> +node/zone requirements, but the concrete algorithm is not defined,
> +and may be changed later.
> +
> +This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
> +the memory controller considers only cgroups belonging to a sub-tree
> +of the OOM-ing cgroup, including itself.
...
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c131f7e..8d07481 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2625,6 +2625,75 @@ static inline bool memcg_has_children(struct 
> mem_cgroup *memcg)
>   return ret;
>  }
>  
> +bool mem_cgroup_select_oom_victim(struct oom_control *oc)
> +{
> + struct mem_cgroup *iter;
> + unsigned long chosen_memcg_points;
> +
> + oc->chosen_memcg = NULL;
> +
> + if (mem_cgroup_disabled())
> + return false;
> +
> + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> + return false;
> +
> + pr_info("Choosing a victim memcg because of %s",
> + oc->memcg ?
> + "memory limit reached of cgroup " :
> + "out of memory\n");
> + if (oc->memcg) {
> + pr_cont_cgroup_path(oc->memcg->css.cgroup);
> + pr_cont("\n");
> + }
> +
> + chosen_memcg_points = 0;
> +
> + for_each_mem_cgroup_tree(iter, oc->memcg) {
> + unsigned long points;
> + int nid;
> +
> + if (mem_cgroup_is_root(iter))
> + continue;
> +
> + if (memcg_has_children(iter))
> + continue;
> +
> + points = 0;
> + for_each_node_state(nid, N_MEMORY) {
> + if (oc->nodemask && !node_isset(nid, *oc->nodemask))
> + continue;
> + points += mem_cgroup_node_nr_lru_pages(iter, nid,
> + LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> + }
> + points += mem_cgroup_get_nr_swap_pages(iter);

I guess we should also take into account kmem as well (unreclaimable
slabs, kernel stacks, socket buffers).

> +
> + pr_info("Memcg ");
> + pr_cont_cgroup_path(iter->css.cgroup);
> + pr_cont(": %lu\n", points);
> +
> + if (points > chosen_memcg_points) {
> + if (oc->chosen_memcg)
> + css_put(>chosen_memcg->css);
> +
> + oc->chosen_memcg = iter;
> + css_get(>css);
> +
> + chosen_memcg_points = points;
> + }
> + }
> +
> + if (oc->chosen_memcg) {
> + pr_info("Kill memcg ");
> + 

Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer

2017-05-19 Thread Michal Hocko
On Thu 18-05-17 14:11:17, Johannes Weiner wrote:
> On Thu, May 18, 2017 at 07:30:04PM +0200, Michal Hocko wrote:
> > On Thu 18-05-17 17:28:04, Roman Gushchin wrote:
> > > Traditionally, the OOM killer is operating on a process level.
> > > Under oom conditions, it finds a process with the highest oom score
> > > and kills it.
> > > 
> > > This behavior doesn't suit well the system with many running
> > > containers. There are two main issues:
> > > 
> > > 1) There is no fairness between containers. A small container with
> > > a few large processes will be chosen over a large one with huge
> > > number of small processes.
> > > 
> > > 2) Containers often do not expect that some random process inside
> > > will be killed. So, in general, a much safer behavior is
> > > to kill the whole cgroup. Traditionally, this was implemented
> > > in userspace, but doing it in the kernel has some advantages,
> > > especially in a case of a system-wide OOM.
> > > 
> > > To address these issues, cgroup-aware OOM killer is introduced.
> > > Under OOM conditions, it looks for a memcg with highest oom score,
> > > and kills all processes inside.
> > > 
> > > Memcg oom score is calculated as a size of active and inactive
> > > anon LRU lists, unevictable LRU list and swap size.
> > > 
> > > For a cgroup-wide OOM, only cgroups belonging to the subtree of
> > > the OOMing cgroup are considered.
> > 
> > While this might make sense for some workloads/setups it is not a
> > generally acceptable policy IMHO. We have discussed that different OOM
> > policies might be interesting few years back at LSFMM but there was no
> > real consensus on how to do that. One possibility was to allow bpf like
> > mechanisms. Could you explore that path?
> 
> OOM policy is an orthogonal discussion, though.
> 
> The OOM killer's job is to pick a memory consumer to kill. Per default
> the unit of the memory consumer is a process, but cgroups allow
> grouping processes into compound consumers. Extending the OOM killer
> to respect the new definition of "consumer" is not a new policy.

I do not want to play word games here but picking a task or more tasks
is a policy from my POV but that is not all that important. My primary
point is that this new "implementation" is most probably not what people
who use memory cgroups outside of containers want. Why? Mostly because
they do not care that only a part of the memcg is still alive pretty
much like the current global OOM behavior when a single task (or its
children) are gone all of the sudden. Why should I kill the whole user
slice just because one of its processes went wild?
 
> I don't think it's reasonable to ask the person who's trying to make
> the OOM killer support group-consumers to design a dynamic OOM policy
> framework instead.
> 
> All we want is the OOM policy, whatever it is, applied to cgroups.

And I am not dismissing this usecase. I believe it is valid but not
universally applicable when memory cgroups are deployed. That is why
I think that we need a way to define those policies in some sane way.
Our current oom policies are basically random -
/proc/sys/vm/oom_kill_allocating_task resp. /proc/sys/vm/panic_on_oom.

I am not really sure we want another hardcoded one e.g.
/proc/sys/vm/oom_kill_container because even that might turn out not the
great fit for different container usecases. Do we want to kill the
largest container or the one with the largest memory hog? Should some
containers have a higher priority over others? I am pretty sure more
criterion would pop up with more usecases.

That's why I think that the current OOM killer implementation should
stay as a last resort and be process oriented and we should think about
a way to override it for particular usecases. The exact mechanism is not
completely clear to me to be honest.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer

2017-05-18 Thread Balbir Singh
On Thu, 2017-05-18 at 20:20 +0100, Roman Gushchin wrote:
> On Fri, May 19, 2017 at 04:37:27AM +1000, Balbir Singh wrote:
> > On Fri, May 19, 2017 at 3:30 AM, Michal Hocko  wrote:
> > > On Thu 18-05-17 17:28:04, Roman Gushchin wrote:
> > > > Traditionally, the OOM killer is operating on a process level.
> > > > Under oom conditions, it finds a process with the highest oom score
> > > > and kills it.
> > > > 
> > > > This behavior doesn't suit well the system with many running
> > > > containers. There are two main issues:
> > > > 
> > > > 1) There is no fairness between containers. A small container with
> > > > a few large processes will be chosen over a large one with huge
> > > > number of small processes.
> > > > 
> > > > 2) Containers often do not expect that some random process inside
> > > > will be killed. So, in general, a much safer behavior is
> > > > to kill the whole cgroup. Traditionally, this was implemented
> > > > in userspace, but doing it in the kernel has some advantages,
> > > > especially in a case of a system-wide OOM.
> > > > 
> > > > To address these issues, cgroup-aware OOM killer is introduced.
> > > > Under OOM conditions, it looks for a memcg with highest oom score,
> > > > and kills all processes inside.
> > > > 
> > > > Memcg oom score is calculated as a size of active and inactive
> > > > anon LRU lists, unevictable LRU list and swap size.
> > > > 
> > > > For a cgroup-wide OOM, only cgroups belonging to the subtree of
> > > > the OOMing cgroup are considered.
> > > 
> > > While this might make sense for some workloads/setups it is not a
> > > generally acceptable policy IMHO. We have discussed that different OOM
> > > policies might be interesting few years back at LSFMM but there was no
> > > real consensus on how to do that. One possibility was to allow bpf like
> > > mechanisms. Could you explore that path?
> > 
> > I agree, I think it needs more thought. I wonder if the real issue is 
> > something
> > else. For example
> > 
> > 1. Did we overcommit a particular container too much?
> 
> Imagine, you have a machine with multiple containers,
> each with it's own process tree, and the machine is overcommited,
> i.e. sum of container's memory limits is larger the amount available RAM.
> 
> In a case of a system-wide OOM some random container will be affected.
> 

The random container containing the most expensive task, yes!

> Historically, this problem was solving by some user-space daemon,
> which was monitoring OOM events and cleaning up affected containers.
> But this approach can't solve the main problem: non-optimal selection
> of a victim. 

Why do you think the problem is non-optimal selection, is it because
we believe that memory cgroup limits should play a role in decision
making of global OOM?


> 
> > 2. Do we need something like 
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_604212_=DwIBaQ=5VD0RTtNlTh3ycd41b3MUw=jJYgtDM7QT-W-Fz_d29HYQ=9jV4id5lmsjFJj1kQjJk0auyQ3bzL27-f6Ur6ZNw36c=ElsS25CoZSPba6ke7O-EIsR7lN0psP6tDVyLnGqCMfs=
> >   to solve
> > the problem?
>
 
The URL got changed to something non-parsable, probably for security, but
could you email client please not do that.

> I don't think it's related.

I was thinking that if we have virtual memory limits and we could set
some sane ones, we could avoid OOM altogether. OOM is a big hammer and
having allocations fail is far more acceptable than killing processes.
I believe that several applications may have much larger VM than actual
memory usage, but I believe with a good overcommit/virtual memory limiter
the problem can be better tackled.

> 
> > 3. We have oom notifiers now, could those be used (assuming you are 
> > interested
> > in non memcg related OOM's affecting a container
> 
> They can be used to inform an userspace daemon about an already happened OOM,
> but they do not affect victim selection.

Yes, the whole point is for the OS to select the victim, the notifiers
provide an opportunity for us to do reclaim to probably prevent OOM

In oom_kill, I see

blocking_notifier_call_chain(_notify_list, 0, );
if (freed > 0)
/* Got some memory back in the last second. */
return true;

Could the notification to user space then decide what to cleanup to free
memory? We also have event notification inside of memcg. I am trying to
understand why these are not sufficient?

We also have soft limits to push containers to a smaller size at the
time of global pressure.

> 
> > 4. How do we determine limits for these containers? From a fariness
> > perspective
> 
> Limits are usually set from some high-level understanding of the nature
> of tasks which are working inside, but overcommiting the machine is
> a common place, I assume.

Agreed overcommit is a given and that is why we wrote the cgroup controllers.
I was wondering if the container limits not being set correctly could cause
these 

Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer

2017-05-18 Thread Johannes Weiner
On Fri, May 19, 2017 at 04:37:27AM +1000, Balbir Singh wrote:
> On Fri, May 19, 2017 at 3:30 AM, Michal Hocko  wrote:
> > On Thu 18-05-17 17:28:04, Roman Gushchin wrote:
> >> Traditionally, the OOM killer is operating on a process level.
> >> Under oom conditions, it finds a process with the highest oom score
> >> and kills it.
> >>
> >> This behavior doesn't suit well the system with many running
> >> containers. There are two main issues:
> >>
> >> 1) There is no fairness between containers. A small container with
> >> a few large processes will be chosen over a large one with huge
> >> number of small processes.
> >>
> >> 2) Containers often do not expect that some random process inside
> >> will be killed. So, in general, a much safer behavior is
> >> to kill the whole cgroup. Traditionally, this was implemented
> >> in userspace, but doing it in the kernel has some advantages,
> >> especially in a case of a system-wide OOM.
> >>
> >> To address these issues, cgroup-aware OOM killer is introduced.
> >> Under OOM conditions, it looks for a memcg with highest oom score,
> >> and kills all processes inside.
> >>
> >> Memcg oom score is calculated as a size of active and inactive
> >> anon LRU lists, unevictable LRU list and swap size.
> >>
> >> For a cgroup-wide OOM, only cgroups belonging to the subtree of
> >> the OOMing cgroup are considered.
> >
> > While this might make sense for some workloads/setups it is not a
> > generally acceptable policy IMHO. We have discussed that different OOM
> > policies might be interesting few years back at LSFMM but there was no
> > real consensus on how to do that. One possibility was to allow bpf like
> > mechanisms. Could you explore that path?
> 
> I agree, I think it needs more thought. I wonder if the real issue is 
> something
> else. For example
> 
> 1. Did we overcommit a particular container too much?
> 2. Do we need something like https://lwn.net/Articles/604212/ to solve
> the problem?

The occasional OOM kill is an unavoidable reality on our systems (and
I bet on most deployments). If we tried not to overcommit, we'd waste
a *lot* of memory.

The problem is when OOM happens, we really want the biggest *job* to
get killed. Before cgroups, we assumed jobs were processes. But with
cgroups, the user is able to define a group of processes as a job, and
then an individual process is no longer a first-class memory consumer.

Without a patch like this, the OOM killer will compare the sizes of
the random subparticles that the jobs in the system are composed of
and kill the single biggest particle, leaving behind the incoherent
remains of one of the jobs. That doesn't make a whole lot of sense.

If you want to determine the most expensive car in a parking lot, you
can't go off and compare the price of one car's muffler with the door
handle of another, then point to a wind shield and yell "This is it!"

You need to compare the cars as a whole with each other.

> 3. We have oom notifiers now, could those be used (assuming you are interested
> in non memcg related OOM's affecting a container

Right now, we watch for OOM notifications and then have userspace kill
the rest of a job. That works - somewhat. What remains is the problem
that I described above, that comparing individual process sizes is not
meaningful when the terminal memory consumer is a cgroup.

> 4. How do we determine limits for these containers? From a fariness
> perspective

How do you mean?
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer

2017-05-18 Thread Roman Gushchin
On Fri, May 19, 2017 at 04:37:27AM +1000, Balbir Singh wrote:
> On Fri, May 19, 2017 at 3:30 AM, Michal Hocko  wrote:
> > On Thu 18-05-17 17:28:04, Roman Gushchin wrote:
> >> Traditionally, the OOM killer is operating on a process level.
> >> Under oom conditions, it finds a process with the highest oom score
> >> and kills it.
> >>
> >> This behavior doesn't suit well the system with many running
> >> containers. There are two main issues:
> >>
> >> 1) There is no fairness between containers. A small container with
> >> a few large processes will be chosen over a large one with huge
> >> number of small processes.
> >>
> >> 2) Containers often do not expect that some random process inside
> >> will be killed. So, in general, a much safer behavior is
> >> to kill the whole cgroup. Traditionally, this was implemented
> >> in userspace, but doing it in the kernel has some advantages,
> >> especially in a case of a system-wide OOM.
> >>
> >> To address these issues, cgroup-aware OOM killer is introduced.
> >> Under OOM conditions, it looks for a memcg with highest oom score,
> >> and kills all processes inside.
> >>
> >> Memcg oom score is calculated as a size of active and inactive
> >> anon LRU lists, unevictable LRU list and swap size.
> >>
> >> For a cgroup-wide OOM, only cgroups belonging to the subtree of
> >> the OOMing cgroup are considered.
> >
> > While this might make sense for some workloads/setups it is not a
> > generally acceptable policy IMHO. We have discussed that different OOM
> > policies might be interesting few years back at LSFMM but there was no
> > real consensus on how to do that. One possibility was to allow bpf like
> > mechanisms. Could you explore that path?
> 
> I agree, I think it needs more thought. I wonder if the real issue is 
> something
> else. For example
> 
> 1. Did we overcommit a particular container too much?

Imagine, you have a machine with multiple containers,
each with it's own process tree, and the machine is overcommited,
i.e. sum of container's memory limits is larger the amount available RAM.

In a case of a system-wide OOM some random container will be affected.

Historically, this problem was solving by some user-space daemon,
which was monitoring OOM events and cleaning up affected containers.
But this approach can't solve the main problem: non-optimal selection
of a victim. 

> 2. Do we need something like 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_604212_=DwIBaQ=5VD0RTtNlTh3ycd41b3MUw=jJYgtDM7QT-W-Fz_d29HYQ=9jV4id5lmsjFJj1kQjJk0auyQ3bzL27-f6Ur6ZNw36c=ElsS25CoZSPba6ke7O-EIsR7lN0psP6tDVyLnGqCMfs=
>   to solve
> the problem?

I don't think it's related.

> 3. We have oom notifiers now, could those be used (assuming you are interested
> in non memcg related OOM's affecting a container

They can be used to inform an userspace daemon about an already happened OOM,
but they do not affect victim selection.

> 4. How do we determine limits for these containers? From a fariness
> perspective

Limits are usually set from some high-level understanding of the nature
of tasks which are working inside, but overcommiting the machine is
a common place, I assume.

Thank you!

Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer

2017-05-18 Thread Balbir Singh
On Fri, May 19, 2017 at 3:30 AM, Michal Hocko  wrote:
> On Thu 18-05-17 17:28:04, Roman Gushchin wrote:
>> Traditionally, the OOM killer is operating on a process level.
>> Under oom conditions, it finds a process with the highest oom score
>> and kills it.
>>
>> This behavior doesn't suit well the system with many running
>> containers. There are two main issues:
>>
>> 1) There is no fairness between containers. A small container with
>> a few large processes will be chosen over a large one with huge
>> number of small processes.
>>
>> 2) Containers often do not expect that some random process inside
>> will be killed. So, in general, a much safer behavior is
>> to kill the whole cgroup. Traditionally, this was implemented
>> in userspace, but doing it in the kernel has some advantages,
>> especially in a case of a system-wide OOM.
>>
>> To address these issues, cgroup-aware OOM killer is introduced.
>> Under OOM conditions, it looks for a memcg with highest oom score,
>> and kills all processes inside.
>>
>> Memcg oom score is calculated as a size of active and inactive
>> anon LRU lists, unevictable LRU list and swap size.
>>
>> For a cgroup-wide OOM, only cgroups belonging to the subtree of
>> the OOMing cgroup are considered.
>
> While this might make sense for some workloads/setups it is not a
> generally acceptable policy IMHO. We have discussed that different OOM
> policies might be interesting few years back at LSFMM but there was no
> real consensus on how to do that. One possibility was to allow bpf like
> mechanisms. Could you explore that path?

I agree, I think it needs more thought. I wonder if the real issue is something
else. For example

1. Did we overcommit a particular container too much?
2. Do we need something like https://lwn.net/Articles/604212/ to solve
the problem?
3. We have oom notifiers now, could those be used (assuming you are interested
in non memcg related OOM's affecting a container
4. How do we determine limits for these containers? From a fariness
perspective

Just trying to understand what leads to the issues you are seeing

Balbir
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer

2017-05-18 Thread Michal Hocko
On Thu 18-05-17 17:28:04, Roman Gushchin wrote:
> Traditionally, the OOM killer is operating on a process level.
> Under oom conditions, it finds a process with the highest oom score
> and kills it.
> 
> This behavior doesn't suit well the system with many running
> containers. There are two main issues:
> 
> 1) There is no fairness between containers. A small container with
> a few large processes will be chosen over a large one with huge
> number of small processes.
> 
> 2) Containers often do not expect that some random process inside
> will be killed. So, in general, a much safer behavior is
> to kill the whole cgroup. Traditionally, this was implemented
> in userspace, but doing it in the kernel has some advantages,
> especially in a case of a system-wide OOM.
> 
> To address these issues, cgroup-aware OOM killer is introduced.
> Under OOM conditions, it looks for a memcg with highest oom score,
> and kills all processes inside.
> 
> Memcg oom score is calculated as a size of active and inactive
> anon LRU lists, unevictable LRU list and swap size.
> 
> For a cgroup-wide OOM, only cgroups belonging to the subtree of
> the OOMing cgroup are considered.

While this might make sense for some workloads/setups it is not a
generally acceptable policy IMHO. We have discussed that different OOM
policies might be interesting few years back at LSFMM but there was no
real consensus on how to do that. One possibility was to allow bpf like
mechanisms. Could you explore that path?
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html