Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
On Fri 02-06-17 16:18:52, Roman Gushchin wrote: > On Fri, Jun 02, 2017 at 10:43:33AM +0200, Michal Hocko wrote: > > On Wed 31-05-17 14:01:45, Johannes Weiner wrote: > > > On Wed, May 31, 2017 at 06:25:04PM +0200, Michal Hocko wrote: > > > > > > + /* > > > > > > * If current has a pending SIGKILL or is exiting, then > > > > > > automatically > > > > > > * select it. The goal is to allow it to allocate so that it > > > > > > may > > > > > > * quickly exit and free its memory. > > > > > > > > > > > > Please note that I haven't explored how much of the infrastructure > > > > > > needed for the OOM decision making is available to modules. But we > > > > > > can > > > > > > export a lot of what we currently have in oom_kill.c. I admit it > > > > > > might > > > > > > turn out that this is simply not feasible but I would like this to > > > > > > be at > > > > > > least explored before we go and implement yet another hardcoded way > > > > > > to > > > > > > handle (see how I didn't use policy ;)) OOM situation. > > > > > > > > > > ;) > > > > > > > > > > My doubt here is mainly that we'll see many (or any) real-life cases > > > > > materialize that cannot be handled with cgroups and scoring. These are > > > > > powerful building blocks on which userspace can implement all kinds of > > > > > policy and sorting algorithms. > > > > > > > > > > So this seems like a lot of churn and complicated code to handle one > > > > > extension. An extension that implements basic functionality. > > > > > > > > Well, as I've said I didn't get to explore this path so I have only a > > > > very vague idea what we would have to export to implement e.g. the > > > > proposed oom killing strategy suggested in this thread. Unfortunatelly I > > > > do not have much time for that. I do not want to block a useful work > > > > which you have a usecase for but I would be really happy if we could > > > > consider longer term plans before diving into a "hardcoded" > > > > implementation. We didn't do that previously and we are left with > > > > oom_kill_allocating_task and similar one off things. > > > > > > As I understand it, killing the allocating task was simply the default > > > before the OOM killer and was added as a compat knob. I really doubt > > > anybody is using it at this point, and we could probably delete it. > > > > I might misremember but my recollection is that SGI simply had too > > large machines with too many processes and so the task selection was > > very expensinve. > > Cgroup-aware OOM killer can be much better in case of large number of > processes, > as we don't have to iterate over all processes locking each mm, and > can select an appropriate cgroup based mostly on lockless counters. > Of course, it depends on concrete setup, but it can be much more efficient > under right circumstances. Yes, I agree with that. > > > I appreciate your concern of being too short-sighted here, but the > > > fact that I cannot point to more usecases isn't for lack of trying. I > > > simply don't see the endless possibilities of usecases that you do. > > > > > > It's unlikely for more types of memory domains to pop up besides MMs > > > and cgroups. (I mentioned vmas, but that just seems esoteric. And we > > > have panic_on_oom for whole-system death. What else could there be?) > > > > > > And as I pointed out, there is no real evidence that the current > > > system for configuring preferences isn't sufficient in practice. > > > > > > That's my thoughts on exploring. I'm not sure what else to do before > > > it feels like running off into fairly contrived hypotheticals. > > > > Yes, I do not want hypotheticals to block an otherwise useful feature, > > of course. But I haven't heard a strong argument why a module based > > approach would be a more maintenance burden longterm. From a very quick > > glance over patches Roman has posted yesterday it seems that a large > > part of the existing oom infrastructure can be reused reasonably. > > I have nothing against module based approach, but I don't think that a module > should implement anything rather than then oom score calculation > (for a process and a cgroup). > Maybe only some custom method for killing, but I can't really imagine anything > reasonable except killing one "worst" process or killing whole cgroup(s). > In case of a system wide OOM, we have to free some memory quickly, > and this means we can't do anything much more complex, > than killing some process(es). > > So, in my understanding, what you're suggesting is not against the proposed > approach at all. We still need to iterate over cgroups, somehow define > their badness, find the worst one and destroy it. In my v2 I've tried > to separate these two potentially customizable areas in two simple functions: > mem_cgroup_oom_badness() and mem_cgroup_kill_oom_victim(). As I've said, I didn't get to look closer at your v2 yet. My point was that we shouldn't hardcode the memcg specific
Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
On Fri, Jun 02, 2017 at 10:43:33AM +0200, Michal Hocko wrote: > On Wed 31-05-17 14:01:45, Johannes Weiner wrote: > > On Wed, May 31, 2017 at 06:25:04PM +0200, Michal Hocko wrote: > > > > > + /* > > > > >* If current has a pending SIGKILL or is exiting, then > > > > > automatically > > > > >* select it. The goal is to allow it to allocate so that it > > > > > may > > > > >* quickly exit and free its memory. > > > > > > > > > > Please note that I haven't explored how much of the infrastructure > > > > > needed for the OOM decision making is available to modules. But we can > > > > > export a lot of what we currently have in oom_kill.c. I admit it might > > > > > turn out that this is simply not feasible but I would like this to be > > > > > at > > > > > least explored before we go and implement yet another hardcoded way to > > > > > handle (see how I didn't use policy ;)) OOM situation. > > > > > > > > ;) > > > > > > > > My doubt here is mainly that we'll see many (or any) real-life cases > > > > materialize that cannot be handled with cgroups and scoring. These are > > > > powerful building blocks on which userspace can implement all kinds of > > > > policy and sorting algorithms. > > > > > > > > So this seems like a lot of churn and complicated code to handle one > > > > extension. An extension that implements basic functionality. > > > > > > Well, as I've said I didn't get to explore this path so I have only a > > > very vague idea what we would have to export to implement e.g. the > > > proposed oom killing strategy suggested in this thread. Unfortunatelly I > > > do not have much time for that. I do not want to block a useful work > > > which you have a usecase for but I would be really happy if we could > > > consider longer term plans before diving into a "hardcoded" > > > implementation. We didn't do that previously and we are left with > > > oom_kill_allocating_task and similar one off things. > > > > As I understand it, killing the allocating task was simply the default > > before the OOM killer and was added as a compat knob. I really doubt > > anybody is using it at this point, and we could probably delete it. > > I might misremember but my recollection is that SGI simply had too > large machines with too many processes and so the task selection was > very expensinve. Cgroup-aware OOM killer can be much better in case of large number of processes, as we don't have to iterate over all processes locking each mm, and can select an appropriate cgroup based mostly on lockless counters. Of course, it depends on concrete setup, but it can be much more efficient under right circumstances. > > > I appreciate your concern of being too short-sighted here, but the > > fact that I cannot point to more usecases isn't for lack of trying. I > > simply don't see the endless possibilities of usecases that you do. > > > > It's unlikely for more types of memory domains to pop up besides MMs > > and cgroups. (I mentioned vmas, but that just seems esoteric. And we > > have panic_on_oom for whole-system death. What else could there be?) > > > > And as I pointed out, there is no real evidence that the current > > system for configuring preferences isn't sufficient in practice. > > > > That's my thoughts on exploring. I'm not sure what else to do before > > it feels like running off into fairly contrived hypotheticals. > > Yes, I do not want hypotheticals to block an otherwise useful feature, > of course. But I haven't heard a strong argument why a module based > approach would be a more maintenance burden longterm. From a very quick > glance over patches Roman has posted yesterday it seems that a large > part of the existing oom infrastructure can be reused reasonably. I have nothing against module based approach, but I don't think that a module should implement anything rather than then oom score calculation (for a process and a cgroup). Maybe only some custom method for killing, but I can't really imagine anything reasonable except killing one "worst" process or killing whole cgroup(s). In case of a system wide OOM, we have to free some memory quickly, and this means we can't do anything much more complex, than killing some process(es). So, in my understanding, what you're suggesting is not against the proposed approach at all. We still need to iterate over cgroups, somehow define their badness, find the worst one and destroy it. In my v2 I've tried to separate these two potentially customizable areas in two simple functions: mem_cgroup_oom_badness() and mem_cgroup_kill_oom_victim(). So we can add an ability to customize these functions (and similar stuff for processes), if we'll have some real examples of where the proposed functionality is insufficient. Do you have any examples which can't be covered by this approach? Thanks! Roman -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More
Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
On Wed 31-05-17 14:01:45, Johannes Weiner wrote: > On Wed, May 31, 2017 at 06:25:04PM +0200, Michal Hocko wrote: > > On Thu 25-05-17 13:08:05, Johannes Weiner wrote: > > > Everything the user would want to dynamically program in the kernel, > > > say with bpf, they could do in userspace and then update the scores > > > for each group and task periodically. > > > > I am rather skeptical about dynamic scores. oom_{score_}adj has turned > > to mere oom disable/enable knobs from my experience. > > That doesn't necessarily have to be a deficiency with the scoring > system. I suspect that most people simply don't care as long as the > the picks for OOM victims aren't entirely stupid. > > For example, we have a lot of machines that run one class of job. If > we run OOM there isn't much preference we'd need to express; just kill > one job - the biggest, whatever - and move on. (The biggest makes > sense because if all jobs are basically equal it's as good as any > other victim, but if one has a runaway bug it goes for that.) > > Where we have more than one job class, it actually is mostly one hipri > and one lopri, in which case setting a hard limit on the lopri or the > -1000 OOM score trick is enough. > > How many systems run more than two clearly distinguishable classes of > workloads concurrently? What about those which run different containers on a large physical machine? > I'm sure they exist. I'm just saying it doesn't surprise me that > elaborate OOM scoring isn't all that wide-spread. > > > > The only limitation is that you have to recalculate and update the > > > scoring tree every once in a while, whereas a bpf program could > > > evaluate things just-in-time. But for that to matter in practice, OOM > > > kills would have to be a fairly hot path. > > > > I am not really sure how to reliably implement "kill the memcg with the > > largest process" strategy. And who knows how many others strategies will > > pop out. > > That seems fairly contrived. > > What does it mean to divide memory into subdomains, but when you run > out of physical memory you kill based on biggest task? Well, the biggest task might be the runaway one and so killing it first before you kill other innocent ones makes some sense to me. > Sure, it frees memory and gets the system going again, so it's as good > as any answer to overcommit gone wrong, I guess. But is that something > you'd intentionally want to express from a userspace perspective? > [...] > > > > Maybe. But that requires somebody to tweak the scoring which can be hard > > > > from trivial. > > > > > > Why is sorting and picking in userspace harder than sorting and > > > picking in the kernel? > > > > Because the userspace score based approach would be much more racy > > especially in the busy system. This could lead to unexpected behavior > > when OOM killer would kill a different than a run-away memcgs. > > How would it be easier to weigh priority against runaway detection > inside the kernel? You have better chances to catch such a process at the time of the OOM because you do the check at the time of the OOM rather than sometimes back in time when your monitor was able to run and check all the existing processes (which alone can be rather time consuming so you do not want to do that very often). > > > > + /* > > > > * If current has a pending SIGKILL or is exiting, then > > > > automatically > > > > * select it. The goal is to allow it to allocate so that it > > > > may > > > > * quickly exit and free its memory. > > > > > > > > Please note that I haven't explored how much of the infrastructure > > > > needed for the OOM decision making is available to modules. But we can > > > > export a lot of what we currently have in oom_kill.c. I admit it might > > > > turn out that this is simply not feasible but I would like this to be at > > > > least explored before we go and implement yet another hardcoded way to > > > > handle (see how I didn't use policy ;)) OOM situation. > > > > > > ;) > > > > > > My doubt here is mainly that we'll see many (or any) real-life cases > > > materialize that cannot be handled with cgroups and scoring. These are > > > powerful building blocks on which userspace can implement all kinds of > > > policy and sorting algorithms. > > > > > > So this seems like a lot of churn and complicated code to handle one > > > extension. An extension that implements basic functionality. > > > > Well, as I've said I didn't get to explore this path so I have only a > > very vague idea what we would have to export to implement e.g. the > > proposed oom killing strategy suggested in this thread. Unfortunatelly I > > do not have much time for that. I do not want to block a useful work > > which you have a usecase for but I would be really happy if we could > > consider longer term plans before diving into a "hardcoded" > > implementation. We didn't do that previously and we are left with > >
Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
On Wed, May 31, 2017 at 06:25:04PM +0200, Michal Hocko wrote: > On Thu 25-05-17 13:08:05, Johannes Weiner wrote: > > Everything the user would want to dynamically program in the kernel, > > say with bpf, they could do in userspace and then update the scores > > for each group and task periodically. > > I am rather skeptical about dynamic scores. oom_{score_}adj has turned > to mere oom disable/enable knobs from my experience. That doesn't necessarily have to be a deficiency with the scoring system. I suspect that most people simply don't care as long as the the picks for OOM victims aren't entirely stupid. For example, we have a lot of machines that run one class of job. If we run OOM there isn't much preference we'd need to express; just kill one job - the biggest, whatever - and move on. (The biggest makes sense because if all jobs are basically equal it's as good as any other victim, but if one has a runaway bug it goes for that.) Where we have more than one job class, it actually is mostly one hipri and one lopri, in which case setting a hard limit on the lopri or the -1000 OOM score trick is enough. How many systems run more than two clearly distinguishable classes of workloads concurrently? I'm sure they exist. I'm just saying it doesn't surprise me that elaborate OOM scoring isn't all that wide-spread. > > The only limitation is that you have to recalculate and update the > > scoring tree every once in a while, whereas a bpf program could > > evaluate things just-in-time. But for that to matter in practice, OOM > > kills would have to be a fairly hot path. > > I am not really sure how to reliably implement "kill the memcg with the > largest process" strategy. And who knows how many others strategies will > pop out. That seems fairly contrived. What does it mean to divide memory into subdomains, but when you run out of physical memory you kill based on biggest task? Sure, it frees memory and gets the system going again, so it's as good as any answer to overcommit gone wrong, I guess. But is that something you'd intentionally want to express from a userspace perspective? > > > > > > > And both kinds of workloads (services/applications and individual > > > > > > > processes run by users) can co-exist on the same host - consider > > > > > > > the > > > > > > > default systemd setup, for instance. > > > > > > > > > > > > > > IMHO it would be better to give users a choice regarding what they > > > > > > > really want for a particular cgroup in case of OOM - killing the > > > > > > > whole > > > > > > > cgroup or one of its descendants. For example, we could introduce > > > > > > > a > > > > > > > per-cgroup flag that would tell the kernel whether the cgroup can > > > > > > > tolerate killing a descendant or not. If it can, the kernel will > > > > > > > pick > > > > > > > the fattest sub-cgroup or process and check it. If it cannot, it > > > > > > > will > > > > > > > kill the whole cgroup and all its processes and sub-cgroups. > > > > > > > > > > > > The last thing we want to do, is to compare processes with cgroups. > > > > > > I agree, that we can have some option to disable the cgroup-aware > > > > > > OOM at all, > > > > > > mostly for backward-compatibility. But I don't think it should be a > > > > > > per-cgroup configuration option, which we will support forever. > > > > > > > > > > I can clearly see a demand for "this is definitely more important > > > > > container than others so do not kill" usecases. I can also see demand > > > > > for "do not kill this container running for X days". And more are > > > > > likely > > > > > to pop out. > > > > > > > > That can all be done with scoring. > > > > > > Maybe. But that requires somebody to tweak the scoring which can be hard > > > from trivial. > > > > Why is sorting and picking in userspace harder than sorting and > > picking in the kernel? > > Because the userspace score based approach would be much more racy > especially in the busy system. This could lead to unexpected behavior > when OOM killer would kill a different than a run-away memcgs. How would it be easier to weigh priority against runaway detection inside the kernel? > > > + /* > > >* If current has a pending SIGKILL or is exiting, then automatically > > >* select it. The goal is to allow it to allocate so that it may > > >* quickly exit and free its memory. > > > > > > Please note that I haven't explored how much of the infrastructure > > > needed for the OOM decision making is available to modules. But we can > > > export a lot of what we currently have in oom_kill.c. I admit it might > > > turn out that this is simply not feasible but I would like this to be at > > > least explored before we go and implement yet another hardcoded way to > > > handle (see how I didn't use policy ;)) OOM situation. > > > > ;) > > > > My doubt here is mainly that we'll see many (or any) real-life cases > > materialize that cannot be handled with cgroups and
Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
On Thu, May 25, 2017 at 05:38:19PM +0200, Michal Hocko wrote: > On Tue 23-05-17 09:25:44, Johannes Weiner wrote: > > On Tue, May 23, 2017 at 09:07:47AM +0200, Michal Hocko wrote: > > > On Mon 22-05-17 18:01:16, Roman Gushchin wrote: > [...] > > > > How to react on an OOM - is definitely a policy, which depends > > > > on the workload. Nothing is changing here from how it's working now, > > > > except now kernel will choose a victim cgroup, and kill the victim > > > > cgroup > > > > rather than a process. > > > > > > There is a _big_ difference. The current implementation just tries > > > to recover from the OOM situation without carying much about the > > > consequences on the workload. This is the last resort and a services for > > > the _system_ to get back to sane state. You are trying to make it more > > > clever and workload aware and that is inevitable going to depend on the > > > specific workload. I really do think we cannot simply hardcode any > > > policy into the kernel for this purpose and that is why I would like to > > > see a discussion about how to do that in a more extensible way. This > > > might be harder to implement now but it I believe it will turn out > > > better longerm. > > > > And that's where I still maintain that this isn't really a policy > > change. Because what this code does ISN'T more clever, and the OOM > > killer STILL IS a last-resort thing. > > The thing I wanted to point out is that what and how much to kill > definitely depends on the usecase. We currently kill all tasks which > share the mm struct because that is the smallest unit that can unpin > user memory. And that makes a lot of sense to me as a general default. > I would call any attempt to guess tasks belonging to the same > workload/job as a "more clever". Yeah, I agree it needs to be configurable. But a memory domain is not a random guess. It's a core concept of the VM at this point. The fact that the OOM killer cannot handle it is pretty weird and goes way beyond "I wish we could have some smarter heuristics to choose from." > > We don't need any elaborate > > just-in-time evaluation of what each entity is worth. We just want to > > kill the biggest job, not the biggest MM. Just like you wouldn't want > > just the biggest VMA unmapped and freed, since it leaves your process > > incoherent, killing one process leaves a job incoherent. > > > > I understand that making it fully configurable is a tempting thought, > > because you'd offload all responsibility to userspace. > > It is not only tempting it is also the only place which can define > a more advanced OOM semantic sanely IMHO. Why do you think that? Everything the user would want to dynamically program in the kernel, say with bpf, they could do in userspace and then update the scores for each group and task periodically. The only limitation is that you have to recalculate and update the scoring tree every once in a while, whereas a bpf program could evaluate things just-in-time. But for that to matter in practice, OOM kills would have to be a fairly hot path. > > > > > And both kinds of workloads (services/applications and individual > > > > > processes run by users) can co-exist on the same host - consider the > > > > > default systemd setup, for instance. > > > > > > > > > > IMHO it would be better to give users a choice regarding what they > > > > > really want for a particular cgroup in case of OOM - killing the whole > > > > > cgroup or one of its descendants. For example, we could introduce a > > > > > per-cgroup flag that would tell the kernel whether the cgroup can > > > > > tolerate killing a descendant or not. If it can, the kernel will pick > > > > > the fattest sub-cgroup or process and check it. If it cannot, it will > > > > > kill the whole cgroup and all its processes and sub-cgroups. > > > > > > > > The last thing we want to do, is to compare processes with cgroups. > > > > I agree, that we can have some option to disable the cgroup-aware OOM > > > > at all, > > > > mostly for backward-compatibility. But I don't think it should be a > > > > per-cgroup configuration option, which we will support forever. > > > > > > I can clearly see a demand for "this is definitely more important > > > container than others so do not kill" usecases. I can also see demand > > > for "do not kill this container running for X days". And more are likely > > > to pop out. > > > > That can all be done with scoring. > > Maybe. But that requires somebody to tweak the scoring which can be hard > from trivial. Why is sorting and picking in userspace harder than sorting and picking in the kernel? > > This was 10 years ago, and nobody has missed anything critical enough > > to implement something beyond scoring. So I don't see why we'd need to > > do it for cgroups all of a sudden. > > > > They're nothing special, they just group together things we have been > > OOM killing for ages. So why shouldn't we use the same config model? > > > > It
Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
On Tue 23-05-17 09:25:44, Johannes Weiner wrote: > On Tue, May 23, 2017 at 09:07:47AM +0200, Michal Hocko wrote: > > On Mon 22-05-17 18:01:16, Roman Gushchin wrote: [...] > > > How to react on an OOM - is definitely a policy, which depends > > > on the workload. Nothing is changing here from how it's working now, > > > except now kernel will choose a victim cgroup, and kill the victim cgroup > > > rather than a process. > > > > There is a _big_ difference. The current implementation just tries > > to recover from the OOM situation without carying much about the > > consequences on the workload. This is the last resort and a services for > > the _system_ to get back to sane state. You are trying to make it more > > clever and workload aware and that is inevitable going to depend on the > > specific workload. I really do think we cannot simply hardcode any > > policy into the kernel for this purpose and that is why I would like to > > see a discussion about how to do that in a more extensible way. This > > might be harder to implement now but it I believe it will turn out > > better longerm. > > And that's where I still maintain that this isn't really a policy > change. Because what this code does ISN'T more clever, and the OOM > killer STILL IS a last-resort thing. The thing I wanted to point out is that what and how much to kill definitely depends on the usecase. We currently kill all tasks which share the mm struct because that is the smallest unit that can unpin user memory. And that makes a lot of sense to me as a general default. I would call any attempt to guess tasks belonging to the same workload/job as a "more clever". > We don't need any elaborate > just-in-time evaluation of what each entity is worth. We just want to > kill the biggest job, not the biggest MM. Just like you wouldn't want > just the biggest VMA unmapped and freed, since it leaves your process > incoherent, killing one process leaves a job incoherent. > > I understand that making it fully configurable is a tempting thought, > because you'd offload all responsibility to userspace. It is not only tempting it is also the only place which can define a more advanced OOM semantic sanely IMHO. > But on the > other hand, this was brought up years ago and nothing has happened > since. And to me this is evidence that nobody really cares all that > much. Because it's still a rather rare event, and there isn't much you > cannot accomplish with periodic score adjustments. Yes and there were no attempts since then which suggests that people didn't care all that much. Maybe things have changed now that containers got much more popular. > > > > And both kinds of workloads (services/applications and individual > > > > processes run by users) can co-exist on the same host - consider the > > > > default systemd setup, for instance. > > > > > > > > IMHO it would be better to give users a choice regarding what they > > > > really want for a particular cgroup in case of OOM - killing the whole > > > > cgroup or one of its descendants. For example, we could introduce a > > > > per-cgroup flag that would tell the kernel whether the cgroup can > > > > tolerate killing a descendant or not. If it can, the kernel will pick > > > > the fattest sub-cgroup or process and check it. If it cannot, it will > > > > kill the whole cgroup and all its processes and sub-cgroups. > > > > > > The last thing we want to do, is to compare processes with cgroups. > > > I agree, that we can have some option to disable the cgroup-aware OOM at > > > all, > > > mostly for backward-compatibility. But I don't think it should be a > > > per-cgroup configuration option, which we will support forever. > > > > I can clearly see a demand for "this is definitely more important > > container than others so do not kill" usecases. I can also see demand > > for "do not kill this container running for X days". And more are likely > > to pop out. > > That can all be done with scoring. Maybe. But that requires somebody to tweak the scoring which can be hard from trivial. > In fact, we HAD the oom killer consider a target's cputime/runtime > before, and David replaced it all with simple scoring in a63d83f427fb > ("oom: badness heuristic rewrite"). Yes, that is correct and I agree that this was definitely step in the right direction because time based heuristics tend to behave very unpredictably in general workloads. > This was 10 years ago, and nobody has missed anything critical enough > to implement something beyond scoring. So I don't see why we'd need to > do it for cgroups all of a sudden. > > They're nothing special, they just group together things we have been > OOM killing for ages. So why shouldn't we use the same config model? > > It seems to me, what we need for this patch is 1) a way to toggle > whether the processes and subgroups of a group are interdependent or > independent and 2) configurable OOM scoring per cgroup analogous to > what we have per process
Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
On Tue, May 23, 2017 at 09:07:47AM +0200, Michal Hocko wrote: > On Mon 22-05-17 18:01:16, Roman Gushchin wrote: > > On Sat, May 20, 2017 at 09:37:29PM +0300, Vladimir Davydov wrote: > > > On Thu, May 18, 2017 at 05:28:04PM +0100, Roman Gushchin wrote: > > > ... > > > > +5-2-4. Cgroup-aware OOM Killer > > > > + > > > > +Cgroup v2 memory controller implements a cgroup-aware OOM killer. > > > > +It means that it treats memory cgroups as memory consumers > > > > +rather then individual processes. Under the OOM conditions it tries > > > > +to find an elegible leaf memory cgroup, and kill all processes > > > > +in this cgroup. If it's not possible (e.g. all processes belong > > > > +to the root cgroup), it falls back to the traditional per-process > > > > +behaviour. > > > > > > I agree that the current OOM victim selection algorithm is totally > > > unfair in a system using containers and it has been crying for rework > > > for the last few years now, so it's great to see this finally coming. > > > > > > However, I don't reckon that killing a whole leaf cgroup is always the > > > best practice. It does make sense when cgroups are used for > > > containerizing services or applications, because a service is unlikely > > > to remain operational after one of its processes is gone, but one can > > > also use cgroups to containerize processes started by a user. Kicking a > > > user out for one of her process has gone mad doesn't sound right to me. > > > > I agree, that it's not always a best practise, if you're not allowed > > to change the cgroup configuration (e.g. create new cgroups). > > IMHO, this case is mostly covered by using the v1 cgroup interface, > > which remains unchanged. > > But there are features which are v2 only and users might really want to > use it. So I really do not buy this v2-only argument. I have to agree here. We won't get around making the leaf killing opt-in or opt-out in some fashion. > > > Another example when the policy you're suggesting fails in my opinion is > > > in case a service (cgroup) consists of sub-services (sub-cgroups) that > > > run processes. The main service may stop working normally if one of its > > > sub-services is killed. So it might make sense to kill not just an > > > individual process or a leaf cgroup, but the whole main service with all > > > its sub-services. > > > > I agree, although I do not pretend for solving all possible > > userspace problems caused by an OOM. > > > > How to react on an OOM - is definitely a policy, which depends > > on the workload. Nothing is changing here from how it's working now, > > except now kernel will choose a victim cgroup, and kill the victim cgroup > > rather than a process. > > There is a _big_ difference. The current implementation just tries > to recover from the OOM situation without carying much about the > consequences on the workload. This is the last resort and a services for > the _system_ to get back to sane state. You are trying to make it more > clever and workload aware and that is inevitable going to depend on the > specific workload. I really do think we cannot simply hardcode any > policy into the kernel for this purpose and that is why I would like to > see a discussion about how to do that in a more extensible way. This > might be harder to implement now but it I believe it will turn out > better longerm. And that's where I still maintain that this isn't really a policy change. Because what this code does ISN'T more clever, and the OOM killer STILL IS a last-resort thing. We don't need any elaborate just-in-time evaluation of what each entity is worth. We just want to kill the biggest job, not the biggest MM. Just like you wouldn't want just the biggest VMA unmapped and freed, since it leaves your process incoherent, killing one process leaves a job incoherent. I understand that making it fully configurable is a tempting thought, because you'd offload all responsibility to userspace. But on the other hand, this was brought up years ago and nothing has happened since. And to me this is evidence that nobody really cares all that much. Because it's still a rather rare event, and there isn't much you cannot accomplish with periodic score adjustments. > > > And both kinds of workloads (services/applications and individual > > > processes run by users) can co-exist on the same host - consider the > > > default systemd setup, for instance. > > > > > > IMHO it would be better to give users a choice regarding what they > > > really want for a particular cgroup in case of OOM - killing the whole > > > cgroup or one of its descendants. For example, we could introduce a > > > per-cgroup flag that would tell the kernel whether the cgroup can > > > tolerate killing a descendant or not. If it can, the kernel will pick > > > the fattest sub-cgroup or process and check it. If it cannot, it will > > > kill the whole cgroup and all its processes and sub-cgroups. > > > > The last thing we want to do, is to
Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
On Mon 22-05-17 18:01:16, Roman Gushchin wrote: > On Sat, May 20, 2017 at 09:37:29PM +0300, Vladimir Davydov wrote: > > Hello Roman, > > Hi Vladimir! > > > > > On Thu, May 18, 2017 at 05:28:04PM +0100, Roman Gushchin wrote: > > ... > > > +5-2-4. Cgroup-aware OOM Killer > > > + > > > +Cgroup v2 memory controller implements a cgroup-aware OOM killer. > > > +It means that it treats memory cgroups as memory consumers > > > +rather then individual processes. Under the OOM conditions it tries > > > +to find an elegible leaf memory cgroup, and kill all processes > > > +in this cgroup. If it's not possible (e.g. all processes belong > > > +to the root cgroup), it falls back to the traditional per-process > > > +behaviour. > > > > I agree that the current OOM victim selection algorithm is totally > > unfair in a system using containers and it has been crying for rework > > for the last few years now, so it's great to see this finally coming. > > > > However, I don't reckon that killing a whole leaf cgroup is always the > > best practice. It does make sense when cgroups are used for > > containerizing services or applications, because a service is unlikely > > to remain operational after one of its processes is gone, but one can > > also use cgroups to containerize processes started by a user. Kicking a > > user out for one of her process has gone mad doesn't sound right to me. > > I agree, that it's not always a best practise, if you're not allowed > to change the cgroup configuration (e.g. create new cgroups). > IMHO, this case is mostly covered by using the v1 cgroup interface, > which remains unchanged. But there are features which are v2 only and users might really want to use it. So I really do not buy this v2-only argument. > If you do have control over cgroups, you can put processes into > separate cgroups, and obtain control over OOM victim selection and killing. Usually you do not have that control because there is a global daemon doing the placement for you. > > Another example when the policy you're suggesting fails in my opinion is > > in case a service (cgroup) consists of sub-services (sub-cgroups) that > > run processes. The main service may stop working normally if one of its > > sub-services is killed. So it might make sense to kill not just an > > individual process or a leaf cgroup, but the whole main service with all > > its sub-services. > > I agree, although I do not pretend for solving all possible > userspace problems caused by an OOM. > > How to react on an OOM - is definitely a policy, which depends > on the workload. Nothing is changing here from how it's working now, > except now kernel will choose a victim cgroup, and kill the victim cgroup > rather than a process. There is a _big_ difference. The current implementation just tries to recover from the OOM situation without carying much about the consequences on the workload. This is the last resort and a services for the _system_ to get back to sane state. You are trying to make it more clever and workload aware and that is inevitable going to depend on the specific workload. I really do think we cannot simply hardcode any policy into the kernel for this purpose and that is why I would like to see a discussion about how to do that in a more extensible way. This might be harder to implement now but it I believe it will turn out better longerm. > > And both kinds of workloads (services/applications and individual > > processes run by users) can co-exist on the same host - consider the > > default systemd setup, for instance. > > > > IMHO it would be better to give users a choice regarding what they > > really want for a particular cgroup in case of OOM - killing the whole > > cgroup or one of its descendants. For example, we could introduce a > > per-cgroup flag that would tell the kernel whether the cgroup can > > tolerate killing a descendant or not. If it can, the kernel will pick > > the fattest sub-cgroup or process and check it. If it cannot, it will > > kill the whole cgroup and all its processes and sub-cgroups. > > The last thing we want to do, is to compare processes with cgroups. > I agree, that we can have some option to disable the cgroup-aware OOM at all, > mostly for backward-compatibility. But I don't think it should be a > per-cgroup configuration option, which we will support forever. I can clearly see a demand for "this is definitely more important container than others so do not kill" usecases. I can also see demand for "do not kill this container running for X days". And more are likely to pop out. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
On Sat, May 20, 2017 at 09:37:29PM +0300, Vladimir Davydov wrote: > Hello Roman, Hi Vladimir! > > On Thu, May 18, 2017 at 05:28:04PM +0100, Roman Gushchin wrote: > ... > > +5-2-4. Cgroup-aware OOM Killer > > + > > +Cgroup v2 memory controller implements a cgroup-aware OOM killer. > > +It means that it treats memory cgroups as memory consumers > > +rather then individual processes. Under the OOM conditions it tries > > +to find an elegible leaf memory cgroup, and kill all processes > > +in this cgroup. If it's not possible (e.g. all processes belong > > +to the root cgroup), it falls back to the traditional per-process > > +behaviour. > > I agree that the current OOM victim selection algorithm is totally > unfair in a system using containers and it has been crying for rework > for the last few years now, so it's great to see this finally coming. > > However, I don't reckon that killing a whole leaf cgroup is always the > best practice. It does make sense when cgroups are used for > containerizing services or applications, because a service is unlikely > to remain operational after one of its processes is gone, but one can > also use cgroups to containerize processes started by a user. Kicking a > user out for one of her process has gone mad doesn't sound right to me. I agree, that it's not always a best practise, if you're not allowed to change the cgroup configuration (e.g. create new cgroups). IMHO, this case is mostly covered by using the v1 cgroup interface, which remains unchanged. If you do have control over cgroups, you can put processes into separate cgroups, and obtain control over OOM victim selection and killing. > Another example when the policy you're suggesting fails in my opinion is > in case a service (cgroup) consists of sub-services (sub-cgroups) that > run processes. The main service may stop working normally if one of its > sub-services is killed. So it might make sense to kill not just an > individual process or a leaf cgroup, but the whole main service with all > its sub-services. I agree, although I do not pretend for solving all possible userspace problems caused by an OOM. How to react on an OOM - is definitely a policy, which depends on the workload. Nothing is changing here from how it's working now, except now kernel will choose a victim cgroup, and kill the victim cgroup rather than a process. > And both kinds of workloads (services/applications and individual > processes run by users) can co-exist on the same host - consider the > default systemd setup, for instance. > > IMHO it would be better to give users a choice regarding what they > really want for a particular cgroup in case of OOM - killing the whole > cgroup or one of its descendants. For example, we could introduce a > per-cgroup flag that would tell the kernel whether the cgroup can > tolerate killing a descendant or not. If it can, the kernel will pick > the fattest sub-cgroup or process and check it. If it cannot, it will > kill the whole cgroup and all its processes and sub-cgroups. The last thing we want to do, is to compare processes with cgroups. I agree, that we can have some option to disable the cgroup-aware OOM at all, mostly for backward-compatibility. But I don't think it should be a per-cgroup configuration option, which we will support forever. > > > + > > +The memory controller tries to make the best choise of a victim cgroup. > > +In general, it tries to select the largest cgroup, matching given > > +node/zone requirements, but the concrete algorithm is not defined, > > +and may be changed later. > > + > > +This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM > > +the memory controller considers only cgroups belonging to a sub-tree > > +of the OOM-ing cgroup, including itself. > ... > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index c131f7e..8d07481 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -2625,6 +2625,75 @@ static inline bool memcg_has_children(struct > > mem_cgroup *memcg) > > return ret; > > } > > > > +bool mem_cgroup_select_oom_victim(struct oom_control *oc) > > +{ > > + struct mem_cgroup *iter; > > + unsigned long chosen_memcg_points; > > + > > + oc->chosen_memcg = NULL; > > + > > + if (mem_cgroup_disabled()) > > + return false; > > + > > + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) > > + return false; > > + > > + pr_info("Choosing a victim memcg because of %s", > > + oc->memcg ? > > + "memory limit reached of cgroup " : > > + "out of memory\n"); > > + if (oc->memcg) { > > + pr_cont_cgroup_path(oc->memcg->css.cgroup); > > + pr_cont("\n"); > > + } > > + > > + chosen_memcg_points = 0; > > + > > + for_each_mem_cgroup_tree(iter, oc->memcg) { > > + unsigned long points; > > + int nid; > > + > > + if (mem_cgroup_is_root(iter)) > > + continue; > > + > > + if
Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
Hello Roman, On Thu, May 18, 2017 at 05:28:04PM +0100, Roman Gushchin wrote: ... > +5-2-4. Cgroup-aware OOM Killer > + > +Cgroup v2 memory controller implements a cgroup-aware OOM killer. > +It means that it treats memory cgroups as memory consumers > +rather then individual processes. Under the OOM conditions it tries > +to find an elegible leaf memory cgroup, and kill all processes > +in this cgroup. If it's not possible (e.g. all processes belong > +to the root cgroup), it falls back to the traditional per-process > +behaviour. I agree that the current OOM victim selection algorithm is totally unfair in a system using containers and it has been crying for rework for the last few years now, so it's great to see this finally coming. However, I don't reckon that killing a whole leaf cgroup is always the best practice. It does make sense when cgroups are used for containerizing services or applications, because a service is unlikely to remain operational after one of its processes is gone, but one can also use cgroups to containerize processes started by a user. Kicking a user out for one of her process has gone mad doesn't sound right to me. Another example when the policy you're suggesting fails in my opinion is in case a service (cgroup) consists of sub-services (sub-cgroups) that run processes. The main service may stop working normally if one of its sub-services is killed. So it might make sense to kill not just an individual process or a leaf cgroup, but the whole main service with all its sub-services. And both kinds of workloads (services/applications and individual processes run by users) can co-exist on the same host - consider the default systemd setup, for instance. IMHO it would be better to give users a choice regarding what they really want for a particular cgroup in case of OOM - killing the whole cgroup or one of its descendants. For example, we could introduce a per-cgroup flag that would tell the kernel whether the cgroup can tolerate killing a descendant or not. If it can, the kernel will pick the fattest sub-cgroup or process and check it. If it cannot, it will kill the whole cgroup and all its processes and sub-cgroups. > + > +The memory controller tries to make the best choise of a victim cgroup. > +In general, it tries to select the largest cgroup, matching given > +node/zone requirements, but the concrete algorithm is not defined, > +and may be changed later. > + > +This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM > +the memory controller considers only cgroups belonging to a sub-tree > +of the OOM-ing cgroup, including itself. ... > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index c131f7e..8d07481 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2625,6 +2625,75 @@ static inline bool memcg_has_children(struct > mem_cgroup *memcg) > return ret; > } > > +bool mem_cgroup_select_oom_victim(struct oom_control *oc) > +{ > + struct mem_cgroup *iter; > + unsigned long chosen_memcg_points; > + > + oc->chosen_memcg = NULL; > + > + if (mem_cgroup_disabled()) > + return false; > + > + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) > + return false; > + > + pr_info("Choosing a victim memcg because of %s", > + oc->memcg ? > + "memory limit reached of cgroup " : > + "out of memory\n"); > + if (oc->memcg) { > + pr_cont_cgroup_path(oc->memcg->css.cgroup); > + pr_cont("\n"); > + } > + > + chosen_memcg_points = 0; > + > + for_each_mem_cgroup_tree(iter, oc->memcg) { > + unsigned long points; > + int nid; > + > + if (mem_cgroup_is_root(iter)) > + continue; > + > + if (memcg_has_children(iter)) > + continue; > + > + points = 0; > + for_each_node_state(nid, N_MEMORY) { > + if (oc->nodemask && !node_isset(nid, *oc->nodemask)) > + continue; > + points += mem_cgroup_node_nr_lru_pages(iter, nid, > + LRU_ALL_ANON | BIT(LRU_UNEVICTABLE)); > + } > + points += mem_cgroup_get_nr_swap_pages(iter); I guess we should also take into account kmem as well (unreclaimable slabs, kernel stacks, socket buffers). > + > + pr_info("Memcg "); > + pr_cont_cgroup_path(iter->css.cgroup); > + pr_cont(": %lu\n", points); > + > + if (points > chosen_memcg_points) { > + if (oc->chosen_memcg) > + css_put(>chosen_memcg->css); > + > + oc->chosen_memcg = iter; > + css_get(>css); > + > + chosen_memcg_points = points; > + } > + } > + > + if (oc->chosen_memcg) { > + pr_info("Kill memcg "); > +
Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
On Thu 18-05-17 14:11:17, Johannes Weiner wrote: > On Thu, May 18, 2017 at 07:30:04PM +0200, Michal Hocko wrote: > > On Thu 18-05-17 17:28:04, Roman Gushchin wrote: > > > Traditionally, the OOM killer is operating on a process level. > > > Under oom conditions, it finds a process with the highest oom score > > > and kills it. > > > > > > This behavior doesn't suit well the system with many running > > > containers. There are two main issues: > > > > > > 1) There is no fairness between containers. A small container with > > > a few large processes will be chosen over a large one with huge > > > number of small processes. > > > > > > 2) Containers often do not expect that some random process inside > > > will be killed. So, in general, a much safer behavior is > > > to kill the whole cgroup. Traditionally, this was implemented > > > in userspace, but doing it in the kernel has some advantages, > > > especially in a case of a system-wide OOM. > > > > > > To address these issues, cgroup-aware OOM killer is introduced. > > > Under OOM conditions, it looks for a memcg with highest oom score, > > > and kills all processes inside. > > > > > > Memcg oom score is calculated as a size of active and inactive > > > anon LRU lists, unevictable LRU list and swap size. > > > > > > For a cgroup-wide OOM, only cgroups belonging to the subtree of > > > the OOMing cgroup are considered. > > > > While this might make sense for some workloads/setups it is not a > > generally acceptable policy IMHO. We have discussed that different OOM > > policies might be interesting few years back at LSFMM but there was no > > real consensus on how to do that. One possibility was to allow bpf like > > mechanisms. Could you explore that path? > > OOM policy is an orthogonal discussion, though. > > The OOM killer's job is to pick a memory consumer to kill. Per default > the unit of the memory consumer is a process, but cgroups allow > grouping processes into compound consumers. Extending the OOM killer > to respect the new definition of "consumer" is not a new policy. I do not want to play word games here but picking a task or more tasks is a policy from my POV but that is not all that important. My primary point is that this new "implementation" is most probably not what people who use memory cgroups outside of containers want. Why? Mostly because they do not care that only a part of the memcg is still alive pretty much like the current global OOM behavior when a single task (or its children) are gone all of the sudden. Why should I kill the whole user slice just because one of its processes went wild? > I don't think it's reasonable to ask the person who's trying to make > the OOM killer support group-consumers to design a dynamic OOM policy > framework instead. > > All we want is the OOM policy, whatever it is, applied to cgroups. And I am not dismissing this usecase. I believe it is valid but not universally applicable when memory cgroups are deployed. That is why I think that we need a way to define those policies in some sane way. Our current oom policies are basically random - /proc/sys/vm/oom_kill_allocating_task resp. /proc/sys/vm/panic_on_oom. I am not really sure we want another hardcoded one e.g. /proc/sys/vm/oom_kill_container because even that might turn out not the great fit for different container usecases. Do we want to kill the largest container or the one with the largest memory hog? Should some containers have a higher priority over others? I am pretty sure more criterion would pop up with more usecases. That's why I think that the current OOM killer implementation should stay as a last resort and be process oriented and we should think about a way to override it for particular usecases. The exact mechanism is not completely clear to me to be honest. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
On Thu, 2017-05-18 at 20:20 +0100, Roman Gushchin wrote: > On Fri, May 19, 2017 at 04:37:27AM +1000, Balbir Singh wrote: > > On Fri, May 19, 2017 at 3:30 AM, Michal Hockowrote: > > > On Thu 18-05-17 17:28:04, Roman Gushchin wrote: > > > > Traditionally, the OOM killer is operating on a process level. > > > > Under oom conditions, it finds a process with the highest oom score > > > > and kills it. > > > > > > > > This behavior doesn't suit well the system with many running > > > > containers. There are two main issues: > > > > > > > > 1) There is no fairness between containers. A small container with > > > > a few large processes will be chosen over a large one with huge > > > > number of small processes. > > > > > > > > 2) Containers often do not expect that some random process inside > > > > will be killed. So, in general, a much safer behavior is > > > > to kill the whole cgroup. Traditionally, this was implemented > > > > in userspace, but doing it in the kernel has some advantages, > > > > especially in a case of a system-wide OOM. > > > > > > > > To address these issues, cgroup-aware OOM killer is introduced. > > > > Under OOM conditions, it looks for a memcg with highest oom score, > > > > and kills all processes inside. > > > > > > > > Memcg oom score is calculated as a size of active and inactive > > > > anon LRU lists, unevictable LRU list and swap size. > > > > > > > > For a cgroup-wide OOM, only cgroups belonging to the subtree of > > > > the OOMing cgroup are considered. > > > > > > While this might make sense for some workloads/setups it is not a > > > generally acceptable policy IMHO. We have discussed that different OOM > > > policies might be interesting few years back at LSFMM but there was no > > > real consensus on how to do that. One possibility was to allow bpf like > > > mechanisms. Could you explore that path? > > > > I agree, I think it needs more thought. I wonder if the real issue is > > something > > else. For example > > > > 1. Did we overcommit a particular container too much? > > Imagine, you have a machine with multiple containers, > each with it's own process tree, and the machine is overcommited, > i.e. sum of container's memory limits is larger the amount available RAM. > > In a case of a system-wide OOM some random container will be affected. > The random container containing the most expensive task, yes! > Historically, this problem was solving by some user-space daemon, > which was monitoring OOM events and cleaning up affected containers. > But this approach can't solve the main problem: non-optimal selection > of a victim. Why do you think the problem is non-optimal selection, is it because we believe that memory cgroup limits should play a role in decision making of global OOM? > > > 2. Do we need something like > > https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_604212_=DwIBaQ=5VD0RTtNlTh3ycd41b3MUw=jJYgtDM7QT-W-Fz_d29HYQ=9jV4id5lmsjFJj1kQjJk0auyQ3bzL27-f6Ur6ZNw36c=ElsS25CoZSPba6ke7O-EIsR7lN0psP6tDVyLnGqCMfs= > > to solve > > the problem? > The URL got changed to something non-parsable, probably for security, but could you email client please not do that. > I don't think it's related. I was thinking that if we have virtual memory limits and we could set some sane ones, we could avoid OOM altogether. OOM is a big hammer and having allocations fail is far more acceptable than killing processes. I believe that several applications may have much larger VM than actual memory usage, but I believe with a good overcommit/virtual memory limiter the problem can be better tackled. > > > 3. We have oom notifiers now, could those be used (assuming you are > > interested > > in non memcg related OOM's affecting a container > > They can be used to inform an userspace daemon about an already happened OOM, > but they do not affect victim selection. Yes, the whole point is for the OS to select the victim, the notifiers provide an opportunity for us to do reclaim to probably prevent OOM In oom_kill, I see blocking_notifier_call_chain(_notify_list, 0, ); if (freed > 0) /* Got some memory back in the last second. */ return true; Could the notification to user space then decide what to cleanup to free memory? We also have event notification inside of memcg. I am trying to understand why these are not sufficient? We also have soft limits to push containers to a smaller size at the time of global pressure. > > > 4. How do we determine limits for these containers? From a fariness > > perspective > > Limits are usually set from some high-level understanding of the nature > of tasks which are working inside, but overcommiting the machine is > a common place, I assume. Agreed overcommit is a given and that is why we wrote the cgroup controllers. I was wondering if the container limits not being set correctly could cause these
Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
On Fri, May 19, 2017 at 04:37:27AM +1000, Balbir Singh wrote: > On Fri, May 19, 2017 at 3:30 AM, Michal Hockowrote: > > On Thu 18-05-17 17:28:04, Roman Gushchin wrote: > >> Traditionally, the OOM killer is operating on a process level. > >> Under oom conditions, it finds a process with the highest oom score > >> and kills it. > >> > >> This behavior doesn't suit well the system with many running > >> containers. There are two main issues: > >> > >> 1) There is no fairness between containers. A small container with > >> a few large processes will be chosen over a large one with huge > >> number of small processes. > >> > >> 2) Containers often do not expect that some random process inside > >> will be killed. So, in general, a much safer behavior is > >> to kill the whole cgroup. Traditionally, this was implemented > >> in userspace, but doing it in the kernel has some advantages, > >> especially in a case of a system-wide OOM. > >> > >> To address these issues, cgroup-aware OOM killer is introduced. > >> Under OOM conditions, it looks for a memcg with highest oom score, > >> and kills all processes inside. > >> > >> Memcg oom score is calculated as a size of active and inactive > >> anon LRU lists, unevictable LRU list and swap size. > >> > >> For a cgroup-wide OOM, only cgroups belonging to the subtree of > >> the OOMing cgroup are considered. > > > > While this might make sense for some workloads/setups it is not a > > generally acceptable policy IMHO. We have discussed that different OOM > > policies might be interesting few years back at LSFMM but there was no > > real consensus on how to do that. One possibility was to allow bpf like > > mechanisms. Could you explore that path? > > I agree, I think it needs more thought. I wonder if the real issue is > something > else. For example > > 1. Did we overcommit a particular container too much? > 2. Do we need something like https://lwn.net/Articles/604212/ to solve > the problem? The occasional OOM kill is an unavoidable reality on our systems (and I bet on most deployments). If we tried not to overcommit, we'd waste a *lot* of memory. The problem is when OOM happens, we really want the biggest *job* to get killed. Before cgroups, we assumed jobs were processes. But with cgroups, the user is able to define a group of processes as a job, and then an individual process is no longer a first-class memory consumer. Without a patch like this, the OOM killer will compare the sizes of the random subparticles that the jobs in the system are composed of and kill the single biggest particle, leaving behind the incoherent remains of one of the jobs. That doesn't make a whole lot of sense. If you want to determine the most expensive car in a parking lot, you can't go off and compare the price of one car's muffler with the door handle of another, then point to a wind shield and yell "This is it!" You need to compare the cars as a whole with each other. > 3. We have oom notifiers now, could those be used (assuming you are interested > in non memcg related OOM's affecting a container Right now, we watch for OOM notifications and then have userspace kill the rest of a job. That works - somewhat. What remains is the problem that I described above, that comparing individual process sizes is not meaningful when the terminal memory consumer is a cgroup. > 4. How do we determine limits for these containers? From a fariness > perspective How do you mean? -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
On Fri, May 19, 2017 at 04:37:27AM +1000, Balbir Singh wrote: > On Fri, May 19, 2017 at 3:30 AM, Michal Hockowrote: > > On Thu 18-05-17 17:28:04, Roman Gushchin wrote: > >> Traditionally, the OOM killer is operating on a process level. > >> Under oom conditions, it finds a process with the highest oom score > >> and kills it. > >> > >> This behavior doesn't suit well the system with many running > >> containers. There are two main issues: > >> > >> 1) There is no fairness between containers. A small container with > >> a few large processes will be chosen over a large one with huge > >> number of small processes. > >> > >> 2) Containers often do not expect that some random process inside > >> will be killed. So, in general, a much safer behavior is > >> to kill the whole cgroup. Traditionally, this was implemented > >> in userspace, but doing it in the kernel has some advantages, > >> especially in a case of a system-wide OOM. > >> > >> To address these issues, cgroup-aware OOM killer is introduced. > >> Under OOM conditions, it looks for a memcg with highest oom score, > >> and kills all processes inside. > >> > >> Memcg oom score is calculated as a size of active and inactive > >> anon LRU lists, unevictable LRU list and swap size. > >> > >> For a cgroup-wide OOM, only cgroups belonging to the subtree of > >> the OOMing cgroup are considered. > > > > While this might make sense for some workloads/setups it is not a > > generally acceptable policy IMHO. We have discussed that different OOM > > policies might be interesting few years back at LSFMM but there was no > > real consensus on how to do that. One possibility was to allow bpf like > > mechanisms. Could you explore that path? > > I agree, I think it needs more thought. I wonder if the real issue is > something > else. For example > > 1. Did we overcommit a particular container too much? Imagine, you have a machine with multiple containers, each with it's own process tree, and the machine is overcommited, i.e. sum of container's memory limits is larger the amount available RAM. In a case of a system-wide OOM some random container will be affected. Historically, this problem was solving by some user-space daemon, which was monitoring OOM events and cleaning up affected containers. But this approach can't solve the main problem: non-optimal selection of a victim. > 2. Do we need something like > https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_604212_=DwIBaQ=5VD0RTtNlTh3ycd41b3MUw=jJYgtDM7QT-W-Fz_d29HYQ=9jV4id5lmsjFJj1kQjJk0auyQ3bzL27-f6Ur6ZNw36c=ElsS25CoZSPba6ke7O-EIsR7lN0psP6tDVyLnGqCMfs= > to solve > the problem? I don't think it's related. > 3. We have oom notifiers now, could those be used (assuming you are interested > in non memcg related OOM's affecting a container They can be used to inform an userspace daemon about an already happened OOM, but they do not affect victim selection. > 4. How do we determine limits for these containers? From a fariness > perspective Limits are usually set from some high-level understanding of the nature of tasks which are working inside, but overcommiting the machine is a common place, I assume. Thank you! Roman -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
On Fri, May 19, 2017 at 3:30 AM, Michal Hockowrote: > On Thu 18-05-17 17:28:04, Roman Gushchin wrote: >> Traditionally, the OOM killer is operating on a process level. >> Under oom conditions, it finds a process with the highest oom score >> and kills it. >> >> This behavior doesn't suit well the system with many running >> containers. There are two main issues: >> >> 1) There is no fairness between containers. A small container with >> a few large processes will be chosen over a large one with huge >> number of small processes. >> >> 2) Containers often do not expect that some random process inside >> will be killed. So, in general, a much safer behavior is >> to kill the whole cgroup. Traditionally, this was implemented >> in userspace, but doing it in the kernel has some advantages, >> especially in a case of a system-wide OOM. >> >> To address these issues, cgroup-aware OOM killer is introduced. >> Under OOM conditions, it looks for a memcg with highest oom score, >> and kills all processes inside. >> >> Memcg oom score is calculated as a size of active and inactive >> anon LRU lists, unevictable LRU list and swap size. >> >> For a cgroup-wide OOM, only cgroups belonging to the subtree of >> the OOMing cgroup are considered. > > While this might make sense for some workloads/setups it is not a > generally acceptable policy IMHO. We have discussed that different OOM > policies might be interesting few years back at LSFMM but there was no > real consensus on how to do that. One possibility was to allow bpf like > mechanisms. Could you explore that path? I agree, I think it needs more thought. I wonder if the real issue is something else. For example 1. Did we overcommit a particular container too much? 2. Do we need something like https://lwn.net/Articles/604212/ to solve the problem? 3. We have oom notifiers now, could those be used (assuming you are interested in non memcg related OOM's affecting a container 4. How do we determine limits for these containers? From a fariness perspective Just trying to understand what leads to the issues you are seeing Balbir -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer
On Thu 18-05-17 17:28:04, Roman Gushchin wrote: > Traditionally, the OOM killer is operating on a process level. > Under oom conditions, it finds a process with the highest oom score > and kills it. > > This behavior doesn't suit well the system with many running > containers. There are two main issues: > > 1) There is no fairness between containers. A small container with > a few large processes will be chosen over a large one with huge > number of small processes. > > 2) Containers often do not expect that some random process inside > will be killed. So, in general, a much safer behavior is > to kill the whole cgroup. Traditionally, this was implemented > in userspace, but doing it in the kernel has some advantages, > especially in a case of a system-wide OOM. > > To address these issues, cgroup-aware OOM killer is introduced. > Under OOM conditions, it looks for a memcg with highest oom score, > and kills all processes inside. > > Memcg oom score is calculated as a size of active and inactive > anon LRU lists, unevictable LRU list and swap size. > > For a cgroup-wide OOM, only cgroups belonging to the subtree of > the OOMing cgroup are considered. While this might make sense for some workloads/setups it is not a generally acceptable policy IMHO. We have discussed that different OOM policies might be interesting few years back at LSFMM but there was no real consensus on how to do that. One possibility was to allow bpf like mechanisms. Could you explore that path? -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html