Re: [v10 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
On Thu 05-10-17 10:54:01, Johannes Weiner wrote: > On Thu, Oct 05, 2017 at 03:14:19PM +0200, Michal Hocko wrote: > > On Wed 04-10-17 16:04:53, Johannes Weiner wrote: > > [...] > > > That will silently ignore what the user writes to the memory.oom_group > > > control files across the system's cgroup tree. > > > > > > We'll have a knob that lets the workload declare itself an indivisible > > > memory consumer, that it would like to get killed in one piece, and > > > it's silently ignored because of a mount option they forgot to pass. > > > > > > That's not good from an interface perspective. > > > > Yes and that is why I think a boot time knob would be the most simple > > way. It will also open doors for more oom policies in future which I > > believe come sooner or later. > > A boot time knob makes less sense to me than the mount option. It > doesn't require a reboot to change this behavior, we shouldn't force > the user to reboot when a runtime configuration is possible. Do we need such a runtime configurability, though? If yes, what is the usecase? > But I don't see how dropping this patch as part of this series would > prevent adding modular oom policies in the future? I didn't say that dropping this patch would prevent further oom policies. My point was that a command line option could be more generic to allow more policies in future. > That said, selectable OOM policies sound like a total deadend to > me. The kernel OOM happens way too late to be useful for any kind of > resource policy already. Even now it won't prevent you from thrashing > indefinitely, with only 5% of your workload's time spent productively. > > What kind of service quality do you have at this point? The OOM killer is a disruptive operation which can be really costly from the workload perspective (you are losing work) and as such the victim selection really depends on the workload. Most of them are just fine with the most rudimentary kill-the-largest approach but think of workloads where the amount or type of work really matters much more (think of a long running computational jobs taking weeks). We cannot really handle all of those so I really expect that we will eventually have to provide a way to allow different policies _somehow_. > The *minority* of our OOM situations (in terms of "this isn't making > real progress anymore due to a lack of memory") is even *seeing* OOM > kills at this point. And it'll get worse as storage gets faster and > memory bigger. This is imho a separate problem which is independent on the oom victim selection. > How is that useful as a resource arbitration point? > > Then there is the question of reliability. I mean, we still don't have > a global OOM killer that is actually free from deadlocks. Well, I believe that we should be deadlock free now. > We don't > have reserves measured to the exact requirements of reclaim that would > guarantee recovery, the OOM reaper requires a lock that we hope isn't > taken, etc. I wouldn't want any of my fleet to rely on this for > regular operation - I'm just glad that, when we do mess up and hit > this event, we don't have to reboot. > > It makes much more sense to monitor memory pressure from userspace and > smartly intervene when things turn unproductive, which is a long way > from the point where the kernel is about to *deadlock* due to memory. again this is independent on the oom selection policy. > Global OOM kills can still happen, but their goal should really be 1) > to save the kernel, 2) respect the integrity of a memory consumer and > 3) be comprehensible to userspace. (These patches are about 2 and 3.) I agree on these but I would add 4) make sure that the impact on the system is acceptable/least disruptive possible. > But abstracting such a rudimentary and fragile deadlock avoidance > mechanism into higher-level resource management, or co-opting it as a > policy enforcement tool, is crazy to me. > > And it seems reckless to present it as those things to our users by > encoding any such elaborate policy interfaces. > > > > On the other hand, the only benefit of this patch is to shield users > > > from changes to the OOM killing heuristics. Yet, it's really hard to > > > imagine that modifying the victim selection process slightly could be > > > called a regression in any way. We have done that many times over, > > > without a second thought on backwards compatibility: > > > > > > 5e9d834a0e0c oom: sacrifice child with highest badness score for parent > > > a63d83f427fb oom: badness heuristic rewrite > > > 778c14affaf9 mm, oom: base root bonus on current usage > > > > yes we have changed that without a deeper considerations. Some of those > > changes are arguable (e.g. child scarification). The oom badness > > heuristic rewrite has triggered quite some complains AFAIR (I remember > > Kosaki has made several attempts to revert it). I think that we are > > trying to be more careful about user visible changes than we used to be. > >
Re: [v10 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
Hello, Michal. On Thu, Oct 05, 2017 at 03:14:19PM +0200, Michal Hocko wrote: > Yes and that is why I think a boot time knob would be the most simple > way. It will also open doors for more oom policies in future which I > believe come sooner or later. While boot params are fine for development and debugging, as a user-interface, they aren't great. * The user can't easily confirm whether the config they input is correct and when they get it wrong what's wrong can be pretty mysterious. * While kernel params can be made r/w through /proc, people usually don't expect that and using that can become really confusing because a lot of people use "dmesg|grep" to confirm the boot params and that won't agree with the setting written later. * It can't be scoped. What if we want to choose different policies per delegated subtree? * Boot params aren't the easiest (again, if you're a developer, they're but most aren't developers) to play with and prone to cause deployment issues. * In this case, even worse because it ends up silently ignoring a clearly explicit configuration in an interface file. If the behavior differences we get from group oom code isn't critical (and it doesn't seem to be), I'd greatly prefer just enabling it when cgroup2 is in use. If it absolutely must be opt-in even on cgroup2, we can discuss other ways but I'd really like to see stronger rationales before going that route. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v10 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
On Thu, Oct 05, 2017 at 03:14:19PM +0200, Michal Hocko wrote: > On Wed 04-10-17 16:04:53, Johannes Weiner wrote: > [...] > > That will silently ignore what the user writes to the memory.oom_group > > control files across the system's cgroup tree. > > > > We'll have a knob that lets the workload declare itself an indivisible > > memory consumer, that it would like to get killed in one piece, and > > it's silently ignored because of a mount option they forgot to pass. > > > > That's not good from an interface perspective. > > Yes and that is why I think a boot time knob would be the most simple > way. It will also open doors for more oom policies in future which I > believe come sooner or later. A boot time knob makes less sense to me than the mount option. It doesn't require a reboot to change this behavior, we shouldn't force the user to reboot when a runtime configuration is possible. But I don't see how dropping this patch as part of this series would prevent adding modular oom policies in the future? That said, selectable OOM policies sound like a total deadend to me. The kernel OOM happens way too late to be useful for any kind of resource policy already. Even now it won't prevent you from thrashing indefinitely, with only 5% of your workload's time spent productively. What kind of service quality do you have at this point? The *minority* of our OOM situations (in terms of "this isn't making real progress anymore due to a lack of memory") is even *seeing* OOM kills at this point. And it'll get worse as storage gets faster and memory bigger. How is that useful as a resource arbitration point? Then there is the question of reliability. I mean, we still don't have a global OOM killer that is actually free from deadlocks. We don't have reserves measured to the exact requirements of reclaim that would guarantee recovery, the OOM reaper requires a lock that we hope isn't taken, etc. I wouldn't want any of my fleet to rely on this for regular operation - I'm just glad that, when we do mess up and hit this event, we don't have to reboot. It makes much more sense to monitor memory pressure from userspace and smartly intervene when things turn unproductive, which is a long way from the point where the kernel is about to *deadlock* due to memory. Global OOM kills can still happen, but their goal should really be 1) to save the kernel, 2) respect the integrity of a memory consumer and 3) be comprehensible to userspace. (These patches are about 2 and 3.) But abstracting such a rudimentary and fragile deadlock avoidance mechanism into higher-level resource management, or co-opting it as a policy enforcement tool, is crazy to me. And it seems reckless to present it as those things to our users by encoding any such elaborate policy interfaces. > > On the other hand, the only benefit of this patch is to shield users > > from changes to the OOM killing heuristics. Yet, it's really hard to > > imagine that modifying the victim selection process slightly could be > > called a regression in any way. We have done that many times over, > > without a second thought on backwards compatibility: > > > > 5e9d834a0e0c oom: sacrifice child with highest badness score for parent > > a63d83f427fb oom: badness heuristic rewrite > > 778c14affaf9 mm, oom: base root bonus on current usage > > yes we have changed that without a deeper considerations. Some of those > changes are arguable (e.g. child scarification). The oom badness > heuristic rewrite has triggered quite some complains AFAIR (I remember > Kosaki has made several attempts to revert it). I think that we are > trying to be more careful about user visible changes than we used to be. Whatever grumbling might have come up, it has not resulted in a revert or a way to switch back to the old behavior. So I don't think this can be considered an actual regression. We change heuristics in the MM all the time. If you track for example allocator behavior over different kernel versions, you can see how much our caching policy, our huge page policy etc. fluctuates. The impact of that is way bigger to regular workloads than how we go about choosing an OOM victim. We don't want to regress anybody, but let's also keep perspective here and especially consider the userspace interfaces we are willing to put in for at least the next few years, the promises we want to make, the further fragmentation of the config space, for such a negligible risk. -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v10 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
On Thu 05-10-17 14:41:13, Roman Gushchin wrote: > On Thu, Oct 05, 2017 at 03:14:19PM +0200, Michal Hocko wrote: > > On Wed 04-10-17 16:04:53, Johannes Weiner wrote: > > [...] > > > That will silently ignore what the user writes to the memory.oom_group > > > control files across the system's cgroup tree. > > > > > > We'll have a knob that lets the workload declare itself an indivisible > > > memory consumer, that it would like to get killed in one piece, and > > > it's silently ignored because of a mount option they forgot to pass. > > > > > > That's not good from an interface perspective. > > > > Yes and that is why I think a boot time knob would be the most simple > > way. It will also open doors for more oom policies in future which I > > believe come sooner or later. > > So, we would rely on grub config to set up OOM policy? Sounds weird. > > We use boot options, when it's hard to implement on the fly switching > (like turning on/off socket memory accounting), but here is not this case. Well we define global policies with kernel command line so I do not think it would be something unusual. An advantage is that you do not have deal with semantic of the policy change during the runtime which is something I am not sure we need or even want. > > > On the other hand, the only benefit of this patch is to shield users > > > from changes to the OOM killing heuristics. Yet, it's really hard to > > > imagine that modifying the victim selection process slightly could be > > > called a regression in any way. We have done that many times over, > > > without a second thought on backwards compatibility: > > > > > > 5e9d834a0e0c oom: sacrifice child with highest badness score for parent > > > a63d83f427fb oom: badness heuristic rewrite > > > 778c14affaf9 mm, oom: base root bonus on current usage > > > > yes we have changed that without a deeper considerations. Some of those > > changes are arguable (e.g. child scarification). The oom badness > > heuristic rewrite has triggered quite some complains AFAIR (I remember > > Kosaki has made several attempts to revert it). I think that we are > > trying to be more careful about user visible changes than we used to be. > > > > More importantly I do not think that the current (non-memcg aware) OOM > > policy is somehow obsolete and many people expect it to behave > > consistently. As I've said already, I have seen many complains that the > > OOM killer doesn't kill the right task. Most of them were just NUMA > > related issues where the oom report was not clear enough. I do not want > > to repeat that again now. Memcg awareness is certainly a useful > > heuristic but I do not see it universally applicable to all workloads. > > > > > Let's not make the userspace interface crap because of some misguided > > > idea that the OOM heuristic is a hard promise to userspace. It's never > > > been, and nobody has complained about changes in the past. > > > > > > This case is doubly silly, as the behavior change only applies to > > > cgroup2, which doesn't exactly have a large base of legacy users yet. > > > > I agree on the interface part but I disagree with making it default just > > because v2 is not largerly adopted yet. > > I believe that the only real regression can be caused by active using of > oom_score_adj. I really don't know how many cgroup v2 users are relying > on it (hopefully, 0). Not only. A memcg with many small tasks could regress as well. > So, personally I would prefer to have an opt-out cgroup v2 mount option > (sane new behavior for most users, 100% backward compatibility for rare > strange setups), but I don't have a very strong opinion here. I fail to see why should people disable the feature after they see an unexpected behavior rather than other way around when the feature is enabled when it is really wanted. The opt-in is more correct just from the "least surprise POV". -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v10 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
On Thu, Oct 05, 2017 at 03:14:19PM +0200, Michal Hocko wrote: > On Wed 04-10-17 16:04:53, Johannes Weiner wrote: > [...] > > That will silently ignore what the user writes to the memory.oom_group > > control files across the system's cgroup tree. > > > > We'll have a knob that lets the workload declare itself an indivisible > > memory consumer, that it would like to get killed in one piece, and > > it's silently ignored because of a mount option they forgot to pass. > > > > That's not good from an interface perspective. > > Yes and that is why I think a boot time knob would be the most simple > way. It will also open doors for more oom policies in future which I > believe come sooner or later. So, we would rely on grub config to set up OOM policy? Sounds weird. We use boot options, when it's hard to implement on the fly switching (like turning on/off socket memory accounting), but here is not this case. > > > On the other hand, the only benefit of this patch is to shield users > > from changes to the OOM killing heuristics. Yet, it's really hard to > > imagine that modifying the victim selection process slightly could be > > called a regression in any way. We have done that many times over, > > without a second thought on backwards compatibility: > > > > 5e9d834a0e0c oom: sacrifice child with highest badness score for parent > > a63d83f427fb oom: badness heuristic rewrite > > 778c14affaf9 mm, oom: base root bonus on current usage > > yes we have changed that without a deeper considerations. Some of those > changes are arguable (e.g. child scarification). The oom badness > heuristic rewrite has triggered quite some complains AFAIR (I remember > Kosaki has made several attempts to revert it). I think that we are > trying to be more careful about user visible changes than we used to be. > > More importantly I do not think that the current (non-memcg aware) OOM > policy is somehow obsolete and many people expect it to behave > consistently. As I've said already, I have seen many complains that the > OOM killer doesn't kill the right task. Most of them were just NUMA > related issues where the oom report was not clear enough. I do not want > to repeat that again now. Memcg awareness is certainly a useful > heuristic but I do not see it universally applicable to all workloads. > > > Let's not make the userspace interface crap because of some misguided > > idea that the OOM heuristic is a hard promise to userspace. It's never > > been, and nobody has complained about changes in the past. > > > > This case is doubly silly, as the behavior change only applies to > > cgroup2, which doesn't exactly have a large base of legacy users yet. > > I agree on the interface part but I disagree with making it default just > because v2 is not largerly adopted yet. I believe that the only real regression can be caused by active using of oom_score_adj. I really don't know how many cgroup v2 users are relying on it (hopefully, 0). So, personally I would prefer to have an opt-out cgroup v2 mount option (sane new behavior for most users, 100% backward compatibility for rare strange setups), but I don't have a very strong opinion here. Thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v10 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
On Wed 04-10-17 16:04:53, Johannes Weiner wrote: [...] > That will silently ignore what the user writes to the memory.oom_group > control files across the system's cgroup tree. > > We'll have a knob that lets the workload declare itself an indivisible > memory consumer, that it would like to get killed in one piece, and > it's silently ignored because of a mount option they forgot to pass. > > That's not good from an interface perspective. Yes and that is why I think a boot time knob would be the most simple way. It will also open doors for more oom policies in future which I believe come sooner or later. > On the other hand, the only benefit of this patch is to shield users > from changes to the OOM killing heuristics. Yet, it's really hard to > imagine that modifying the victim selection process slightly could be > called a regression in any way. We have done that many times over, > without a second thought on backwards compatibility: > > 5e9d834a0e0c oom: sacrifice child with highest badness score for parent > a63d83f427fb oom: badness heuristic rewrite > 778c14affaf9 mm, oom: base root bonus on current usage yes we have changed that without a deeper considerations. Some of those changes are arguable (e.g. child scarification). The oom badness heuristic rewrite has triggered quite some complains AFAIR (I remember Kosaki has made several attempts to revert it). I think that we are trying to be more careful about user visible changes than we used to be. More importantly I do not think that the current (non-memcg aware) OOM policy is somehow obsolete and many people expect it to behave consistently. As I've said already, I have seen many complains that the OOM killer doesn't kill the right task. Most of them were just NUMA related issues where the oom report was not clear enough. I do not want to repeat that again now. Memcg awareness is certainly a useful heuristic but I do not see it universally applicable to all workloads. > Let's not make the userspace interface crap because of some misguided > idea that the OOM heuristic is a hard promise to userspace. It's never > been, and nobody has complained about changes in the past. > > This case is doubly silly, as the behavior change only applies to > cgroup2, which doesn't exactly have a large base of legacy users yet. I agree on the interface part but I disagree with making it default just because v2 is not largerly adopted yet. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v10 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
On Wed, Oct 04, 2017 at 04:46:37PM +0100, Roman Gushchin wrote: > Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware > OOM killer. If not set, the OOM selection is performed in > a "traditional" per-process way. > > The behavior can be changed dynamically by remounting the cgroupfs. > > Signed-off-by: Roman Gushchin > Cc: Michal Hocko > Cc: Vladimir Davydov > Cc: Johannes Weiner > Cc: Tetsuo Handa > Cc: David Rientjes > Cc: Andrew Morton > Cc: Tejun Heo > Cc: kernel-t...@fb.com > Cc: cgro...@vger.kernel.org > Cc: linux-doc@vger.kernel.org > Cc: linux-ker...@vger.kernel.org > Cc: linux...@kvack.org > --- > include/linux/cgroup-defs.h | 5 + > kernel/cgroup/cgroup.c | 10 ++ > mm/memcontrol.c | 3 +++ > 3 files changed, 18 insertions(+) > > diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h > index 3e55bbd31ad1..cae5343a8b21 100644 > --- a/include/linux/cgroup-defs.h > +++ b/include/linux/cgroup-defs.h > @@ -80,6 +80,11 @@ enum { >* Enable cpuset controller in v1 cgroup to use v2 behavior. >*/ > CGRP_ROOT_CPUSET_V2_MODE = (1 << 4), > + > + /* > + * Enable cgroup-aware OOM killer. > + */ > + CGRP_GROUP_OOM = (1 << 5), > }; > > /* cftype->flags */ > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > index c3421ee0d230..8d8aa46ff930 100644 > --- a/kernel/cgroup/cgroup.c > +++ b/kernel/cgroup/cgroup.c > @@ -1709,6 +1709,9 @@ static int parse_cgroup_root_flags(char *data, unsigned > int *root_flags) > if (!strcmp(token, "nsdelegate")) { > *root_flags |= CGRP_ROOT_NS_DELEGATE; > continue; > + } else if (!strcmp(token, "groupoom")) { > + *root_flags |= CGRP_GROUP_OOM; > + continue; > } > > pr_err("cgroup2: unknown option \"%s\"\n", token); > @@ -1725,6 +1728,11 @@ static void apply_cgroup_root_flags(unsigned int > root_flags) > cgrp_dfl_root.flags |= CGRP_ROOT_NS_DELEGATE; > else > cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE; > + > + if (root_flags & CGRP_GROUP_OOM) > + cgrp_dfl_root.flags |= CGRP_GROUP_OOM; > + else > + cgrp_dfl_root.flags &= ~CGRP_GROUP_OOM; > } > } > > @@ -1732,6 +1740,8 @@ static int cgroup_show_options(struct seq_file *seq, > struct kernfs_root *kf_root > { > if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE) > seq_puts(seq, ",nsdelegate"); > + if (cgrp_dfl_root.flags & CGRP_GROUP_OOM) > + seq_puts(seq, ",groupoom"); > return 0; > } > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 1fcd6cc353d5..2e82625bd354 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -2865,6 +2865,9 @@ bool mem_cgroup_select_oom_victim(struct oom_control > *oc) > if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) > return false; > > + if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM)) > + return false; That will silently ignore what the user writes to the memory.oom_group control files across the system's cgroup tree. We'll have a knob that lets the workload declare itself an indivisible memory consumer, that it would like to get killed in one piece, and it's silently ignored because of a mount option they forgot to pass. That's not good from an interface perspective. On the other hand, the only benefit of this patch is to shield users from changes to the OOM killing heuristics. Yet, it's really hard to imagine that modifying the victim selection process slightly could be called a regression in any way. We have done that many times over, without a second thought on backwards compatibility: 5e9d834a0e0c oom: sacrifice child with highest badness score for parent a63d83f427fb oom: badness heuristic rewrite 778c14affaf9 mm, oom: base root bonus on current usage Let's not make the userspace interface crap because of some misguided idea that the OOM heuristic is a hard promise to userspace. It's never been, and nobody has complained about changes in the past. This case is doubly silly, as the behavior change only applies to cgroup2, which doesn't exactly have a large base of legacy users yet. Let's just drop this 5/6 patch. -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[v10 5/6] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer
Add a "groupoom" cgroup v2 mount option to enable the cgroup-aware OOM killer. If not set, the OOM selection is performed in a "traditional" per-process way. The behavior can be changed dynamically by remounting the cgroupfs. Signed-off-by: Roman Gushchin Cc: Michal Hocko Cc: Vladimir Davydov Cc: Johannes Weiner Cc: Tetsuo Handa Cc: David Rientjes Cc: Andrew Morton Cc: Tejun Heo Cc: kernel-t...@fb.com Cc: cgro...@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-ker...@vger.kernel.org Cc: linux...@kvack.org --- include/linux/cgroup-defs.h | 5 + kernel/cgroup/cgroup.c | 10 ++ mm/memcontrol.c | 3 +++ 3 files changed, 18 insertions(+) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 3e55bbd31ad1..cae5343a8b21 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -80,6 +80,11 @@ enum { * Enable cpuset controller in v1 cgroup to use v2 behavior. */ CGRP_ROOT_CPUSET_V2_MODE = (1 << 4), + + /* +* Enable cgroup-aware OOM killer. +*/ + CGRP_GROUP_OOM = (1 << 5), }; /* cftype->flags */ diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index c3421ee0d230..8d8aa46ff930 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -1709,6 +1709,9 @@ static int parse_cgroup_root_flags(char *data, unsigned int *root_flags) if (!strcmp(token, "nsdelegate")) { *root_flags |= CGRP_ROOT_NS_DELEGATE; continue; + } else if (!strcmp(token, "groupoom")) { + *root_flags |= CGRP_GROUP_OOM; + continue; } pr_err("cgroup2: unknown option \"%s\"\n", token); @@ -1725,6 +1728,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags) cgrp_dfl_root.flags |= CGRP_ROOT_NS_DELEGATE; else cgrp_dfl_root.flags &= ~CGRP_ROOT_NS_DELEGATE; + + if (root_flags & CGRP_GROUP_OOM) + cgrp_dfl_root.flags |= CGRP_GROUP_OOM; + else + cgrp_dfl_root.flags &= ~CGRP_GROUP_OOM; } } @@ -1732,6 +1740,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root { if (cgrp_dfl_root.flags & CGRP_ROOT_NS_DELEGATE) seq_puts(seq, ",nsdelegate"); + if (cgrp_dfl_root.flags & CGRP_GROUP_OOM) + seq_puts(seq, ",groupoom"); return 0; } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1fcd6cc353d5..2e82625bd354 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2865,6 +2865,9 @@ bool mem_cgroup_select_oom_victim(struct oom_control *oc) if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) return false; + if (!(cgrp_dfl_root.flags & CGRP_GROUP_OOM)) + return false; + if (oc->memcg) root = oc->memcg; else -- 2.13.6 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html