Re: cgroup aware oom killer (was Re: [PATCH 0/3] introduce memory.oom.group)
On Sun, Aug 19, 2018 at 04:26:50PM -0700, David Rientjes wrote: > Roman, have you had time to go through this? Hm, I thought we've finished this part of discussion, no? Anyway, let me repeat my position: I don't like the interface you've proposed in that follow-up patchset, and I explained why. If you've a new proposal, please, rebase it to the current mm tree, and we can discuss it separately. Alternatively, we can discuss the interface first (without the implementation), but, please, make a new thread with a fresh description of a proposed interface. Thanks! > > > On Tue, 7 Aug 2018, David Rientjes wrote: > > > On Mon, 6 Aug 2018, Roman Gushchin wrote: > > > > > > In a cgroup-aware oom killer world, yes, we need the ability to specify > > > > that the usage of the entire subtree should be compared as a single > > > > entity with other cgroups. That is necessary for user subtrees but may > > > > not be necessary for top-level cgroups depending on how you structure > > > > your > > > > unified cgroup hierarchy. So it needs to be configurable, as you > > > > suggest, > > > > and you are correct it can be different than oom.group. > > > > > > > > That's not the only thing we need though, as I'm sure you were > > > > expecting > > > > me to say :) > > > > > > > > We need the ability to preserve existing behavior, i.e. process based > > > > and > > > > not cgroup aware, for subtrees so that our users who have clear > > > > expectations and tune their oom_score_adj accordingly based on how the > > > > oom > > > > killer has always chosen processes for oom kill do not suddenly regress. > > > > > > Isn't the combination of oom.group=0 and oom.evaluate_together=1 > > > describing > > > this case? This basically means that if memcg is selected as target, > > > the process inside will be selected using traditional per-process > > > approach. > > > > > > > No, that would overload the policy and mechanism. We want the ability to > > consider user-controlled subtrees as a single entity for comparison with > > other user subtrees to select which subtree to target. This does not > > imply that users want their entire subtree oom killed. > > > > > > So we need to define the policy for a subtree that is oom, and I > > > > suggest > > > > we do that as a characteristic of the cgroup that is oom ("process" vs > > > > "cgroup", and process would be the default to preserve what currently > > > > happens in a user subtree). > > > > > > I'm not entirely convinced here. > > > I do agree, that some sub-tree may have a well tuned oom_score_adj, > > > and it's preferable to keep the current behavior. > > > > > > At the same time I don't like the idea to look at the policy of the OOMing > > > cgroup. Why exceeding of one limit should be handled different to > > > exceeding > > > of another? This seems to be a property of workload, not a limit. > > > > > > > The limit is the property of the mem cgroup, so it's logical that the > > policy when reaching that limit is a property of the same mem cgroup. > > Using the user-controlled subtree example, if we have /david and /roman, > > we can define our own policies on oom, we are not restricted to cgroup > > aware selection on the entire hierarchy. /david/oom.policy can be > > "process" so that I haven't regressed with earlier kernels, and > > /roman/oom.policy can be "cgroup" to target the largest cgroup in your > > subtree. > > > > Something needs to be oom killed when a mem cgroup at any level in the > > hierarchy is reached and reclaim has failed. What to do when that limit > > is reached is a property of that cgroup. > > > > > > Now, as users who rely on process selection are well aware, we have > > > > oom_score_adj to influence the decision of which process to oom kill. > > > > If > > > > our oom subtree is cgroup aware, we should have the ability to likewise > > > > influence that decision. For example, we have high priority > > > > applications > > > > that run at the top-level that use a lot of memory and strictly oom > > > > killing them in all scenarios because they use a lot of memory isn't > > > > appropriate. We need to be able to adjust the comparison of a cgroup > > > > (or > > > > subtree) when compared to other cgroups. > > > > > > > > I've also suggested, but did not implement in my patchset because I was > > > > trying to define the API and find common ground first, that we have a > > > > need > > > > for priority based selection. In other words, define the priority of a > > > > subtree regardless of cgroup usage. > > > > > > > > So with these four things, we have > > > > > > > > - an "oom.policy" tunable to define "cgroup" or "process" for that > > > >subtree (and plans for "priority" in the future), > > > > > > > > - your "oom.evaluate_as_group" tunable to account the usage of the > > > >subtree as the cgroup's own usage for comparison with others, > > > > > > > > - an
Re: cgroup aware oom killer (was Re: [PATCH 0/3] introduce memory.oom.group)
On Sun, Aug 19, 2018 at 04:26:50PM -0700, David Rientjes wrote: > Roman, have you had time to go through this? Hm, I thought we've finished this part of discussion, no? Anyway, let me repeat my position: I don't like the interface you've proposed in that follow-up patchset, and I explained why. If you've a new proposal, please, rebase it to the current mm tree, and we can discuss it separately. Alternatively, we can discuss the interface first (without the implementation), but, please, make a new thread with a fresh description of a proposed interface. Thanks! > > > On Tue, 7 Aug 2018, David Rientjes wrote: > > > On Mon, 6 Aug 2018, Roman Gushchin wrote: > > > > > > In a cgroup-aware oom killer world, yes, we need the ability to specify > > > > that the usage of the entire subtree should be compared as a single > > > > entity with other cgroups. That is necessary for user subtrees but may > > > > not be necessary for top-level cgroups depending on how you structure > > > > your > > > > unified cgroup hierarchy. So it needs to be configurable, as you > > > > suggest, > > > > and you are correct it can be different than oom.group. > > > > > > > > That's not the only thing we need though, as I'm sure you were > > > > expecting > > > > me to say :) > > > > > > > > We need the ability to preserve existing behavior, i.e. process based > > > > and > > > > not cgroup aware, for subtrees so that our users who have clear > > > > expectations and tune their oom_score_adj accordingly based on how the > > > > oom > > > > killer has always chosen processes for oom kill do not suddenly regress. > > > > > > Isn't the combination of oom.group=0 and oom.evaluate_together=1 > > > describing > > > this case? This basically means that if memcg is selected as target, > > > the process inside will be selected using traditional per-process > > > approach. > > > > > > > No, that would overload the policy and mechanism. We want the ability to > > consider user-controlled subtrees as a single entity for comparison with > > other user subtrees to select which subtree to target. This does not > > imply that users want their entire subtree oom killed. > > > > > > So we need to define the policy for a subtree that is oom, and I > > > > suggest > > > > we do that as a characteristic of the cgroup that is oom ("process" vs > > > > "cgroup", and process would be the default to preserve what currently > > > > happens in a user subtree). > > > > > > I'm not entirely convinced here. > > > I do agree, that some sub-tree may have a well tuned oom_score_adj, > > > and it's preferable to keep the current behavior. > > > > > > At the same time I don't like the idea to look at the policy of the OOMing > > > cgroup. Why exceeding of one limit should be handled different to > > > exceeding > > > of another? This seems to be a property of workload, not a limit. > > > > > > > The limit is the property of the mem cgroup, so it's logical that the > > policy when reaching that limit is a property of the same mem cgroup. > > Using the user-controlled subtree example, if we have /david and /roman, > > we can define our own policies on oom, we are not restricted to cgroup > > aware selection on the entire hierarchy. /david/oom.policy can be > > "process" so that I haven't regressed with earlier kernels, and > > /roman/oom.policy can be "cgroup" to target the largest cgroup in your > > subtree. > > > > Something needs to be oom killed when a mem cgroup at any level in the > > hierarchy is reached and reclaim has failed. What to do when that limit > > is reached is a property of that cgroup. > > > > > > Now, as users who rely on process selection are well aware, we have > > > > oom_score_adj to influence the decision of which process to oom kill. > > > > If > > > > our oom subtree is cgroup aware, we should have the ability to likewise > > > > influence that decision. For example, we have high priority > > > > applications > > > > that run at the top-level that use a lot of memory and strictly oom > > > > killing them in all scenarios because they use a lot of memory isn't > > > > appropriate. We need to be able to adjust the comparison of a cgroup > > > > (or > > > > subtree) when compared to other cgroups. > > > > > > > > I've also suggested, but did not implement in my patchset because I was > > > > trying to define the API and find common ground first, that we have a > > > > need > > > > for priority based selection. In other words, define the priority of a > > > > subtree regardless of cgroup usage. > > > > > > > > So with these four things, we have > > > > > > > > - an "oom.policy" tunable to define "cgroup" or "process" for that > > > >subtree (and plans for "priority" in the future), > > > > > > > > - your "oom.evaluate_as_group" tunable to account the usage of the > > > >subtree as the cgroup's own usage for comparison with others, > > > > > > > > - an
cgroup aware oom killer (was Re: [PATCH 0/3] introduce memory.oom.group)
Roman, have you had time to go through this? On Tue, 7 Aug 2018, David Rientjes wrote: > On Mon, 6 Aug 2018, Roman Gushchin wrote: > > > > In a cgroup-aware oom killer world, yes, we need the ability to specify > > > that the usage of the entire subtree should be compared as a single > > > entity with other cgroups. That is necessary for user subtrees but may > > > not be necessary for top-level cgroups depending on how you structure > > > your > > > unified cgroup hierarchy. So it needs to be configurable, as you > > > suggest, > > > and you are correct it can be different than oom.group. > > > > > > That's not the only thing we need though, as I'm sure you were expecting > > > me to say :) > > > > > > We need the ability to preserve existing behavior, i.e. process based and > > > not cgroup aware, for subtrees so that our users who have clear > > > expectations and tune their oom_score_adj accordingly based on how the > > > oom > > > killer has always chosen processes for oom kill do not suddenly regress. > > > > Isn't the combination of oom.group=0 and oom.evaluate_together=1 describing > > this case? This basically means that if memcg is selected as target, > > the process inside will be selected using traditional per-process approach. > > > > No, that would overload the policy and mechanism. We want the ability to > consider user-controlled subtrees as a single entity for comparison with > other user subtrees to select which subtree to target. This does not > imply that users want their entire subtree oom killed. > > > > So we need to define the policy for a subtree that is oom, and I suggest > > > we do that as a characteristic of the cgroup that is oom ("process" vs > > > "cgroup", and process would be the default to preserve what currently > > > happens in a user subtree). > > > > I'm not entirely convinced here. > > I do agree, that some sub-tree may have a well tuned oom_score_adj, > > and it's preferable to keep the current behavior. > > > > At the same time I don't like the idea to look at the policy of the OOMing > > cgroup. Why exceeding of one limit should be handled different to exceeding > > of another? This seems to be a property of workload, not a limit. > > > > The limit is the property of the mem cgroup, so it's logical that the > policy when reaching that limit is a property of the same mem cgroup. > Using the user-controlled subtree example, if we have /david and /roman, > we can define our own policies on oom, we are not restricted to cgroup > aware selection on the entire hierarchy. /david/oom.policy can be > "process" so that I haven't regressed with earlier kernels, and > /roman/oom.policy can be "cgroup" to target the largest cgroup in your > subtree. > > Something needs to be oom killed when a mem cgroup at any level in the > hierarchy is reached and reclaim has failed. What to do when that limit > is reached is a property of that cgroup. > > > > Now, as users who rely on process selection are well aware, we have > > > oom_score_adj to influence the decision of which process to oom kill. If > > > our oom subtree is cgroup aware, we should have the ability to likewise > > > influence that decision. For example, we have high priority applications > > > that run at the top-level that use a lot of memory and strictly oom > > > killing them in all scenarios because they use a lot of memory isn't > > > appropriate. We need to be able to adjust the comparison of a cgroup (or > > > subtree) when compared to other cgroups. > > > > > > I've also suggested, but did not implement in my patchset because I was > > > trying to define the API and find common ground first, that we have a > > > need > > > for priority based selection. In other words, define the priority of a > > > subtree regardless of cgroup usage. > > > > > > So with these four things, we have > > > > > > - an "oom.policy" tunable to define "cgroup" or "process" for that > > >subtree (and plans for "priority" in the future), > > > > > > - your "oom.evaluate_as_group" tunable to account the usage of the > > >subtree as the cgroup's own usage for comparison with others, > > > > > > - an "oom.adj" to adjust the usage of the cgroup (local or subtree) > > >to protect important applications and bias against unimportant > > >applications. > > > > > > This adds several tunables, which I didn't like, so I tried to overload > > > oom.policy and oom.evaluate_as_group. When I referred to separating out > > > the subtree usage accounting into a separate tunable, that is what I have > > > referenced above. > > > > IMO, merging multiple tunables into one doesn't make it saner. > > The real question how to make a reasonable interface with fever tunables. > > > > The reason behind introducing all these knobs is to provide > > a generic solution to define OOM handling rules, but then the > > question raises if the kernel is the best place for it. > >
cgroup aware oom killer (was Re: [PATCH 0/3] introduce memory.oom.group)
Roman, have you had time to go through this? On Tue, 7 Aug 2018, David Rientjes wrote: > On Mon, 6 Aug 2018, Roman Gushchin wrote: > > > > In a cgroup-aware oom killer world, yes, we need the ability to specify > > > that the usage of the entire subtree should be compared as a single > > > entity with other cgroups. That is necessary for user subtrees but may > > > not be necessary for top-level cgroups depending on how you structure > > > your > > > unified cgroup hierarchy. So it needs to be configurable, as you > > > suggest, > > > and you are correct it can be different than oom.group. > > > > > > That's not the only thing we need though, as I'm sure you were expecting > > > me to say :) > > > > > > We need the ability to preserve existing behavior, i.e. process based and > > > not cgroup aware, for subtrees so that our users who have clear > > > expectations and tune their oom_score_adj accordingly based on how the > > > oom > > > killer has always chosen processes for oom kill do not suddenly regress. > > > > Isn't the combination of oom.group=0 and oom.evaluate_together=1 describing > > this case? This basically means that if memcg is selected as target, > > the process inside will be selected using traditional per-process approach. > > > > No, that would overload the policy and mechanism. We want the ability to > consider user-controlled subtrees as a single entity for comparison with > other user subtrees to select which subtree to target. This does not > imply that users want their entire subtree oom killed. > > > > So we need to define the policy for a subtree that is oom, and I suggest > > > we do that as a characteristic of the cgroup that is oom ("process" vs > > > "cgroup", and process would be the default to preserve what currently > > > happens in a user subtree). > > > > I'm not entirely convinced here. > > I do agree, that some sub-tree may have a well tuned oom_score_adj, > > and it's preferable to keep the current behavior. > > > > At the same time I don't like the idea to look at the policy of the OOMing > > cgroup. Why exceeding of one limit should be handled different to exceeding > > of another? This seems to be a property of workload, not a limit. > > > > The limit is the property of the mem cgroup, so it's logical that the > policy when reaching that limit is a property of the same mem cgroup. > Using the user-controlled subtree example, if we have /david and /roman, > we can define our own policies on oom, we are not restricted to cgroup > aware selection on the entire hierarchy. /david/oom.policy can be > "process" so that I haven't regressed with earlier kernels, and > /roman/oom.policy can be "cgroup" to target the largest cgroup in your > subtree. > > Something needs to be oom killed when a mem cgroup at any level in the > hierarchy is reached and reclaim has failed. What to do when that limit > is reached is a property of that cgroup. > > > > Now, as users who rely on process selection are well aware, we have > > > oom_score_adj to influence the decision of which process to oom kill. If > > > our oom subtree is cgroup aware, we should have the ability to likewise > > > influence that decision. For example, we have high priority applications > > > that run at the top-level that use a lot of memory and strictly oom > > > killing them in all scenarios because they use a lot of memory isn't > > > appropriate. We need to be able to adjust the comparison of a cgroup (or > > > subtree) when compared to other cgroups. > > > > > > I've also suggested, but did not implement in my patchset because I was > > > trying to define the API and find common ground first, that we have a > > > need > > > for priority based selection. In other words, define the priority of a > > > subtree regardless of cgroup usage. > > > > > > So with these four things, we have > > > > > > - an "oom.policy" tunable to define "cgroup" or "process" for that > > >subtree (and plans for "priority" in the future), > > > > > > - your "oom.evaluate_as_group" tunable to account the usage of the > > >subtree as the cgroup's own usage for comparison with others, > > > > > > - an "oom.adj" to adjust the usage of the cgroup (local or subtree) > > >to protect important applications and bias against unimportant > > >applications. > > > > > > This adds several tunables, which I didn't like, so I tried to overload > > > oom.policy and oom.evaluate_as_group. When I referred to separating out > > > the subtree usage accounting into a separate tunable, that is what I have > > > referenced above. > > > > IMO, merging multiple tunables into one doesn't make it saner. > > The real question how to make a reasonable interface with fever tunables. > > > > The reason behind introducing all these knobs is to provide > > a generic solution to define OOM handling rules, but then the > > question raises if the kernel is the best place for it. > >