subject:"cgroup aware oom killer \(was Re\: \[PATCH 0\/3\] introduce memory.oom.group\)"

Re: cgroup aware oom killer (was Re: [PATCH 0/3] introduce memory.oom.group)

2018-08-20 Thread Roman Gushchin

On Sun, Aug 19, 2018 at 04:26:50PM -0700, David Rientjes wrote:
> Roman, have you had time to go through this?

Hm, I thought we've finished this part of discussion, no?
Anyway, let me repeat my position: I don't like the interface
you've proposed in that follow-up patchset, and I explained why.
If you've a new proposal, please, rebase it to the current
mm tree, and we can discuss it separately.
Alternatively, we can discuss the interface first (without
the implementation), but, please, make a new thread with a
fresh description of a proposed interface.

Thanks!

> 
> 
> On Tue, 7 Aug 2018, David Rientjes wrote:
> 
> > On Mon, 6 Aug 2018, Roman Gushchin wrote:
> > 
> > > > In a cgroup-aware oom killer world, yes, we need the ability to specify 
> > > > that the usage of the entire subtree should be compared as a single 
> > > > entity with other cgroups.  That is necessary for user subtrees but may 
> > > > not be necessary for top-level cgroups depending on how you structure 
> > > > your 
> > > > unified cgroup hierarchy.  So it needs to be configurable, as you 
> > > > suggest, 
> > > > and you are correct it can be different than oom.group.
> > > > 
> > > > That's not the only thing we need though, as I'm sure you were 
> > > > expecting 
> > > > me to say :)
> > > > 
> > > > We need the ability to preserve existing behavior, i.e. process based 
> > > > and 
> > > > not cgroup aware, for subtrees so that our users who have clear 
> > > > expectations and tune their oom_score_adj accordingly based on how the 
> > > > oom 
> > > > killer has always chosen processes for oom kill do not suddenly regress.
> > > 
> > > Isn't the combination of oom.group=0 and oom.evaluate_together=1 
> > > describing
> > > this case? This basically means that if memcg is selected as target,
> > > the process inside will be selected using traditional per-process 
> > > approach.
> > > 
> > 
> > No, that would overload the policy and mechanism.  We want the ability to 
> > consider user-controlled subtrees as a single entity for comparison with 
> > other user subtrees to select which subtree to target.  This does not 
> > imply that users want their entire subtree oom killed.
> > 
> > > > So we need to define the policy for a subtree that is oom, and I 
> > > > suggest 
> > > > we do that as a characteristic of the cgroup that is oom ("process" vs 
> > > > "cgroup", and process would be the default to preserve what currently 
> > > > happens in a user subtree).
> > > 
> > > I'm not entirely convinced here.
> > > I do agree, that some sub-tree may have a well tuned oom_score_adj,
> > > and it's preferable to keep the current behavior.
> > > 
> > > At the same time I don't like the idea to look at the policy of the OOMing
> > > cgroup. Why exceeding of one limit should be handled different to 
> > > exceeding
> > > of another? This seems to be a property of workload, not a limit.
> > > 
> > 
> > The limit is the property of the mem cgroup, so it's logical that the 
> > policy when reaching that limit is a property of the same mem cgroup.
> > Using the user-controlled subtree example, if we have /david and /roman, 
> > we can define our own policies on oom, we are not restricted to cgroup 
> > aware selection on the entire hierarchy.  /david/oom.policy can be 
> > "process" so that I haven't regressed with earlier kernels, and 
> > /roman/oom.policy can be "cgroup" to target the largest cgroup in your 
> > subtree.
> > 
> > Something needs to be oom killed when a mem cgroup at any level in the 
> > hierarchy is reached and reclaim has failed.  What to do when that limit 
> > is reached is a property of that cgroup.
> > 
> > > > Now, as users who rely on process selection are well aware, we have 
> > > > oom_score_adj to influence the decision of which process to oom kill.  
> > > > If 
> > > > our oom subtree is cgroup aware, we should have the ability to likewise 
> > > > influence that decision.  For example, we have high priority 
> > > > applications 
> > > > that run at the top-level that use a lot of memory and strictly oom 
> > > > killing them in all scenarios because they use a lot of memory isn't 
> > > > appropriate.  We need to be able to adjust the comparison of a cgroup 
> > > > (or 
> > > > subtree) when compared to other cgroups.
> > > > 
> > > > I've also suggested, but did not implement in my patchset because I was 
> > > > trying to define the API and find common ground first, that we have a 
> > > > need 
> > > > for priority based selection.  In other words, define the priority of a 
> > > > subtree regardless of cgroup usage.
> > > > 
> > > > So with these four things, we have
> > > > 
> > > >  - an "oom.policy" tunable to define "cgroup" or "process" for that 
> > > >subtree (and plans for "priority" in the future),
> > > > 
> > > >  - your "oom.evaluate_as_group" tunable to account the usage of the
> > > >subtree as the cgroup's own usage for comparison with others,
> > > > 
> > > >  - an

Re: cgroup aware oom killer (was Re: [PATCH 0/3] introduce memory.oom.group)

2018-08-20 Thread Roman Gushchin

On Sun, Aug 19, 2018 at 04:26:50PM -0700, David Rientjes wrote:
> Roman, have you had time to go through this?

Hm, I thought we've finished this part of discussion, no?
Anyway, let me repeat my position: I don't like the interface
you've proposed in that follow-up patchset, and I explained why.
If you've a new proposal, please, rebase it to the current
mm tree, and we can discuss it separately.
Alternatively, we can discuss the interface first (without
the implementation), but, please, make a new thread with a
fresh description of a proposed interface.

Thanks!

> 
> 
> On Tue, 7 Aug 2018, David Rientjes wrote:
> 
> > On Mon, 6 Aug 2018, Roman Gushchin wrote:
> > 
> > > > In a cgroup-aware oom killer world, yes, we need the ability to specify 
> > > > that the usage of the entire subtree should be compared as a single 
> > > > entity with other cgroups.  That is necessary for user subtrees but may 
> > > > not be necessary for top-level cgroups depending on how you structure 
> > > > your 
> > > > unified cgroup hierarchy.  So it needs to be configurable, as you 
> > > > suggest, 
> > > > and you are correct it can be different than oom.group.
> > > > 
> > > > That's not the only thing we need though, as I'm sure you were 
> > > > expecting 
> > > > me to say :)
> > > > 
> > > > We need the ability to preserve existing behavior, i.e. process based 
> > > > and 
> > > > not cgroup aware, for subtrees so that our users who have clear 
> > > > expectations and tune their oom_score_adj accordingly based on how the 
> > > > oom 
> > > > killer has always chosen processes for oom kill do not suddenly regress.
> > > 
> > > Isn't the combination of oom.group=0 and oom.evaluate_together=1 
> > > describing
> > > this case? This basically means that if memcg is selected as target,
> > > the process inside will be selected using traditional per-process 
> > > approach.
> > > 
> > 
> > No, that would overload the policy and mechanism.  We want the ability to 
> > consider user-controlled subtrees as a single entity for comparison with 
> > other user subtrees to select which subtree to target.  This does not 
> > imply that users want their entire subtree oom killed.
> > 
> > > > So we need to define the policy for a subtree that is oom, and I 
> > > > suggest 
> > > > we do that as a characteristic of the cgroup that is oom ("process" vs 
> > > > "cgroup", and process would be the default to preserve what currently 
> > > > happens in a user subtree).
> > > 
> > > I'm not entirely convinced here.
> > > I do agree, that some sub-tree may have a well tuned oom_score_adj,
> > > and it's preferable to keep the current behavior.
> > > 
> > > At the same time I don't like the idea to look at the policy of the OOMing
> > > cgroup. Why exceeding of one limit should be handled different to 
> > > exceeding
> > > of another? This seems to be a property of workload, not a limit.
> > > 
> > 
> > The limit is the property of the mem cgroup, so it's logical that the 
> > policy when reaching that limit is a property of the same mem cgroup.
> > Using the user-controlled subtree example, if we have /david and /roman, 
> > we can define our own policies on oom, we are not restricted to cgroup 
> > aware selection on the entire hierarchy.  /david/oom.policy can be 
> > "process" so that I haven't regressed with earlier kernels, and 
> > /roman/oom.policy can be "cgroup" to target the largest cgroup in your 
> > subtree.
> > 
> > Something needs to be oom killed when a mem cgroup at any level in the 
> > hierarchy is reached and reclaim has failed.  What to do when that limit 
> > is reached is a property of that cgroup.
> > 
> > > > Now, as users who rely on process selection are well aware, we have 
> > > > oom_score_adj to influence the decision of which process to oom kill.  
> > > > If 
> > > > our oom subtree is cgroup aware, we should have the ability to likewise 
> > > > influence that decision.  For example, we have high priority 
> > > > applications 
> > > > that run at the top-level that use a lot of memory and strictly oom 
> > > > killing them in all scenarios because they use a lot of memory isn't 
> > > > appropriate.  We need to be able to adjust the comparison of a cgroup 
> > > > (or 
> > > > subtree) when compared to other cgroups.
> > > > 
> > > > I've also suggested, but did not implement in my patchset because I was 
> > > > trying to define the API and find common ground first, that we have a 
> > > > need 
> > > > for priority based selection.  In other words, define the priority of a 
> > > > subtree regardless of cgroup usage.
> > > > 
> > > > So with these four things, we have
> > > > 
> > > >  - an "oom.policy" tunable to define "cgroup" or "process" for that 
> > > >subtree (and plans for "priority" in the future),
> > > > 
> > > >  - your "oom.evaluate_as_group" tunable to account the usage of the
> > > >subtree as the cgroup's own usage for comparison with others,
> > > > 
> > > >  - an

cgroup aware oom killer (was Re: [PATCH 0/3] introduce memory.oom.group)

2018-08-19 Thread David Rientjes

Roman, have you had time to go through this?


On Tue, 7 Aug 2018, David Rientjes wrote:

> On Mon, 6 Aug 2018, Roman Gushchin wrote:
> 
> > > In a cgroup-aware oom killer world, yes, we need the ability to specify 
> > > that the usage of the entire subtree should be compared as a single 
> > > entity with other cgroups.  That is necessary for user subtrees but may 
> > > not be necessary for top-level cgroups depending on how you structure 
> > > your 
> > > unified cgroup hierarchy.  So it needs to be configurable, as you 
> > > suggest, 
> > > and you are correct it can be different than oom.group.
> > > 
> > > That's not the only thing we need though, as I'm sure you were expecting 
> > > me to say :)
> > > 
> > > We need the ability to preserve existing behavior, i.e. process based and 
> > > not cgroup aware, for subtrees so that our users who have clear 
> > > expectations and tune their oom_score_adj accordingly based on how the 
> > > oom 
> > > killer has always chosen processes for oom kill do not suddenly regress.
> > 
> > Isn't the combination of oom.group=0 and oom.evaluate_together=1 describing
> > this case? This basically means that if memcg is selected as target,
> > the process inside will be selected using traditional per-process approach.
> > 
> 
> No, that would overload the policy and mechanism.  We want the ability to 
> consider user-controlled subtrees as a single entity for comparison with 
> other user subtrees to select which subtree to target.  This does not 
> imply that users want their entire subtree oom killed.
> 
> > > So we need to define the policy for a subtree that is oom, and I suggest 
> > > we do that as a characteristic of the cgroup that is oom ("process" vs 
> > > "cgroup", and process would be the default to preserve what currently 
> > > happens in a user subtree).
> > 
> > I'm not entirely convinced here.
> > I do agree, that some sub-tree may have a well tuned oom_score_adj,
> > and it's preferable to keep the current behavior.
> > 
> > At the same time I don't like the idea to look at the policy of the OOMing
> > cgroup. Why exceeding of one limit should be handled different to exceeding
> > of another? This seems to be a property of workload, not a limit.
> > 
> 
> The limit is the property of the mem cgroup, so it's logical that the 
> policy when reaching that limit is a property of the same mem cgroup.
> Using the user-controlled subtree example, if we have /david and /roman, 
> we can define our own policies on oom, we are not restricted to cgroup 
> aware selection on the entire hierarchy.  /david/oom.policy can be 
> "process" so that I haven't regressed with earlier kernels, and 
> /roman/oom.policy can be "cgroup" to target the largest cgroup in your 
> subtree.
> 
> Something needs to be oom killed when a mem cgroup at any level in the 
> hierarchy is reached and reclaim has failed.  What to do when that limit 
> is reached is a property of that cgroup.
> 
> > > Now, as users who rely on process selection are well aware, we have 
> > > oom_score_adj to influence the decision of which process to oom kill.  If 
> > > our oom subtree is cgroup aware, we should have the ability to likewise 
> > > influence that decision.  For example, we have high priority applications 
> > > that run at the top-level that use a lot of memory and strictly oom 
> > > killing them in all scenarios because they use a lot of memory isn't 
> > > appropriate.  We need to be able to adjust the comparison of a cgroup (or 
> > > subtree) when compared to other cgroups.
> > > 
> > > I've also suggested, but did not implement in my patchset because I was 
> > > trying to define the API and find common ground first, that we have a 
> > > need 
> > > for priority based selection.  In other words, define the priority of a 
> > > subtree regardless of cgroup usage.
> > > 
> > > So with these four things, we have
> > > 
> > >  - an "oom.policy" tunable to define "cgroup" or "process" for that 
> > >subtree (and plans for "priority" in the future),
> > > 
> > >  - your "oom.evaluate_as_group" tunable to account the usage of the
> > >subtree as the cgroup's own usage for comparison with others,
> > > 
> > >  - an "oom.adj" to adjust the usage of the cgroup (local or subtree)
> > >to protect important applications and bias against unimportant
> > >applications.
> > > 
> > > This adds several tunables, which I didn't like, so I tried to overload 
> > > oom.policy and oom.evaluate_as_group.  When I referred to separating out 
> > > the subtree usage accounting into a separate tunable, that is what I have 
> > > referenced above.
> > 
> > IMO, merging multiple tunables into one doesn't make it saner.
> > The real question how to make a reasonable interface with fever tunables.
> > 
> > The reason behind introducing all these knobs is to provide
> > a generic solution to define OOM handling rules, but then the
> > question raises if the kernel is the best place for it.
> >

cgroup aware oom killer (was Re: [PATCH 0/3] introduce memory.oom.group)

2018-08-19 Thread David Rientjes

Roman, have you had time to go through this?


On Tue, 7 Aug 2018, David Rientjes wrote:

> On Mon, 6 Aug 2018, Roman Gushchin wrote:
> 
> > > In a cgroup-aware oom killer world, yes, we need the ability to specify 
> > > that the usage of the entire subtree should be compared as a single 
> > > entity with other cgroups.  That is necessary for user subtrees but may 
> > > not be necessary for top-level cgroups depending on how you structure 
> > > your 
> > > unified cgroup hierarchy.  So it needs to be configurable, as you 
> > > suggest, 
> > > and you are correct it can be different than oom.group.
> > > 
> > > That's not the only thing we need though, as I'm sure you were expecting 
> > > me to say :)
> > > 
> > > We need the ability to preserve existing behavior, i.e. process based and 
> > > not cgroup aware, for subtrees so that our users who have clear 
> > > expectations and tune their oom_score_adj accordingly based on how the 
> > > oom 
> > > killer has always chosen processes for oom kill do not suddenly regress.
> > 
> > Isn't the combination of oom.group=0 and oom.evaluate_together=1 describing
> > this case? This basically means that if memcg is selected as target,
> > the process inside will be selected using traditional per-process approach.
> > 
> 
> No, that would overload the policy and mechanism.  We want the ability to 
> consider user-controlled subtrees as a single entity for comparison with 
> other user subtrees to select which subtree to target.  This does not 
> imply that users want their entire subtree oom killed.
> 
> > > So we need to define the policy for a subtree that is oom, and I suggest 
> > > we do that as a characteristic of the cgroup that is oom ("process" vs 
> > > "cgroup", and process would be the default to preserve what currently 
> > > happens in a user subtree).
> > 
> > I'm not entirely convinced here.
> > I do agree, that some sub-tree may have a well tuned oom_score_adj,
> > and it's preferable to keep the current behavior.
> > 
> > At the same time I don't like the idea to look at the policy of the OOMing
> > cgroup. Why exceeding of one limit should be handled different to exceeding
> > of another? This seems to be a property of workload, not a limit.
> > 
> 
> The limit is the property of the mem cgroup, so it's logical that the 
> policy when reaching that limit is a property of the same mem cgroup.
> Using the user-controlled subtree example, if we have /david and /roman, 
> we can define our own policies on oom, we are not restricted to cgroup 
> aware selection on the entire hierarchy.  /david/oom.policy can be 
> "process" so that I haven't regressed with earlier kernels, and 
> /roman/oom.policy can be "cgroup" to target the largest cgroup in your 
> subtree.
> 
> Something needs to be oom killed when a mem cgroup at any level in the 
> hierarchy is reached and reclaim has failed.  What to do when that limit 
> is reached is a property of that cgroup.
> 
> > > Now, as users who rely on process selection are well aware, we have 
> > > oom_score_adj to influence the decision of which process to oom kill.  If 
> > > our oom subtree is cgroup aware, we should have the ability to likewise 
> > > influence that decision.  For example, we have high priority applications 
> > > that run at the top-level that use a lot of memory and strictly oom 
> > > killing them in all scenarios because they use a lot of memory isn't 
> > > appropriate.  We need to be able to adjust the comparison of a cgroup (or 
> > > subtree) when compared to other cgroups.
> > > 
> > > I've also suggested, but did not implement in my patchset because I was 
> > > trying to define the API and find common ground first, that we have a 
> > > need 
> > > for priority based selection.  In other words, define the priority of a 
> > > subtree regardless of cgroup usage.
> > > 
> > > So with these four things, we have
> > > 
> > >  - an "oom.policy" tunable to define "cgroup" or "process" for that 
> > >subtree (and plans for "priority" in the future),
> > > 
> > >  - your "oom.evaluate_as_group" tunable to account the usage of the
> > >subtree as the cgroup's own usage for comparison with others,
> > > 
> > >  - an "oom.adj" to adjust the usage of the cgroup (local or subtree)
> > >to protect important applications and bias against unimportant
> > >applications.
> > > 
> > > This adds several tunables, which I didn't like, so I tried to overload 
> > > oom.policy and oom.evaluate_as_group.  When I referred to separating out 
> > > the subtree usage accounting into a separate tunable, that is what I have 
> > > referenced above.
> > 
> > IMO, merging multiple tunables into one doesn't make it saner.
> > The real question how to make a reasonable interface with fever tunables.
> > 
> > The reason behind introducing all these knobs is to provide
> > a generic solution to define OOM handling rules, but then the
> > question raises if the kernel is the best place for it.
> >

Re: cgroup aware oom killer (was Re: [PATCH 0/3] introduce memory.oom.group)

Re: cgroup aware oom killer (was Re: [PATCH 0/3] introduce memory.oom.group)

cgroup aware oom killer (was Re: [PATCH 0/3] introduce memory.oom.group)

cgroup aware oom killer (was Re: [PATCH 0/3] introduce memory.oom.group)

4 matches

Site Navigation

Mail list logo

Footer information