On Fri, 26 Jan 2018 14:52:59 -0800 (PST) David Rientjes <rient...@google.com> 
wrote:

> On Fri, 26 Jan 2018, Andrew Morton wrote:
> 
> > > -ECONFUSED.  We want to have a mount option that has the sole purpose of 
> > > doing echo cgroup > /mnt/cgroup/memory.oom_policy?
> > 
> > Approximately.  Let me put it another way: can we modify your patchset
> > so that the mount option remains, and continues to have a sufficiently
> > same effect?  For backward compatibility.
> > 
> 
> The mount option would exist solely to set the oom policy of the root mem 
> cgroup, it would lose its effect of mandating that policy for any subtree 
> since it would become configurable by the user if delegated.

Why can't we propagate the mount option into the subtrees?

If the user then alters that behaviour with new added-by-David tunables
then fine, that's still backward compatible.

> Let me put it another way: if the cgroup aware oom killer is merged for 
> 4.16 without this patchset, userspace can reasonably infer the oom policy 
> from checking how cgroups were mounted.  If my followup patchset were 
> merged for 4.17, that's invalid and it becomes dependent on kernel 
> version: it could have the "groupoom" mount option but configured through 
> the root mem cgroup's memory.oom_policy to not be cgroup aware at all.

That concern seems unreasonable to me.  Is an application *really*
going to peek at the mount options to figure out what its present oom
policy is?  Well, maybe.  But that's a pretty dopey thing to do and I
wouldn't lose much sleep over breaking any such application in the very
unlikely case that such a thing was developed in that two-month window.

If that's really a concern then let's add (to Roman's patchset) a
proper interface for an application to query its own oom policy.

> That inconsistency, to me, is unfair to burden userspace with.
> 
> > > This, and fixes to fairly compare the root mem cgroup with leaf mem 
> > > cgroups, are essential before the feature is merged otherwise it yields 
> > > wildly unpredictable (and unexpected, since its interaction with 
> > > oom_score_adj isn't documented) results as I already demonstrated where 
> > > cgroups with 1GB of usage are killed instead of 6GB workers outside of 
> > > that subtree.
> > 
> > OK, so Roman's new feature is incomplete: it satisfies some use cases
> > but not others.  And we kinda have a plan to address the other use
> > cases in the future.
> > 
> 
> Those use cases are also undocumented such that the user doesn't know the 
> behavior they are opting into.  Nowhere in the patchset does it mention 
> anything about oom_score_adj other than being oom disabled.  It doesn't 
> mention that a per-process tunable now depends strictly on whether it is 
> attached to root or not.  It specifies a fair comparison between the root 
> mem cgroup and leaf mem cgroups, which is obviously incorrect by the 
> implementation itself.  So I'm not sure the user would know which use 
> cases it is valid for, which is why I've been trying to make it generally 
> purposeful and documented.

Documentation patches are nice.  We can cc:stable them too, so no huge
hurry.

> > There's nothing wrong with that!  As long as we don't break existing
> > setups while evolving the feature.  How do we do that?
> > 
> 
> We'd break the setups that actually configure their cgroups and processes 
> to abide by the current implementation since we'd need to discount 
> oom_score_adj from the the root mem cgroup usage to fix it.

Am having trouble understanding that.  Expand, please?

Can we address this (and other such) issues in the (interim)
documentation?

Reply via email to