Re: [RFC 3/3] memcg: get rid of mm_struct::owner

2015-05-29 Thread Michal Hocko
On Fri 29-05-15 11:23:28, Tejun Heo wrote:
> Hello,
> 
> On Fri, May 29, 2015 at 04:57:39PM +0200, Michal Hocko wrote:
[...]
> > OK so you creat a task A (leader) which clones several tasks Pn with
> > CLONE_VM without CLONE_THREAD. Moving A around would control memcg
> > membership while Pn could be moved around freely to control membership
> > in other controllers (e.g. cpu to control shares). So it is something
> > like moving threads separately.
> 
> Sure, it'd behave clearly in certain cases but then again you'd have
> cases where how mm->owner changes isn't clear at all when seen from
> the userland. 

Sure. I am definitely _not_ advocating this use case! As said before, I
consider it abuse. It is just fair to point out this is a user visible
change IMO.

> e.g. When the original owner goes away, the assignment
> of the next owner is essentially arbitrary.  That's what I meant by
> saying it was already a crapshoot.  We should definitely document the
> change but this isn't likely to be an issue.  CLONE_VM &&
> !CLONE_THREAD is an extreme corner case to begin with and even the
> behavior there wasn't all that clearly defined.

That is the line of argumentation in my changelog ;)

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 3/3] memcg: get rid of mm_struct::owner

2015-05-29 Thread Tejun Heo
Hello,

On Fri, May 29, 2015 at 04:57:39PM +0200, Michal Hocko wrote:
> > > "
> > > It also allows several control groups that are virtually grouped by
> > > mm_struct, to exist independent of the memory controller i.e., without
> > > adding mem_cgroup's for each controller, to mm_struct.
> > > "
> > > suggests it might have been intentional. That being said, I think it was
> > 
> > I think he's talking about implmenting different controllers which may
> > want to add their own css pointer in mm_struct now wouldn't need to as
> > the mm is tagged with the owning task from which membership of all
> > controllers can be derived.  I don't think that's something we need to
> > worry about.  We haven't seen even a suggestion for such a controller
> > and even if that happens we'd be better off adding a separate field
> > for the new controller.
> 
> Maybe I've just misunderstood. My understandig was that tasks sharing
> the mm could live in different cgroups while the memory would be bound
> by a shared memcg.

Hmm it specifically goes into explaining that it's about having
different controllers sharing the owner field.

 "i.e., without adding mem_cgroup's for each controller, to mm_struct."

It seems fairly clear to me.

> > I'm a bit lost on what's cleared defined is actually changing.  It's
> > not like userland had firm control over mm->owner.  It was already a
> > crapshoot, no?
> 
> OK so you creat a task A (leader) which clones several tasks Pn with
> CLONE_VM without CLONE_THREAD. Moving A around would control memcg
> membership while Pn could be moved around freely to control membership
> in other controllers (e.g. cpu to control shares). So it is something
> like moving threads separately.

Sure, it'd behave clearly in certain cases but then again you'd have
cases where how mm->owner changes isn't clear at all when seen from
the userland.  e.g. When the original owner goes away, the assignment
of the next owner is essentially arbitrary.  That's what I meant by
saying it was already a crapshoot.  We should definitely document the
change but this isn't likely to be an issue.  CLONE_VM &&
!CLONE_THREAD is an extreme corner case to begin with and even the
behavior there wasn't all that clearly defined.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 3/3] memcg: get rid of mm_struct::owner

2015-05-29 Thread Michal Hocko
On Fri 29-05-15 10:07:37, Tejun Heo wrote:
> Hello,
> 
> On Fri, May 29, 2015 at 03:45:53PM +0200, Michal Hocko wrote:
> > Sure but we are talking about processes here. They just happen to share
> > mm. And this is exactly the behavior change I am talking about... With
> 
> Are we talking about CLONE_VM w/o CLONE_THREAD?  ie. two threadgroups
> sharing the same VM?

yes.

> > the owner you could emulate "threads" with this patch you cannot
> > anymore. IMO we shouldn't allow for that but just reading the original
> > commit message (cf475ad28ac35) which has added mm->owner:
> > "
> > It also allows several control groups that are virtually grouped by
> > mm_struct, to exist independent of the memory controller i.e., without
> > adding mem_cgroup's for each controller, to mm_struct.
> > "
> > suggests it might have been intentional. That being said, I think it was
> 
> I think he's talking about implmenting different controllers which may
> want to add their own css pointer in mm_struct now wouldn't need to as
> the mm is tagged with the owning task from which membership of all
> controllers can be derived.  I don't think that's something we need to
> worry about.  We haven't seen even a suggestion for such a controller
> and even if that happens we'd be better off adding a separate field
> for the new controller.

Maybe I've just misunderstood. My understandig was that tasks sharing
the mm could live in different cgroups while the memory would be bound
by a shared memcg.

> > a mistake back at the time and we should move on to a saner model. But I
> > also believe we should be really vocal when the user visible behavior
> > changes. If somebody really asks for the previous behavior I would
> > insist on a _strong_ usecase.
> 
> I'm a bit lost on what's cleared defined is actually changing.  It's
> not like userland had firm control over mm->owner.  It was already a
> crapshoot, no?

OK so you creat a task A (leader) which clones several tasks Pn with
CLONE_VM without CLONE_THREAD. Moving A around would control memcg
membership while Pn could be moved around freely to control membership
in other controllers (e.g. cpu to control shares). So it is something
like moving threads separately.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 3/3] memcg: get rid of mm_struct::owner

2015-05-29 Thread Tejun Heo
Hello,

On Fri, May 29, 2015 at 03:45:53PM +0200, Michal Hocko wrote:
> Sure but we are talking about processes here. They just happen to share
> mm. And this is exactly the behavior change I am talking about... With

Are we talking about CLONE_VM w/o CLONE_THREAD?  ie. two threadgroups
sharing the same VM?

> the owner you could emulate "threads" with this patch you cannot
> anymore. IMO we shouldn't allow for that but just reading the original
> commit message (cf475ad28ac35) which has added mm->owner:
> "
> It also allows several control groups that are virtually grouped by
> mm_struct, to exist independent of the memory controller i.e., without
> adding mem_cgroup's for each controller, to mm_struct.
> "
> suggests it might have been intentional. That being said, I think it was

I think he's talking about implmenting different controllers which may
want to add their own css pointer in mm_struct now wouldn't need to as
the mm is tagged with the owning task from which membership of all
controllers can be derived.  I don't think that's something we need to
worry about.  We haven't seen even a suggestion for such a controller
and even if that happens we'd be better off adding a separate field
for the new controller.

> a mistake back at the time and we should move on to a saner model. But I
> also believe we should be really vocal when the user visible behavior
> changes. If somebody really asks for the previous behavior I would
> insist on a _strong_ usecase.

I'm a bit lost on what's cleared defined is actually changing.  It's
not like userland had firm control over mm->owner.  It was already a
crapshoot, no?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 3/3] memcg: get rid of mm_struct::owner

2015-05-29 Thread Michal Hocko
On Fri 29-05-15 09:10:55, Tejun Heo wrote:
> On Fri, May 29, 2015 at 02:08:38PM +0200, Michal Hocko wrote:
> > > I suppose that making mm always follow the threadgroup leader should
> > > be fine, right? 
> > 
> > That is the plan.
> 
> Cool.
> 
> > > While this wouldn't make any difference in the unified hierarchy,
> > 
> > Just to make sure I understand. "wouldn't make any difference" because
> > the API is not backward compatible right?
> 
> Hmm... because it's always per-process.  If any thread is going, the
> whole process is going together.

Sure but we are talking about processes here. They just happen to share
mm. And this is exactly the behavior change I am talking about... With
the owner you could emulate "threads" with this patch you cannot
anymore. IMO we shouldn't allow for that but just reading the original
commit message (cf475ad28ac35) which has added mm->owner:
"
It also allows several control groups that are virtually grouped by
mm_struct, to exist independent of the memory controller i.e., without
adding mem_cgroup's for each controller, to mm_struct.
"
suggests it might have been intentional. That being said, I think it was
a mistake back at the time and we should move on to a saner model. But I
also believe we should be really vocal when the user visible behavior
changes. If somebody really asks for the previous behavior I would
insist on a _strong_ usecase.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 3/3] memcg: get rid of mm_struct::owner

2015-05-29 Thread Tejun Heo
On Fri, May 29, 2015 at 02:08:38PM +0200, Michal Hocko wrote:
> > I suppose that making mm always follow the threadgroup leader should
> > be fine, right? 
> 
> That is the plan.

Cool.

> > While this wouldn't make any difference in the unified hierarchy,
> 
> Just to make sure I understand. "wouldn't make any difference" because
> the API is not backward compatible right?

Hmm... because it's always per-process.  If any thread is going, the
whole process is going together.

> > I think this would make more sense for traditional hierarchies.
> 
> Yes I believe so.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 3/3] memcg: get rid of mm_struct::owner

2015-05-29 Thread Michal Hocko
On Thu 28-05-15 17:07:42, Tejun Heo wrote:
> Hello, Johannes, Michal.
> 
> On Tue, May 26, 2015 at 10:10:11AM -0400, Johannes Weiner wrote:
> > On Tue, May 26, 2015 at 01:50:06PM +0200, Michal Hocko wrote:
> > > Please note that this patch introduces a USER VISIBLE CHANGE OF BEHAVIOR.
> > > Without mm->owner _all_ tasks associated with the mm_struct would
> > > initiate memcg migration while previously only owner of the mm_struct
> > > could do that. The original behavior was awkward though because the user
> > > task didn't have any means to find out the current owner (esp. after
> > > mm_update_next_owner) so the migration behavior was not well defined
> > > in general.
> > > New cgroup API (unified hierarchy) will discontinue tasks file which
> > > means that migrating threads will no longer be possible. In such a case
> > > having CLONE_VM without CLONE_THREAD could emulate the thread behavior
> > > but this patch prevents from isolating memcg controllers from others.
> > > Nevertheless I am not convinced such a use case would really deserve
> > > complications on the memcg code side.
> > 
> > I think such a change is okay.  The memcg semantics of moving threads
> > with the same mm into separate groups have always been arbitrary.  No
> > reasonable behavior can be expected of this, so what sane real life
> > usecase would rely on it?
> 
> I suppose that making mm always follow the threadgroup leader should
> be fine, right? 

That is the plan.

> While this wouldn't make any difference in the unified hierarchy,

Just to make sure I understand. "wouldn't make any difference" because
the API is not backward compatible right?

> I think this would make more sense for traditional hierarchies.

Yes I believe so.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 3/3] memcg: get rid of mm_struct::owner

2015-05-29 Thread Michal Hocko
On Thu 28-05-15 17:07:42, Tejun Heo wrote:
 Hello, Johannes, Michal.
 
 On Tue, May 26, 2015 at 10:10:11AM -0400, Johannes Weiner wrote:
  On Tue, May 26, 2015 at 01:50:06PM +0200, Michal Hocko wrote:
   Please note that this patch introduces a USER VISIBLE CHANGE OF BEHAVIOR.
   Without mm-owner _all_ tasks associated with the mm_struct would
   initiate memcg migration while previously only owner of the mm_struct
   could do that. The original behavior was awkward though because the user
   task didn't have any means to find out the current owner (esp. after
   mm_update_next_owner) so the migration behavior was not well defined
   in general.
   New cgroup API (unified hierarchy) will discontinue tasks file which
   means that migrating threads will no longer be possible. In such a case
   having CLONE_VM without CLONE_THREAD could emulate the thread behavior
   but this patch prevents from isolating memcg controllers from others.
   Nevertheless I am not convinced such a use case would really deserve
   complications on the memcg code side.
  
  I think such a change is okay.  The memcg semantics of moving threads
  with the same mm into separate groups have always been arbitrary.  No
  reasonable behavior can be expected of this, so what sane real life
  usecase would rely on it?
 
 I suppose that making mm always follow the threadgroup leader should
 be fine, right? 

That is the plan.

 While this wouldn't make any difference in the unified hierarchy,

Just to make sure I understand. wouldn't make any difference because
the API is not backward compatible right?

 I think this would make more sense for traditional hierarchies.

Yes I believe so.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 3/3] memcg: get rid of mm_struct::owner

2015-05-29 Thread Tejun Heo
On Fri, May 29, 2015 at 02:08:38PM +0200, Michal Hocko wrote:
  I suppose that making mm always follow the threadgroup leader should
  be fine, right? 
 
 That is the plan.

Cool.

  While this wouldn't make any difference in the unified hierarchy,
 
 Just to make sure I understand. wouldn't make any difference because
 the API is not backward compatible right?

Hmm... because it's always per-process.  If any thread is going, the
whole process is going together.

  I think this would make more sense for traditional hierarchies.
 
 Yes I believe so.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 3/3] memcg: get rid of mm_struct::owner

2015-05-29 Thread Michal Hocko
On Fri 29-05-15 09:10:55, Tejun Heo wrote:
 On Fri, May 29, 2015 at 02:08:38PM +0200, Michal Hocko wrote:
   I suppose that making mm always follow the threadgroup leader should
   be fine, right? 
  
  That is the plan.
 
 Cool.
 
   While this wouldn't make any difference in the unified hierarchy,
  
  Just to make sure I understand. wouldn't make any difference because
  the API is not backward compatible right?
 
 Hmm... because it's always per-process.  If any thread is going, the
 whole process is going together.

Sure but we are talking about processes here. They just happen to share
mm. And this is exactly the behavior change I am talking about... With
the owner you could emulate threads with this patch you cannot
anymore. IMO we shouldn't allow for that but just reading the original
commit message (cf475ad28ac35) which has added mm-owner:

It also allows several control groups that are virtually grouped by
mm_struct, to exist independent of the memory controller i.e., without
adding mem_cgroup's for each controller, to mm_struct.

suggests it might have been intentional. That being said, I think it was
a mistake back at the time and we should move on to a saner model. But I
also believe we should be really vocal when the user visible behavior
changes. If somebody really asks for the previous behavior I would
insist on a _strong_ usecase.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 3/3] memcg: get rid of mm_struct::owner

2015-05-29 Thread Tejun Heo
Hello,

On Fri, May 29, 2015 at 03:45:53PM +0200, Michal Hocko wrote:
 Sure but we are talking about processes here. They just happen to share
 mm. And this is exactly the behavior change I am talking about... With

Are we talking about CLONE_VM w/o CLONE_THREAD?  ie. two threadgroups
sharing the same VM?

 the owner you could emulate threads with this patch you cannot
 anymore. IMO we shouldn't allow for that but just reading the original
 commit message (cf475ad28ac35) which has added mm-owner:
 
 It also allows several control groups that are virtually grouped by
 mm_struct, to exist independent of the memory controller i.e., without
 adding mem_cgroup's for each controller, to mm_struct.
 
 suggests it might have been intentional. That being said, I think it was

I think he's talking about implmenting different controllers which may
want to add their own css pointer in mm_struct now wouldn't need to as
the mm is tagged with the owning task from which membership of all
controllers can be derived.  I don't think that's something we need to
worry about.  We haven't seen even a suggestion for such a controller
and even if that happens we'd be better off adding a separate field
for the new controller.

 a mistake back at the time and we should move on to a saner model. But I
 also believe we should be really vocal when the user visible behavior
 changes. If somebody really asks for the previous behavior I would
 insist on a _strong_ usecase.

I'm a bit lost on what's cleared defined is actually changing.  It's
not like userland had firm control over mm-owner.  It was already a
crapshoot, no?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 3/3] memcg: get rid of mm_struct::owner

2015-05-29 Thread Tejun Heo
Hello,

On Fri, May 29, 2015 at 04:57:39PM +0200, Michal Hocko wrote:
   
   It also allows several control groups that are virtually grouped by
   mm_struct, to exist independent of the memory controller i.e., without
   adding mem_cgroup's for each controller, to mm_struct.
   
   suggests it might have been intentional. That being said, I think it was
  
  I think he's talking about implmenting different controllers which may
  want to add their own css pointer in mm_struct now wouldn't need to as
  the mm is tagged with the owning task from which membership of all
  controllers can be derived.  I don't think that's something we need to
  worry about.  We haven't seen even a suggestion for such a controller
  and even if that happens we'd be better off adding a separate field
  for the new controller.
 
 Maybe I've just misunderstood. My understandig was that tasks sharing
 the mm could live in different cgroups while the memory would be bound
 by a shared memcg.

Hmm it specifically goes into explaining that it's about having
different controllers sharing the owner field.

 i.e., without adding mem_cgroup's for each controller, to mm_struct.

It seems fairly clear to me.

  I'm a bit lost on what's cleared defined is actually changing.  It's
  not like userland had firm control over mm-owner.  It was already a
  crapshoot, no?
 
 OK so you creat a task A (leader) which clones several tasks Pn with
 CLONE_VM without CLONE_THREAD. Moving A around would control memcg
 membership while Pn could be moved around freely to control membership
 in other controllers (e.g. cpu to control shares). So it is something
 like moving threads separately.

Sure, it'd behave clearly in certain cases but then again you'd have
cases where how mm-owner changes isn't clear at all when seen from
the userland.  e.g. When the original owner goes away, the assignment
of the next owner is essentially arbitrary.  That's what I meant by
saying it was already a crapshoot.  We should definitely document the
change but this isn't likely to be an issue.  CLONE_VM 
!CLONE_THREAD is an extreme corner case to begin with and even the
behavior there wasn't all that clearly defined.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 3/3] memcg: get rid of mm_struct::owner

2015-05-29 Thread Michal Hocko
On Fri 29-05-15 10:07:37, Tejun Heo wrote:
 Hello,
 
 On Fri, May 29, 2015 at 03:45:53PM +0200, Michal Hocko wrote:
  Sure but we are talking about processes here. They just happen to share
  mm. And this is exactly the behavior change I am talking about... With
 
 Are we talking about CLONE_VM w/o CLONE_THREAD?  ie. two threadgroups
 sharing the same VM?

yes.

  the owner you could emulate threads with this patch you cannot
  anymore. IMO we shouldn't allow for that but just reading the original
  commit message (cf475ad28ac35) which has added mm-owner:
  
  It also allows several control groups that are virtually grouped by
  mm_struct, to exist independent of the memory controller i.e., without
  adding mem_cgroup's for each controller, to mm_struct.
  
  suggests it might have been intentional. That being said, I think it was
 
 I think he's talking about implmenting different controllers which may
 want to add their own css pointer in mm_struct now wouldn't need to as
 the mm is tagged with the owning task from which membership of all
 controllers can be derived.  I don't think that's something we need to
 worry about.  We haven't seen even a suggestion for such a controller
 and even if that happens we'd be better off adding a separate field
 for the new controller.

Maybe I've just misunderstood. My understandig was that tasks sharing
the mm could live in different cgroups while the memory would be bound
by a shared memcg.

  a mistake back at the time and we should move on to a saner model. But I
  also believe we should be really vocal when the user visible behavior
  changes. If somebody really asks for the previous behavior I would
  insist on a _strong_ usecase.
 
 I'm a bit lost on what's cleared defined is actually changing.  It's
 not like userland had firm control over mm-owner.  It was already a
 crapshoot, no?

OK so you creat a task A (leader) which clones several tasks Pn with
CLONE_VM without CLONE_THREAD. Moving A around would control memcg
membership while Pn could be moved around freely to control membership
in other controllers (e.g. cpu to control shares). So it is something
like moving threads separately.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 3/3] memcg: get rid of mm_struct::owner

2015-05-29 Thread Michal Hocko
On Fri 29-05-15 11:23:28, Tejun Heo wrote:
 Hello,
 
 On Fri, May 29, 2015 at 04:57:39PM +0200, Michal Hocko wrote:
[...]
  OK so you creat a task A (leader) which clones several tasks Pn with
  CLONE_VM without CLONE_THREAD. Moving A around would control memcg
  membership while Pn could be moved around freely to control membership
  in other controllers (e.g. cpu to control shares). So it is something
  like moving threads separately.
 
 Sure, it'd behave clearly in certain cases but then again you'd have
 cases where how mm-owner changes isn't clear at all when seen from
 the userland. 

Sure. I am definitely _not_ advocating this use case! As said before, I
consider it abuse. It is just fair to point out this is a user visible
change IMO.

 e.g. When the original owner goes away, the assignment
 of the next owner is essentially arbitrary.  That's what I meant by
 saying it was already a crapshoot.  We should definitely document the
 change but this isn't likely to be an issue.  CLONE_VM 
 !CLONE_THREAD is an extreme corner case to begin with and even the
 behavior there wasn't all that clearly defined.

That is the line of argumentation in my changelog ;)

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 3/3] memcg: get rid of mm_struct::owner

2015-05-28 Thread Tejun Heo
Hello, Johannes, Michal.

On Tue, May 26, 2015 at 10:10:11AM -0400, Johannes Weiner wrote:
> On Tue, May 26, 2015 at 01:50:06PM +0200, Michal Hocko wrote:
> > Please note that this patch introduces a USER VISIBLE CHANGE OF BEHAVIOR.
> > Without mm->owner _all_ tasks associated with the mm_struct would
> > initiate memcg migration while previously only owner of the mm_struct
> > could do that. The original behavior was awkward though because the user
> > task didn't have any means to find out the current owner (esp. after
> > mm_update_next_owner) so the migration behavior was not well defined
> > in general.
> > New cgroup API (unified hierarchy) will discontinue tasks file which
> > means that migrating threads will no longer be possible. In such a case
> > having CLONE_VM without CLONE_THREAD could emulate the thread behavior
> > but this patch prevents from isolating memcg controllers from others.
> > Nevertheless I am not convinced such a use case would really deserve
> > complications on the memcg code side.
> 
> I think such a change is okay.  The memcg semantics of moving threads
> with the same mm into separate groups have always been arbitrary.  No
> reasonable behavior can be expected of this, so what sane real life
> usecase would rely on it?

I suppose that making mm always follow the threadgroup leader should
be fine, right?  While this wouldn't make any difference in the
unified hierarchy, I think this would make more sense for traditional
hierarchies.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 3/3] memcg: get rid of mm_struct::owner

2015-05-28 Thread Tejun Heo
Hello, Johannes, Michal.

On Tue, May 26, 2015 at 10:10:11AM -0400, Johannes Weiner wrote:
 On Tue, May 26, 2015 at 01:50:06PM +0200, Michal Hocko wrote:
  Please note that this patch introduces a USER VISIBLE CHANGE OF BEHAVIOR.
  Without mm-owner _all_ tasks associated with the mm_struct would
  initiate memcg migration while previously only owner of the mm_struct
  could do that. The original behavior was awkward though because the user
  task didn't have any means to find out the current owner (esp. after
  mm_update_next_owner) so the migration behavior was not well defined
  in general.
  New cgroup API (unified hierarchy) will discontinue tasks file which
  means that migrating threads will no longer be possible. In such a case
  having CLONE_VM without CLONE_THREAD could emulate the thread behavior
  but this patch prevents from isolating memcg controllers from others.
  Nevertheless I am not convinced such a use case would really deserve
  complications on the memcg code side.
 
 I think such a change is okay.  The memcg semantics of moving threads
 with the same mm into separate groups have always been arbitrary.  No
 reasonable behavior can be expected of this, so what sane real life
 usecase would rely on it?

I suppose that making mm always follow the threadgroup leader should
be fine, right?  While this wouldn't make any difference in the
unified hierarchy, I think this would make more sense for traditional
hierarchies.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 3/3] memcg: get rid of mm_struct::owner

2015-05-27 Thread Michal Hocko
On Tue 26-05-15 13:20:19, Johannes Weiner wrote:
> On Tue, May 26, 2015 at 05:11:49PM +0200, Michal Hocko wrote:
> > On Tue 26-05-15 10:10:11, Johannes Weiner wrote:
> > > On Tue, May 26, 2015 at 01:50:06PM +0200, Michal Hocko wrote:
> > > > @@ -104,7 +105,12 @@ static inline bool mm_match_cgroup(struct 
> > > > mm_struct *mm,
> > > > bool match = false;
> > > >  
> > > > rcu_read_lock();
> > > > -   task_memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
> > > > +   /*
> > > > +* rcu_dereference would be better but mem_cgroup is not a 
> > > > complete
> > > > +* type here
> > > > +*/
> > > > +   task_memcg = READ_ONCE(mm->memcg);
> > > > +   smp_read_barrier_depends();
> > > > if (task_memcg)
> > > > match = mem_cgroup_is_descendant(task_memcg, memcg);
> > > > rcu_read_unlock();
> > > 
> > > This function has only one user in rmap.  If you inline it there, you
> > > can use rcu_dereference() and get rid of the specialness & comment.
> > 
> > I am not sure I understand. struct mem_cgroup is defined in
> > mm/memcontrol.c so mm/rmap.c will not see it. Or do you suggest pulling
> > struct mem_cgroup out into a header with all the dependencies?
> 
> Yes, I think that would be preferrable.  It's weird that we have such
> a major data structure that is used all over the mm-code but only in
> the shape of pointers to an incomplete type.  It forces a bad style of
> code that uses uninlinable callbacks and accessors for even the most
> basic things.  There are a few functions in memcontrol.c that could
> instead be static inlines or should even be implemented as part of the
> code that is using them, such as

Fair enough. I was afraid of dependencies between networking and memcg
header files but it seems that only struct cg_proto is really needed for
tcp kmem controller and that one doesn't depend on any socket specific
stuff. So we are good here. 

> mem_cgroup_get_lru_size(),
> mem_cgroup_is_descendant, mem_cgroup_inactive_anon_is_low(),
> mem_cgroup_lruvec_online(), mem_cgroup_swappiness(),
> mem_cgroup_select_victim_node(), mem_cgroup_update_page_stat(), and
> mem_cgroup_events().  Your new functions fall into the same category.

Let me try how this will end up. Hopefully the code will not grow too
much.

> > @@ -486,29 +486,13 @@ void mm_set_memcg(struct mm_struct *mm, struct 
> > mem_cgroup *memcg)
> >  void mm_drop_memcg(struct mm_struct *mm)
> >  {
> > /*
> > -* This is the last reference to mm so nobody can see
> > -* this memcg
> > +* We could reset mm->memcg, but the mm goes away as this is the
> > +* last reference.
> >  */
> > if (mm->memcg)
> > css_put(>memcg->css);
> >  }
> 
> This function is supposed to be an API call to disassociate a mm from
> its memcg, but it actually doesn't do that and will leave a dangling
> pointer based on assumptions it makes about how and when the caller
> invokes it.  That's bad.  It's a subtle optimization with dependencies
> spread across two moving parts.  The result is very fragile code which
> will break things in non-obvious ways when the caller changes later on.

Fair point. The optimization is not really worth it and I will add
explicit NULLing because I would prefer to keep the function as well as
mm_set_memcg because this is easier to track and at least mm_set_memcg
needs to be called from two places (as pointed out by Oleg) and I would
really like prevent from duplication.

> And what's left standing is silly too: a memcg-specific API to call
> css_put(), even though struct cgroup_subsys_state and css_put() are
> public API already.
> 
> Both these things are a negative side effect of struct mem_cgroup
> being semi-private.  Memcg pointers are everywhere, yet we need a
> public interface indirection for every simple dereference.
> 
> > @@ -5252,10 +5236,15 @@ static void mem_cgroup_move_task(struct 
> > cgroup_subsys_state *css,
> >  
> > if (mm) {
> > /*
> > -* Commit to a new memcg. mc.to points to the destination
> > -* memcg even when the current charges are not moved.
> > +* Commit to the target memcg even when we do not move
> > +* charges.
> >  */
> > -   mm_move_memcg(mm, mc.to);
> > +   struct mem_cgroup *old_memcg = READ_ONCE(mm->memcg);
> > +   struct mem_cgroup *new_memcg = mem_cgroup_from_css(css);
> > +
> > +   mm_set_memcg(mm, new_memcg);
> > +   if (old_memcg)
> > +   css_put(_memcg->css);
> 
> "Commit" is a problematic choice of words because of its existing
> meaning in memcg of associating a page with a pre-reserved charge.
> 
> I'm not sure a comment is actually necessary here.  Reassigning
> mm->memcg when moving a process pretty straight forward IMO.

OK, will remove it.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" 

Re: [RFC 3/3] memcg: get rid of mm_struct::owner

2015-05-27 Thread Michal Hocko
On Tue 26-05-15 19:38:22, Oleg Nesterov wrote:
> On 05/26, Michal Hocko wrote:
> >
> > On Tue 26-05-15 18:36:46, Oleg Nesterov wrote:
> > >
> > > > +static struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
> > > > +{
> > > > +   if (!p->mm)
> > > > +   return NULL;
> > > > +   return rcu_dereference(p->mm->memcg);
> > > > +}
> > >
> > > Probably I missed something, but it seems that the callers do not
> > > expect it can return NULL.
> >
> > This hasn't changed by this patch. mem_cgroup_from_task was allowed to
> > return NULL even before. I've just made it static because it doesn't
> > have any external users anymore.
> 
> I see, but it could only return NULL if mem_cgroup_from_css() returns
> NULL. Now it returns NULL for sure if the caller is task_in_mem_cgroup(),
> 
>   // called when task->mm == NULL
> 
>   task_memcg = mem_cgroup_from_task(task);
>   css_get(_memcg->css);
> 
> and this css_get() doesn't look nice if task_memcg == NULL ;)

You are right of course. mem_cgroup_from_task is indeed weird. I will
add the diff below to the original patch and try to get rid of this
weird interface in a follow up patch.

> > I will double check
> 
> Yes, please. Perhaps I missed something.
> 
> > > And in fact I can't understand what mem_cgroup_from_task() actually
> > > means, with or without these changes.
> >
> > It performs task_struct->mem_cgroup mapping. We cannot use cgroup
> > mapping here because the charges are bound to mm_struct rather than
> > task.
> 
> Sure, this is what I can understand. I meant... OK, lets ignore
> "without these changes", because without these changes there are
> much more oddities ;) With these changes only ->mm == NULL case
> looks unclear.
> 
> And btw,
> 
>   if (!p->mm)
>   return NULL;
>   return rcu_dereference(p->mm->memcg);
> 
> perhaps this needs a comment. It is not clear what protects ->mm.
> But. After this series "p" is always current (if ->mm != NULL), so
> this is fine.
> 
> Nevermind. Please forget. I feel this needs a bit of cleanup, but
> we can always do this later.

Yes I will rather do that in a separate patch. Thanks!

This will go into to patch because I have indeed change the semantic of
this function and I haven't realized the subtle difference.
---
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index aa85d5dfbe0e..ab00b6ae84e2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -471,9 +471,14 @@ static inline struct mem_cgroup 
*mem_cgroup_from_id(unsigned short id)
 
 static struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
 {
-   if (!p->mm)
-   return NULL;
-   return rcu_dereference(p->mm->memcg);
+   if (p->mm)
+   return rcu_dereference(p->mm->memcg);
+
+   /*
+* If the process doesn't have mm struct anymore we have to fallback
+* to the task_css.
+*/
+   return mem_cgroup_from_css(task_css(p, memory_cgrp_id));
 }
 
 void mm_set_memcg(struct mm_struct *mm, struct mem_cgroup *memcg)
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 3/3] memcg: get rid of mm_struct::owner

2015-05-27 Thread Michal Hocko
On Tue 26-05-15 19:38:22, Oleg Nesterov wrote:
 On 05/26, Michal Hocko wrote:
 
  On Tue 26-05-15 18:36:46, Oleg Nesterov wrote:
  
+static struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
+{
+   if (!p-mm)
+   return NULL;
+   return rcu_dereference(p-mm-memcg);
+}
  
   Probably I missed something, but it seems that the callers do not
   expect it can return NULL.
 
  This hasn't changed by this patch. mem_cgroup_from_task was allowed to
  return NULL even before. I've just made it static because it doesn't
  have any external users anymore.
 
 I see, but it could only return NULL if mem_cgroup_from_css() returns
 NULL. Now it returns NULL for sure if the caller is task_in_mem_cgroup(),
 
   // called when task-mm == NULL
 
   task_memcg = mem_cgroup_from_task(task);
   css_get(task_memcg-css);
 
 and this css_get() doesn't look nice if task_memcg == NULL ;)

You are right of course. mem_cgroup_from_task is indeed weird. I will
add the diff below to the original patch and try to get rid of this
weird interface in a follow up patch.

  I will double check
 
 Yes, please. Perhaps I missed something.
 
   And in fact I can't understand what mem_cgroup_from_task() actually
   means, with or without these changes.
 
  It performs task_struct-mem_cgroup mapping. We cannot use cgroup
  mapping here because the charges are bound to mm_struct rather than
  task.
 
 Sure, this is what I can understand. I meant... OK, lets ignore
 without these changes, because without these changes there are
 much more oddities ;) With these changes only -mm == NULL case
 looks unclear.
 
 And btw,
 
   if (!p-mm)
   return NULL;
   return rcu_dereference(p-mm-memcg);
 
 perhaps this needs a comment. It is not clear what protects -mm.
 But. After this series p is always current (if -mm != NULL), so
 this is fine.
 
 Nevermind. Please forget. I feel this needs a bit of cleanup, but
 we can always do this later.

Yes I will rather do that in a separate patch. Thanks!

This will go into to patch because I have indeed change the semantic of
this function and I haven't realized the subtle difference.
---
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index aa85d5dfbe0e..ab00b6ae84e2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -471,9 +471,14 @@ static inline struct mem_cgroup 
*mem_cgroup_from_id(unsigned short id)
 
 static struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
 {
-   if (!p-mm)
-   return NULL;
-   return rcu_dereference(p-mm-memcg);
+   if (p-mm)
+   return rcu_dereference(p-mm-memcg);
+
+   /*
+* If the process doesn't have mm struct anymore we have to fallback
+* to the task_css.
+*/
+   return mem_cgroup_from_css(task_css(p, memory_cgrp_id));
 }
 
 void mm_set_memcg(struct mm_struct *mm, struct mem_cgroup *memcg)
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 3/3] memcg: get rid of mm_struct::owner

2015-05-27 Thread Michal Hocko
On Tue 26-05-15 13:20:19, Johannes Weiner wrote:
 On Tue, May 26, 2015 at 05:11:49PM +0200, Michal Hocko wrote:
  On Tue 26-05-15 10:10:11, Johannes Weiner wrote:
   On Tue, May 26, 2015 at 01:50:06PM +0200, Michal Hocko wrote:
@@ -104,7 +105,12 @@ static inline bool mm_match_cgroup(struct 
mm_struct *mm,
bool match = false;
 
rcu_read_lock();
-   task_memcg = mem_cgroup_from_task(rcu_dereference(mm-owner));
+   /*
+* rcu_dereference would be better but mem_cgroup is not a 
complete
+* type here
+*/
+   task_memcg = READ_ONCE(mm-memcg);
+   smp_read_barrier_depends();
if (task_memcg)
match = mem_cgroup_is_descendant(task_memcg, memcg);
rcu_read_unlock();
   
   This function has only one user in rmap.  If you inline it there, you
   can use rcu_dereference() and get rid of the specialness  comment.
  
  I am not sure I understand. struct mem_cgroup is defined in
  mm/memcontrol.c so mm/rmap.c will not see it. Or do you suggest pulling
  struct mem_cgroup out into a header with all the dependencies?
 
 Yes, I think that would be preferrable.  It's weird that we have such
 a major data structure that is used all over the mm-code but only in
 the shape of pointers to an incomplete type.  It forces a bad style of
 code that uses uninlinable callbacks and accessors for even the most
 basic things.  There are a few functions in memcontrol.c that could
 instead be static inlines or should even be implemented as part of the
 code that is using them, such as

Fair enough. I was afraid of dependencies between networking and memcg
header files but it seems that only struct cg_proto is really needed for
tcp kmem controller and that one doesn't depend on any socket specific
stuff. So we are good here. 

 mem_cgroup_get_lru_size(),
 mem_cgroup_is_descendant, mem_cgroup_inactive_anon_is_low(),
 mem_cgroup_lruvec_online(), mem_cgroup_swappiness(),
 mem_cgroup_select_victim_node(), mem_cgroup_update_page_stat(), and
 mem_cgroup_events().  Your new functions fall into the same category.

Let me try how this will end up. Hopefully the code will not grow too
much.

  @@ -486,29 +486,13 @@ void mm_set_memcg(struct mm_struct *mm, struct 
  mem_cgroup *memcg)
   void mm_drop_memcg(struct mm_struct *mm)
   {
  /*
  -* This is the last reference to mm so nobody can see
  -* this memcg
  +* We could reset mm-memcg, but the mm goes away as this is the
  +* last reference.
   */
  if (mm-memcg)
  css_put(mm-memcg-css);
   }
 
 This function is supposed to be an API call to disassociate a mm from
 its memcg, but it actually doesn't do that and will leave a dangling
 pointer based on assumptions it makes about how and when the caller
 invokes it.  That's bad.  It's a subtle optimization with dependencies
 spread across two moving parts.  The result is very fragile code which
 will break things in non-obvious ways when the caller changes later on.

Fair point. The optimization is not really worth it and I will add
explicit NULLing because I would prefer to keep the function as well as
mm_set_memcg because this is easier to track and at least mm_set_memcg
needs to be called from two places (as pointed out by Oleg) and I would
really like prevent from duplication.

 And what's left standing is silly too: a memcg-specific API to call
 css_put(), even though struct cgroup_subsys_state and css_put() are
 public API already.
 
 Both these things are a negative side effect of struct mem_cgroup
 being semi-private.  Memcg pointers are everywhere, yet we need a
 public interface indirection for every simple dereference.
 
  @@ -5252,10 +5236,15 @@ static void mem_cgroup_move_task(struct 
  cgroup_subsys_state *css,
   
  if (mm) {
  /*
  -* Commit to a new memcg. mc.to points to the destination
  -* memcg even when the current charges are not moved.
  +* Commit to the target memcg even when we do not move
  +* charges.
   */
  -   mm_move_memcg(mm, mc.to);
  +   struct mem_cgroup *old_memcg = READ_ONCE(mm-memcg);
  +   struct mem_cgroup *new_memcg = mem_cgroup_from_css(css);
  +
  +   mm_set_memcg(mm, new_memcg);
  +   if (old_memcg)
  +   css_put(old_memcg-css);
 
 Commit is a problematic choice of words because of its existing
 meaning in memcg of associating a page with a pre-reserved charge.
 
 I'm not sure a comment is actually necessary here.  Reassigning
 mm-memcg when moving a process pretty straight forward IMO.

OK, will remove it.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 3/3] memcg: get rid of mm_struct::owner

2015-05-26 Thread Oleg Nesterov
On 05/26, Michal Hocko wrote:
>
> On Tue 26-05-15 18:36:46, Oleg Nesterov wrote:
> >
> > > +static struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
> > > +{
> > > + if (!p->mm)
> > > + return NULL;
> > > + return rcu_dereference(p->mm->memcg);
> > > +}
> >
> > Probably I missed something, but it seems that the callers do not
> > expect it can return NULL.
>
> This hasn't changed by this patch. mem_cgroup_from_task was allowed to
> return NULL even before. I've just made it static because it doesn't
> have any external users anymore.

I see, but it could only return NULL if mem_cgroup_from_css() returns
NULL. Now it returns NULL for sure if the caller is task_in_mem_cgroup(),

// called when task->mm == NULL

task_memcg = mem_cgroup_from_task(task);
css_get(_memcg->css);

and this css_get() doesn't look nice if task_memcg == NULL ;)

> I will double check

Yes, please. Perhaps I missed something.

> > And in fact I can't understand what mem_cgroup_from_task() actually
> > means, with or without these changes.
>
> It performs task_struct->mem_cgroup mapping. We cannot use cgroup
> mapping here because the charges are bound to mm_struct rather than
> task.

Sure, this is what I can understand. I meant... OK, lets ignore
"without these changes", because without these changes there are
much more oddities ;) With these changes only ->mm == NULL case
looks unclear.

And btw,

if (!p->mm)
return NULL;
return rcu_dereference(p->mm->memcg);

perhaps this needs a comment. It is not clear what protects ->mm.
But. After this series "p" is always current (if ->mm != NULL), so
this is fine.

Nevermind. Please forget. I feel this needs a bit of cleanup, but
we can always do this later.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 3/3] memcg: get rid of mm_struct::owner

2015-05-26 Thread Michal Hocko
On Tue 26-05-15 18:36:46, Oleg Nesterov wrote:
> On 05/26, Michal Hocko wrote:
> >
> > @@ -426,17 +426,7 @@ struct mm_struct {
> > struct kioctx_table __rcu   *ioctx_table;
> >  #endif
> >  #ifdef CONFIG_MEMCG
> > -   /*
> > -* "owner" points to a task that is regarded as the canonical
> > -* user/owner of this mm. All of the following must be true in
> > -* order for it to be changed:
> > -*
> > -* current == mm->owner
> > -* current->mm != mm
> > -* new_owner->mm == mm
> > -* new_owner->alloc_lock is held
> > -*/
> > -   struct task_struct __rcu *owner;
> > +   struct mem_cgroup __rcu *memcg;
> 
> Yes, thanks, this is what I tried to suggest ;)
> 
> But I can't review this series. Simply because I know nothing about
> memcs. I don't even know how to use it.
> 
> Just one question,
> 
> > +static struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
> > +{
> > +   if (!p->mm)
> > +   return NULL;
> > +   return rcu_dereference(p->mm->memcg);
> > +}
> 
> Probably I missed something, but it seems that the callers do not
> expect it can return NULL.

This hasn't changed by this patch. mem_cgroup_from_task was allowed to
return NULL even before. I've just made it static because it doesn't
have any external users anymore. I will double check whether we can ever
get NULL there in the real life. We have this code like that for quite
some time. Maybe this is just a heritage from the past...

> Perhaps sock_update_memcg() is fine, but
> task_in_mem_cgroup() calls it when find_lock_task_mm() fails, and in
> this case ->mm is NULL.
> 
> And in fact I can't understand what mem_cgroup_from_task() actually
> means, with or without these changes.

It performs task_struct->mem_cgroup mapping. We cannot use cgroup
mapping here because the charges are bound to mm_struct rather than
task.

> And another question. I can't understand what happens when a task
> execs... IOW, could you confirm that exec_mmap() does not need
> mm_set_memcg(mm, oldmm->memcg) ?

Right you are! Fixed thanks!
---
diff --git a/fs/exec.c b/fs/exec.c
index 2cd4def4b1d6..ea00d5a47aad 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -867,6 +867,7 @@ static int exec_mmap(struct mm_struct *mm)
up_read(_mm->mmap_sem);
BUG_ON(active_mm != old_mm);
setmax_mm_hiwater_rss(>signal->maxrss, old_mm);
+   mm_set_memcg(mm, old_mm->memcg);
mmput(old_mm);
return 0;
}
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 3/3] memcg: get rid of mm_struct::owner

2015-05-26 Thread Johannes Weiner
On Tue, May 26, 2015 at 05:11:49PM +0200, Michal Hocko wrote:
> On Tue 26-05-15 10:10:11, Johannes Weiner wrote:
> > On Tue, May 26, 2015 at 01:50:06PM +0200, Michal Hocko wrote:
> > > @@ -104,7 +105,12 @@ static inline bool mm_match_cgroup(struct mm_struct 
> > > *mm,
> > >   bool match = false;
> > >  
> > >   rcu_read_lock();
> > > - task_memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
> > > + /*
> > > +  * rcu_dereference would be better but mem_cgroup is not a complete
> > > +  * type here
> > > +  */
> > > + task_memcg = READ_ONCE(mm->memcg);
> > > + smp_read_barrier_depends();
> > >   if (task_memcg)
> > >   match = mem_cgroup_is_descendant(task_memcg, memcg);
> > >   rcu_read_unlock();
> > 
> > This function has only one user in rmap.  If you inline it there, you
> > can use rcu_dereference() and get rid of the specialness & comment.
> 
> I am not sure I understand. struct mem_cgroup is defined in
> mm/memcontrol.c so mm/rmap.c will not see it. Or do you suggest pulling
> struct mem_cgroup out into a header with all the dependencies?

Yes, I think that would be preferrable.  It's weird that we have such
a major data structure that is used all over the mm-code but only in
the shape of pointers to an incomplete type.  It forces a bad style of
code that uses uninlinable callbacks and accessors for even the most
basic things.  There are a few functions in memcontrol.c that could
instead be static inlines or should even be implemented as part of the
code that is using them, such as mem_cgroup_get_lru_size(),
mem_cgroup_is_descendant, mem_cgroup_inactive_anon_is_low(),
mem_cgroup_lruvec_online(), mem_cgroup_swappiness(),
mem_cgroup_select_victim_node(), mem_cgroup_update_page_stat(), and
mem_cgroup_events().  Your new functions fall into the same category.

> @@ -486,29 +486,13 @@ void mm_set_memcg(struct mm_struct *mm, struct 
> mem_cgroup *memcg)
>  void mm_drop_memcg(struct mm_struct *mm)
>  {
>   /*
> -  * This is the last reference to mm so nobody can see
> -  * this memcg
> +  * We could reset mm->memcg, but the mm goes away as this is the
> +  * last reference.
>*/
>   if (mm->memcg)
>   css_put(>memcg->css);
>  }

This function is supposed to be an API call to disassociate a mm from
its memcg, but it actually doesn't do that and will leave a dangling
pointer based on assumptions it makes about how and when the caller
invokes it.  That's bad.  It's a subtle optimization with dependencies
spread across two moving parts.  The result is very fragile code which
will break things in non-obvious ways when the caller changes later on.

And what's left standing is silly too: a memcg-specific API to call
css_put(), even though struct cgroup_subsys_state and css_put() are
public API already.

Both these things are a negative side effect of struct mem_cgroup
being semi-private.  Memcg pointers are everywhere, yet we need a
public interface indirection for every simple dereference.

> @@ -5252,10 +5236,15 @@ static void mem_cgroup_move_task(struct 
> cgroup_subsys_state *css,
>  
>   if (mm) {
>   /*
> -  * Commit to a new memcg. mc.to points to the destination
> -  * memcg even when the current charges are not moved.
> +  * Commit to the target memcg even when we do not move
> +  * charges.
>*/
> - mm_move_memcg(mm, mc.to);
> + struct mem_cgroup *old_memcg = READ_ONCE(mm->memcg);
> + struct mem_cgroup *new_memcg = mem_cgroup_from_css(css);
> +
> + mm_set_memcg(mm, new_memcg);
> + if (old_memcg)
> + css_put(_memcg->css);

"Commit" is a problematic choice of words because of its existing
meaning in memcg of associating a page with a pre-reserved charge.

I'm not sure a comment is actually necessary here.  Reassigning
mm->memcg when moving a process pretty straight forward IMO.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 3/3] memcg: get rid of mm_struct::owner

2015-05-26 Thread Oleg Nesterov
On 05/26, Michal Hocko wrote:
>
> @@ -426,17 +426,7 @@ struct mm_struct {
>   struct kioctx_table __rcu   *ioctx_table;
>  #endif
>  #ifdef CONFIG_MEMCG
> - /*
> -  * "owner" points to a task that is regarded as the canonical
> -  * user/owner of this mm. All of the following must be true in
> -  * order for it to be changed:
> -  *
> -  * current == mm->owner
> -  * current->mm != mm
> -  * new_owner->mm == mm
> -  * new_owner->alloc_lock is held
> -  */
> - struct task_struct __rcu *owner;
> + struct mem_cgroup __rcu *memcg;

Yes, thanks, this is what I tried to suggest ;)

But I can't review this series. Simply because I know nothing about
memcs. I don't even know how to use it.

Just one question,

> +static struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
> +{
> + if (!p->mm)
> + return NULL;
> + return rcu_dereference(p->mm->memcg);
> +}

Probably I missed something, but it seems that the callers do not
expect it can return NULL. Perhaps sock_update_memcg() is fine, but
task_in_mem_cgroup() calls it when find_lock_task_mm() fails, and in
this case ->mm is NULL.

And in fact I can't understand what mem_cgroup_from_task() actually
means, with or without these changes.

And another question. I can't understand what happens when a task
execs... IOW, could you confirm that exec_mmap() does not need
mm_set_memcg(mm, oldmm->memcg) ?

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 3/3] memcg: get rid of mm_struct::owner

2015-05-26 Thread Michal Hocko
[CCing Greg who I forgot to add the to list - sorry about that. The
thread starts here: http://marc.info/?l=linux-mm=143264102317318=2]

On Tue 26-05-15 10:10:11, Johannes Weiner wrote:
> On Tue, May 26, 2015 at 01:50:06PM +0200, Michal Hocko wrote:
> > Please note that this patch introduces a USER VISIBLE CHANGE OF BEHAVIOR.
> > Without mm->owner _all_ tasks associated with the mm_struct would
> > initiate memcg migration while previously only owner of the mm_struct
> > could do that. The original behavior was awkward though because the user
> > task didn't have any means to find out the current owner (esp. after
> > mm_update_next_owner) so the migration behavior was not well defined
> > in general.
> > New cgroup API (unified hierarchy) will discontinue tasks file which
> > means that migrating threads will no longer be possible. In such a case
> > having CLONE_VM without CLONE_THREAD could emulate the thread behavior
> > but this patch prevents from isolating memcg controllers from others.
> > Nevertheless I am not convinced such a use case would really deserve
> > complications on the memcg code side.
> 
> I think such a change is okay.  The memcg semantics of moving threads
> with the same mm into separate groups have always been arbitrary.  No
> reasonable behavior can be expected of this, so what sane real life
> usecase would rely on it?

I can imagine that threads would go to different cgroups because of
other controllers (e.g. cpu or cpuset).
AFAIR google was doing threads distribution.

> > @@ -104,7 +105,12 @@ static inline bool mm_match_cgroup(struct mm_struct 
> > *mm,
> > bool match = false;
> >  
> > rcu_read_lock();
> > -   task_memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
> > +   /*
> > +* rcu_dereference would be better but mem_cgroup is not a complete
> > +* type here
> > +*/
> > +   task_memcg = READ_ONCE(mm->memcg);
> > +   smp_read_barrier_depends();
> > if (task_memcg)
> > match = mem_cgroup_is_descendant(task_memcg, memcg);
> > rcu_read_unlock();
> 
> This function has only one user in rmap.  If you inline it there, you
> can use rcu_dereference() and get rid of the specialness & comment.

I am not sure I understand. struct mem_cgroup is defined in
mm/memcontrol.c so mm/rmap.c will not see it. Or do you suggest pulling
struct mem_cgroup out into a header with all the dependencies?

> > @@ -195,6 +201,10 @@ void mem_cgroup_split_huge_fixup(struct page *head);
> >  #else /* CONFIG_MEMCG */
> >  struct mem_cgroup;
> >  
> > +void mm_drop_memcg(struct mm_struct *mm)
> > +{}
> > +void mm_set_memcg(struct mm_struct *mm, struct mem_cgroup *memcg)
> > +{}
> 
> static inline?

Of course. Fixed.
 
[...]
> > @@ -469,6 +469,46 @@ static inline struct mem_cgroup 
> > *mem_cgroup_from_id(unsigned short id)
> > return mem_cgroup_from_css(css);
> >  }
> >  
> > +static struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
> > +{
> > +   if (!p->mm)
> > +   return NULL;
> > +   return rcu_dereference(p->mm->memcg);
> > +}
> > +
> > +void mm_set_memcg(struct mm_struct *mm, struct mem_cgroup *memcg)
> > +{
> > +   if (memcg)
> > +   css_get(>css);
> > +   rcu_assign_pointer(mm->memcg, memcg);
> > +}
> > +
> > +void mm_drop_memcg(struct mm_struct *mm)
> > +{
> > +   /*
> > +* This is the last reference to mm so nobody can see
> > +* this memcg
> > +*/
> > +   if (mm->memcg)
> > +   css_put(>memcg->css);
> > +}
> 
> This is really simple and obvious and has only one caller, it would be
> better to inline this into mmput().  The comment would also be easier
> to understand in conjunction with the mmdrop() in the callsite:

Same case as rmap.c.

> 
>   if (mm->memcg)
>   css_put(>memcg->css);
>   /* We could reset mm->memcg, but this will free the mm: */
>   mmdrop(mm);

I like your comment more. I will update it

> 
> The same goes for mm_set_memcg, there is no real need for obscuring a
> simple get-and-store.
> 
> > +static void mm_move_memcg(struct mm_struct *mm, struct mem_cgroup *memcg)
> > +{
> > +   struct mem_cgroup *old_memcg;
> > +
> > +   mm_set_memcg(mm, memcg);
> > +
> > +   /*
> > +* wait for all current users of the old memcg before we
> > +* release the reference.
> > +*/
> > +   old_memcg = mm->memcg;

Doh. Last minute changes... This is incorrect, of course, because I am
dropping the new memcg reference. Fixed

> > +   synchronize_rcu();
> > +   if (old_memcg)
> > +   css_put(_memcg->css);
> > +}
> 
> I'm not sure why we need that synchronize_rcu() in here, the css is
> itself protected by RCU and a failing tryget will prevent you from
> taking it outside a RCU-locked region.

Yeah, you are right. Removed.

> Aside from that, there is again exactly one place that performs this
> operation.  Please inline it into mem_cgroup_move_task().

OK, I will inline it there.

> > @@ -5204,6 +5251,12 @@ static void mem_cgroup_move_task(struct 
> > 

[RFC 3/3] memcg: get rid of mm_struct::owner

2015-05-26 Thread Michal Hocko
mm_struct::owner keeps track of the task which is in charge for the
specific mm. This is usually the thread group leader of the task but
there are more exotic cases where this doesn't hold.

The most prominent one is when separate tasks (not in the same thread
group) share the address space (by using clone with CLONE_VM without
CLONE_THREAD). The first task will be the owner until it exits.
mm_update_next_owner will then try to find a new owner - a task which
points to the same mm_struct. There is no guarantee a new owner will
be the thread group leader though because the leader might have
exited. Even though such a thread will be still around waiting for the
remaining threads from its group, it's mm will be NULL so it cannot be
chosen.

cgroup migration code, however assumes only group leaders when migrating
via cgroup.procs (which will be the only mode in the unified hierarchy
API) while mem_cgroup_can_attach only those tasks which are owner
of the mm. So we might end up with tasks which cannot be migrated.
mm_update_next_owner could be tweaked to try harder and use a group
leader whenever possible but this will never be 100% because all the
leaders might be dead.  It seems that getting rid of the mm->owner
sounds like a better option.

The whole concept of the mm owner is a bit artificial and too tricky to
get right. All the memcg code needs is to find struct mem_cgroup from
a given mm_struct and there are only two events when the association
is either built or changed
- a new mm is created - dup_mm - when the memcg is inherited
  from the oldmm
- task associated with the mm is moved to another memcg
So it is much more easier to bind mm_struct with the mem_cgroup directly
rather than indirectly via a task. This is exactly what this patch does.

mm_set_memcg and mm_drop_memcg are exported for the core kernel to bind
an old memcg during dup_mm and releasing that memcg in mmput after the
last reference is dropped and no task sees the mm anymore. We have to be
careful and take a reference to the memcg->css so that it doesn't vanish
from under our feet.
mm_move_memcg is then used during the task migration to change the
association. This is done in mem_cgroup_move_task before charges get
moved because mem_cgroup_can_attach is too early and other controllers
might fail and we would have to handle the rollback. The race between
can_attach and attach is harmless and it existed even before.

mm->memcg conforms to standard mem_cgroup locking rules. It has to be
used inside rcu_read_{un}lock() and a reference has to be taken before the
unlock if the memcg is supposed to be used outside. mm_move_memcg will
make sure that all the preexisting users will finish before it drops the
reference to the old memcg.

Finally mem_cgroup_can_attach will allow task migration only for the
thread group leaders to conform with cgroup core requirements.

Please note that this patch introduces a USER VISIBLE CHANGE OF BEHAVIOR.
Without mm->owner _all_ tasks associated with the mm_struct would
initiate memcg migration while previously only owner of the mm_struct
could do that. The original behavior was awkward though because the user
task didn't have any means to find out the current owner (esp. after
mm_update_next_owner) so the migration behavior was not well defined
in general.
New cgroup API (unified hierarchy) will discontinue tasks file which
means that migrating threads will no longer be possible. In such a case
having CLONE_VM without CLONE_THREAD could emulate the thread behavior
but this patch prevents from isolating memcg controllers from others.
Nevertheless I am not convinced such a use case would really deserve
complications on the memcg code side.

Signed-off-by: Michal Hocko 
---
 fs/exec.c  |  1 -
 include/linux/memcontrol.h | 14 ++-
 include/linux/mm_types.h   | 12 +-
 kernel/exit.c  | 89 -
 kernel/fork.c  | 10 +
 mm/debug.c |  4 +-
 mm/memcontrol.c| 99 +++---
 7 files changed, 93 insertions(+), 136 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 02bfd980a40c..2cd4def4b1d6 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -867,7 +867,6 @@ static int exec_mmap(struct mm_struct *mm)
up_read(_mm->mmap_sem);
BUG_ON(active_mm != old_mm);
setmax_mm_hiwater_rss(>signal->maxrss, old_mm);
-   mm_update_next_owner(old_mm);
mmput(old_mm);
return 0;
}
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6c8918114804..315ec1e58acb 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -67,6 +67,8 @@ enum mem_cgroup_events_index {
 };
 
 #ifdef CONFIG_MEMCG
+void mm_drop_memcg(struct mm_struct *mm);
+void mm_set_memcg(struct mm_struct *mm, struct mem_cgroup *memcg);
 void mem_cgroup_events(struct mem_cgroup 

Re: [RFC 3/3] memcg: get rid of mm_struct::owner

2015-05-26 Thread Johannes Weiner
On Tue, May 26, 2015 at 01:50:06PM +0200, Michal Hocko wrote:
> Please note that this patch introduces a USER VISIBLE CHANGE OF BEHAVIOR.
> Without mm->owner _all_ tasks associated with the mm_struct would
> initiate memcg migration while previously only owner of the mm_struct
> could do that. The original behavior was awkward though because the user
> task didn't have any means to find out the current owner (esp. after
> mm_update_next_owner) so the migration behavior was not well defined
> in general.
> New cgroup API (unified hierarchy) will discontinue tasks file which
> means that migrating threads will no longer be possible. In such a case
> having CLONE_VM without CLONE_THREAD could emulate the thread behavior
> but this patch prevents from isolating memcg controllers from others.
> Nevertheless I am not convinced such a use case would really deserve
> complications on the memcg code side.

I think such a change is okay.  The memcg semantics of moving threads
with the same mm into separate groups have always been arbitrary.  No
reasonable behavior can be expected of this, so what sane real life
usecase would rely on it?

> @@ -104,7 +105,12 @@ static inline bool mm_match_cgroup(struct mm_struct *mm,
>   bool match = false;
>  
>   rcu_read_lock();
> - task_memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
> + /*
> +  * rcu_dereference would be better but mem_cgroup is not a complete
> +  * type here
> +  */
> + task_memcg = READ_ONCE(mm->memcg);
> + smp_read_barrier_depends();
>   if (task_memcg)
>   match = mem_cgroup_is_descendant(task_memcg, memcg);
>   rcu_read_unlock();

This function has only one user in rmap.  If you inline it there, you
can use rcu_dereference() and get rid of the specialness & comment.

> @@ -195,6 +201,10 @@ void mem_cgroup_split_huge_fixup(struct page *head);
>  #else /* CONFIG_MEMCG */
>  struct mem_cgroup;
>  
> +void mm_drop_memcg(struct mm_struct *mm)
> +{}
> +void mm_set_memcg(struct mm_struct *mm, struct mem_cgroup *memcg)
> +{}

static inline?

> @@ -292,94 +292,6 @@ kill_orphaned_pgrp(struct task_struct *tsk, struct 
> task_struct *parent)
>   }
>  }
>  
> -#ifdef CONFIG_MEMCG
> -/*
> - * A task is exiting.   If it owned this mm, find a new owner for the mm.
> - */
> -void mm_update_next_owner(struct mm_struct *mm)
> -{
> - struct task_struct *c, *g, *p = current;
> -
> -retry:
> - /*
> -  * If the exiting or execing task is not the owner, it's
> -  * someone else's problem.
> -  */
> - if (mm->owner != p)
> - return;
> - /*
> -  * The current owner is exiting/execing and there are no other
> -  * candidates.  Do not leave the mm pointing to a possibly
> -  * freed task structure.
> -  */
> - if (atomic_read(>mm_users) <= 1) {
> - mm->owner = NULL;
> - return;
> - }
> -
> - read_lock(_lock);
> - /*
> -  * Search in the children
> -  */
> - list_for_each_entry(c, >children, sibling) {
> - if (c->mm == mm)
> - goto assign_new_owner;
> - }
> -
> - /*
> -  * Search in the siblings
> -  */
> - list_for_each_entry(c, >real_parent->children, sibling) {
> - if (c->mm == mm)
> - goto assign_new_owner;
> - }
> -
> - /*
> -  * Search through everything else, we should not get here often.
> -  */
> - for_each_process(g) {
> - if (g->flags & PF_KTHREAD)
> - continue;
> - for_each_thread(g, c) {
> - if (c->mm == mm)
> - goto assign_new_owner;
> - if (c->mm)
> - break;
> - }
> - }
> - read_unlock(_lock);
> - /*
> -  * We found no owner yet mm_users > 1: this implies that we are
> -  * most likely racing with swapoff (try_to_unuse()) or /proc or
> -  * ptrace or page migration (get_task_mm()).  Mark owner as NULL.
> -  */
> - mm->owner = NULL;
> - return;
> -
> -assign_new_owner:
> - BUG_ON(c == p);
> - get_task_struct(c);
> - /*
> -  * The task_lock protects c->mm from changing.
> -  * We always want mm->owner->mm == mm
> -  */
> - task_lock(c);
> - /*
> -  * Delay read_unlock() till we have the task_lock()
> -  * to ensure that c does not slip away underneath us
> -  */
> - read_unlock(_lock);
> - if (c->mm != mm) {
> - task_unlock(c);
> - put_task_struct(c);
> - goto retry;
> - }
> - mm->owner = c;
> - task_unlock(c);
> - put_task_struct(c);
> -}
> -#endif /* CONFIG_MEMCG */

w00t!

> @@ -469,6 +469,46 @@ static inline struct mem_cgroup 
> *mem_cgroup_from_id(unsigned short id)
>   return mem_cgroup_from_css(css);
>  }
>  
> +static struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
> +{
> + if 

Re: [RFC 3/3] memcg: get rid of mm_struct::owner

2015-05-26 Thread Oleg Nesterov
On 05/26, Michal Hocko wrote:

 @@ -426,17 +426,7 @@ struct mm_struct {
   struct kioctx_table __rcu   *ioctx_table;
  #endif
  #ifdef CONFIG_MEMCG
 - /*
 -  * owner points to a task that is regarded as the canonical
 -  * user/owner of this mm. All of the following must be true in
 -  * order for it to be changed:
 -  *
 -  * current == mm-owner
 -  * current-mm != mm
 -  * new_owner-mm == mm
 -  * new_owner-alloc_lock is held
 -  */
 - struct task_struct __rcu *owner;
 + struct mem_cgroup __rcu *memcg;

Yes, thanks, this is what I tried to suggest ;)

But I can't review this series. Simply because I know nothing about
memcs. I don't even know how to use it.

Just one question,

 +static struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
 +{
 + if (!p-mm)
 + return NULL;
 + return rcu_dereference(p-mm-memcg);
 +}

Probably I missed something, but it seems that the callers do not
expect it can return NULL. Perhaps sock_update_memcg() is fine, but
task_in_mem_cgroup() calls it when find_lock_task_mm() fails, and in
this case -mm is NULL.

And in fact I can't understand what mem_cgroup_from_task() actually
means, with or without these changes.

And another question. I can't understand what happens when a task
execs... IOW, could you confirm that exec_mmap() does not need
mm_set_memcg(mm, oldmm-memcg) ?

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC 3/3] memcg: get rid of mm_struct::owner

2015-05-26 Thread Michal Hocko
mm_struct::owner keeps track of the task which is in charge for the
specific mm. This is usually the thread group leader of the task but
there are more exotic cases where this doesn't hold.

The most prominent one is when separate tasks (not in the same thread
group) share the address space (by using clone with CLONE_VM without
CLONE_THREAD). The first task will be the owner until it exits.
mm_update_next_owner will then try to find a new owner - a task which
points to the same mm_struct. There is no guarantee a new owner will
be the thread group leader though because the leader might have
exited. Even though such a thread will be still around waiting for the
remaining threads from its group, it's mm will be NULL so it cannot be
chosen.

cgroup migration code, however assumes only group leaders when migrating
via cgroup.procs (which will be the only mode in the unified hierarchy
API) while mem_cgroup_can_attach only those tasks which are owner
of the mm. So we might end up with tasks which cannot be migrated.
mm_update_next_owner could be tweaked to try harder and use a group
leader whenever possible but this will never be 100% because all the
leaders might be dead.  It seems that getting rid of the mm-owner
sounds like a better option.

The whole concept of the mm owner is a bit artificial and too tricky to
get right. All the memcg code needs is to find struct mem_cgroup from
a given mm_struct and there are only two events when the association
is either built or changed
- a new mm is created - dup_mm - when the memcg is inherited
  from the oldmm
- task associated with the mm is moved to another memcg
So it is much more easier to bind mm_struct with the mem_cgroup directly
rather than indirectly via a task. This is exactly what this patch does.

mm_set_memcg and mm_drop_memcg are exported for the core kernel to bind
an old memcg during dup_mm and releasing that memcg in mmput after the
last reference is dropped and no task sees the mm anymore. We have to be
careful and take a reference to the memcg-css so that it doesn't vanish
from under our feet.
mm_move_memcg is then used during the task migration to change the
association. This is done in mem_cgroup_move_task before charges get
moved because mem_cgroup_can_attach is too early and other controllers
might fail and we would have to handle the rollback. The race between
can_attach and attach is harmless and it existed even before.

mm-memcg conforms to standard mem_cgroup locking rules. It has to be
used inside rcu_read_{un}lock() and a reference has to be taken before the
unlock if the memcg is supposed to be used outside. mm_move_memcg will
make sure that all the preexisting users will finish before it drops the
reference to the old memcg.

Finally mem_cgroup_can_attach will allow task migration only for the
thread group leaders to conform with cgroup core requirements.

Please note that this patch introduces a USER VISIBLE CHANGE OF BEHAVIOR.
Without mm-owner _all_ tasks associated with the mm_struct would
initiate memcg migration while previously only owner of the mm_struct
could do that. The original behavior was awkward though because the user
task didn't have any means to find out the current owner (esp. after
mm_update_next_owner) so the migration behavior was not well defined
in general.
New cgroup API (unified hierarchy) will discontinue tasks file which
means that migrating threads will no longer be possible. In such a case
having CLONE_VM without CLONE_THREAD could emulate the thread behavior
but this patch prevents from isolating memcg controllers from others.
Nevertheless I am not convinced such a use case would really deserve
complications on the memcg code side.

Signed-off-by: Michal Hocko mho...@suse.cz
---
 fs/exec.c  |  1 -
 include/linux/memcontrol.h | 14 ++-
 include/linux/mm_types.h   | 12 +-
 kernel/exit.c  | 89 -
 kernel/fork.c  | 10 +
 mm/debug.c |  4 +-
 mm/memcontrol.c| 99 +++---
 7 files changed, 93 insertions(+), 136 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 02bfd980a40c..2cd4def4b1d6 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -867,7 +867,6 @@ static int exec_mmap(struct mm_struct *mm)
up_read(old_mm-mmap_sem);
BUG_ON(active_mm != old_mm);
setmax_mm_hiwater_rss(tsk-signal-maxrss, old_mm);
-   mm_update_next_owner(old_mm);
mmput(old_mm);
return 0;
}
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6c8918114804..315ec1e58acb 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -67,6 +67,8 @@ enum mem_cgroup_events_index {
 };
 
 #ifdef CONFIG_MEMCG
+void mm_drop_memcg(struct mm_struct *mm);
+void mm_set_memcg(struct mm_struct *mm, struct mem_cgroup *memcg);
 void mem_cgroup_events(struct 

Re: [RFC 3/3] memcg: get rid of mm_struct::owner

2015-05-26 Thread Michal Hocko
[CCing Greg who I forgot to add the to list - sorry about that. The
thread starts here: http://marc.info/?l=linux-mmm=143264102317318w=2]

On Tue 26-05-15 10:10:11, Johannes Weiner wrote:
 On Tue, May 26, 2015 at 01:50:06PM +0200, Michal Hocko wrote:
  Please note that this patch introduces a USER VISIBLE CHANGE OF BEHAVIOR.
  Without mm-owner _all_ tasks associated with the mm_struct would
  initiate memcg migration while previously only owner of the mm_struct
  could do that. The original behavior was awkward though because the user
  task didn't have any means to find out the current owner (esp. after
  mm_update_next_owner) so the migration behavior was not well defined
  in general.
  New cgroup API (unified hierarchy) will discontinue tasks file which
  means that migrating threads will no longer be possible. In such a case
  having CLONE_VM without CLONE_THREAD could emulate the thread behavior
  but this patch prevents from isolating memcg controllers from others.
  Nevertheless I am not convinced such a use case would really deserve
  complications on the memcg code side.
 
 I think such a change is okay.  The memcg semantics of moving threads
 with the same mm into separate groups have always been arbitrary.  No
 reasonable behavior can be expected of this, so what sane real life
 usecase would rely on it?

I can imagine that threads would go to different cgroups because of
other controllers (e.g. cpu or cpuset).
AFAIR google was doing threads distribution.

  @@ -104,7 +105,12 @@ static inline bool mm_match_cgroup(struct mm_struct 
  *mm,
  bool match = false;
   
  rcu_read_lock();
  -   task_memcg = mem_cgroup_from_task(rcu_dereference(mm-owner));
  +   /*
  +* rcu_dereference would be better but mem_cgroup is not a complete
  +* type here
  +*/
  +   task_memcg = READ_ONCE(mm-memcg);
  +   smp_read_barrier_depends();
  if (task_memcg)
  match = mem_cgroup_is_descendant(task_memcg, memcg);
  rcu_read_unlock();
 
 This function has only one user in rmap.  If you inline it there, you
 can use rcu_dereference() and get rid of the specialness  comment.

I am not sure I understand. struct mem_cgroup is defined in
mm/memcontrol.c so mm/rmap.c will not see it. Or do you suggest pulling
struct mem_cgroup out into a header with all the dependencies?

  @@ -195,6 +201,10 @@ void mem_cgroup_split_huge_fixup(struct page *head);
   #else /* CONFIG_MEMCG */
   struct mem_cgroup;
   
  +void mm_drop_memcg(struct mm_struct *mm)
  +{}
  +void mm_set_memcg(struct mm_struct *mm, struct mem_cgroup *memcg)
  +{}
 
 static inline?

Of course. Fixed.
 
[...]
  @@ -469,6 +469,46 @@ static inline struct mem_cgroup 
  *mem_cgroup_from_id(unsigned short id)
  return mem_cgroup_from_css(css);
   }
   
  +static struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
  +{
  +   if (!p-mm)
  +   return NULL;
  +   return rcu_dereference(p-mm-memcg);
  +}
  +
  +void mm_set_memcg(struct mm_struct *mm, struct mem_cgroup *memcg)
  +{
  +   if (memcg)
  +   css_get(memcg-css);
  +   rcu_assign_pointer(mm-memcg, memcg);
  +}
  +
  +void mm_drop_memcg(struct mm_struct *mm)
  +{
  +   /*
  +* This is the last reference to mm so nobody can see
  +* this memcg
  +*/
  +   if (mm-memcg)
  +   css_put(mm-memcg-css);
  +}
 
 This is really simple and obvious and has only one caller, it would be
 better to inline this into mmput().  The comment would also be easier
 to understand in conjunction with the mmdrop() in the callsite:

Same case as rmap.c.

 
   if (mm-memcg)
   css_put(mm-memcg-css);
   /* We could reset mm-memcg, but this will free the mm: */
   mmdrop(mm);

I like your comment more. I will update it

 
 The same goes for mm_set_memcg, there is no real need for obscuring a
 simple get-and-store.
 
  +static void mm_move_memcg(struct mm_struct *mm, struct mem_cgroup *memcg)
  +{
  +   struct mem_cgroup *old_memcg;
  +
  +   mm_set_memcg(mm, memcg);
  +
  +   /*
  +* wait for all current users of the old memcg before we
  +* release the reference.
  +*/
  +   old_memcg = mm-memcg;

Doh. Last minute changes... This is incorrect, of course, because I am
dropping the new memcg reference. Fixed

  +   synchronize_rcu();
  +   if (old_memcg)
  +   css_put(old_memcg-css);
  +}
 
 I'm not sure why we need that synchronize_rcu() in here, the css is
 itself protected by RCU and a failing tryget will prevent you from
 taking it outside a RCU-locked region.

Yeah, you are right. Removed.

 Aside from that, there is again exactly one place that performs this
 operation.  Please inline it into mem_cgroup_move_task().

OK, I will inline it there.

  @@ -5204,6 +5251,12 @@ static void mem_cgroup_move_task(struct 
  cgroup_subsys_state *css,
  struct mm_struct *mm = get_task_mm(p);
   
  if (mm) {
  +   /*
  +* Commit to a new memcg. mc.to points to the destination
  +* 

Re: [RFC 3/3] memcg: get rid of mm_struct::owner

2015-05-26 Thread Johannes Weiner
On Tue, May 26, 2015 at 01:50:06PM +0200, Michal Hocko wrote:
 Please note that this patch introduces a USER VISIBLE CHANGE OF BEHAVIOR.
 Without mm-owner _all_ tasks associated with the mm_struct would
 initiate memcg migration while previously only owner of the mm_struct
 could do that. The original behavior was awkward though because the user
 task didn't have any means to find out the current owner (esp. after
 mm_update_next_owner) so the migration behavior was not well defined
 in general.
 New cgroup API (unified hierarchy) will discontinue tasks file which
 means that migrating threads will no longer be possible. In such a case
 having CLONE_VM without CLONE_THREAD could emulate the thread behavior
 but this patch prevents from isolating memcg controllers from others.
 Nevertheless I am not convinced such a use case would really deserve
 complications on the memcg code side.

I think such a change is okay.  The memcg semantics of moving threads
with the same mm into separate groups have always been arbitrary.  No
reasonable behavior can be expected of this, so what sane real life
usecase would rely on it?

 @@ -104,7 +105,12 @@ static inline bool mm_match_cgroup(struct mm_struct *mm,
   bool match = false;
  
   rcu_read_lock();
 - task_memcg = mem_cgroup_from_task(rcu_dereference(mm-owner));
 + /*
 +  * rcu_dereference would be better but mem_cgroup is not a complete
 +  * type here
 +  */
 + task_memcg = READ_ONCE(mm-memcg);
 + smp_read_barrier_depends();
   if (task_memcg)
   match = mem_cgroup_is_descendant(task_memcg, memcg);
   rcu_read_unlock();

This function has only one user in rmap.  If you inline it there, you
can use rcu_dereference() and get rid of the specialness  comment.

 @@ -195,6 +201,10 @@ void mem_cgroup_split_huge_fixup(struct page *head);
  #else /* CONFIG_MEMCG */
  struct mem_cgroup;
  
 +void mm_drop_memcg(struct mm_struct *mm)
 +{}
 +void mm_set_memcg(struct mm_struct *mm, struct mem_cgroup *memcg)
 +{}

static inline?

 @@ -292,94 +292,6 @@ kill_orphaned_pgrp(struct task_struct *tsk, struct 
 task_struct *parent)
   }
  }
  
 -#ifdef CONFIG_MEMCG
 -/*
 - * A task is exiting.   If it owned this mm, find a new owner for the mm.
 - */
 -void mm_update_next_owner(struct mm_struct *mm)
 -{
 - struct task_struct *c, *g, *p = current;
 -
 -retry:
 - /*
 -  * If the exiting or execing task is not the owner, it's
 -  * someone else's problem.
 -  */
 - if (mm-owner != p)
 - return;
 - /*
 -  * The current owner is exiting/execing and there are no other
 -  * candidates.  Do not leave the mm pointing to a possibly
 -  * freed task structure.
 -  */
 - if (atomic_read(mm-mm_users) = 1) {
 - mm-owner = NULL;
 - return;
 - }
 -
 - read_lock(tasklist_lock);
 - /*
 -  * Search in the children
 -  */
 - list_for_each_entry(c, p-children, sibling) {
 - if (c-mm == mm)
 - goto assign_new_owner;
 - }
 -
 - /*
 -  * Search in the siblings
 -  */
 - list_for_each_entry(c, p-real_parent-children, sibling) {
 - if (c-mm == mm)
 - goto assign_new_owner;
 - }
 -
 - /*
 -  * Search through everything else, we should not get here often.
 -  */
 - for_each_process(g) {
 - if (g-flags  PF_KTHREAD)
 - continue;
 - for_each_thread(g, c) {
 - if (c-mm == mm)
 - goto assign_new_owner;
 - if (c-mm)
 - break;
 - }
 - }
 - read_unlock(tasklist_lock);
 - /*
 -  * We found no owner yet mm_users  1: this implies that we are
 -  * most likely racing with swapoff (try_to_unuse()) or /proc or
 -  * ptrace or page migration (get_task_mm()).  Mark owner as NULL.
 -  */
 - mm-owner = NULL;
 - return;
 -
 -assign_new_owner:
 - BUG_ON(c == p);
 - get_task_struct(c);
 - /*
 -  * The task_lock protects c-mm from changing.
 -  * We always want mm-owner-mm == mm
 -  */
 - task_lock(c);
 - /*
 -  * Delay read_unlock() till we have the task_lock()
 -  * to ensure that c does not slip away underneath us
 -  */
 - read_unlock(tasklist_lock);
 - if (c-mm != mm) {
 - task_unlock(c);
 - put_task_struct(c);
 - goto retry;
 - }
 - mm-owner = c;
 - task_unlock(c);
 - put_task_struct(c);
 -}
 -#endif /* CONFIG_MEMCG */

w00t!

 @@ -469,6 +469,46 @@ static inline struct mem_cgroup 
 *mem_cgroup_from_id(unsigned short id)
   return mem_cgroup_from_css(css);
  }
  
 +static struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
 +{
 + if (!p-mm)
 + return NULL;
 + return rcu_dereference(p-mm-memcg);
 +}
 +
 +void mm_set_memcg(struct mm_struct *mm, 

Re: [RFC 3/3] memcg: get rid of mm_struct::owner

2015-05-26 Thread Johannes Weiner
On Tue, May 26, 2015 at 05:11:49PM +0200, Michal Hocko wrote:
 On Tue 26-05-15 10:10:11, Johannes Weiner wrote:
  On Tue, May 26, 2015 at 01:50:06PM +0200, Michal Hocko wrote:
   @@ -104,7 +105,12 @@ static inline bool mm_match_cgroup(struct mm_struct 
   *mm,
 bool match = false;

 rcu_read_lock();
   - task_memcg = mem_cgroup_from_task(rcu_dereference(mm-owner));
   + /*
   +  * rcu_dereference would be better but mem_cgroup is not a complete
   +  * type here
   +  */
   + task_memcg = READ_ONCE(mm-memcg);
   + smp_read_barrier_depends();
 if (task_memcg)
 match = mem_cgroup_is_descendant(task_memcg, memcg);
 rcu_read_unlock();
  
  This function has only one user in rmap.  If you inline it there, you
  can use rcu_dereference() and get rid of the specialness  comment.
 
 I am not sure I understand. struct mem_cgroup is defined in
 mm/memcontrol.c so mm/rmap.c will not see it. Or do you suggest pulling
 struct mem_cgroup out into a header with all the dependencies?

Yes, I think that would be preferrable.  It's weird that we have such
a major data structure that is used all over the mm-code but only in
the shape of pointers to an incomplete type.  It forces a bad style of
code that uses uninlinable callbacks and accessors for even the most
basic things.  There are a few functions in memcontrol.c that could
instead be static inlines or should even be implemented as part of the
code that is using them, such as mem_cgroup_get_lru_size(),
mem_cgroup_is_descendant, mem_cgroup_inactive_anon_is_low(),
mem_cgroup_lruvec_online(), mem_cgroup_swappiness(),
mem_cgroup_select_victim_node(), mem_cgroup_update_page_stat(), and
mem_cgroup_events().  Your new functions fall into the same category.

 @@ -486,29 +486,13 @@ void mm_set_memcg(struct mm_struct *mm, struct 
 mem_cgroup *memcg)
  void mm_drop_memcg(struct mm_struct *mm)
  {
   /*
 -  * This is the last reference to mm so nobody can see
 -  * this memcg
 +  * We could reset mm-memcg, but the mm goes away as this is the
 +  * last reference.
*/
   if (mm-memcg)
   css_put(mm-memcg-css);
  }

This function is supposed to be an API call to disassociate a mm from
its memcg, but it actually doesn't do that and will leave a dangling
pointer based on assumptions it makes about how and when the caller
invokes it.  That's bad.  It's a subtle optimization with dependencies
spread across two moving parts.  The result is very fragile code which
will break things in non-obvious ways when the caller changes later on.

And what's left standing is silly too: a memcg-specific API to call
css_put(), even though struct cgroup_subsys_state and css_put() are
public API already.

Both these things are a negative side effect of struct mem_cgroup
being semi-private.  Memcg pointers are everywhere, yet we need a
public interface indirection for every simple dereference.

 @@ -5252,10 +5236,15 @@ static void mem_cgroup_move_task(struct 
 cgroup_subsys_state *css,
  
   if (mm) {
   /*
 -  * Commit to a new memcg. mc.to points to the destination
 -  * memcg even when the current charges are not moved.
 +  * Commit to the target memcg even when we do not move
 +  * charges.
*/
 - mm_move_memcg(mm, mc.to);
 + struct mem_cgroup *old_memcg = READ_ONCE(mm-memcg);
 + struct mem_cgroup *new_memcg = mem_cgroup_from_css(css);
 +
 + mm_set_memcg(mm, new_memcg);
 + if (old_memcg)
 + css_put(old_memcg-css);

Commit is a problematic choice of words because of its existing
meaning in memcg of associating a page with a pre-reserved charge.

I'm not sure a comment is actually necessary here.  Reassigning
mm-memcg when moving a process pretty straight forward IMO.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 3/3] memcg: get rid of mm_struct::owner

2015-05-26 Thread Michal Hocko
On Tue 26-05-15 18:36:46, Oleg Nesterov wrote:
 On 05/26, Michal Hocko wrote:
 
  @@ -426,17 +426,7 @@ struct mm_struct {
  struct kioctx_table __rcu   *ioctx_table;
   #endif
   #ifdef CONFIG_MEMCG
  -   /*
  -* owner points to a task that is regarded as the canonical
  -* user/owner of this mm. All of the following must be true in
  -* order for it to be changed:
  -*
  -* current == mm-owner
  -* current-mm != mm
  -* new_owner-mm == mm
  -* new_owner-alloc_lock is held
  -*/
  -   struct task_struct __rcu *owner;
  +   struct mem_cgroup __rcu *memcg;
 
 Yes, thanks, this is what I tried to suggest ;)
 
 But I can't review this series. Simply because I know nothing about
 memcs. I don't even know how to use it.
 
 Just one question,
 
  +static struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
  +{
  +   if (!p-mm)
  +   return NULL;
  +   return rcu_dereference(p-mm-memcg);
  +}
 
 Probably I missed something, but it seems that the callers do not
 expect it can return NULL.

This hasn't changed by this patch. mem_cgroup_from_task was allowed to
return NULL even before. I've just made it static because it doesn't
have any external users anymore. I will double check whether we can ever
get NULL there in the real life. We have this code like that for quite
some time. Maybe this is just a heritage from the past...

 Perhaps sock_update_memcg() is fine, but
 task_in_mem_cgroup() calls it when find_lock_task_mm() fails, and in
 this case -mm is NULL.
 
 And in fact I can't understand what mem_cgroup_from_task() actually
 means, with or without these changes.

It performs task_struct-mem_cgroup mapping. We cannot use cgroup
mapping here because the charges are bound to mm_struct rather than
task.

 And another question. I can't understand what happens when a task
 execs... IOW, could you confirm that exec_mmap() does not need
 mm_set_memcg(mm, oldmm-memcg) ?

Right you are! Fixed thanks!
---
diff --git a/fs/exec.c b/fs/exec.c
index 2cd4def4b1d6..ea00d5a47aad 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -867,6 +867,7 @@ static int exec_mmap(struct mm_struct *mm)
up_read(old_mm-mmap_sem);
BUG_ON(active_mm != old_mm);
setmax_mm_hiwater_rss(tsk-signal-maxrss, old_mm);
+   mm_set_memcg(mm, old_mm-memcg);
mmput(old_mm);
return 0;
}
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 3/3] memcg: get rid of mm_struct::owner

2015-05-26 Thread Oleg Nesterov
On 05/26, Michal Hocko wrote:

 On Tue 26-05-15 18:36:46, Oleg Nesterov wrote:
 
   +static struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
   +{
   + if (!p-mm)
   + return NULL;
   + return rcu_dereference(p-mm-memcg);
   +}
 
  Probably I missed something, but it seems that the callers do not
  expect it can return NULL.

 This hasn't changed by this patch. mem_cgroup_from_task was allowed to
 return NULL even before. I've just made it static because it doesn't
 have any external users anymore.

I see, but it could only return NULL if mem_cgroup_from_css() returns
NULL. Now it returns NULL for sure if the caller is task_in_mem_cgroup(),

// called when task-mm == NULL

task_memcg = mem_cgroup_from_task(task);
css_get(task_memcg-css);

and this css_get() doesn't look nice if task_memcg == NULL ;)

 I will double check

Yes, please. Perhaps I missed something.

  And in fact I can't understand what mem_cgroup_from_task() actually
  means, with or without these changes.

 It performs task_struct-mem_cgroup mapping. We cannot use cgroup
 mapping here because the charges are bound to mm_struct rather than
 task.

Sure, this is what I can understand. I meant... OK, lets ignore
without these changes, because without these changes there are
much more oddities ;) With these changes only -mm == NULL case
looks unclear.

And btw,

if (!p-mm)
return NULL;
return rcu_dereference(p-mm-memcg);

perhaps this needs a comment. It is not clear what protects -mm.
But. After this series p is always current (if -mm != NULL), so
this is fine.

Nevermind. Please forget. I feel this needs a bit of cleanup, but
we can always do this later.

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/