Re: [PATCH] memcg: stop warning on memcg_propagate_kmem

2013-02-04 Thread Lord Glauber Costa of Sealand
On 02/04/2013 12:36 PM, Michal Hocko wrote:
> On Mon 04-02-13 12:04:06, Glauber Costa wrote:
>> On 02/04/2013 11:57 AM, Michal Hocko wrote:
>>> On Sun 03-02-13 20:29:01, Hugh Dickins wrote:
 Whilst I run the risk of a flogging for disloyalty to the Lord of Sealand,
 I do have CONFIG_MEMCG=y CONFIG_MEMCG_KMEM not set, and grow tired of the
 "mm/memcontrol.c:4972:12: warning: `memcg_propagate_kmem' defined but not
 used [-Wunused-function]" seen in 3.8-rc: move the #ifdef outwards.

 Signed-off-by: Hugh Dickins 
>>>
>>> Acked-by: Michal Hocko 
>>>
>>> Hmm, if you are not too tired then moving the function downwards to
>>> where it is called (memcg_init_kmem) will reduce the number of ifdefs.
>>> But this can wait for a bigger clean up which is getting due:
>>> git grep "def.*CONFIG_MEMCG_KMEM" mm/memcontrol.c | wc -l
>>> 12
>>>
>>
>> The problem is that I was usually keeping things in clearly separated
>> blocks, like this :
>>
>> #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
>> struct tcp_memcontrol tcp_mem;
>> #endif
>> #if defined(CONFIG_MEMCG_KMEM)
>> /* analogous to slab_common's slab_caches list. per-memcg */
>> struct list_head memcg_slab_caches;
>> /* Not a spinlock, we can take a lot of time walking the list */
>> struct mutex slab_caches_mutex;
>> /* Index in the kmem_cache->memcg_params->memcg_caches array */
>> int kmemcg_id;
>> #endif
>>
>> If it would be preferable to everybody, this could be easily rewritten as:
>>
>> #if defined(CONFIG_MEMCG_KMEM)
>> #if defined(CONFIG_INET)
>> struct tcp_memcontrol tcp_mem;
>> #endif
>> /* analogous to slab_common's slab_caches list. per-memcg */
>> struct list_head memcg_slab_caches;
>> /* Not a spinlock, we can take a lot of time walking the list */
>> struct mutex slab_caches_mutex;
>> /* Index in the kmem_cache->memcg_params->memcg_caches array */
>> int kmemcg_id;
>> #endif
> 
> I was rather interested in reducing CONFIG_KMEM block, the above example
> doesn't bother me that much.
>  
>> This would allow us to collapse some blocks a bit down as well.
>>
>> It doesn't bother me *that* much, though.
> 
> Yes and a quick attempt shows that a clean up would bring a lot of
> churn.
> 
And some of it, because there are circular dependencies. So we would
have to start adding forward declarations here and there to make it all
work. That is part of the reason why I kept the blocks separate.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] memcg: stop warning on memcg_propagate_kmem

2013-02-04 Thread Lord Glauber Costa of Sealand
On 02/04/2013 11:57 AM, Michal Hocko wrote:
> On Sun 03-02-13 20:29:01, Hugh Dickins wrote:
>> Whilst I run the risk of a flogging for disloyalty to the Lord of Sealand,
>> I do have CONFIG_MEMCG=y CONFIG_MEMCG_KMEM not set, and grow tired of the
>> "mm/memcontrol.c:4972:12: warning: `memcg_propagate_kmem' defined but not
>> used [-Wunused-function]" seen in 3.8-rc: move the #ifdef outwards.
>>
>> Signed-off-by: Hugh Dickins 
> 
> Acked-by: Michal Hocko 
> 
> Hmm, if you are not too tired then moving the function downwards to
> where it is called (memcg_init_kmem) will reduce the number of ifdefs.
> But this can wait for a bigger clean up which is getting due:
> git grep "def.*CONFIG_MEMCG_KMEM" mm/memcontrol.c | wc -l
> 12
> 

The problem is that I was usually keeping things in clearly separated
blocks, like this :

#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
struct tcp_memcontrol tcp_mem;
#endif
#if defined(CONFIG_MEMCG_KMEM)
/* analogous to slab_common's slab_caches list. per-memcg */
struct list_head memcg_slab_caches;
/* Not a spinlock, we can take a lot of time walking the list */
struct mutex slab_caches_mutex;
/* Index in the kmem_cache->memcg_params->memcg_caches array */
int kmemcg_id;
#endif

If it would be preferable to everybody, this could be easily rewritten as:

#if defined(CONFIG_MEMCG_KMEM)
#if defined(CONFIG_INET)
struct tcp_memcontrol tcp_mem;
#endif
/* analogous to slab_common's slab_caches list. per-memcg */
struct list_head memcg_slab_caches;
/* Not a spinlock, we can take a lot of time walking the list */
struct mutex slab_caches_mutex;
/* Index in the kmem_cache->memcg_params->memcg_caches array */
int kmemcg_id;
#endif

This would allow us to collapse some blocks a bit down as well.

It doesn't bother me *that* much, though.

> Thanks
>> ---
>>
>>  mm/memcontrol.c |4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> --- 3.8-rc6/mm/memcontrol.c  2012-12-22 09:43:27.628015582 -0800
>> +++ linux/mm/memcontrol.c2013-02-02 16:56:06.188325771 -0800
>> @@ -4969,6 +4969,7 @@ out:
>>  return ret;
>>  }
>>  
>> +#ifdef CONFIG_MEMCG_KMEM
>>  static int memcg_propagate_kmem(struct mem_cgroup *memcg)
>>  {
>>  int ret = 0;
>> @@ -4977,7 +4978,6 @@ static int memcg_propagate_kmem(struct m
>>  goto out;
>>  
>>  memcg->kmem_account_flags = parent->kmem_account_flags;
>> -#ifdef CONFIG_MEMCG_KMEM
>>  /*
>>   * When that happen, we need to disable the static branch only on those
>>   * memcgs that enabled it. To achieve this, we would be forced to
>> @@ -5003,10 +5003,10 @@ static int memcg_propagate_kmem(struct m
>>  mutex_lock(_limit_mutex);
>>  ret = memcg_update_cache_sizes(memcg);
>>  mutex_unlock(_limit_mutex);
>> -#endif
>>  out:
>>  return ret;
>>  }
>> +#endif /* CONFIG_MEMCG_KMEM */
>>  
>>  /*
>>   * The user of this function is...
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] memcg: stop warning on memcg_propagate_kmem

2013-02-04 Thread Lord Glauber Costa of Sealand
On 02/04/2013 11:57 AM, Michal Hocko wrote:
 On Sun 03-02-13 20:29:01, Hugh Dickins wrote:
 Whilst I run the risk of a flogging for disloyalty to the Lord of Sealand,
 I do have CONFIG_MEMCG=y CONFIG_MEMCG_KMEM not set, and grow tired of the
 mm/memcontrol.c:4972:12: warning: `memcg_propagate_kmem' defined but not
 used [-Wunused-function] seen in 3.8-rc: move the #ifdef outwards.

 Signed-off-by: Hugh Dickins hu...@google.com
 
 Acked-by: Michal Hocko mho...@suse.cz
 
 Hmm, if you are not too tired then moving the function downwards to
 where it is called (memcg_init_kmem) will reduce the number of ifdefs.
 But this can wait for a bigger clean up which is getting due:
 git grep def.*CONFIG_MEMCG_KMEM mm/memcontrol.c | wc -l
 12
 

The problem is that I was usually keeping things in clearly separated
blocks, like this :

#if defined(CONFIG_MEMCG_KMEM)  defined(CONFIG_INET)
struct tcp_memcontrol tcp_mem;
#endif
#if defined(CONFIG_MEMCG_KMEM)
/* analogous to slab_common's slab_caches list. per-memcg */
struct list_head memcg_slab_caches;
/* Not a spinlock, we can take a lot of time walking the list */
struct mutex slab_caches_mutex;
/* Index in the kmem_cache-memcg_params-memcg_caches array */
int kmemcg_id;
#endif

If it would be preferable to everybody, this could be easily rewritten as:

#if defined(CONFIG_MEMCG_KMEM)
#if defined(CONFIG_INET)
struct tcp_memcontrol tcp_mem;
#endif
/* analogous to slab_common's slab_caches list. per-memcg */
struct list_head memcg_slab_caches;
/* Not a spinlock, we can take a lot of time walking the list */
struct mutex slab_caches_mutex;
/* Index in the kmem_cache-memcg_params-memcg_caches array */
int kmemcg_id;
#endif

This would allow us to collapse some blocks a bit down as well.

It doesn't bother me *that* much, though.

 Thanks
 ---

  mm/memcontrol.c |4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

 --- 3.8-rc6/mm/memcontrol.c  2012-12-22 09:43:27.628015582 -0800
 +++ linux/mm/memcontrol.c2013-02-02 16:56:06.188325771 -0800
 @@ -4969,6 +4969,7 @@ out:
  return ret;
  }
  
 +#ifdef CONFIG_MEMCG_KMEM
  static int memcg_propagate_kmem(struct mem_cgroup *memcg)
  {
  int ret = 0;
 @@ -4977,7 +4978,6 @@ static int memcg_propagate_kmem(struct m
  goto out;
  
  memcg-kmem_account_flags = parent-kmem_account_flags;
 -#ifdef CONFIG_MEMCG_KMEM
  /*
   * When that happen, we need to disable the static branch only on those
   * memcgs that enabled it. To achieve this, we would be forced to
 @@ -5003,10 +5003,10 @@ static int memcg_propagate_kmem(struct m
  mutex_lock(set_limit_mutex);
  ret = memcg_update_cache_sizes(memcg);
  mutex_unlock(set_limit_mutex);
 -#endif
  out:
  return ret;
  }
 +#endif /* CONFIG_MEMCG_KMEM */
  
  /*
   * The user of this function is...
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] memcg: stop warning on memcg_propagate_kmem

2013-02-04 Thread Lord Glauber Costa of Sealand
On 02/04/2013 12:36 PM, Michal Hocko wrote:
 On Mon 04-02-13 12:04:06, Glauber Costa wrote:
 On 02/04/2013 11:57 AM, Michal Hocko wrote:
 On Sun 03-02-13 20:29:01, Hugh Dickins wrote:
 Whilst I run the risk of a flogging for disloyalty to the Lord of Sealand,
 I do have CONFIG_MEMCG=y CONFIG_MEMCG_KMEM not set, and grow tired of the
 mm/memcontrol.c:4972:12: warning: `memcg_propagate_kmem' defined but not
 used [-Wunused-function] seen in 3.8-rc: move the #ifdef outwards.

 Signed-off-by: Hugh Dickins hu...@google.com

 Acked-by: Michal Hocko mho...@suse.cz

 Hmm, if you are not too tired then moving the function downwards to
 where it is called (memcg_init_kmem) will reduce the number of ifdefs.
 But this can wait for a bigger clean up which is getting due:
 git grep def.*CONFIG_MEMCG_KMEM mm/memcontrol.c | wc -l
 12


 The problem is that I was usually keeping things in clearly separated
 blocks, like this :

 #if defined(CONFIG_MEMCG_KMEM)  defined(CONFIG_INET)
 struct tcp_memcontrol tcp_mem;
 #endif
 #if defined(CONFIG_MEMCG_KMEM)
 /* analogous to slab_common's slab_caches list. per-memcg */
 struct list_head memcg_slab_caches;
 /* Not a spinlock, we can take a lot of time walking the list */
 struct mutex slab_caches_mutex;
 /* Index in the kmem_cache-memcg_params-memcg_caches array */
 int kmemcg_id;
 #endif

 If it would be preferable to everybody, this could be easily rewritten as:

 #if defined(CONFIG_MEMCG_KMEM)
 #if defined(CONFIG_INET)
 struct tcp_memcontrol tcp_mem;
 #endif
 /* analogous to slab_common's slab_caches list. per-memcg */
 struct list_head memcg_slab_caches;
 /* Not a spinlock, we can take a lot of time walking the list */
 struct mutex slab_caches_mutex;
 /* Index in the kmem_cache-memcg_params-memcg_caches array */
 int kmemcg_id;
 #endif
 
 I was rather interested in reducing CONFIG_KMEM block, the above example
 doesn't bother me that much.
  
 This would allow us to collapse some blocks a bit down as well.

 It doesn't bother me *that* much, though.
 
 Yes and a quick attempt shows that a clean up would bring a lot of
 churn.
 
And some of it, because there are circular dependencies. So we would
have to start adding forward declarations here and there to make it all
work. That is part of the reason why I kept the blocks separate.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] memcg: stop warning on memcg_propagate_kmem

2013-02-03 Thread Lord Glauber Costa of Sealand
On 02/04/2013 08:29 AM, Hugh Dickins wrote:
> Whilst I run the risk of a flogging for disloyalty to the Lord of Sealand,
> I do have CONFIG_MEMCG=y CONFIG_MEMCG_KMEM not set, and grow tired of the
> "mm/memcontrol.c:4972:12: warning: `memcg_propagate_kmem' defined but not
> used [-Wunused-function]" seen in 3.8-rc: move the #ifdef outwards.
> 

Thanks my dear Hugh,

This is no disloyalty at all, and your braveness is indeed much appreciated.

My bad for letting that one slip

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] memcg: stop warning on memcg_propagate_kmem

2013-02-03 Thread Lord Glauber Costa of Sealand
On 02/04/2013 08:29 AM, Hugh Dickins wrote:
 Whilst I run the risk of a flogging for disloyalty to the Lord of Sealand,
 I do have CONFIG_MEMCG=y CONFIG_MEMCG_KMEM not set, and grow tired of the
 mm/memcontrol.c:4972:12: warning: `memcg_propagate_kmem' defined but not
 used [-Wunused-function] seen in 3.8-rc: move the #ifdef outwards.
 

Thanks my dear Hugh,

This is no disloyalty at all, and your braveness is indeed much appreciated.

My bad for letting that one slip

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 8/9] zswap: add to mm/

2013-01-29 Thread Lord Glauber Costa of Sealand
On 01/28/2013 07:27 PM, Seth Jennings wrote:
> Yes, I prototyped a shrinker interface for zswap, but, as we both
> figured, it shrinks the zswap compressed pool too aggressively to the
> point of being useless.
Can't you advertise a smaller number of objects that you actively have?

Since the shrinker would never try to shrink more objects than you
advertised, you could control pressure this way.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCHv2 8/9] zswap: add to mm/

2013-01-29 Thread Lord Glauber Costa of Sealand
On 01/28/2013 07:27 PM, Seth Jennings wrote:
 Yes, I prototyped a shrinker interface for zswap, but, as we both
 figured, it shrinks the zswap compressed pool too aggressively to the
 point of being useless.
Can't you advertise a smaller number of objects that you actively have?

Since the shrinker would never try to shrink more objects than you
advertised, you could control pressure this way.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH review 3/6] userns: Recommend use of memory control groups.

2013-01-28 Thread Lord Glauber Costa of Sealand
On 01/28/2013 08:19 PM, Eric W. Biederman wrote:
> Lord Glauber Costa of Sealand  writes:
> 
>> On 01/28/2013 12:14 PM, Eric W. Biederman wrote:
>>> Lord Glauber Costa of Sealand  writes:
>>>
>>>> I just saw in a later patch of yours that your concern here seems not
>>>> limited to backed ram by tmpfs, but with things like the internal
>>>> structures for userns , to avoid patterns in the form: 'for (;;)
>>>> unshare(...)'
>>>>
>>>> Humm, it does seem sensible. The kernel memory controller aims to
>>>> prevent exactly things like that. But they all exist already before
>>>> userns: there are destructive patterns like that with sockets, dentries,
>>>> processes, and pretty much every other resource in the kernel. So
>>>> Although the recommendation per-se makes sense, I am wondering if it is
>>>> worth it to mention anything in the user_ns config?
>>>
>>> The config might be overkill.  However I have already gotten bug reports
>>> about there being no limits.
>>>
>>> So someone needs to stop and connect the dots and say: 
>> Absolutely, and I am all for it
>>
>>> "If you care this is what you can do." 
>>
>> How about we say it, then?
>>
>> The current text in quite cryptic in this aspect, in the sense that it
>> doesn't give enough information for standard people about what are the
>> problems involved.
>>
>> Of course, maybe the Kconfig text is not the best place for having all
>> the info: but don't we have some place in Documentation/ where we could
>> put this, and then refer people there from Kconfig ?
> 
> At this point I have written the best text I can.
> 
> Please feel free to look at my tree at:
> git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git 
> for-next
> 
Will do soon, thanks for your effort.

> and send me an patch on top of that to improve the wording.
> 
> At this point I have done my best to connect the dots for people who
> care, that the memory control group is what they need to limit what
> people can do with user namespaces.
> 
> My hope is that there is at least a passing mention in the next user
> namespace article on lwn.
It would definitely be helpful. Let's hope someone from there is reading! =)

> 
> For two pieces of software that were designed to complement each other
> I find it a bit surprising how many people (including myself) need the
> connection made that memory control groups and user namespaces should go
> together.

Well, I've manifested many times in here that I am less than satisfied
about the fact that the connection between namespaces and cgroups are so
loose. There are many situations, like virtualizing the proc files and
friends where I believe we could benefit from having the information
about whether or not cgroups and namespaces are used at the same time.

But since after considering a lot of alternatives, I could never come up
with one that were really clean,  I guess just communicating it
extensively is the best we can do so far.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] cfq: fix lock imbalance with failed allocations

2013-01-28 Thread Lord Glauber Costa of Sealand
From: Glauber Costa 

While stress-running very-small container scenarios with the Kernel
Memory Controller, I've run into a lockdep-detected lock imbalance in
cfq-iosched.c.

I'll apologize beforehand for not posting a backlog: I didn't anticipate
it would be so hard to reproduce, so I didn't save my serial output and
went directly on debugging. Turns out that it did not happen again in
more than 20 runs, making it a quite rare pattern.

But here is my analysis:

When we are in very low-memory situations, we will arrive at
cfq_find_alloc_queue and may not find a queue, having to resort to the
oom queue, in an rcu-locked condition:

  if (!cfqq || cfqq == >oom_cfqq)
  [ ... ]

Next, we will release the rcu lock, and try to allocate a queue,
retrying if we succeed:

  rcu_read_unlock();
  spin_unlock_irq(cfqd->queue->queue_lock);
  new_cfqq = kmem_cache_alloc_node(cfq_pool,
  gfp_mask | __GFP_ZERO,
  cfqd->queue->node);
   spin_lock_irq(cfqd->queue->queue_lock);
   if (new_cfqq)
   goto retry;

We are unlocked at this point, but it should be fine, since we will
reacquire the rcu_read_lock when we retry.

Except of course, that we may not retry: the allocation may very well
fail and we'll keep on going through the flow:

The next branch is:

if (cfqq) {
[ ... ]
} else
cfqq = >oom_cfqq;

And right before exiting, we'll issue rcu_read_unlock().

Being already unlocked, this is the likely source of our imbalance.
Since cfqq is either already NULL or made NULL in the first statement of
the outter branch, the only viable alternative here seems to be to
return the oom queue right away in case of allocation failure.

Please review the following patch and apply if you agree with my
analysis.

Signed-off-by: Glauber Costa 
Cc: Jens Axboe 
Cc: Andrew Morton 
Cc: Tejun Heo 
---
 block/cfq-iosched.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index fb52df9..d52437a 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -3205,6 +3205,8 @@ retry:
spin_lock_irq(cfqd->queue->queue_lock);
if (new_cfqq)
goto retry;
+   else
+   return >oom_cfqq;
} else {
cfqq = kmem_cache_alloc_node(cfq_pool,
gfp_mask | __GFP_ZERO,
-- 
1.8.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH review 2/6] userns: Allow any uid or gid mappings that don't overlap.

2013-01-28 Thread Lord Glauber Costa of Sealand
Hello Mr. Someone.

On 01/28/2013 06:28 PM, Aristeu Rozanski wrote:
> On Fri, Jan 25, 2013 at 06:21:00PM -0800, Eric W. Biederman wrote:
>> When I initially wrote the code for /proc//uid_map.  I was lazy
>> and avoided duplicate mappings by the simple expedient of ensuring the
>> first number in a new extent was greater than any number in the
>> previous extent.
>>
>> Unfortunately that precludes a number of valid mappings, and someone
>> noticed and complained.  So use a simple check to ensure that ranges
>> in the mapping extents don't overlap.
> 
> Acked-by: Someone 
> 

Documentation/SubmittingPatches:

"then you just add a line saying

Signed-off-by: Random J Developer 

using your real name (sorry, no pseudonyms or anonymous contributions.)"

I know how it feels, but that is how it goes. You'll have to change that.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH review 3/6] userns: Recommend use of memory control groups.

2013-01-28 Thread Lord Glauber Costa of Sealand
On 01/28/2013 12:14 PM, Eric W. Biederman wrote:
> Lord Glauber Costa of Sealand  writes:
> 
>> I just saw in a later patch of yours that your concern here seems not
>> limited to backed ram by tmpfs, but with things like the internal
>> structures for userns , to avoid patterns in the form: 'for (;;)
>> unshare(...)'
>>
>> Humm, it does seem sensible. The kernel memory controller aims to
>> prevent exactly things like that. But they all exist already before
>> userns: there are destructive patterns like that with sockets, dentries,
>> processes, and pretty much every other resource in the kernel. So
>> Although the recommendation per-se makes sense, I am wondering if it is
>> worth it to mention anything in the user_ns config?
> 
> The config might be overkill.  However I have already gotten bug reports
> about there being no limits.
> 
> So someone needs to stop and connect the dots and say: 
Absolutely, and I am all for it

> "If you care this is what you can do." 

How about we say it, then?

The current text in quite cryptic in this aspect, in the sense that it
doesn't give enough information for standard people about what are the
problems involved.

Of course, maybe the Kconfig text is not the best place for having all
the info: but don't we have some place in Documentation/ where we could
put this, and then refer people there from Kconfig ?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH review 3/6] userns: Recommend use of memory control groups.

2013-01-28 Thread Lord Glauber Costa of Sealand
On 01/28/2013 12:14 PM, Eric W. Biederman wrote:
 Lord Glauber Costa of Sealand glom...@parallels.com writes:
 
 I just saw in a later patch of yours that your concern here seems not
 limited to backed ram by tmpfs, but with things like the internal
 structures for userns , to avoid patterns in the form: 'for (;;)
 unshare(...)'

 Humm, it does seem sensible. The kernel memory controller aims to
 prevent exactly things like that. But they all exist already before
 userns: there are destructive patterns like that with sockets, dentries,
 processes, and pretty much every other resource in the kernel. So
 Although the recommendation per-se makes sense, I am wondering if it is
 worth it to mention anything in the user_ns config?
 
 The config might be overkill.  However I have already gotten bug reports
 about there being no limits.
 
 So someone needs to stop and connect the dots and say: 
Absolutely, and I am all for it

 If you care this is what you can do. 

How about we say it, then?

The current text in quite cryptic in this aspect, in the sense that it
doesn't give enough information for standard people about what are the
problems involved.

Of course, maybe the Kconfig text is not the best place for having all
the info: but don't we have some place in Documentation/ where we could
put this, and then refer people there from Kconfig ?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH review 2/6] userns: Allow any uid or gid mappings that don't overlap.

2013-01-28 Thread Lord Glauber Costa of Sealand
Hello Mr. Someone.

On 01/28/2013 06:28 PM, Aristeu Rozanski wrote:
 On Fri, Jan 25, 2013 at 06:21:00PM -0800, Eric W. Biederman wrote:
 When I initially wrote the code for /proc/pid/uid_map.  I was lazy
 and avoided duplicate mappings by the simple expedient of ensuring the
 first number in a new extent was greater than any number in the
 previous extent.

 Unfortunately that precludes a number of valid mappings, and someone
 noticed and complained.  So use a simple check to ensure that ranges
 in the mapping extents don't overlap.
 
 Acked-by: Someone a...@redhat.com
 

Documentation/SubmittingPatches:

then you just add a line saying

Signed-off-by: Random J Developer ran...@developer.example.org

using your real name (sorry, no pseudonyms or anonymous contributions.)

I know how it feels, but that is how it goes. You'll have to change that.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] cfq: fix lock imbalance with failed allocations

2013-01-28 Thread Lord Glauber Costa of Sealand
From: Glauber Costa glom...@parallels.com

While stress-running very-small container scenarios with the Kernel
Memory Controller, I've run into a lockdep-detected lock imbalance in
cfq-iosched.c.

I'll apologize beforehand for not posting a backlog: I didn't anticipate
it would be so hard to reproduce, so I didn't save my serial output and
went directly on debugging. Turns out that it did not happen again in
more than 20 runs, making it a quite rare pattern.

But here is my analysis:

When we are in very low-memory situations, we will arrive at
cfq_find_alloc_queue and may not find a queue, having to resort to the
oom queue, in an rcu-locked condition:

  if (!cfqq || cfqq == cfqd-oom_cfqq)
  [ ... ]

Next, we will release the rcu lock, and try to allocate a queue,
retrying if we succeed:

  rcu_read_unlock();
  spin_unlock_irq(cfqd-queue-queue_lock);
  new_cfqq = kmem_cache_alloc_node(cfq_pool,
  gfp_mask | __GFP_ZERO,
  cfqd-queue-node);
   spin_lock_irq(cfqd-queue-queue_lock);
   if (new_cfqq)
   goto retry;

We are unlocked at this point, but it should be fine, since we will
reacquire the rcu_read_lock when we retry.

Except of course, that we may not retry: the allocation may very well
fail and we'll keep on going through the flow:

The next branch is:

if (cfqq) {
[ ... ]
} else
cfqq = cfqd-oom_cfqq;

And right before exiting, we'll issue rcu_read_unlock().

Being already unlocked, this is the likely source of our imbalance.
Since cfqq is either already NULL or made NULL in the first statement of
the outter branch, the only viable alternative here seems to be to
return the oom queue right away in case of allocation failure.

Please review the following patch and apply if you agree with my
analysis.

Signed-off-by: Glauber Costa glom...@parallels.com
Cc: Jens Axboe ax...@kernel.dk
Cc: Andrew Morton a...@linux-foundation.org
Cc: Tejun Heo t...@kernel.org
---
 block/cfq-iosched.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index fb52df9..d52437a 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -3205,6 +3205,8 @@ retry:
spin_lock_irq(cfqd-queue-queue_lock);
if (new_cfqq)
goto retry;
+   else
+   return cfqd-oom_cfqq;
} else {
cfqq = kmem_cache_alloc_node(cfq_pool,
gfp_mask | __GFP_ZERO,
-- 
1.8.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH review 3/6] userns: Recommend use of memory control groups.

2013-01-28 Thread Lord Glauber Costa of Sealand
On 01/28/2013 08:19 PM, Eric W. Biederman wrote:
 Lord Glauber Costa of Sealand glom...@parallels.com writes:
 
 On 01/28/2013 12:14 PM, Eric W. Biederman wrote:
 Lord Glauber Costa of Sealand glom...@parallels.com writes:

 I just saw in a later patch of yours that your concern here seems not
 limited to backed ram by tmpfs, but with things like the internal
 structures for userns , to avoid patterns in the form: 'for (;;)
 unshare(...)'

 Humm, it does seem sensible. The kernel memory controller aims to
 prevent exactly things like that. But they all exist already before
 userns: there are destructive patterns like that with sockets, dentries,
 processes, and pretty much every other resource in the kernel. So
 Although the recommendation per-se makes sense, I am wondering if it is
 worth it to mention anything in the user_ns config?

 The config might be overkill.  However I have already gotten bug reports
 about there being no limits.

 So someone needs to stop and connect the dots and say: 
 Absolutely, and I am all for it

 If you care this is what you can do. 

 How about we say it, then?

 The current text in quite cryptic in this aspect, in the sense that it
 doesn't give enough information for standard people about what are the
 problems involved.

 Of course, maybe the Kconfig text is not the best place for having all
 the info: but don't we have some place in Documentation/ where we could
 put this, and then refer people there from Kconfig ?
 
 At this point I have written the best text I can.
 
 Please feel free to look at my tree at:
 git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git 
 for-next
 
Will do soon, thanks for your effort.

 and send me an patch on top of that to improve the wording.
 
 At this point I have done my best to connect the dots for people who
 care, that the memory control group is what they need to limit what
 people can do with user namespaces.
 
 My hope is that there is at least a passing mention in the next user
 namespace article on lwn.
It would definitely be helpful. Let's hope someone from there is reading! =)

 
 For two pieces of software that were designed to complement each other
 I find it a bit surprising how many people (including myself) need the
 connection made that memory control groups and user namespaces should go
 together.

Well, I've manifested many times in here that I am less than satisfied
about the fact that the connection between namespaces and cgroups are so
loose. There are many situations, like virtualizing the proc files and
friends where I believe we could benefit from having the information
about whether or not cgroups and namespaces are used at the same time.

But since after considering a lot of alternatives, I could never come up
with one that were really clean,  I guess just communicating it
extensively is the best we can do so far.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH review 3/6] userns: Recommend use of memory control groups.

2013-01-27 Thread Lord Glauber Costa of Sealand
On 01/28/2013 11:37 AM, Lord Glauber Costa of Sealand wrote:
> On 01/26/2013 06:22 AM, Eric W. Biederman wrote:
>>
>> In the help text describing user namespaces recommend use of memory
>> control groups.  In many cases memory control groups are the only
>> mechanism there is to limit how much memory a user who can create
>> user namespaces can use.
>>
>> Signed-off-by: "Eric W. Biederman" 
>> ---
>>  Documentation/namespaces/resource-control.txt |   10 ++
>>  init/Kconfig  |7 +++
>>  2 files changed, 17 insertions(+), 0 deletions(-)
>>  create mode 100644 Documentation/namespaces/resource-control.txt
>>
>> diff --git a/Documentation/namespaces/resource-control.txt 
>> b/Documentation/namespaces/resource-control.txt
>> new file mode 100644
>> index 000..3d8178a
>> --- /dev/null
>> +++ b/Documentation/namespaces/resource-control.txt
>> @@ -0,0 +1,10 @@
>> +There are a lot of kinds of objects in the kernel that don't have
>> +individual limits or that have limits that are ineffective when a set
>> +of processes is allowed to switch user ids.  With user namespaces
>> +enabled in a kernel for people who don't trust their users or their
>> +users programs to play nice this problems becomes more acute.
>> +
>> +Therefore it is recommended that memory control groups be enabled in
>> +kernels that enable user namespaces, and it is further recommended
>> +that userspace configure memory control groups to limit how much
>> +memory users they don't trust to play nice can use.
>> diff --git a/init/Kconfig b/init/Kconfig
>> index 7d30240..c8c58bd 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -1035,6 +1035,13 @@ config USER_NS
>>  help
>>This allows containers, i.e. vservers, to use user namespaces
>>to provide different user info for different servers.
>> +
>> +  When user namespaces are enabled in the kernel it is
>> +  recommended that the MEMCG and MEMCG_KMEM options also be
>> +  enabled and that user-space use the memory control groups to
>> +  limit the amount of memory a memory unprivileged users can
>> +  use.
>> +
>>If unsure, say N.
> 
> Since this becomes an official recommendation that people will likely
> follow, are we really that much concerned about the types of abuses the
> MEMCG_KMEM will prevent? Those are mostly metadata-based abuses users
> could do in their own local disks without mounting anything extra (and
> things that look like that)
> 
> Unless there is a specific concern here, shouldn't we say "... that the
> MEMCG (and possibly MEMCG_KMEM) options..." ?
> 
> 
I just saw in a later patch of yours that your concern here seems not
limited to backed ram by tmpfs, but with things like the internal
structures for userns , to avoid patterns in the form: 'for (;;)
unshare(...)'

Humm, it does seem sensible. The kernel memory controller aims to
prevent exactly things like that. But they all exist already before
userns: there are destructive patterns like that with sockets, dentries,
processes, and pretty much every other resource in the kernel. So
Although the recommendation per-se makes sense, I am wondering if it is
worth it to mention anything in the user_ns config?




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH review 3/6] userns: Recommend use of memory control groups.

2013-01-27 Thread Lord Glauber Costa of Sealand
On 01/26/2013 06:22 AM, Eric W. Biederman wrote:
> 
> In the help text describing user namespaces recommend use of memory
> control groups.  In many cases memory control groups are the only
> mechanism there is to limit how much memory a user who can create
> user namespaces can use.
> 
> Signed-off-by: "Eric W. Biederman" 
> ---
>  Documentation/namespaces/resource-control.txt |   10 ++
>  init/Kconfig  |7 +++
>  2 files changed, 17 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/namespaces/resource-control.txt
> 
> diff --git a/Documentation/namespaces/resource-control.txt 
> b/Documentation/namespaces/resource-control.txt
> new file mode 100644
> index 000..3d8178a
> --- /dev/null
> +++ b/Documentation/namespaces/resource-control.txt
> @@ -0,0 +1,10 @@
> +There are a lot of kinds of objects in the kernel that don't have
> +individual limits or that have limits that are ineffective when a set
> +of processes is allowed to switch user ids.  With user namespaces
> +enabled in a kernel for people who don't trust their users or their
> +users programs to play nice this problems becomes more acute.
> +
> +Therefore it is recommended that memory control groups be enabled in
> +kernels that enable user namespaces, and it is further recommended
> +that userspace configure memory control groups to limit how much
> +memory users they don't trust to play nice can use.
> diff --git a/init/Kconfig b/init/Kconfig
> index 7d30240..c8c58bd 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1035,6 +1035,13 @@ config USER_NS
>   help
> This allows containers, i.e. vservers, to use user namespaces
> to provide different user info for different servers.
> +
> +   When user namespaces are enabled in the kernel it is
> +   recommended that the MEMCG and MEMCG_KMEM options also be
> +   enabled and that user-space use the memory control groups to
> +   limit the amount of memory a memory unprivileged users can
> +   use.
> +
> If unsure, say N.

Since this becomes an official recommendation that people will likely
follow, are we really that much concerned about the types of abuses the
MEMCG_KMEM will prevent? Those are mostly metadata-based abuses users
could do in their own local disks without mounting anything extra (and
things that look like that)

Unless there is a specific concern here, shouldn't we say "... that the
MEMCG (and possibly MEMCG_KMEM) options..." ?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH review 3/6] userns: Recommend use of memory control groups.

2013-01-27 Thread Lord Glauber Costa of Sealand
On 01/26/2013 06:22 AM, Eric W. Biederman wrote:
 
 In the help text describing user namespaces recommend use of memory
 control groups.  In many cases memory control groups are the only
 mechanism there is to limit how much memory a user who can create
 user namespaces can use.
 
 Signed-off-by: Eric W. Biederman ebied...@xmission.com
 ---
  Documentation/namespaces/resource-control.txt |   10 ++
  init/Kconfig  |7 +++
  2 files changed, 17 insertions(+), 0 deletions(-)
  create mode 100644 Documentation/namespaces/resource-control.txt
 
 diff --git a/Documentation/namespaces/resource-control.txt 
 b/Documentation/namespaces/resource-control.txt
 new file mode 100644
 index 000..3d8178a
 --- /dev/null
 +++ b/Documentation/namespaces/resource-control.txt
 @@ -0,0 +1,10 @@
 +There are a lot of kinds of objects in the kernel that don't have
 +individual limits or that have limits that are ineffective when a set
 +of processes is allowed to switch user ids.  With user namespaces
 +enabled in a kernel for people who don't trust their users or their
 +users programs to play nice this problems becomes more acute.
 +
 +Therefore it is recommended that memory control groups be enabled in
 +kernels that enable user namespaces, and it is further recommended
 +that userspace configure memory control groups to limit how much
 +memory users they don't trust to play nice can use.
 diff --git a/init/Kconfig b/init/Kconfig
 index 7d30240..c8c58bd 100644
 --- a/init/Kconfig
 +++ b/init/Kconfig
 @@ -1035,6 +1035,13 @@ config USER_NS
   help
 This allows containers, i.e. vservers, to use user namespaces
 to provide different user info for different servers.
 +
 +   When user namespaces are enabled in the kernel it is
 +   recommended that the MEMCG and MEMCG_KMEM options also be
 +   enabled and that user-space use the memory control groups to
 +   limit the amount of memory a memory unprivileged users can
 +   use.
 +
 If unsure, say N.

Since this becomes an official recommendation that people will likely
follow, are we really that much concerned about the types of abuses the
MEMCG_KMEM will prevent? Those are mostly metadata-based abuses users
could do in their own local disks without mounting anything extra (and
things that look like that)

Unless there is a specific concern here, shouldn't we say ... that the
MEMCG (and possibly MEMCG_KMEM) options... ?


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH review 3/6] userns: Recommend use of memory control groups.

2013-01-27 Thread Lord Glauber Costa of Sealand
On 01/28/2013 11:37 AM, Lord Glauber Costa of Sealand wrote:
 On 01/26/2013 06:22 AM, Eric W. Biederman wrote:

 In the help text describing user namespaces recommend use of memory
 control groups.  In many cases memory control groups are the only
 mechanism there is to limit how much memory a user who can create
 user namespaces can use.

 Signed-off-by: Eric W. Biederman ebied...@xmission.com
 ---
  Documentation/namespaces/resource-control.txt |   10 ++
  init/Kconfig  |7 +++
  2 files changed, 17 insertions(+), 0 deletions(-)
  create mode 100644 Documentation/namespaces/resource-control.txt

 diff --git a/Documentation/namespaces/resource-control.txt 
 b/Documentation/namespaces/resource-control.txt
 new file mode 100644
 index 000..3d8178a
 --- /dev/null
 +++ b/Documentation/namespaces/resource-control.txt
 @@ -0,0 +1,10 @@
 +There are a lot of kinds of objects in the kernel that don't have
 +individual limits or that have limits that are ineffective when a set
 +of processes is allowed to switch user ids.  With user namespaces
 +enabled in a kernel for people who don't trust their users or their
 +users programs to play nice this problems becomes more acute.
 +
 +Therefore it is recommended that memory control groups be enabled in
 +kernels that enable user namespaces, and it is further recommended
 +that userspace configure memory control groups to limit how much
 +memory users they don't trust to play nice can use.
 diff --git a/init/Kconfig b/init/Kconfig
 index 7d30240..c8c58bd 100644
 --- a/init/Kconfig
 +++ b/init/Kconfig
 @@ -1035,6 +1035,13 @@ config USER_NS
  help
This allows containers, i.e. vservers, to use user namespaces
to provide different user info for different servers.
 +
 +  When user namespaces are enabled in the kernel it is
 +  recommended that the MEMCG and MEMCG_KMEM options also be
 +  enabled and that user-space use the memory control groups to
 +  limit the amount of memory a memory unprivileged users can
 +  use.
 +
If unsure, say N.
 
 Since this becomes an official recommendation that people will likely
 follow, are we really that much concerned about the types of abuses the
 MEMCG_KMEM will prevent? Those are mostly metadata-based abuses users
 could do in their own local disks without mounting anything extra (and
 things that look like that)
 
 Unless there is a specific concern here, shouldn't we say ... that the
 MEMCG (and possibly MEMCG_KMEM) options... ?
 
 
I just saw in a later patch of yours that your concern here seems not
limited to backed ram by tmpfs, but with things like the internal
structures for userns , to avoid patterns in the form: 'for (;;)
unshare(...)'

Humm, it does seem sensible. The kernel memory controller aims to
prevent exactly things like that. But they all exist already before
userns: there are destructive patterns like that with sockets, dentries,
processes, and pretty much every other resource in the kernel. So
Although the recommendation per-se makes sense, I am wondering if it is
worth it to mention anything in the user_ns config?




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v6 06/12] cpuacct: don't actually do anything.

2013-01-24 Thread Lord Glauber Costa of Sealand
From: Glauber Costa 

All the information we have that is needed for cpuusage (and
cpuusage_percpu) is present in schedstats. It is already recorded
in a sane hierarchical way.

If we have CONFIG_SCHEDSTATS, we don't really need to do any extra
work. All former functions become empty inlines.

Signed-off-by: Glauber Costa 
Cc: Peter Zijlstra 
Cc: Michal Hocko 
Cc: Kay Sievers 
Cc: Lennart Poettering 
Cc: Dave Jones 
Cc: Ben Hutchings 
Cc: Paul Turner 
---
 kernel/sched/core.c  | 102 ++-
 kernel/sched/sched.h |  10 +++--
 2 files changed, 90 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a62b771..f8a9acf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7267,6 +7267,7 @@ void sched_move_task(struct task_struct *tsk)
task_rq_unlock(rq, tsk, );
 }
 
+#ifndef CONFIG_SCHEDSTATS
 void task_group_charge(struct task_struct *tsk, u64 cputime)
 {
struct task_group *tg;
@@ -7284,6 +7285,7 @@ void task_group_charge(struct task_struct *tsk, u64 
cputime)
 
rcu_read_unlock();
 }
+#endif
 #endif /* CONFIG_CGROUP_SCHED */
 
 #if defined(CONFIG_RT_GROUP_SCHED) || defined(CONFIG_CFS_BANDWIDTH)
@@ -7640,22 +7642,92 @@ cpu_cgroup_exit(struct cgroup *cgrp, struct cgroup 
*old_cgrp,
sched_move_task(task);
 }
 
-static u64 task_group_cpuusage_read(struct task_group *tg, int cpu)
+/*
+ * Take rq->lock to make 64-bit write safe on 32-bit platforms.
+ */
+static inline void lock_rq_dword(int cpu)
 {
-   u64 *cpuusage = per_cpu_ptr(tg->cpuusage, cpu);
-   u64 data;
-
 #ifndef CONFIG_64BIT
-   /*
-* Take rq->lock to make 64-bit read safe on 32-bit platforms.
-*/
raw_spin_lock_irq(_rq(cpu)->lock);
-   data = *cpuusage;
+#endif
+}
+
+static inline void unlock_rq_dword(int cpu)
+{
+#ifndef CONFIG_64BIT
raw_spin_unlock_irq(_rq(cpu)->lock);
+#endif
+}
+
+#ifdef CONFIG_SCHEDSTATS
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static inline u64 cfs_exec_clock(struct task_group *tg, int cpu)
+{
+   return tg->cfs_rq[cpu]->exec_clock - tg->cfs_rq[cpu]->prev_exec_clock;
+}
+
+static inline void cfs_exec_clock_reset(struct task_group *tg, int cpu)
+{
+   tg->cfs_rq[cpu]->prev_exec_clock = tg->cfs_rq[cpu]->exec_clock;
+}
 #else
-   data = *cpuusage;
+static inline u64 cfs_exec_clock(struct task_group *tg, int cpu)
+{
+}
+
+static inline void cfs_exec_clock_reset(struct task_group *tg, int cpu)
+{
+}
+#endif
+#ifdef CONFIG_RT_GROUP_SCHED
+static inline u64 rt_exec_clock(struct task_group *tg, int cpu)
+{
+   return tg->rt_rq[cpu]->exec_clock - tg->rt_rq[cpu]->prev_exec_clock;
+}
+
+static inline void rt_exec_clock_reset(struct task_group *tg, int cpu)
+{
+   tg->rt_rq[cpu]->prev_exec_clock = tg->rt_rq[cpu]->exec_clock;
+}
+#else
+static inline u64 rt_exec_clock(struct task_group *tg, int cpu)
+{
+   return 0;
+}
+
+static inline void rt_exec_clock_reset(struct task_group *tg, int cpu)
+{
+}
 #endif
 
+static u64 task_group_cpuusage_read(struct task_group *tg, int cpu)
+{
+   u64 ret = 0;
+
+   lock_rq_dword(cpu);
+   ret = cfs_exec_clock(tg, cpu) + rt_exec_clock(tg, cpu);
+   unlock_rq_dword(cpu);
+
+   return ret;
+}
+
+static void task_group_cpuusage_write(struct task_group *tg, int cpu, u64 val)
+{
+   lock_rq_dword(cpu);
+   cfs_exec_clock_reset(tg, cpu);
+   rt_exec_clock_reset(tg, cpu);
+   unlock_rq_dword(cpu);
+}
+#else
+static u64 task_group_cpuusage_read(struct task_group *tg, int cpu)
+{
+   u64 *cpuusage = per_cpu_ptr(tg->cpuusage, cpu);
+   u64 data;
+
+   lock_rq_dword(cpu);
+   data = *cpuusage;
+   unlock_rq_dword(cpu);
+
return data;
 }
 
@@ -7663,17 +7735,11 @@ static void task_group_cpuusage_write(struct task_group 
*tg, int cpu, u64 val)
 {
u64 *cpuusage = per_cpu_ptr(tg->cpuusage, cpu);
 
-#ifndef CONFIG_64BIT
-   /*
-* Take rq->lock to make 64-bit write safe on 32-bit platforms.
-*/
-   raw_spin_lock_irq(_rq(cpu)->lock);
+   lock_rq_dword(cpu);
*cpuusage = val;
-   raw_spin_unlock_irq(_rq(cpu)->lock);
-#else
-   *cpuusage = val;
-#endif
+   unlock_rq_dword(cpu);
 }
+#endif
 
 /* return total cpu usage (in nanoseconds) of a group */
 static u64 cpucg_cpuusage_read(struct cgroup *cgrp, struct cftype *cft)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 01ca8a4..640aa14 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -597,8 +597,6 @@ static inline void set_task_rq(struct task_struct *p, 
unsigned int cpu)
 #endif
 }
 
-extern void task_group_charge(struct task_struct *tsk, u64 cputime);
-
 #else /* CONFIG_CGROUP_SCHED */
 
 static inline void set_task_rq(struct task_struct *p, unsigned int cpu) { }
@@ -606,10 +604,14 @@ static inline struct task_group *task_group(struct 
task_struct *p)
 {
return NULL;
 }
-static inline void task_group_charge(struct task_struct *tsk, u64 

[PATCH v6 04/12] cgroup, sched: deprecate cpuacct

2013-01-24 Thread Lord Glauber Costa of Sealand
From: Tejun Heo 

Now that cpu serves the same files as cpuacct and using cpuacct
separately from cpu is deprecated, we can deprecate cpuacct.  To avoid
disturbing userland which has been co-mounting cpu and cpuacct,
implement some hackery in cgroup core so that cpuacct co-mounting
still works even if cpuacct is disabled.

The goal of this patch is to accelerate disabling and removal of
cpuacct by decoupling kernel-side deprecation from userland changes.
Userland is recommended to do the following.

* If /proc/cgroups lists cpuacct, always co-mount it with cpu under
  e.g. /sys/fs/cgroup/cpu.

* Optionally create symlinks for compatibility -
  e.g. /sys/fs/cgroup/cpuacct and /sys/fs/cgroup/cpu,cpucct both
  pointing to /sys/fs/cgroup/cpu - whether cpuacct exists or not.

This compatibility hack will eventually go away.

[ glom...@parallels.com: subsys_bits => subsys_mask ]

Signed-off-by: Tejun Heo 
Cc: Peter Zijlstra 
Cc: Glauber Costa 
Cc: Michal Hocko 
Cc: Kay Sievers 
Cc: Lennart Poettering 
Cc: Dave Jones 
Cc: Ben Hutchings 
Cc: Paul Turner 
---
 init/Kconfig| 11 ++-
 kernel/cgroup.c | 47 ++-
 kernel/sched/core.c |  2 ++
 3 files changed, 54 insertions(+), 6 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 7d30240..4e411ac 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -815,11 +815,20 @@ config PROC_PID_CPUSET
default y
 
 config CGROUP_CPUACCT
-   bool "Simple CPU accounting cgroup subsystem"
+   bool "DEPRECATED: Simple CPU accounting cgroup subsystem"
+   default n
help
  Provides a simple Resource Controller for monitoring the
  total CPU consumed by the tasks in a cgroup.
 
+ This cgroup subsystem is deprecated.  The CPU cgroup
+ subsystem serves the same accounting files and "cpuacct"
+ mount option is ignored if specified with "cpu".  As long as
+ userland co-mounts cpu and cpuacct, disabling this
+ controller should be mostly unnoticeable - one notable
+ difference is that /proc/PID/cgroup won't list cpuacct
+ anymore.
+
 config RESOURCE_COUNTERS
bool "Resource counters"
help
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 0750669d..4ddb335 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1136,6 +1136,7 @@ static int parse_cgroupfs_options(char *data, struct 
cgroup_sb_opts *opts)
unsigned long mask = (unsigned long)-1;
int i;
bool module_pin_failed = false;
+   bool cpuacct_requested = false;
 
BUG_ON(!mutex_is_locked(_mutex));
 
@@ -1225,8 +1226,13 @@ static int parse_cgroupfs_options(char *data, struct 
cgroup_sb_opts *opts)
 
break;
}
-   if (i == CGROUP_SUBSYS_COUNT)
+   /* handle deprecated cpuacct specially, see below */
+   if (!strcmp(token, "cpuacct")) {
+   cpuacct_requested = true;
+   one_ss = true;
+   } else if (i == CGROUP_SUBSYS_COUNT) {
return -ENOENT;
+   }
}
 
/*
@@ -1253,12 +1259,29 @@ static int parse_cgroupfs_options(char *data, struct 
cgroup_sb_opts *opts)
 * this creates some discrepancies in /proc/cgroups and
 * /proc/PID/cgroup.
 *
+* Accept and ignore "cpuacct" option if comounted with "cpu" even
+* when cpuacct itself is disabled to allow quick disabling and
+* removal of cpuacct.  This will be removed eventually.
+*
 * https://lkml.org/lkml/2012/9/13/542
 */
+   if (cpuacct_requested) {
+   bool comounted = false;
+
+#if IS_ENABLED(CONFIG_CGROUP_SCHED)
+   comounted = opts->subsys_mask & (1 << cpu_cgroup_subsys_id);
+#endif
+   if (!comounted) {
+   pr_warning("cgroup: mounting cpuacct separately from 
cpu is deprecated\n");
+#if !IS_ENABLED(CONFIG_CGROUP_CPUACCT)
+   return -EINVAL;
+#endif
+   }
+   }
 #if IS_ENABLED(CONFIG_CGROUP_SCHED) && IS_ENABLED(CONFIG_CGROUP_CPUACCT)
-   if ((opts->subsys_bits & (1 << cpu_cgroup_subsys_id)) &&
-   (opts->subsys_bits & (1 << cpuacct_subsys_id)))
-   opts->subsys_bits &= ~(1 << cpuacct_subsys_id);
+   if ((opts->subsys_mask & (1 << cpu_cgroup_subsys_id)) &&
+   (opts->subsys_mask & (1 << cpuacct_subsys_id)))
+   opts->subsys_mask &= ~(1 << cpuacct_subsys_id);
 #endif
/*
 * Option noprefix was introduced just for backward compatibility
@@ -4806,6 +4829,7 @@ const struct file_operations proc_cgroup_operations = {
 /* Display information about each subsystem and each hierarchy */
 static int proc_cgroupstats_show(struct seq_file *m, void *v)
 {
+   struct cgroup_subsys *ss;
int i;
 
seq_puts(m, "#subsys_name\thierarchy\tnum_cgroups\tenabled\n");
@@ -4816,7 +4840,7 

[PATCH v6 00/12] per-cgroup cpu-stat

2013-01-24 Thread Lord Glauber Costa of Sealand
From: Glauber Costa 

Hi all,

This is an attempt to provide userspace with enough information to reconstruct
per-container version of files like "/proc/stat". In particular, we are
interested in knowing the per-cgroup slices of user time, system time, wait
time, number of processes, and a variety of statistics.

This task is made more complicated by the fact that multiple controllers are
involved in collecting those statistics: cpu and cpuacct. So the first thing I
am doing here, is ressurecting Tejun's patches that aim at deprecating cpuacct.

This is one of the major differences from earlier attempts: all data is provided
by the cpu controller, resulting in greater simplicity. Please note, however,
that this patchset only goes as far as deprecating it: the cpuacct can still be
mounted separately from the cpu cgroup if the user so whishes.

This also tries to hook into the existing scheduler hierarchy walks instead of
providing new ones.


Glauber Costa (8):
  don't call cpuacct_charge in stop_task.c
  sched: adjust exec_clock to use it as cpu usage metric
  cpuacct: don't actually do anything.
  sched: document the cpu cgroup.
  sched: account guest time per-cgroup as well.
  sched: record per-cgroup number of context switches
  sched: change nr_context_switches calculation.
  sched: introduce cgroup file stat_percpu

Peter Zijlstra (1):
  sched: Push put_prev_task() into pick_next_task()

Tejun Heo (3):
  cgroup: implement CFTYPE_NO_PREFIX
  cgroup, sched: let cpu serve the same files as cpuacct
  cgroup, sched: deprecate cpuacct

 Documentation/cgroups/cpu.txt | 100 +++
 include/linux/cgroup.h|   1 +
 include/linux/sched.h |   8 +-
 init/Kconfig  |  11 +-
 kernel/cgroup.c   |  57 ++-
 kernel/sched/core.c   | 387 --
 kernel/sched/cputime.c|  29 +++-
 kernel/sched/fair.c   |  39 -
 kernel/sched/idle_task.c  |   9 +-
 kernel/sched/rt.c |  42 +++--
 kernel/sched/sched.h  |  28 ++-
 kernel/sched/stop_task.c  |   8 +-
 12 files changed, 672 insertions(+), 47 deletions(-)
 create mode 100644 Documentation/cgroups/cpu.txt

-- 
1.8.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v6 03/12] cgroup, sched: let cpu serve the same files as cpuacct

2013-01-24 Thread Lord Glauber Costa of Sealand
From: Tejun Heo 

cpuacct being on a separate hierarchy is one of the main cgroup
related complaints from scheduler side and the consensus seems to be

* Allowing cpuacct to be a separate controller was a mistake.  In
  general multiple controllers on the same type of resource should be
  avoided, especially accounting-only ones.

* Statistics provided by cpuacct are useful and should instead be
  served by cpu.

This patch makes cpu maintain and serve all cpuacct.* files and make
cgroup core ignore cpuacct if it's co-mounted with cpu.  This is a
step in deprecating cpuacct.  The next patch will allow disabling or
dropping cpuacct without affecting userland too much.

Note that this creates some discrepancies in /proc/cgroups and
/proc/PID/cgroup.  The co-mounted cpuacct won't be reflected correctly
there.  cpuacct will eventually be removed completely probably except
for the statistics filenames and I'd like to keep the amount of
compatbility hackery to minimum as much as possible.

The cpu statistics implementation isn't optimized in any way.  It's
mostly verbatim copy from cpuacct.  The goal is allowing quick
disabling and removal of CONFIG_CGROUP_CPUACCT and creating a base on
top of which cpu can implement proper optimization.

[ glommer: don't call *_charge in stop_task.c ]

Signed-off-by: Tejun Heo 
Signed-off-by: Glauber Costa 
Cc: Peter Zijlstra 
Cc: Michal Hocko 
Cc: Kay Sievers 
Cc: Lennart Poettering 
Cc: Dave Jones 
Cc: Ben Hutchings 
Cc: Paul Turner 
---
 kernel/cgroup.c|  13 
 kernel/sched/core.c| 173 +
 kernel/sched/cputime.c |  19 +-
 kernel/sched/fair.c|   1 +
 kernel/sched/rt.c  |   1 +
 kernel/sched/sched.h   |   7 ++
 6 files changed, 212 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 2f98398..0750669d 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1248,6 +1248,19 @@ static int parse_cgroupfs_options(char *data, struct 
cgroup_sb_opts *opts)
/* Consistency checks */
 
/*
+* cpuacct is deprecated and cpu will serve the same stat files.
+* If co-mount with cpu is requested, ignore cpuacct.  Note that
+* this creates some discrepancies in /proc/cgroups and
+* /proc/PID/cgroup.
+*
+* https://lkml.org/lkml/2012/9/13/542
+*/
+#if IS_ENABLED(CONFIG_CGROUP_SCHED) && IS_ENABLED(CONFIG_CGROUP_CPUACCT)
+   if ((opts->subsys_bits & (1 << cpu_cgroup_subsys_id)) &&
+   (opts->subsys_bits & (1 << cpuacct_subsys_id)))
+   opts->subsys_bits &= ~(1 << cpuacct_subsys_id);
+#endif
+   /*
 * Option noprefix was introduced just for backward compatibility
 * with the old cpuset, so we allow noprefix only if mounting just
 * the cpuset subsystem.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 257002c..6516694 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6811,6 +6811,7 @@ int in_sched_functions(unsigned long addr)
 #ifdef CONFIG_CGROUP_SCHED
 struct task_group root_task_group;
 LIST_HEAD(task_groups);
+static DEFINE_PER_CPU(u64, root_tg_cpuusage);
 #endif
 
 DECLARE_PER_CPU(cpumask_var_t, load_balance_tmpmask);
@@ -6869,6 +6870,8 @@ void __init sched_init(void)
 #endif /* CONFIG_RT_GROUP_SCHED */
 
 #ifdef CONFIG_CGROUP_SCHED
+   root_task_group.cpustat = _cpustat;
+   root_task_group.cpuusage = _tg_cpuusage;
list_add(_task_group.list, _groups);
INIT_LIST_HEAD(_task_group.children);
INIT_LIST_HEAD(_task_group.siblings);
@@ -7152,6 +7155,8 @@ static void free_sched_group(struct task_group *tg)
free_fair_sched_group(tg);
free_rt_sched_group(tg);
autogroup_free(tg);
+   free_percpu(tg->cpuusage);
+   free_percpu(tg->cpustat);
kfree(tg);
 }
 
@@ -7165,6 +7170,11 @@ struct task_group *sched_create_group(struct task_group 
*parent)
if (!tg)
return ERR_PTR(-ENOMEM);
 
+   tg->cpuusage = alloc_percpu(u64);
+   tg->cpustat = alloc_percpu(struct kernel_cpustat);
+   if (!tg->cpuusage || !tg->cpustat)
+   goto err;
+
if (!alloc_fair_sched_group(tg, parent))
goto err;
 
@@ -7256,6 +7266,24 @@ void sched_move_task(struct task_struct *tsk)
 
task_rq_unlock(rq, tsk, );
 }
+
+void task_group_charge(struct task_struct *tsk, u64 cputime)
+{
+   struct task_group *tg;
+   int cpu = task_cpu(tsk);
+
+   rcu_read_lock();
+
+   tg = container_of(task_subsys_state(tsk, cpu_cgroup_subsys_id),
+ struct task_group, css);
+
+   for (; tg; tg = tg->parent) {
+   u64 *cpuusage = per_cpu_ptr(tg->cpuusage, cpu);
+   *cpuusage += cputime;
+   }
+
+   rcu_read_unlock();
+}
 #endif /* CONFIG_CGROUP_SCHED */
 
 #if defined(CONFIG_RT_GROUP_SCHED) || defined(CONFIG_CFS_BANDWIDTH)
@@ -7612,6 +7640,134 @@ cpu_cgroup_exit(struct cgroup *cgrp, struct 

[PATCH v6 08/12] sched: account guest time per-cgroup as well.

2013-01-24 Thread Lord Glauber Costa of Sealand
From: Glauber Costa 

We already track multiple tick statistics per-cgroup, using
the task_group_account_field facility. This patch accounts
guest_time in that manner as well.

Signed-off-by: Glauber Costa 
CC: Peter Zijlstra 
CC: Paul Turner 
---
 kernel/sched/cputime.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index a4332f9..0685e71 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -190,8 +190,6 @@ void account_user_time(struct task_struct *p, cputime_t 
cputime,
 static void account_guest_time(struct task_struct *p, cputime_t cputime,
   cputime_t cputime_scaled)
 {
-   u64 *cpustat = kcpustat_this_cpu->cpustat;
-
/* Add guest time to process. */
p->utime += cputime;
p->utimescaled += cputime_scaled;
@@ -200,11 +198,11 @@ static void account_guest_time(struct task_struct *p, 
cputime_t cputime,
 
/* Add guest time to cpustat. */
if (TASK_NICE(p) > 0) {
-   cpustat[CPUTIME_NICE] += (__force u64) cputime;
-   cpustat[CPUTIME_GUEST_NICE] += (__force u64) cputime;
+   task_group_account_field(p, CPUTIME_NICE, (__force u64) 
cputime);
+   task_group_account_field(p, CPUTIME_GUEST, (__force u64) 
cputime);
} else {
-   cpustat[CPUTIME_USER] += (__force u64) cputime;
-   cpustat[CPUTIME_GUEST] += (__force u64) cputime;
+   task_group_account_field(p, CPUTIME_USER, (__force u64) 
cputime);
+   task_group_account_field(p, CPUTIME_GUEST, (__force u64) 
cputime);
}
 }
 
-- 
1.8.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v6 07/12] sched: document the cpu cgroup.

2013-01-24 Thread Lord Glauber Costa of Sealand
From: Glauber Costa 

The CPU cgroup is so far, undocumented. Although data exists in the
Documentation directory about its functioning, it is usually spread,
and/or presented in the context of something else. This file
consolidates all cgroup-related information about it.

Signed-off-by: Glauber Costa 
---
 Documentation/cgroups/cpu.txt | 82 +++
 1 file changed, 82 insertions(+)
 create mode 100644 Documentation/cgroups/cpu.txt

diff --git a/Documentation/cgroups/cpu.txt b/Documentation/cgroups/cpu.txt
new file mode 100644
index 000..e0ea075
--- /dev/null
+++ b/Documentation/cgroups/cpu.txt
@@ -0,0 +1,82 @@
+CPU Controller
+--
+
+The CPU controller is responsible for grouping tasks together that will be
+viewed by the scheduler as a single unit. The CFS scheduler will first divide
+CPU time equally between all entities in the same level, and then proceed by
+doing the same in the next level. Basic use cases for that are described in the
+main cgroup documentation file, cgroups.txt.
+
+Users of this functionality should be aware that deep hierarchies will of
+course impose scheduler overhead, since the scheduler will have to take extra
+steps and look up additional data structures to make its final decision.
+
+Through the CPU controller, the scheduler is also able to cap the CPU
+utilization of a particular group. This is particularly useful in environments
+in which CPU is paid for by the hour, and one values predictability over
+performance.
+
+CPU Accounting
+--
+
+The CPU cgroup will also provide additional files under the prefix "cpuacct".
+Those files provide accounting statistics and were previously provided by the
+separate cpuacct controller. Although the cpuacct controller will still be kept
+around for compatibility reasons, its usage is discouraged. If both the CPU and
+cpuacct controllers are present in the system, distributors are encouraged to
+always mount them together.
+
+Files
+-
+
+The CPU controller exposes the following files to the user:
+
+ - cpu.shares: The weight of each group living in the same hierarchy, that
+ translates into the amount of CPU it is expected to get. Upon cgroup creation,
+ each group gets assigned a default of 1024. The percentage of CPU assigned to
+ the cgroup is the value of shares divided by the sum of all shares in all
+ cgroups in the same level.
+
+ - cpu.cfs_period_us: The duration in microseconds of each scheduler period, 
for
+ bandwidth decisions. This defaults to 10us or 100ms. Larger periods will
+ improve throughput at the expense of latency, since the scheduler will be able
+ to sustain a cpu-bound workload for longer. The opposite of true for smaller
+ periods. Note that this only affects non-RT tasks that are scheduled by the
+ CFS scheduler.
+
+- cpu.cfs_quota_us: The maximum time in microseconds during each cfs_period_us
+  in for the current group will be allowed to run. For instance, if it is set 
to
+  half of cpu_period_us, the cgroup will only be able to peak run for 50 % of
+  the time. One should note that this represents aggregate time over all CPUs
+  in the system. Therefore, in order to allow full usage of two CPUs, for
+  instance, one should set this value to twice the value of cfs_period_us.
+
+- cpu.stat: statistics about the bandwidth controls. No data will be presented
+  if cpu.cfs_quota_us is not set. The file presents three
+  numbers:
+   nr_periods: how many full periods have been elapsed.
+   nr_throttled: number of times we exausted the full allowed bandwidth
+   throttled_time: total time the tasks were not run due to being overquota
+
+ - cpu.rt_runtime_us and cpu.rt_period_us: Those files are the RT-tasks
+   analogous to the CFS files cfs_quota_us and cfs_period_us. One important
+   difference, though, is that while the cfs quotas are upper bounds that
+   won't necessarily be met, the rt runtimes form a stricter guarantee.
+   Therefore, no overlap is allowed. Implications of that are that given a
+   hierarchy with multiple children, the sum of all rt_runtime_us may not 
exceed
+   the runtime of the parent. Also, a rt_runtime_us of 0, means that no rt 
tasks
+   can ever be run in this cgroup. For more information about rt tasks runtime
+   assignments, see scheduler/sched-rt-group.txt
+
+ - cpuacct.usage: The aggregate CPU time, in nanoseconds, consumed by all tasks
+   in this group.
+
+ - cpuacct.usage_percpu: The CPU time, in nanoseconds, consumed by all tasks in
+   this group, separated by CPU. The format is an space-separated array of time
+   values, one for each present CPU.
+
+ - cpuacct.stat: aggregate user and system time consumed by tasks in this 
group.
+   The format is
+   user: x
+   system: y
+
-- 
1.8.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v6 05/12] sched: adjust exec_clock to use it as cpu usage metric

2013-01-24 Thread Lord Glauber Costa of Sealand
From: Glauber Costa 

exec_clock already provides per-group cpu usage metrics, and can be
reused by cpuacct in case cpu and cpuacct are comounted.

However, it is only provided by tasks in fair class. Doing the same for
rt is easy, and can be done in an already existing hierarchy loop. This
is an improvement over the independent hierarchy walk executed by
cpuacct.

Signed-off-by: Glauber Costa 
CC: Dave Jones 
CC: Ben Hutchings 
CC: Peter Zijlstra 
CC: Paul Turner 
CC: Lennart Poettering 
CC: Kay Sievers 
CC: Tejun Heo 
---
 kernel/sched/rt.c| 1 +
 kernel/sched/sched.h | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index f7e05d87..7f6f6c6 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -945,6 +945,7 @@ static void update_curr_rt(struct rq *rq)
 
for_each_sched_rt_entity(rt_se) {
rt_rq = rt_rq_of_se(rt_se);
+   schedstat_add(rt_rq, exec_clock, delta_exec);
 
if (sched_rt_runtime(rt_rq) != RUNTIME_INF) {
raw_spin_lock(_rq->rt_runtime_lock);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 84a339d..01ca8a4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -210,6 +210,7 @@ struct cfs_rq {
unsigned int nr_running, h_nr_running;
 
u64 exec_clock;
+   u64 prev_exec_clock;
u64 min_vruntime;
 #ifndef CONFIG_64BIT
u64 min_vruntime_copy;
@@ -312,6 +313,8 @@ struct rt_rq {
struct plist_head pushable_tasks;
 #endif
int rt_throttled;
+   u64 exec_clock;
+   u64 prev_exec_clock;
u64 rt_time;
u64 rt_runtime;
/* Nests inside the rq lock: */
-- 
1.8.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v6 12/12] sched: introduce cgroup file stat_percpu

2013-01-24 Thread Lord Glauber Costa of Sealand
From: Glauber Costa 

The file cpu.stat_percpu will show various scheduler related
information, that are usually available to the top level through other
files.

For instance, most of the meaningful data in /proc/stat is presented
here. Given this file, a container can easily construct a local copy of
/proc/stat for internal consumption.

The data we export is comprised of:
* all the tick information, previously available only through cpuacct,
  like user time, system time, etc.

* wait time, which can be used to construct analogous information to
  steal time in hypervisors,

* nr_switches and nr_running, which are cgroup-local versions of
  their global counterparts.

The file format consists of a one-line header that describes the fields
being listed.  No guarantee is given that the fields will be kept the
same between kernel releases, and readers should always check the header
in order to introspect it.

Each of the following lines will show the respective field value for
each of the possible cpus in the system. All values are show in
nanoseconds.

One example output for this file is:

cpu user nice system irq softirq guest guest_nice wait nr_switches nr_running
cpu0 47100 0 1500 0 0 0 0 1996534 7205 1
cpu1 58800 0 1700 0 0 0 0 2848680 6510 1
cpu2 50500 0 1400 0 0 0 0 2350771 6183 1
cpu3 47200 0 1600 0 0 0 0 19766345 6277 2

Signed-off-by: Glauber Costa 
CC: Peter Zijlstra 
CC: Paul Turner 
---
 Documentation/cgroups/cpu.txt |  18 +++
 kernel/sched/core.c   | 109 ++
 kernel/sched/fair.c   |  14 ++
 kernel/sched/sched.h  |  11 -
 4 files changed, 150 insertions(+), 2 deletions(-)

diff --git a/Documentation/cgroups/cpu.txt b/Documentation/cgroups/cpu.txt
index e0ea075..2124320 100644
--- a/Documentation/cgroups/cpu.txt
+++ b/Documentation/cgroups/cpu.txt
@@ -68,6 +68,24 @@ The CPU controller exposes the following files to the user:
can ever be run in this cgroup. For more information about rt tasks runtime
assignments, see scheduler/sched-rt-group.txt
 
+ - cpu.stat_percpu: Various scheduler statistics for the current group. The
+   information provided in this file is akin to the one displayed in 
/proc/stat,
+   except for the fact that it is cgroup-aware. The file format consists of a
+   one-line header that describes the fields being listed.  No guarantee is
+   given that the fields will be kept the same between kernel releases, and
+   readers should always check the header in order to introspect it.
+
+   Each of the following lines will show the respective field value for
+   each of the possible cpus in the system. All values are show in
+   nanoseconds. One example output for this file is:
+
+   cpu user nice system irq softirq guest guest_nice wait nr_switches 
nr_running
+   cpu0 47100 0 1500 0 0 0 0 1996534 7205 1
+   cpu1 58800 0 1700 0 0 0 0 2848680 6510 1
+   cpu2 50500 0 1400 0 0 0 0 2350771 6183 1
+   cpu3 47200 0 1600 0 0 0 0 19766345 6277 2
+
+
  - cpuacct.usage: The aggregate CPU time, in nanoseconds, consumed by all tasks
in this group.
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6bb56f0..87437af 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7680,6 +7680,7 @@ static inline void cfs_exec_clock_reset(struct task_group 
*tg, int cpu)
 #else
 static inline u64 cfs_exec_clock(struct task_group *tg, int cpu)
 {
+   return 0;
 }
 
 static inline void cfs_exec_clock_reset(struct task_group *tg, int cpu)
@@ -8111,6 +8112,108 @@ static u64 cpu_rt_period_read_uint(struct cgroup *cgrp, 
struct cftype *cft)
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+#ifdef CONFIG_SCHEDSTATS
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+#define fair_rq(field, tg, i)  (tg)->cfs_rq[i]->field
+#else
+#define fair_rq(field, tg, i)  0
+#endif
+
+#ifdef CONFIG_RT_GROUP_SCHED
+#define rt_rq(field, tg, i)  (tg)->rt_rq[i]->field
+#else
+#define rt_rq(field, tg, i)  0
+#endif
+
+static u64 tg_nr_switches(struct task_group *tg, int cpu)
+{
+   /* nr_switches, which counts idle and stop task, is added to all tgs */
+   return cpu_rq(cpu)->nr_switches +
+   cfs_nr_switches(tg, cpu) + rt_nr_switches(tg, cpu);
+}
+
+static u64 tg_nr_running(struct task_group *tg, int cpu)
+{
+   /*
+* because of autogrouped groups in root_task_group, the
+* following does not hold.
+*/
+   if (tg != _task_group)
+   return rt_rq(rt_nr_running, tg, cpu) + fair_rq(nr_running, tg, 
cpu);
+
+   return cpu_rq(cpu)->nr_running;
+}
+
+static u64 tg_wait(struct task_group *tg, int cpu)
+{
+   u64 val;
+
+   if (tg != _task_group)
+   val = cfs_read_wait(tg, cpu);
+   else
+   /*
+* There are many errors here that we are accumulating.
+* However, we only provide this in the interest of having
+* a consistent interface for 

[PATCH v6 09/12] sched: Push put_prev_task() into pick_next_task()

2013-01-24 Thread Lord Glauber Costa of Sealand
From: Peter Zijlstra 

In order to avoid having to do put/set on a whole cgroup hierarchy
when we context switch, push the put into pick_next_task() so that
both operations are in the same function. Further changes then allow
us to possibly optimize away redundant work.

[ glom...@parallels.com: incorporated mailing list feedback ]

Signed-off-by: Peter Zijlstra 
Signed-off-by: Glauber Costa 
---
 include/linux/sched.h|  8 +++-
 kernel/sched/core.c  | 20 +++-
 kernel/sched/fair.c  |  6 +-
 kernel/sched/idle_task.c |  6 +-
 kernel/sched/rt.c| 27 ---
 kernel/sched/stop_task.c |  5 -
 6 files changed, 44 insertions(+), 28 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 206bb08..31d86e5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1082,7 +1082,13 @@ struct sched_class {
 
void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int 
flags);
 
-   struct task_struct * (*pick_next_task) (struct rq *rq);
+   /*
+* It is the responsibility of the pick_next_task() method that will
+* return the next task to call put_prev_task() on the @prev task or
+* something equivalent.
+*/
+   struct task_struct * (*pick_next_task) (struct rq *rq,
+   struct task_struct *prev);
void (*put_prev_task) (struct rq *rq, struct task_struct *p);
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f8a9acf..c36df03 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2807,18 +2807,11 @@ static inline void schedule_debug(struct task_struct 
*prev)
schedstat_inc(this_rq(), sched_count);
 }
 
-static void put_prev_task(struct rq *rq, struct task_struct *prev)
-{
-   if (prev->on_rq || rq->skip_clock_update < 0)
-   update_rq_clock(rq);
-   prev->sched_class->put_prev_task(rq, prev);
-}
-
 /*
  * Pick up the highest-prio task:
  */
 static inline struct task_struct *
-pick_next_task(struct rq *rq)
+pick_next_task(struct rq *rq, struct task_struct *prev)
 {
const struct sched_class *class;
struct task_struct *p;
@@ -2828,13 +2821,13 @@ pick_next_task(struct rq *rq)
 * the fair class we can call that function directly:
 */
if (likely(rq->nr_running == rq->cfs.h_nr_running)) {
-   p = fair_sched_class.pick_next_task(rq);
+   p = fair_sched_class.pick_next_task(rq, prev);
if (likely(p))
return p;
}
 
for_each_class(class) {
-   p = class->pick_next_task(rq);
+   p = class->pick_next_task(rq, prev);
if (p)
return p;
}
@@ -2929,8 +2922,9 @@ need_resched:
if (unlikely(!rq->nr_running))
idle_balance(cpu, rq);
 
-   put_prev_task(rq, prev);
-   next = pick_next_task(rq);
+   if (prev->on_rq || rq->skip_clock_update < 0)
+   update_rq_clock(rq);
+   next = pick_next_task(rq, prev);
clear_tsk_need_resched(prev);
rq->skip_clock_update = 0;
 
@@ -4880,7 +4874,7 @@ static void migrate_tasks(unsigned int dead_cpu)
if (rq->nr_running == 1)
break;
 
-   next = pick_next_task(rq);
+   next = pick_next_task(rq, NULL);
BUG_ON(!next);
next->sched_class->put_prev_task(rq, next);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c15bc92..d59a106 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3595,7 +3595,8 @@ preempt:
set_last_buddy(se);
 }
 
-static struct task_struct *pick_next_task_fair(struct rq *rq)
+static struct task_struct *
+pick_next_task_fair(struct rq *rq, struct task_struct *prev)
 {
struct task_struct *p;
struct cfs_rq *cfs_rq = >cfs;
@@ -3604,6 +3605,9 @@ static struct task_struct *pick_next_task_fair(struct rq 
*rq)
if (!cfs_rq->nr_running)
return NULL;
 
+   if (prev)
+   prev->sched_class->put_prev_task(rq, prev);
+
do {
se = pick_next_entity(cfs_rq);
set_next_entity(cfs_rq, se);
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index b6baf37..07e6027 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -22,8 +22,12 @@ static void check_preempt_curr_idle(struct rq *rq, struct 
task_struct *p, int fl
resched_task(rq->idle);
 }
 
-static struct task_struct *pick_next_task_idle(struct rq *rq)
+static struct task_struct *
+pick_next_task_idle(struct rq *rq, struct task_struct *prev)
 {
+   if (prev)
+   prev->sched_class->put_prev_task(rq, prev);
+
schedstat_inc(rq, sched_goidle);
return rq->idle;
 }
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 7f6f6c6..80c58fe 100644
--- 

[PATCH v6 11/12] sched: change nr_context_switches calculation.

2013-01-24 Thread Lord Glauber Costa of Sealand
From: Glauber Costa 

This patch changes the calculation of nr_context_switches. The variable
"nr_switches" is now used to account for the number of transition to the
idle task, or stop task. It is removed from the schedule() path.

The total calculation can be made using the fact that the transitions to
fair and rt classes are recorded in the root_task_group. One can easily
derive the total figure by adding those quantities together.

Signed-off-by: Glauber Costa 
CC: Peter Zijlstra 
CC: Paul Turner 
---
 kernel/sched/core.c  | 17 +++--
 kernel/sched/idle_task.c |  3 +++
 kernel/sched/stop_task.c |  2 ++
 3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c36df03..6bb56f0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2001,13 +2001,27 @@ unsigned long nr_uninterruptible(void)
return sum;
 }
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+#define cfs_nr_switches(tg, cpu) (tg)->cfs_rq[cpu]->nr_switches
+#else
+#define cfs_nr_switches(tg, cpu) cpu_rq(cpu)->cfs.nr_switches
+#endif
+#ifdef CONFIG_RT_GROUP_SCHED
+#define rt_nr_switches(tg, cpu) (tg)->rt_rq[cpu]->rt_nr_switches
+#else
+#define rt_nr_switches(tg, cpu) cpu_rq(cpu)->rt.rt_nr_switches
+#endif
+
 unsigned long long nr_context_switches(void)
 {
int i;
unsigned long long sum = 0;
 
-   for_each_possible_cpu(i)
+   for_each_possible_cpu(i) {
+   sum += cfs_nr_switches(_task_group, i);
+   sum += rt_nr_switches(_task_group, i);
sum += cpu_rq(i)->nr_switches;
+   }
 
return sum;
 }
@@ -2929,7 +2943,6 @@ need_resched:
rq->skip_clock_update = 0;
 
if (likely(prev != next)) {
-   rq->nr_switches++;
rq->curr = next;
++*switch_count;
 
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index 07e6027..652d98c 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -28,6 +28,9 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev)
if (prev)
prev->sched_class->put_prev_task(rq, prev);
 
+   if (prev != rq->idle)
+   rq->nr_switches++;
+
schedstat_inc(rq, sched_goidle);
return rq->idle;
 }
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 5f10918..d1e9b82 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -32,6 +32,8 @@ pick_next_task_stop(struct rq *rq, struct task_struct *prev)
stop->se.exec_start = rq->clock_task;
if (prev)
prev->sched_class->put_prev_task(rq, prev);
+   if (prev != rq->stop)
+   rq->nr_switches++;
return stop;
}
 
-- 
1.8.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v6 02/12] cgroup: implement CFTYPE_NO_PREFIX

2013-01-24 Thread Lord Glauber Costa of Sealand
From: Tejun Heo 

When cgroup files are created, cgroup core automatically prepends the
name of the subsystem as prefix.  This patch adds CFTYPE_NO_PREFIX
which disables the automatic prefix.

This will be used to deprecate cpuacct which will make cpu create and
serve the cpuacct files.

Signed-off-by: Tejun Heo 
Cc: Peter Zijlstra 
Cc: Glauber Costa 
---
 include/linux/cgroup.h | 1 +
 kernel/cgroup.c| 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 7d73905..7d193f9 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -277,6 +277,7 @@ struct cgroup_map_cb {
 /* cftype->flags */
 #define CFTYPE_ONLY_ON_ROOT(1U << 0)   /* only create on root cg */
 #define CFTYPE_NOT_ON_ROOT (1U << 1)   /* don't create on root cg */
+#define CFTYPE_NO_PREFIX   (1U << 2)   /* skip subsys prefix */
 
 #define MAX_CFTYPE_NAME64
 
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 4855892..2f98398 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2723,7 +2723,8 @@ static int cgroup_add_file(struct cgroup *cgrp, struct 
cgroup_subsys *subsys,
 
simple_xattrs_init(>xattrs);
 
-   if (subsys && !test_bit(ROOT_NOPREFIX, >root->flags)) {
+   if (subsys && !(cft->flags & CFTYPE_NO_PREFIX) &&
+   !test_bit(ROOT_NOPREFIX, >root->flags)) {
strcpy(name, subsys->name);
strcat(name, ".");
}
-- 
1.8.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v6 10/12] sched: record per-cgroup number of context switches

2013-01-24 Thread Lord Glauber Costa of Sealand
From: Glauber Costa 

Context switches are, to this moment, a property of the runqueue. When
running containers, we would like to be able to present a separate
figure for each container (or cgroup, in this context).

The chosen way to accomplish this is to increment a per cfs_rq or
rt_rq, depending on the task, for each of the sched entities involved,
up to the parent. It is trivial to note that for the parent cgroup, we
always add 1 by doing this. Also, we are not introducing any hierarchy
walks in here. An already existent walk is reused.
There are, however, two main issues:

 1. the traditional context switch code only increment nr_switches when
 a different task is being inserted in the rq. Eventually, albeit not
 likely, we will pick the same task as before. Since for cfq and rt we
 only now which task will be next after the walk, we need to do the walk
 again, decrementing 1. Since this is by far not likely, it seems a fair
 price to pay.

 2. Those figures do not include switches from and to the idle or stop
 task. Those need to be recorded separately, which will happen in a
 follow up patch.

Signed-off-by: Glauber Costa 
CC: Peter Zijlstra 
CC: Paul Turner 
---
 kernel/sched/fair.c  | 18 ++
 kernel/sched/rt.c| 15 +--
 kernel/sched/sched.h |  3 +++
 3 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d59a106..0dd9c50 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3609,6 +3609,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct 
*prev)
prev->sched_class->put_prev_task(rq, prev);
 
do {
+   if (likely(prev))
+   cfs_rq->nr_switches++;
se = pick_next_entity(cfs_rq);
set_next_entity(cfs_rq, se);
cfs_rq = group_cfs_rq(se);
@@ -3618,6 +3620,22 @@ pick_next_task_fair(struct rq *rq, struct task_struct 
*prev)
if (hrtick_enabled(rq))
hrtick_start_fair(rq, p);
 
+   /*
+* This condition is extremely unlikely, and most of the time will just
+* consist of this unlikely branch, which is extremely cheap. But we
+* still need to have it, because when we first loop through cfs_rq's,
+* we can't possibly know which task we will pick. The call to
+* set_next_entity above is not meant to mess up the tree in this case,
+* so this should give us the same chain, in the same order.
+*/
+   if (unlikely(p == prev)) {
+   se = >se;
+   for_each_sched_entity(se) {
+   cfs_rq = cfs_rq_of(se);
+   cfs_rq->nr_switches--;
+   }
+   }
+
return p;
 }
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 80c58fe..19ceed9 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1364,13 +1364,16 @@ static struct sched_rt_entity 
*pick_next_rt_entity(struct rq *rq,
return next;
 }
 
-static struct task_struct *_pick_next_task_rt(struct rq *rq)
+static struct task_struct *
+_pick_next_task_rt(struct rq *rq, struct task_struct *prev)
 {
struct sched_rt_entity *rt_se;
struct task_struct *p;
struct rt_rq *rt_rq  = >rt;
 
do {
+   if (likely(prev))
+   rt_rq->rt_nr_switches++;
rt_se = pick_next_rt_entity(rq, rt_rq);
BUG_ON(!rt_se);
rt_rq = group_rt_rq(rt_se);
@@ -1379,6 +1382,14 @@ static struct task_struct *_pick_next_task_rt(struct rq 
*rq)
p = rt_task_of(rt_se);
p->se.exec_start = rq->clock_task;
 
+   /* See fair.c for an explanation on this */
+   if (unlikely(p == prev)) {
+   for_each_sched_rt_entity(rt_se) {
+   rt_rq = rt_rq_of_se(rt_se);
+   rt_rq->rt_nr_switches--;
+   }
+   }
+
return p;
 }
 
@@ -1397,7 +1408,7 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev)
if (prev)
prev->sched_class->put_prev_task(rq, prev);
 
-   p = _pick_next_task_rt(rq);
+   p = _pick_next_task_rt(rq, prev);
 
/* The running task is never eligible for pushing */
if (p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 640aa14..a426abc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -229,6 +229,7 @@ struct cfs_rq {
unsigned int nr_spread_over;
 #endif
 
+   u64 nr_switches;
 #ifdef CONFIG_SMP
 /*
  * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
@@ -298,6 +299,8 @@ static inline int rt_bandwidth_enabled(void)
 struct rt_rq {
struct rt_prio_array active;
unsigned int rt_nr_running;
+   u64 rt_nr_switches;
+
 #if defined CONFIG_SMP || defined CONFIG_RT_GROUP_SCHED
struct {
int curr; /* highest queued rt task prio */
-- 
1.8.1

--
To unsubscribe from this list: send the 

[PATCH v6 01/12] don't call cpuacct_charge in stop_task.c

2013-01-24 Thread Lord Glauber Costa of Sealand
From: Glauber Costa 

Commit 8f618968 changed stop_task to do the same bookkeping as the
other classes. However, the call to cpuacct_charge() doesn't affect
the scheduler decisions at all, and doesn't need to be moved over.

Moreover, being a kthread, the migration thread won't belong to any
cgroup anyway, rendering this call quite useless.

Signed-off-by: Glauber Costa 
CC: Mike Galbraith 
CC: Peter Zijlstra 
CC: Thomas Gleixner 
---
 kernel/sched/stop_task.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index da5eb5b..fda1cbe 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -68,7 +68,6 @@ static void put_prev_task_stop(struct rq *rq, struct 
task_struct *prev)
account_group_exec_runtime(curr, delta_exec);
 
curr->se.exec_start = rq->clock_task;
-   cpuacct_charge(curr, delta_exec);
 }
 
 static void task_tick_stop(struct rq *rq, struct task_struct *curr, int queued)
-- 
1.8.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v6 01/12] don't call cpuacct_charge in stop_task.c

2013-01-24 Thread Lord Glauber Costa of Sealand
From: Glauber Costa glom...@parallels.com

Commit 8f618968 changed stop_task to do the same bookkeping as the
other classes. However, the call to cpuacct_charge() doesn't affect
the scheduler decisions at all, and doesn't need to be moved over.

Moreover, being a kthread, the migration thread won't belong to any
cgroup anyway, rendering this call quite useless.

Signed-off-by: Glauber Costa glom...@parallels.com
CC: Mike Galbraith mgalbra...@suse.de
CC: Peter Zijlstra a.p.zijls...@chello.nl
CC: Thomas Gleixner t...@linutronix.de
---
 kernel/sched/stop_task.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index da5eb5b..fda1cbe 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -68,7 +68,6 @@ static void put_prev_task_stop(struct rq *rq, struct 
task_struct *prev)
account_group_exec_runtime(curr, delta_exec);
 
curr-se.exec_start = rq-clock_task;
-   cpuacct_charge(curr, delta_exec);
 }
 
 static void task_tick_stop(struct rq *rq, struct task_struct *curr, int queued)
-- 
1.8.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v6 02/12] cgroup: implement CFTYPE_NO_PREFIX

2013-01-24 Thread Lord Glauber Costa of Sealand
From: Tejun Heo t...@kernel.org

When cgroup files are created, cgroup core automatically prepends the
name of the subsystem as prefix.  This patch adds CFTYPE_NO_PREFIX
which disables the automatic prefix.

This will be used to deprecate cpuacct which will make cpu create and
serve the cpuacct files.

Signed-off-by: Tejun Heo t...@kernel.org
Cc: Peter Zijlstra pet...@infradead.org
Cc: Glauber Costa glom...@parallels.com
---
 include/linux/cgroup.h | 1 +
 kernel/cgroup.c| 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 7d73905..7d193f9 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -277,6 +277,7 @@ struct cgroup_map_cb {
 /* cftype-flags */
 #define CFTYPE_ONLY_ON_ROOT(1U  0)   /* only create on root cg */
 #define CFTYPE_NOT_ON_ROOT (1U  1)   /* don't create on root cg */
+#define CFTYPE_NO_PREFIX   (1U  2)   /* skip subsys prefix */
 
 #define MAX_CFTYPE_NAME64
 
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 4855892..2f98398 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2723,7 +2723,8 @@ static int cgroup_add_file(struct cgroup *cgrp, struct 
cgroup_subsys *subsys,
 
simple_xattrs_init(cft-xattrs);
 
-   if (subsys  !test_bit(ROOT_NOPREFIX, cgrp-root-flags)) {
+   if (subsys  !(cft-flags  CFTYPE_NO_PREFIX) 
+   !test_bit(ROOT_NOPREFIX, cgrp-root-flags)) {
strcpy(name, subsys-name);
strcat(name, .);
}
-- 
1.8.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v6 10/12] sched: record per-cgroup number of context switches

2013-01-24 Thread Lord Glauber Costa of Sealand
From: Glauber Costa glom...@parallels.com

Context switches are, to this moment, a property of the runqueue. When
running containers, we would like to be able to present a separate
figure for each container (or cgroup, in this context).

The chosen way to accomplish this is to increment a per cfs_rq or
rt_rq, depending on the task, for each of the sched entities involved,
up to the parent. It is trivial to note that for the parent cgroup, we
always add 1 by doing this. Also, we are not introducing any hierarchy
walks in here. An already existent walk is reused.
There are, however, two main issues:

 1. the traditional context switch code only increment nr_switches when
 a different task is being inserted in the rq. Eventually, albeit not
 likely, we will pick the same task as before. Since for cfq and rt we
 only now which task will be next after the walk, we need to do the walk
 again, decrementing 1. Since this is by far not likely, it seems a fair
 price to pay.

 2. Those figures do not include switches from and to the idle or stop
 task. Those need to be recorded separately, which will happen in a
 follow up patch.

Signed-off-by: Glauber Costa glom...@parallels.com
CC: Peter Zijlstra a.p.zijls...@chello.nl
CC: Paul Turner p...@google.com
---
 kernel/sched/fair.c  | 18 ++
 kernel/sched/rt.c| 15 +--
 kernel/sched/sched.h |  3 +++
 3 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d59a106..0dd9c50 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3609,6 +3609,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct 
*prev)
prev-sched_class-put_prev_task(rq, prev);
 
do {
+   if (likely(prev))
+   cfs_rq-nr_switches++;
se = pick_next_entity(cfs_rq);
set_next_entity(cfs_rq, se);
cfs_rq = group_cfs_rq(se);
@@ -3618,6 +3620,22 @@ pick_next_task_fair(struct rq *rq, struct task_struct 
*prev)
if (hrtick_enabled(rq))
hrtick_start_fair(rq, p);
 
+   /*
+* This condition is extremely unlikely, and most of the time will just
+* consist of this unlikely branch, which is extremely cheap. But we
+* still need to have it, because when we first loop through cfs_rq's,
+* we can't possibly know which task we will pick. The call to
+* set_next_entity above is not meant to mess up the tree in this case,
+* so this should give us the same chain, in the same order.
+*/
+   if (unlikely(p == prev)) {
+   se = p-se;
+   for_each_sched_entity(se) {
+   cfs_rq = cfs_rq_of(se);
+   cfs_rq-nr_switches--;
+   }
+   }
+
return p;
 }
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 80c58fe..19ceed9 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1364,13 +1364,16 @@ static struct sched_rt_entity 
*pick_next_rt_entity(struct rq *rq,
return next;
 }
 
-static struct task_struct *_pick_next_task_rt(struct rq *rq)
+static struct task_struct *
+_pick_next_task_rt(struct rq *rq, struct task_struct *prev)
 {
struct sched_rt_entity *rt_se;
struct task_struct *p;
struct rt_rq *rt_rq  = rq-rt;
 
do {
+   if (likely(prev))
+   rt_rq-rt_nr_switches++;
rt_se = pick_next_rt_entity(rq, rt_rq);
BUG_ON(!rt_se);
rt_rq = group_rt_rq(rt_se);
@@ -1379,6 +1382,14 @@ static struct task_struct *_pick_next_task_rt(struct rq 
*rq)
p = rt_task_of(rt_se);
p-se.exec_start = rq-clock_task;
 
+   /* See fair.c for an explanation on this */
+   if (unlikely(p == prev)) {
+   for_each_sched_rt_entity(rt_se) {
+   rt_rq = rt_rq_of_se(rt_se);
+   rt_rq-rt_nr_switches--;
+   }
+   }
+
return p;
 }
 
@@ -1397,7 +1408,7 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev)
if (prev)
prev-sched_class-put_prev_task(rq, prev);
 
-   p = _pick_next_task_rt(rq);
+   p = _pick_next_task_rt(rq, prev);
 
/* The running task is never eligible for pushing */
if (p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 640aa14..a426abc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -229,6 +229,7 @@ struct cfs_rq {
unsigned int nr_spread_over;
 #endif
 
+   u64 nr_switches;
 #ifdef CONFIG_SMP
 /*
  * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
@@ -298,6 +299,8 @@ static inline int rt_bandwidth_enabled(void)
 struct rt_rq {
struct rt_prio_array active;
unsigned int rt_nr_running;
+   u64 rt_nr_switches;
+
 #if defined CONFIG_SMP || defined CONFIG_RT_GROUP_SCHED
struct {
int curr; /* highest queued 

[PATCH v6 11/12] sched: change nr_context_switches calculation.

2013-01-24 Thread Lord Glauber Costa of Sealand
From: Glauber Costa glom...@parallels.com

This patch changes the calculation of nr_context_switches. The variable
nr_switches is now used to account for the number of transition to the
idle task, or stop task. It is removed from the schedule() path.

The total calculation can be made using the fact that the transitions to
fair and rt classes are recorded in the root_task_group. One can easily
derive the total figure by adding those quantities together.

Signed-off-by: Glauber Costa glom...@parallels.com
CC: Peter Zijlstra a.p.zijls...@chello.nl
CC: Paul Turner p...@google.com
---
 kernel/sched/core.c  | 17 +++--
 kernel/sched/idle_task.c |  3 +++
 kernel/sched/stop_task.c |  2 ++
 3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c36df03..6bb56f0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2001,13 +2001,27 @@ unsigned long nr_uninterruptible(void)
return sum;
 }
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+#define cfs_nr_switches(tg, cpu) (tg)-cfs_rq[cpu]-nr_switches
+#else
+#define cfs_nr_switches(tg, cpu) cpu_rq(cpu)-cfs.nr_switches
+#endif
+#ifdef CONFIG_RT_GROUP_SCHED
+#define rt_nr_switches(tg, cpu) (tg)-rt_rq[cpu]-rt_nr_switches
+#else
+#define rt_nr_switches(tg, cpu) cpu_rq(cpu)-rt.rt_nr_switches
+#endif
+
 unsigned long long nr_context_switches(void)
 {
int i;
unsigned long long sum = 0;
 
-   for_each_possible_cpu(i)
+   for_each_possible_cpu(i) {
+   sum += cfs_nr_switches(root_task_group, i);
+   sum += rt_nr_switches(root_task_group, i);
sum += cpu_rq(i)-nr_switches;
+   }
 
return sum;
 }
@@ -2929,7 +2943,6 @@ need_resched:
rq-skip_clock_update = 0;
 
if (likely(prev != next)) {
-   rq-nr_switches++;
rq-curr = next;
++*switch_count;
 
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index 07e6027..652d98c 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -28,6 +28,9 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev)
if (prev)
prev-sched_class-put_prev_task(rq, prev);
 
+   if (prev != rq-idle)
+   rq-nr_switches++;
+
schedstat_inc(rq, sched_goidle);
return rq-idle;
 }
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 5f10918..d1e9b82 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -32,6 +32,8 @@ pick_next_task_stop(struct rq *rq, struct task_struct *prev)
stop-se.exec_start = rq-clock_task;
if (prev)
prev-sched_class-put_prev_task(rq, prev);
+   if (prev != rq-stop)
+   rq-nr_switches++;
return stop;
}
 
-- 
1.8.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v6 12/12] sched: introduce cgroup file stat_percpu

2013-01-24 Thread Lord Glauber Costa of Sealand
From: Glauber Costa glom...@parallels.com

The file cpu.stat_percpu will show various scheduler related
information, that are usually available to the top level through other
files.

For instance, most of the meaningful data in /proc/stat is presented
here. Given this file, a container can easily construct a local copy of
/proc/stat for internal consumption.

The data we export is comprised of:
* all the tick information, previously available only through cpuacct,
  like user time, system time, etc.

* wait time, which can be used to construct analogous information to
  steal time in hypervisors,

* nr_switches and nr_running, which are cgroup-local versions of
  their global counterparts.

The file format consists of a one-line header that describes the fields
being listed.  No guarantee is given that the fields will be kept the
same between kernel releases, and readers should always check the header
in order to introspect it.

Each of the following lines will show the respective field value for
each of the possible cpus in the system. All values are show in
nanoseconds.

One example output for this file is:

cpu user nice system irq softirq guest guest_nice wait nr_switches nr_running
cpu0 47100 0 1500 0 0 0 0 1996534 7205 1
cpu1 58800 0 1700 0 0 0 0 2848680 6510 1
cpu2 50500 0 1400 0 0 0 0 2350771 6183 1
cpu3 47200 0 1600 0 0 0 0 19766345 6277 2

Signed-off-by: Glauber Costa glom...@parallels.com
CC: Peter Zijlstra a.p.zijls...@chello.nl
CC: Paul Turner p...@google.com
---
 Documentation/cgroups/cpu.txt |  18 +++
 kernel/sched/core.c   | 109 ++
 kernel/sched/fair.c   |  14 ++
 kernel/sched/sched.h  |  11 -
 4 files changed, 150 insertions(+), 2 deletions(-)

diff --git a/Documentation/cgroups/cpu.txt b/Documentation/cgroups/cpu.txt
index e0ea075..2124320 100644
--- a/Documentation/cgroups/cpu.txt
+++ b/Documentation/cgroups/cpu.txt
@@ -68,6 +68,24 @@ The CPU controller exposes the following files to the user:
can ever be run in this cgroup. For more information about rt tasks runtime
assignments, see scheduler/sched-rt-group.txt
 
+ - cpu.stat_percpu: Various scheduler statistics for the current group. The
+   information provided in this file is akin to the one displayed in 
/proc/stat,
+   except for the fact that it is cgroup-aware. The file format consists of a
+   one-line header that describes the fields being listed.  No guarantee is
+   given that the fields will be kept the same between kernel releases, and
+   readers should always check the header in order to introspect it.
+
+   Each of the following lines will show the respective field value for
+   each of the possible cpus in the system. All values are show in
+   nanoseconds. One example output for this file is:
+
+   cpu user nice system irq softirq guest guest_nice wait nr_switches 
nr_running
+   cpu0 47100 0 1500 0 0 0 0 1996534 7205 1
+   cpu1 58800 0 1700 0 0 0 0 2848680 6510 1
+   cpu2 50500 0 1400 0 0 0 0 2350771 6183 1
+   cpu3 47200 0 1600 0 0 0 0 19766345 6277 2
+
+
  - cpuacct.usage: The aggregate CPU time, in nanoseconds, consumed by all tasks
in this group.
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6bb56f0..87437af 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7680,6 +7680,7 @@ static inline void cfs_exec_clock_reset(struct task_group 
*tg, int cpu)
 #else
 static inline u64 cfs_exec_clock(struct task_group *tg, int cpu)
 {
+   return 0;
 }
 
 static inline void cfs_exec_clock_reset(struct task_group *tg, int cpu)
@@ -8111,6 +8112,108 @@ static u64 cpu_rt_period_read_uint(struct cgroup *cgrp, 
struct cftype *cft)
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+#ifdef CONFIG_SCHEDSTATS
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+#define fair_rq(field, tg, i)  (tg)-cfs_rq[i]-field
+#else
+#define fair_rq(field, tg, i)  0
+#endif
+
+#ifdef CONFIG_RT_GROUP_SCHED
+#define rt_rq(field, tg, i)  (tg)-rt_rq[i]-field
+#else
+#define rt_rq(field, tg, i)  0
+#endif
+
+static u64 tg_nr_switches(struct task_group *tg, int cpu)
+{
+   /* nr_switches, which counts idle and stop task, is added to all tgs */
+   return cpu_rq(cpu)-nr_switches +
+   cfs_nr_switches(tg, cpu) + rt_nr_switches(tg, cpu);
+}
+
+static u64 tg_nr_running(struct task_group *tg, int cpu)
+{
+   /*
+* because of autogrouped groups in root_task_group, the
+* following does not hold.
+*/
+   if (tg != root_task_group)
+   return rt_rq(rt_nr_running, tg, cpu) + fair_rq(nr_running, tg, 
cpu);
+
+   return cpu_rq(cpu)-nr_running;
+}
+
+static u64 tg_wait(struct task_group *tg, int cpu)
+{
+   u64 val;
+
+   if (tg != root_task_group)
+   val = cfs_read_wait(tg, cpu);
+   else
+   /*
+* There are many errors here that we are accumulating.
+* However, we only 

[PATCH v6 09/12] sched: Push put_prev_task() into pick_next_task()

2013-01-24 Thread Lord Glauber Costa of Sealand
From: Peter Zijlstra a.p.zijls...@chello.nl

In order to avoid having to do put/set on a whole cgroup hierarchy
when we context switch, push the put into pick_next_task() so that
both operations are in the same function. Further changes then allow
us to possibly optimize away redundant work.

[ glom...@parallels.com: incorporated mailing list feedback ]

Signed-off-by: Peter Zijlstra a.p.zijls...@chello.nl
Signed-off-by: Glauber Costa glom...@parallels.com
---
 include/linux/sched.h|  8 +++-
 kernel/sched/core.c  | 20 +++-
 kernel/sched/fair.c  |  6 +-
 kernel/sched/idle_task.c |  6 +-
 kernel/sched/rt.c| 27 ---
 kernel/sched/stop_task.c |  5 -
 6 files changed, 44 insertions(+), 28 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 206bb08..31d86e5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1082,7 +1082,13 @@ struct sched_class {
 
void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int 
flags);
 
-   struct task_struct * (*pick_next_task) (struct rq *rq);
+   /*
+* It is the responsibility of the pick_next_task() method that will
+* return the next task to call put_prev_task() on the @prev task or
+* something equivalent.
+*/
+   struct task_struct * (*pick_next_task) (struct rq *rq,
+   struct task_struct *prev);
void (*put_prev_task) (struct rq *rq, struct task_struct *p);
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f8a9acf..c36df03 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2807,18 +2807,11 @@ static inline void schedule_debug(struct task_struct 
*prev)
schedstat_inc(this_rq(), sched_count);
 }
 
-static void put_prev_task(struct rq *rq, struct task_struct *prev)
-{
-   if (prev-on_rq || rq-skip_clock_update  0)
-   update_rq_clock(rq);
-   prev-sched_class-put_prev_task(rq, prev);
-}
-
 /*
  * Pick up the highest-prio task:
  */
 static inline struct task_struct *
-pick_next_task(struct rq *rq)
+pick_next_task(struct rq *rq, struct task_struct *prev)
 {
const struct sched_class *class;
struct task_struct *p;
@@ -2828,13 +2821,13 @@ pick_next_task(struct rq *rq)
 * the fair class we can call that function directly:
 */
if (likely(rq-nr_running == rq-cfs.h_nr_running)) {
-   p = fair_sched_class.pick_next_task(rq);
+   p = fair_sched_class.pick_next_task(rq, prev);
if (likely(p))
return p;
}
 
for_each_class(class) {
-   p = class-pick_next_task(rq);
+   p = class-pick_next_task(rq, prev);
if (p)
return p;
}
@@ -2929,8 +2922,9 @@ need_resched:
if (unlikely(!rq-nr_running))
idle_balance(cpu, rq);
 
-   put_prev_task(rq, prev);
-   next = pick_next_task(rq);
+   if (prev-on_rq || rq-skip_clock_update  0)
+   update_rq_clock(rq);
+   next = pick_next_task(rq, prev);
clear_tsk_need_resched(prev);
rq-skip_clock_update = 0;
 
@@ -4880,7 +4874,7 @@ static void migrate_tasks(unsigned int dead_cpu)
if (rq-nr_running == 1)
break;
 
-   next = pick_next_task(rq);
+   next = pick_next_task(rq, NULL);
BUG_ON(!next);
next-sched_class-put_prev_task(rq, next);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c15bc92..d59a106 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3595,7 +3595,8 @@ preempt:
set_last_buddy(se);
 }
 
-static struct task_struct *pick_next_task_fair(struct rq *rq)
+static struct task_struct *
+pick_next_task_fair(struct rq *rq, struct task_struct *prev)
 {
struct task_struct *p;
struct cfs_rq *cfs_rq = rq-cfs;
@@ -3604,6 +3605,9 @@ static struct task_struct *pick_next_task_fair(struct rq 
*rq)
if (!cfs_rq-nr_running)
return NULL;
 
+   if (prev)
+   prev-sched_class-put_prev_task(rq, prev);
+
do {
se = pick_next_entity(cfs_rq);
set_next_entity(cfs_rq, se);
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index b6baf37..07e6027 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -22,8 +22,12 @@ static void check_preempt_curr_idle(struct rq *rq, struct 
task_struct *p, int fl
resched_task(rq-idle);
 }
 
-static struct task_struct *pick_next_task_idle(struct rq *rq)
+static struct task_struct *
+pick_next_task_idle(struct rq *rq, struct task_struct *prev)
 {
+   if (prev)
+   prev-sched_class-put_prev_task(rq, prev);
+
schedstat_inc(rq, sched_goidle);
return rq-idle;
 }
diff --git a/kernel/sched/rt.c 

[PATCH v6 05/12] sched: adjust exec_clock to use it as cpu usage metric

2013-01-24 Thread Lord Glauber Costa of Sealand
From: Glauber Costa glom...@parallels.com

exec_clock already provides per-group cpu usage metrics, and can be
reused by cpuacct in case cpu and cpuacct are comounted.

However, it is only provided by tasks in fair class. Doing the same for
rt is easy, and can be done in an already existing hierarchy loop. This
is an improvement over the independent hierarchy walk executed by
cpuacct.

Signed-off-by: Glauber Costa glom...@parallels.com
CC: Dave Jones da...@redhat.com
CC: Ben Hutchings b...@decadent.org.uk
CC: Peter Zijlstra a.p.zijls...@chello.nl
CC: Paul Turner p...@google.com
CC: Lennart Poettering lenn...@poettering.net
CC: Kay Sievers kay.siev...@vrfy.org
CC: Tejun Heo t...@kernel.org
---
 kernel/sched/rt.c| 1 +
 kernel/sched/sched.h | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index f7e05d87..7f6f6c6 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -945,6 +945,7 @@ static void update_curr_rt(struct rq *rq)
 
for_each_sched_rt_entity(rt_se) {
rt_rq = rt_rq_of_se(rt_se);
+   schedstat_add(rt_rq, exec_clock, delta_exec);
 
if (sched_rt_runtime(rt_rq) != RUNTIME_INF) {
raw_spin_lock(rt_rq-rt_runtime_lock);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 84a339d..01ca8a4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -210,6 +210,7 @@ struct cfs_rq {
unsigned int nr_running, h_nr_running;
 
u64 exec_clock;
+   u64 prev_exec_clock;
u64 min_vruntime;
 #ifndef CONFIG_64BIT
u64 min_vruntime_copy;
@@ -312,6 +313,8 @@ struct rt_rq {
struct plist_head pushable_tasks;
 #endif
int rt_throttled;
+   u64 exec_clock;
+   u64 prev_exec_clock;
u64 rt_time;
u64 rt_runtime;
/* Nests inside the rq lock: */
-- 
1.8.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v6 07/12] sched: document the cpu cgroup.

2013-01-24 Thread Lord Glauber Costa of Sealand
From: Glauber Costa glom...@parallels.com

The CPU cgroup is so far, undocumented. Although data exists in the
Documentation directory about its functioning, it is usually spread,
and/or presented in the context of something else. This file
consolidates all cgroup-related information about it.

Signed-off-by: Glauber Costa glom...@parallels.com
---
 Documentation/cgroups/cpu.txt | 82 +++
 1 file changed, 82 insertions(+)
 create mode 100644 Documentation/cgroups/cpu.txt

diff --git a/Documentation/cgroups/cpu.txt b/Documentation/cgroups/cpu.txt
new file mode 100644
index 000..e0ea075
--- /dev/null
+++ b/Documentation/cgroups/cpu.txt
@@ -0,0 +1,82 @@
+CPU Controller
+--
+
+The CPU controller is responsible for grouping tasks together that will be
+viewed by the scheduler as a single unit. The CFS scheduler will first divide
+CPU time equally between all entities in the same level, and then proceed by
+doing the same in the next level. Basic use cases for that are described in the
+main cgroup documentation file, cgroups.txt.
+
+Users of this functionality should be aware that deep hierarchies will of
+course impose scheduler overhead, since the scheduler will have to take extra
+steps and look up additional data structures to make its final decision.
+
+Through the CPU controller, the scheduler is also able to cap the CPU
+utilization of a particular group. This is particularly useful in environments
+in which CPU is paid for by the hour, and one values predictability over
+performance.
+
+CPU Accounting
+--
+
+The CPU cgroup will also provide additional files under the prefix cpuacct.
+Those files provide accounting statistics and were previously provided by the
+separate cpuacct controller. Although the cpuacct controller will still be kept
+around for compatibility reasons, its usage is discouraged. If both the CPU and
+cpuacct controllers are present in the system, distributors are encouraged to
+always mount them together.
+
+Files
+-
+
+The CPU controller exposes the following files to the user:
+
+ - cpu.shares: The weight of each group living in the same hierarchy, that
+ translates into the amount of CPU it is expected to get. Upon cgroup creation,
+ each group gets assigned a default of 1024. The percentage of CPU assigned to
+ the cgroup is the value of shares divided by the sum of all shares in all
+ cgroups in the same level.
+
+ - cpu.cfs_period_us: The duration in microseconds of each scheduler period, 
for
+ bandwidth decisions. This defaults to 10us or 100ms. Larger periods will
+ improve throughput at the expense of latency, since the scheduler will be able
+ to sustain a cpu-bound workload for longer. The opposite of true for smaller
+ periods. Note that this only affects non-RT tasks that are scheduled by the
+ CFS scheduler.
+
+- cpu.cfs_quota_us: The maximum time in microseconds during each cfs_period_us
+  in for the current group will be allowed to run. For instance, if it is set 
to
+  half of cpu_period_us, the cgroup will only be able to peak run for 50 % of
+  the time. One should note that this represents aggregate time over all CPUs
+  in the system. Therefore, in order to allow full usage of two CPUs, for
+  instance, one should set this value to twice the value of cfs_period_us.
+
+- cpu.stat: statistics about the bandwidth controls. No data will be presented
+  if cpu.cfs_quota_us is not set. The file presents three
+  numbers:
+   nr_periods: how many full periods have been elapsed.
+   nr_throttled: number of times we exausted the full allowed bandwidth
+   throttled_time: total time the tasks were not run due to being overquota
+
+ - cpu.rt_runtime_us and cpu.rt_period_us: Those files are the RT-tasks
+   analogous to the CFS files cfs_quota_us and cfs_period_us. One important
+   difference, though, is that while the cfs quotas are upper bounds that
+   won't necessarily be met, the rt runtimes form a stricter guarantee.
+   Therefore, no overlap is allowed. Implications of that are that given a
+   hierarchy with multiple children, the sum of all rt_runtime_us may not 
exceed
+   the runtime of the parent. Also, a rt_runtime_us of 0, means that no rt 
tasks
+   can ever be run in this cgroup. For more information about rt tasks runtime
+   assignments, see scheduler/sched-rt-group.txt
+
+ - cpuacct.usage: The aggregate CPU time, in nanoseconds, consumed by all tasks
+   in this group.
+
+ - cpuacct.usage_percpu: The CPU time, in nanoseconds, consumed by all tasks in
+   this group, separated by CPU. The format is an space-separated array of time
+   values, one for each present CPU.
+
+ - cpuacct.stat: aggregate user and system time consumed by tasks in this 
group.
+   The format is
+   user: x
+   system: y
+
-- 
1.8.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

[PATCH v6 08/12] sched: account guest time per-cgroup as well.

2013-01-24 Thread Lord Glauber Costa of Sealand
From: Glauber Costa glom...@parallels.com

We already track multiple tick statistics per-cgroup, using
the task_group_account_field facility. This patch accounts
guest_time in that manner as well.

Signed-off-by: Glauber Costa glom...@parallels.com
CC: Peter Zijlstra a.p.zijls...@chello.nl
CC: Paul Turner p...@google.com
---
 kernel/sched/cputime.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index a4332f9..0685e71 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -190,8 +190,6 @@ void account_user_time(struct task_struct *p, cputime_t 
cputime,
 static void account_guest_time(struct task_struct *p, cputime_t cputime,
   cputime_t cputime_scaled)
 {
-   u64 *cpustat = kcpustat_this_cpu-cpustat;
-
/* Add guest time to process. */
p-utime += cputime;
p-utimescaled += cputime_scaled;
@@ -200,11 +198,11 @@ static void account_guest_time(struct task_struct *p, 
cputime_t cputime,
 
/* Add guest time to cpustat. */
if (TASK_NICE(p)  0) {
-   cpustat[CPUTIME_NICE] += (__force u64) cputime;
-   cpustat[CPUTIME_GUEST_NICE] += (__force u64) cputime;
+   task_group_account_field(p, CPUTIME_NICE, (__force u64) 
cputime);
+   task_group_account_field(p, CPUTIME_GUEST, (__force u64) 
cputime);
} else {
-   cpustat[CPUTIME_USER] += (__force u64) cputime;
-   cpustat[CPUTIME_GUEST] += (__force u64) cputime;
+   task_group_account_field(p, CPUTIME_USER, (__force u64) 
cputime);
+   task_group_account_field(p, CPUTIME_GUEST, (__force u64) 
cputime);
}
 }
 
-- 
1.8.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v6 00/12] per-cgroup cpu-stat

2013-01-24 Thread Lord Glauber Costa of Sealand
From: Glauber Costa glom...@parallels.com

Hi all,

This is an attempt to provide userspace with enough information to reconstruct
per-container version of files like /proc/stat. In particular, we are
interested in knowing the per-cgroup slices of user time, system time, wait
time, number of processes, and a variety of statistics.

This task is made more complicated by the fact that multiple controllers are
involved in collecting those statistics: cpu and cpuacct. So the first thing I
am doing here, is ressurecting Tejun's patches that aim at deprecating cpuacct.

This is one of the major differences from earlier attempts: all data is provided
by the cpu controller, resulting in greater simplicity. Please note, however,
that this patchset only goes as far as deprecating it: the cpuacct can still be
mounted separately from the cpu cgroup if the user so whishes.

This also tries to hook into the existing scheduler hierarchy walks instead of
providing new ones.


Glauber Costa (8):
  don't call cpuacct_charge in stop_task.c
  sched: adjust exec_clock to use it as cpu usage metric
  cpuacct: don't actually do anything.
  sched: document the cpu cgroup.
  sched: account guest time per-cgroup as well.
  sched: record per-cgroup number of context switches
  sched: change nr_context_switches calculation.
  sched: introduce cgroup file stat_percpu

Peter Zijlstra (1):
  sched: Push put_prev_task() into pick_next_task()

Tejun Heo (3):
  cgroup: implement CFTYPE_NO_PREFIX
  cgroup, sched: let cpu serve the same files as cpuacct
  cgroup, sched: deprecate cpuacct

 Documentation/cgroups/cpu.txt | 100 +++
 include/linux/cgroup.h|   1 +
 include/linux/sched.h |   8 +-
 init/Kconfig  |  11 +-
 kernel/cgroup.c   |  57 ++-
 kernel/sched/core.c   | 387 --
 kernel/sched/cputime.c|  29 +++-
 kernel/sched/fair.c   |  39 -
 kernel/sched/idle_task.c  |   9 +-
 kernel/sched/rt.c |  42 +++--
 kernel/sched/sched.h  |  28 ++-
 kernel/sched/stop_task.c  |   8 +-
 12 files changed, 672 insertions(+), 47 deletions(-)
 create mode 100644 Documentation/cgroups/cpu.txt

-- 
1.8.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v6 03/12] cgroup, sched: let cpu serve the same files as cpuacct

2013-01-24 Thread Lord Glauber Costa of Sealand
From: Tejun Heo t...@kernel.org

cpuacct being on a separate hierarchy is one of the main cgroup
related complaints from scheduler side and the consensus seems to be

* Allowing cpuacct to be a separate controller was a mistake.  In
  general multiple controllers on the same type of resource should be
  avoided, especially accounting-only ones.

* Statistics provided by cpuacct are useful and should instead be
  served by cpu.

This patch makes cpu maintain and serve all cpuacct.* files and make
cgroup core ignore cpuacct if it's co-mounted with cpu.  This is a
step in deprecating cpuacct.  The next patch will allow disabling or
dropping cpuacct without affecting userland too much.

Note that this creates some discrepancies in /proc/cgroups and
/proc/PID/cgroup.  The co-mounted cpuacct won't be reflected correctly
there.  cpuacct will eventually be removed completely probably except
for the statistics filenames and I'd like to keep the amount of
compatbility hackery to minimum as much as possible.

The cpu statistics implementation isn't optimized in any way.  It's
mostly verbatim copy from cpuacct.  The goal is allowing quick
disabling and removal of CONFIG_CGROUP_CPUACCT and creating a base on
top of which cpu can implement proper optimization.

[ glommer: don't call *_charge in stop_task.c ]

Signed-off-by: Tejun Heo t...@kernel.org
Signed-off-by: Glauber Costa glom...@parallels.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Michal Hocko mho...@suse.cz
Cc: Kay Sievers kay.siev...@vrfy.org
Cc: Lennart Poettering mzxre...@0pointer.de
Cc: Dave Jones da...@redhat.com
Cc: Ben Hutchings b...@decadent.org.uk
Cc: Paul Turner p...@google.com
---
 kernel/cgroup.c|  13 
 kernel/sched/core.c| 173 +
 kernel/sched/cputime.c |  19 +-
 kernel/sched/fair.c|   1 +
 kernel/sched/rt.c  |   1 +
 kernel/sched/sched.h   |   7 ++
 6 files changed, 212 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 2f98398..0750669d 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1248,6 +1248,19 @@ static int parse_cgroupfs_options(char *data, struct 
cgroup_sb_opts *opts)
/* Consistency checks */
 
/*
+* cpuacct is deprecated and cpu will serve the same stat files.
+* If co-mount with cpu is requested, ignore cpuacct.  Note that
+* this creates some discrepancies in /proc/cgroups and
+* /proc/PID/cgroup.
+*
+* https://lkml.org/lkml/2012/9/13/542
+*/
+#if IS_ENABLED(CONFIG_CGROUP_SCHED)  IS_ENABLED(CONFIG_CGROUP_CPUACCT)
+   if ((opts-subsys_bits  (1  cpu_cgroup_subsys_id)) 
+   (opts-subsys_bits  (1  cpuacct_subsys_id)))
+   opts-subsys_bits = ~(1  cpuacct_subsys_id);
+#endif
+   /*
 * Option noprefix was introduced just for backward compatibility
 * with the old cpuset, so we allow noprefix only if mounting just
 * the cpuset subsystem.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 257002c..6516694 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6811,6 +6811,7 @@ int in_sched_functions(unsigned long addr)
 #ifdef CONFIG_CGROUP_SCHED
 struct task_group root_task_group;
 LIST_HEAD(task_groups);
+static DEFINE_PER_CPU(u64, root_tg_cpuusage);
 #endif
 
 DECLARE_PER_CPU(cpumask_var_t, load_balance_tmpmask);
@@ -6869,6 +6870,8 @@ void __init sched_init(void)
 #endif /* CONFIG_RT_GROUP_SCHED */
 
 #ifdef CONFIG_CGROUP_SCHED
+   root_task_group.cpustat = kernel_cpustat;
+   root_task_group.cpuusage = root_tg_cpuusage;
list_add(root_task_group.list, task_groups);
INIT_LIST_HEAD(root_task_group.children);
INIT_LIST_HEAD(root_task_group.siblings);
@@ -7152,6 +7155,8 @@ static void free_sched_group(struct task_group *tg)
free_fair_sched_group(tg);
free_rt_sched_group(tg);
autogroup_free(tg);
+   free_percpu(tg-cpuusage);
+   free_percpu(tg-cpustat);
kfree(tg);
 }
 
@@ -7165,6 +7170,11 @@ struct task_group *sched_create_group(struct task_group 
*parent)
if (!tg)
return ERR_PTR(-ENOMEM);
 
+   tg-cpuusage = alloc_percpu(u64);
+   tg-cpustat = alloc_percpu(struct kernel_cpustat);
+   if (!tg-cpuusage || !tg-cpustat)
+   goto err;
+
if (!alloc_fair_sched_group(tg, parent))
goto err;
 
@@ -7256,6 +7266,24 @@ void sched_move_task(struct task_struct *tsk)
 
task_rq_unlock(rq, tsk, flags);
 }
+
+void task_group_charge(struct task_struct *tsk, u64 cputime)
+{
+   struct task_group *tg;
+   int cpu = task_cpu(tsk);
+
+   rcu_read_lock();
+
+   tg = container_of(task_subsys_state(tsk, cpu_cgroup_subsys_id),
+ struct task_group, css);
+
+   for (; tg; tg = tg-parent) {
+   u64 *cpuusage = per_cpu_ptr(tg-cpuusage, cpu);
+   *cpuusage += cputime;
+   }
+
+   

[PATCH v6 06/12] cpuacct: don't actually do anything.

2013-01-24 Thread Lord Glauber Costa of Sealand
From: Glauber Costa glom...@parallels.com

All the information we have that is needed for cpuusage (and
cpuusage_percpu) is present in schedstats. It is already recorded
in a sane hierarchical way.

If we have CONFIG_SCHEDSTATS, we don't really need to do any extra
work. All former functions become empty inlines.

Signed-off-by: Glauber Costa glom...@parallels.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Michal Hocko mho...@suse.cz
Cc: Kay Sievers kay.siev...@vrfy.org
Cc: Lennart Poettering mzxre...@0pointer.de
Cc: Dave Jones da...@redhat.com
Cc: Ben Hutchings b...@decadent.org.uk
Cc: Paul Turner p...@google.com
---
 kernel/sched/core.c  | 102 ++-
 kernel/sched/sched.h |  10 +++--
 2 files changed, 90 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a62b771..f8a9acf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7267,6 +7267,7 @@ void sched_move_task(struct task_struct *tsk)
task_rq_unlock(rq, tsk, flags);
 }
 
+#ifndef CONFIG_SCHEDSTATS
 void task_group_charge(struct task_struct *tsk, u64 cputime)
 {
struct task_group *tg;
@@ -7284,6 +7285,7 @@ void task_group_charge(struct task_struct *tsk, u64 
cputime)
 
rcu_read_unlock();
 }
+#endif
 #endif /* CONFIG_CGROUP_SCHED */
 
 #if defined(CONFIG_RT_GROUP_SCHED) || defined(CONFIG_CFS_BANDWIDTH)
@@ -7640,22 +7642,92 @@ cpu_cgroup_exit(struct cgroup *cgrp, struct cgroup 
*old_cgrp,
sched_move_task(task);
 }
 
-static u64 task_group_cpuusage_read(struct task_group *tg, int cpu)
+/*
+ * Take rq-lock to make 64-bit write safe on 32-bit platforms.
+ */
+static inline void lock_rq_dword(int cpu)
 {
-   u64 *cpuusage = per_cpu_ptr(tg-cpuusage, cpu);
-   u64 data;
-
 #ifndef CONFIG_64BIT
-   /*
-* Take rq-lock to make 64-bit read safe on 32-bit platforms.
-*/
raw_spin_lock_irq(cpu_rq(cpu)-lock);
-   data = *cpuusage;
+#endif
+}
+
+static inline void unlock_rq_dword(int cpu)
+{
+#ifndef CONFIG_64BIT
raw_spin_unlock_irq(cpu_rq(cpu)-lock);
+#endif
+}
+
+#ifdef CONFIG_SCHEDSTATS
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static inline u64 cfs_exec_clock(struct task_group *tg, int cpu)
+{
+   return tg-cfs_rq[cpu]-exec_clock - tg-cfs_rq[cpu]-prev_exec_clock;
+}
+
+static inline void cfs_exec_clock_reset(struct task_group *tg, int cpu)
+{
+   tg-cfs_rq[cpu]-prev_exec_clock = tg-cfs_rq[cpu]-exec_clock;
+}
 #else
-   data = *cpuusage;
+static inline u64 cfs_exec_clock(struct task_group *tg, int cpu)
+{
+}
+
+static inline void cfs_exec_clock_reset(struct task_group *tg, int cpu)
+{
+}
+#endif
+#ifdef CONFIG_RT_GROUP_SCHED
+static inline u64 rt_exec_clock(struct task_group *tg, int cpu)
+{
+   return tg-rt_rq[cpu]-exec_clock - tg-rt_rq[cpu]-prev_exec_clock;
+}
+
+static inline void rt_exec_clock_reset(struct task_group *tg, int cpu)
+{
+   tg-rt_rq[cpu]-prev_exec_clock = tg-rt_rq[cpu]-exec_clock;
+}
+#else
+static inline u64 rt_exec_clock(struct task_group *tg, int cpu)
+{
+   return 0;
+}
+
+static inline void rt_exec_clock_reset(struct task_group *tg, int cpu)
+{
+}
 #endif
 
+static u64 task_group_cpuusage_read(struct task_group *tg, int cpu)
+{
+   u64 ret = 0;
+
+   lock_rq_dword(cpu);
+   ret = cfs_exec_clock(tg, cpu) + rt_exec_clock(tg, cpu);
+   unlock_rq_dword(cpu);
+
+   return ret;
+}
+
+static void task_group_cpuusage_write(struct task_group *tg, int cpu, u64 val)
+{
+   lock_rq_dword(cpu);
+   cfs_exec_clock_reset(tg, cpu);
+   rt_exec_clock_reset(tg, cpu);
+   unlock_rq_dword(cpu);
+}
+#else
+static u64 task_group_cpuusage_read(struct task_group *tg, int cpu)
+{
+   u64 *cpuusage = per_cpu_ptr(tg-cpuusage, cpu);
+   u64 data;
+
+   lock_rq_dword(cpu);
+   data = *cpuusage;
+   unlock_rq_dword(cpu);
+
return data;
 }
 
@@ -7663,17 +7735,11 @@ static void task_group_cpuusage_write(struct task_group 
*tg, int cpu, u64 val)
 {
u64 *cpuusage = per_cpu_ptr(tg-cpuusage, cpu);
 
-#ifndef CONFIG_64BIT
-   /*
-* Take rq-lock to make 64-bit write safe on 32-bit platforms.
-*/
-   raw_spin_lock_irq(cpu_rq(cpu)-lock);
+   lock_rq_dword(cpu);
*cpuusage = val;
-   raw_spin_unlock_irq(cpu_rq(cpu)-lock);
-#else
-   *cpuusage = val;
-#endif
+   unlock_rq_dword(cpu);
 }
+#endif
 
 /* return total cpu usage (in nanoseconds) of a group */
 static u64 cpucg_cpuusage_read(struct cgroup *cgrp, struct cftype *cft)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 01ca8a4..640aa14 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -597,8 +597,6 @@ static inline void set_task_rq(struct task_struct *p, 
unsigned int cpu)
 #endif
 }
 
-extern void task_group_charge(struct task_struct *tsk, u64 cputime);
-
 #else /* CONFIG_CGROUP_SCHED */
 
 static inline void set_task_rq(struct task_struct *p, unsigned int cpu) { }
@@ -606,10 +604,14 @@ static 

[PATCH v6 04/12] cgroup, sched: deprecate cpuacct

2013-01-24 Thread Lord Glauber Costa of Sealand
From: Tejun Heo t...@kernel.org

Now that cpu serves the same files as cpuacct and using cpuacct
separately from cpu is deprecated, we can deprecate cpuacct.  To avoid
disturbing userland which has been co-mounting cpu and cpuacct,
implement some hackery in cgroup core so that cpuacct co-mounting
still works even if cpuacct is disabled.

The goal of this patch is to accelerate disabling and removal of
cpuacct by decoupling kernel-side deprecation from userland changes.
Userland is recommended to do the following.

* If /proc/cgroups lists cpuacct, always co-mount it with cpu under
  e.g. /sys/fs/cgroup/cpu.

* Optionally create symlinks for compatibility -
  e.g. /sys/fs/cgroup/cpuacct and /sys/fs/cgroup/cpu,cpucct both
  pointing to /sys/fs/cgroup/cpu - whether cpuacct exists or not.

This compatibility hack will eventually go away.

[ glom...@parallels.com: subsys_bits = subsys_mask ]

Signed-off-by: Tejun Heo t...@kernel.org
Cc: Peter Zijlstra pet...@infradead.org
Cc: Glauber Costa glom...@parallels.com
Cc: Michal Hocko mho...@suse.cz
Cc: Kay Sievers kay.siev...@vrfy.org
Cc: Lennart Poettering mzxre...@0pointer.de
Cc: Dave Jones da...@redhat.com
Cc: Ben Hutchings b...@decadent.org.uk
Cc: Paul Turner p...@google.com
---
 init/Kconfig| 11 ++-
 kernel/cgroup.c | 47 ++-
 kernel/sched/core.c |  2 ++
 3 files changed, 54 insertions(+), 6 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 7d30240..4e411ac 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -815,11 +815,20 @@ config PROC_PID_CPUSET
default y
 
 config CGROUP_CPUACCT
-   bool Simple CPU accounting cgroup subsystem
+   bool DEPRECATED: Simple CPU accounting cgroup subsystem
+   default n
help
  Provides a simple Resource Controller for monitoring the
  total CPU consumed by the tasks in a cgroup.
 
+ This cgroup subsystem is deprecated.  The CPU cgroup
+ subsystem serves the same accounting files and cpuacct
+ mount option is ignored if specified with cpu.  As long as
+ userland co-mounts cpu and cpuacct, disabling this
+ controller should be mostly unnoticeable - one notable
+ difference is that /proc/PID/cgroup won't list cpuacct
+ anymore.
+
 config RESOURCE_COUNTERS
bool Resource counters
help
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 0750669d..4ddb335 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1136,6 +1136,7 @@ static int parse_cgroupfs_options(char *data, struct 
cgroup_sb_opts *opts)
unsigned long mask = (unsigned long)-1;
int i;
bool module_pin_failed = false;
+   bool cpuacct_requested = false;
 
BUG_ON(!mutex_is_locked(cgroup_mutex));
 
@@ -1225,8 +1226,13 @@ static int parse_cgroupfs_options(char *data, struct 
cgroup_sb_opts *opts)
 
break;
}
-   if (i == CGROUP_SUBSYS_COUNT)
+   /* handle deprecated cpuacct specially, see below */
+   if (!strcmp(token, cpuacct)) {
+   cpuacct_requested = true;
+   one_ss = true;
+   } else if (i == CGROUP_SUBSYS_COUNT) {
return -ENOENT;
+   }
}
 
/*
@@ -1253,12 +1259,29 @@ static int parse_cgroupfs_options(char *data, struct 
cgroup_sb_opts *opts)
 * this creates some discrepancies in /proc/cgroups and
 * /proc/PID/cgroup.
 *
+* Accept and ignore cpuacct option if comounted with cpu even
+* when cpuacct itself is disabled to allow quick disabling and
+* removal of cpuacct.  This will be removed eventually.
+*
 * https://lkml.org/lkml/2012/9/13/542
 */
+   if (cpuacct_requested) {
+   bool comounted = false;
+
+#if IS_ENABLED(CONFIG_CGROUP_SCHED)
+   comounted = opts-subsys_mask  (1  cpu_cgroup_subsys_id);
+#endif
+   if (!comounted) {
+   pr_warning(cgroup: mounting cpuacct separately from 
cpu is deprecated\n);
+#if !IS_ENABLED(CONFIG_CGROUP_CPUACCT)
+   return -EINVAL;
+#endif
+   }
+   }
 #if IS_ENABLED(CONFIG_CGROUP_SCHED)  IS_ENABLED(CONFIG_CGROUP_CPUACCT)
-   if ((opts-subsys_bits  (1  cpu_cgroup_subsys_id)) 
-   (opts-subsys_bits  (1  cpuacct_subsys_id)))
-   opts-subsys_bits = ~(1  cpuacct_subsys_id);
+   if ((opts-subsys_mask  (1  cpu_cgroup_subsys_id)) 
+   (opts-subsys_mask  (1  cpuacct_subsys_id)))
+   opts-subsys_mask = ~(1  cpuacct_subsys_id);
 #endif
/*
 * Option noprefix was introduced just for backward compatibility
@@ -4806,6 +4829,7 @@ const struct file_operations proc_cgroup_operations = {
 /* Display information about each subsystem and each hierarchy */
 static int proc_cgroupstats_show(struct seq_file *m, void *v)
 {
+   struct