Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

2015-12-04 Thread Waiman Long

On 12/03/2015 09:07 PM, Mike Galbraith wrote:

On Thu, 2015-12-03 at 14:34 -0500, Waiman Long wrote:

On 12/02/2015 11:32 PM, Mike Galbraith wrote:

Is that with the box booted skew_tick=1?

I haven't tried that kernel parameter. I will try it to see if it can
improve the situation. BTW, will there be other undesirable side effects
of using this other than the increased power consumption as said in the
kernel-parameters.txt file?

Not that are known.  I kinda doubt you'd notice the power, but you
should see a notable performance boost.  Who knows, with a big enough
farm of busy big boxen, it may save power by needing fewer of them.

-Mike



You are right. Use skew_tick=1 did improve performance and reduce the 
overhead of clock tick processing rather significantly. I think it 
should be the default for large SMP boxes. Thanks for the tip.


Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

2015-12-04 Thread Waiman Long

On 12/03/2015 09:07 PM, Mike Galbraith wrote:

On Thu, 2015-12-03 at 14:34 -0500, Waiman Long wrote:

On 12/02/2015 11:32 PM, Mike Galbraith wrote:

Is that with the box booted skew_tick=1?

I haven't tried that kernel parameter. I will try it to see if it can
improve the situation. BTW, will there be other undesirable side effects
of using this other than the increased power consumption as said in the
kernel-parameters.txt file?

Not that are known.  I kinda doubt you'd notice the power, but you
should see a notable performance boost.  Who knows, with a big enough
farm of busy big boxen, it may save power by needing fewer of them.

-Mike



You are right. Use skew_tick=1 did improve performance and reduce the 
overhead of clock tick processing rather significantly. I think it 
should be the default for large SMP boxes. Thanks for the tip.


Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

2015-12-03 Thread Mike Galbraith
On Thu, 2015-12-03 at 14:34 -0500, Waiman Long wrote:
> On 12/02/2015 11:32 PM, Mike Galbraith wrote:

> > Is that with the box booted skew_tick=1?

> I haven't tried that kernel parameter. I will try it to see if it can 
> improve the situation. BTW, will there be other undesirable side effects 
> of using this other than the increased power consumption as said in the 
> kernel-parameters.txt file?

Not that are known.  I kinda doubt you'd notice the power, but you
should see a notable performance boost.  Who knows, with a big enough
farm of busy big boxen, it may save power by needing fewer of them.

-Mike 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

2015-12-03 Thread Peter Zijlstra
On Thu, Dec 03, 2015 at 02:56:37PM -0500, Waiman Long wrote:
> >  #ifdef CONFIG_CGROUP_SCHED
> >+task_group_cache = KMEM_CACHE(task_group, 0);
> >+
> Thanks for making that change.
> 
> Do we need to add the flag SLAB_HWCACHE_ALIGN? Or we could make a helper
> flag that define SLAB_HWCACHE_ALIGN if CONFIG_FAIR_GROUP_SCHED is defined.
> Other than that, I am fine with the change.

I don't think we need that, see my reply earlier to Ben.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

2015-12-03 Thread Waiman Long

On 12/03/2015 06:12 AM, Peter Zijlstra wrote:


I made this:

---
Subject: sched/fair: Move hot load_avg into its own cacheline
From: Waiman Long
Date: Wed, 2 Dec 2015 13:41:49 -0500

If a system with large number of sockets was driven to full
utilization, it was found that the clock tick handling occupied a
rather significant proportion of CPU time when fair group scheduling
and autogroup were enabled.

Running a java benchmark on a 16-socket IvyBridge-EX system, the perf
profile looked like:

   10.52%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
9.66%   0.05%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
8.65%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
8.56%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
8.07%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
6.91%   1.78%  java   [kernel.vmlinux]  [k] task_tick_fair
5.24%   5.04%  java   [kernel.vmlinux]  [k] update_cfs_shares

In particular, the high CPU time consumed by update_cfs_shares()
was mostly due to contention on the cacheline that contained the
task_group's load_avg statistical counter. This cacheline may also
contains variables like shares, cfs_rq&  se which are accessed rather
frequently during clock tick processing.

This patch moves the load_avg variable into another cacheline
separated from the other frequently accessed variables. It also
creates a cacheline aligned kmemcache for task_group to make sure
that all the allocated task_group's are cacheline aligned.

By doing so, the perf profile became:

9.44%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
8.74%   0.01%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
7.83%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
7.74%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
7.27%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
5.94%   1.74%  java   [kernel.vmlinux]  [k] task_tick_fair
4.15%   3.92%  java   [kernel.vmlinux]  [k] update_cfs_shares

The %cpu time is still pretty high, but it is better than before. The
benchmark results before and after the patch was as follows:

   Before patch - Max-jOPs: 907533Critical-jOps: 134877
   After patch  - Max-jOPs: 916011Critical-jOps: 142366

Cc: Scott J Norton
Cc: Douglas Hatch
Cc: Ingo Molnar
Cc: Yuyang Du
Cc: Paul Turner
Cc: Ben Segall
Cc: Morten Rasmussen
Signed-off-by: Waiman Long
Signed-off-by: Peter Zijlstra (Intel)
Link: 
http://lkml.kernel.org/r/1449081710-20185-3-git-send-email-waiman.l...@hpe.com
---
  kernel/sched/core.c  |   10 +++---
  kernel/sched/sched.h |7 ++-
  2 files changed, 13 insertions(+), 4 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7345,6 +7345,9 @@ int in_sched_functions(unsigned long add
   */
  struct task_group root_task_group;
  LIST_HEAD(task_groups);
+
+/* Cacheline aligned slab cache for task_group */
+static struct kmem_cache *task_group_cache __read_mostly;
  #endif

  DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
@@ -7402,11 +7405,12 @@ void __init sched_init(void)
  #endif /* CONFIG_RT_GROUP_SCHED */

  #ifdef CONFIG_CGROUP_SCHED
+   task_group_cache = KMEM_CACHE(task_group, 0);
+

Thanks for making that change.

Do we need to add the flag SLAB_HWCACHE_ALIGN? Or we could make a helper 
flag that define SLAB_HWCACHE_ALIGN if CONFIG_FAIR_GROUP_SCHED is 
defined. Other than that, I am fine with the change.


Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

2015-12-03 Thread bsegall
Waiman Long  writes:

> On 12/02/2015 03:02 PM, bseg...@google.com wrote:
>> Waiman Long  writes:
>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>> index efd3bfc..e679895 100644
>>> --- a/kernel/sched/sched.h
>>> +++ b/kernel/sched/sched.h
>>> @@ -248,7 +248,12 @@ struct task_group {
>>> unsigned long shares;
>>>
>>>   #ifdefCONFIG_SMP
>>> -   atomic_long_t load_avg;
>>> +   /*
>>> +* load_avg can be heavily contended at clock tick time, so put
>>> +* it in its own cacheline separated from the fields above which
>>> +* will also be accessed at each tick.
>>> +*/
>>> +   atomic_long_t load_avg cacheline_aligned;
>>>   #endif
>>>   #endif
>> I suppose the question is if it would be better to just move this to
>> wind up on a separate cacheline without the extra empty space, though it
>> would likely be more fragile and unclear.
>
> I have been thinking about that too. The problem is anything that will be in 
> the
> same cacheline as load_avg and have to be accessed at clock click time will
> cause the same contention problem. In the current layout, the fields after
> load_avg are the rt stuff as well some list head structure and pointers. The 
> rt
> stuff should be kind of mutually exclusive of the CFS load_avg in term of 
> usage.
> The list head structure and pointers don't seem to be that frequently 
> accessed.
> So it is the right place to start a new cacheline boundary.
>
> Cheers,
> Longman

Yeah, this is a good place to start a new boundary, I was just saying
you could probably remove the empty space by reordering fields, but that
would be a less logical ordering in terms of programmer clarity.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

2015-12-03 Thread Waiman Long

On 12/03/2015 05:56 AM, Peter Zijlstra wrote:

On Wed, Dec 02, 2015 at 01:41:49PM -0500, Waiman Long wrote:

+/*
+ * Make sure that the task_group structure is cacheline aligned when
+ * fair group scheduling is enabled.
+ */
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static inline struct task_group *alloc_task_group(void)
+{
+   return kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO);
+}
+
+static inline void free_task_group(struct task_group *tg)
+{
+   kmem_cache_free(task_group_cache, tg);
+}
+#else /* CONFIG_FAIR_GROUP_SCHED */
+static inline struct task_group *alloc_task_group(void)
+{
+   return kzalloc(sizeof(struct task_group), GFP_KERNEL);
+}
+
+static inline void free_task_group(struct task_group *tg)
+{
+   kfree(tg);
+}
+#endif /* CONFIG_FAIR_GROUP_SCHED */

I think we can simply always use the kmem_cache, both slab and slub
merge slabcaches where appropriate.


I did that as I was not sure how much overhead would the introduction of 
a new kmem_cache bring. It seems like it is not really an issue. So I am 
fine with making the kmem_cache change permanent.


Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

2015-12-03 Thread Waiman Long

On 12/02/2015 11:32 PM, Mike Galbraith wrote:

On Wed, 2015-12-02 at 13:41 -0500, Waiman Long wrote:


By doing so, the perf profile became:

9.44%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
8.74%   0.01%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
7.83%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
7.74%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
7.27%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
5.94%   1.74%  java   [kernel.vmlinux]  [k] task_tick_fair
4.15%   3.92%  java   [kernel.vmlinux]  [k] update_cfs_shares

The %cpu time is still pretty high, but it is better than before.

Is that with the box booted skew_tick=1?

-Mike



I haven't tried that kernel parameter. I will try it to see if it can 
improve the situation. BTW, will there be other undesirable side effects 
of using this other than the increased power consumption as said in the 
kernel-parameters.txt file?


Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

2015-12-03 Thread Waiman Long

On 12/02/2015 03:02 PM, bseg...@google.com wrote:

Waiman Long  writes:


If a system with large number of sockets was driven to full
utilization, it was found that the clock tick handling occupied a
rather significant proportion of CPU time when fair group scheduling
and autogroup were enabled.

Running a java benchmark on a 16-socket IvyBridge-EX system, the perf
profile looked like:

   10.52%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
9.66%   0.05%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
8.65%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
8.56%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
8.07%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
6.91%   1.78%  java   [kernel.vmlinux]  [k] task_tick_fair
5.24%   5.04%  java   [kernel.vmlinux]  [k] update_cfs_shares

In particular, the high CPU time consumed by update_cfs_shares()
was mostly due to contention on the cacheline that contained the
task_group's load_avg statistical counter. This cacheline may also
contains variables like shares, cfs_rq&  se which are accessed rather
frequently during clock tick processing.

This patch moves the load_avg variable into another cacheline
separated from the other frequently accessed variables. It also
creates a cacheline aligned kmemcache for task_group to make sure
that all the allocated task_group's are cacheline aligned.

By doing so, the perf profile became:

9.44%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
8.74%   0.01%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
7.83%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
7.74%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
7.27%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
5.94%   1.74%  java   [kernel.vmlinux]  [k] task_tick_fair
4.15%   3.92%  java   [kernel.vmlinux]  [k] update_cfs_shares

The %cpu time is still pretty high, but it is better than before. The
benchmark results before and after the patch was as follows:

   Before patch - Max-jOPs: 907533Critical-jOps: 134877
   After patch  - Max-jOPs: 916011Critical-jOps: 142366

Signed-off-by: Waiman Long
---
  kernel/sched/core.c  |   36 ++--
  kernel/sched/sched.h |7 ++-
  2 files changed, 40 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4d568ac..e39204f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7331,6 +7331,11 @@ int in_sched_functions(unsigned long addr)
   */
  struct task_group root_task_group;
  LIST_HEAD(task_groups);
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+/* Cacheline aligned slab cache for task_group */
+static struct kmem_cache *task_group_cache __read_mostly;
+#endif
  #endif

  DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
@@ -7356,6 +7361,7 @@ void __init sched_init(void)
root_task_group.cfs_rq = (struct cfs_rq **)ptr;
ptr += nr_cpu_ids * sizeof(void **);

+   task_group_cache = KMEM_CACHE(task_group, SLAB_HWCACHE_ALIGN);

The KMEM_CACHE macro suggests instead adding
cacheline_aligned_in_smp to the struct definition instead.


The main goal is to have the load_avg placed in a new cacheline 
separated from the read-only fields above. That is why I placed 
cacheline_aligned after load_avg. I omitted the in_smp part because 
it is in the SMP block already. Putting cacheline_aligned_in_smp 
won't guarantee alignment of any field within the structure.


I have done some test and having cacheline_aligned inside the 
structure has the same effect of forcing the whole structure in the 
cacheline aligned boundary.



  #endif /* CONFIG_FAIR_GROUP_SCHED */
  #ifdef CONFIG_RT_GROUP_SCHED
root_task_group.rt_se = (struct sched_rt_entity **)ptr;
@@ -7668,12 +7674,38 @@ void set_curr_task(int cpu, struct task_struct *p)
  /* task_group_lock serializes the addition/removal of task groups */
  static DEFINE_SPINLOCK(task_group_lock);

+/*
+ * Make sure that the task_group structure is cacheline aligned when
+ * fair group scheduling is enabled.
+ */
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static inline struct task_group *alloc_task_group(void)
+{
+   return kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO);
+}
+
+static inline void free_task_group(struct task_group *tg)
+{
+   kmem_cache_free(task_group_cache, tg);
+}
+#else /* CONFIG_FAIR_GROUP_SCHED */
+static inline struct task_group *alloc_task_group(void)
+{
+   return kzalloc(sizeof(struct task_group), GFP_KERNEL);
+}
+
+static inline void free_task_group(struct task_group *tg)
+{
+   kfree(tg);
+}
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
  static void free_sched_group(struct task_group *tg)
  {
free_fair_sched_group(tg);
free_rt_sched_group(tg);
autogroup_free(tg);
-   kfree(tg);
+   free_task_group(tg);
  }

  /* allocate runqueue etc for a new task group */
@@ 

Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

2015-12-03 Thread bsegall
Peter Zijlstra  writes:

> On Thu, Dec 03, 2015 at 09:56:02AM -0800, bseg...@google.com wrote:
>> Peter Zijlstra  writes:
>
>> > @@ -7402,11 +7405,12 @@ void __init sched_init(void)
>> >  #endif /* CONFIG_RT_GROUP_SCHED */
>> >  
>> >  #ifdef CONFIG_CGROUP_SCHED
>> > +  task_group_cache = KMEM_CACHE(task_group, 0);
>> > +
>> >list_add(_task_group.list, _groups);
>> >INIT_LIST_HEAD(_task_group.children);
>> >INIT_LIST_HEAD(_task_group.siblings);
>> >autogroup_init(_task);
>> > -
>> >  #endif /* CONFIG_CGROUP_SCHED */
>> >  
>> >for_each_possible_cpu(i) {
>> > --- a/kernel/sched/sched.h
>> > +++ b/kernel/sched/sched.h
>> > @@ -248,7 +248,12 @@ struct task_group {
>> >unsigned long shares;
>> >  
>> >  #ifdefCONFIG_SMP
>> > -  atomic_long_t load_avg;
>> > +  /*
>> > +   * load_avg can be heavily contended at clock tick time, so put
>> > +   * it in its own cacheline separated from the fields above which
>> > +   * will also be accessed at each tick.
>> > +   */
>> > +  atomic_long_t load_avg cacheline_aligned;
>> >  #endif
>> >  #endif
>> >  
>> 
>> This loses the cacheline-alignment for task_group, is that ok?
>
> I'm a bit dense (its late) can you spell that out? Did you mean me
> killing SLAB_HWCACHE_ALIGN? That should not matter because:
>
> #define KMEM_CACHE(__struct, __flags) kmem_cache_create(#__struct,\
>   sizeof(struct __struct), __alignof__(struct __struct),\
>   (__flags), NULL)
>
> picks up the alignment explicitly.
>
> And struct task_group having one cacheline aligned member, means that
> the alignment of the composite object (the struct proper) must be an
> integer multiple of this (typically 1).

Ah, yeah, I forgot about this, my fault.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

2015-12-03 Thread Peter Zijlstra
On Thu, Dec 03, 2015 at 09:56:02AM -0800, bseg...@google.com wrote:
> Peter Zijlstra  writes:

> > @@ -7402,11 +7405,12 @@ void __init sched_init(void)
> >  #endif /* CONFIG_RT_GROUP_SCHED */
> >  
> >  #ifdef CONFIG_CGROUP_SCHED
> > +   task_group_cache = KMEM_CACHE(task_group, 0);
> > +
> > list_add(_task_group.list, _groups);
> > INIT_LIST_HEAD(_task_group.children);
> > INIT_LIST_HEAD(_task_group.siblings);
> > autogroup_init(_task);
> > -
> >  #endif /* CONFIG_CGROUP_SCHED */
> >  
> > for_each_possible_cpu(i) {
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -248,7 +248,12 @@ struct task_group {
> > unsigned long shares;
> >  
> >  #ifdef CONFIG_SMP
> > -   atomic_long_t load_avg;
> > +   /*
> > +* load_avg can be heavily contended at clock tick time, so put
> > +* it in its own cacheline separated from the fields above which
> > +* will also be accessed at each tick.
> > +*/
> > +   atomic_long_t load_avg cacheline_aligned;
> >  #endif
> >  #endif
> >  
> 
> This loses the cacheline-alignment for task_group, is that ok?

I'm a bit dense (its late) can you spell that out? Did you mean me
killing SLAB_HWCACHE_ALIGN? That should not matter because:

#define KMEM_CACHE(__struct, __flags) kmem_cache_create(#__struct,\
sizeof(struct __struct), __alignof__(struct __struct),\
(__flags), NULL)

picks up the alignment explicitly.

And struct task_group having one cacheline aligned member, means that
the alignment of the composite object (the struct proper) must be an
integer multiple of this (typically 1).


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

2015-12-03 Thread bsegall
Peter Zijlstra  writes:

> I made this:
>
> ---
> Subject: sched/fair: Move hot load_avg into its own cacheline
> From: Waiman Long 
> Date: Wed, 2 Dec 2015 13:41:49 -0500
>
[...]
> @@ -7402,11 +7405,12 @@ void __init sched_init(void)
>  #endif /* CONFIG_RT_GROUP_SCHED */
>  
>  #ifdef CONFIG_CGROUP_SCHED
> + task_group_cache = KMEM_CACHE(task_group, 0);
> +
>   list_add(_task_group.list, _groups);
>   INIT_LIST_HEAD(_task_group.children);
>   INIT_LIST_HEAD(_task_group.siblings);
>   autogroup_init(_task);
> -
>  #endif /* CONFIG_CGROUP_SCHED */
>  
>   for_each_possible_cpu(i) {
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -248,7 +248,12 @@ struct task_group {
>   unsigned long shares;
>  
>  #ifdef   CONFIG_SMP
> - atomic_long_t load_avg;
> + /*
> +  * load_avg can be heavily contended at clock tick time, so put
> +  * it in its own cacheline separated from the fields above which
> +  * will also be accessed at each tick.
> +  */
> + atomic_long_t load_avg cacheline_aligned;
>  #endif
>  #endif
>  

This loses the cacheline-alignment for task_group, is that ok?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

2015-12-03 Thread Peter Zijlstra


I made this:

---
Subject: sched/fair: Move hot load_avg into its own cacheline
From: Waiman Long 
Date: Wed, 2 Dec 2015 13:41:49 -0500

If a system with large number of sockets was driven to full
utilization, it was found that the clock tick handling occupied a
rather significant proportion of CPU time when fair group scheduling
and autogroup were enabled.

Running a java benchmark on a 16-socket IvyBridge-EX system, the perf
profile looked like:

  10.52%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
   9.66%   0.05%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
   8.65%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
   8.56%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
   8.07%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
   6.91%   1.78%  java   [kernel.vmlinux]  [k] task_tick_fair
   5.24%   5.04%  java   [kernel.vmlinux]  [k] update_cfs_shares

In particular, the high CPU time consumed by update_cfs_shares()
was mostly due to contention on the cacheline that contained the
task_group's load_avg statistical counter. This cacheline may also
contains variables like shares, cfs_rq & se which are accessed rather
frequently during clock tick processing.

This patch moves the load_avg variable into another cacheline
separated from the other frequently accessed variables. It also
creates a cacheline aligned kmemcache for task_group to make sure
that all the allocated task_group's are cacheline aligned.

By doing so, the perf profile became:

   9.44%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
   8.74%   0.01%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
   7.83%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
   7.74%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
   7.27%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
   5.94%   1.74%  java   [kernel.vmlinux]  [k] task_tick_fair
   4.15%   3.92%  java   [kernel.vmlinux]  [k] update_cfs_shares

The %cpu time is still pretty high, but it is better than before. The
benchmark results before and after the patch was as follows:

  Before patch - Max-jOPs: 907533Critical-jOps: 134877
  After patch  - Max-jOPs: 916011Critical-jOps: 142366

Cc: Scott J Norton 
Cc: Douglas Hatch 
Cc: Ingo Molnar 
Cc: Yuyang Du 
Cc: Paul Turner 
Cc: Ben Segall 
Cc: Morten Rasmussen 
Signed-off-by: Waiman Long 
Signed-off-by: Peter Zijlstra (Intel) 
Link: 
http://lkml.kernel.org/r/1449081710-20185-3-git-send-email-waiman.l...@hpe.com
---
 kernel/sched/core.c  |   10 +++---
 kernel/sched/sched.h |7 ++-
 2 files changed, 13 insertions(+), 4 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7345,6 +7345,9 @@ int in_sched_functions(unsigned long add
  */
 struct task_group root_task_group;
 LIST_HEAD(task_groups);
+
+/* Cacheline aligned slab cache for task_group */
+static struct kmem_cache *task_group_cache __read_mostly;
 #endif
 
 DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
@@ -7402,11 +7405,12 @@ void __init sched_init(void)
 #endif /* CONFIG_RT_GROUP_SCHED */
 
 #ifdef CONFIG_CGROUP_SCHED
+   task_group_cache = KMEM_CACHE(task_group, 0);
+
list_add(_task_group.list, _groups);
INIT_LIST_HEAD(_task_group.children);
INIT_LIST_HEAD(_task_group.siblings);
autogroup_init(_task);
-
 #endif /* CONFIG_CGROUP_SCHED */
 
for_each_possible_cpu(i) {
@@ -7687,7 +7691,7 @@ static void free_sched_group(struct task
free_fair_sched_group(tg);
free_rt_sched_group(tg);
autogroup_free(tg);
-   kfree(tg);
+   kmem_cache_free(task_group_cache, tg);
 }
 
 /* allocate runqueue etc for a new task group */
@@ -7695,7 +7699,7 @@ struct task_group *sched_create_group(st
 {
struct task_group *tg;
 
-   tg = kzalloc(sizeof(*tg), GFP_KERNEL);
+   tg = kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO);
if (!tg)
return ERR_PTR(-ENOMEM);
 
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -248,7 +248,12 @@ struct task_group {
unsigned long shares;
 
 #ifdef CONFIG_SMP
-   atomic_long_t load_avg;
+   /*
+* load_avg can be heavily contended at clock tick time, so put
+* it in its own cacheline separated from the fields above which
+* will also be accessed at each tick.
+*/
+   atomic_long_t load_avg cacheline_aligned;
 #endif
 #endif
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

2015-12-03 Thread Peter Zijlstra
On Wed, Dec 02, 2015 at 01:41:49PM -0500, Waiman Long wrote:
> +/*
> + * Make sure that the task_group structure is cacheline aligned when
> + * fair group scheduling is enabled.
> + */
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> +static inline struct task_group *alloc_task_group(void)
> +{
> + return kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO);
> +}
> +
> +static inline void free_task_group(struct task_group *tg)
> +{
> + kmem_cache_free(task_group_cache, tg);
> +}
> +#else /* CONFIG_FAIR_GROUP_SCHED */
> +static inline struct task_group *alloc_task_group(void)
> +{
> + return kzalloc(sizeof(struct task_group), GFP_KERNEL);
> +}
> +
> +static inline void free_task_group(struct task_group *tg)
> +{
> + kfree(tg);
> +}
> +#endif /* CONFIG_FAIR_GROUP_SCHED */

I think we can simply always use the kmem_cache, both slab and slub
merge slabcaches where appropriate.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

2015-12-03 Thread Peter Zijlstra


I made this:

---
Subject: sched/fair: Move hot load_avg into its own cacheline
From: Waiman Long 
Date: Wed, 2 Dec 2015 13:41:49 -0500

If a system with large number of sockets was driven to full
utilization, it was found that the clock tick handling occupied a
rather significant proportion of CPU time when fair group scheduling
and autogroup were enabled.

Running a java benchmark on a 16-socket IvyBridge-EX system, the perf
profile looked like:

  10.52%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
   9.66%   0.05%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
   8.65%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
   8.56%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
   8.07%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
   6.91%   1.78%  java   [kernel.vmlinux]  [k] task_tick_fair
   5.24%   5.04%  java   [kernel.vmlinux]  [k] update_cfs_shares

In particular, the high CPU time consumed by update_cfs_shares()
was mostly due to contention on the cacheline that contained the
task_group's load_avg statistical counter. This cacheline may also
contains variables like shares, cfs_rq & se which are accessed rather
frequently during clock tick processing.

This patch moves the load_avg variable into another cacheline
separated from the other frequently accessed variables. It also
creates a cacheline aligned kmemcache for task_group to make sure
that all the allocated task_group's are cacheline aligned.

By doing so, the perf profile became:

   9.44%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
   8.74%   0.01%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
   7.83%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
   7.74%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
   7.27%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
   5.94%   1.74%  java   [kernel.vmlinux]  [k] task_tick_fair
   4.15%   3.92%  java   [kernel.vmlinux]  [k] update_cfs_shares

The %cpu time is still pretty high, but it is better than before. The
benchmark results before and after the patch was as follows:

  Before patch - Max-jOPs: 907533Critical-jOps: 134877
  After patch  - Max-jOPs: 916011Critical-jOps: 142366

Cc: Scott J Norton 
Cc: Douglas Hatch 
Cc: Ingo Molnar 
Cc: Yuyang Du 
Cc: Paul Turner 
Cc: Ben Segall 
Cc: Morten Rasmussen 
Signed-off-by: Waiman Long 
Signed-off-by: Peter Zijlstra (Intel) 
Link: 
http://lkml.kernel.org/r/1449081710-20185-3-git-send-email-waiman.l...@hpe.com
---
 kernel/sched/core.c  |   10 +++---
 kernel/sched/sched.h |7 ++-
 2 files changed, 13 insertions(+), 4 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7345,6 +7345,9 @@ int in_sched_functions(unsigned long add
  */
 struct task_group root_task_group;
 LIST_HEAD(task_groups);
+
+/* Cacheline aligned slab cache for task_group */
+static struct kmem_cache *task_group_cache __read_mostly;
 #endif
 
 DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
@@ -7402,11 +7405,12 @@ void __init sched_init(void)
 #endif /* CONFIG_RT_GROUP_SCHED */
 
 #ifdef CONFIG_CGROUP_SCHED
+   task_group_cache = KMEM_CACHE(task_group, 0);
+
list_add(_task_group.list, _groups);
INIT_LIST_HEAD(_task_group.children);
INIT_LIST_HEAD(_task_group.siblings);
autogroup_init(_task);
-
 #endif /* CONFIG_CGROUP_SCHED */
 
for_each_possible_cpu(i) {
@@ -7687,7 +7691,7 @@ static void free_sched_group(struct task
free_fair_sched_group(tg);
free_rt_sched_group(tg);
autogroup_free(tg);
-   kfree(tg);
+   kmem_cache_free(task_group_cache, tg);
 }
 
 /* allocate runqueue etc for a new task group */
@@ -7695,7 +7699,7 @@ struct task_group *sched_create_group(st
 {
struct task_group *tg;
 
-   tg = kzalloc(sizeof(*tg), GFP_KERNEL);
+   tg = kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO);
if (!tg)
return ERR_PTR(-ENOMEM);
 
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -248,7 +248,12 @@ struct task_group {
unsigned long shares;
 
 #ifdef CONFIG_SMP
-   atomic_long_t load_avg;
+   /*
+* load_avg can be heavily contended at clock tick time, so put
+* it in its own cacheline separated from the fields above which
+* will also be accessed at each tick.
+*/
+   atomic_long_t load_avg cacheline_aligned;
 #endif
 #endif
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

2015-12-03 Thread Peter Zijlstra
On Wed, Dec 02, 2015 at 01:41:49PM -0500, Waiman Long wrote:
> +/*
> + * Make sure that the task_group structure is cacheline aligned when
> + * fair group scheduling is enabled.
> + */
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> +static inline struct task_group *alloc_task_group(void)
> +{
> + return kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO);
> +}
> +
> +static inline void free_task_group(struct task_group *tg)
> +{
> + kmem_cache_free(task_group_cache, tg);
> +}
> +#else /* CONFIG_FAIR_GROUP_SCHED */
> +static inline struct task_group *alloc_task_group(void)
> +{
> + return kzalloc(sizeof(struct task_group), GFP_KERNEL);
> +}
> +
> +static inline void free_task_group(struct task_group *tg)
> +{
> + kfree(tg);
> +}
> +#endif /* CONFIG_FAIR_GROUP_SCHED */

I think we can simply always use the kmem_cache, both slab and slub
merge slabcaches where appropriate.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

2015-12-03 Thread bsegall
Waiman Long  writes:

> On 12/02/2015 03:02 PM, bseg...@google.com wrote:
>> Waiman Long  writes:
>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>> index efd3bfc..e679895 100644
>>> --- a/kernel/sched/sched.h
>>> +++ b/kernel/sched/sched.h
>>> @@ -248,7 +248,12 @@ struct task_group {
>>> unsigned long shares;
>>>
>>>   #ifdefCONFIG_SMP
>>> -   atomic_long_t load_avg;
>>> +   /*
>>> +* load_avg can be heavily contended at clock tick time, so put
>>> +* it in its own cacheline separated from the fields above which
>>> +* will also be accessed at each tick.
>>> +*/
>>> +   atomic_long_t load_avg cacheline_aligned;
>>>   #endif
>>>   #endif
>> I suppose the question is if it would be better to just move this to
>> wind up on a separate cacheline without the extra empty space, though it
>> would likely be more fragile and unclear.
>
> I have been thinking about that too. The problem is anything that will be in 
> the
> same cacheline as load_avg and have to be accessed at clock click time will
> cause the same contention problem. In the current layout, the fields after
> load_avg are the rt stuff as well some list head structure and pointers. The 
> rt
> stuff should be kind of mutually exclusive of the CFS load_avg in term of 
> usage.
> The list head structure and pointers don't seem to be that frequently 
> accessed.
> So it is the right place to start a new cacheline boundary.
>
> Cheers,
> Longman

Yeah, this is a good place to start a new boundary, I was just saying
you could probably remove the empty space by reordering fields, but that
would be a less logical ordering in terms of programmer clarity.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

2015-12-03 Thread Waiman Long

On 12/03/2015 06:12 AM, Peter Zijlstra wrote:


I made this:

---
Subject: sched/fair: Move hot load_avg into its own cacheline
From: Waiman Long
Date: Wed, 2 Dec 2015 13:41:49 -0500

If a system with large number of sockets was driven to full
utilization, it was found that the clock tick handling occupied a
rather significant proportion of CPU time when fair group scheduling
and autogroup were enabled.

Running a java benchmark on a 16-socket IvyBridge-EX system, the perf
profile looked like:

   10.52%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
9.66%   0.05%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
8.65%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
8.56%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
8.07%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
6.91%   1.78%  java   [kernel.vmlinux]  [k] task_tick_fair
5.24%   5.04%  java   [kernel.vmlinux]  [k] update_cfs_shares

In particular, the high CPU time consumed by update_cfs_shares()
was mostly due to contention on the cacheline that contained the
task_group's load_avg statistical counter. This cacheline may also
contains variables like shares, cfs_rq&  se which are accessed rather
frequently during clock tick processing.

This patch moves the load_avg variable into another cacheline
separated from the other frequently accessed variables. It also
creates a cacheline aligned kmemcache for task_group to make sure
that all the allocated task_group's are cacheline aligned.

By doing so, the perf profile became:

9.44%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
8.74%   0.01%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
7.83%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
7.74%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
7.27%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
5.94%   1.74%  java   [kernel.vmlinux]  [k] task_tick_fair
4.15%   3.92%  java   [kernel.vmlinux]  [k] update_cfs_shares

The %cpu time is still pretty high, but it is better than before. The
benchmark results before and after the patch was as follows:

   Before patch - Max-jOPs: 907533Critical-jOps: 134877
   After patch  - Max-jOPs: 916011Critical-jOps: 142366

Cc: Scott J Norton
Cc: Douglas Hatch
Cc: Ingo Molnar
Cc: Yuyang Du
Cc: Paul Turner
Cc: Ben Segall
Cc: Morten Rasmussen
Signed-off-by: Waiman Long
Signed-off-by: Peter Zijlstra (Intel)
Link: 
http://lkml.kernel.org/r/1449081710-20185-3-git-send-email-waiman.l...@hpe.com
---
  kernel/sched/core.c  |   10 +++---
  kernel/sched/sched.h |7 ++-
  2 files changed, 13 insertions(+), 4 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7345,6 +7345,9 @@ int in_sched_functions(unsigned long add
   */
  struct task_group root_task_group;
  LIST_HEAD(task_groups);
+
+/* Cacheline aligned slab cache for task_group */
+static struct kmem_cache *task_group_cache __read_mostly;
  #endif

  DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
@@ -7402,11 +7405,12 @@ void __init sched_init(void)
  #endif /* CONFIG_RT_GROUP_SCHED */

  #ifdef CONFIG_CGROUP_SCHED
+   task_group_cache = KMEM_CACHE(task_group, 0);
+

Thanks for making that change.

Do we need to add the flag SLAB_HWCACHE_ALIGN? Or we could make a helper 
flag that define SLAB_HWCACHE_ALIGN if CONFIG_FAIR_GROUP_SCHED is 
defined. Other than that, I am fine with the change.


Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

2015-12-03 Thread Waiman Long

On 12/03/2015 05:56 AM, Peter Zijlstra wrote:

On Wed, Dec 02, 2015 at 01:41:49PM -0500, Waiman Long wrote:

+/*
+ * Make sure that the task_group structure is cacheline aligned when
+ * fair group scheduling is enabled.
+ */
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static inline struct task_group *alloc_task_group(void)
+{
+   return kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO);
+}
+
+static inline void free_task_group(struct task_group *tg)
+{
+   kmem_cache_free(task_group_cache, tg);
+}
+#else /* CONFIG_FAIR_GROUP_SCHED */
+static inline struct task_group *alloc_task_group(void)
+{
+   return kzalloc(sizeof(struct task_group), GFP_KERNEL);
+}
+
+static inline void free_task_group(struct task_group *tg)
+{
+   kfree(tg);
+}
+#endif /* CONFIG_FAIR_GROUP_SCHED */

I think we can simply always use the kmem_cache, both slab and slub
merge slabcaches where appropriate.


I did that as I was not sure how much overhead would the introduction of 
a new kmem_cache bring. It seems like it is not really an issue. So I am 
fine with making the kmem_cache change permanent.


Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

2015-12-03 Thread Waiman Long

On 12/02/2015 11:32 PM, Mike Galbraith wrote:

On Wed, 2015-12-02 at 13:41 -0500, Waiman Long wrote:


By doing so, the perf profile became:

9.44%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
8.74%   0.01%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
7.83%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
7.74%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
7.27%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
5.94%   1.74%  java   [kernel.vmlinux]  [k] task_tick_fair
4.15%   3.92%  java   [kernel.vmlinux]  [k] update_cfs_shares

The %cpu time is still pretty high, but it is better than before.

Is that with the box booted skew_tick=1?

-Mike



I haven't tried that kernel parameter. I will try it to see if it can 
improve the situation. BTW, will there be other undesirable side effects 
of using this other than the increased power consumption as said in the 
kernel-parameters.txt file?


Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

2015-12-03 Thread Peter Zijlstra
On Thu, Dec 03, 2015 at 02:56:37PM -0500, Waiman Long wrote:
> >  #ifdef CONFIG_CGROUP_SCHED
> >+task_group_cache = KMEM_CACHE(task_group, 0);
> >+
> Thanks for making that change.
> 
> Do we need to add the flag SLAB_HWCACHE_ALIGN? Or we could make a helper
> flag that define SLAB_HWCACHE_ALIGN if CONFIG_FAIR_GROUP_SCHED is defined.
> Other than that, I am fine with the change.

I don't think we need that, see my reply earlier to Ben.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

2015-12-03 Thread Waiman Long

On 12/02/2015 03:02 PM, bseg...@google.com wrote:

Waiman Long  writes:


If a system with large number of sockets was driven to full
utilization, it was found that the clock tick handling occupied a
rather significant proportion of CPU time when fair group scheduling
and autogroup were enabled.

Running a java benchmark on a 16-socket IvyBridge-EX system, the perf
profile looked like:

   10.52%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
9.66%   0.05%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
8.65%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
8.56%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
8.07%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
6.91%   1.78%  java   [kernel.vmlinux]  [k] task_tick_fair
5.24%   5.04%  java   [kernel.vmlinux]  [k] update_cfs_shares

In particular, the high CPU time consumed by update_cfs_shares()
was mostly due to contention on the cacheline that contained the
task_group's load_avg statistical counter. This cacheline may also
contains variables like shares, cfs_rq&  se which are accessed rather
frequently during clock tick processing.

This patch moves the load_avg variable into another cacheline
separated from the other frequently accessed variables. It also
creates a cacheline aligned kmemcache for task_group to make sure
that all the allocated task_group's are cacheline aligned.

By doing so, the perf profile became:

9.44%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
8.74%   0.01%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
7.83%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
7.74%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
7.27%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
5.94%   1.74%  java   [kernel.vmlinux]  [k] task_tick_fair
4.15%   3.92%  java   [kernel.vmlinux]  [k] update_cfs_shares

The %cpu time is still pretty high, but it is better than before. The
benchmark results before and after the patch was as follows:

   Before patch - Max-jOPs: 907533Critical-jOps: 134877
   After patch  - Max-jOPs: 916011Critical-jOps: 142366

Signed-off-by: Waiman Long
---
  kernel/sched/core.c  |   36 ++--
  kernel/sched/sched.h |7 ++-
  2 files changed, 40 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4d568ac..e39204f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7331,6 +7331,11 @@ int in_sched_functions(unsigned long addr)
   */
  struct task_group root_task_group;
  LIST_HEAD(task_groups);
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+/* Cacheline aligned slab cache for task_group */
+static struct kmem_cache *task_group_cache __read_mostly;
+#endif
  #endif

  DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
@@ -7356,6 +7361,7 @@ void __init sched_init(void)
root_task_group.cfs_rq = (struct cfs_rq **)ptr;
ptr += nr_cpu_ids * sizeof(void **);

+   task_group_cache = KMEM_CACHE(task_group, SLAB_HWCACHE_ALIGN);

The KMEM_CACHE macro suggests instead adding
cacheline_aligned_in_smp to the struct definition instead.


The main goal is to have the load_avg placed in a new cacheline 
separated from the read-only fields above. That is why I placed 
cacheline_aligned after load_avg. I omitted the in_smp part because 
it is in the SMP block already. Putting cacheline_aligned_in_smp 
won't guarantee alignment of any field within the structure.


I have done some test and having cacheline_aligned inside the 
structure has the same effect of forcing the whole structure in the 
cacheline aligned boundary.



  #endif /* CONFIG_FAIR_GROUP_SCHED */
  #ifdef CONFIG_RT_GROUP_SCHED
root_task_group.rt_se = (struct sched_rt_entity **)ptr;
@@ -7668,12 +7674,38 @@ void set_curr_task(int cpu, struct task_struct *p)
  /* task_group_lock serializes the addition/removal of task groups */
  static DEFINE_SPINLOCK(task_group_lock);

+/*
+ * Make sure that the task_group structure is cacheline aligned when
+ * fair group scheduling is enabled.
+ */
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static inline struct task_group *alloc_task_group(void)
+{
+   return kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO);
+}
+
+static inline void free_task_group(struct task_group *tg)
+{
+   kmem_cache_free(task_group_cache, tg);
+}
+#else /* CONFIG_FAIR_GROUP_SCHED */
+static inline struct task_group *alloc_task_group(void)
+{
+   return kzalloc(sizeof(struct task_group), GFP_KERNEL);
+}
+
+static inline void free_task_group(struct task_group *tg)
+{
+   kfree(tg);
+}
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
  static void free_sched_group(struct task_group *tg)
  {
free_fair_sched_group(tg);
free_rt_sched_group(tg);
autogroup_free(tg);
-   kfree(tg);
+   free_task_group(tg);
  }

  /* 

Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

2015-12-03 Thread bsegall
Peter Zijlstra  writes:

> I made this:
>
> ---
> Subject: sched/fair: Move hot load_avg into its own cacheline
> From: Waiman Long 
> Date: Wed, 2 Dec 2015 13:41:49 -0500
>
[...]
> @@ -7402,11 +7405,12 @@ void __init sched_init(void)
>  #endif /* CONFIG_RT_GROUP_SCHED */
>  
>  #ifdef CONFIG_CGROUP_SCHED
> + task_group_cache = KMEM_CACHE(task_group, 0);
> +
>   list_add(_task_group.list, _groups);
>   INIT_LIST_HEAD(_task_group.children);
>   INIT_LIST_HEAD(_task_group.siblings);
>   autogroup_init(_task);
> -
>  #endif /* CONFIG_CGROUP_SCHED */
>  
>   for_each_possible_cpu(i) {
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -248,7 +248,12 @@ struct task_group {
>   unsigned long shares;
>  
>  #ifdef   CONFIG_SMP
> - atomic_long_t load_avg;
> + /*
> +  * load_avg can be heavily contended at clock tick time, so put
> +  * it in its own cacheline separated from the fields above which
> +  * will also be accessed at each tick.
> +  */
> + atomic_long_t load_avg cacheline_aligned;
>  #endif
>  #endif
>  

This loses the cacheline-alignment for task_group, is that ok?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

2015-12-03 Thread bsegall
Peter Zijlstra  writes:

> On Thu, Dec 03, 2015 at 09:56:02AM -0800, bseg...@google.com wrote:
>> Peter Zijlstra  writes:
>
>> > @@ -7402,11 +7405,12 @@ void __init sched_init(void)
>> >  #endif /* CONFIG_RT_GROUP_SCHED */
>> >  
>> >  #ifdef CONFIG_CGROUP_SCHED
>> > +  task_group_cache = KMEM_CACHE(task_group, 0);
>> > +
>> >list_add(_task_group.list, _groups);
>> >INIT_LIST_HEAD(_task_group.children);
>> >INIT_LIST_HEAD(_task_group.siblings);
>> >autogroup_init(_task);
>> > -
>> >  #endif /* CONFIG_CGROUP_SCHED */
>> >  
>> >for_each_possible_cpu(i) {
>> > --- a/kernel/sched/sched.h
>> > +++ b/kernel/sched/sched.h
>> > @@ -248,7 +248,12 @@ struct task_group {
>> >unsigned long shares;
>> >  
>> >  #ifdefCONFIG_SMP
>> > -  atomic_long_t load_avg;
>> > +  /*
>> > +   * load_avg can be heavily contended at clock tick time, so put
>> > +   * it in its own cacheline separated from the fields above which
>> > +   * will also be accessed at each tick.
>> > +   */
>> > +  atomic_long_t load_avg cacheline_aligned;
>> >  #endif
>> >  #endif
>> >  
>> 
>> This loses the cacheline-alignment for task_group, is that ok?
>
> I'm a bit dense (its late) can you spell that out? Did you mean me
> killing SLAB_HWCACHE_ALIGN? That should not matter because:
>
> #define KMEM_CACHE(__struct, __flags) kmem_cache_create(#__struct,\
>   sizeof(struct __struct), __alignof__(struct __struct),\
>   (__flags), NULL)
>
> picks up the alignment explicitly.
>
> And struct task_group having one cacheline aligned member, means that
> the alignment of the composite object (the struct proper) must be an
> integer multiple of this (typically 1).

Ah, yeah, I forgot about this, my fault.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

2015-12-03 Thread Peter Zijlstra
On Thu, Dec 03, 2015 at 09:56:02AM -0800, bseg...@google.com wrote:
> Peter Zijlstra  writes:

> > @@ -7402,11 +7405,12 @@ void __init sched_init(void)
> >  #endif /* CONFIG_RT_GROUP_SCHED */
> >  
> >  #ifdef CONFIG_CGROUP_SCHED
> > +   task_group_cache = KMEM_CACHE(task_group, 0);
> > +
> > list_add(_task_group.list, _groups);
> > INIT_LIST_HEAD(_task_group.children);
> > INIT_LIST_HEAD(_task_group.siblings);
> > autogroup_init(_task);
> > -
> >  #endif /* CONFIG_CGROUP_SCHED */
> >  
> > for_each_possible_cpu(i) {
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -248,7 +248,12 @@ struct task_group {
> > unsigned long shares;
> >  
> >  #ifdef CONFIG_SMP
> > -   atomic_long_t load_avg;
> > +   /*
> > +* load_avg can be heavily contended at clock tick time, so put
> > +* it in its own cacheline separated from the fields above which
> > +* will also be accessed at each tick.
> > +*/
> > +   atomic_long_t load_avg cacheline_aligned;
> >  #endif
> >  #endif
> >  
> 
> This loses the cacheline-alignment for task_group, is that ok?

I'm a bit dense (its late) can you spell that out? Did you mean me
killing SLAB_HWCACHE_ALIGN? That should not matter because:

#define KMEM_CACHE(__struct, __flags) kmem_cache_create(#__struct,\
sizeof(struct __struct), __alignof__(struct __struct),\
(__flags), NULL)

picks up the alignment explicitly.

And struct task_group having one cacheline aligned member, means that
the alignment of the composite object (the struct proper) must be an
integer multiple of this (typically 1).


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

2015-12-03 Thread Mike Galbraith
On Thu, 2015-12-03 at 14:34 -0500, Waiman Long wrote:
> On 12/02/2015 11:32 PM, Mike Galbraith wrote:

> > Is that with the box booted skew_tick=1?

> I haven't tried that kernel parameter. I will try it to see if it can 
> improve the situation. BTW, will there be other undesirable side effects 
> of using this other than the increased power consumption as said in the 
> kernel-parameters.txt file?

Not that are known.  I kinda doubt you'd notice the power, but you
should see a notable performance boost.  Who knows, with a big enough
farm of busy big boxen, it may save power by needing fewer of them.

-Mike 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

2015-12-02 Thread Mike Galbraith
On Wed, 2015-12-02 at 13:41 -0500, Waiman Long wrote:

> By doing so, the perf profile became:
> 
>9.44%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
>8.74%   0.01%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
>7.83%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
>7.74%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
>7.27%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
>5.94%   1.74%  java   [kernel.vmlinux]  [k] task_tick_fair
>4.15%   3.92%  java   [kernel.vmlinux]  [k] update_cfs_shares
> 
> The %cpu time is still pretty high, but it is better than before.

Is that with the box booted skew_tick=1?

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

2015-12-02 Thread bsegall
Waiman Long  writes:

> If a system with large number of sockets was driven to full
> utilization, it was found that the clock tick handling occupied a
> rather significant proportion of CPU time when fair group scheduling
> and autogroup were enabled.
>
> Running a java benchmark on a 16-socket IvyBridge-EX system, the perf
> profile looked like:
>
>   10.52%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
>9.66%   0.05%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
>8.65%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
>8.56%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
>8.07%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
>6.91%   1.78%  java   [kernel.vmlinux]  [k] task_tick_fair
>5.24%   5.04%  java   [kernel.vmlinux]  [k] update_cfs_shares
>
> In particular, the high CPU time consumed by update_cfs_shares()
> was mostly due to contention on the cacheline that contained the
> task_group's load_avg statistical counter. This cacheline may also
> contains variables like shares, cfs_rq & se which are accessed rather
> frequently during clock tick processing.
>
> This patch moves the load_avg variable into another cacheline
> separated from the other frequently accessed variables. It also
> creates a cacheline aligned kmemcache for task_group to make sure
> that all the allocated task_group's are cacheline aligned.
>
> By doing so, the perf profile became:
>
>9.44%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
>8.74%   0.01%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
>7.83%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
>7.74%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
>7.27%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
>5.94%   1.74%  java   [kernel.vmlinux]  [k] task_tick_fair
>4.15%   3.92%  java   [kernel.vmlinux]  [k] update_cfs_shares
>
> The %cpu time is still pretty high, but it is better than before. The
> benchmark results before and after the patch was as follows:
>
>   Before patch - Max-jOPs: 907533Critical-jOps: 134877
>   After patch  - Max-jOPs: 916011Critical-jOps: 142366
>
> Signed-off-by: Waiman Long 
> ---
>  kernel/sched/core.c  |   36 ++--
>  kernel/sched/sched.h |7 ++-
>  2 files changed, 40 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 4d568ac..e39204f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7331,6 +7331,11 @@ int in_sched_functions(unsigned long addr)
>   */
>  struct task_group root_task_group;
>  LIST_HEAD(task_groups);
> +
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> +/* Cacheline aligned slab cache for task_group */
> +static struct kmem_cache *task_group_cache __read_mostly;
> +#endif
>  #endif
>  
>  DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
> @@ -7356,6 +7361,7 @@ void __init sched_init(void)
>   root_task_group.cfs_rq = (struct cfs_rq **)ptr;
>   ptr += nr_cpu_ids * sizeof(void **);
>  
> + task_group_cache = KMEM_CACHE(task_group, SLAB_HWCACHE_ALIGN);

The KMEM_CACHE macro suggests instead adding
cacheline_aligned_in_smp to the struct definition instead.

>  #endif /* CONFIG_FAIR_GROUP_SCHED */
>  #ifdef CONFIG_RT_GROUP_SCHED
>   root_task_group.rt_se = (struct sched_rt_entity **)ptr;
> @@ -7668,12 +7674,38 @@ void set_curr_task(int cpu, struct task_struct *p)
>  /* task_group_lock serializes the addition/removal of task groups */
>  static DEFINE_SPINLOCK(task_group_lock);
>  
> +/*
> + * Make sure that the task_group structure is cacheline aligned when
> + * fair group scheduling is enabled.
> + */
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> +static inline struct task_group *alloc_task_group(void)
> +{
> + return kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO);
> +}
> +
> +static inline void free_task_group(struct task_group *tg)
> +{
> + kmem_cache_free(task_group_cache, tg);
> +}
> +#else /* CONFIG_FAIR_GROUP_SCHED */
> +static inline struct task_group *alloc_task_group(void)
> +{
> + return kzalloc(sizeof(struct task_group), GFP_KERNEL);
> +}
> +
> +static inline void free_task_group(struct task_group *tg)
> +{
> + kfree(tg);
> +}
> +#endif /* CONFIG_FAIR_GROUP_SCHED */
> +
>  static void free_sched_group(struct task_group *tg)
>  {
>   free_fair_sched_group(tg);
>   free_rt_sched_group(tg);
>   autogroup_free(tg);
> - kfree(tg);
> + free_task_group(tg);
>  }
>  
>  /* allocate runqueue etc for a new task group */
> @@ -7681,7 +7713,7 @@ struct task_group *sched_create_group(struct task_group 
> *parent)
>  {
>   struct task_group *tg;
>  
> - tg = kzalloc(sizeof(*tg), GFP_KERNEL);
> + tg = alloc_task_group();
>   if (!tg)
>   return ERR_PTR(-ENOMEM);
>  
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index efd3bfc..e679895 100644
> --- a/kernel/sched/sched.h
> 

Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

2015-12-02 Thread bsegall
Waiman Long  writes:

> If a system with large number of sockets was driven to full
> utilization, it was found that the clock tick handling occupied a
> rather significant proportion of CPU time when fair group scheduling
> and autogroup were enabled.
>
> Running a java benchmark on a 16-socket IvyBridge-EX system, the perf
> profile looked like:
>
>   10.52%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
>9.66%   0.05%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
>8.65%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
>8.56%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
>8.07%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
>6.91%   1.78%  java   [kernel.vmlinux]  [k] task_tick_fair
>5.24%   5.04%  java   [kernel.vmlinux]  [k] update_cfs_shares
>
> In particular, the high CPU time consumed by update_cfs_shares()
> was mostly due to contention on the cacheline that contained the
> task_group's load_avg statistical counter. This cacheline may also
> contains variables like shares, cfs_rq & se which are accessed rather
> frequently during clock tick processing.
>
> This patch moves the load_avg variable into another cacheline
> separated from the other frequently accessed variables. It also
> creates a cacheline aligned kmemcache for task_group to make sure
> that all the allocated task_group's are cacheline aligned.
>
> By doing so, the perf profile became:
>
>9.44%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
>8.74%   0.01%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
>7.83%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
>7.74%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
>7.27%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
>5.94%   1.74%  java   [kernel.vmlinux]  [k] task_tick_fair
>4.15%   3.92%  java   [kernel.vmlinux]  [k] update_cfs_shares
>
> The %cpu time is still pretty high, but it is better than before. The
> benchmark results before and after the patch was as follows:
>
>   Before patch - Max-jOPs: 907533Critical-jOps: 134877
>   After patch  - Max-jOPs: 916011Critical-jOps: 142366
>
> Signed-off-by: Waiman Long 
> ---
>  kernel/sched/core.c  |   36 ++--
>  kernel/sched/sched.h |7 ++-
>  2 files changed, 40 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 4d568ac..e39204f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7331,6 +7331,11 @@ int in_sched_functions(unsigned long addr)
>   */
>  struct task_group root_task_group;
>  LIST_HEAD(task_groups);
> +
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> +/* Cacheline aligned slab cache for task_group */
> +static struct kmem_cache *task_group_cache __read_mostly;
> +#endif
>  #endif
>  
>  DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
> @@ -7356,6 +7361,7 @@ void __init sched_init(void)
>   root_task_group.cfs_rq = (struct cfs_rq **)ptr;
>   ptr += nr_cpu_ids * sizeof(void **);
>  
> + task_group_cache = KMEM_CACHE(task_group, SLAB_HWCACHE_ALIGN);

The KMEM_CACHE macro suggests instead adding
cacheline_aligned_in_smp to the struct definition instead.

>  #endif /* CONFIG_FAIR_GROUP_SCHED */
>  #ifdef CONFIG_RT_GROUP_SCHED
>   root_task_group.rt_se = (struct sched_rt_entity **)ptr;
> @@ -7668,12 +7674,38 @@ void set_curr_task(int cpu, struct task_struct *p)
>  /* task_group_lock serializes the addition/removal of task groups */
>  static DEFINE_SPINLOCK(task_group_lock);
>  
> +/*
> + * Make sure that the task_group structure is cacheline aligned when
> + * fair group scheduling is enabled.
> + */
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> +static inline struct task_group *alloc_task_group(void)
> +{
> + return kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO);
> +}
> +
> +static inline void free_task_group(struct task_group *tg)
> +{
> + kmem_cache_free(task_group_cache, tg);
> +}
> +#else /* CONFIG_FAIR_GROUP_SCHED */
> +static inline struct task_group *alloc_task_group(void)
> +{
> + return kzalloc(sizeof(struct task_group), GFP_KERNEL);
> +}
> +
> +static inline void free_task_group(struct task_group *tg)
> +{
> + kfree(tg);
> +}
> +#endif /* CONFIG_FAIR_GROUP_SCHED */
> +
>  static void free_sched_group(struct task_group *tg)
>  {
>   free_fair_sched_group(tg);
>   free_rt_sched_group(tg);
>   autogroup_free(tg);
> - kfree(tg);
> + free_task_group(tg);
>  }
>  
>  /* allocate runqueue etc for a new task group */
> @@ -7681,7 +7713,7 @@ struct task_group *sched_create_group(struct task_group 
> *parent)
>  {
>   struct task_group *tg;
>  
> - tg = kzalloc(sizeof(*tg), GFP_KERNEL);
> + tg = alloc_task_group();
>   if (!tg)
>   return ERR_PTR(-ENOMEM);
>  
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 

Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline

2015-12-02 Thread Mike Galbraith
On Wed, 2015-12-02 at 13:41 -0500, Waiman Long wrote:

> By doing so, the perf profile became:
> 
>9.44%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
>8.74%   0.01%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
>7.83%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
>7.74%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
>7.27%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
>5.94%   1.74%  java   [kernel.vmlinux]  [k] task_tick_fair
>4.15%   3.92%  java   [kernel.vmlinux]  [k] update_cfs_shares
> 
> The %cpu time is still pretty high, but it is better than before.

Is that with the box booted skew_tick=1?

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/