Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
On 12/03/2015 09:07 PM, Mike Galbraith wrote: On Thu, 2015-12-03 at 14:34 -0500, Waiman Long wrote: On 12/02/2015 11:32 PM, Mike Galbraith wrote: Is that with the box booted skew_tick=1? I haven't tried that kernel parameter. I will try it to see if it can improve the situation. BTW, will there be other undesirable side effects of using this other than the increased power consumption as said in the kernel-parameters.txt file? Not that are known. I kinda doubt you'd notice the power, but you should see a notable performance boost. Who knows, with a big enough farm of busy big boxen, it may save power by needing fewer of them. -Mike You are right. Use skew_tick=1 did improve performance and reduce the overhead of clock tick processing rather significantly. I think it should be the default for large SMP boxes. Thanks for the tip. Cheers, Longman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
On 12/03/2015 09:07 PM, Mike Galbraith wrote: On Thu, 2015-12-03 at 14:34 -0500, Waiman Long wrote: On 12/02/2015 11:32 PM, Mike Galbraith wrote: Is that with the box booted skew_tick=1? I haven't tried that kernel parameter. I will try it to see if it can improve the situation. BTW, will there be other undesirable side effects of using this other than the increased power consumption as said in the kernel-parameters.txt file? Not that are known. I kinda doubt you'd notice the power, but you should see a notable performance boost. Who knows, with a big enough farm of busy big boxen, it may save power by needing fewer of them. -Mike You are right. Use skew_tick=1 did improve performance and reduce the overhead of clock tick processing rather significantly. I think it should be the default for large SMP boxes. Thanks for the tip. Cheers, Longman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
On Thu, 2015-12-03 at 14:34 -0500, Waiman Long wrote: > On 12/02/2015 11:32 PM, Mike Galbraith wrote: > > Is that with the box booted skew_tick=1? > I haven't tried that kernel parameter. I will try it to see if it can > improve the situation. BTW, will there be other undesirable side effects > of using this other than the increased power consumption as said in the > kernel-parameters.txt file? Not that are known. I kinda doubt you'd notice the power, but you should see a notable performance boost. Who knows, with a big enough farm of busy big boxen, it may save power by needing fewer of them. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
On Thu, Dec 03, 2015 at 02:56:37PM -0500, Waiman Long wrote: > > #ifdef CONFIG_CGROUP_SCHED > >+task_group_cache = KMEM_CACHE(task_group, 0); > >+ > Thanks for making that change. > > Do we need to add the flag SLAB_HWCACHE_ALIGN? Or we could make a helper > flag that define SLAB_HWCACHE_ALIGN if CONFIG_FAIR_GROUP_SCHED is defined. > Other than that, I am fine with the change. I don't think we need that, see my reply earlier to Ben. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
On 12/03/2015 06:12 AM, Peter Zijlstra wrote: I made this: --- Subject: sched/fair: Move hot load_avg into its own cacheline From: Waiman Long Date: Wed, 2 Dec 2015 13:41:49 -0500 If a system with large number of sockets was driven to full utilization, it was found that the clock tick handling occupied a rather significant proportion of CPU time when fair group scheduling and autogroup were enabled. Running a java benchmark on a 16-socket IvyBridge-EX system, the perf profile looked like: 10.52% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt 9.66% 0.05% java [kernel.vmlinux] [k] hrtimer_interrupt 8.65% 0.03% java [kernel.vmlinux] [k] tick_sched_timer 8.56% 0.00% java [kernel.vmlinux] [k] update_process_times 8.07% 0.03% java [kernel.vmlinux] [k] scheduler_tick 6.91% 1.78% java [kernel.vmlinux] [k] task_tick_fair 5.24% 5.04% java [kernel.vmlinux] [k] update_cfs_shares In particular, the high CPU time consumed by update_cfs_shares() was mostly due to contention on the cacheline that contained the task_group's load_avg statistical counter. This cacheline may also contains variables like shares, cfs_rq& se which are accessed rather frequently during clock tick processing. This patch moves the load_avg variable into another cacheline separated from the other frequently accessed variables. It also creates a cacheline aligned kmemcache for task_group to make sure that all the allocated task_group's are cacheline aligned. By doing so, the perf profile became: 9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt 8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt 7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer 7.74% 0.00% java [kernel.vmlinux] [k] update_process_times 7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick 5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair 4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares The %cpu time is still pretty high, but it is better than before. The benchmark results before and after the patch was as follows: Before patch - Max-jOPs: 907533Critical-jOps: 134877 After patch - Max-jOPs: 916011Critical-jOps: 142366 Cc: Scott J Norton Cc: Douglas Hatch Cc: Ingo Molnar Cc: Yuyang Du Cc: Paul Turner Cc: Ben Segall Cc: Morten Rasmussen Signed-off-by: Waiman Long Signed-off-by: Peter Zijlstra (Intel) Link: http://lkml.kernel.org/r/1449081710-20185-3-git-send-email-waiman.l...@hpe.com --- kernel/sched/core.c | 10 +++--- kernel/sched/sched.h |7 ++- 2 files changed, 13 insertions(+), 4 deletions(-) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7345,6 +7345,9 @@ int in_sched_functions(unsigned long add */ struct task_group root_task_group; LIST_HEAD(task_groups); + +/* Cacheline aligned slab cache for task_group */ +static struct kmem_cache *task_group_cache __read_mostly; #endif DECLARE_PER_CPU(cpumask_var_t, load_balance_mask); @@ -7402,11 +7405,12 @@ void __init sched_init(void) #endif /* CONFIG_RT_GROUP_SCHED */ #ifdef CONFIG_CGROUP_SCHED + task_group_cache = KMEM_CACHE(task_group, 0); + Thanks for making that change. Do we need to add the flag SLAB_HWCACHE_ALIGN? Or we could make a helper flag that define SLAB_HWCACHE_ALIGN if CONFIG_FAIR_GROUP_SCHED is defined. Other than that, I am fine with the change. Cheers, Longman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
Waiman Long writes: > On 12/02/2015 03:02 PM, bseg...@google.com wrote: >> Waiman Long writes: >>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h >>> index efd3bfc..e679895 100644 >>> --- a/kernel/sched/sched.h >>> +++ b/kernel/sched/sched.h >>> @@ -248,7 +248,12 @@ struct task_group { >>> unsigned long shares; >>> >>> #ifdefCONFIG_SMP >>> - atomic_long_t load_avg; >>> + /* >>> +* load_avg can be heavily contended at clock tick time, so put >>> +* it in its own cacheline separated from the fields above which >>> +* will also be accessed at each tick. >>> +*/ >>> + atomic_long_t load_avg cacheline_aligned; >>> #endif >>> #endif >> I suppose the question is if it would be better to just move this to >> wind up on a separate cacheline without the extra empty space, though it >> would likely be more fragile and unclear. > > I have been thinking about that too. The problem is anything that will be in > the > same cacheline as load_avg and have to be accessed at clock click time will > cause the same contention problem. In the current layout, the fields after > load_avg are the rt stuff as well some list head structure and pointers. The > rt > stuff should be kind of mutually exclusive of the CFS load_avg in term of > usage. > The list head structure and pointers don't seem to be that frequently > accessed. > So it is the right place to start a new cacheline boundary. > > Cheers, > Longman Yeah, this is a good place to start a new boundary, I was just saying you could probably remove the empty space by reordering fields, but that would be a less logical ordering in terms of programmer clarity. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
On 12/03/2015 05:56 AM, Peter Zijlstra wrote: On Wed, Dec 02, 2015 at 01:41:49PM -0500, Waiman Long wrote: +/* + * Make sure that the task_group structure is cacheline aligned when + * fair group scheduling is enabled. + */ +#ifdef CONFIG_FAIR_GROUP_SCHED +static inline struct task_group *alloc_task_group(void) +{ + return kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO); +} + +static inline void free_task_group(struct task_group *tg) +{ + kmem_cache_free(task_group_cache, tg); +} +#else /* CONFIG_FAIR_GROUP_SCHED */ +static inline struct task_group *alloc_task_group(void) +{ + return kzalloc(sizeof(struct task_group), GFP_KERNEL); +} + +static inline void free_task_group(struct task_group *tg) +{ + kfree(tg); +} +#endif /* CONFIG_FAIR_GROUP_SCHED */ I think we can simply always use the kmem_cache, both slab and slub merge slabcaches where appropriate. I did that as I was not sure how much overhead would the introduction of a new kmem_cache bring. It seems like it is not really an issue. So I am fine with making the kmem_cache change permanent. Cheers, Longman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
On 12/02/2015 11:32 PM, Mike Galbraith wrote: On Wed, 2015-12-02 at 13:41 -0500, Waiman Long wrote: By doing so, the perf profile became: 9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt 8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt 7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer 7.74% 0.00% java [kernel.vmlinux] [k] update_process_times 7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick 5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair 4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares The %cpu time is still pretty high, but it is better than before. Is that with the box booted skew_tick=1? -Mike I haven't tried that kernel parameter. I will try it to see if it can improve the situation. BTW, will there be other undesirable side effects of using this other than the increased power consumption as said in the kernel-parameters.txt file? Cheers, Longman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
On 12/02/2015 03:02 PM, bseg...@google.com wrote: Waiman Long writes: If a system with large number of sockets was driven to full utilization, it was found that the clock tick handling occupied a rather significant proportion of CPU time when fair group scheduling and autogroup were enabled. Running a java benchmark on a 16-socket IvyBridge-EX system, the perf profile looked like: 10.52% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt 9.66% 0.05% java [kernel.vmlinux] [k] hrtimer_interrupt 8.65% 0.03% java [kernel.vmlinux] [k] tick_sched_timer 8.56% 0.00% java [kernel.vmlinux] [k] update_process_times 8.07% 0.03% java [kernel.vmlinux] [k] scheduler_tick 6.91% 1.78% java [kernel.vmlinux] [k] task_tick_fair 5.24% 5.04% java [kernel.vmlinux] [k] update_cfs_shares In particular, the high CPU time consumed by update_cfs_shares() was mostly due to contention on the cacheline that contained the task_group's load_avg statistical counter. This cacheline may also contains variables like shares, cfs_rq& se which are accessed rather frequently during clock tick processing. This patch moves the load_avg variable into another cacheline separated from the other frequently accessed variables. It also creates a cacheline aligned kmemcache for task_group to make sure that all the allocated task_group's are cacheline aligned. By doing so, the perf profile became: 9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt 8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt 7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer 7.74% 0.00% java [kernel.vmlinux] [k] update_process_times 7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick 5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair 4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares The %cpu time is still pretty high, but it is better than before. The benchmark results before and after the patch was as follows: Before patch - Max-jOPs: 907533Critical-jOps: 134877 After patch - Max-jOPs: 916011Critical-jOps: 142366 Signed-off-by: Waiman Long --- kernel/sched/core.c | 36 ++-- kernel/sched/sched.h |7 ++- 2 files changed, 40 insertions(+), 3 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 4d568ac..e39204f 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7331,6 +7331,11 @@ int in_sched_functions(unsigned long addr) */ struct task_group root_task_group; LIST_HEAD(task_groups); + +#ifdef CONFIG_FAIR_GROUP_SCHED +/* Cacheline aligned slab cache for task_group */ +static struct kmem_cache *task_group_cache __read_mostly; +#endif #endif DECLARE_PER_CPU(cpumask_var_t, load_balance_mask); @@ -7356,6 +7361,7 @@ void __init sched_init(void) root_task_group.cfs_rq = (struct cfs_rq **)ptr; ptr += nr_cpu_ids * sizeof(void **); + task_group_cache = KMEM_CACHE(task_group, SLAB_HWCACHE_ALIGN); The KMEM_CACHE macro suggests instead adding cacheline_aligned_in_smp to the struct definition instead. The main goal is to have the load_avg placed in a new cacheline separated from the read-only fields above. That is why I placed cacheline_aligned after load_avg. I omitted the in_smp part because it is in the SMP block already. Putting cacheline_aligned_in_smp won't guarantee alignment of any field within the structure. I have done some test and having cacheline_aligned inside the structure has the same effect of forcing the whole structure in the cacheline aligned boundary. #endif /* CONFIG_FAIR_GROUP_SCHED */ #ifdef CONFIG_RT_GROUP_SCHED root_task_group.rt_se = (struct sched_rt_entity **)ptr; @@ -7668,12 +7674,38 @@ void set_curr_task(int cpu, struct task_struct *p) /* task_group_lock serializes the addition/removal of task groups */ static DEFINE_SPINLOCK(task_group_lock); +/* + * Make sure that the task_group structure is cacheline aligned when + * fair group scheduling is enabled. + */ +#ifdef CONFIG_FAIR_GROUP_SCHED +static inline struct task_group *alloc_task_group(void) +{ + return kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO); +} + +static inline void free_task_group(struct task_group *tg) +{ + kmem_cache_free(task_group_cache, tg); +} +#else /* CONFIG_FAIR_GROUP_SCHED */ +static inline struct task_group *alloc_task_group(void) +{ + return kzalloc(sizeof(struct task_group), GFP_KERNEL); +} + +static inline void free_task_group(struct task_group *tg) +{ + kfree(tg); +} +#endif /* CONFIG_FAIR_GROUP_SCHED */ + static void free_sched_group(struct task_group *tg) { free_fair_sched_group(tg); free_rt_sched_group(tg); autogroup_free(tg); - kfree(tg); + free_task_group(tg); } /* allocate runqueue etc for a new task group */ @@
Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
Peter Zijlstra writes: > On Thu, Dec 03, 2015 at 09:56:02AM -0800, bseg...@google.com wrote: >> Peter Zijlstra writes: > >> > @@ -7402,11 +7405,12 @@ void __init sched_init(void) >> > #endif /* CONFIG_RT_GROUP_SCHED */ >> > >> > #ifdef CONFIG_CGROUP_SCHED >> > + task_group_cache = KMEM_CACHE(task_group, 0); >> > + >> >list_add(_task_group.list, _groups); >> >INIT_LIST_HEAD(_task_group.children); >> >INIT_LIST_HEAD(_task_group.siblings); >> >autogroup_init(_task); >> > - >> > #endif /* CONFIG_CGROUP_SCHED */ >> > >> >for_each_possible_cpu(i) { >> > --- a/kernel/sched/sched.h >> > +++ b/kernel/sched/sched.h >> > @@ -248,7 +248,12 @@ struct task_group { >> >unsigned long shares; >> > >> > #ifdefCONFIG_SMP >> > - atomic_long_t load_avg; >> > + /* >> > + * load_avg can be heavily contended at clock tick time, so put >> > + * it in its own cacheline separated from the fields above which >> > + * will also be accessed at each tick. >> > + */ >> > + atomic_long_t load_avg cacheline_aligned; >> > #endif >> > #endif >> > >> >> This loses the cacheline-alignment for task_group, is that ok? > > I'm a bit dense (its late) can you spell that out? Did you mean me > killing SLAB_HWCACHE_ALIGN? That should not matter because: > > #define KMEM_CACHE(__struct, __flags) kmem_cache_create(#__struct,\ > sizeof(struct __struct), __alignof__(struct __struct),\ > (__flags), NULL) > > picks up the alignment explicitly. > > And struct task_group having one cacheline aligned member, means that > the alignment of the composite object (the struct proper) must be an > integer multiple of this (typically 1). Ah, yeah, I forgot about this, my fault. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
On Thu, Dec 03, 2015 at 09:56:02AM -0800, bseg...@google.com wrote: > Peter Zijlstra writes: > > @@ -7402,11 +7405,12 @@ void __init sched_init(void) > > #endif /* CONFIG_RT_GROUP_SCHED */ > > > > #ifdef CONFIG_CGROUP_SCHED > > + task_group_cache = KMEM_CACHE(task_group, 0); > > + > > list_add(_task_group.list, _groups); > > INIT_LIST_HEAD(_task_group.children); > > INIT_LIST_HEAD(_task_group.siblings); > > autogroup_init(_task); > > - > > #endif /* CONFIG_CGROUP_SCHED */ > > > > for_each_possible_cpu(i) { > > --- a/kernel/sched/sched.h > > +++ b/kernel/sched/sched.h > > @@ -248,7 +248,12 @@ struct task_group { > > unsigned long shares; > > > > #ifdef CONFIG_SMP > > - atomic_long_t load_avg; > > + /* > > +* load_avg can be heavily contended at clock tick time, so put > > +* it in its own cacheline separated from the fields above which > > +* will also be accessed at each tick. > > +*/ > > + atomic_long_t load_avg cacheline_aligned; > > #endif > > #endif > > > > This loses the cacheline-alignment for task_group, is that ok? I'm a bit dense (its late) can you spell that out? Did you mean me killing SLAB_HWCACHE_ALIGN? That should not matter because: #define KMEM_CACHE(__struct, __flags) kmem_cache_create(#__struct,\ sizeof(struct __struct), __alignof__(struct __struct),\ (__flags), NULL) picks up the alignment explicitly. And struct task_group having one cacheline aligned member, means that the alignment of the composite object (the struct proper) must be an integer multiple of this (typically 1). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
Peter Zijlstra writes: > I made this: > > --- > Subject: sched/fair: Move hot load_avg into its own cacheline > From: Waiman Long > Date: Wed, 2 Dec 2015 13:41:49 -0500 > [...] > @@ -7402,11 +7405,12 @@ void __init sched_init(void) > #endif /* CONFIG_RT_GROUP_SCHED */ > > #ifdef CONFIG_CGROUP_SCHED > + task_group_cache = KMEM_CACHE(task_group, 0); > + > list_add(_task_group.list, _groups); > INIT_LIST_HEAD(_task_group.children); > INIT_LIST_HEAD(_task_group.siblings); > autogroup_init(_task); > - > #endif /* CONFIG_CGROUP_SCHED */ > > for_each_possible_cpu(i) { > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -248,7 +248,12 @@ struct task_group { > unsigned long shares; > > #ifdef CONFIG_SMP > - atomic_long_t load_avg; > + /* > + * load_avg can be heavily contended at clock tick time, so put > + * it in its own cacheline separated from the fields above which > + * will also be accessed at each tick. > + */ > + atomic_long_t load_avg cacheline_aligned; > #endif > #endif > This loses the cacheline-alignment for task_group, is that ok? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
I made this: --- Subject: sched/fair: Move hot load_avg into its own cacheline From: Waiman Long Date: Wed, 2 Dec 2015 13:41:49 -0500 If a system with large number of sockets was driven to full utilization, it was found that the clock tick handling occupied a rather significant proportion of CPU time when fair group scheduling and autogroup were enabled. Running a java benchmark on a 16-socket IvyBridge-EX system, the perf profile looked like: 10.52% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt 9.66% 0.05% java [kernel.vmlinux] [k] hrtimer_interrupt 8.65% 0.03% java [kernel.vmlinux] [k] tick_sched_timer 8.56% 0.00% java [kernel.vmlinux] [k] update_process_times 8.07% 0.03% java [kernel.vmlinux] [k] scheduler_tick 6.91% 1.78% java [kernel.vmlinux] [k] task_tick_fair 5.24% 5.04% java [kernel.vmlinux] [k] update_cfs_shares In particular, the high CPU time consumed by update_cfs_shares() was mostly due to contention on the cacheline that contained the task_group's load_avg statistical counter. This cacheline may also contains variables like shares, cfs_rq & se which are accessed rather frequently during clock tick processing. This patch moves the load_avg variable into another cacheline separated from the other frequently accessed variables. It also creates a cacheline aligned kmemcache for task_group to make sure that all the allocated task_group's are cacheline aligned. By doing so, the perf profile became: 9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt 8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt 7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer 7.74% 0.00% java [kernel.vmlinux] [k] update_process_times 7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick 5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair 4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares The %cpu time is still pretty high, but it is better than before. The benchmark results before and after the patch was as follows: Before patch - Max-jOPs: 907533Critical-jOps: 134877 After patch - Max-jOPs: 916011Critical-jOps: 142366 Cc: Scott J Norton Cc: Douglas Hatch Cc: Ingo Molnar Cc: Yuyang Du Cc: Paul Turner Cc: Ben Segall Cc: Morten Rasmussen Signed-off-by: Waiman Long Signed-off-by: Peter Zijlstra (Intel) Link: http://lkml.kernel.org/r/1449081710-20185-3-git-send-email-waiman.l...@hpe.com --- kernel/sched/core.c | 10 +++--- kernel/sched/sched.h |7 ++- 2 files changed, 13 insertions(+), 4 deletions(-) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7345,6 +7345,9 @@ int in_sched_functions(unsigned long add */ struct task_group root_task_group; LIST_HEAD(task_groups); + +/* Cacheline aligned slab cache for task_group */ +static struct kmem_cache *task_group_cache __read_mostly; #endif DECLARE_PER_CPU(cpumask_var_t, load_balance_mask); @@ -7402,11 +7405,12 @@ void __init sched_init(void) #endif /* CONFIG_RT_GROUP_SCHED */ #ifdef CONFIG_CGROUP_SCHED + task_group_cache = KMEM_CACHE(task_group, 0); + list_add(_task_group.list, _groups); INIT_LIST_HEAD(_task_group.children); INIT_LIST_HEAD(_task_group.siblings); autogroup_init(_task); - #endif /* CONFIG_CGROUP_SCHED */ for_each_possible_cpu(i) { @@ -7687,7 +7691,7 @@ static void free_sched_group(struct task free_fair_sched_group(tg); free_rt_sched_group(tg); autogroup_free(tg); - kfree(tg); + kmem_cache_free(task_group_cache, tg); } /* allocate runqueue etc for a new task group */ @@ -7695,7 +7699,7 @@ struct task_group *sched_create_group(st { struct task_group *tg; - tg = kzalloc(sizeof(*tg), GFP_KERNEL); + tg = kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO); if (!tg) return ERR_PTR(-ENOMEM); --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -248,7 +248,12 @@ struct task_group { unsigned long shares; #ifdef CONFIG_SMP - atomic_long_t load_avg; + /* +* load_avg can be heavily contended at clock tick time, so put +* it in its own cacheline separated from the fields above which +* will also be accessed at each tick. +*/ + atomic_long_t load_avg cacheline_aligned; #endif #endif -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
On Wed, Dec 02, 2015 at 01:41:49PM -0500, Waiman Long wrote: > +/* > + * Make sure that the task_group structure is cacheline aligned when > + * fair group scheduling is enabled. > + */ > +#ifdef CONFIG_FAIR_GROUP_SCHED > +static inline struct task_group *alloc_task_group(void) > +{ > + return kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO); > +} > + > +static inline void free_task_group(struct task_group *tg) > +{ > + kmem_cache_free(task_group_cache, tg); > +} > +#else /* CONFIG_FAIR_GROUP_SCHED */ > +static inline struct task_group *alloc_task_group(void) > +{ > + return kzalloc(sizeof(struct task_group), GFP_KERNEL); > +} > + > +static inline void free_task_group(struct task_group *tg) > +{ > + kfree(tg); > +} > +#endif /* CONFIG_FAIR_GROUP_SCHED */ I think we can simply always use the kmem_cache, both slab and slub merge slabcaches where appropriate. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
I made this: --- Subject: sched/fair: Move hot load_avg into its own cacheline From: Waiman LongDate: Wed, 2 Dec 2015 13:41:49 -0500 If a system with large number of sockets was driven to full utilization, it was found that the clock tick handling occupied a rather significant proportion of CPU time when fair group scheduling and autogroup were enabled. Running a java benchmark on a 16-socket IvyBridge-EX system, the perf profile looked like: 10.52% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt 9.66% 0.05% java [kernel.vmlinux] [k] hrtimer_interrupt 8.65% 0.03% java [kernel.vmlinux] [k] tick_sched_timer 8.56% 0.00% java [kernel.vmlinux] [k] update_process_times 8.07% 0.03% java [kernel.vmlinux] [k] scheduler_tick 6.91% 1.78% java [kernel.vmlinux] [k] task_tick_fair 5.24% 5.04% java [kernel.vmlinux] [k] update_cfs_shares In particular, the high CPU time consumed by update_cfs_shares() was mostly due to contention on the cacheline that contained the task_group's load_avg statistical counter. This cacheline may also contains variables like shares, cfs_rq & se which are accessed rather frequently during clock tick processing. This patch moves the load_avg variable into another cacheline separated from the other frequently accessed variables. It also creates a cacheline aligned kmemcache for task_group to make sure that all the allocated task_group's are cacheline aligned. By doing so, the perf profile became: 9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt 8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt 7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer 7.74% 0.00% java [kernel.vmlinux] [k] update_process_times 7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick 5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair 4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares The %cpu time is still pretty high, but it is better than before. The benchmark results before and after the patch was as follows: Before patch - Max-jOPs: 907533Critical-jOps: 134877 After patch - Max-jOPs: 916011Critical-jOps: 142366 Cc: Scott J Norton Cc: Douglas Hatch Cc: Ingo Molnar Cc: Yuyang Du Cc: Paul Turner Cc: Ben Segall Cc: Morten Rasmussen Signed-off-by: Waiman Long Signed-off-by: Peter Zijlstra (Intel) Link: http://lkml.kernel.org/r/1449081710-20185-3-git-send-email-waiman.l...@hpe.com --- kernel/sched/core.c | 10 +++--- kernel/sched/sched.h |7 ++- 2 files changed, 13 insertions(+), 4 deletions(-) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7345,6 +7345,9 @@ int in_sched_functions(unsigned long add */ struct task_group root_task_group; LIST_HEAD(task_groups); + +/* Cacheline aligned slab cache for task_group */ +static struct kmem_cache *task_group_cache __read_mostly; #endif DECLARE_PER_CPU(cpumask_var_t, load_balance_mask); @@ -7402,11 +7405,12 @@ void __init sched_init(void) #endif /* CONFIG_RT_GROUP_SCHED */ #ifdef CONFIG_CGROUP_SCHED + task_group_cache = KMEM_CACHE(task_group, 0); + list_add(_task_group.list, _groups); INIT_LIST_HEAD(_task_group.children); INIT_LIST_HEAD(_task_group.siblings); autogroup_init(_task); - #endif /* CONFIG_CGROUP_SCHED */ for_each_possible_cpu(i) { @@ -7687,7 +7691,7 @@ static void free_sched_group(struct task free_fair_sched_group(tg); free_rt_sched_group(tg); autogroup_free(tg); - kfree(tg); + kmem_cache_free(task_group_cache, tg); } /* allocate runqueue etc for a new task group */ @@ -7695,7 +7699,7 @@ struct task_group *sched_create_group(st { struct task_group *tg; - tg = kzalloc(sizeof(*tg), GFP_KERNEL); + tg = kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO); if (!tg) return ERR_PTR(-ENOMEM); --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -248,7 +248,12 @@ struct task_group { unsigned long shares; #ifdef CONFIG_SMP - atomic_long_t load_avg; + /* +* load_avg can be heavily contended at clock tick time, so put +* it in its own cacheline separated from the fields above which +* will also be accessed at each tick. +*/ + atomic_long_t load_avg cacheline_aligned; #endif #endif -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
On Wed, Dec 02, 2015 at 01:41:49PM -0500, Waiman Long wrote: > +/* > + * Make sure that the task_group structure is cacheline aligned when > + * fair group scheduling is enabled. > + */ > +#ifdef CONFIG_FAIR_GROUP_SCHED > +static inline struct task_group *alloc_task_group(void) > +{ > + return kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO); > +} > + > +static inline void free_task_group(struct task_group *tg) > +{ > + kmem_cache_free(task_group_cache, tg); > +} > +#else /* CONFIG_FAIR_GROUP_SCHED */ > +static inline struct task_group *alloc_task_group(void) > +{ > + return kzalloc(sizeof(struct task_group), GFP_KERNEL); > +} > + > +static inline void free_task_group(struct task_group *tg) > +{ > + kfree(tg); > +} > +#endif /* CONFIG_FAIR_GROUP_SCHED */ I think we can simply always use the kmem_cache, both slab and slub merge slabcaches where appropriate. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
Waiman Longwrites: > On 12/02/2015 03:02 PM, bseg...@google.com wrote: >> Waiman Long writes: >>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h >>> index efd3bfc..e679895 100644 >>> --- a/kernel/sched/sched.h >>> +++ b/kernel/sched/sched.h >>> @@ -248,7 +248,12 @@ struct task_group { >>> unsigned long shares; >>> >>> #ifdefCONFIG_SMP >>> - atomic_long_t load_avg; >>> + /* >>> +* load_avg can be heavily contended at clock tick time, so put >>> +* it in its own cacheline separated from the fields above which >>> +* will also be accessed at each tick. >>> +*/ >>> + atomic_long_t load_avg cacheline_aligned; >>> #endif >>> #endif >> I suppose the question is if it would be better to just move this to >> wind up on a separate cacheline without the extra empty space, though it >> would likely be more fragile and unclear. > > I have been thinking about that too. The problem is anything that will be in > the > same cacheline as load_avg and have to be accessed at clock click time will > cause the same contention problem. In the current layout, the fields after > load_avg are the rt stuff as well some list head structure and pointers. The > rt > stuff should be kind of mutually exclusive of the CFS load_avg in term of > usage. > The list head structure and pointers don't seem to be that frequently > accessed. > So it is the right place to start a new cacheline boundary. > > Cheers, > Longman Yeah, this is a good place to start a new boundary, I was just saying you could probably remove the empty space by reordering fields, but that would be a less logical ordering in terms of programmer clarity. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
On 12/03/2015 06:12 AM, Peter Zijlstra wrote: I made this: --- Subject: sched/fair: Move hot load_avg into its own cacheline From: Waiman LongDate: Wed, 2 Dec 2015 13:41:49 -0500 If a system with large number of sockets was driven to full utilization, it was found that the clock tick handling occupied a rather significant proportion of CPU time when fair group scheduling and autogroup were enabled. Running a java benchmark on a 16-socket IvyBridge-EX system, the perf profile looked like: 10.52% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt 9.66% 0.05% java [kernel.vmlinux] [k] hrtimer_interrupt 8.65% 0.03% java [kernel.vmlinux] [k] tick_sched_timer 8.56% 0.00% java [kernel.vmlinux] [k] update_process_times 8.07% 0.03% java [kernel.vmlinux] [k] scheduler_tick 6.91% 1.78% java [kernel.vmlinux] [k] task_tick_fair 5.24% 5.04% java [kernel.vmlinux] [k] update_cfs_shares In particular, the high CPU time consumed by update_cfs_shares() was mostly due to contention on the cacheline that contained the task_group's load_avg statistical counter. This cacheline may also contains variables like shares, cfs_rq& se which are accessed rather frequently during clock tick processing. This patch moves the load_avg variable into another cacheline separated from the other frequently accessed variables. It also creates a cacheline aligned kmemcache for task_group to make sure that all the allocated task_group's are cacheline aligned. By doing so, the perf profile became: 9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt 8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt 7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer 7.74% 0.00% java [kernel.vmlinux] [k] update_process_times 7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick 5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair 4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares The %cpu time is still pretty high, but it is better than before. The benchmark results before and after the patch was as follows: Before patch - Max-jOPs: 907533Critical-jOps: 134877 After patch - Max-jOPs: 916011Critical-jOps: 142366 Cc: Scott J Norton Cc: Douglas Hatch Cc: Ingo Molnar Cc: Yuyang Du Cc: Paul Turner Cc: Ben Segall Cc: Morten Rasmussen Signed-off-by: Waiman Long Signed-off-by: Peter Zijlstra (Intel) Link: http://lkml.kernel.org/r/1449081710-20185-3-git-send-email-waiman.l...@hpe.com --- kernel/sched/core.c | 10 +++--- kernel/sched/sched.h |7 ++- 2 files changed, 13 insertions(+), 4 deletions(-) --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7345,6 +7345,9 @@ int in_sched_functions(unsigned long add */ struct task_group root_task_group; LIST_HEAD(task_groups); + +/* Cacheline aligned slab cache for task_group */ +static struct kmem_cache *task_group_cache __read_mostly; #endif DECLARE_PER_CPU(cpumask_var_t, load_balance_mask); @@ -7402,11 +7405,12 @@ void __init sched_init(void) #endif /* CONFIG_RT_GROUP_SCHED */ #ifdef CONFIG_CGROUP_SCHED + task_group_cache = KMEM_CACHE(task_group, 0); + Thanks for making that change. Do we need to add the flag SLAB_HWCACHE_ALIGN? Or we could make a helper flag that define SLAB_HWCACHE_ALIGN if CONFIG_FAIR_GROUP_SCHED is defined. Other than that, I am fine with the change. Cheers, Longman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
On 12/03/2015 05:56 AM, Peter Zijlstra wrote: On Wed, Dec 02, 2015 at 01:41:49PM -0500, Waiman Long wrote: +/* + * Make sure that the task_group structure is cacheline aligned when + * fair group scheduling is enabled. + */ +#ifdef CONFIG_FAIR_GROUP_SCHED +static inline struct task_group *alloc_task_group(void) +{ + return kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO); +} + +static inline void free_task_group(struct task_group *tg) +{ + kmem_cache_free(task_group_cache, tg); +} +#else /* CONFIG_FAIR_GROUP_SCHED */ +static inline struct task_group *alloc_task_group(void) +{ + return kzalloc(sizeof(struct task_group), GFP_KERNEL); +} + +static inline void free_task_group(struct task_group *tg) +{ + kfree(tg); +} +#endif /* CONFIG_FAIR_GROUP_SCHED */ I think we can simply always use the kmem_cache, both slab and slub merge slabcaches where appropriate. I did that as I was not sure how much overhead would the introduction of a new kmem_cache bring. It seems like it is not really an issue. So I am fine with making the kmem_cache change permanent. Cheers, Longman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
On 12/02/2015 11:32 PM, Mike Galbraith wrote: On Wed, 2015-12-02 at 13:41 -0500, Waiman Long wrote: By doing so, the perf profile became: 9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt 8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt 7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer 7.74% 0.00% java [kernel.vmlinux] [k] update_process_times 7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick 5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair 4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares The %cpu time is still pretty high, but it is better than before. Is that with the box booted skew_tick=1? -Mike I haven't tried that kernel parameter. I will try it to see if it can improve the situation. BTW, will there be other undesirable side effects of using this other than the increased power consumption as said in the kernel-parameters.txt file? Cheers, Longman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
On Thu, Dec 03, 2015 at 02:56:37PM -0500, Waiman Long wrote: > > #ifdef CONFIG_CGROUP_SCHED > >+task_group_cache = KMEM_CACHE(task_group, 0); > >+ > Thanks for making that change. > > Do we need to add the flag SLAB_HWCACHE_ALIGN? Or we could make a helper > flag that define SLAB_HWCACHE_ALIGN if CONFIG_FAIR_GROUP_SCHED is defined. > Other than that, I am fine with the change. I don't think we need that, see my reply earlier to Ben. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
On 12/02/2015 03:02 PM, bseg...@google.com wrote: Waiman Longwrites: If a system with large number of sockets was driven to full utilization, it was found that the clock tick handling occupied a rather significant proportion of CPU time when fair group scheduling and autogroup were enabled. Running a java benchmark on a 16-socket IvyBridge-EX system, the perf profile looked like: 10.52% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt 9.66% 0.05% java [kernel.vmlinux] [k] hrtimer_interrupt 8.65% 0.03% java [kernel.vmlinux] [k] tick_sched_timer 8.56% 0.00% java [kernel.vmlinux] [k] update_process_times 8.07% 0.03% java [kernel.vmlinux] [k] scheduler_tick 6.91% 1.78% java [kernel.vmlinux] [k] task_tick_fair 5.24% 5.04% java [kernel.vmlinux] [k] update_cfs_shares In particular, the high CPU time consumed by update_cfs_shares() was mostly due to contention on the cacheline that contained the task_group's load_avg statistical counter. This cacheline may also contains variables like shares, cfs_rq& se which are accessed rather frequently during clock tick processing. This patch moves the load_avg variable into another cacheline separated from the other frequently accessed variables. It also creates a cacheline aligned kmemcache for task_group to make sure that all the allocated task_group's are cacheline aligned. By doing so, the perf profile became: 9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt 8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt 7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer 7.74% 0.00% java [kernel.vmlinux] [k] update_process_times 7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick 5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair 4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares The %cpu time is still pretty high, but it is better than before. The benchmark results before and after the patch was as follows: Before patch - Max-jOPs: 907533Critical-jOps: 134877 After patch - Max-jOPs: 916011Critical-jOps: 142366 Signed-off-by: Waiman Long --- kernel/sched/core.c | 36 ++-- kernel/sched/sched.h |7 ++- 2 files changed, 40 insertions(+), 3 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 4d568ac..e39204f 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7331,6 +7331,11 @@ int in_sched_functions(unsigned long addr) */ struct task_group root_task_group; LIST_HEAD(task_groups); + +#ifdef CONFIG_FAIR_GROUP_SCHED +/* Cacheline aligned slab cache for task_group */ +static struct kmem_cache *task_group_cache __read_mostly; +#endif #endif DECLARE_PER_CPU(cpumask_var_t, load_balance_mask); @@ -7356,6 +7361,7 @@ void __init sched_init(void) root_task_group.cfs_rq = (struct cfs_rq **)ptr; ptr += nr_cpu_ids * sizeof(void **); + task_group_cache = KMEM_CACHE(task_group, SLAB_HWCACHE_ALIGN); The KMEM_CACHE macro suggests instead adding cacheline_aligned_in_smp to the struct definition instead. The main goal is to have the load_avg placed in a new cacheline separated from the read-only fields above. That is why I placed cacheline_aligned after load_avg. I omitted the in_smp part because it is in the SMP block already. Putting cacheline_aligned_in_smp won't guarantee alignment of any field within the structure. I have done some test and having cacheline_aligned inside the structure has the same effect of forcing the whole structure in the cacheline aligned boundary. #endif /* CONFIG_FAIR_GROUP_SCHED */ #ifdef CONFIG_RT_GROUP_SCHED root_task_group.rt_se = (struct sched_rt_entity **)ptr; @@ -7668,12 +7674,38 @@ void set_curr_task(int cpu, struct task_struct *p) /* task_group_lock serializes the addition/removal of task groups */ static DEFINE_SPINLOCK(task_group_lock); +/* + * Make sure that the task_group structure is cacheline aligned when + * fair group scheduling is enabled. + */ +#ifdef CONFIG_FAIR_GROUP_SCHED +static inline struct task_group *alloc_task_group(void) +{ + return kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO); +} + +static inline void free_task_group(struct task_group *tg) +{ + kmem_cache_free(task_group_cache, tg); +} +#else /* CONFIG_FAIR_GROUP_SCHED */ +static inline struct task_group *alloc_task_group(void) +{ + return kzalloc(sizeof(struct task_group), GFP_KERNEL); +} + +static inline void free_task_group(struct task_group *tg) +{ + kfree(tg); +} +#endif /* CONFIG_FAIR_GROUP_SCHED */ + static void free_sched_group(struct task_group *tg) { free_fair_sched_group(tg); free_rt_sched_group(tg); autogroup_free(tg); - kfree(tg); + free_task_group(tg); } /*
Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
Peter Zijlstrawrites: > I made this: > > --- > Subject: sched/fair: Move hot load_avg into its own cacheline > From: Waiman Long > Date: Wed, 2 Dec 2015 13:41:49 -0500 > [...] > @@ -7402,11 +7405,12 @@ void __init sched_init(void) > #endif /* CONFIG_RT_GROUP_SCHED */ > > #ifdef CONFIG_CGROUP_SCHED > + task_group_cache = KMEM_CACHE(task_group, 0); > + > list_add(_task_group.list, _groups); > INIT_LIST_HEAD(_task_group.children); > INIT_LIST_HEAD(_task_group.siblings); > autogroup_init(_task); > - > #endif /* CONFIG_CGROUP_SCHED */ > > for_each_possible_cpu(i) { > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -248,7 +248,12 @@ struct task_group { > unsigned long shares; > > #ifdef CONFIG_SMP > - atomic_long_t load_avg; > + /* > + * load_avg can be heavily contended at clock tick time, so put > + * it in its own cacheline separated from the fields above which > + * will also be accessed at each tick. > + */ > + atomic_long_t load_avg cacheline_aligned; > #endif > #endif > This loses the cacheline-alignment for task_group, is that ok? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
Peter Zijlstrawrites: > On Thu, Dec 03, 2015 at 09:56:02AM -0800, bseg...@google.com wrote: >> Peter Zijlstra writes: > >> > @@ -7402,11 +7405,12 @@ void __init sched_init(void) >> > #endif /* CONFIG_RT_GROUP_SCHED */ >> > >> > #ifdef CONFIG_CGROUP_SCHED >> > + task_group_cache = KMEM_CACHE(task_group, 0); >> > + >> >list_add(_task_group.list, _groups); >> >INIT_LIST_HEAD(_task_group.children); >> >INIT_LIST_HEAD(_task_group.siblings); >> >autogroup_init(_task); >> > - >> > #endif /* CONFIG_CGROUP_SCHED */ >> > >> >for_each_possible_cpu(i) { >> > --- a/kernel/sched/sched.h >> > +++ b/kernel/sched/sched.h >> > @@ -248,7 +248,12 @@ struct task_group { >> >unsigned long shares; >> > >> > #ifdefCONFIG_SMP >> > - atomic_long_t load_avg; >> > + /* >> > + * load_avg can be heavily contended at clock tick time, so put >> > + * it in its own cacheline separated from the fields above which >> > + * will also be accessed at each tick. >> > + */ >> > + atomic_long_t load_avg cacheline_aligned; >> > #endif >> > #endif >> > >> >> This loses the cacheline-alignment for task_group, is that ok? > > I'm a bit dense (its late) can you spell that out? Did you mean me > killing SLAB_HWCACHE_ALIGN? That should not matter because: > > #define KMEM_CACHE(__struct, __flags) kmem_cache_create(#__struct,\ > sizeof(struct __struct), __alignof__(struct __struct),\ > (__flags), NULL) > > picks up the alignment explicitly. > > And struct task_group having one cacheline aligned member, means that > the alignment of the composite object (the struct proper) must be an > integer multiple of this (typically 1). Ah, yeah, I forgot about this, my fault. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
On Thu, Dec 03, 2015 at 09:56:02AM -0800, bseg...@google.com wrote: > Peter Zijlstrawrites: > > @@ -7402,11 +7405,12 @@ void __init sched_init(void) > > #endif /* CONFIG_RT_GROUP_SCHED */ > > > > #ifdef CONFIG_CGROUP_SCHED > > + task_group_cache = KMEM_CACHE(task_group, 0); > > + > > list_add(_task_group.list, _groups); > > INIT_LIST_HEAD(_task_group.children); > > INIT_LIST_HEAD(_task_group.siblings); > > autogroup_init(_task); > > - > > #endif /* CONFIG_CGROUP_SCHED */ > > > > for_each_possible_cpu(i) { > > --- a/kernel/sched/sched.h > > +++ b/kernel/sched/sched.h > > @@ -248,7 +248,12 @@ struct task_group { > > unsigned long shares; > > > > #ifdef CONFIG_SMP > > - atomic_long_t load_avg; > > + /* > > +* load_avg can be heavily contended at clock tick time, so put > > +* it in its own cacheline separated from the fields above which > > +* will also be accessed at each tick. > > +*/ > > + atomic_long_t load_avg cacheline_aligned; > > #endif > > #endif > > > > This loses the cacheline-alignment for task_group, is that ok? I'm a bit dense (its late) can you spell that out? Did you mean me killing SLAB_HWCACHE_ALIGN? That should not matter because: #define KMEM_CACHE(__struct, __flags) kmem_cache_create(#__struct,\ sizeof(struct __struct), __alignof__(struct __struct),\ (__flags), NULL) picks up the alignment explicitly. And struct task_group having one cacheline aligned member, means that the alignment of the composite object (the struct proper) must be an integer multiple of this (typically 1). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
On Thu, 2015-12-03 at 14:34 -0500, Waiman Long wrote: > On 12/02/2015 11:32 PM, Mike Galbraith wrote: > > Is that with the box booted skew_tick=1? > I haven't tried that kernel parameter. I will try it to see if it can > improve the situation. BTW, will there be other undesirable side effects > of using this other than the increased power consumption as said in the > kernel-parameters.txt file? Not that are known. I kinda doubt you'd notice the power, but you should see a notable performance boost. Who knows, with a big enough farm of busy big boxen, it may save power by needing fewer of them. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
On Wed, 2015-12-02 at 13:41 -0500, Waiman Long wrote: > By doing so, the perf profile became: > >9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt >8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt >7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer >7.74% 0.00% java [kernel.vmlinux] [k] update_process_times >7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick >5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair >4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares > > The %cpu time is still pretty high, but it is better than before. Is that with the box booted skew_tick=1? -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
Waiman Long writes: > If a system with large number of sockets was driven to full > utilization, it was found that the clock tick handling occupied a > rather significant proportion of CPU time when fair group scheduling > and autogroup were enabled. > > Running a java benchmark on a 16-socket IvyBridge-EX system, the perf > profile looked like: > > 10.52% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt >9.66% 0.05% java [kernel.vmlinux] [k] hrtimer_interrupt >8.65% 0.03% java [kernel.vmlinux] [k] tick_sched_timer >8.56% 0.00% java [kernel.vmlinux] [k] update_process_times >8.07% 0.03% java [kernel.vmlinux] [k] scheduler_tick >6.91% 1.78% java [kernel.vmlinux] [k] task_tick_fair >5.24% 5.04% java [kernel.vmlinux] [k] update_cfs_shares > > In particular, the high CPU time consumed by update_cfs_shares() > was mostly due to contention on the cacheline that contained the > task_group's load_avg statistical counter. This cacheline may also > contains variables like shares, cfs_rq & se which are accessed rather > frequently during clock tick processing. > > This patch moves the load_avg variable into another cacheline > separated from the other frequently accessed variables. It also > creates a cacheline aligned kmemcache for task_group to make sure > that all the allocated task_group's are cacheline aligned. > > By doing so, the perf profile became: > >9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt >8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt >7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer >7.74% 0.00% java [kernel.vmlinux] [k] update_process_times >7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick >5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair >4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares > > The %cpu time is still pretty high, but it is better than before. The > benchmark results before and after the patch was as follows: > > Before patch - Max-jOPs: 907533Critical-jOps: 134877 > After patch - Max-jOPs: 916011Critical-jOps: 142366 > > Signed-off-by: Waiman Long > --- > kernel/sched/core.c | 36 ++-- > kernel/sched/sched.h |7 ++- > 2 files changed, 40 insertions(+), 3 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 4d568ac..e39204f 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -7331,6 +7331,11 @@ int in_sched_functions(unsigned long addr) > */ > struct task_group root_task_group; > LIST_HEAD(task_groups); > + > +#ifdef CONFIG_FAIR_GROUP_SCHED > +/* Cacheline aligned slab cache for task_group */ > +static struct kmem_cache *task_group_cache __read_mostly; > +#endif > #endif > > DECLARE_PER_CPU(cpumask_var_t, load_balance_mask); > @@ -7356,6 +7361,7 @@ void __init sched_init(void) > root_task_group.cfs_rq = (struct cfs_rq **)ptr; > ptr += nr_cpu_ids * sizeof(void **); > > + task_group_cache = KMEM_CACHE(task_group, SLAB_HWCACHE_ALIGN); The KMEM_CACHE macro suggests instead adding cacheline_aligned_in_smp to the struct definition instead. > #endif /* CONFIG_FAIR_GROUP_SCHED */ > #ifdef CONFIG_RT_GROUP_SCHED > root_task_group.rt_se = (struct sched_rt_entity **)ptr; > @@ -7668,12 +7674,38 @@ void set_curr_task(int cpu, struct task_struct *p) > /* task_group_lock serializes the addition/removal of task groups */ > static DEFINE_SPINLOCK(task_group_lock); > > +/* > + * Make sure that the task_group structure is cacheline aligned when > + * fair group scheduling is enabled. > + */ > +#ifdef CONFIG_FAIR_GROUP_SCHED > +static inline struct task_group *alloc_task_group(void) > +{ > + return kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO); > +} > + > +static inline void free_task_group(struct task_group *tg) > +{ > + kmem_cache_free(task_group_cache, tg); > +} > +#else /* CONFIG_FAIR_GROUP_SCHED */ > +static inline struct task_group *alloc_task_group(void) > +{ > + return kzalloc(sizeof(struct task_group), GFP_KERNEL); > +} > + > +static inline void free_task_group(struct task_group *tg) > +{ > + kfree(tg); > +} > +#endif /* CONFIG_FAIR_GROUP_SCHED */ > + > static void free_sched_group(struct task_group *tg) > { > free_fair_sched_group(tg); > free_rt_sched_group(tg); > autogroup_free(tg); > - kfree(tg); > + free_task_group(tg); > } > > /* allocate runqueue etc for a new task group */ > @@ -7681,7 +7713,7 @@ struct task_group *sched_create_group(struct task_group > *parent) > { > struct task_group *tg; > > - tg = kzalloc(sizeof(*tg), GFP_KERNEL); > + tg = alloc_task_group(); > if (!tg) > return ERR_PTR(-ENOMEM); > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index efd3bfc..e679895 100644 > --- a/kernel/sched/sched.h >
Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
Waiman Longwrites: > If a system with large number of sockets was driven to full > utilization, it was found that the clock tick handling occupied a > rather significant proportion of CPU time when fair group scheduling > and autogroup were enabled. > > Running a java benchmark on a 16-socket IvyBridge-EX system, the perf > profile looked like: > > 10.52% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt >9.66% 0.05% java [kernel.vmlinux] [k] hrtimer_interrupt >8.65% 0.03% java [kernel.vmlinux] [k] tick_sched_timer >8.56% 0.00% java [kernel.vmlinux] [k] update_process_times >8.07% 0.03% java [kernel.vmlinux] [k] scheduler_tick >6.91% 1.78% java [kernel.vmlinux] [k] task_tick_fair >5.24% 5.04% java [kernel.vmlinux] [k] update_cfs_shares > > In particular, the high CPU time consumed by update_cfs_shares() > was mostly due to contention on the cacheline that contained the > task_group's load_avg statistical counter. This cacheline may also > contains variables like shares, cfs_rq & se which are accessed rather > frequently during clock tick processing. > > This patch moves the load_avg variable into another cacheline > separated from the other frequently accessed variables. It also > creates a cacheline aligned kmemcache for task_group to make sure > that all the allocated task_group's are cacheline aligned. > > By doing so, the perf profile became: > >9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt >8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt >7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer >7.74% 0.00% java [kernel.vmlinux] [k] update_process_times >7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick >5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair >4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares > > The %cpu time is still pretty high, but it is better than before. The > benchmark results before and after the patch was as follows: > > Before patch - Max-jOPs: 907533Critical-jOps: 134877 > After patch - Max-jOPs: 916011Critical-jOps: 142366 > > Signed-off-by: Waiman Long > --- > kernel/sched/core.c | 36 ++-- > kernel/sched/sched.h |7 ++- > 2 files changed, 40 insertions(+), 3 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 4d568ac..e39204f 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -7331,6 +7331,11 @@ int in_sched_functions(unsigned long addr) > */ > struct task_group root_task_group; > LIST_HEAD(task_groups); > + > +#ifdef CONFIG_FAIR_GROUP_SCHED > +/* Cacheline aligned slab cache for task_group */ > +static struct kmem_cache *task_group_cache __read_mostly; > +#endif > #endif > > DECLARE_PER_CPU(cpumask_var_t, load_balance_mask); > @@ -7356,6 +7361,7 @@ void __init sched_init(void) > root_task_group.cfs_rq = (struct cfs_rq **)ptr; > ptr += nr_cpu_ids * sizeof(void **); > > + task_group_cache = KMEM_CACHE(task_group, SLAB_HWCACHE_ALIGN); The KMEM_CACHE macro suggests instead adding cacheline_aligned_in_smp to the struct definition instead. > #endif /* CONFIG_FAIR_GROUP_SCHED */ > #ifdef CONFIG_RT_GROUP_SCHED > root_task_group.rt_se = (struct sched_rt_entity **)ptr; > @@ -7668,12 +7674,38 @@ void set_curr_task(int cpu, struct task_struct *p) > /* task_group_lock serializes the addition/removal of task groups */ > static DEFINE_SPINLOCK(task_group_lock); > > +/* > + * Make sure that the task_group structure is cacheline aligned when > + * fair group scheduling is enabled. > + */ > +#ifdef CONFIG_FAIR_GROUP_SCHED > +static inline struct task_group *alloc_task_group(void) > +{ > + return kmem_cache_alloc(task_group_cache, GFP_KERNEL | __GFP_ZERO); > +} > + > +static inline void free_task_group(struct task_group *tg) > +{ > + kmem_cache_free(task_group_cache, tg); > +} > +#else /* CONFIG_FAIR_GROUP_SCHED */ > +static inline struct task_group *alloc_task_group(void) > +{ > + return kzalloc(sizeof(struct task_group), GFP_KERNEL); > +} > + > +static inline void free_task_group(struct task_group *tg) > +{ > + kfree(tg); > +} > +#endif /* CONFIG_FAIR_GROUP_SCHED */ > + > static void free_sched_group(struct task_group *tg) > { > free_fair_sched_group(tg); > free_rt_sched_group(tg); > autogroup_free(tg); > - kfree(tg); > + free_task_group(tg); > } > > /* allocate runqueue etc for a new task group */ > @@ -7681,7 +7713,7 @@ struct task_group *sched_create_group(struct task_group > *parent) > { > struct task_group *tg; > > - tg = kzalloc(sizeof(*tg), GFP_KERNEL); > + tg = alloc_task_group(); > if (!tg) > return ERR_PTR(-ENOMEM); > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index
Re: [PATCH v2 2/3] sched/fair: Move hot load_avg into its own cacheline
On Wed, 2015-12-02 at 13:41 -0500, Waiman Long wrote: > By doing so, the perf profile became: > >9.44% 0.00% java [kernel.vmlinux] [k] smp_apic_timer_interrupt >8.74% 0.01% java [kernel.vmlinux] [k] hrtimer_interrupt >7.83% 0.03% java [kernel.vmlinux] [k] tick_sched_timer >7.74% 0.00% java [kernel.vmlinux] [k] update_process_times >7.27% 0.03% java [kernel.vmlinux] [k] scheduler_tick >5.94% 1.74% java [kernel.vmlinux] [k] task_tick_fair >4.15% 3.92% java [kernel.vmlinux] [k] update_cfs_shares > > The %cpu time is still pretty high, but it is better than before. Is that with the box booted skew_tick=1? -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/