On 12/03/2015 06:12 AM, Peter Zijlstra wrote:

I made this:

---
Subject: sched/fair: Move hot load_avg into its own cacheline
From: Waiman Long<waiman.l...@hpe.com>
Date: Wed, 2 Dec 2015 13:41:49 -0500

If a system with large number of sockets was driven to full
utilization, it was found that the clock tick handling occupied a
rather significant proportion of CPU time when fair group scheduling
and autogroup were enabled.

Running a java benchmark on a 16-socket IvyBridge-EX system, the perf
profile looked like:

   10.52%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
    9.66%   0.05%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
    8.65%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
    8.56%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
    8.07%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
    6.91%   1.78%  java   [kernel.vmlinux]  [k] task_tick_fair
    5.24%   5.04%  java   [kernel.vmlinux]  [k] update_cfs_shares

In particular, the high CPU time consumed by update_cfs_shares()
was mostly due to contention on the cacheline that contained the
task_group's load_avg statistical counter. This cacheline may also
contains variables like shares, cfs_rq&  se which are accessed rather
frequently during clock tick processing.

This patch moves the load_avg variable into another cacheline
separated from the other frequently accessed variables. It also
creates a cacheline aligned kmemcache for task_group to make sure
that all the allocated task_group's are cacheline aligned.

By doing so, the perf profile became:

    9.44%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
    8.74%   0.01%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
    7.83%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
    7.74%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
    7.27%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
    5.94%   1.74%  java   [kernel.vmlinux]  [k] task_tick_fair
    4.15%   3.92%  java   [kernel.vmlinux]  [k] update_cfs_shares

The %cpu time is still pretty high, but it is better than before. The
benchmark results before and after the patch was as follows:

   Before patch - Max-jOPs: 907533    Critical-jOps: 134877
   After patch  - Max-jOPs: 916011    Critical-jOps: 142366

Cc: Scott J Norton<scott.nor...@hpe.com>
Cc: Douglas Hatch<doug.ha...@hpe.com>
Cc: Ingo Molnar<mi...@redhat.com>
Cc: Yuyang Du<yuyang...@intel.com>
Cc: Paul Turner<p...@google.com>
Cc: Ben Segall<bseg...@google.com>
Cc: Morten Rasmussen<morten.rasmus...@arm.com>
Signed-off-by: Waiman Long<waiman.l...@hpe.com>
Signed-off-by: Peter Zijlstra (Intel)<pet...@infradead.org>
Link: 
http://lkml.kernel.org/r/1449081710-20185-3-git-send-email-waiman.l...@hpe.com
---
  kernel/sched/core.c  |   10 +++++++---
  kernel/sched/sched.h |    7 ++++++-
  2 files changed, 13 insertions(+), 4 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7345,6 +7345,9 @@ int in_sched_functions(unsigned long add
   */
  struct task_group root_task_group;
  LIST_HEAD(task_groups);
+
+/* Cacheline aligned slab cache for task_group */
+static struct kmem_cache *task_group_cache __read_mostly;
  #endif

  DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
@@ -7402,11 +7405,12 @@ void __init sched_init(void)
  #endif /* CONFIG_RT_GROUP_SCHED */

  #ifdef CONFIG_CGROUP_SCHED
+       task_group_cache = KMEM_CACHE(task_group, 0);
+
Thanks for making that change.

Do we need to add the flag SLAB_HWCACHE_ALIGN? Or we could make a helper flag that define SLAB_HWCACHE_ALIGN if CONFIG_FAIR_GROUP_SCHED is defined. Other than that, I am fine with the change.

Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to