Commit-ID:  a35b6466aabb051568b844e8c63f87a356d3d129
Gitweb:     http://git.kernel.org/tip/a35b6466aabb051568b844e8c63f87a356d3d129
Author:     Peter Zijlstra <a.p.zijls...@chello.nl>
AuthorDate: Wed, 8 Aug 2012 21:46:40 +0200
Committer:  Thomas Gleixner <t...@linutronix.de>
CommitDate: Mon, 13 Aug 2012 18:41:54 +0200

sched, cgroup: Reduce rq->lock hold times for large cgroup hierarchies

Peter Portante reported that for large cgroup hierarchies (and or on
large CPU counts) we get immense lock contention on rq->lock and stuff
stops working properly.

His workload was a ton of processes, each in their own cgroup,
everybody idling except for a sporadic wakeup once every so often.

It was found that:

  schedule()
    idle_balance()
      load_balance()
        local_irq_save()
        double_rq_lock()
        update_h_load()
          walk_tg_tree(tg_load_down)
            tg_load_down()

Results in an entire cgroup hierarchy walk under rq->lock for every
new-idle balance and since new-idle balance isn't throttled this
results in a lot of work while holding the rq->lock.

This patch does two things, it removes the work from under rq->lock
based on the good principle of race and pray which is widely employed
in the load-balancer as a whole. And secondly it throttles the
update_h_load() calculation to max once per jiffy.

I considered excluding update_h_load() for new-idle balance
all-together, but purely relying on regular balance passes to update
this data might not work out under some rare circumstances where the
new-idle busiest isn't the regular busiest for a while (unlikely, but
a nightmare to debug if someone hits it and suffers).

Cc: p...@google.com
Cc: Larry Woodman <lwood...@redhat.com>
Cc: Mike Galbraith <efa...@gmx.de>
Reported-by: Peter Portante <pport...@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijls...@chello.nl>
Link: http://lkml.kernel.org/n/tip-aaarrzfpnaam7pqrekofu...@git.kernel.org
Signed-off-by: Thomas Gleixner <t...@linutronix.de>
---
 kernel/sched/fair.c  |   11 +++++++++--
 kernel/sched/sched.h |    6 +++++-
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d0cc03b..c219bf8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3387,6 +3387,14 @@ static int tg_load_down(struct task_group *tg, void 
*data)
 
 static void update_h_load(long cpu)
 {
+       struct rq *rq = cpu_rq(cpu);
+       unsigned long now = jiffies;
+
+       if (rq->h_load_throttle == now)
+               return;
+
+       rq->h_load_throttle = now;
+
        rcu_read_lock();
        walk_tg_tree(tg_load_down, tg_nop, (void *)cpu);
        rcu_read_unlock();
@@ -4293,11 +4301,10 @@ redo:
                env.src_rq    = busiest;
                env.loop_max  = min(sysctl_sched_nr_migrate, 
busiest->nr_running);
 
+               update_h_load(env.src_cpu);
 more_balance:
                local_irq_save(flags);
                double_rq_lock(this_rq, busiest);
-               if (!env.loop)
-                       update_h_load(env.src_cpu);
 
                /*
                 * cur_ld_moved - load moved in current iteration
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c35a1a7..531411b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -374,7 +374,11 @@ struct rq {
 #ifdef CONFIG_FAIR_GROUP_SCHED
        /* list of leaf cfs_rq on this cpu: */
        struct list_head leaf_cfs_rq_list;
-#endif
+#ifdef CONFIG_SMP
+       unsigned long h_load_throttle;
+#endif /* CONFIG_SMP */
+#endif /* CONFIG_FAIR_GROUP_SCHED */
+
 #ifdef CONFIG_RT_GROUP_SCHED
        struct list_head leaf_rt_rq_list;
 #endif
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to