Hi Tejun,

Le Tuesday 02 May 2017 à 09:18:53 (+0200), Vincent Guittot a écrit :
> On 28 April 2017 at 22:33, Tejun Heo <t...@kernel.org> wrote:
> > Hello, Vincent.
> >
> > On Thu, Apr 27, 2017 at 10:29:10AM +0200, Vincent Guittot wrote:
> >> On 27 April 2017 at 00:52, Tejun Heo <t...@kernel.org> wrote:
> >> > Hello,
> >> >
> >> > On Wed, Apr 26, 2017 at 08:12:09PM +0200, Vincent Guittot wrote:
> >> >> On 24 April 2017 at 22:14, Tejun Heo <t...@kernel.org> wrote:
> >> >> Can the problem be on the load balance side instead ?  and more
> >> >> precisely in the wakeup path ?
> >> >> After looking at the trace, it seems that task placement happens at
> >> >> wake up path and if it fails to select the right idle cpu at wake up,
> >> >> you will have to wait for a load balance which is alreayd too late
> >> >
> >> > Oh, I was tracing most of scheduler activities and the ratios of
> >> > wakeups picking idle CPUs were about the same regardless of cgroup
> >> > membership.  I can confidently say that the latency issue that I'm
> >> > seeing is from load balancer picking the wrong busiest CPU, which is
> >> > not to say that there can be other problems.
> >>
> >> ok. Is there any trace that you can share ? your behavior seems
> >> different of mine
> >
> >

[ snip]

> > You can notice that B's pertask weight is 4.409 which is way higher
> > than A's 2.779, and this is from Q014-asdf's contribution to Q014-/ is
> > twice as high as it should be.  The root queue's runnable avg should
> 
> Are you sure that this is because of blocked load in group A ? it can
> be that Q014-asdf has already have to wait before running and its load
> still increase while runnable but not running .
> IIUC your trace, group A has 2 running tasks and group B only one but
> load_balance selects B because of its sgs->avg_load being higher. But
> this can also happen even if runnable_load_avg of child cfs_rq was
> propagated correctly in group entity because we can have situation
> where a group A has only 1 task with higher load than 2 tasks on
> groupB and even if blocked load is not taken into account, and
> load_balance will select A.
> 
> IMHO, we should better improve load balance selection. I'm going to
> add smarter group selection in load_balance. that's something we
> should have already done but it was difficult without load/util_avg
> propagation. it should be doable now

Could you test the patch in load_balance below ?
If group is not overloaded which means that threads have all runtime they
want, we select the cfs_rq according to the number of running threads instead

---
 kernel/sched/fair.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a903276..87e3b77 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7069,7 +7069,8 @@ static unsigned long task_h_load(struct task_struct *p)
 /********** Helpers for find_busiest_group ************************/
 
 enum group_type {
-       group_other = 0,
+       group_idle = 0,
+       group_other,
        group_imbalanced,
        group_overloaded,
 };
@@ -7383,6 +7384,9 @@ group_type group_classify(struct sched_group *group,
        if (sgs->group_no_capacity)
                return group_overloaded;
 
+       if (!sgs->sum_nr_running)
+               return group_idle;
+
        if (sg_imbalanced(group))
                return group_imbalanced;
 
@@ -7476,8 +7480,19 @@ static bool update_sd_pick_busiest(struct lb_env *env,
        if (sgs->group_type < busiest->group_type)
                return false;
 
-       if (sgs->avg_load <= busiest->avg_load)
+       if (sgs->group_type == group_other) {
+               /*
+                * The groups are not overloaded so there is enough cpu time
+                * for all threads. In this case, takes the group with the
+                * highest number of tasks per CPU in order to improve
+                * scheduling latency
+                */
+               if ((sgs->sum_nr_running * busiest->group_weight) <=
+                               (busiest->sum_nr_running * sgs->group_weight))
+                       return false;
+       } if (sgs->avg_load <= busiest->avg_load) {
                return false;
+       }
 
        if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
                goto asym_packing;
@@ -7969,6 +7984,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
                    !check_cpu_capacity(rq, env->sd))
                        continue;
 
+               if (!rq->cfs.h_nr_running)
+                       continue;
+
                /*
                 * For the load comparisons with the other cpu's, consider
                 * the weighted_cpuload() scaled with the cpu capacity, so
-- 
2.7.4


> 
> > only contain what's currently active but because we're scaling load
> > avg which includes both active and blocked, we're ending up picking
> > group B over A.
> >

Reply via email to