Hi Tejun, Le Tuesday 02 May 2017 à 09:18:53 (+0200), Vincent Guittot a écrit : > On 28 April 2017 at 22:33, Tejun Heo <t...@kernel.org> wrote: > > Hello, Vincent. > > > > On Thu, Apr 27, 2017 at 10:29:10AM +0200, Vincent Guittot wrote: > >> On 27 April 2017 at 00:52, Tejun Heo <t...@kernel.org> wrote: > >> > Hello, > >> > > >> > On Wed, Apr 26, 2017 at 08:12:09PM +0200, Vincent Guittot wrote: > >> >> On 24 April 2017 at 22:14, Tejun Heo <t...@kernel.org> wrote: > >> >> Can the problem be on the load balance side instead ? and more > >> >> precisely in the wakeup path ? > >> >> After looking at the trace, it seems that task placement happens at > >> >> wake up path and if it fails to select the right idle cpu at wake up, > >> >> you will have to wait for a load balance which is alreayd too late > >> > > >> > Oh, I was tracing most of scheduler activities and the ratios of > >> > wakeups picking idle CPUs were about the same regardless of cgroup > >> > membership. I can confidently say that the latency issue that I'm > >> > seeing is from load balancer picking the wrong busiest CPU, which is > >> > not to say that there can be other problems. > >> > >> ok. Is there any trace that you can share ? your behavior seems > >> different of mine > > > >
[ snip] > > You can notice that B's pertask weight is 4.409 which is way higher > > than A's 2.779, and this is from Q014-asdf's contribution to Q014-/ is > > twice as high as it should be. The root queue's runnable avg should > > Are you sure that this is because of blocked load in group A ? it can > be that Q014-asdf has already have to wait before running and its load > still increase while runnable but not running . > IIUC your trace, group A has 2 running tasks and group B only one but > load_balance selects B because of its sgs->avg_load being higher. But > this can also happen even if runnable_load_avg of child cfs_rq was > propagated correctly in group entity because we can have situation > where a group A has only 1 task with higher load than 2 tasks on > groupB and even if blocked load is not taken into account, and > load_balance will select A. > > IMHO, we should better improve load balance selection. I'm going to > add smarter group selection in load_balance. that's something we > should have already done but it was difficult without load/util_avg > propagation. it should be doable now Could you test the patch in load_balance below ? If group is not overloaded which means that threads have all runtime they want, we select the cfs_rq according to the number of running threads instead --- kernel/sched/fair.c | 22 ++++++++++++++++++++-- 1 file changed, 20 insertions(+), 2 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index a903276..87e3b77 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7069,7 +7069,8 @@ static unsigned long task_h_load(struct task_struct *p) /********** Helpers for find_busiest_group ************************/ enum group_type { - group_other = 0, + group_idle = 0, + group_other, group_imbalanced, group_overloaded, }; @@ -7383,6 +7384,9 @@ group_type group_classify(struct sched_group *group, if (sgs->group_no_capacity) return group_overloaded; + if (!sgs->sum_nr_running) + return group_idle; + if (sg_imbalanced(group)) return group_imbalanced; @@ -7476,8 +7480,19 @@ static bool update_sd_pick_busiest(struct lb_env *env, if (sgs->group_type < busiest->group_type) return false; - if (sgs->avg_load <= busiest->avg_load) + if (sgs->group_type == group_other) { + /* + * The groups are not overloaded so there is enough cpu time + * for all threads. In this case, takes the group with the + * highest number of tasks per CPU in order to improve + * scheduling latency + */ + if ((sgs->sum_nr_running * busiest->group_weight) <= + (busiest->sum_nr_running * sgs->group_weight)) + return false; + } if (sgs->avg_load <= busiest->avg_load) { return false; + } if (!(env->sd->flags & SD_ASYM_CPUCAPACITY)) goto asym_packing; @@ -7969,6 +7984,9 @@ static struct rq *find_busiest_queue(struct lb_env *env, !check_cpu_capacity(rq, env->sd)) continue; + if (!rq->cfs.h_nr_running) + continue; + /* * For the load comparisons with the other cpu's, consider * the weighted_cpuload() scaled with the cpu capacity, so -- 2.7.4 > > > only contain what's currently active but because we're scaling load > > avg which includes both active and blocked, we're ending up picking > > group B over A. > >