On 28/09/16 14:13, Vincent Guittot wrote: > Le Wednesday 28 Sep 2016 à 05:27:54 (-0700), Vincent Guittot a écrit : >> On 28 September 2016 at 04:31, Dietmar Eggemann >> <dietmar.eggem...@arm.com> wrote: >>> On 28/09/16 12:19, Peter Zijlstra wrote: >>>> On Wed, Sep 28, 2016 at 12:06:43PM +0100, Dietmar Eggemann wrote: >>>>> On 28/09/16 11:14, Peter Zijlstra wrote: >>>>>> On Fri, Sep 23, 2016 at 12:58:08PM +0100, Matt Fleming wrote:
[...] > IIUC the problem raised by Matt, he see a regression because we now remove > during the dequeue the exact same load as during the enqueue so > cfs_rq->runnable_load_avg is null so we select a cfs_rq that might already > have > a lot of hackbench blocked thread. This is my understanding as well. > The fact that runnable_load_avg is null, when the cfs_rq doesn't have runnable > task, is quite correct and we should keep it. But when we look for the idlest > group, we have to take into account the blocked thread. > > That's what i have tried to do below [...] > + /* > + * In case that we have same runnable load (especially null > + * runnable load), we select the group with smallest blocked > + * load > + */ > + min_avg_load = avg_load; > + min_runnable_load = runnable_load; Setting 'min_runnable_load' wouldn't be necessary here. > idlest = group; > } > + > } while (group = group->next, group != sd->groups); > > - if (!idlest || 100*this_load < imbalance*min_load) > + if (!idlest || 100*this_load < imbalance*min_runnable_load) > return NULL; > return idlest; On the Hikey board (ARM64) (2 cluster, each 4 cpu's, so MC and DIE), the first f_i_g (on DIE) is still based on rbl_load. So if the first hackbench task (spawning all the worker task) runs on cluster1, and the former worker p_X already blocks f_i_g returns cluster2, if p_X still runs, it returns idlest=NULL and we continue with cluster1 for second f_i_g on MC. The additional 'else if' condition doesn't seem to help much because of occurrences where an idle cpu (which never took a worker) still has a small value of rbl_load (shouldn't actually happen, weighted_cpuload() should be 0) so it is never chosen or it has even a negative impact in the case where an idle cpu (which never took a worker) is not chosen because its load (cfs->avg.load_avg) hasn't been updated for a long time so another cpu with rbl_load = 0 and a smaller load is used (even though a lot of worker where already placed on it). There are also episodes where we 'pack' workers onto the cpu which is initially picked in f_i_c (on DIE) because (100*this_load < imbalance*min_load) is true in f_i_g on MC. Maybe we can get rid of this for !sd->child ? [...]