On Mon, Oct 09, 2017 at 09:08:57AM +0100, Morten Rasmussen wrote: > > --- a/kernel/sched/debug.c > > +++ b/kernel/sched/debug.c > > @@ -565,6 +565,8 @@ void print_cfs_rq(struct seq_file *m, in > > cfs_rq->removed.load_avg); > > SEQ_printf(m, " .%-30s: %ld\n", "removed.util_avg", > > cfs_rq->removed.util_avg); > > + SEQ_printf(m, " .%-30s: %ld\n", "removed.runnable_sum", > > + cfs_rq->removed.runnable_sum); > > #ifdef CONFIG_FAIR_GROUP_SCHED > > SEQ_printf(m, " .%-30s: %lu\n", "tg_load_avg_contrib", > > cfs_rq->tg_load_avg_contrib); > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -3330,11 +3330,77 @@ void set_task_rq_fair(struct sched_entit > > se->avg.last_update_time = n_last_update_time; > > } > > > > -/* Take into account change of utilization of a child task group */ > > + > > +/* > > + * When on migration a sched_entity joins/leaves the PELT hierarchy, we > > need to > > + * propagate its contribution. The key to this propagation is the invariant > > + * that for each group: > > + * > > + * ge->avg == grq->avg (1) > > + * > > + * _IFF_ we look at the pure running and runnable sums. Because they > > + * represent the very same entity, just at different points in the > > hierarchy. > > + * > > + * > > + * Per the above update_tg_cfs_util() is trivial (and still 'wrong') and > > + * simply copies the running sum over. > > + * > > + * However, update_tg_cfs_runnable() is more complex. So we have: > > + * > > + * ge->avg.load_avg = ge->load.weight * ge->avg.runnable_avg > > (2) > > + * > > + * And since, like util, the runnable part should be directly transferable, > > + * the following would _appear_ to be the straight forward approach: > > + * > > + * grq->avg.load_avg = grq->load.weight * grq->avg.running_avg (3) > > Should it be grq->avg.runnable_avg instead of running_avg?
Yes very much so. Typing hard. Otherwise (3) would not follow from (2) either. > cfs_rq->avg.load_avg has been defined previous (in patch 2 I think) to > be: > > load_avg = \Sum se->avg.load_avg > = \Sum se->load.weight * se->avg.runnable_avg > > That sum will increase when ge is runnable regardless of whether it is > running or not. So, I think it has to be runnable_avg to make sense? Ack. > > + * > > + * And per (1) we have: > > + * > > + * ge->avg.running_avg == grq->avg.running_avg > > You just said further up that (1) only applies to running and runnable > sums? These are averages, so I think this is invalid use of (1). But > maybe that is part of your point about (4) being wrong? > > I'm still trying to get my head around the remaining bits, but it sort > of depends if I understood the above bits correctly :) So while true, the thing we're looking for is indeed runnable_avg. > > + * > > + * Which gives: > > + * > > + * ge->load.weight * grq->avg.load_avg > > + * ge->avg.load_avg = ----------------------------------- > > (4) > > + * grq->load.weight > > + * > > + * Except that is wrong! > > + * > > + * Because while for entities historical weight is not important and we > > + * really only care about our future and therefore can consider a pure > > + * runnable sum, runqueues can NOT do this. > > + * > > + * We specifically want runqueues to have a load_avg that includes > > + * historical weights. Those represent the blocked load, the load we expect > > + * to (shortly) return to us. This only works by keeping the weights as > > + * integral part of the sum. We therefore cannot decompose as per (3). > > + * > > + * OK, so what then? And as the text above suggests, we cannot decompose because it contains the blocked weight, which is not included in grq->load.weight and thus things come apart. > > + * Another way to look at things is: > > + * > > + * grq->avg.load_avg = \Sum se->avg.load_avg > > + * > > + * Therefore, per (2): > > + * > > + * grq->avg.load_avg = \Sum se->load.weight * se->avg.runnable_avg > > + * > > + * And the very thing we're propagating is a change in that sum (someone > > + * joined/left). So we can easily know the runnable change, which would > > be, per > > + * (2) the already tracked se->load_avg divided by the corresponding > > + * se->weight. > > + * > > + * Basically (4) but in differential form: > > + * > > + * d(runnable_avg) += se->avg.load_avg / se->load.weight > > + * (5) > > + * ge->avg.load_avg += ge->load.weight * d(runnable_avg) And this all has runnable again, and so should make sense. Combined with an earlier bit, noted by Dietmar, I now have the below delta. --- kernel/sched/fair.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b9e520b6923e..ba879c42bddd 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3333,7 +3333,7 @@ __update_load_avg_cfs_rq(u64 now, int cpu, struct cfs_rq *cfs_rq) * differential update where we store the last value we propagated. This in * turn allows skipping updates if the differential is 'small'. * - * Updating tg's load_avg is necessary before update_cfs_share(). + * Updating tg's load_avg is necessary before update_cfs_group(). */ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force) { @@ -3422,11 +3422,11 @@ void set_task_rq_fair(struct sched_entity *se, * And since, like util, the runnable part should be directly transferable, * the following would _appear_ to be the straight forward approach: * - * grq->avg.load_avg = grq->load.weight * grq->avg.running_avg (3) + * grq->avg.load_avg = grq->load.weight * grq->avg.runnable_avg (3) * * And per (1) we have: * - * ge->avg.running_avg == grq->avg.running_avg + * ge->avg.runnable_avg == grq->avg.runnable_avg * * Which gives: * @@ -3601,7 +3601,7 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum * avg. The immediate corollary is that all (fair) tasks must be attached, see * post_init_entity_util_avg(). * - * cfs_rq->avg is used for task_h_load() and update_cfs_share() for example. + * cfs_rq->avg is used for task_h_load() and update_cfs_group() for example. * * Returns true if the load decayed or we removed load. *