Yuyang Du <[email protected]> writes: > Currently, load_avg = scale_load_down(load) * runnable%. The extra scaling > down of load does not make much sense, because load_avg is primarily THE > load and on top of that, we take runnable time into account. > > We therefore remove scale_load_down() for load_avg. But we need to > carefully consider the overflow risk if load has higher range > (2*SCHED_FIXEDPOINT_SHIFT). The only case an overflow may occur due > to us is on 64bit kernel with increased load range. In that case, > the 64bit load_sum can afford 4251057 (=2^64/47742/88761/1024) > entities with the highest load (=88761*1024) always runnable on one
This is true (ignoring that the cgroup max is a bit higher at 262144, but that's only ~3x, hardly an issue), but I wondered when the uses of load_avg in h_load and effective_load hit issues, under the assumption that maybe the old old code used scale_load_down originally, so here's the math: effective_load multiplies load_avg by tg->shares, allowing a maximum of (2^18)*load_avg=(2^18)*((2^18)*1024)=2^46 per child cgroup, allowing 131072 child cgroups, or ~3x that in tasks. (Half of normal because it is signed) h_load however is cfs_rq_load_avg * se->avg.load_avg, so we have an extra factor of 1024, allowing only 256 max weight cgroups or 9x that in tasks. That said, when I checked the old code, it didn't use scale_load_down back h_load involved cfs_rq->load.weight * se->load.weight, so this is only new in a recent sense. TLDR: you're correct that load_avg hits any issues first, and it did so before any of the new average computations were added. > single cfs_rq, which may be an issue, but should be fine. Even if this > occurs at the end of day, on the condition where it occurs, the > load average will not be useful anyway. > > Signed-off-by: Yuyang Du <[email protected]> > [update calculate_imbalance] > Signed-off-by: Vincent Guittot <[email protected]> > --- > include/linux/sched.h | 19 ++++++++++++++----- > kernel/sched/fair.c | 19 +++++++++---------- > 2 files changed, 23 insertions(+), 15 deletions(-) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index db3c6e1..8df6d69 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1213,7 +1213,7 @@ struct load_weight { > * > * [load_avg definition] > * > - * load_avg = runnable% * scale_load_down(load) > + * load_avg = runnable% * load > * > * where runnable% is the time ratio that a sched_entity is runnable. > * For cfs_rq, it is the aggregated such load_avg of all runnable and > @@ -1221,7 +1221,7 @@ struct load_weight { > * > * load_avg may also take frequency scaling into account: > * > - * load_avg = runnable% * scale_load_down(load) * freq% > + * load_avg = runnable% * load * freq% > * > * where freq% is the CPU frequency normalize to the highest frequency > * > @@ -1247,9 +1247,18 @@ struct load_weight { > * > * [Overflow issue] > * > - * The 64bit load_sum can have 4353082796 (=2^64/47742/88761) entities > - * with the highest load (=88761) always runnable on a single cfs_rq, we > - * should not overflow as the number already hits PID_MAX_LIMIT. > + * On 64bit kernel: > + * > + * When load has small fixed point range (SCHED_FIXEDPOINT_SHIFT), the > + * 64bit load_sum can have 4353082796 (=2^64/47742/88761) tasks with > + * the highest load (=88761) always runnable on a cfs_rq, we should > + * not overflow as the number already hits PID_MAX_LIMIT. > + * > + * When load has large fixed point range (2*SCHED_FIXEDPOINT_SHIFT), > + * the 64bit load_sum can have 4251057 (=2^64/47742/88761/1024) tasks > + * with the highest load (=88761*1024) always runnable on ONE cfs_rq, > + * we should be fine. Even if the overflow occurs at the end of day, > + * at the time the load_avg won't be useful anyway in that situation. > * > * For all other cases (including 32bit kernel), struct load_weight's > * weight will overflow first before we do, because: > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index bf835b5..da6642f 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -680,7 +680,7 @@ void init_entity_runnable_average(struct sched_entity *se) > * will definitely be update (after enqueue). > */ > sa->period_contrib = 1023; > - sa->load_avg = scale_load_down(se->load.weight); > + sa->load_avg = se->load.weight; > sa->load_sum = sa->load_avg * LOAD_AVG_MAX; > sa->util_avg = SCHED_CAPACITY_SCALE; > sa->util_sum = sa->util_avg * LOAD_AVG_MAX; > @@ -2837,7 +2837,7 @@ static inline int update_cfs_rq_load_avg(u64 now, > struct cfs_rq *cfs_rq) > } > > decayed = __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa, > - scale_load_down(cfs_rq->load.weight), cfs_rq->curr != NULL, > cfs_rq); > + cfs_rq->load.weight, cfs_rq->curr != NULL, cfs_rq); > > #ifndef CONFIG_64BIT > smp_wmb(); > @@ -2858,8 +2858,7 @@ static inline void update_load_avg(struct sched_entity > *se, int update_tg) > * Track task load average for carrying it to new CPU after migrated, > and > * track group sched_entity load average for task_h_load calc in > migration > */ > - __update_load_avg(now, cpu, &se->avg, > - se->on_rq * scale_load_down(se->load.weight), > + __update_load_avg(now, cpu, &se->avg, se->on_rq * se->load.weight, > cfs_rq->curr == se, NULL); > > if (update_cfs_rq_load_avg(now, cfs_rq) && update_tg) > @@ -2896,7 +2895,7 @@ skip_aging: > static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct > sched_entity *se) > { > __update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)), > - &se->avg, se->on_rq * > scale_load_down(se->load.weight), > + &se->avg, se->on_rq * se->load.weight, > cfs_rq->curr == se, NULL); > > cfs_rq->avg.load_avg = max_t(long, cfs_rq->avg.load_avg - > se->avg.load_avg, 0); > @@ -2916,7 +2915,7 @@ enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct > sched_entity *se) > migrated = !sa->last_update_time; > if (!migrated) { > __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa, > - se->on_rq * scale_load_down(se->load.weight), > + se->on_rq * se->load.weight, > cfs_rq->curr == se, NULL); > } > > @@ -6876,10 +6875,10 @@ static inline void calculate_imbalance(struct lb_env > *env, struct sd_lb_stats *s > */ > if (busiest->group_type == group_overloaded && > local->group_type == group_overloaded) { > - load_above_capacity = busiest->sum_nr_running * > - scale_load_down(NICE_0_LOAD); > - if (load_above_capacity > busiest->group_capacity) > - load_above_capacity -= busiest->group_capacity; > + load_above_capacity = busiest->sum_nr_running * NICE_0_LOAD; > + if (load_above_capacity > scale_load(busiest->group_capacity)) > + load_above_capacity -= > + scale_load(busiest->group_capacity); > else > load_above_capacity = ~0UL; > }

