Hello, Vincent.
On Thu, Apr 27, 2017 at 10:28:01AM +0200, Vincent Guittot wrote:
> On 27 April 2017 at 02:30, Tejun Heo <[email protected]> wrote:
> > Hello, Vincent.
> >
> > On Wed, Apr 26, 2017 at 12:21:52PM +0200, Vincent Guittot wrote:
> >> > This is from the follow-up patch. I was confused. Because we don't
> >> > propagate decays, we still should decay the runnable_load_avg;
> >> > otherwise, we end up accumulating errors in the counter. I'll drop
> >> > the last patch.
> >>
> >> Ok, the runnable_load_avg goes back to 0 when I drop patch 3. But i
> >> see runnable_load_avg sometimes significantly higher than load_avg
> >> which is normally not possible as load_avg = runnable_load_avg +
> >> sleeping task's load_avg
> >
> > So, while load_avg would eventually converge on runnable_load_avg +
> > blocked load_avg given stable enough workload for long enough,
> > runnable_load_avg jumping above load avg temporarily is expected,
>
> No, it's not. Look at load_avg/runnable_avg at root domain when only
> task are involved, runnable_load_avg will never be higher than
> load_avg because
> load_avg = /sum load_avg of tasks attached to the cfs_rq
> runnable_load_avg = /sum load_avg of tasks attached and enqueued
> to the cfs_rq
> load_avg = runnable_load_avg + blocked tasks and as a result
> runnable_load_avg is always lower than load_avg.
We don't actually calculate that tho, but yeah, given the calculations
we do, runnable shouldn't be significantly higher than load_avg. I
see runable going above load_avg by a bit often but the margins are
small. I wonder whether you're seeing the errors accumulating from
the incorrect min() capping.
> And with the propagate load/util_avg patchset, we can even reflect
> task migration directly at root domain whereas we had to wait for the
> util/load_avg and runnable_load_avg to converge to the new value
> before
>
> Just to confirm one of my assumption, the latency regression was
> already there in previous kernel versions and is not a result of
> propagate load/util_avg patchset, isn't it ?
Yeah, the latency problem is independent of the propagation logic.
It's really the meaning of the top level running_avg_load being
different w/ and w/o cgroups.
> > AFAICS. That's the whole point of it, a sum closely tracking what's
> > currently on the cpu so that we can pick the cpu which has the most on
> > it now. It doesn't make sense to try to pick threads off of a cpu
> > which is generally loaded but doesn't have much going on right now,
> > after all.
>
> The only interest of runnable_load_avg is to be null when a cfs_rq is
> idle whereas load_avg is not but not to be higher than load_avg. The
> root cause is that load_balance only looks at "load" but not number of
> task currently running and that's probably the main problem:
> runnable_load_avg has been added because load_balance fails to filter
> idle group and idle rq. We should better add a new type in
> group_classify to tag group that are idle and the same in
> find_busiest_queue more.
I'll follow up in the other subthread but there really is fundamental
difference in how we calculate runnable_avg w/ and w/o cgroups.
Indepndent of whether we can improve the load balancer further, it is
an existing bug.
> > * Can you please double check that the higher latencies w/ the patch
> > is reliably reproducible? The test machines that I use have
> > variable management load. They never dominate the machine but are
> > enough to disturb the results so that to drawing out a reliable
> > pattern takes a lot of repeated runs. I'd really appreciate if you
> > could double check that the pattern is reliable with different run
> > patterns (ie. instead of 10 consecutive runs after another,
> > interleaved).
>
> I always let time between 2 consecutive run and the 10 consecutive
> runs take around 2min to execute
>
> Then I have run several time these 10 consecutive test and results stay the
> same
Can you please try the patch at the end of this message? I'm
wondering whether you're seeing the errors accumulating from the wrong
min().
> > ones. If there's something I can easily buy, please point me to it.
> > If there's something I can loan, that'd be great too.
>
> It looks like most of web site are currently out of stock
:(
> > * If not, I'll try to clean up the debug patches I have and send them
> > your way to get more visiblity but given these things tend to be
> > very iterative, it might take quite a few back and forth.
>
> Yes, that could be usefull. Even a trace of regression could be useful
Yeah, will clean it up a bit and post. Will also post some traces.
And here's the updated patch. I didn't add the suggested
scale_down()'s. I'll follow up on that in the other subthread.
Thanks.
kernel/sched/fair.c | 47 ++++++++++++++++++++---------------------------
1 file changed, 20 insertions(+), 27 deletions(-)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3078,37 +3078,30 @@ static inline void
update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
struct cfs_rq *gcfs_rq = group_cfs_rq(se);
- long delta, load = gcfs_rq->avg.load_avg;
+ long load = 0, delta;
/*
- * If the load of group cfs_rq is null, the load of the
- * sched_entity will also be null so we can skip the formula
+ * A cfs_rq's load avg contribution to the parent should be scaled
+ * to the sched_entity's weight. Use freshly calculated shares
+ * instead of @se->load.weight as the latter may not reflect
+ * changes from the current scheduling operation.
+ *
+ * Note that the propagation source is runnable_load_avg instead of
+ * load_avg. This keeps every cfs_rq's runnable_load_avg true to
+ * the sum of the scaled loads of all tasks queued under it, which
+ * is important for the correct operation of the load balancer.
+ *
+ * This can make the sched_entity's load_avg jumpier but that
+ * correctly reflects what would happen without cgroups if each
+ * task's load is scaled across nesting - the load is being
+ * averaged at the task and each cfs_rq.
*/
- if (load) {
- long tg_load;
+ if (gcfs_rq->load.weight) {
+ long shares = calc_cfs_shares(gcfs_rq, gcfs_rq->tg);
- /* Get tg's load and ensure tg_load > 0 */
- tg_load = atomic_long_read(&gcfs_rq->tg->load_avg) + 1;
-
- /* Ensure tg_load >= load and updated with current load*/
- tg_load -= gcfs_rq->tg_load_avg_contrib;
- tg_load += load;
-
- /*
- * We need to compute a correction term in the case that the
- * task group is consuming more CPU than a task of equal
- * weight. A task with a weight equals to tg->shares will have
- * a load less or equal to scale_load_down(tg->shares).
- * Similarly, the sched_entities that represent the task group
- * at parent level, can't have a load higher than
- * scale_load_down(tg->shares). And the Sum of sched_entities'
- * load must be <= scale_load_down(tg->shares).
- */
- if (tg_load > scale_load_down(gcfs_rq->tg->shares)) {
- /* scale gcfs_rq's load into tg's shares*/
- load *= scale_load_down(gcfs_rq->tg->shares);
- load /= tg_load;
- }
+ load = min_t(long, scale_load_down(shares),
+ gcfs_rq->runnable_load_avg *
+ shares / gcfs_rq->load.weight);
}
delta = load - se->avg.load_avg;