Sorry for being verbose; I've been procrastinating replying, and in doing so the things I wanted to say kept growing.
On Fri, Apr 24, 2020 at 10:24:43PM +0800, Aaron Lu wrote: > To make this work, the root level sched entities' vruntime of the two > threads must be directly comparable. So one of the hyperthread's root > cfs_rq's min_vruntime is chosen as the core wide one and all root level > sched entities' vruntime is normalized against it. > +/* > + * This is called in stop machine context so no need to take the rq lock. > + * > + * Core scheduling is going to be enabled and the root level sched entities > + * of both siblings will use cfs_rq->min_vruntime as the common cfs_rq > + * min_vruntime, so it's necessary to normalize vruntime of existing root > + * level sched entities in sibling_cfs_rq. > + * > + * Update of sibling_cfs_rq's min_vruntime isn't necessary as we will be > + * only using cfs_rq->min_vruntime during the entire run of core scheduling. > + */ > +void sched_core_normalize_se_vruntime(int cpu) > +{ > + struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs; > + int i; > + > + for_each_cpu(i, cpu_smt_mask(cpu)) { > + struct sched_entity *se, *next; > + struct cfs_rq *sibling_cfs_rq; > + s64 delta; > + > + if (i == cpu) > + continue; > + > + sibling_cfs_rq = &cpu_rq(i)->cfs; > + if (!sibling_cfs_rq->nr_running) > + continue; > + > + delta = cfs_rq->min_vruntime - sibling_cfs_rq->min_vruntime; > + rbtree_postorder_for_each_entry_safe(se, next, > + &sibling_cfs_rq->tasks_timeline.rb_root, > + run_node) { > + se->vruntime += delta; > + } > + } > +} Aside from this being way to complicated for what it does -- you could've saved the min_vruntime for each rq and compared them with subtraction -- it is also terminally broken afaict. Consider any infeasible weight scenario. Take for instance two tasks, each bound to their respective sibling, one with weight 1 and one with weight 2. Then the lower weight task will run ahead of the higher weight task without bound. This utterly destroys the concept of a shared time base. Remember; all this is about a proportionally fair scheduling, where each tasks receives: w_i dt_i = ---------- dt (1) \Sum_j w_j which we do by tracking a virtual time, s_i: 1 s_i = --- d[t]_i (2) w_i Where d[t] is a delta of discrete time, while dt is an infinitesimal. The immediate corrolary is that the ideal schedule S, where (2) to use an infnitesimal delta, is: 1 S = ---------- dt (3) \Sum_i w_i >From which we can define the lag, or deviation from the ideal, as: lag(i) = S - s_i (4) And since the one and only purpose is to approximate S, we get that: \Sum_i w_i lag(i) := 0 (5) If this were not so, we no longer converge to S, and we can no longer claim our scheduler has any of the properties we derive from S. This is exactly what you did above, you broke it! Let's continue for a while though; to see if there is anything useful to be learned. We can combine (1)-(3) or (4)-(5) and express S in s_i: \Sum_i w_i s_i S = -------------- (6) \Sum_i w_i Which gives us a way to compute S, given our s_i. Now, if you've read our code, you know that we do not in fact do this, the reason for this is two-fold. Firstly, computing S in that way requires a 64bit division for every time we'd use it (see 12), and secondly, this only describes the steady-state, it doesn't handle dynamics. Anyway, in (6): s_i -> x + (s_i - x), to get: \Sum_i w_i (s_i - x) S - x = -------------------- (7) \Sum_i w_i Which shows that S and s_i transform alike (which makes perfect sense given that S is basically the (weighted) average of s_i). Then: x -> s_min := min{s_i} (8) to obtain: \Sum_i w_i (s_i - s_min) S = s_min + ------------------------ (9) \Sum_i w_i Which already looks familiar, and is the basis for our current approximation: S ~= s_min (10) Now, obviously, (10) is absolute crap :-), but it sorta works. So the thing to remember is that the above is strictly UP. It is possible to generalize to multiple runqueues -- however it gets really yuck when you have to add affinity support, as illustrated by our very first counter-example. XXX segue into the load-balance issues related to this: - how a negative lag task on a 'heavy' runqueue should not remain a negative lag task when migrated to a 'light' runqueue. - how we can compute and use the combined S in load-balancing to better handle infeasible weight scenarios. Luckily I think we can avoid needing a full multi-queue variant for core-scheduling (or load-balancing). The crucial observation is that we only actually need this comparison in the presence of forced-idle; only then do we need to tell if the stalled rq has higher priority over the other. [XXX assumes SMT2; better consider the more general case, I suspect it'll work out because our comparison is always between 2 rqs and the answer is only interesting if one of them is forced-idle] And (under assumption of SMT2) when there is forced-idle, there is only a single queue, so everything works like normal. Let, for our runqueue 'k': T_k = \Sum_i w_i s_i W_k = \Sum_i w_i ; for all i of k (11) Then we can write (6) like: T_k S_k = --- (12) W_k >From which immediately follows that: T_k + T_l S_k+l = --------- (13) W_k + W_l On which we can define a combined lag: lag_k+l(i) := S_k+l - s_i (14) And that gives us the tools to compare tasks across a combined runqueue. Combined this gives the following: a) when a runqueue enters force-idle, sync it against it's sibling rq(s) using (7); this only requires storing single 'time'-stamps. b) when comparing tasks between 2 runqueues of which one is forced-idle, compare the combined lag, per (14). Now, of course cgroups (I so hate them) make this more interesting in that a) seems to suggest we need to iterate all cgroup on a CPU at such boundaries, but I think we can avoid that. The force-idle is for the whole CPU, all it's rqs. So we can mark it in the root and lazily propagate downward on demand.