On Sat, May 16, 2020 at 11:42:30AM +0800, Aaron Lu wrote: > On Thu, May 14, 2020 at 03:02:48PM +0200, Peter Zijlstra wrote: > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -4476,6 +4473,16 @@ next_class:; > > WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick)); > > } > > > > + /* XXX SMT2 only */ > > + if (new_active == 1 && old_active > 1) { > > There is a case when incompatible task appears but we failed to 'drop > into single-rq mode' per the above condition check. The TLDR is: when > there is a task that sits on the sibling rq with the same cookie as > 'max', new_active will be 2 instead of 1 and that would cause us missing > the chance to do a sync of core min_vruntime.
FWIW: when I disable the feature of running cookie_pick task on sibling and thus enforce a strict single-rq mode, Peter's patch works well for the scenario described below. > This is how it happens: > 1) 2 tasks of the same cgroup with different weight running on 2 siblings, > say cg0_A with weight 1024 bound at cpu0 and cg0_B with weight 2 bound > at cpu1(assume cpu0 and cpu1 are siblings); > 2) Since new_active == 2, we didn't trigger min_vruntime sync. For > simplicity, let's assume both siblings' root cfs_rq's min_vruntime and > core_vruntime are all at 0 now; > 3) let the two tasks run a while; > 4) a new task cg1_C of another cgroup gets queued on cpu1. Since cpu1's > existing task has a very small weight, its cfs_rq's min_vruntime can > be much larger than cpu0's cfs_rq min_vruntime. So cg1_C's vruntime is > much larger than cg0_A's and the 'max' of the core wide task > selection goes to cg0_A; > 5) Now I suppose we should drop into single-rq mode and by doing a sync > of core min_vruntime, cg1_C's turn shall come. But the problem is, our > current selection logic prefer not to waste CPU time so after decides > cg0_A as the 'max', the sibling will also do a cookie_pick() and > get cg0_B to run. This is where problem asises: new_active is 2 > instead of the expected 1. > 6) Due to we didn't do the sync of core min_vruntime, the newly queued > cg1_C shall wait a long time before cg0_A's vruntime catches up. P.S. this is what I did to enforce a strict single-rq mode: diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 1fa5b48b742a..0f5580bc7e96 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4411,7 +4411,7 @@ pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *ma (!max || prio_less(max, class_pick))) return class_pick; - return cookie_pick; + return NULL; } static struct task_struct *