We found the source of the major performance regression we discussed previously. It turns out there was a pattern where a task (a kworker in this case) could be woken up, but the core could still end up idle before that task had a chance to run.
Example sequence, cpu0 and cpu1 and siblings on the same core, task1 and task2 are in the same cgroup with the tag enabled (each following line happens in the increasing order of time): - task1 running on cpu0, task2 running on cpu1 - sched_waking(kworker/0, target_cpu=cpu0) - task1 scheduled out of cpu0 - kworker/0 cannot run on cpu0 because of task2 is still running on cpu1 cpu0 is idle - task2 scheduled out of cpu1 - cpu1 doesn’t select kworker/0 for cpu0, because the optimization path ends the task selection if core_cookie is NULL for currently selected process and the cpu1’s runqueue. - cpu1 is idle --> both siblings are idle but kworker/0 is still in the run queue of cpu0. Cpu0 may stay idle for longer if it goes deep idle. With the fix below, we ensure to send an IPI to the sibling if it is idle and has tasks waiting in its runqueue. This fixes the performance issue we were seeing. Now here is what we can measure with a disk write-intensive benchmark: - no performance impact with enabling core scheduling without any tagged task, - 5% overhead if one tagged task is competing with an untagged task, - 10% overhead if 2 tasks tagged with a different tag are competing against each other. We are starting more scaling tests, but this is very encouraging ! diff --git a/kernel/sched/core.c b/kernel/sched/core.c index e1fa10561279..02c862a5e973 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3779,7 +3779,22 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) trace_printk("unconstrained pick: %s/%d %lx\n", next->comm, next->pid, next->core_cookie); + rq->core_pick = NULL; + /* + * If the sibling is idling, we might want to wake it + * so that it can check for any runnable but blocked tasks + * due to previous task matching. + */ + for_each_cpu(j, smt_mask) { + struct rq *rq_j = cpu_rq(j); + rq_j->core_pick = NULL; + if (j != cpu && is_idle_task(rq_j->curr) && rq_j->nr_running) { + resched_curr(rq_j); + trace_printk("IPI(%d->%d[%d]) idle preempt\n", + cpu, j, rq_j->nr_running); + } + } goto done; }