(+Cc VincentD, who's been looking at some of this)
On 11/09/20 09:17, Peter Zijlstra wrote: > In preparation for migrate_disable(), make sure only per-cpu kthreads > are allowed to run on !active CPUs. > > This is ran (as one of the very first steps) from the cpu-hotplug > task which is a per-cpu kthread and completion of the hotplug > operation only requires such tasks. > > This contraint enables the migrate_disable() implementation to wait s/contraint/constraint/ > for completion of all migrate_disable regions on this CPU at hotplug > time without fear of any new ones starting. > > This replaces the unlikely(rq->balance_callbacks) test at the tail of > context_switch with an unlikely(rq->balance_work), the fast path is > not affected. > > Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org> > --- > kernel/sched/core.c | 103 > ++++++++++++++++++++++++++++++++++++++++++++++++++- > kernel/sched/sched.h | 5 ++ > 2 files changed, 106 insertions(+), 2 deletions(-) > > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -6836,6 +6849,87 @@ static void migrate_tasks(struct rq *dea > > rq->stop = stop; > } > + > +static int __balance_push_stop(void *arg) __balance_push_cpu_stop()? _cpu_stop() seems to be the usual suffix. > +{ > + struct task_struct *p = arg; > + struct rq *rq = this_rq(); > + struct rq_flags rf; > + int cpu; > + > + raw_spin_lock_irq(&p->pi_lock); > + rq_lock(rq, &rf); > + > + if (task_rq(p) == rq && task_on_rq_queued(p)) { > + cpu = select_fallback_rq(rq->cpu, p); > + rq = __migrate_task(rq, &rf, p, cpu); > + } > + > + rq_unlock(rq, &rf); > + raw_spin_unlock_irq(&p->pi_lock); > + > + put_task_struct(p); > + > + return 0; > +} > + > +static DEFINE_PER_CPU(struct cpu_stop_work, push_work); > + > +/* > + * Ensure we only run per-cpu kthreads once the CPU goes !active. > + */ > +static bool balance_push(struct rq *rq) > +{ > + struct task_struct *push_task = rq->curr; > + > + lockdep_assert_held(&rq->lock); > + SCHED_WARN_ON(rq->cpu != smp_processor_id()); > + > + /* > + * Both the cpu-hotplug and stop task are in this class and are > + * required to complete the hotplug process. Nit: s/class/case/? I can't not read "class" as "sched_class", and those two are *not* in the same sched_class, and confusion ensues. > + */ > + if (is_per_cpu_kthread(push_task)) > + return false; > + > + get_task_struct(push_task); > + /* > + * Temporarily drop rq->lock such that we can wake-up the stop task. > + * Both preemption and IRQs are still disabled. > + */ > + raw_spin_unlock(&rq->lock); > + stop_one_cpu_nowait(rq->cpu, __balance_push_stop, push_task, > + this_cpu_ptr(&push_work)); > + /* > + * At this point need_resched() is true and we'll take the loop in > + * schedule(). The next pick is obviously going to be the stop task > + * which is_per_cpu_kthread() and will push this task away. > + */ > + raw_spin_lock(&rq->lock); > + > + return true; > +} > + > @@ -6968,6 +7064,8 @@ int sched_cpu_deactivate(unsigned int cp > */ > synchronize_rcu(); > > + balance_push_set(cpu, true); > + IIUC this is going to make every subsequent finish_lock_switch() migrate the switched-to task if it isn't a pcpu kthread. So this is going to lead to a dance of switch_to(<task0>) -> switch_to(<stopper>) -> switch_to(<task1>) -> switch_to(<stopper>) ... Wouldn't it be better to batch all those in a migrate_tasks() sibling that skips pcpu kthreads? > #ifdef CONFIG_SCHED_SMT > /* > * When going down, decrement the number of cores with SMT present.