On Tue, 2020-11-17 at 19:28 +0000, Valentin Schneider wrote: > We did have some breakage in that area, but all the holes I was aware of > have been plugged. What would help here is to see which tasks are still > queued on that outgoing CPU, and their recent activity. > > Something like > - ftrace_dump_on_oops on your kernel cmdline > - trace-cmd start -e 'sched:*' > <start the test here> > > ought to do it. Then you can paste the (tail of the) ftrace dump. > > I also had this laying around, which may or may not be of some help:
Okay, your patch did not help, since it can still be reproduced using this, https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/hotplug/cpu_hotplug/functional/cpuhotplug04.sh # while :; do cpuhotplug04.sh -l 1; done The ftrace dump has too much output on this 256-CPU system, so I have not had the patient to wait for it to finish after 15-min. But here is the log capturing so far (search for "kernel BUG" there). http://people.redhat.com/qcai/console.log > --- > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index a6aaf9fb3400..c4a4cb8b47a2 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -7534,7 +7534,25 @@ int sched_cpu_dying(unsigned int cpu) > sched_tick_stop(cpu); > > rq_lock_irqsave(rq, &rf); > - BUG_ON(rq->nr_running != 1 || rq_has_pinned_tasks(rq)); > + > + if (rq->nr_running != 1 || rq_has_pinned_tasks(rq)) { > + struct task_struct *g, *p; > + > + pr_crit("CPU%d nr_running=%d\n", cpu, rq->nr_running); > + rcu_read_lock(); > + for_each_process_thread(g, p) { > + if (task_cpu(p) != cpu) > + continue; > + > + if (!task_on_rq_queued(p)) > + continue; > + > + pr_crit("\tp=%s\n", p->comm); > + } > + rcu_read_unlock(); > + BUG(); > + } > + > rq_unlock_irqrestore(rq, &rf); > > calc_load_migrate(rq); >