On 04/12/20 21:19, Qian Cai wrote: > On Tue, 2020-11-17 at 19:28 +0000, Valentin Schneider wrote: >> We did have some breakage in that area, but all the holes I was aware of >> have been plugged. What would help here is to see which tasks are still >> queued on that outgoing CPU, and their recent activity. >> >> Something like >> - ftrace_dump_on_oops on your kernel cmdline >> - trace-cmd start -e 'sched:*' >> <start the test here> >> >> ought to do it. Then you can paste the (tail of the) ftrace dump. >> >> I also had this laying around, which may or may not be of some help: > > Okay, your patch did not help, since it can still be reproduced using this, >
It wasn't meant to fix this, only add some more debug prints :) > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/hotplug/cpu_hotplug/functional/cpuhotplug04.sh > > # while :; do cpuhotplug04.sh -l 1; done > > The ftrace dump has too much output on this 256-CPU system, so I have not had > the patient to wait for it to finish after 15-min. But here is the log > capturing > so far (search for "kernel BUG" there). > > http://people.redhat.com/qcai/console.log > >From there I see: [20798.166987][ T650] CPU127 nr_running=2 [20798.171185][ T650] p=migration/127 [20798.175161][ T650] p=kworker/127:1 so this might be another workqueue hurdle. This should be prevented by: 06249738a41a ("workqueue: Manually break affinity on hotplug") In any case, I'll give this a try on a TX2 next week and see where it gets me. Note that much earlier in your log, you have a softlockup on CPU127: [ 74.278367][ C127] watchdog: BUG: soft lockup - CPU#127 stuck for 23s! [swapper/0:1]