On Wed, May 20, 2020 at 02:50:56PM +0200, Peter Zijlstra wrote: > On Tue, May 19, 2020 at 11:58:17PM -0400, Qian Cai wrote: > > Just a head up. Repeatedly compiling kernels for a while would trigger > > endless soft-lockups since next-20200519 on both x86_64 and powerpc. > > .config are in, > > Could be 90b5363acd47 ("sched: Clean up scheduler_ipi()"), although I've > not seen anything like that myself. Let me go have a look. > > > In as far as the logs are readable (they're a wrapped mess, please don't > do that!), they contain very little useful, as is typical with IPIs :/ > > > [ 1167.993773][ C1] WARNING: CPU: 1 PID: 0 at kernel/smp.c:127 > > flush_smp_call_function_queue+0x1fa/0x2e0
So I've tried to think of a race that could produce that and here is the only thing I could come up with. It's a bit complicated unfortunately: CPU 0 CPU 1 ----- ----- tick { trigger_load_balance() { raise_softirq(SCHED_SOFTIRQ); //but nohz_flags(0) = 0 } kick_ilb() { atomic_fetch_or(...., nohz_flags(0)) softirq() { #VMEXIT or anything that could stop a CPU for a while run_rebalance_domain() { nohz_idle_balance() { atomic_andnot(NOHZ_KICK_MASK, nohz_flag(0)) } } } } // schedule nohz_newidle_balance() { kick_ilb() { // pick current CPU atomic_fetch_or(...., nohz_flags(0)) #VMENTER smp_call_function_single_async() { smp_call_function_single_async() { // verified csd->flags != CSD_LOCK // verified csd->flags != CSD_LOCK csd->flags = CSD_LOCK csd->flags = CSD_LOCK //execute in place //queue and send IPI csd->flags = 0 nohz_csd_func() } } } IPI�{ flush_smp_call_function_queue() { csd_unlock() { WARN_ON(csd->flags != CSD_LOCK) <---------!!!!! The root cause here would be that trigger_load_balance() unconditionally raise the softirq. And I have to confess I'm not clear why since the softirq is essentially a no-op when nohz_flags() is 0. Thanks.