On (01/28/16 17:13), Byungchul Park wrote: [..] > > > > int down_trylock(struct semaphore *sem) > > > > { > > > > unsigned long flags; > > > > int count; > > > > > > > > raw_spin_lock_irqsave(&sem->lock, flags); <<<<<< um... > > > > > > I also think it's hard, but a backtrace said the lockup happened here. > > > > what was the state of `struct semaphore *sem' and especially of `sem->lock'? > > what was the lock->owner_cpu doing? (addr2line of its pc registe, for > > example). > > Unfortunately, it's not reproduced anymore. > > If it's clearly a spinlock caller's bug as you said, modifying the > spinlock debug code does not help it at all. But I found there's a > possiblity in the debug code *itself* to cause a lockup. So I tried > to fix it. What do you think about it?
ah... silly me... you mean the first CPU that triggers the spin_dump() will deadlock itself, so the rest of CPUs will see endless recursive spin_lock()->spin_dump()->spin_lock()->spin_dump() calls? like the one below? CPUZ is doing vprintk_emit()->spin_lock(), CPUA is the spin_lock's owner CPUZ -> vprintk_emit() __spin_lock_debug() for (i = 0; i < `loops_per_jiffy * HZ'; i++) { << wait for the lock if (arch_spin_trylock()) return; __delay(1); } spin_dump() << lock is still owned by CPUA { -> vprintk_emit() __spin_lock_debug() for (...) { if (arch_spin_trylock()) return; __delay(1); } << CPUA unlocked the lock spin_dump() { -> vprintk_emit() __spin_lock_debug() for (...) { if (arch_spin_trylock()) << success!! /* CPUZ now owns the lock */ return; } } << we return here with the spin_lock being owned by this CPUZ trigger_all_cpu_backtrace() << and... now it does the arch_spin_lock() /* * The trylock above was causing a livelock. Give the lower level arch * specific lock code a chance to acquire the lock. We have already * printed a warning/backtrace at this point. The non-debug arch * specific code might actually succeed in acquiring the lock. If it is * not successful, the end-result is the same - there is no forward * progress. */ arch_spin_lock(&lock->raw_lock); << which obviously dealocks this CPU... } trigger_all_cpu_backtrace() arch_spin_lock() so "the CPUZ is now keeping the lock forever, and not going to release it" and "CPUA-CPUX will do vprintk_emit()->spin_lock()->spin_dump()->vprintk_emit()->..." My apologies for not getting it right the first time. Sorry! Can you please update your bug description in the commit message? It's the deadlock that is causing the recursion on other CPUs in the first place. -ss