On Thu, Jan 28, 2016 at 03:05:30PM +0900, Sergey Senozhatsky wrote: > On (01/28/16 13:36), byungchul.park wrote: > [..] > > > the thing is, it's really-really hard to lockup in console_trylock()... > > > > > > int down_trylock(struct semaphore *sem) > > > { > > > unsigned long flags; > > > int count; > > > > > > raw_spin_lock_irqsave(&sem->lock, flags); <<<<<< um... > > > > I also think it's hard, but a backtrace said the lockup happened here. > > what was the state of `struct semaphore *sem' and especially of `sem->lock'? > what was the lock->owner_cpu doing? (addr2line of its pc registe, for > example).
Unfortunately, it's not reproduced anymore. If it's clearly a spinlock caller's bug as you said, modifying the spinlock debug code does not help it at all. But I found there's a possiblity in the debug code *itself* to cause a lockup. So I tried to fix it. What do you think about it? > > > > count = sem->count - 1; > > > if (likely(count >= 0)) > > > sem->count = count; > > > raw_spin_unlock_irqrestore(&sem->lock, flags); > > > > > > return (count < 0); > > > } > > > > > > was not able to dereference sem->count? `*sem' was corrupted? or because > > > sem->lock was corrupted? in any of those cases you solve the problem from > > > the wrong side. if you have a memory corruption then it can corrupt > > > > What I solved here is to make it possible to print what the problem is, by > > the spinlock debug code instead of system lockup while printing a debug > > message. I think it's not wrong. > > none of the CPUs will print anything anymore. it's done. You are right if it's a real lockup situation. But it will print proper debug messages if it's just a suspect case, which works with this patch. I will also try to reproduce it and check if there's a bug on use of spinlock. In this case, we should fix the root cause. But since the possiblity I mentioned can *also* cause the lockup problem, I think it must be fixed at first. > > > your CPUa - CPUx are spinning in down_trylock() > > vprintk_emit() > down_trylock() > raw_spin_lock_irqsave() << spin here > > and they are spinnig because CPUz is keeping the sem->lock and is > _not_ going to release it. and this is the root cause, not spin_dump(). If it's not going to release it then it's a problem. But IMHO, arch_spin_trylock() in __spin_lock_debug() can fail even though the owner of the spinlock releases it properly, if there's heavy use on printk() at the moment. Is there something I missed here? If what I mention can happen, then it's not a spinlock user's problem. It's just a debug code's problem. > > > CPUz was expected to do a quick thing in down_trylock() > > raw_spin_lock_irqsave(&sem->lock, flags); > count = sem->count - 1; > if (likely(count >= 0)) > sem->count = count; > raw_spin_unlock_irqrestore(&sem->lock, flags); This may be done quickly, but if the use of printk() on the system is heavy? > > > or down()/down_common() > > > raw_spin_lock_irqsave(&sem->lock, flags); > if (likely(sem->count > 0)) > sem->count--; > else > down_common() > { > ... > for (;;) { > if (signal_pending_state(state, task)) > goto interrupted; > if (unlikely(timeout <= 0)) > goto timed_out; > __set_task_state(task, state); > raw_spin_unlock_irq(&sem->lock); > timeout = schedule_timeout(timeout); > raw_spin_lock_irq(&sem->lock); > if (waiter.up) > return 0; > } > ... > } > raw_spin_unlock_irqrestore(&sem->lock, flags); > > > I can't see how it's even possible to keep that spin_lock locked > longer than `loops_per_jiffy * HZ'. No need to keep that spinlock longer than it to cause the problem.. > > and the fact that you saw N recursive > printk()->down_trylock()->spin_lock()->spin_dump()->printk()->... > > sounds like a good prove of the fact the your CPUz was either dead, > or gone crazy, and took the spin_lock with it. but this is not My system esp. qemu might have been crazy because printk() business. > spinlock_debug's business. I think it could be exactly a spinlock debug's business. > > console_flush_on_panic() _probably_ would help in this particular > case -- it does not care about console_sem state and goes to > console_unlock() directly -- but it still locks the logbuf_lock. > so if CPUz died with logbuf_lock being locked, then nothing will > save your day. > > > do you have any reproducer or you've seen it once? Just once. thanks, byungchul > > -ss