On (01/28/16 17:13), Byungchul Park wrote:
[..]
> > > > int down_trylock(struct semaphore *sem)
> > > > {
> > > >         unsigned long flags;
> > > >         int count;
> > > > 
> > > >         raw_spin_lock_irqsave(&sem->lock, flags);   <<<<<<  um...
> > > 
> > > I also think it's hard, but a backtrace said the lockup happened here.
> > 
> > what was the state of `struct semaphore *sem' and especially of `sem->lock'?
> > what was the lock->owner_cpu doing? (addr2line of its pc registe, for 
> > example).
> 
> Unfortunately, it's not reproduced anymore.
> 
> If it's clearly a spinlock caller's bug as you said, modifying the
> spinlock debug code does not help it at all. But I found there's a
> possiblity in the debug code *itself* to cause a lockup. So I tried
> to fix it. What do you think about it?

ah... silly me... you mean the first CPU that triggers the spin_dump() will
deadlock itself, so the rest of CPUs will see endless recursive
spin_lock()->spin_dump()->spin_lock()->spin_dump() calls?

like the one below?


CPUZ is doing vprintk_emit()->spin_lock(), CPUA is the spin_lock's owner

CPUZ -> vprintk_emit()
        __spin_lock_debug()
                for (i = 0; i < `loops_per_jiffy * HZ'; i++) {   << wait for 
the lock
                        if (arch_spin_trylock())
                                return;
                        __delay(1);
                        }
                spin_dump()       << lock is still owned by CPUA
                {   -> vprintk_emit()
                        __spin_lock_debug()
                                for (...) {
                                        if (arch_spin_trylock())
                                                return;
                                        __delay(1);
                                }
                << CPUA unlocked the lock
                                spin_dump()
                                {   -> vprintk_emit()
                                        __spin_lock_debug()
                                                for (...) {
                                                        if 
(arch_spin_trylock())   << success!!
                                                        /* CPUZ now owns the 
lock */
                                                                return;
                                                }
                                }

                        << we return here with the spin_lock being owned by 
this CPUZ

                                trigger_all_cpu_backtrace()

                        << and... now it does the arch_spin_lock()
                                /*
                                 * The trylock above was causing a livelock.  
Give the lower level arch
                                 * specific lock code a chance to acquire the 
lock. We have already
                                 * printed a warning/backtrace at this point. 
The non-debug arch
                                 * specific code might actually succeed in 
acquiring the lock.  If it is
                                 * not successful, the end-result is the same - 
there is no forward
                                 * progress.
                                 */
                                arch_spin_lock(&lock->raw_lock);

                        << which obviously dealocks this CPU...
                }

                trigger_all_cpu_backtrace()

                arch_spin_lock()




so
        "the CPUZ is now keeping the lock forever, and not going to release it"
and
        "CPUA-CPUX will do 
vprintk_emit()->spin_lock()->spin_dump()->vprintk_emit()->..."



My apologies for not getting it right the first time. Sorry!

Can you please update your bug description in the commit message?
It's the deadlock that is causing the recursion on other CPUs in the
first place.


        -ss

Reply via email to