On Sun, Oct 8, 2017 at 5:02 PM, Boqun Feng <boqun.f...@gmail.com> wrote: > Josef reported a HARDIRQ-safe -> HARDIRQ-unsafe lock order detected by > lockdep: > > | [ 1270.472259] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected > | [ 1270.472783] 4.14.0-rc1-xfstests-12888-g76833e8 #110 Not tainted > | [ 1270.473240] ----------------------------------------------------- > | [ 1270.473710] kworker/u5:2/5157 [HC0[0]:SC0[0]:HE0:SE1] is trying to > acquire: > | [ 1270.474239] (&(&lock->wait_lock)->rlock){+.+.}, at: > [<ffffffff8da253d2>] __mutex_unlock_slowpath+0xa2/0x280 > | [ 1270.474994] > | [ 1270.474994] and this task is already holding: > | [ 1270.475440] (&pool->lock/1){-.-.}, at: [<ffffffff8d2992f6>] > worker_thread+0x366/0x3c0 > | [ 1270.476046] which would create a new lock dependency: > | [ 1270.476436] (&pool->lock/1){-.-.} -> (&(&lock->wait_lock)->rlock){+.+.} > | [ 1270.476949] > | [ 1270.476949] but this new dependency connects a HARDIRQ-irq-safe lock: > | [ 1270.477553] (&pool->lock/1){-.-.} > ... > | [ 1270.488900] to a HARDIRQ-irq-unsafe lock: > | [ 1270.489327] (&(&lock->wait_lock)->rlock){+.+.} > ... > | [ 1270.494735] Possible interrupt unsafe locking scenario: > | [ 1270.494735] > | [ 1270.495250] CPU0 CPU1 > | [ 1270.495600] ---- ---- > | [ 1270.495947] lock(&(&lock->wait_lock)->rlock); > | [ 1270.496295] local_irq_disable(); > | [ 1270.496753] lock(&pool->lock/1); > | [ 1270.497205] > lock(&(&lock->wait_lock)->rlock); > | [ 1270.497744] <Interrupt> > | [ 1270.497948] lock(&pool->lock/1); > > , which will cause a irq inversion deadlock if the above lock scenario > happens. > > The root cause of this safe -> unsafe lock order is the > mutex_unlock(pool::manager_arb) in manage_workers() with pool::lock > held.
I didn't thought this kind of pattern is very seldom. I remember I saw several. mutex_lock(); do_something(); spin_lock_irq(); record_the_state_for_ do_something(). // keep the spin lock held to hold the state for do_more_things(). mutex_unlock(); // unlock() is suggested to be called when just exiting C.S. do_more_things(); spin_unlock_irq(); Was all code of this pattern removed? Could it be possible that mutex will be changed to allow this? (If the mutex can't be changed...) And I think the little more proper fix is to move the 'mutex_unlock();' down. In the case for manager_arb, 'mutex_unlock();' can be called in process_one_work() and before the worker sleeps. However, a variable might be needed to indicate whether it should be called. It doesn't means I don't like this fix. 'spin_unlock_irq(&pool->lock);' before 'mutex_unlock();' makes the things more complicated. More time is needed for understanding it. And I leave one concern about the worker_leave_idle() below. > An obvious fix is dropping the pool::lock before mutex_unlock() > and re-grabing afterwards, which however will introduce a race condition > between worker_thread() and put_unbound_pool(): > > put_unbound_pool() will grab both pool::manager_arb and pool::lock to > set all current IDLE workers to DIE, and may wait on the > pool::detach_completion for the last worker to detach from the pool. > > And when manage_workers() is called, the caller worker_thread is in > non-ILDE state, so if the worker dropped both pool::{manager_arb, lock} > and got delayed for a while long enough for a put_unbound_pool(), the > put_unbound_pool() would not switch that worker to DIE. As a result, the > worker will not detach from the pool as it's not DIE and the > put_unbound_pool() will not proceed as it's waiting for the last worker > to detach, therefore deadlock. > > To overcome this, put the worker back to IDLE state before it drops > pool::lock in manage_workers(), and make the worker check again whether > it's DIE after it re-grabs the pool::lock. In this way, we fix the > potential deadlock reported by lockdep without introducing another. > > Reported-by: Josef Bacik <jo...@toxicpanda.com> > Signed-off-by: Boqun Feng <boqun.f...@gmail.com> > Cc: Peter Zijlstra <pet...@infradead.org> > --- > kernel/workqueue.c | 35 ++++++++++++++++++++++++++++++++++- > 1 file changed, 34 insertions(+), 1 deletion(-) > > diff --git a/kernel/workqueue.c b/kernel/workqueue.c > index 64d0edf428f8..2ea7b04cc48b 100644 > --- a/kernel/workqueue.c > +++ b/kernel/workqueue.c > @@ -1997,7 +1997,40 @@ static bool manage_workers(struct worker *worker) > maybe_create_worker(pool); > > pool->manager = NULL; > + > + /* > + * Put the manager back to ->idle_list, this allows us to drop the > + * pool->lock safely without racing with put_unbound_pool() > + * > + * <in "manager worker" > thread> > + * worker_thread(): > + * > spin_lock_irq(&pool->lock); > + * worker_leave_idle(); > + * manage_workers(): > // return true > + * > mutex_trylock(&pool->manager_arb); > + * <without entering > idle here> > + * > spin_unlock_irq(&pool->lock); > + * > mutex_unlock(&pool->manager_arb); > + * > + * put_unbound_pool(): > + * mutex_lock(&pool->manager_arb); > + * spin_lock_irq(&pool->lock); > + * <set ILDE worker to DIE> > + * <the manager worker is not set to be DIE, because it's not > IDLE> > + * ... > + * wait_for_completion(&pool->detach_completion); > + * <no one will complete() because pool->workers is not empty> > + * > + * > spin_lock_irq(&pool->lock); > + * <pool->worklist is > empty, go to sleep> > + * > + * No one is going to wake up the manager worker, even so, it won't > + * complete(->detach_completion), since it's not a DIE worker. > + */ > + worker_enter_idle(worker); > + spin_unlock_irq(&pool->lock); > mutex_unlock(&pool->manager_arb); > + spin_lock_irq(&pool->lock); > return true; > } > > @@ -2202,6 +2235,7 @@ static int worker_thread(void *__worker) > woke_up: > spin_lock_irq(&pool->lock); > > +recheck: > /* am I supposed to die? */ > if (unlikely(worker->flags & WORKER_DIE)) { > spin_unlock_irq(&pool->lock); > @@ -2216,7 +2250,6 @@ static int worker_thread(void *__worker) > } > > worker_leave_idle(worker); I think worker_leave_idle() might be called multiple times, which might cause bugs, since recheck is moved up. > -recheck: > /* no more worker necessary? */ > if (!need_more_worker(pool)) > goto sleep; > -- > 2.14.1 >