On Mon, Oct 9, 2017 at 9:21 PM, Tejun Heo <t...@kernel.org> wrote: > Josef reported a HARDIRQ-safe -> HARDIRQ-unsafe lock order detected by > lockdep: > > [ 1270.472259] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected > [ 1270.472783] 4.14.0-rc1-xfstests-12888-g76833e8 #110 Not tainted > [ 1270.473240] ----------------------------------------------------- > [ 1270.473710] kworker/u5:2/5157 [HC0[0]:SC0[0]:HE0:SE1] is trying to > acquire: > [ 1270.474239] (&(&lock->wait_lock)->rlock){+.+.}, at: [<ffffffff8da253d2>] > __mutex_unlock_slowpath+0xa2/0x280 > [ 1270.474994] > [ 1270.474994] and this task is already holding: > [ 1270.475440] (&pool->lock/1){-.-.}, at: [<ffffffff8d2992f6>] > worker_thread+0x366/0x3c0 > [ 1270.476046] which would create a new lock dependency: > [ 1270.476436] (&pool->lock/1){-.-.} -> (&(&lock->wait_lock)->rlock){+.+.} > [ 1270.476949] > [ 1270.476949] but this new dependency connects a HARDIRQ-irq-safe lock: > [ 1270.477553] (&pool->lock/1){-.-.} > ... > [ 1270.488900] to a HARDIRQ-irq-unsafe lock: > [ 1270.489327] (&(&lock->wait_lock)->rlock){+.+.} > ... > [ 1270.494735] Possible interrupt unsafe locking scenario: > [ 1270.494735] > [ 1270.495250] CPU0 CPU1 > [ 1270.495600] ---- ---- > [ 1270.495947] lock(&(&lock->wait_lock)->rlock); > [ 1270.496295] local_irq_disable(); > [ 1270.496753] lock(&pool->lock/1); > [ 1270.497205] > lock(&(&lock->wait_lock)->rlock); > [ 1270.497744] <Interrupt> > [ 1270.497948] lock(&pool->lock/1); > > , which will cause a irq inversion deadlock if the above lock scenario > happens. > > The root cause of this safe -> unsafe lock order is the > mutex_unlock(pool->manager_arb) in manage_workers() with pool->lock > held. > > Unlocking mutex while holding an irq spinlock was never safe and this > problem has been around forever but it never got noticed because the > only time the mutex is usually trylocked while holding irqlock making > actual failures very unlikely and lockdep annotation missed the > condition until the recent b9c16a0e1f73 ("locking/mutex: Fix > lockdep_assert_held() fail"). > > Using mutex for pool->manager_arb has always been a bit of stretch. > It primarily is an mechanism to arbitrate managership between workers > which can easily be done with a pool flag. The only reason it became > a mutex is that pool destruction path wants to exclude parallel > managing operations. > > This patch replaces the mutex with a new pool flag POOL_MANAGER_ACTIVE > and make the destruction path wait for the current manager on a wait > queue. > > Signed-off-by: Tejun Heo <t...@kernel.org> > Reported-by: Josef Bacik <jo...@toxicpanda.com> > Cc: Peter Zijlstra <pet...@infradead.org> > Cc: Boqun Feng <boqun.f...@gmail.com> > Cc: sta...@vger.kernel.org > --- > Hello, > > Boqun, thanks for the patch and explanation and I shamelessly lifted > parts of your patch description. I took an alternative approach > because there's no reason to make this any more complex when the issue > is essentially caused by abusing mutex. > > The patch seems to work fine but, Lai, can you please review the > patch? > > Thanks. > > kernel/workqueue.c | 39 ++++++++++++++++++--------------------- > 1 file changed, 18 insertions(+), 21 deletions(-) > > diff --git a/kernel/workqueue.c b/kernel/workqueue.c > index 64d0edf..8739b6de 100644 > --- a/kernel/workqueue.c > +++ b/kernel/workqueue.c > @@ -68,6 +68,7 @@ enum { > * attach_mutex to avoid changing binding state while > * worker_attach_to_pool() is in progress. > */ > + POOL_MANAGER_ACTIVE = 1 << 0, /* being managed */ > POOL_DISASSOCIATED = 1 << 2, /* cpu can't serve workers */ > > /* worker flags */ > @@ -165,7 +166,6 @@ struct worker_pool { > /* L: hash of busy workers */ > > /* see manage_workers() for details on the two manager mutexes */ > - struct mutex manager_arb; /* manager arbitration */ > struct worker *manager; /* L: purely informational */ > struct mutex attach_mutex; /* attach/detach exclusion */ > struct list_head workers; /* A: attached workers */ > @@ -299,6 +299,7 @@ static struct workqueue_attrs > *wq_update_unbound_numa_attrs_buf; > > static DEFINE_MUTEX(wq_pool_mutex); /* protects pools and workqueues list > */ > static DEFINE_SPINLOCK(wq_mayday_lock); /* protects wq->maydays list > */ > +static DECLARE_WAIT_QUEUE_HEAD(wq_manager_wait); /* wait for manager to go > away */ > > static LIST_HEAD(workqueues); /* PR: list of all workqueues */ > static bool workqueue_freezing; /* PL: have wqs started > freezing? */ > @@ -801,7 +802,7 @@ static bool need_to_create_worker(struct worker_pool > *pool) > /* Do we have too many workers and should some go away? */ > static bool too_many_workers(struct worker_pool *pool) > { > - bool managing = mutex_is_locked(&pool->manager_arb); > + bool managing = pool->flags & POOL_MANAGER_ACTIVE; > int nr_idle = pool->nr_idle + managing; /* manager is considered idle > */ > int nr_busy = pool->nr_workers - nr_idle; > > @@ -1980,24 +1981,17 @@ static bool manage_workers(struct worker *worker) > { > struct worker_pool *pool = worker->pool; > > - /* > - * Anyone who successfully grabs manager_arb wins the arbitration > - * and becomes the manager. mutex_trylock() on pool->manager_arb > - * failure while holding pool->lock reliably indicates that someone > - * else is managing the pool and the worker which failed trylock > - * can proceed to executing work items. This means that anyone > - * grabbing manager_arb is responsible for actually performing > - * manager duties. If manager_arb is grabbed and released without > - * actual management, the pool may stall indefinitely. > - */ > - if (!mutex_trylock(&pool->manager_arb)) > + if (pool->flags & POOL_MANAGER_ACTIVE) > return false; > + > + pool->flags |= POOL_MANAGER_ACTIVE; > pool->manager = worker;
Hello, tj I don't find anything wrong with this patch so far. It is good though I haven't tested it. I was also thinking alternative code when reviewing. The first is quite obvious. Testing POOL_MANAGER_ACTIVE can be replaced by testing pool->manager. And POOL_MANAGER_ACTIVE is not needed. Isn't it? The second thing is to make manage_workers() and put_unbound_pool() exclusive. Waiting event on POOL_MANAGER_ACTIVE(or pool->manager) is one way. However, the pool's refcnt is not possible to be dropped to zero now since the caller still hold the pool->lock and some pwds of the works in the worklist. So the other way to enforce the exclusive could be just doing get_pwq(the first pwd of the worklist) and put_pwq() when the manage_workers() done. And the code about pool->manager_arb in put_unbound_pool() can be simply removed. I just want to send my thinking earlier and I didn't think deeply enough. What do you think. thx Lai > > maybe_create_worker(pool); > > pool->manager = NULL; > - mutex_unlock(&pool->manager_arb); > + pool->flags &= ~POOL_MANAGER_ACTIVE; > + wake_up(&wq_manager_wait); > return true; > } > > @@ -3248,7 +3242,6 @@ static int init_worker_pool(struct worker_pool *pool) > setup_timer(&pool->mayday_timer, pool_mayday_timeout, > (unsigned long)pool); > > - mutex_init(&pool->manager_arb); > mutex_init(&pool->attach_mutex); > INIT_LIST_HEAD(&pool->workers); > > @@ -3318,13 +3311,14 @@ static void put_unbound_pool(struct worker_pool *pool) > hash_del(&pool->hash_node); > > /* > - * Become the manager and destroy all workers. Grabbing > - * manager_arb prevents @pool's workers from blocking on > - * attach_mutex. > + * Become the manager and destroy all workers. Becoming manager > + * prevents @pool's workers from blocking on attach_mutex. > */ > - mutex_lock(&pool->manager_arb); > - > spin_lock_irq(&pool->lock); > + wait_event_lock_irq(wq_manager_wait, > + !(pool->flags & POOL_MANAGER_ACTIVE), pool->lock); > + pool->flags |= POOL_MANAGER_ACTIVE; > + > while ((worker = first_idle_worker(pool))) > destroy_worker(worker); > WARN_ON(pool->nr_workers || pool->nr_idle); > @@ -3338,7 +3332,10 @@ static void put_unbound_pool(struct worker_pool *pool) > if (pool->detach_completion) > wait_for_completion(pool->detach_completion); > > - mutex_unlock(&pool->manager_arb); > + spin_lock_irq(&pool->lock); > + pool->flags &= ~POOL_MANAGER_ACTIVE; > + wake_up(&wq_manager_wait); > + spin_unlock_irq(&pool->lock); > > /* shut down the timers */ > del_timer_sync(&pool->idle_timer);