On Wed, 3 Feb 2016, Tejun Heo wrote: > When looking up the pool_workqueue to use for an unbound workqueue, > workqueue assumes that the target CPU is always bound to a valid NUMA > node. However, currently, when a CPU goes offline, the mapping is > destroyed and cpu_to_node() returns NUMA_NO_NODE. This has always > been broken but hasn't triggered until recently. > > After 874bbfe600a6 ("workqueue: make sure delayed work run in local > cpu"), workqueue forcifully assigns the local CPU for delayed work > items without explicit target CPU to fix a different issue. This > widens the window where CPU can go offline while a delayed work item > is pending causing delayed work items dispatched with target CPU set > to an already offlined CPU. The resulting NUMA_NO_NODE mapping makes > workqueue try to queue the work item on a NULL pool_workqueue and thus > crash. > > Fix it by mapping NUMA_NO_NODE to the default pool_workqueue from > unbound_pwq_by_node(). This is a temporary workaround. The long term > solution is keeping CPU -> NODE mapping stable across CPU off/online > cycles which is in the works. > > Signed-off-by: Tejun Heo <t...@kernel.org> > Reported-by: Mike Galbraith <umgwanakikb...@gmail.com> > Cc: Tang Chen <tangc...@cn.fujitsu.com> > Cc: Rafael J. Wysocki <raf...@kernel.org> > Cc: Len Brown <len.br...@intel.com> > Cc: sta...@vger.kernel.org # v4.3+
4.3+ ? Hasn't 874bbfe600a6 been backported to older stable kernels? Adding a 'Fixes: 874bbfe600a6 ...' tag is what you really want here. > diff --git a/kernel/workqueue.c b/kernel/workqueue.c > index 61a0264..f748eab 100644 > --- a/kernel/workqueue.c > +++ b/kernel/workqueue.c > @@ -570,6 +570,16 @@ static struct pool_workqueue *unbound_pwq_by_node(struct > workqueue_struct *wq, > int node) > { > assert_rcu_or_wq_mutex_or_pool_mutex(wq); > + > + /* > + * XXX: @node can be NUMA_NO_NODE if CPU goes offline while a > + * delayed item is pending. The plan is to keep CPU -> NODE > + * mapping valid and stable across CPU on/offlines. Once that > + * happens, this workaround can be removed. So what happens if the complete node is offline? > + */ > + if (unlikely(node == NUMA_NO_NODE)) > + return wq->dfl_pwq; > + Thanks, tglx