On 12/12/2014 06:19 PM, Lai Jiangshan wrote: > Yasuaki Ishimatsu hit a allocation failure bug when the numa mapping > between CPU and node is changed. This was the last scene: > SLUB: Unable to allocate memory on node 2 (gfp=0x80d0) > cache: kmalloc-192, object size: 192, buffer size: 192, default order: 1, > min order: 0 > node 0: slabs: 6172, objs: 259224, free: 245741 > node 1: slabs: 3261, objs: 136962, free: 127656 > > Yasuaki Ishimatsu investigated that it happened in the following situation: > > 1) System Node/CPU before offline/online: > | CPU > ------------------------ > node 0 | 0-14, 60-74 > node 1 | 15-29, 75-89 > node 2 | 30-44, 90-104 > node 3 | 45-59, 105-119 > > 2) A system-board (contains node2 and node3) is offline: > | CPU > ------------------------ > node 0 | 0-14, 60-74 > node 1 | 15-29, 75-89 > > 3) A new system-board is online, two new node IDs are allocated > for the two node of the SB, but the old CPU IDs are allocated for > the SB, here the NUMA mapping between node and CPU is changed. > (the node of CPU#30 is changed from node#2 to node#4, for example) > | CPU > ------------------------ > node 0 | 0-14, 60-74 > node 1 | 15-29, 75-89 > node 4 | 30-59 > node 5 | 90-119 > > 4) now, the NUMA mapping is changed, but wq_numa_possible_cpumask > which is the convenient NUMA mapping cache in workqueue.c is still > outdated. > thus pool->node calculated by get_unbound_pool() is incorrect. > > 5) when the create_worker() is called with the incorrect offlined > pool->node, it is failed and the pool can't make any progress. > > To fix this bug, we need to fixup the wq_numa_possible_cpumask and the > pool->node, the fix is so complicated that we split it into two patches, > this patch fix the wq_numa_possible_cpumask and the next fix the pool->node. > > To fix the wq_numa_possible_cpumask, we only update the cpumasks of > the orig_node and the new_node of the onlining @cpu. we con't touch > other unrelated nodes since the wq subsystem haven't seen the changed. > > After this fix the new pool->node of new pools are correct. > and existing wq's affinity is fixed up by wq_update_unbound_numa() > after wq_update_numa_mapping(). > > Reported-by: Yasuaki Ishimatsu <isimatu.yasu...@jp.fujitsu.com> > Cc: Tejun Heo <t...@kernel.org> > Cc: Yasuaki Ishimatsu <isimatu.yasu...@jp.fujitsu.com> > Cc: "Gu, Zheng" <guz.f...@cn.fujitsu.com> > Cc: tangchen <tangc...@cn.fujitsu.com> > Cc: Hiroyuki KAMEZAWA <kamezawa.hir...@jp.fujitsu.com> > Signed-off-by: Lai Jiangshan <la...@cn.fujitsu.com> > --- > kernel/workqueue.c | 42 +++++++++++++++++++++++++++++++++++++++++- > 1 files changed, 41 insertions(+), 1 deletions(-) > > diff --git a/kernel/workqueue.c b/kernel/workqueue.c > index a6fd2b8..4c88b61 100644 > --- a/kernel/workqueue.c > +++ b/kernel/workqueue.c > @@ -266,7 +266,7 @@ struct workqueue_struct { > static struct kmem_cache *pwq_cache; > > static cpumask_var_t *wq_numa_possible_cpumask; > - /* possible CPUs of each node */ > + /* PL: possible CPUs of each node */ > > static bool wq_disable_numa; > module_param_named(disable_numa, wq_disable_numa, bool, 0444); > @@ -3949,6 +3949,44 @@ out_unlock: > put_pwq_unlocked(old_pwq); > } > > +static void wq_update_numa_mapping(int cpu) > +{ > + int node, orig_node = NUMA_NO_NODE, new_node = cpu_to_node(cpu); > + > + lockdep_assert_held(&wq_pool_mutex); > + > + if (!wq_numa_enabled) > + return; > + > + /* the node of onlining CPU is not NUMA_NO_NODE */ > + if (WARN_ON(new_node == NUMA_NO_NODE)) > + return; > + > + /* test whether the NUMA node mapping is changed. */ > + if (cpumask_test_cpu(cpu, wq_numa_possible_cpumask[new_node])) > + return; > + > + /* find the origin node */ > + for_each_node(node) { > + if (cpumask_test_cpu(cpu, wq_numa_possible_cpumask[node])) { > + orig_node = node; > + break; > + } > + } > + > + /* there may be multi mappings changed, re-initial. */ > + cpumask_clear(wq_numa_possible_cpumask[new_node]); > + if (orig_node != NUMA_NO_NODE) > + cpumask_clear(wq_numa_possible_cpumask[orig_node]); > + for_each_possible_cpu(cpu) { > + node = cpu_to_node(node);
Hi, Yasuaki Ishimatsu The bug is here. It should be node = cpu_to_node(cpu); > + if (node == new_node) > + cpumask_set_cpu(cpu, > wq_numa_possible_cpumask[new_node]); > + else if (orig_node != NUMA_NO_NODE && node == orig_node) > + cpumask_set_cpu(cpu, > wq_numa_possible_cpumask[orig_node]); > + } > +} > + > static int alloc_and_link_pwqs(struct workqueue_struct *wq) > { > bool highpri = wq->flags & WQ_HIGHPRI; > @@ -4584,6 +4622,8 @@ static int workqueue_cpu_up_callback(struct > notifier_block *nfb, > mutex_unlock(&pool->attach_mutex); > } > > + wq_update_numa_mapping(cpu); > + > /* update NUMA affinity of unbound workqueues */ > list_for_each_entry(wq, &workqueues, list) > wq_update_unbound_numa(wq, cpu, true); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/