On Mon, Mar 04, 2019 at 08:59:52PM +0100, Laurent Vivier wrote: > This happens because initially powerpc code computes > sched_domains_numa_masks of offline nodes as if they were merged with > node 0 (because firmware doesn't provide the distance information for > memoryless/cpuless nodes): > > node 0 1 2 3 > 0: 10 40 10 10 > 1: 40 10 40 40 > 2: 10 40 10 10 > 3: 10 40 10 10
*groan*... what does it do for things like percpu memory? ISTR the per-cpu chunks are all allocated early too. Having them all use memory out of node-0 would seem sub-optimal. > We should have: > > node 0 1 2 3 > 0: 10 40 40 40 > 1: 40 10 40 40 > 2: 40 40 10 40 > 3: 40 40 40 10 Can it happen that it introduces a new distance in the table? One that hasn't been seen before? This example only has 10 and 40, but suppose the new node lands at distance 20 (or 80); can such a thing happen? If not; why not? > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c > index 3f35ba1d8fde..24831b86533b 100644 > --- a/kernel/sched/topology.c > +++ b/kernel/sched/topology.c > @@ -1622,8 +1622,10 @@ void sched_init_numa(void) > return; > > sched_domains_numa_masks[i][j] = mask; > + if (!node_state(j, N_ONLINE)) > + continue; > > - for_each_node(k) { > + for_each_online_node(k) { > if (node_distance(j, k) > > sched_domains_numa_distance[i]) > continue; > So you're relying on sched_domain_numa_masks_set/clear() to fix this up, but that in turn relies on the sched_domain_numa_levels thing to stay accurate. This all seems very fragile and unfortunate.