On 11/16/2017 11:24 AM, Michael Bringmann wrote: > On powerpc systems which allow 'hot-add' of CPU or memory resources, > it may occur that the new resources are to be inserted into nodes > that were not used for these resources at bootup. In the kernel, > any node that is used must be defined and initialized. These empty > nodes may occur when, > > * Dedicated vs. shared resources. Shared resources require > information such as the VPHN hcall for CPU assignment to nodes. > Associativity decisions made based on dedicated resource rules, > such as associativity properties in the device tree, may vary > from decisions made using the values returned by the VPHN hcall. > * memoryless nodes at boot. Nodes need to be defined as 'possible' > at boot for operation with other code modules. Previously, the > powerpc code would limit the set of possible nodes to those which > have memory assigned at boot, and were thus online. Subsequent > add/remove of CPUs or memory would only work with this subset of > possible nodes. > * memoryless nodes with CPUs at boot. Due to the previous restriction > on nodes, nodes that had CPUs but no memory were being collapsed > into other nodes that did have memory at boot. In practice this > meant that the node assignment presented by the runtime kernel > differed from the affinity and associativity attributes presented > by the device tree or VPHN hcalls. Nodes that might be known to > the pHyp were not 'possible' in the runtime kernel because they did > not have memory at boot. > > This patch ensures that sufficient nodes are defined to support > configuration requirements after boot, as well as at boot. This > patch set fixes a couple of problems. > > * Nodes known to powerpc to be memoryless at boot, but to have > CPUs in them are allowed to be 'possible' and 'online'. Memory > allocations for those nodes are taken from another node that does > have memory until and if memory is hot-added to the node. > * Nodes which have no resources assigned at boot, but which may still > be referenced subsequently by affinity or associativity attributes, > are kept in the list of 'possible' nodes for powerpc. Hot-add of > memory or CPUs to the system can reference these nodes and bring > them online instead of redirecting to one of the set of nodes that > were known to have memory at boot. > > This patch extracts the value of the lowest domain level (number of > allocable resources) from the device tree property > "ibm,max-associativity-domains" to use as the maximum number of nodes > to setup as possibly available in the system. This new setting will > override the instruction, > > nodes_and(node_possible_map, node_possible_map, node_online_map); > > presently seen in the function arch/powerpc/mm/numa.c:initmem_init(). > > If the "ibm,max-associativity-domains" property is not present at boot, > no operation will be performed to define or enable additional nodes, or > enable the above 'nodes_and()'. > > Signed-off-by: Michael Bringmann <m...@linux.vnet.ibm.com> > --- > Changes in V6: > -- Remove some node initialization/allocation from boot setup > to later in runtime to try to limit memory needs early on > -- Augment descriptive documentation for patch > --- > arch/powerpc/mm/numa.c | 40 +++++++++++++++++++++++++++++++++++++--- > 1 file changed, 37 insertions(+), 3 deletions(-) > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > index eb604b3..334a1ff 100644 > --- a/arch/powerpc/mm/numa.c > +++ b/arch/powerpc/mm/numa.c > @@ -892,6 +892,37 @@ static void __init setup_node_data(int nid, u64 > start_pfn, u64 end_pfn) > NODE_DATA(nid)->node_spanned_pages = spanned_pages; > } > > +static void __init find_possible_nodes(void) > +{ > + struct device_node *rtas; > + u32 numnodes, i; > + > + if (min_common_depth <= 0) > + return; > + > + rtas = of_find_node_by_path("/rtas"); > + if (!rtas) > + return; > + > + if (of_property_read_u32_index(rtas, > + "ibm,max-associativity-domains", > + min_common_depth, &numnodes)) > + goto out; > + > + pr_info("numa: Nodes = %d (mcd = %d)\n", numnodes, > + min_common_depth);
numa.c already has a pr_fmt define, no need to pre-pend "numa:" to the information message. -Nathan > + > + for (i = 0; i < numnodes; i++) { > + if (!node_possible(i)) { > + setup_node_data(i, 0, 0); > + node_set(i, node_possible_map); > + } > + } > + > +out: > + of_node_put(rtas); > +} > + > void __init initmem_init(void) > { > int nid, cpu; > @@ -905,12 +936,15 @@ void __init initmem_init(void) > memblock_dump_all(); > > /* > - * Reduce the possible NUMA nodes to the online NUMA nodes, > - * since we do not support node hotplug. This ensures that we > - * lower the maximum NUMA node ID to what is actually present. > + * Modify the set of possible NUMA nodes to reflect information > + * available about the set of online nodes, and the set of nodes > + * that we expect to make use of for this platform's affinity > + * calculations. > */ > nodes_and(node_possible_map, node_possible_map, node_online_map); > > + find_possible_nodes(); > + > for_each_online_node(nid) { > unsigned long start_pfn, end_pfn; >