Daniel Henrique Barboza <danielhb...@gmail.com> writes: > On 6/17/21 1:51 PM, Aneesh Kumar K.V wrote: >> PAPR interface currently supports two different ways of communicating >> resource >> grouping details to the OS. These are referred to as Form 0 and Form 1 >> associativity grouping. Form 0 is the older format and is now considered >> deprecated. This patch adds another resource grouping named FORM2. >> >> Signed-off-by: Daniel Henrique Barboza <danielhb...@gmail.com> >> Signed-off-by: Aneesh Kumar K.V <aneesh.ku...@linux.ibm.com> >> --- >> Documentation/powerpc/associativity.rst | 135 ++++++++++++++++++++ >> arch/powerpc/include/asm/firmware.h | 3 +- >> arch/powerpc/include/asm/prom.h | 1 + >> arch/powerpc/kernel/prom_init.c | 3 +- >> arch/powerpc/mm/numa.c | 149 +++++++++++++++++++++- >> arch/powerpc/platforms/pseries/firmware.c | 1 + >> 6 files changed, 286 insertions(+), 6 deletions(-) >> create mode 100644 Documentation/powerpc/associativity.rst >> >> diff --git a/Documentation/powerpc/associativity.rst >> b/Documentation/powerpc/associativity.rst >> new file mode 100644 >> index 000000000000..93be604ac54d >> --- /dev/null >> +++ b/Documentation/powerpc/associativity.rst >> @@ -0,0 +1,135 @@ >> +============================ >> +NUMA resource associativity >> +============================= >> + >> +Associativity represents the groupings of the various platform resources >> into >> +domains of substantially similar mean performance relative to resources >> outside >> +of that domain. Resources subsets of a given domain that exhibit better >> +performance relative to each other than relative to other resources subsets >> +are represented as being members of a sub-grouping domain. This performance >> +characteristic is presented in terms of NUMA node distance within the Linux >> kernel. >> +From the platform view, these groups are also referred to as domains. >> + >> +PAPR interface currently supports different ways of communicating these >> resource >> +grouping details to the OS. These are referred to as Form 0, Form 1 and >> Form2 >> +associativity grouping. Form 0 is the older format and is now considered >> deprecated. >> + >> +Hypervisor indicates the type/form of associativity used via >> "ibm,arcitecture-vec-5 property". >> +Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of >> Form 0 or Form 1. >> +A value of 1 indicates the usage of Form 1 associativity. For Form 2 >> associativity >> +bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used. >> + >> +Form 0 >> +----- >> +Form 0 associativity supports only two NUMA distance (LOCAL and REMOTE). >> + >> +Form 1 >> +----- >> +With Form 1 a combination of ibm,associativity-reference-points and >> ibm,associativity >> +device tree properties are used to determine the NUMA distance between >> resource groups/domains. >> + >> +The “ibm,associativity” property contains one or more lists of numbers >> (domainID) >> +representing the resource’s platform grouping domains. >> + >> +The “ibm,associativity-reference-points” property contains one or more list >> of numbers >> +(domainID index) that represents the 1 based ordinal in the associativity >> lists. >> +The list of domainID index represnets increasing hierachy of resource >> grouping. >> + >> +ex: >> +{ primary domainID index, secondary domainID index, tertiary domainID >> index.. } >> + >> +Linux kernel uses the domainID at the primary domainID index as the NUMA >> node id. >> +Linux kernel computes NUMA distance between two domains by recursively >> comparing >> +if they belong to the same higher-level domains. For mismatch at every >> higher >> +level of the resource group, the kernel doubles the NUMA distance between >> the >> +comparing domains. >> + >> +Form 2 >> +------- >> +Form 2 associativity format adds separate device tree properties >> representing NUMA node distance >> +thereby making the node distance computation flexible. Form 2 also allows >> flexible primary >> +domain numbering. With numa distance computation now detached from the >> index value of >> +"ibm,associativity" property, Form 2 allows a large number of primary >> domain ids at the >> +same domainID index representing resource groups of different >> performance/latency characteristics. >> + >> +Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 >> in the >> +"ibm,architecture-vec-5" property. >> + >> +"ibm,numa-lookup-index-table" property contains one or more list numbers >> representing >> +the domainIDs present in the system. The offset of the domainID in this >> property is considered >> +the domainID index. >> + >> +prop-encoded-array: The number N of the domainIDs encoded as with >> encode-int, followed by >> +N domainID encoded as with encode-int >> + >> +For ex: >> +ibm,numa-lookup-index-table = {4, 0, 8, 250, 252}, domainID index for >> domainID 8 is 1. >> + >> +"ibm,numa-distance-table" property contains one or more list of numbers >> representing the NUMA >> +distance between resource groups/domains present in the system. >> + >> +prop-encoded-array: The number N of the distance values encoded as with >> encode-int, followed by >> +N distance values encoded as with encode-bytes. The max distance value we >> could encode is 255. >> + >> +For ex: >> +ibm,numa-lookup-index-table = {3, 0, 8, 40} >> +ibm,numa-distance-table = {9, 10, 20, 80, 20, 10, 160, 80, 160, 10} >> + >> + | 0 8 40 >> +--|------------ >> + | >> +0 | 10 20 80 >> + | >> +8 | 20 10 160 >> + | >> +40| 80 160 10 >> + >> + >> +"ibm,associativity" property for resources in node 0, 8 and 40 >> + >> +{ 3, 6, 7, 0 } >> +{ 3, 6, 9, 8 } >> +{ 3, 6, 7, 40} >> + >> +With "ibm,associativity-reference-points" { 0x3 } > > With this configuration, would the following ibm,associativity arrays > also be valid? > > > { 3, 0, 0, 0 } > { 3, 0, 0, 8 } > { 3, 0, 0, 40} >
Yes > If yes, then we need a way to tell that the associativity domains assignment > are optional, and FORM2 relies solely on finding out the domainID of the > resource (0, 8 and 40) to retrieve the domainID index, and with this > index all performance metrics can be retrieved from the numa-* properties > (numa-distance-table, numa-bandwidth-table ...). > Where do you suggest we clarify that? I agree that it is not explicitly mentioned. But we describe the details of how we find the numa distance with example in the document. > Retrieving the resource domainID is done by using > ibm,associativity-reference-points. > > This will allow the platform to implement FORM2 such as: > > { 1, 0 } > { 1, 8 } > { 1, 40 } > > - ref-points: { 0x1 } > > If the platform chooses to do so. > That is correct. > >> + >> +Each resource (drcIndex) now also supports additional optional device tree >> properties. >> +These properties are marked optional because the platform can choose not to >> export >> +them and provide the system topology details using the earlier defined >> device tree >> +properties alone. The optional device tree properties are used when adding >> new resources >> +(DLPAR) and when the platform didn't provide the topology details of the >> domain which >> +contains the newly added resource during boot. >> + >> +"ibm,numa-lookup-index" property contains a number representing the >> domainID index to be used >> +when building the NUMA distance of the numa node to which this resource >> belongs. This can >> +be looked at as the index at which this new domainID would have appeared in >> +"ibm,numa-lookup-index-table" if the domain was present during boot. The >> domainID >> +of the new resource can be obtained from the existing "ibm,associativity" >> property. This >> +can be used to build distance information of a newly onlined NUMA node via >> DLPAR operation. >> +The value is 1 based array index value. >> + >> +prop-encoded-array: An integer encoded as with encode-int specifying the >> domainID index >> + >> +"ibm,numa-distance" property contains one or more list of numbers >> presenting the NUMA distance >> +from this resource domain to other resources. >> + >> +prop-encoded-array: The number N of the distance values encoded as with >> encode-int, followed by >> +N distance values encoded as with encode-bytes. The max distance value we >> could encode is 255. >> + >> +For ex: >> +ibm,associativity = { 4, 5, 10, 50} >> +ibm,numa-lookup-index = { 4 } >> +ibm,numa-distance = {8, 160, 255, 80, 10, 160, 255, 80, 10} >> + >> +resulting in a new toplogy as below. >> + | 0 8 40 50 >> +--|------------------ >> + | >> +0 | 10 20 80 160 >> + | >> +8 | 20 10 160 255 >> + | >> +40| 80 160 10 80 >> + | >> +50| 160 255 80 10 >> + > > I see there is no mention of the special PAPR SCM handling. I saw in > one of the your replies of v1: > > "Another option is to make sure that numa-distance-value is populated > such that PMEMB distance indicates it is closer to node0 when compared > to node1. ie, node_distance[40][0] < node_distance[40][1]. One could > possibly infer the grouping based on the distance value and not deepend > on ibm,associativity for that purpose." > > > Is that was we're supposed to do with PAPR SCM? I'm not sure how that > affects NVDIMM support in QEMU with FORM2. > > yes that is what we are doing with this version of the patchset (v4) version. We can drop the nvdimm specific changes from Qemu. -aneesh