Re: [PATCH] acpi/nfit, device-dax: Identify differentiated memory with a unique numa-node
On Fri, Nov 9, 2018 at 3:24 PM Dan Williams wrote: > > Persistent memory, as described by the ACPI NFIT (NVDIMM Firmware > Interface Table), is the first known instance of a memory range > described by a unique "target" proximity domain. Where "initiator" and > "target" proximity domains is an approach that the ACPI HMAT > (Heterogeneous Memory Attributes Table) uses to described the unique > performance properties of a memory range relative to a given initiator > (e.g. CPU or DMA device). > > Currently the numa-node for a /dev/pmemX block-device or /dev/daxX.Y > char-device follows the traditional notion of 'numa-node' where the > attribute conveys the closest online numa-node. That numa-node attribute > is useful for cpu-binding and memory-binding processes *near* the > device. However, when the memory range backing a 'pmem', or 'dax' device > is onlined (memory hot-add) the memory-only-numa-node representing that > address needs to be differentiated from the set of online nodes. In > other words, the numa-node association of the device depends on whether > you can bind processes *near* the cpu-numa-node in the offline > device-case, or bind process *on* the memory-range directly after the > backing address range is onlined. > > Allow for the case that platform firmware describes persistent memory > with a unique proximity domain, i.e. when it is distinct from the > proximity of DRAM and CPUs that are on the same socket. Plumb the Linux > numa-node translation of that proximity through the libnvdimm region > device to namespaces that are in device-dax mode. With this in place the > proposed kmem driver [1] can optionally discover a unique numa-node > number for the address range as it transitions the memory from an > offline state managed by a device-driver to an online memory range > managed by the core-mm. > > [1]: https://lkml.org/lkml/2018/10/23/9 > > Reported-by: Fan Du Thanks for coming up with the patch quickly. We reported the problem to Fan Du, and did a very similar, but preliminary, in-house implementation. This implementation looks good to me. Reviewed-by: Yang Shi > Cc: Michael Ellerman > Cc: "Oliver O'Halloran" > Cc: Dave Hansen > Cc: Jérôme Glisse > Signed-off-by: Dan Williams > --- > arch/powerpc/platforms/pseries/papr_scm.c |1 + > drivers/acpi/nfit/core.c |8 ++-- > drivers/acpi/numa.c |1 + > drivers/dax/bus.c |4 +++- > drivers/dax/bus.h |3 ++- > drivers/dax/dax-private.h |4 > drivers/dax/pmem/core.c |4 +++- > drivers/nvdimm/e820.c |1 + > drivers/nvdimm/nd.h |2 +- > drivers/nvdimm/of_pmem.c |1 + > drivers/nvdimm/region_devs.c |1 + > include/linux/libnvdimm.h |1 + > 12 files changed, 25 insertions(+), 6 deletions(-) > > diff --git a/arch/powerpc/platforms/pseries/papr_scm.c > b/arch/powerpc/platforms/pseries/papr_scm.c > index ee9372b65ca5..6a0a35b872d1 100644 > --- a/arch/powerpc/platforms/pseries/papr_scm.c > +++ b/arch/powerpc/platforms/pseries/papr_scm.c > @@ -233,6 +233,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p) > memset(_desc, 0, sizeof(ndr_desc)); > ndr_desc.attr_groups = region_attr_groups; > ndr_desc.numa_node = dev_to_node(>pdev->dev); > + ndr_desc.target_node = ndr_desc.numa_node; > ndr_desc.res = >res; > ndr_desc.of_node = p->dn; > ndr_desc.provider_data = p; > diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c > index f8c638f3c946..2225e3de33ac 100644 > --- a/drivers/acpi/nfit/core.c > +++ b/drivers/acpi/nfit/core.c > @@ -2825,11 +2825,15 @@ static int acpi_nfit_register_region(struct > acpi_nfit_desc *acpi_desc, > ndr_desc->res = > ndr_desc->provider_data = nfit_spa; > ndr_desc->attr_groups = acpi_nfit_region_attribute_groups; > - if (spa->flags & ACPI_NFIT_PROXIMITY_VALID) > + if (spa->flags & ACPI_NFIT_PROXIMITY_VALID) { > ndr_desc->numa_node = acpi_map_pxm_to_online_node( > spa->proximity_domain); > - else > + ndr_desc->target_node = acpi_map_pxm_to_node( > + spa->proximity_domain); > + } else { > ndr_desc->numa_node = NUMA_NO_NODE; > + ndr_desc->target_node = NUMA_NO_NODE; > + } > > /* > * Persistence domain bits are hierarchical, if > diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c > index 274699463b4f..b9d86babb13a 100644 > --- a/drivers/acpi/numa.c > +++ b/drivers/acpi/numa.c > @@ -84,6 +84,7 @@ int acpi_map_pxm_to_node(int pxm) > > return node; > } > +EXPORT_SYMBOL(acpi_map_pxm_to_node); > > /** > * acpi_map_pxm_to_online_node - Map proximity ID to online node > diff
[PATCH] acpi/nfit, device-dax: Identify differentiated memory with a unique numa-node
Persistent memory, as described by the ACPI NFIT (NVDIMM Firmware Interface Table), is the first known instance of a memory range described by a unique "target" proximity domain. Where "initiator" and "target" proximity domains is an approach that the ACPI HMAT (Heterogeneous Memory Attributes Table) uses to described the unique performance properties of a memory range relative to a given initiator (e.g. CPU or DMA device). Currently the numa-node for a /dev/pmemX block-device or /dev/daxX.Y char-device follows the traditional notion of 'numa-node' where the attribute conveys the closest online numa-node. That numa-node attribute is useful for cpu-binding and memory-binding processes *near* the device. However, when the memory range backing a 'pmem', or 'dax' device is onlined (memory hot-add) the memory-only-numa-node representing that address needs to be differentiated from the set of online nodes. In other words, the numa-node association of the device depends on whether you can bind processes *near* the cpu-numa-node in the offline device-case, or bind process *on* the memory-range directly after the backing address range is onlined. Allow for the case that platform firmware describes persistent memory with a unique proximity domain, i.e. when it is distinct from the proximity of DRAM and CPUs that are on the same socket. Plumb the Linux numa-node translation of that proximity through the libnvdimm region device to namespaces that are in device-dax mode. With this in place the proposed kmem driver [1] can optionally discover a unique numa-node number for the address range as it transitions the memory from an offline state managed by a device-driver to an online memory range managed by the core-mm. [1]: https://lkml.org/lkml/2018/10/23/9 Reported-by: Fan Du Cc: Michael Ellerman Cc: "Oliver O'Halloran" Cc: Dave Hansen Cc: Jérôme Glisse Signed-off-by: Dan Williams --- arch/powerpc/platforms/pseries/papr_scm.c |1 + drivers/acpi/nfit/core.c |8 ++-- drivers/acpi/numa.c |1 + drivers/dax/bus.c |4 +++- drivers/dax/bus.h |3 ++- drivers/dax/dax-private.h |4 drivers/dax/pmem/core.c |4 +++- drivers/nvdimm/e820.c |1 + drivers/nvdimm/nd.h |2 +- drivers/nvdimm/of_pmem.c |1 + drivers/nvdimm/region_devs.c |1 + include/linux/libnvdimm.h |1 + 12 files changed, 25 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index ee9372b65ca5..6a0a35b872d1 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -233,6 +233,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p) memset(_desc, 0, sizeof(ndr_desc)); ndr_desc.attr_groups = region_attr_groups; ndr_desc.numa_node = dev_to_node(>pdev->dev); + ndr_desc.target_node = ndr_desc.numa_node; ndr_desc.res = >res; ndr_desc.of_node = p->dn; ndr_desc.provider_data = p; diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c index f8c638f3c946..2225e3de33ac 100644 --- a/drivers/acpi/nfit/core.c +++ b/drivers/acpi/nfit/core.c @@ -2825,11 +2825,15 @@ static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc, ndr_desc->res = ndr_desc->provider_data = nfit_spa; ndr_desc->attr_groups = acpi_nfit_region_attribute_groups; - if (spa->flags & ACPI_NFIT_PROXIMITY_VALID) + if (spa->flags & ACPI_NFIT_PROXIMITY_VALID) { ndr_desc->numa_node = acpi_map_pxm_to_online_node( spa->proximity_domain); - else + ndr_desc->target_node = acpi_map_pxm_to_node( + spa->proximity_domain); + } else { ndr_desc->numa_node = NUMA_NO_NODE; + ndr_desc->target_node = NUMA_NO_NODE; + } /* * Persistence domain bits are hierarchical, if diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c index 274699463b4f..b9d86babb13a 100644 --- a/drivers/acpi/numa.c +++ b/drivers/acpi/numa.c @@ -84,6 +84,7 @@ int acpi_map_pxm_to_node(int pxm) return node; } +EXPORT_SYMBOL(acpi_map_pxm_to_node); /** * acpi_map_pxm_to_online_node - Map proximity ID to online node diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c index 568168500217..c620ad52d7e5 100644 --- a/drivers/dax/bus.c +++ b/drivers/dax/bus.c @@ -214,7 +214,7 @@ static void dax_region_unregister(void *region) } struct dax_region *alloc_dax_region(struct device *parent, int region_id, - struct resource *res, unsigned int align, + struct resource *res, int target_node, unsigned int align, unsigned long pfn_flags) {