Re: [PATCH] acpi/nfit, device-dax: Identify differentiated memory with a unique numa-node

2018-11-12 Thread Yang Shi
On Fri, Nov 9, 2018 at 3:24 PM Dan Williams  wrote:
>
> Persistent memory, as described by the ACPI NFIT (NVDIMM Firmware
> Interface Table), is the first known instance of a memory range
> described by a unique "target" proximity domain. Where "initiator" and
> "target" proximity domains is an approach that the ACPI HMAT
> (Heterogeneous Memory Attributes Table) uses to described the unique
> performance properties of a memory range relative to a given initiator
> (e.g. CPU or DMA device).
>
> Currently the numa-node for a /dev/pmemX block-device or /dev/daxX.Y
> char-device follows the traditional notion of 'numa-node' where the
> attribute conveys the closest online numa-node. That numa-node attribute
> is useful for cpu-binding and memory-binding processes *near* the
> device. However, when the memory range backing a 'pmem', or 'dax' device
> is onlined (memory hot-add) the memory-only-numa-node representing that
> address needs to be differentiated from the set of online nodes. In
> other words, the numa-node association of the device depends on whether
> you can bind processes *near* the cpu-numa-node in the offline
> device-case, or bind process *on* the memory-range directly after the
> backing address range is onlined.
>
> Allow for the case that platform firmware describes persistent memory
> with a unique proximity domain, i.e. when it is distinct from the
> proximity of DRAM and CPUs that are on the same socket. Plumb the Linux
> numa-node translation of that proximity through the libnvdimm region
> device to namespaces that are in device-dax mode. With this in place the
> proposed kmem driver [1] can optionally discover a unique numa-node
> number for the address range as it transitions the memory from an
> offline state managed by a device-driver to an online memory range
> managed by the core-mm.
>
> [1]: https://lkml.org/lkml/2018/10/23/9
>
> Reported-by: Fan Du 

Thanks for coming up with the patch quickly. We reported the problem
to Fan Du, and did a very similar, but preliminary,  in-house
implementation.

This implementation looks good to me. Reviewed-by: Yang Shi


> Cc: Michael Ellerman 
> Cc: "Oliver O'Halloran" 
> Cc: Dave Hansen 
> Cc: Jérôme Glisse 
> Signed-off-by: Dan Williams 
> ---
>  arch/powerpc/platforms/pseries/papr_scm.c |1 +
>  drivers/acpi/nfit/core.c  |8 ++--
>  drivers/acpi/numa.c   |1 +
>  drivers/dax/bus.c |4 +++-
>  drivers/dax/bus.h |3 ++-
>  drivers/dax/dax-private.h |4 
>  drivers/dax/pmem/core.c   |4 +++-
>  drivers/nvdimm/e820.c |1 +
>  drivers/nvdimm/nd.h   |2 +-
>  drivers/nvdimm/of_pmem.c  |1 +
>  drivers/nvdimm/region_devs.c  |1 +
>  include/linux/libnvdimm.h |1 +
>  12 files changed, 25 insertions(+), 6 deletions(-)
>
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> b/arch/powerpc/platforms/pseries/papr_scm.c
> index ee9372b65ca5..6a0a35b872d1 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -233,6 +233,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
> memset(_desc, 0, sizeof(ndr_desc));
> ndr_desc.attr_groups = region_attr_groups;
> ndr_desc.numa_node = dev_to_node(>pdev->dev);
> +   ndr_desc.target_node = ndr_desc.numa_node;
> ndr_desc.res = >res;
> ndr_desc.of_node = p->dn;
> ndr_desc.provider_data = p;
> diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
> index f8c638f3c946..2225e3de33ac 100644
> --- a/drivers/acpi/nfit/core.c
> +++ b/drivers/acpi/nfit/core.c
> @@ -2825,11 +2825,15 @@ static int acpi_nfit_register_region(struct 
> acpi_nfit_desc *acpi_desc,
> ndr_desc->res = 
> ndr_desc->provider_data = nfit_spa;
> ndr_desc->attr_groups = acpi_nfit_region_attribute_groups;
> -   if (spa->flags & ACPI_NFIT_PROXIMITY_VALID)
> +   if (spa->flags & ACPI_NFIT_PROXIMITY_VALID) {
> ndr_desc->numa_node = acpi_map_pxm_to_online_node(
> spa->proximity_domain);
> -   else
> +   ndr_desc->target_node = acpi_map_pxm_to_node(
> +   spa->proximity_domain);
> +   } else {
> ndr_desc->numa_node = NUMA_NO_NODE;
> +   ndr_desc->target_node = NUMA_NO_NODE;
> +   }
>
> /*
>  * Persistence domain bits are hierarchical, if
> diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
> index 274699463b4f..b9d86babb13a 100644
> --- a/drivers/acpi/numa.c
> +++ b/drivers/acpi/numa.c
> @@ -84,6 +84,7 @@ int acpi_map_pxm_to_node(int pxm)
>
> return node;
>  }
> +EXPORT_SYMBOL(acpi_map_pxm_to_node);
>
>  /**
>   * acpi_map_pxm_to_online_node - Map proximity ID to online node
> diff 

[PATCH] acpi/nfit, device-dax: Identify differentiated memory with a unique numa-node

2018-11-09 Thread Dan Williams
Persistent memory, as described by the ACPI NFIT (NVDIMM Firmware
Interface Table), is the first known instance of a memory range
described by a unique "target" proximity domain. Where "initiator" and
"target" proximity domains is an approach that the ACPI HMAT
(Heterogeneous Memory Attributes Table) uses to described the unique
performance properties of a memory range relative to a given initiator
(e.g. CPU or DMA device).

Currently the numa-node for a /dev/pmemX block-device or /dev/daxX.Y
char-device follows the traditional notion of 'numa-node' where the
attribute conveys the closest online numa-node. That numa-node attribute
is useful for cpu-binding and memory-binding processes *near* the
device. However, when the memory range backing a 'pmem', or 'dax' device
is onlined (memory hot-add) the memory-only-numa-node representing that
address needs to be differentiated from the set of online nodes. In
other words, the numa-node association of the device depends on whether
you can bind processes *near* the cpu-numa-node in the offline
device-case, or bind process *on* the memory-range directly after the
backing address range is onlined.

Allow for the case that platform firmware describes persistent memory
with a unique proximity domain, i.e. when it is distinct from the
proximity of DRAM and CPUs that are on the same socket. Plumb the Linux
numa-node translation of that proximity through the libnvdimm region
device to namespaces that are in device-dax mode. With this in place the
proposed kmem driver [1] can optionally discover a unique numa-node
number for the address range as it transitions the memory from an
offline state managed by a device-driver to an online memory range
managed by the core-mm.

[1]: https://lkml.org/lkml/2018/10/23/9

Reported-by: Fan Du 
Cc: Michael Ellerman 
Cc: "Oliver O'Halloran" 
Cc: Dave Hansen 
Cc: Jérôme Glisse 
Signed-off-by: Dan Williams 
---
 arch/powerpc/platforms/pseries/papr_scm.c |1 +
 drivers/acpi/nfit/core.c  |8 ++--
 drivers/acpi/numa.c   |1 +
 drivers/dax/bus.c |4 +++-
 drivers/dax/bus.h |3 ++-
 drivers/dax/dax-private.h |4 
 drivers/dax/pmem/core.c   |4 +++-
 drivers/nvdimm/e820.c |1 +
 drivers/nvdimm/nd.h   |2 +-
 drivers/nvdimm/of_pmem.c  |1 +
 drivers/nvdimm/region_devs.c  |1 +
 include/linux/libnvdimm.h |1 +
 12 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index ee9372b65ca5..6a0a35b872d1 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -233,6 +233,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
memset(_desc, 0, sizeof(ndr_desc));
ndr_desc.attr_groups = region_attr_groups;
ndr_desc.numa_node = dev_to_node(>pdev->dev);
+   ndr_desc.target_node = ndr_desc.numa_node;
ndr_desc.res = >res;
ndr_desc.of_node = p->dn;
ndr_desc.provider_data = p;
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index f8c638f3c946..2225e3de33ac 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -2825,11 +2825,15 @@ static int acpi_nfit_register_region(struct 
acpi_nfit_desc *acpi_desc,
ndr_desc->res = 
ndr_desc->provider_data = nfit_spa;
ndr_desc->attr_groups = acpi_nfit_region_attribute_groups;
-   if (spa->flags & ACPI_NFIT_PROXIMITY_VALID)
+   if (spa->flags & ACPI_NFIT_PROXIMITY_VALID) {
ndr_desc->numa_node = acpi_map_pxm_to_online_node(
spa->proximity_domain);
-   else
+   ndr_desc->target_node = acpi_map_pxm_to_node(
+   spa->proximity_domain);
+   } else {
ndr_desc->numa_node = NUMA_NO_NODE;
+   ndr_desc->target_node = NUMA_NO_NODE;
+   }
 
/*
 * Persistence domain bits are hierarchical, if
diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
index 274699463b4f..b9d86babb13a 100644
--- a/drivers/acpi/numa.c
+++ b/drivers/acpi/numa.c
@@ -84,6 +84,7 @@ int acpi_map_pxm_to_node(int pxm)
 
return node;
 }
+EXPORT_SYMBOL(acpi_map_pxm_to_node);
 
 /**
  * acpi_map_pxm_to_online_node - Map proximity ID to online node
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 568168500217..c620ad52d7e5 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -214,7 +214,7 @@ static void dax_region_unregister(void *region)
 }
 
 struct dax_region *alloc_dax_region(struct device *parent, int region_id,
-   struct resource *res, unsigned int align,
+   struct resource *res, int target_node, unsigned int align,
unsigned long pfn_flags)
 {