Re: [nvdimm PATCH 0/6] Label initialization time optimizations

2018-10-10 Thread Dan Williams
On Wed, Oct 10, 2018 at 4:36 PM Alexander Duyck
 wrote:
>
> This patch set is intended to improve NVDIMM label read times by first
> increasing the upper limit on the label read/write size, and then
> reducing the number of reads by making use of the free label bitmap in
> the index to determine what labels are actually populated and only read
> those labels. In my testing on a system populated with 24 NVDIMM modules
> I see the total label init time drop from about 24 seconds down to 2 to
> 3 seconds.
>
> In the process of coding this up I came across a few minor issues that
> I felt should be addressed so I have added a few patches for those fixes
> along the way.

Thanks, this all looks good to me, the split looks good, and the
commentary in patch6 made it easy to review.

I quibble with "Remove empty if statement" because I find positive
logic easier to read than negative logic, but removing more lines than
it adds made me reserve that quibble for another time.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v9 00/13] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-10-10 Thread Bjorn Helgaas
On Wed, Oct 10, 2018 at 05:03:33PM -0600, Logan Gunthorpe wrote:
> 
> 
> On 2018-10-10 2:19 p.m., Bjorn Helgaas wrote:
> > I added the reviewed-by tags from Christoph, Jens' ack on the blkdev.h
> > change, and applied these to pci/peer-to-peer with the intent of
> > merging these for v4.20.
> > 
> > I gave up on waiting for an ack for the memremap.h and mm.h changes.
> > 
> > I dropped the "nvme-pci: Add a quirk for a pseudo CMB" quirk because
> > of Christoph's objection.  After this is all merged, I won't need to
> > be involved, and you and the NVMe folks can hash that out.
> > 
> > If there are updates to "nvmet: Optionally use PCI P2P memory" based
> > on Sagi's comments, send an incremental patch and I'll fold them in.
> 
> Thanks for picking this up. However, I hate to throw a wrench in the
> works, but I had a v10[1] queued up because kbuild found some problems
> with the series over the weekend. I can send v10 off right away if you
> want to just replace it in your branch or, if you'd like, I can generate
> some incremental patches. Let me know which you'd prefer.

I applied the updates from your v10 to my pci/peer-to-peer branch.

> [1] https://github.com/sbates130272/linux-p2pmem pci-p2p-v10
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-10 Thread Yi Zhang
On 2018-10-10 at 11:18:49 -0700, Dan Williams wrote:
> On Wed, Oct 10, 2018 at 10:30 AM Michal Hocko  wrote:
> >
> > On Wed 10-10-18 09:39:08, Alexander Duyck wrote:
> > > On 10/10/2018 2:58 AM, Michal Hocko wrote:
> > > > On Tue 09-10-18 13:26:41, Alexander Duyck wrote:
> > > > [...]
> > > > > I would think with that being the case we still probably need the 
> > > > > call to
> > > > > __SetPageReserved to set the bit with the expectation that it will 
> > > > > not be
> > > > > cleared for device-pages since the pages are not onlined. Removing 
> > > > > the call
> > > > > to __SetPageReserved would probably introduce a number of regressions 
> > > > > as
> > > > > there are multiple spots that use the reserved bit to determine if a 
> > > > > page
> > > > > can be swapped out to disk, mapped as system memory, or migrated.
> > > >
> > > > PageReserved is meant to tell any potential pfn walkers that might get
> > > > to this struct page to back off and not touch it. Even though
> > > > ZONE_DEVICE doesn't online pages in traditional sense it makes those
> > > > pages available for further use so the page reserved bit should be
> > > > cleared.
> > >
> > > So from what I can tell that isn't necessarily the case. Specifically if 
> > > the
> > > pagemap type is MEMORY_DEVICE_PRIVATE or MEMORY_DEVICE_PUBLIC both are
> > > special cases where the memory may not be accessible to the CPU or cannot 
> > > be
> > > pinned in order to allow for eviction.
> >
> > Could you give me an example please?
> >
> > > The specific case that Dan and Yi are referring to is for the type
> > > MEMORY_DEVICE_FS_DAX. For that type I could probably look at not setting 
> > > the
> > > reserved bit. Part of me wants to say that we should wait and clear the 
> > > bit
> > > later, but that would end up just adding time back to initialization. At
> > > this point I would consider the change more of a follow-up optimization
> > > rather than a fix though since this is tailoring things specifically for 
> > > DAX
> > > versus the other ZONE_DEVICE types.
> >
> > I thought I have already made it clear that these zone device hacks are
> > not acceptable to the generic hotplug code. If the current reserve bit
> > handling is not correct then give us a specific reason for that and we
> > can start thinking about the proper fix.
> >
> 
> Right, so we're in a situation where a hack is needed for KVM's
> current interpretation of the Reserved flag relative to dax mapped
> pages. I'm arguing to push that knowledge / handling as deep as
> possible into the core rather than hack the leaf implementations like
> KVM, i.e. disable the Reserved flag for all non-MEMORY_DEVICE_*
> ZONE_DEVICE types.
> 
> Here is the KVM thread about why they need a change:
> 
> https://lkml.org/lkml/2018/9/7/552
> 
> ...and where I pushed back on a KVM-local hack:
> 
> https://lkml.org/lkml/2018/9/19/154
Yeah, Thank Dan, I think I can going on with something like this:

@@ -5589,6 +5589,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
struct page *page = pfn_to_page(pfn);
 
__init_single_page(page, pfn, zone_idx, nid);
+   /* Could we move this a little bit earlier as I can
+* direct use is_dax_page(page), or something else?
+*/
+   page->pgmap = pgmap;
 
/*
 * Mark page reserved as it will need to wait for onlining
@@ -5597,14 +5598,14 @@ void __ref memmap_init_zone_device(struct zone *zone,
 * We can use the non-atomic __set_bit operation for setting
 * the flag as we are still initializing the pages.
 */
-   __SetPageReserved(page);
+if(!is_dax_page(page))
+   __SetPageReserved(page);
 
/*
 * ZONE_DEVICE pages union ->lru with a ->pgmap back
 * pointer and hmm_data.  It is a bug if a ZONE_DEVICE
 * page is ever freed or placed on a driver-private list.
 */
-   page->pgmap = pgmap;
page->hmm_data = 0;


After Alex's patch merged.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-10 Thread Yi Zhang
On 2018-10-10 at 08:27:58 -0700, Alexander Duyck wrote:
> 
> 
> On 10/10/2018 5:52 AM, Yi Zhang wrote:
> >On 2018-10-09 at 14:19:32 -0700, Dan Williams wrote:
> >>On Tue, Oct 9, 2018 at 1:34 PM Alexander Duyck
> >> wrote:
> >>>
> >>>On 10/9/2018 11:04 AM, Dan Williams wrote:
> On Tue, Oct 9, 2018 at 3:21 AM Yi Zhang  
> wrote:
> >>[..]
> That comment is incorrect, device-pages are never onlined. So I think
> we can just skip that call to __SetPageReserved() unless the memory
> range is MEMORY_DEVICE_{PRIVATE,PUBLIC}.
> 
> >>>
> >>>When pages are "onlined" via __free_pages_boot_core they clear the
> >>>reserved bit, that is the reason for the comment. The reserved bit is
> >>>meant to indicate that the page cannot be swapped out or moved based on
> >>>the description of the bit.
> >>
> >>...but ZONE_DEVICE pages are never onlined so I would expect
> >>memmap_init_zone_device() to know that detail.
> >>
> >>>I would think with that being the case we still probably need the call
> >>>to __SetPageReserved to set the bit with the expectation that it will
> >>>not be cleared for device-pages since the pages are not onlined.
> >>>Removing the call to __SetPageReserved would probably introduce a number
> >>>of regressions as there are multiple spots that use the reserved bit to
> >>>determine if a page can be swapped out to disk, mapped as system memory,
> >>>or migrated.
> >
> >Another things, it seems page_init/set_reserved already been done in the
> >move_pfn_range_to_zone
> > |-->memmap_init_zone
> > |-->for_each_page_in_pfn
> > |-->__init_single_page
> > |-->SetPageReserved
> >
> >Why we haven't remove these redundant initial in memmap_init_zone?
> >
> >Correct me if I missed something.
> 
> In this case it isn't redundant as only the vmmemmap pages are initialized
> in memmap_init_zone now. So all of the pages that are going to be used as
> device pages are not initialized until the call to memmap_init_zone_device.
> What I did is split the initialization of the pages into two parts in order
> to allow us to initialize the pages outside of the hotplug lock.
Ah.. I saw that, Thanks the explanation, so that is we only need to
care about the device pages reserved flag, and plan to remove that.
> 
> >>
> >>Right, this is what Yi is working on... the PageReserved flag is
> >>problematic for KVM. Auditing those locations it seems as long as we
> >>teach hibernation to avoid ZONE_DEVICE ranges we can safely not set
> >>the reserved flag for DAX pages. What I'm trying to avoid is a local
> >>KVM hack to check for DAX pages when the Reserved flag is not
> >>otherwise needed.
> >Thanks Dan. Provide the patch link.
> >
> >https://lore.kernel.org/lkml/cover.1536342881.git.yi.z.zh...@linux.intel.com
> 
> So it looks like your current logic is just working around the bit then
> since it just allows for reserved DAX pages.
> 
> 
> ___
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


答复:作为业务员为什么努力了总得不到预期效果

2018-10-10 Thread 孙经理
91469082481
详 细 课 纲 请 阅 读 附 件
8:40:48
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 2/2] powerpc/pseries: Add driver for PAPR SCM regions

2018-10-10 Thread Michael Ellerman
Oliver O'Halloran  writes:

> Adds a driver that implements support for enabling and accessing PAPR
> SCM regions. Unfortunately due to how the PAPR interface works we can't
> use the existing of_pmem driver (yet) because:
>
>  a) The guest is required to use the H_SCM_BIND_MEM h-call to add
> add the SCM region to it's physical address space, and
>  b) There is currently no mechanism for relating a bare of_pmem region
> to the backing DIMM (or not-a-DIMM for our case).
>
> Both of these are easily handled by rolling the functionality into a
> seperate driver so here we are...
>
> Signed-off-by: Oliver O'Halloran 
> ---
> The alternative implementation here is that we have the pseries code
> do the h-calls and craft a pmem-region@ node based on that.
> However, that doesn't solve b) and mpe has expressed his dislike of
> adding new stuff to the DT at runtime so i'd say that's a non-starter.

Hmm, from memory you yelled something at me across the office about
that, so my response may not have been entirely well considered.

I'm not quite sure what the state of the OF overlays support is, but
that would be The Right Way to do that sort of modification to the
device tree at runtime.

If we merged this and then later got the of_pmem driver to work for us
would there be any user-visible change?

cheers
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[nvdimm PATCH 4/6] nvdimm: Remove empty if statement

2018-10-10 Thread Alexander Duyck
This patch removes an empty statement from an if expression and promotes
the else statement to the if expression with the expression logic reversed.

I feel this is more readable as the empty statement can lead to issues if
any additional logic was ever added.

Signed-off-by: Alexander Duyck 
---
 drivers/nvdimm/label.c |5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/nvdimm/label.c b/drivers/nvdimm/label.c
index bb813b8e8ace..43bad0d5bdb6 100644
--- a/drivers/nvdimm/label.c
+++ b/drivers/nvdimm/label.c
@@ -261,9 +261,8 @@ int nd_label_validate(struct nvdimm_drvdata *ndd)
 void nd_label_copy(struct nvdimm_drvdata *ndd, struct nd_namespace_index *dst,
struct nd_namespace_index *src)
 {
-   if (dst && src)
-   /* pass */;
-   else
+   /* just exit if either destination or source is NULL */
+   if (!dst || !src)
return;
 
memcpy(dst, src, sizeof_namespace_index(ndd));

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[nvdimm PATCH 5/6] nvdimm: Split label init out from the logic for getting config data

2018-10-10 Thread Alexander Duyck
This patch splits the initialization of the label data into two functions.
One for doing the init, and another for reading the actual configuration
data. The idea behind this is that by doing this we create a symmetry
between the getting and setting of config data in that we have a function
for both. In addition it will make it easier for us to identify the bits
that are related to init versus the pieces that are a wrapper for reading
data from the ACPI interface.

So for example by splitting things out like this it becomes much more
obvious that we were performing checks that weren't necessarily related to
the set/get operations such as relying on ndd->data being present when the
set and get ops should not care about a locally cached copy of the label
area.

Signed-off-by: Alexander Duyck 
---
 drivers/nvdimm/dimm.c  |2 +-
 drivers/nvdimm/dimm_devs.c |   49 +---
 drivers/nvdimm/label.c |   38 ++
 drivers/nvdimm/label.h |1 +
 drivers/nvdimm/nd.h|2 ++
 5 files changed, 61 insertions(+), 31 deletions(-)

diff --git a/drivers/nvdimm/dimm.c b/drivers/nvdimm/dimm.c
index 6c8fb7590838..07bf96948553 100644
--- a/drivers/nvdimm/dimm.c
+++ b/drivers/nvdimm/dimm.c
@@ -75,7 +75,7 @@ static int nvdimm_probe(struct device *dev)
 * DIMM capacity. We fail the dimm probe to prevent regions from
 * attempting to parse the label area.
 */
-   rc = nvdimm_init_config_data(ndd);
+   rc = nd_label_data_init(ndd);
if (rc == -EACCES)
nvdimm_set_locked(dev);
if (rc)
diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
index 75ac78017b15..6c3de2317390 100644
--- a/drivers/nvdimm/dimm_devs.c
+++ b/drivers/nvdimm/dimm_devs.c
@@ -85,55 +85,47 @@ int nvdimm_init_nsarea(struct nvdimm_drvdata *ndd)
return cmd_rc;
 }
 
-int nvdimm_init_config_data(struct nvdimm_drvdata *ndd)
+int nvdimm_get_config_data(struct nvdimm_drvdata *ndd, void *buf,
+  size_t offset, size_t len)
 {
struct nvdimm_bus *nvdimm_bus = walk_to_nvdimm_bus(ndd->dev);
+   struct nvdimm_bus_descriptor *nd_desc = nvdimm_bus->nd_desc;
int rc = validate_dimm(ndd), cmd_rc = 0;
struct nd_cmd_get_config_data_hdr *cmd;
-   struct nvdimm_bus_descriptor *nd_desc;
-   u32 max_cmd_size, config_size;
-   size_t offset;
+   size_t max_cmd_size, buf_offset;
 
if (rc)
return rc;
 
-   if (ndd->data)
-   return 0;
-
-   if (ndd->nsarea.status || ndd->nsarea.max_xfer == 0
-   || ndd->nsarea.config_size < ND_LABEL_MIN_SIZE) {
-   dev_dbg(ndd->dev, "failed to init config data area: (%d:%d)\n",
-   ndd->nsarea.max_xfer, ndd->nsarea.config_size);
+   if (offset + len > ndd->nsarea.config_size)
return -ENXIO;
-   }
 
-   ndd->data = kvmalloc(ndd->nsarea.config_size, GFP_KERNEL);
-   if (!ndd->data)
-   return -ENOMEM;
-
-   max_cmd_size = min_t(u32, ndd->nsarea.config_size, 
ndd->nsarea.max_xfer);
+   max_cmd_size = min_t(u32, len, ndd->nsarea.max_xfer);
cmd = kvzalloc(max_cmd_size + sizeof(*cmd), GFP_KERNEL);
if (!cmd)
return -ENOMEM;
 
-   nd_desc = nvdimm_bus->nd_desc;
-   for (config_size = ndd->nsarea.config_size, offset = 0;
-   config_size; config_size -= cmd->in_length,
-   offset += cmd->in_length) {
-   cmd->in_length = min(config_size, max_cmd_size);
-   cmd->in_offset = offset;
+   for (buf_offset = 0; len;
+len -= cmd->in_length, buf_offset += cmd->in_length) {
+   size_t cmd_size;
+
+   cmd->in_offset = offset + buf_offset;
+   cmd->in_length = min(max_cmd_size, len);
+
+   cmd_size = sizeof(*cmd) + cmd->in_length;
+
rc = nd_desc->ndctl(nd_desc, to_nvdimm(ndd->dev),
-   ND_CMD_GET_CONFIG_DATA, cmd,
-   cmd->in_length + sizeof(*cmd), _rc);
+   ND_CMD_GET_CONFIG_DATA, cmd, cmd_size, _rc);
if (rc < 0)
break;
if (cmd_rc < 0) {
rc = cmd_rc;
break;
}
-   memcpy(ndd->data + offset, cmd->out_buf, cmd->in_length);
+
+   /* out_buf should be valid, copy it into our output buffer */
+   memcpy(buf + buf_offset, cmd->out_buf, cmd->in_length);
}
-   dev_dbg(ndd->dev, "len: %zu rc: %d\n", offset, rc);
kvfree(cmd);
 
return rc;
@@ -151,9 +143,6 @@ int nvdimm_set_config_data(struct nvdimm_drvdata *ndd, 
size_t offset,
if (rc)
return rc;
 
-   if (!ndd->data)
-   return -ENXIO;
-
if (offset + len > 

[nvdimm PATCH 6/6] nvdimm: Use namespace index data to reduce number of label reads needed

2018-10-10 Thread Alexander Duyck
This patch adds logic that is meant to make use of the namespace index data
to reduce the number of reads that are needed to initialize a given
namespace. The general idea is that once we have enough data to validate
the namespace index we do so and then proceed to fetch only those labels
that are not listed as being "free". By doing this I am seeing a total time
reduction from about 4-5 seconds to 2-3 seconds for 24 NVDIMM modules each
with 128K of label config area.

Signed-off-by: Alexander Duyck 
---
 drivers/nvdimm/dimm.c  |4 --
 drivers/nvdimm/label.c |   93 +---
 drivers/nvdimm/label.h |3 --
 3 files changed, 88 insertions(+), 12 deletions(-)

diff --git a/drivers/nvdimm/dimm.c b/drivers/nvdimm/dimm.c
index 07bf96948553..9899c97138a3 100644
--- a/drivers/nvdimm/dimm.c
+++ b/drivers/nvdimm/dimm.c
@@ -84,10 +84,6 @@ static int nvdimm_probe(struct device *dev)
dev_dbg(dev, "config data size: %d\n", ndd->nsarea.config_size);
 
nvdimm_bus_lock(dev);
-   ndd->ns_current = nd_label_validate(ndd);
-   ndd->ns_next = nd_label_next_nsindex(ndd->ns_current);
-   nd_label_copy(ndd, to_next_namespace_index(ndd),
-   to_current_namespace_index(ndd));
if (ndd->ns_current >= 0) {
rc = nd_label_reserve_dpa(ndd);
if (rc == 0)
diff --git a/drivers/nvdimm/label.c b/drivers/nvdimm/label.c
index 563f24af01b5..7f03d117824f 100644
--- a/drivers/nvdimm/label.c
+++ b/drivers/nvdimm/label.c
@@ -235,7 +235,7 @@ static int __nd_label_validate(struct nvdimm_drvdata *ndd)
return -1;
 }
 
-int nd_label_validate(struct nvdimm_drvdata *ndd)
+static int nd_label_validate(struct nvdimm_drvdata *ndd)
 {
/*
 * In order to probe for and validate namespace index blocks we
@@ -258,8 +258,9 @@ int nd_label_validate(struct nvdimm_drvdata *ndd)
return -1;
 }
 
-void nd_label_copy(struct nvdimm_drvdata *ndd, struct nd_namespace_index *dst,
-   struct nd_namespace_index *src)
+static void nd_label_copy(struct nvdimm_drvdata *ndd,
+ struct nd_namespace_index *dst,
+ struct nd_namespace_index *src)
 {
/* just exit if either destination or source is NULL */
if (!dst || !src)
@@ -419,7 +420,9 @@ int nd_label_reserve_dpa(struct nvdimm_drvdata *ndd)
 
 int nd_label_data_init(struct nvdimm_drvdata *ndd)
 {
-   size_t config_size, read_size;
+   size_t config_size, read_size, max_xfer, offset;
+   struct nd_namespace_index *nsindex;
+   unsigned int i;
int rc = 0;
 
if (ndd->data)
@@ -452,7 +455,87 @@ int nd_label_data_init(struct nvdimm_drvdata *ndd)
if (!ndd->data)
return -ENOMEM;
 
-   return nvdimm_get_config_data(ndd, ndd->data, 0, config_size);
+   /*
+* We want to guarantee as few reads as possible while conserving
+* memory. To do that we figure out how much unused space will be left
+* in the last read, divide that by the total number of reads it is
+* going to take given our maximum transfer size, and then reduce our
+* maximum transfer size based on that result.
+*/
+   max_xfer = min_t(size_t, ndd->nsarea.max_xfer, config_size);
+   if (read_size < max_xfer) {
+   /* trim waste */
+   max_xfer -= ((max_xfer - 1) - (config_size - 1) % max_xfer) /
+   DIV_ROUND_UP(config_size, max_xfer);
+   /* make certain we read indexes in exactly 1 read */
+   if (max_xfer < read_size)
+   max_xfer = read_size;
+   }
+
+   /* Make our initial read size a multiple of max_xfer size */
+   read_size = min(DIV_ROUND_UP(read_size, max_xfer) * max_xfer,
+   config_size);
+
+   /* Read the index data */
+   rc = nvdimm_get_config_data(ndd, ndd->data, 0, read_size);
+   if (rc)
+   goto out_err;
+
+   /* Validate index data, if not valid assume all labels are invalid */
+   ndd->ns_current = nd_label_validate(ndd);
+   if (ndd->ns_current < 0)
+   return 0;
+
+   /* Record our index values */
+   ndd->ns_next = nd_label_next_nsindex(ndd->ns_current);
+
+   /* Copy "current" index on top of the "next" index */
+   nsindex = to_current_namespace_index(ndd);
+   nd_label_copy(ndd, to_next_namespace_index(ndd), nsindex);
+
+   /* Determine starting offset for label data */
+   offset = __le64_to_cpu(nsindex->labeloff);
+
+   /* Loop through the free list pulling in any active labels */
+   for (i = 0; i < nsindex->nslot; i++, offset += ndd->nslabel_size) {
+   size_t label_read_size;
+
+   /* zero out the unused labels */
+   if (test_bit_le(i, nsindex->free)) {
+   memset(ndd->data + offset, 0, ndd->nslabel_size);
+   

[nvdimm PATCH 3/6] nvdimm: Clarify comment in sizeof_namespace_index

2018-10-10 Thread Alexander Duyck
When working on the label code I found it rather confusing to see several
spots that reference a minimum label size of 256 while working with labels
that are 128 bytes in size.

This patch is meant to provide a clarification on one of the comments that
was at the heart of the issue. Specifically for version 1.2 and later of
the namespace specification the minimum label size is 256, prior to that
the minimum label size was 128. So we should state that as such to avoid
confusion.

Signed-off-by: Alexander Duyck 
---
 drivers/nvdimm/label.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/nvdimm/label.c b/drivers/nvdimm/label.c
index 1f5842509dbc..bb813b8e8ace 100644
--- a/drivers/nvdimm/label.c
+++ b/drivers/nvdimm/label.c
@@ -75,7 +75,8 @@ size_t sizeof_namespace_index(struct nvdimm_drvdata *ndd)
/*
 * Per UEFI 2.7, the minimum size of the Label Storage Area is large
 * enough to hold 2 index blocks and 2 labels.  The minimum index
-* block size is 256 bytes, and the minimum label size is 256 bytes.
+* block size is 256 bytes. The label size is 128 for namespaces
+* prior to version 1.2 and at minimum 256 for version 1.2 and later.
 */
nslot = nvdimm_num_label_slots(ndd);
space = ndd->nsarea.config_size - nslot * sizeof_namespace_label(ndd);

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[nvdimm PATCH 2/6] nvdimm: Sanity check labeloff

2018-10-10 Thread Alexander Duyck
This patch adds validation for the labeloff field in the indexes.

Signed-off-by: Alexander Duyck 
---
 drivers/nvdimm/label.c |7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/nvdimm/label.c b/drivers/nvdimm/label.c
index 1d28cd656536..1f5842509dbc 100644
--- a/drivers/nvdimm/label.c
+++ b/drivers/nvdimm/label.c
@@ -183,6 +183,13 @@ static int __nd_label_validate(struct nvdimm_drvdata *ndd)
__le64_to_cpu(nsindex[i]->otheroff));
continue;
}
+   if (__le64_to_cpu(nsindex[i]->labeloff)
+   != 2 * sizeof_namespace_index(ndd)) {
+   dev_dbg(dev, "nsindex%d labeloff: %#llx invalid\n",
+   i, (unsigned long long)
+   __le64_to_cpu(nsindex[i]->labeloff));
+   continue;
+   }
 
size = __le64_to_cpu(nsindex[i]->mysize);
if (size > sizeof_namespace_index(ndd)

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[nvdimm PATCH 1/6] libnvdimm, dimm: Maximize label transfer size

2018-10-10 Thread Alexander Duyck
From: Dan Williams 

Use kvzalloc() to bypass the arbitrary PAGE_SIZE limit of label transfer
operations. Given the expense of calling into firmware, maximize the
amount of label data we transfer per call to be up to the total label
space if allowed by the firmware.

Instead of limiting based on PAGE_SIZE we can instead simply limit the
maximum size based on either the config_size int he case of the get
operation, or the length of the write based on the set operation.

On a system with 24 NVDIMM modules each with a config_size of 128K and a
maximum transfer size of 64K - 4, this patch reduces the init time for the
label data from around 24 seconds down to between 4-5 seconds.

Signed-off-by: Dan Williams 
Signed-off-by: Alexander Duyck 
---

I think I only did a few minor tweaks on this so I thought I would leave
Dan as the author and just add my Signed-off-by.

 drivers/nvdimm/dimm_devs.c |   13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
index 863cabc35215..75ac78017b15 100644
--- a/drivers/nvdimm/dimm_devs.c
+++ b/drivers/nvdimm/dimm_devs.c
@@ -111,8 +111,8 @@ int nvdimm_init_config_data(struct nvdimm_drvdata *ndd)
if (!ndd->data)
return -ENOMEM;
 
-   max_cmd_size = min_t(u32, PAGE_SIZE, ndd->nsarea.max_xfer);
-   cmd = kzalloc(max_cmd_size + sizeof(*cmd), GFP_KERNEL);
+   max_cmd_size = min_t(u32, ndd->nsarea.config_size, 
ndd->nsarea.max_xfer);
+   cmd = kvzalloc(max_cmd_size + sizeof(*cmd), GFP_KERNEL);
if (!cmd)
return -ENOMEM;
 
@@ -134,7 +134,7 @@ int nvdimm_init_config_data(struct nvdimm_drvdata *ndd)
memcpy(ndd->data + offset, cmd->out_buf, cmd->in_length);
}
dev_dbg(ndd->dev, "len: %zu rc: %d\n", offset, rc);
-   kfree(cmd);
+   kvfree(cmd);
 
return rc;
 }
@@ -157,9 +157,8 @@ int nvdimm_set_config_data(struct nvdimm_drvdata *ndd, 
size_t offset,
if (offset + len > ndd->nsarea.config_size)
return -ENXIO;
 
-   max_cmd_size = min_t(u32, PAGE_SIZE, len);
-   max_cmd_size = min_t(u32, max_cmd_size, ndd->nsarea.max_xfer);
-   cmd = kzalloc(max_cmd_size + sizeof(*cmd) + sizeof(u32), GFP_KERNEL);
+   max_cmd_size = min_t(u32, len, ndd->nsarea.max_xfer);
+   cmd = kvzalloc(max_cmd_size + sizeof(*cmd) + sizeof(u32), GFP_KERNEL);
if (!cmd)
return -ENOMEM;
 
@@ -183,7 +182,7 @@ int nvdimm_set_config_data(struct nvdimm_drvdata *ndd, 
size_t offset,
break;
}
}
-   kfree(cmd);
+   kvfree(cmd);
 
return rc;
 }

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[nvdimm PATCH 0/6] Label initialization time optimizations

2018-10-10 Thread Alexander Duyck
This patch set is intended to improve NVDIMM label read times by first
increasing the upper limit on the label read/write size, and then
reducing the number of reads by making use of the free label bitmap in
the index to determine what labels are actually populated and only read
those labels. In my testing on a system populated with 24 NVDIMM modules
I see the total label init time drop from about 24 seconds down to 2 to
3 seconds. 

In the process of coding this up I came across a few minor issues that
I felt should be addressed so I have added a few patches for those fixes
along the way.

---

Alexander Duyck (5):
  nvdimm: Sanity check labeloff
  nvdimm: Clarify comment in sizeof_namespace_index
  nvdimm: Remove empty if statement
  nvdimm: Split label init out from the logic for getting config data
  nvdimm: Use namespace index data to reduce number of label reads needed

Dan Williams (1):
  libnvdimm, dimm: Maximize label transfer size


 drivers/nvdimm/dimm.c  |6 --
 drivers/nvdimm/dimm_devs.c |   60 +++
 drivers/nvdimm/label.c |  142 ++--
 drivers/nvdimm/label.h |4 -
 drivers/nvdimm/nd.h|2 +
 5 files changed, 163 insertions(+), 51 deletions(-)

--
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v9 00/13] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-10-10 Thread Logan Gunthorpe



On 2018-10-10 2:19 p.m., Bjorn Helgaas wrote:
> I added the reviewed-by tags from Christoph, Jens' ack on the blkdev.h
> change, and applied these to pci/peer-to-peer with the intent of
> merging these for v4.20.
> 
> I gave up on waiting for an ack for the memremap.h and mm.h changes.
> 
> I dropped the "nvme-pci: Add a quirk for a pseudo CMB" quirk because
> of Christoph's objection.  After this is all merged, I won't need to
> be involved, and you and the NVMe folks can hash that out.
> 
> If there are updates to "nvmet: Optionally use PCI P2P memory" based
> on Sagi's comments, send an incremental patch and I'll fold them in.

Thanks for picking this up. However, I hate to throw a wrench in the
works, but I had a v10[1] queued up because kbuild found some problems
with the series over the weekend. I can send v10 off right away if you
want to just replace it in your branch or, if you'd like, I can generate
some incremental patches. Let me know which you'd prefer.

Thanks,

Logan

[1] https://github.com/sbates130272/linux-p2pmem pci-p2p-v10
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 1/2] powerpc/pseries: PAPR persistent memory support

2018-10-10 Thread Oliver
On Thu, Oct 11, 2018 at 3:36 AM Nathan Fontenot
 wrote:
>
> On 10/10/2018 01:08 AM, Oliver O'Halloran wrote:
> > This patch implements support for discovering storage class memory
> > devices at boot and for handling hotplug of new regions via RTAS
> > hotplug events.
> >
> > Signed-off-by: Oliver O'Halloran 
> > ---
> >  arch/powerpc/include/asm/firmware.h   |  3 ++-
> >  arch/powerpc/include/asm/hvcall.h | 10 +-
> >  arch/powerpc/include/asm/rtas.h   |  2 ++
> >  arch/powerpc/kernel/rtasd.c   |  2 ++
> >  arch/powerpc/platforms/pseries/Makefile   |  2 +-
> >  arch/powerpc/platforms/pseries/dlpar.c|  4 
> >  arch/powerpc/platforms/pseries/firmware.c |  1 +
> >  arch/powerpc/platforms/pseries/pseries.h  |  5 +
> >  arch/powerpc/platforms/pseries/ras.c  |  3 ++-
> >  9 files changed, 28 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/powerpc/include/asm/firmware.h 
> > b/arch/powerpc/include/asm/firmware.h
> > index 7a051bd21f87..113c64d5d394 100644
> > --- a/arch/powerpc/include/asm/firmware.h
> > +++ b/arch/powerpc/include/asm/firmware.h
> > @@ -52,6 +52,7 @@
> >  #define FW_FEATURE_PRRN  ASM_CONST(0x0002)
> >  #define FW_FEATURE_DRMEM_V2  ASM_CONST(0x0004)
> >  #define FW_FEATURE_DRC_INFO  ASM_CONST(0x0008)
> > +#define FW_FEATURE_PAPR_SCM  ASM_CONST(0x0010)
> >
> >  #ifndef __ASSEMBLY__
> >
> > @@ -69,7 +70,7 @@ enum {
> >   FW_FEATURE_SET_MODE | FW_FEATURE_BEST_ENERGY |
> >   FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN |
> >   FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 |
> > - FW_FEATURE_DRC_INFO,
> > + FW_FEATURE_DRC_INFO | FW_FEATURE_PAPR_SCM,
> >   FW_FEATURE_PSERIES_ALWAYS = 0,
> >   FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL,
> >   FW_FEATURE_POWERNV_ALWAYS = 0,
> > diff --git a/arch/powerpc/include/asm/hvcall.h 
> > b/arch/powerpc/include/asm/hvcall.h
> > index a0b17f9f1ea4..0e81ef83b35a 100644
> > --- a/arch/powerpc/include/asm/hvcall.h
> > +++ b/arch/powerpc/include/asm/hvcall.h
> > @@ -295,7 +295,15 @@
> >  #define H_INT_ESB   0x3C8
> >  #define H_INT_SYNC  0x3CC
> >  #define H_INT_RESET 0x3D0
> > -#define MAX_HCALL_OPCODE H_INT_RESET
> > +#define H_SCM_READ_METADATA 0x3E4
> > +#define H_SCM_WRITE_METADATA0x3E8
> > +#define H_SCM_BIND_MEM  0x3EC
> > +#define H_SCM_UNBIND_MEM0x3F0
> > +#define H_SCM_QUERY_BLOCK_MEM_BINDING 0x3F4
> > +#define H_SCM_QUERY_LOGICAL_MEM_BINDING 0x3F8
> > +#define H_SCM_MEM_QUERY  0x3FC
> > +#define H_SCM_BLOCK_CLEAR   0x400
> > +#define MAX_HCALL_OPCODE H_SCM_BLOCK_CLEAR
> >
> >  /* H_VIOCTL functions */
> >  #define H_GET_VIOA_DUMP_SIZE 0x01
> > diff --git a/arch/powerpc/include/asm/rtas.h 
> > b/arch/powerpc/include/asm/rtas.h
> > index 71e393c46a49..1e81f3d55457 100644
> > --- a/arch/powerpc/include/asm/rtas.h
> > +++ b/arch/powerpc/include/asm/rtas.h
> > @@ -125,6 +125,7 @@ struct rtas_suspend_me_data {
> >  #define RTAS_TYPE_INFO   0xE2
> >  #define RTAS_TYPE_DEALLOC0xE3so we might as well
> >  #define RTAS_TYPE_DUMP   0xE4
> > +#define RTAS_TYPE_HOTPLUG0xE5
> >  /* I don't add PowerMGM events right now, this is a different topic */
> >  #define RTAS_TYPE_PMGM_POWER_SW_ON   0x60
> >  #define RTAS_TYPE_PMGM_POWER_SW_OFF  0x61
> > @@ -316,6 +317,7 @@ struct pseries_hp_errorlog {
> >  #define PSERIES_HP_ELOG_RESOURCE_MEM 2
> >  #define PSERIES_HP_ELOG_RESOURCE_SLOT3
> >  #define PSERIES_HP_ELOG_RESOURCE_PHB 4
> > +#define PSERIES_HP_ELOG_RESOURCE_PMEM   6
> >
> >  #define PSERIES_HP_ELOG_ACTION_ADD   1
> >  #define PSERIES_HP_ELOG_ACTION_REMOVE2
> > diff --git a/arch/powerpc/kernel/rtasd.c b/arch/powerpc/kernel/rtasd.c
> > index 6fafc82c04b0..fad0baddfcba 100644
> > --- a/arch/powerpc/kernel/rtasd.c
> > +++ b/arch/powerpc/kernel/rtasd.c
> > @@ -91,6 +91,8 @@ static char *rtas_event_type(int type)
> >   return "Dump Notification Event";
> >   case RTAS_TYPE_PRRN:
> >   return "Platform Resource Reassignment Event";
> > + case RTAS_TYPE_HOTPLUG:
> > + return "Hotplug Event";
> >   }
> >
> >   return rtas_type[0];
> > diff --git a/arch/powerpc/platforms/pseries/Makefile 
> > b/arch/powerpc/platforms/pseries/Makefile
> > index 7e89d5c47068..892b27ced973 100644
> > --- a/arch/powerpc/platforms/pseries/Makefile
> > +++ b/arch/powerpc/platforms/pseries/Makefile
> > @@ -13,7 +13,7 @@ obj-$(CONFIG_KEXEC_CORE)+= kexec.o
> >  obj-$(CONFIG_PSERIES_ENERGY) += pseries_energy.o
> >
> >  obj-$(CONFIG_HOTPLUG_CPU)+= hotplug-cpu.o
> > -obj-$(CONFIG_MEMORY_HOTPLUG) += hotplug-memory.o
> > +obj-$(CONFIG_MEMORY_HOTPLUG) += hotplug-memory.o pmem.o
> >
> >  obj-$(CONFIG_HVC_CONSOLE)+= hvconsole.o
> >  obj-$(CONFIG_HVCS)

Re: [PATCH v9 00/13] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-10-10 Thread Bjorn Helgaas
On Thu, Oct 04, 2018 at 03:27:34PM -0600, Logan Gunthorpe wrote:
> This is v9 for the p2pdma patch set. It has some substantial changes
> from the previous version. Essentially, the last patch has changed
> based on information from Sagi that the P2P memory can be assigned
> per namespace instead of globally. To that end, I've added a
> radix tree that stores the p2p devices to use for each namespace and
> a lookup will be done on that before allocating the memory.
> 
> This has some knock on effects of simplifying the requirements
> from the p2pdma code and so we drop all the client list code seeing
> we don't need to worry about selecting a P2P device that works
> with every namespace at once; only each namespace independantly.
> Thus, Patch 1, 6 and 13 have significant changes. However,
> I think the removal of the client list code will be generally
> appreciated based on the feedback I've gotten from Christian.
> 
> As usual, a git repo of the branch is available here:
> 
> https://github.com/sbates130272/linux-p2pmem pci-p2p-v9
> 
> Thanks,
> 
> Logan
> 
> --
> 
> Changes in v9:
> * Rebased on v4.19-rc6
> * Droped pci_p2pdma_add_client(), pci_p2pdma_remove_client() and
>   pci_p2pdma_client_list_free(), and reworked pci_p2pdma_distance() and
>   pci_p2pmem_find() to take simple arrays instead of list_heads.
> * Updated the documentation to match
> * Reworked the nvmet code to have a configfs attribute per namespace
>   instead of per port. Then each namespace has it's own P2P device
>   stored in a radix-tree in the controller (it can't be stored in
>   the nvme_ns structure seeing it needs to be unique per controller).
> * Dropped the 'sq' argument in nvmet_req_alloc_sgl() seeing I've noticed
>   it is now available in the nvmet_req structure.
> * Collected Keith's reviewed-by tags (however, I have not applied his
>   tag to the last patch seeing I think it has changed too much since
>   his review).
> 
> Changes in v8:
> 
> * Added a couple of comments to address Bart's feedback and
>   removed the bogus call to percpu_ref_switch_to_atomic_sync()
> 
> Changes in v7:
> 
> * Rebased on v4.19-rc5
> 
> * Fixed up commit message of patch 7 that was no longer accurate. (as
>   pointed out by Jens)
> * Change the configfs to not use "auto" or "none" and instead just
>   use a 0/1/ (or boolean). This matches the existing
>   nvme-target configfs booleans. (Per Bjorn)
> * A handful of other minor changes and edits that were noticed by Bjorn
> * Collected Acks from Bjorn
> 
> Changes in v6:
> 
> * Rebased on v4.19-rc3
> 
> * Remove the changes to the common submit_bio() path and instead
>   set REQ_NOMERGE in the NVME target driver, when appropriate.
>   Per discussions with Jens and Christoph.
> 
> * Some minor grammar changes in the documentation as spotted by Randy.
> 
> Changes in v5:
> 
> * Rebased on v4.19-rc1
> 
> * Drop changing ACS settings in this patchset. Now, the code
>   will only allow P2P transactions between devices whos
>   downstream ports do not restrict P2P TLPs.
> 
> * Drop the REQ_PCI_P2PDMA block flag and instead use
>   is_pci_p2pdma_page() to tell if a request is P2P or not. In that
>   case we check for queue support and enforce using REQ_NOMERGE.
>   Per feedback from Christoph.
> 
> * Drop the pci_p2pdma_unmap_sg() function as it was empty and only
>   there for symmetry and compatibility with dma_unmap_sg. Per feedback
>   from Christoph.
> 
> * Split off the logic to handle enabling P2P in NVMe fabrics' configfs
>   into specific helpers in the p2pdma code. Per feedback from Christoph.
> 
> * A number of other minor cleanups and fixes as pointed out by
>   Christoph and others.
> 
> Changes in v4:
> 
> * Change the original upstream_bridges_match() function to
>   upstream_bridge_distance() which calculates the distance between two
>   devices as long as they are behind the same root port. This should
>   address Bjorn's concerns that the code was to focused on
>   being behind a single switch.
> 
> * The disable ACS function now disables ACS for all bridge ports instead
>   of switch ports (ie. those that had two upstream_bridge ports).
> 
> * Change the pci_p2pmem_alloc_sgl() and pci_p2pmem_free_sgl()
>   API to be more like sgl_alloc() in that the alloc function returns
>   the allocated scatterlist and nents is not required bythe free
>   function.
> 
> * Moved the new documentation into the driver-api tree as requested
>   by Jonathan
> 
> * Add SGL alloc and free helpers in the nvmet code so that the
>   individual drivers can share the code that allocates P2P memory.
>   As requested by Christoph.
> 
> * Cleanup the nvmet_p2pmem_store() function as Christoph
>   thought my first attempt was ugly.
> 
> * Numerous commit message and comment fix-ups
> 
> Changes in v3:
> 
> * Many more fixes and minor cleanups that were spotted by Bjorn
> 
> * Additional explanation of the ACS change in both the commit message
>   and Kconfig doc. Also, the code that disables the ACS bits 

Re: [PATCH v9 07/13] block: Add PCI P2P flag for request queue and check support for requests

2018-10-10 Thread Bjorn Helgaas
On Fri, Oct 05, 2018 at 07:16:04PM -0600, Jens Axboe wrote:
> On 10/4/18 3:27 PM, Logan Gunthorpe wrote:
> > QUEUE_FLAG_PCI_P2P is introduced meaning a driver's request queue
> > supports targeting P2P memory. This will be used by P2P providers and
> > orchestrators (in subsequent patches) to ensure block devices can
> > support P2P memory before submitting P2P backed pages to submit_bio().
> 
> Nit pick, but the subject line still says that it checks support
> for requests. This patch just adds the ability to flag support
> in the queue.

What about the following?  MAINTAINERS doesn't list a specific
maintainer for include/linux/blkdev.h (maybe the "BLOCK LAYER" entry
should have an F: pattern for it?), but I'd really like your ack
before merging this.

commit 7e647ae1eda290786851c3dff4f38189b982386d
Author: Logan Gunthorpe 
Date:   Thu Oct 4 15:27:41 2018 -0600

block: Add PCI P2P flag for request queue

Add QUEUE_FLAG_PCI_P2P, meaning a driver's request queue supports targeting
P2P memory.  This will be used by P2P providers and orchestrators (in
subsequent patches) to ensure block devices can support P2P memory before
submitting P2P-backed pages to submit_bio().

Signed-off-by: Logan Gunthorpe 
Signed-off-by: Bjorn Helgaas 
Reviewed-by: Christoph Hellwig 

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6980014357d4..c32f7171899b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -699,6 +699,7 @@ struct request_queue {
 #define QUEUE_FLAG_SCSI_PASSTHROUGH 27 /* queue supports SCSI commands */
 #define QUEUE_FLAG_QUIESCED28  /* queue has been quiesced */
 #define QUEUE_FLAG_PREEMPT_ONLY29  /* only process REQ_PREEMPT 
requests */
+#define QUEUE_FLAG_PCI_P2PDMA  30  /* device supports PCI p2p requests */
 
 #define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) |\
 (1 << QUEUE_FLAG_SAME_COMP)|   \
@@ -731,6 +732,8 @@ bool blk_queue_flag_test_and_clear(unsigned int flag, 
struct request_queue *q);
 #define blk_queue_dax(q)   test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
 #define blk_queue_scsi_passthrough(q)  \
test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags)
+#define blk_queue_pci_p2pdma(q)\
+   test_bit(QUEUE_FLAG_PCI_P2PDMA, &(q)->queue_flags)
 
 #define blk_noretry_request(rq) \
((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v9 07/13] block: Add PCI P2P flag for request queue and check support for requests

2018-10-10 Thread Jens Axboe
On 10/10/18 1:59 PM, Bjorn Helgaas wrote:
> On Fri, Oct 05, 2018 at 07:16:04PM -0600, Jens Axboe wrote:
>> On 10/4/18 3:27 PM, Logan Gunthorpe wrote:
>>> QUEUE_FLAG_PCI_P2P is introduced meaning a driver's request queue
>>> supports targeting P2P memory. This will be used by P2P providers and
>>> orchestrators (in subsequent patches) to ensure block devices can
>>> support P2P memory before submitting P2P backed pages to submit_bio().
>>
>> Nit pick, but the subject line still says that it checks support
>> for requests. This patch just adds the ability to flag support
>> in the queue.
> 
> What about the following?  MAINTAINERS doesn't list a specific
> maintainer for include/linux/blkdev.h (maybe the "BLOCK LAYER" entry
> should have an F: pattern for it?), but I'd really like your ack
> before merging this.
> 
> commit 7e647ae1eda290786851c3dff4f38189b982386d
> Author: Logan Gunthorpe 
> Date:   Thu Oct 4 15:27:41 2018 -0600
> 
> block: Add PCI P2P flag for request queue
> 
> Add QUEUE_FLAG_PCI_P2P, meaning a driver's request queue supports 
> targeting
> P2P memory.  This will be used by P2P providers and orchestrators (in
> subsequent patches) to ensure block devices can support P2P memory before
> submitting P2P-backed pages to submit_bio().
> 
> Signed-off-by: Logan Gunthorpe 
> Signed-off-by: Bjorn Helgaas 
> Reviewed-by: Christoph Hellwig 

You can add my acked-by to this one.

-- 
Jens Axboe

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-10 Thread Michal Hocko
On Wed 10-10-18 10:39:01, Alexander Duyck wrote:
> On 10/10/2018 10:24 AM, Michal Hocko wrote:
[...]
> > I thought I have already made it clear that these zone device hacks are
> > not acceptable to the generic hotplug code. If the current reserve bit
> > handling is not correct then give us a specific reason for that and we
> > can start thinking about the proper fix.
> 
> I might have misunderstood your earlier comment then. I thought you were
> saying that we shouldn't bother with setting the reserved bit. Now it sounds
> like you were thinking more along the lines of what I was here in my comment
> where I thought the bit should be cleared later in some code specifically
> related to DAX when it is exposing it for use to userspace or KVM.

It seems I managed to confuse myself completely. Sorry, it's been a long
day and I am sick so the brain doesn't work all that well. I will get
back to this tomorrow or on Friday with a fresh brain.

My recollection was that we do clear the reserved bit in
move_pfn_range_to_zone and we indeed do in __init_single_page. But then
we set the bit back right afterwards. This seems to be the case since
d0dc12e86b319 which reorganized the code. I have to study this some more
obviously.

-- 
Michal Hocko
SUSE Labs
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-10 Thread Dan Williams
On Wed, Oct 10, 2018 at 10:30 AM Michal Hocko  wrote:
>
> On Wed 10-10-18 09:39:08, Alexander Duyck wrote:
> > On 10/10/2018 2:58 AM, Michal Hocko wrote:
> > > On Tue 09-10-18 13:26:41, Alexander Duyck wrote:
> > > [...]
> > > > I would think with that being the case we still probably need the call 
> > > > to
> > > > __SetPageReserved to set the bit with the expectation that it will not 
> > > > be
> > > > cleared for device-pages since the pages are not onlined. Removing the 
> > > > call
> > > > to __SetPageReserved would probably introduce a number of regressions as
> > > > there are multiple spots that use the reserved bit to determine if a 
> > > > page
> > > > can be swapped out to disk, mapped as system memory, or migrated.
> > >
> > > PageReserved is meant to tell any potential pfn walkers that might get
> > > to this struct page to back off and not touch it. Even though
> > > ZONE_DEVICE doesn't online pages in traditional sense it makes those
> > > pages available for further use so the page reserved bit should be
> > > cleared.
> >
> > So from what I can tell that isn't necessarily the case. Specifically if the
> > pagemap type is MEMORY_DEVICE_PRIVATE or MEMORY_DEVICE_PUBLIC both are
> > special cases where the memory may not be accessible to the CPU or cannot be
> > pinned in order to allow for eviction.
>
> Could you give me an example please?
>
> > The specific case that Dan and Yi are referring to is for the type
> > MEMORY_DEVICE_FS_DAX. For that type I could probably look at not setting the
> > reserved bit. Part of me wants to say that we should wait and clear the bit
> > later, but that would end up just adding time back to initialization. At
> > this point I would consider the change more of a follow-up optimization
> > rather than a fix though since this is tailoring things specifically for DAX
> > versus the other ZONE_DEVICE types.
>
> I thought I have already made it clear that these zone device hacks are
> not acceptable to the generic hotplug code. If the current reserve bit
> handling is not correct then give us a specific reason for that and we
> can start thinking about the proper fix.
>

Right, so we're in a situation where a hack is needed for KVM's
current interpretation of the Reserved flag relative to dax mapped
pages. I'm arguing to push that knowledge / handling as deep as
possible into the core rather than hack the leaf implementations like
KVM, i.e. disable the Reserved flag for all non-MEMORY_DEVICE_*
ZONE_DEVICE types.

Here is the KVM thread about why they need a change:

https://lkml.org/lkml/2018/9/7/552

...and where I pushed back on a KVM-local hack:

https://lkml.org/lkml/2018/9/19/154
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-10 Thread Alexander Duyck

On 10/10/2018 10:53 AM, Michal Hocko wrote:

On Wed 10-10-18 10:39:01, Alexander Duyck wrote:



On 10/10/2018 10:24 AM, Michal Hocko wrote:

On Wed 10-10-18 09:39:08, Alexander Duyck wrote:

On 10/10/2018 2:58 AM, Michal Hocko wrote:

On Tue 09-10-18 13:26:41, Alexander Duyck wrote:
[...]

I would think with that being the case we still probably need the call to
__SetPageReserved to set the bit with the expectation that it will not be
cleared for device-pages since the pages are not onlined. Removing the call
to __SetPageReserved would probably introduce a number of regressions as
there are multiple spots that use the reserved bit to determine if a page
can be swapped out to disk, mapped as system memory, or migrated.


PageReserved is meant to tell any potential pfn walkers that might get
to this struct page to back off and not touch it. Even though
ZONE_DEVICE doesn't online pages in traditional sense it makes those
pages available for further use so the page reserved bit should be
cleared.


So from what I can tell that isn't necessarily the case. Specifically if the
pagemap type is MEMORY_DEVICE_PRIVATE or MEMORY_DEVICE_PUBLIC both are
special cases where the memory may not be accessible to the CPU or cannot be
pinned in order to allow for eviction.


Could you give me an example please?


Honestly I am getting a bit beyond my depth here so maybe Dan could explain
better. I am basing the above comment on Dan's earlier comment in this
thread combined with the comment that explains the "memory_type" field for
the pgmap:
https://elixir.bootlin.com/linux/v4.19-rc7/source/include/linux/memremap.h#L28


The specific case that Dan and Yi are referring to is for the type
MEMORY_DEVICE_FS_DAX. For that type I could probably look at not setting the
reserved bit. Part of me wants to say that we should wait and clear the bit
later, but that would end up just adding time back to initialization. At
this point I would consider the change more of a follow-up optimization
rather than a fix though since this is tailoring things specifically for DAX
versus the other ZONE_DEVICE types.


I thought I have already made it clear that these zone device hacks are
not acceptable to the generic hotplug code. If the current reserve bit
handling is not correct then give us a specific reason for that and we
can start thinking about the proper fix.


I might have misunderstood your earlier comment then. I thought you were
saying that we shouldn't bother with setting the reserved bit. Now it sounds
like you were thinking more along the lines of what I was here in my comment
where I thought the bit should be cleared later in some code specifically
related to DAX when it is exposing it for use to userspace or KVM.


I was referring to my earlier comment that if you need to do something
about struct page initialization (move_pfn_range_to_zone) outside of the
lock (with the appropriate ground work that is needed) rather than
pulling more zone device hacks into the generic hotplug code [1]

[1] http://lkml.kernel.org/r/20180926075540.gd6...@dhcp22.suse.cz


The only issue is if we don't pull the code together we are looking at a 
massive increase in initialization times. So for example just the loop 
that was originally there that was setting the pgmap and resetting the 
LRU prev pointer was adding an additional 20+ seconds per node with 3TB 
allocated per node. That essentially doubles the initialization time.


How would you recommend I address that? Currently it is a few extra 
lines in the memmap_init_zone_device function. Eventually I was planning 
to combine the memmap_init_zone hoplug functionality and 
memmap_init_zone_device core functionality into a single function called 
__init_pageblock[1] and then reuse that functionality for deferred page 
init as well.


[1] 
http://lkml.kernel.org/r/20181005151224.17473.53398.stgit@localhost.localdomain


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-10 Thread Michal Hocko
On Wed 10-10-18 10:39:01, Alexander Duyck wrote:
> 
> 
> On 10/10/2018 10:24 AM, Michal Hocko wrote:
> > On Wed 10-10-18 09:39:08, Alexander Duyck wrote:
> > > On 10/10/2018 2:58 AM, Michal Hocko wrote:
> > > > On Tue 09-10-18 13:26:41, Alexander Duyck wrote:
> > > > [...]
> > > > > I would think with that being the case we still probably need the 
> > > > > call to
> > > > > __SetPageReserved to set the bit with the expectation that it will 
> > > > > not be
> > > > > cleared for device-pages since the pages are not onlined. Removing 
> > > > > the call
> > > > > to __SetPageReserved would probably introduce a number of regressions 
> > > > > as
> > > > > there are multiple spots that use the reserved bit to determine if a 
> > > > > page
> > > > > can be swapped out to disk, mapped as system memory, or migrated.
> > > > 
> > > > PageReserved is meant to tell any potential pfn walkers that might get
> > > > to this struct page to back off and not touch it. Even though
> > > > ZONE_DEVICE doesn't online pages in traditional sense it makes those
> > > > pages available for further use so the page reserved bit should be
> > > > cleared.
> > > 
> > > So from what I can tell that isn't necessarily the case. Specifically if 
> > > the
> > > pagemap type is MEMORY_DEVICE_PRIVATE or MEMORY_DEVICE_PUBLIC both are
> > > special cases where the memory may not be accessible to the CPU or cannot 
> > > be
> > > pinned in order to allow for eviction.
> > 
> > Could you give me an example please?
> 
> Honestly I am getting a bit beyond my depth here so maybe Dan could explain
> better. I am basing the above comment on Dan's earlier comment in this
> thread combined with the comment that explains the "memory_type" field for
> the pgmap:
> https://elixir.bootlin.com/linux/v4.19-rc7/source/include/linux/memremap.h#L28
> 
> > > The specific case that Dan and Yi are referring to is for the type
> > > MEMORY_DEVICE_FS_DAX. For that type I could probably look at not setting 
> > > the
> > > reserved bit. Part of me wants to say that we should wait and clear the 
> > > bit
> > > later, but that would end up just adding time back to initialization. At
> > > this point I would consider the change more of a follow-up optimization
> > > rather than a fix though since this is tailoring things specifically for 
> > > DAX
> > > versus the other ZONE_DEVICE types.
> > 
> > I thought I have already made it clear that these zone device hacks are
> > not acceptable to the generic hotplug code. If the current reserve bit
> > handling is not correct then give us a specific reason for that and we
> > can start thinking about the proper fix.
> 
> I might have misunderstood your earlier comment then. I thought you were
> saying that we shouldn't bother with setting the reserved bit. Now it sounds
> like you were thinking more along the lines of what I was here in my comment
> where I thought the bit should be cleared later in some code specifically
> related to DAX when it is exposing it for use to userspace or KVM.

I was referring to my earlier comment that if you need to do something
about struct page initialization (move_pfn_range_to_zone) outside of the
lock (with the appropriate ground work that is needed) rather than
pulling more zone device hacks into the generic hotplug code [1]

[1] http://lkml.kernel.org/r/20180926075540.gd6...@dhcp22.suse.cz

-- 
Michal Hocko
SUSE Labs
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-10 Thread Alexander Duyck




On 10/10/2018 10:24 AM, Michal Hocko wrote:

On Wed 10-10-18 09:39:08, Alexander Duyck wrote:

On 10/10/2018 2:58 AM, Michal Hocko wrote:

On Tue 09-10-18 13:26:41, Alexander Duyck wrote:
[...]

I would think with that being the case we still probably need the call to
__SetPageReserved to set the bit with the expectation that it will not be
cleared for device-pages since the pages are not onlined. Removing the call
to __SetPageReserved would probably introduce a number of regressions as
there are multiple spots that use the reserved bit to determine if a page
can be swapped out to disk, mapped as system memory, or migrated.


PageReserved is meant to tell any potential pfn walkers that might get
to this struct page to back off and not touch it. Even though
ZONE_DEVICE doesn't online pages in traditional sense it makes those
pages available for further use so the page reserved bit should be
cleared.


So from what I can tell that isn't necessarily the case. Specifically if the
pagemap type is MEMORY_DEVICE_PRIVATE or MEMORY_DEVICE_PUBLIC both are
special cases where the memory may not be accessible to the CPU or cannot be
pinned in order to allow for eviction.


Could you give me an example please?


Honestly I am getting a bit beyond my depth here so maybe Dan could 
explain better. I am basing the above comment on Dan's earlier comment 
in this thread combined with the comment that explains the "memory_type" 
field for the pgmap:

https://elixir.bootlin.com/linux/v4.19-rc7/source/include/linux/memremap.h#L28


The specific case that Dan and Yi are referring to is for the type
MEMORY_DEVICE_FS_DAX. For that type I could probably look at not setting the
reserved bit. Part of me wants to say that we should wait and clear the bit
later, but that would end up just adding time back to initialization. At
this point I would consider the change more of a follow-up optimization
rather than a fix though since this is tailoring things specifically for DAX
versus the other ZONE_DEVICE types.


I thought I have already made it clear that these zone device hacks are
not acceptable to the generic hotplug code. If the current reserve bit
handling is not correct then give us a specific reason for that and we
can start thinking about the proper fix.


I might have misunderstood your earlier comment then. I thought you were 
saying that we shouldn't bother with setting the reserved bit. Now it 
sounds like you were thinking more along the lines of what I was here in 
my comment where I thought the bit should be cleared later in some code 
specifically related to DAX when it is exposing it for use to userspace 
or KVM.

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-10 Thread Michal Hocko
On Wed 10-10-18 09:39:08, Alexander Duyck wrote:
> On 10/10/2018 2:58 AM, Michal Hocko wrote:
> > On Tue 09-10-18 13:26:41, Alexander Duyck wrote:
> > [...]
> > > I would think with that being the case we still probably need the call to
> > > __SetPageReserved to set the bit with the expectation that it will not be
> > > cleared for device-pages since the pages are not onlined. Removing the 
> > > call
> > > to __SetPageReserved would probably introduce a number of regressions as
> > > there are multiple spots that use the reserved bit to determine if a page
> > > can be swapped out to disk, mapped as system memory, or migrated.
> > 
> > PageReserved is meant to tell any potential pfn walkers that might get
> > to this struct page to back off and not touch it. Even though
> > ZONE_DEVICE doesn't online pages in traditional sense it makes those
> > pages available for further use so the page reserved bit should be
> > cleared.
> 
> So from what I can tell that isn't necessarily the case. Specifically if the
> pagemap type is MEMORY_DEVICE_PRIVATE or MEMORY_DEVICE_PUBLIC both are
> special cases where the memory may not be accessible to the CPU or cannot be
> pinned in order to allow for eviction.

Could you give me an example please?

> The specific case that Dan and Yi are referring to is for the type
> MEMORY_DEVICE_FS_DAX. For that type I could probably look at not setting the
> reserved bit. Part of me wants to say that we should wait and clear the bit
> later, but that would end up just adding time back to initialization. At
> this point I would consider the change more of a follow-up optimization
> rather than a fix though since this is tailoring things specifically for DAX
> versus the other ZONE_DEVICE types.

I thought I have already made it clear that these zone device hacks are
not acceptable to the generic hotplug code. If the current reserve bit
handling is not correct then give us a specific reason for that and we
can start thinking about the proper fix.

-- 
Michal Hocko
SUSE Labs
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-10 Thread Alexander Duyck

On 10/10/2018 2:58 AM, Michal Hocko wrote:

On Tue 09-10-18 13:26:41, Alexander Duyck wrote:
[...]

I would think with that being the case we still probably need the call to
__SetPageReserved to set the bit with the expectation that it will not be
cleared for device-pages since the pages are not onlined. Removing the call
to __SetPageReserved would probably introduce a number of regressions as
there are multiple spots that use the reserved bit to determine if a page
can be swapped out to disk, mapped as system memory, or migrated.


PageReserved is meant to tell any potential pfn walkers that might get
to this struct page to back off and not touch it. Even though
ZONE_DEVICE doesn't online pages in traditional sense it makes those
pages available for further use so the page reserved bit should be
cleared.


So from what I can tell that isn't necessarily the case. Specifically if 
the pagemap type is MEMORY_DEVICE_PRIVATE or MEMORY_DEVICE_PUBLIC both 
are special cases where the memory may not be accessible to the CPU or 
cannot be pinned in order to allow for eviction.


The specific case that Dan and Yi are referring to is for the type 
MEMORY_DEVICE_FS_DAX. For that type I could probably look at not setting 
the reserved bit. Part of me wants to say that we should wait and clear 
the bit later, but that would end up just adding time back to 
initialization. At this point I would consider the change more of a 
follow-up optimization rather than a fix though since this is tailoring 
things specifically for DAX versus the other ZONE_DEVICE types.


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 1/2] powerpc/pseries: PAPR persistent memory support

2018-10-10 Thread Nathan Fontenot
On 10/10/2018 01:08 AM, Oliver O'Halloran wrote:
> This patch implements support for discovering storage class memory
> devices at boot and for handling hotplug of new regions via RTAS
> hotplug events.
> 
> Signed-off-by: Oliver O'Halloran 
> ---
>  arch/powerpc/include/asm/firmware.h   |  3 ++-
>  arch/powerpc/include/asm/hvcall.h | 10 +-
>  arch/powerpc/include/asm/rtas.h   |  2 ++
>  arch/powerpc/kernel/rtasd.c   |  2 ++
>  arch/powerpc/platforms/pseries/Makefile   |  2 +-
>  arch/powerpc/platforms/pseries/dlpar.c|  4 
>  arch/powerpc/platforms/pseries/firmware.c |  1 +
>  arch/powerpc/platforms/pseries/pseries.h  |  5 +
>  arch/powerpc/platforms/pseries/ras.c  |  3 ++-
>  9 files changed, 28 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/firmware.h 
> b/arch/powerpc/include/asm/firmware.h
> index 7a051bd21f87..113c64d5d394 100644
> --- a/arch/powerpc/include/asm/firmware.h
> +++ b/arch/powerpc/include/asm/firmware.h
> @@ -52,6 +52,7 @@
>  #define FW_FEATURE_PRRN  ASM_CONST(0x0002)
>  #define FW_FEATURE_DRMEM_V2  ASM_CONST(0x0004)
>  #define FW_FEATURE_DRC_INFO  ASM_CONST(0x0008)
> +#define FW_FEATURE_PAPR_SCM  ASM_CONST(0x0010)
> 
>  #ifndef __ASSEMBLY__
> 
> @@ -69,7 +70,7 @@ enum {
>   FW_FEATURE_SET_MODE | FW_FEATURE_BEST_ENERGY |
>   FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN |
>   FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 |
> - FW_FEATURE_DRC_INFO,
> + FW_FEATURE_DRC_INFO | FW_FEATURE_PAPR_SCM,
>   FW_FEATURE_PSERIES_ALWAYS = 0,
>   FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL,
>   FW_FEATURE_POWERNV_ALWAYS = 0,
> diff --git a/arch/powerpc/include/asm/hvcall.h 
> b/arch/powerpc/include/asm/hvcall.h
> index a0b17f9f1ea4..0e81ef83b35a 100644
> --- a/arch/powerpc/include/asm/hvcall.h
> +++ b/arch/powerpc/include/asm/hvcall.h
> @@ -295,7 +295,15 @@
>  #define H_INT_ESB   0x3C8
>  #define H_INT_SYNC  0x3CC
>  #define H_INT_RESET 0x3D0
> -#define MAX_HCALL_OPCODE H_INT_RESET
> +#define H_SCM_READ_METADATA 0x3E4
> +#define H_SCM_WRITE_METADATA0x3E8
> +#define H_SCM_BIND_MEM  0x3EC
> +#define H_SCM_UNBIND_MEM0x3F0
> +#define H_SCM_QUERY_BLOCK_MEM_BINDING 0x3F4
> +#define H_SCM_QUERY_LOGICAL_MEM_BINDING 0x3F8
> +#define H_SCM_MEM_QUERY  0x3FC
> +#define H_SCM_BLOCK_CLEAR   0x400
> +#define MAX_HCALL_OPCODE H_SCM_BLOCK_CLEAR
> 
>  /* H_VIOCTL functions */
>  #define H_GET_VIOA_DUMP_SIZE 0x01
> diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
> index 71e393c46a49..1e81f3d55457 100644
> --- a/arch/powerpc/include/asm/rtas.h
> +++ b/arch/powerpc/include/asm/rtas.h
> @@ -125,6 +125,7 @@ struct rtas_suspend_me_data {
>  #define RTAS_TYPE_INFO   0xE2
>  #define RTAS_TYPE_DEALLOC0xE3
>  #define RTAS_TYPE_DUMP   0xE4
> +#define RTAS_TYPE_HOTPLUG0xE5
>  /* I don't add PowerMGM events right now, this is a different topic */ 
>  #define RTAS_TYPE_PMGM_POWER_SW_ON   0x60
>  #define RTAS_TYPE_PMGM_POWER_SW_OFF  0x61
> @@ -316,6 +317,7 @@ struct pseries_hp_errorlog {
>  #define PSERIES_HP_ELOG_RESOURCE_MEM 2
>  #define PSERIES_HP_ELOG_RESOURCE_SLOT3
>  #define PSERIES_HP_ELOG_RESOURCE_PHB 4
> +#define PSERIES_HP_ELOG_RESOURCE_PMEM   6
> 
>  #define PSERIES_HP_ELOG_ACTION_ADD   1
>  #define PSERIES_HP_ELOG_ACTION_REMOVE2
> diff --git a/arch/powerpc/kernel/rtasd.c b/arch/powerpc/kernel/rtasd.c
> index 6fafc82c04b0..fad0baddfcba 100644
> --- a/arch/powerpc/kernel/rtasd.c
> +++ b/arch/powerpc/kernel/rtasd.c
> @@ -91,6 +91,8 @@ static char *rtas_event_type(int type)
>   return "Dump Notification Event";
>   case RTAS_TYPE_PRRN:
>   return "Platform Resource Reassignment Event";
> + case RTAS_TYPE_HOTPLUG:
> + return "Hotplug Event";
>   }
> 
>   return rtas_type[0];
> diff --git a/arch/powerpc/platforms/pseries/Makefile 
> b/arch/powerpc/platforms/pseries/Makefile
> index 7e89d5c47068..892b27ced973 100644
> --- a/arch/powerpc/platforms/pseries/Makefile
> +++ b/arch/powerpc/platforms/pseries/Makefile
> @@ -13,7 +13,7 @@ obj-$(CONFIG_KEXEC_CORE)+= kexec.o
>  obj-$(CONFIG_PSERIES_ENERGY) += pseries_energy.o
> 
>  obj-$(CONFIG_HOTPLUG_CPU)+= hotplug-cpu.o
> -obj-$(CONFIG_MEMORY_HOTPLUG) += hotplug-memory.o
> +obj-$(CONFIG_MEMORY_HOTPLUG) += hotplug-memory.o pmem.o
> 
>  obj-$(CONFIG_HVC_CONSOLE)+= hvconsole.o
>  obj-$(CONFIG_HVCS)   += hvcserver.o
> diff --git a/arch/powerpc/platforms/pseries/dlpar.c 
> b/arch/powerpc/platforms/pseries/dlpar.c
> index a0b20c03f078..795996fefdb9 100644
> --- a/arch/powerpc/platforms/pseries/dlpar.c
> +++ b/arch/powerpc/platforms/pseries/dlpar.c
> @@ -357,6 +357,10 @@ static int 

Re: [PATCH v12 00/12] Adding security support for nvdimm

2018-10-10 Thread Dave Jiang



On 10/09/2018 06:35 PM, Williams, Dan J wrote:
> On Mon, 2018-10-08 at 16:55 -0700, Dave Jiang wrote:
>> The following series implements security support for nvdimm. Mostly
>> adding
>> new security DSM support from the Intel NVDIMM DSM spec v1.7, but
>> also
>> adding generic support libnvdimm for other vendors. The most
>> important
>> security features are unlocking locked nvdimms, and updating/setting
>> security
>> passphrase to nvdimms.
>>
>> v12:
>> - Add a mutex for the cached key and remove key_get/key_put messiness
>> (Dan)
>> - Move security code to its own C file and wrap under
>> CONFIG_NVDIMM_SECURITY
>>   in order to fix issue reported by 0-day build without CONFIG_KEYS.
> 
> Going over this a bit more closely here is some proposed cleanups /
> fixes:
> 
> * remove nvdimm_get_key() and just rely on nvdimm->key being not NULL
> under the key_mutex
> 
> * open code nvdimm_check_key_len() checks
> 
> * make all the security op function signatures take an nvdimm rather
> than a dev
> 
> * move the release of nvdimm->key to nvdimm_remove() rather than
> nvdimm_release(), i.e. driver unload vs nvdimm_bus de-registration.
> 
> * rename nvdimm_replace_key to make_kernel_key
> 
> * clean up nvdimm_security_change_key() to a be bit more readable and
> fix a missing key_put()
> 
> 
> Let me know if these look acceptable and I can fold them into the patch
> set.

Looks good. Thanks!

> 
> diff --git a/drivers/nvdimm/dimm.c b/drivers/nvdimm/dimm.c
> index b6381ddbd6c1..429c544e7ac5 100644
> --- a/drivers/nvdimm/dimm.c
> +++ b/drivers/nvdimm/dimm.c
> @@ -23,6 +23,7 @@
>  
>  static int nvdimm_probe(struct device *dev)
>  {
> + struct nvdimm *nvdimm = to_nvdimm(dev);
>   struct nvdimm_drvdata *ndd;
>   int rc;
>  
> @@ -51,10 +52,10 @@ static int nvdimm_probe(struct device *dev)
>   get_device(dev);
>   kref_init(>kref);
>  
> - nvdimm_security_get_state(dev);
> + nvdimm_security_get_state(nvdimm);
>  
>   /* unlock DIMM here before touch label */
> - rc = nvdimm_security_unlock_dimm(dev);
> + rc = nvdimm_security_unlock_dimm(nvdimm);
>   if (rc < 0)
>   dev_warn(dev, "failed to unlock dimm %s\n", dev_name(dev));
>  
> @@ -115,6 +116,9 @@ static int nvdimm_probe(struct device *dev)
>  static int nvdimm_remove(struct device *dev)
>  {
>   struct nvdimm_drvdata *ndd = dev_get_drvdata(dev);
> + struct nvdimm *nvdimm = to_nvdimm(dev);
> +
> + nvdimm_security_release(nvdimm);
>  
>   if (!ndd)
>   return 0;
> diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
> index 2d9915378bbc..84bec3bb025e 100644
> --- a/drivers/nvdimm/dimm_devs.c
> +++ b/drivers/nvdimm/dimm_devs.c
> @@ -213,13 +213,7 @@ void nvdimm_clear_locked(struct device *dev)
>  static void nvdimm_release(struct device *dev)
>  {
>   struct nvdimm *nvdimm = to_nvdimm(dev);
> - struct key *key;
>  
> - mutex_lock(>key_mutex);
> - key = nvdimm_get_key(dev);
> - if (key)
> - key_put(key);
> - mutex_unlock(>key_mutex);
>   ida_simple_remove(_ida, nvdimm->id);
>   kfree(nvdimm);
>  }
> @@ -436,20 +430,20 @@ static ssize_t security_store(struct device *dev,
>   if (rc != 3)
>   return -EINVAL;
>   dev_dbg(dev, "update %#x %#x\n", old_key, new_key);
> - rc = nvdimm_security_change_key(dev, old_key, new_key);
> + rc = nvdimm_security_change_key(nvdimm, old_key, new_key);
>   } else if (sysfs_streq(cmd, "disable")) {
>   if (rc != 2)
>   return -EINVAL;
>   dev_dbg(dev, "disable %#x\n", old_key);
> - rc = nvdimm_security_disable(dev, old_key);
> + rc = nvdimm_security_disable(nvdimm, old_key);
>   } else if (sysfs_streq(buf, "freeze")) {
>   dev_dbg(dev, "freeze\n");
> - rc = nvdimm_security_freeze_lock(dev);
> + rc = nvdimm_security_freeze_lock(nvdimm);
>   } else if (sysfs_streq(cmd, "erase")) {
>   if (rc != 2)
>   return -EINVAL;
>   dev_dbg(dev, "erase %#x\n", old_key);
> - rc = nvdimm_security_erase(dev, old_key);
> + rc = nvdimm_security_erase(nvdimm, old_key);
>   } else
>   return -EINVAL;
>  
> diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
> index 08e442632a2d..45fcaf45ba2c 100644
> --- a/drivers/nvdimm/nd.h
> +++ b/drivers/nvdimm/nd.h
> @@ -425,48 +425,47 @@ const u8 *nd_dev_to_uuid(struct device *dev);
>  bool pmem_should_map_pages(struct device *dev);
>  
>  #ifdef CONFIG_NVDIMM_SECURITY
> -struct key *nvdimm_get_key(struct device *dev);
> -int nvdimm_security_unlock_dimm(struct device *dev);
> -int nvdimm_security_get_state(struct device *dev);
> -int nvdimm_security_change_key(struct device *dev, unsigned int old_keyid,
> +int nvdimm_security_unlock_dimm(struct nvdimm *nvdimm);
> +void nvdimm_security_release(struct nvdimm 

Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-10 Thread Alexander Duyck




On 10/10/2018 5:52 AM, Yi Zhang wrote:

On 2018-10-09 at 14:19:32 -0700, Dan Williams wrote:

On Tue, Oct 9, 2018 at 1:34 PM Alexander Duyck
 wrote:


On 10/9/2018 11:04 AM, Dan Williams wrote:

On Tue, Oct 9, 2018 at 3:21 AM Yi Zhang  wrote:

[..]

That comment is incorrect, device-pages are never onlined. So I think
we can just skip that call to __SetPageReserved() unless the memory
range is MEMORY_DEVICE_{PRIVATE,PUBLIC}.



When pages are "onlined" via __free_pages_boot_core they clear the
reserved bit, that is the reason for the comment. The reserved bit is
meant to indicate that the page cannot be swapped out or moved based on
the description of the bit.


...but ZONE_DEVICE pages are never onlined so I would expect
memmap_init_zone_device() to know that detail.


I would think with that being the case we still probably need the call
to __SetPageReserved to set the bit with the expectation that it will
not be cleared for device-pages since the pages are not onlined.
Removing the call to __SetPageReserved would probably introduce a number
of regressions as there are multiple spots that use the reserved bit to
determine if a page can be swapped out to disk, mapped as system memory,
or migrated.


Another things, it seems page_init/set_reserved already been done in the
move_pfn_range_to_zone
 |-->memmap_init_zone
|-->for_each_page_in_pfn
|-->__init_single_page
|-->SetPageReserved

Why we haven't remove these redundant initial in memmap_init_zone?

Correct me if I missed something.


In this case it isn't redundant as only the vmmemmap pages are 
initialized in memmap_init_zone now. So all of the pages that are going 
to be used as device pages are not initialized until the call to 
memmap_init_zone_device. What I did is split the initialization of the 
pages into two parts in order to allow us to initialize the pages 
outside of the hotplug lock.




Right, this is what Yi is working on... the PageReserved flag is
problematic for KVM. Auditing those locations it seems as long as we
teach hibernation to avoid ZONE_DEVICE ranges we can safely not set
the reserved flag for DAX pages. What I'm trying to avoid is a local
KVM hack to check for DAX pages when the Reserved flag is not
otherwise needed.

Thanks Dan. Provide the patch link.

https://lore.kernel.org/lkml/cover.1536342881.git.yi.z.zh...@linux.intel.com


So it looks like your current logic is just working around the bit then 
since it just allows for reserved DAX pages.



___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-10 Thread Michal Hocko
On Tue 09-10-18 13:26:41, Alexander Duyck wrote:
[...]
> I would think with that being the case we still probably need the call to
> __SetPageReserved to set the bit with the expectation that it will not be
> cleared for device-pages since the pages are not onlined. Removing the call
> to __SetPageReserved would probably introduce a number of regressions as
> there are multiple spots that use the reserved bit to determine if a page
> can be swapped out to disk, mapped as system memory, or migrated.

PageReserved is meant to tell any potential pfn walkers that might get
to this struct page to back off and not touch it. Even though
ZONE_DEVICE doesn't online pages in traditional sense it makes those
pages available for further use so the page reserved bit should be
cleared.
-- 
Michal Hocko
SUSE Labs
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] mm: Preserve _PAGE_DEVMAP across mprotect() calls

2018-10-10 Thread Jan Kara
On Tue 09-10-18 10:55:14, Dan Williams wrote:
> On Tue, Oct 9, 2018 at 3:19 AM Jan Kara  wrote:
> >
> > Currently _PAGE_DEVMAP bit is not preserved in mprotect(2) calls. As a
> > result we will see warnings such as:
> >
> > BUG: Bad page map in process JobWrk0013  pte:81803875ea25 pmd:7624381067
> > addr:7f093072 vm_flags:28f9 anon_vma:  (null) 
> > mapping:97f2384056f0 index:0
> > file:457-00fe0030-0009-00ca-0001_2001.fileblock 
> > fault:xfs_filemap_fault [xfs] mmap:xfs_file_mmap [xfs] readpage:  
> > (null)
> > CPU: 3 PID: 15848 Comm: JobWrk0013 Tainted: GW  
> > 4.12.14-2.g7573215-default #1 SLE12-SP4 (unreleased)
> > Hardware name: Intel Corporation S2600WFD/S2600WFD, BIOS 
> > SE5C620.86B.01.00.0833.051120182255 05/11/2018
> > Call Trace:
> >  dump_stack+0x5a/0x75
> >  print_bad_pte+0x217/0x2c0
> >  ? enqueue_task_fair+0x76/0x9f0
> >  _vm_normal_page+0xe5/0x100
> >  zap_pte_range+0x148/0x740
> >  unmap_page_range+0x39a/0x4b0
> >  unmap_vmas+0x42/0x90
> >  unmap_region+0x99/0xf0
> >  ? vma_gap_callbacks_rotate+0x1a/0x20
> >  do_munmap+0x255/0x3a0
> >  vm_munmap+0x54/0x80
> >  SyS_munmap+0x1d/0x30
> >  do_syscall_64+0x74/0x150
> >  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> > ...
> >
> > when mprotect(2) gets used on DAX mappings. Also there is a wide variety
> > of other failures that can result from the missing _PAGE_DEVMAP flag
> > when the area gets used by get_user_pages() later.
> >
> > Fix the problem by including _PAGE_DEVMAP in a set of flags that get
> > preserved by mprotect(2).
> >
> > Fixes: 69660fd797c3 ("x86, mm: introduce _PAGE_DEVMAP")
> > Fixes: ebd31197931d ("powerpc/mm: Add devmap support for ppc64")
> > CC: sta...@vger.kernel.org
> > Signed-off-by: Jan Kara 
> 
> Looks good, do you want me to take this upstream along with the livelock fix?

Yes, I think that would be best. Thanks!

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap

2018-10-10 Thread Yi Zhang
On 2018-10-09 at 14:19:32 -0700, Dan Williams wrote:
> On Tue, Oct 9, 2018 at 1:34 PM Alexander Duyck
>  wrote:
> >
> > On 10/9/2018 11:04 AM, Dan Williams wrote:
> > > On Tue, Oct 9, 2018 at 3:21 AM Yi Zhang  
> > > wrote:
> [..]
> > > That comment is incorrect, device-pages are never onlined. So I think
> > > we can just skip that call to __SetPageReserved() unless the memory
> > > range is MEMORY_DEVICE_{PRIVATE,PUBLIC}.
> > >
> >
> > When pages are "onlined" via __free_pages_boot_core they clear the
> > reserved bit, that is the reason for the comment. The reserved bit is
> > meant to indicate that the page cannot be swapped out or moved based on
> > the description of the bit.
> 
> ...but ZONE_DEVICE pages are never onlined so I would expect
> memmap_init_zone_device() to know that detail.
> 
> > I would think with that being the case we still probably need the call
> > to __SetPageReserved to set the bit with the expectation that it will
> > not be cleared for device-pages since the pages are not onlined.
> > Removing the call to __SetPageReserved would probably introduce a number
> > of regressions as there are multiple spots that use the reserved bit to
> > determine if a page can be swapped out to disk, mapped as system memory,
> > or migrated.

Another things, it seems page_init/set_reserved already been done in the
move_pfn_range_to_zone
|-->memmap_init_zone
|-->for_each_page_in_pfn
|-->__init_single_page
|-->SetPageReserved

Why we haven't remove these redundant initial in memmap_init_zone?

Correct me if I missed something.

> 
> Right, this is what Yi is working on... the PageReserved flag is
> problematic for KVM. Auditing those locations it seems as long as we
> teach hibernation to avoid ZONE_DEVICE ranges we can safely not set
> the reserved flag for DAX pages. What I'm trying to avoid is a local
> KVM hack to check for DAX pages when the Reserved flag is not
> otherwise needed.
Thanks Dan. Provide the patch link.

https://lore.kernel.org/lkml/cover.1536342881.git.yi.z.zh...@linux.intel.com



> ___
> Linux-nvdimm mailing list
> Linux-nvdimm@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


PAPR SCM support

2018-10-10 Thread Oliver O'Halloran
This series adds support for the para-virtualised storage class memory
interface defined by the Power Architecture Platform Reference.

Patch 1 implements the pseries device discovery (via DT) and hotplug
support (via RTAS hotplug interrupt).

Patch 2 implements a driver that binds to the platform devices, does the
necessary H-calls to activate the region and configures the libnvdimm
interface. It also adds an NDCTL implementation to allow the "metadata"
space associated with an SCM region to be used as label space.

This should go in via the ppc tree, but an ack from Dan for patch two
would be appreciated.

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 2/2] powerpc/pseries: Add driver for PAPR SCM regions

2018-10-10 Thread Oliver O'Halloran
Adds a driver that implements support for enabling and accessing PAPR
SCM regions. Unfortunately due to how the PAPR interface works we can't
use the existing of_pmem driver (yet) because:

 a) The guest is required to use the H_SCM_BIND_MEM h-call to add
add the SCM region to it's physical address space, and
 b) There is currently no mechanism for relating a bare of_pmem region
to the backing DIMM (or not-a-DIMM for our case).

Both of these are easily handled by rolling the functionality into a
seperate driver so here we are...

Signed-off-by: Oliver O'Halloran 
---
The alternative implementation here is that we have the pseries code
do the h-calls and craft a pmem-region@ node based on that.
However, that doesn't solve b) and mpe has expressed his dislike of
adding new stuff to the DT at runtime so i'd say that's a non-starter.
---
 arch/powerpc/platforms/pseries/Kconfig|   7 +
 arch/powerpc/platforms/pseries/Makefile   |   1 +
 arch/powerpc/platforms/pseries/papr_scm.c | 340 ++
 3 files changed, 348 insertions(+)
 create mode 100644 arch/powerpc/platforms/pseries/papr_scm.c

diff --git a/arch/powerpc/platforms/pseries/Kconfig 
b/arch/powerpc/platforms/pseries/Kconfig
index 0c698fd6d491..4b0fcb80efe5 100644
--- a/arch/powerpc/platforms/pseries/Kconfig
+++ b/arch/powerpc/platforms/pseries/Kconfig
@@ -140,3 +140,10 @@ config IBMEBUS
bool "Support for GX bus based adapters"
help
  Bus device driver for GX bus based adapters.
+
+config PAPR_SCM
+   depends on PPC_PSERIES && MEMORY_HOTPLUG
+   select LIBNVDIMM
+   tristate "Support for the PAPR Storage Class Memory interface"
+   help
+ Enable access to hypervisor provided storage class memory.
diff --git a/arch/powerpc/platforms/pseries/Makefile 
b/arch/powerpc/platforms/pseries/Makefile
index 892b27ced973..a43ec843c8e2 100644
--- a/arch/powerpc/platforms/pseries/Makefile
+++ b/arch/powerpc/platforms/pseries/Makefile
@@ -24,6 +24,7 @@ obj-$(CONFIG_IO_EVENT_IRQ)+= io_event_irq.o
 obj-$(CONFIG_LPARCFG)  += lparcfg.o
 obj-$(CONFIG_IBMVIO)   += vio.o
 obj-$(CONFIG_IBMEBUS)  += ibmebus.o
+obj-$(CONFIG_PAPR_SCM) += papr_scm.o
 
 ifdef CONFIG_PPC_PSERIES
 obj-$(CONFIG_SUSPEND)  += suspend.o
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
new file mode 100644
index ..87d4dbc18845
--- /dev/null
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -0,0 +1,340 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define pr_fmt(fmt)"papr-scm: " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#define BIND_ANY_ADDR (~0ul)
+
+#define PAPR_SCM_DIMM_CMD_MASK \
+   ((1ul << ND_CMD_GET_CONFIG_SIZE) | \
+(1ul << ND_CMD_GET_CONFIG_DATA) | \
+(1ul << ND_CMD_SET_CONFIG_DATA))
+
+struct papr_scm_priv {
+   struct platform_device *pdev;
+   struct device_node *dn;
+   uint32_t drc_index;
+   uint64_t blocks;
+   uint64_t block_size;
+   int metadata_size;
+
+   uint64_t bound_addr;
+
+   struct nvdimm_bus_descriptor bus_desc;
+   struct nvdimm_bus *bus;
+   struct nvdimm *nvdimm;
+   struct resource res;
+   struct nd_region *region;
+   struct nd_interleave_set nd_set;
+};
+
+static int drc_pmem_bind(struct papr_scm_priv *p)
+{
+   unsigned long ret[PLPAR_HCALL_BUFSIZE];
+   uint64_t rc, token;
+
+   /*
+* When the hypervisor cannot map all the requested memory in a single
+* hcall it returns H_BUSY and we call again with the token until
+* we get H_SUCCESS. Aborting the retry loop before getting H_SUCCESS
+* leave the system in an undefined state, so we wait.
+*/
+   token = 0;
+
+   do {
+   rc = plpar_hcall(H_SCM_BIND_MEM, ret, p->drc_index, 0,
+   p->blocks, BIND_ANY_ADDR, token);
+   token = be64_to_cpu(ret[0]);
+   } while (rc == H_BUSY);
+
+   if (rc) {
+   dev_err(>pdev->dev, "bind err: %lld\n", rc);
+   return -ENXIO;
+   }
+
+   p->bound_addr = be64_to_cpu(ret[1]);
+
+   dev_dbg(>pdev->dev, "bound drc %x to %pR\n", p->drc_index, >res);
+
+   return 0;
+}
+
+static int drc_pmem_unbind(struct papr_scm_priv *p)
+{
+   unsigned long ret[PLPAR_HCALL_BUFSIZE];
+   uint64_t rc, token;
+
+   token = 0;
+
+   /* NB: unbind has the same retry requirements mentioned above */
+   do {
+   rc = plpar_hcall(H_SCM_UNBIND_MEM, ret, p->drc_index,
+   p->bound_addr, p->blocks, token);
+   token = be64_to_cpu(ret);
+   } while (rc == H_BUSY);
+
+   if (rc)
+   dev_err(>pdev->dev, "unbind error: %lld\n", rc);
+
+   return !!rc;
+}
+
+static int papr_scm_meta_get(struct papr_scm_priv *p,
+   struct 

[PATCH 1/2] powerpc/pseries: PAPR persistent memory support

2018-10-10 Thread Oliver O'Halloran
This patch implements support for discovering storage class memory
devices at boot and for handling hotplug of new regions via RTAS
hotplug events.

Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/include/asm/firmware.h   |  3 ++-
 arch/powerpc/include/asm/hvcall.h | 10 +-
 arch/powerpc/include/asm/rtas.h   |  2 ++
 arch/powerpc/kernel/rtasd.c   |  2 ++
 arch/powerpc/platforms/pseries/Makefile   |  2 +-
 arch/powerpc/platforms/pseries/dlpar.c|  4 
 arch/powerpc/platforms/pseries/firmware.c |  1 +
 arch/powerpc/platforms/pseries/pseries.h  |  5 +
 arch/powerpc/platforms/pseries/ras.c  |  3 ++-
 9 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/firmware.h 
b/arch/powerpc/include/asm/firmware.h
index 7a051bd21f87..113c64d5d394 100644
--- a/arch/powerpc/include/asm/firmware.h
+++ b/arch/powerpc/include/asm/firmware.h
@@ -52,6 +52,7 @@
 #define FW_FEATURE_PRRNASM_CONST(0x0002)
 #define FW_FEATURE_DRMEM_V2ASM_CONST(0x0004)
 #define FW_FEATURE_DRC_INFOASM_CONST(0x0008)
+#define FW_FEATURE_PAPR_SCMASM_CONST(0x0010)
 
 #ifndef __ASSEMBLY__
 
@@ -69,7 +70,7 @@ enum {
FW_FEATURE_SET_MODE | FW_FEATURE_BEST_ENERGY |
FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN |
FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 |
-   FW_FEATURE_DRC_INFO,
+   FW_FEATURE_DRC_INFO | FW_FEATURE_PAPR_SCM,
FW_FEATURE_PSERIES_ALWAYS = 0,
FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL,
FW_FEATURE_POWERNV_ALWAYS = 0,
diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index a0b17f9f1ea4..0e81ef83b35a 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -295,7 +295,15 @@
 #define H_INT_ESB   0x3C8
 #define H_INT_SYNC  0x3CC
 #define H_INT_RESET 0x3D0
-#define MAX_HCALL_OPCODE   H_INT_RESET
+#define H_SCM_READ_METADATA 0x3E4
+#define H_SCM_WRITE_METADATA0x3E8
+#define H_SCM_BIND_MEM  0x3EC
+#define H_SCM_UNBIND_MEM0x3F0
+#define H_SCM_QUERY_BLOCK_MEM_BINDING 0x3F4
+#define H_SCM_QUERY_LOGICAL_MEM_BINDING 0x3F8
+#define H_SCM_MEM_QUERY0x3FC
+#define H_SCM_BLOCK_CLEAR   0x400
+#define MAX_HCALL_OPCODE   H_SCM_BLOCK_CLEAR
 
 /* H_VIOCTL functions */
 #define H_GET_VIOA_DUMP_SIZE   0x01
diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index 71e393c46a49..1e81f3d55457 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -125,6 +125,7 @@ struct rtas_suspend_me_data {
 #define RTAS_TYPE_INFO 0xE2
 #define RTAS_TYPE_DEALLOC  0xE3
 #define RTAS_TYPE_DUMP 0xE4
+#define RTAS_TYPE_HOTPLUG  0xE5
 /* I don't add PowerMGM events right now, this is a different topic */ 
 #define RTAS_TYPE_PMGM_POWER_SW_ON 0x60
 #define RTAS_TYPE_PMGM_POWER_SW_OFF0x61
@@ -316,6 +317,7 @@ struct pseries_hp_errorlog {
 #define PSERIES_HP_ELOG_RESOURCE_MEM   2
 #define PSERIES_HP_ELOG_RESOURCE_SLOT  3
 #define PSERIES_HP_ELOG_RESOURCE_PHB   4
+#define PSERIES_HP_ELOG_RESOURCE_PMEM   6
 
 #define PSERIES_HP_ELOG_ACTION_ADD 1
 #define PSERIES_HP_ELOG_ACTION_REMOVE  2
diff --git a/arch/powerpc/kernel/rtasd.c b/arch/powerpc/kernel/rtasd.c
index 6fafc82c04b0..fad0baddfcba 100644
--- a/arch/powerpc/kernel/rtasd.c
+++ b/arch/powerpc/kernel/rtasd.c
@@ -91,6 +91,8 @@ static char *rtas_event_type(int type)
return "Dump Notification Event";
case RTAS_TYPE_PRRN:
return "Platform Resource Reassignment Event";
+   case RTAS_TYPE_HOTPLUG:
+   return "Hotplug Event";
}
 
return rtas_type[0];
diff --git a/arch/powerpc/platforms/pseries/Makefile 
b/arch/powerpc/platforms/pseries/Makefile
index 7e89d5c47068..892b27ced973 100644
--- a/arch/powerpc/platforms/pseries/Makefile
+++ b/arch/powerpc/platforms/pseries/Makefile
@@ -13,7 +13,7 @@ obj-$(CONFIG_KEXEC_CORE)  += kexec.o
 obj-$(CONFIG_PSERIES_ENERGY)   += pseries_energy.o
 
 obj-$(CONFIG_HOTPLUG_CPU)  += hotplug-cpu.o
-obj-$(CONFIG_MEMORY_HOTPLUG)   += hotplug-memory.o
+obj-$(CONFIG_MEMORY_HOTPLUG)   += hotplug-memory.o pmem.o
 
 obj-$(CONFIG_HVC_CONSOLE)  += hvconsole.o
 obj-$(CONFIG_HVCS) += hvcserver.o
diff --git a/arch/powerpc/platforms/pseries/dlpar.c 
b/arch/powerpc/platforms/pseries/dlpar.c
index a0b20c03f078..795996fefdb9 100644
--- a/arch/powerpc/platforms/pseries/dlpar.c
+++ b/arch/powerpc/platforms/pseries/dlpar.c
@@ -357,6 +357,10 @@ static int handle_dlpar_errorlog(struct 
pseries_hp_errorlog *hp_elog)
case PSERIES_HP_ELOG_RESOURCE_CPU:
rc = dlpar_cpu(hp_elog);
break;
+   case PSERIES_HP_ELOG_RESOURCE_PMEM:
+