Re: [nvdimm PATCH 0/6] Label initialization time optimizations
On Wed, Oct 10, 2018 at 4:36 PM Alexander Duyck wrote: > > This patch set is intended to improve NVDIMM label read times by first > increasing the upper limit on the label read/write size, and then > reducing the number of reads by making use of the free label bitmap in > the index to determine what labels are actually populated and only read > those labels. In my testing on a system populated with 24 NVDIMM modules > I see the total label init time drop from about 24 seconds down to 2 to > 3 seconds. > > In the process of coding this up I came across a few minor issues that > I felt should be addressed so I have added a few patches for those fixes > along the way. Thanks, this all looks good to me, the split looks good, and the commentary in patch6 made it easy to review. I quibble with "Remove empty if statement" because I find positive logic easier to read than negative logic, but removing more lines than it adds made me reserve that quibble for another time. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v9 00/13] Copy Offload in NVMe Fabrics with P2P PCI Memory
On Wed, Oct 10, 2018 at 05:03:33PM -0600, Logan Gunthorpe wrote: > > > On 2018-10-10 2:19 p.m., Bjorn Helgaas wrote: > > I added the reviewed-by tags from Christoph, Jens' ack on the blkdev.h > > change, and applied these to pci/peer-to-peer with the intent of > > merging these for v4.20. > > > > I gave up on waiting for an ack for the memremap.h and mm.h changes. > > > > I dropped the "nvme-pci: Add a quirk for a pseudo CMB" quirk because > > of Christoph's objection. After this is all merged, I won't need to > > be involved, and you and the NVMe folks can hash that out. > > > > If there are updates to "nvmet: Optionally use PCI P2P memory" based > > on Sagi's comments, send an incremental patch and I'll fold them in. > > Thanks for picking this up. However, I hate to throw a wrench in the > works, but I had a v10[1] queued up because kbuild found some problems > with the series over the weekend. I can send v10 off right away if you > want to just replace it in your branch or, if you'd like, I can generate > some incremental patches. Let me know which you'd prefer. I applied the updates from your v10 to my pci/peer-to-peer branch. > [1] https://github.com/sbates130272/linux-p2pmem pci-p2p-v10 ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap
On 2018-10-10 at 11:18:49 -0700, Dan Williams wrote: > On Wed, Oct 10, 2018 at 10:30 AM Michal Hocko wrote: > > > > On Wed 10-10-18 09:39:08, Alexander Duyck wrote: > > > On 10/10/2018 2:58 AM, Michal Hocko wrote: > > > > On Tue 09-10-18 13:26:41, Alexander Duyck wrote: > > > > [...] > > > > > I would think with that being the case we still probably need the > > > > > call to > > > > > __SetPageReserved to set the bit with the expectation that it will > > > > > not be > > > > > cleared for device-pages since the pages are not onlined. Removing > > > > > the call > > > > > to __SetPageReserved would probably introduce a number of regressions > > > > > as > > > > > there are multiple spots that use the reserved bit to determine if a > > > > > page > > > > > can be swapped out to disk, mapped as system memory, or migrated. > > > > > > > > PageReserved is meant to tell any potential pfn walkers that might get > > > > to this struct page to back off and not touch it. Even though > > > > ZONE_DEVICE doesn't online pages in traditional sense it makes those > > > > pages available for further use so the page reserved bit should be > > > > cleared. > > > > > > So from what I can tell that isn't necessarily the case. Specifically if > > > the > > > pagemap type is MEMORY_DEVICE_PRIVATE or MEMORY_DEVICE_PUBLIC both are > > > special cases where the memory may not be accessible to the CPU or cannot > > > be > > > pinned in order to allow for eviction. > > > > Could you give me an example please? > > > > > The specific case that Dan and Yi are referring to is for the type > > > MEMORY_DEVICE_FS_DAX. For that type I could probably look at not setting > > > the > > > reserved bit. Part of me wants to say that we should wait and clear the > > > bit > > > later, but that would end up just adding time back to initialization. At > > > this point I would consider the change more of a follow-up optimization > > > rather than a fix though since this is tailoring things specifically for > > > DAX > > > versus the other ZONE_DEVICE types. > > > > I thought I have already made it clear that these zone device hacks are > > not acceptable to the generic hotplug code. If the current reserve bit > > handling is not correct then give us a specific reason for that and we > > can start thinking about the proper fix. > > > > Right, so we're in a situation where a hack is needed for KVM's > current interpretation of the Reserved flag relative to dax mapped > pages. I'm arguing to push that knowledge / handling as deep as > possible into the core rather than hack the leaf implementations like > KVM, i.e. disable the Reserved flag for all non-MEMORY_DEVICE_* > ZONE_DEVICE types. > > Here is the KVM thread about why they need a change: > > https://lkml.org/lkml/2018/9/7/552 > > ...and where I pushed back on a KVM-local hack: > > https://lkml.org/lkml/2018/9/19/154 Yeah, Thank Dan, I think I can going on with something like this: @@ -5589,6 +5589,7 @@ void __ref memmap_init_zone_device(struct zone *zone, struct page *page = pfn_to_page(pfn); __init_single_page(page, pfn, zone_idx, nid); + /* Could we move this a little bit earlier as I can +* direct use is_dax_page(page), or something else? +*/ + page->pgmap = pgmap; /* * Mark page reserved as it will need to wait for onlining @@ -5597,14 +5598,14 @@ void __ref memmap_init_zone_device(struct zone *zone, * We can use the non-atomic __set_bit operation for setting * the flag as we are still initializing the pages. */ - __SetPageReserved(page); +if(!is_dax_page(page)) + __SetPageReserved(page); /* * ZONE_DEVICE pages union ->lru with a ->pgmap back * pointer and hmm_data. It is a bug if a ZONE_DEVICE * page is ever freed or placed on a driver-private list. */ - page->pgmap = pgmap; page->hmm_data = 0; After Alex's patch merged. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap
On 2018-10-10 at 08:27:58 -0700, Alexander Duyck wrote: > > > On 10/10/2018 5:52 AM, Yi Zhang wrote: > >On 2018-10-09 at 14:19:32 -0700, Dan Williams wrote: > >>On Tue, Oct 9, 2018 at 1:34 PM Alexander Duyck > >> wrote: > >>> > >>>On 10/9/2018 11:04 AM, Dan Williams wrote: > On Tue, Oct 9, 2018 at 3:21 AM Yi Zhang > wrote: > >>[..] > That comment is incorrect, device-pages are never onlined. So I think > we can just skip that call to __SetPageReserved() unless the memory > range is MEMORY_DEVICE_{PRIVATE,PUBLIC}. > > >>> > >>>When pages are "onlined" via __free_pages_boot_core they clear the > >>>reserved bit, that is the reason for the comment. The reserved bit is > >>>meant to indicate that the page cannot be swapped out or moved based on > >>>the description of the bit. > >> > >>...but ZONE_DEVICE pages are never onlined so I would expect > >>memmap_init_zone_device() to know that detail. > >> > >>>I would think with that being the case we still probably need the call > >>>to __SetPageReserved to set the bit with the expectation that it will > >>>not be cleared for device-pages since the pages are not onlined. > >>>Removing the call to __SetPageReserved would probably introduce a number > >>>of regressions as there are multiple spots that use the reserved bit to > >>>determine if a page can be swapped out to disk, mapped as system memory, > >>>or migrated. > > > >Another things, it seems page_init/set_reserved already been done in the > >move_pfn_range_to_zone > > |-->memmap_init_zone > > |-->for_each_page_in_pfn > > |-->__init_single_page > > |-->SetPageReserved > > > >Why we haven't remove these redundant initial in memmap_init_zone? > > > >Correct me if I missed something. > > In this case it isn't redundant as only the vmmemmap pages are initialized > in memmap_init_zone now. So all of the pages that are going to be used as > device pages are not initialized until the call to memmap_init_zone_device. > What I did is split the initialization of the pages into two parts in order > to allow us to initialize the pages outside of the hotplug lock. Ah.. I saw that, Thanks the explanation, so that is we only need to care about the device pages reserved flag, and plan to remove that. > > >> > >>Right, this is what Yi is working on... the PageReserved flag is > >>problematic for KVM. Auditing those locations it seems as long as we > >>teach hibernation to avoid ZONE_DEVICE ranges we can safely not set > >>the reserved flag for DAX pages. What I'm trying to avoid is a local > >>KVM hack to check for DAX pages when the Reserved flag is not > >>otherwise needed. > >Thanks Dan. Provide the patch link. > > > >https://lore.kernel.org/lkml/cover.1536342881.git.yi.z.zh...@linux.intel.com > > So it looks like your current logic is just working around the bit then > since it just allows for reserved DAX pages. > > > ___ > Linux-nvdimm mailing list > Linux-nvdimm@lists.01.org > https://lists.01.org/mailman/listinfo/linux-nvdimm ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
答复:作为业务员为什么努力了总得不到预期效果
91469082481 详 细 课 纲 请 阅 读 附 件 8:40:48 ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 2/2] powerpc/pseries: Add driver for PAPR SCM regions
Oliver O'Halloran writes: > Adds a driver that implements support for enabling and accessing PAPR > SCM regions. Unfortunately due to how the PAPR interface works we can't > use the existing of_pmem driver (yet) because: > > a) The guest is required to use the H_SCM_BIND_MEM h-call to add > add the SCM region to it's physical address space, and > b) There is currently no mechanism for relating a bare of_pmem region > to the backing DIMM (or not-a-DIMM for our case). > > Both of these are easily handled by rolling the functionality into a > seperate driver so here we are... > > Signed-off-by: Oliver O'Halloran > --- > The alternative implementation here is that we have the pseries code > do the h-calls and craft a pmem-region@ node based on that. > However, that doesn't solve b) and mpe has expressed his dislike of > adding new stuff to the DT at runtime so i'd say that's a non-starter. Hmm, from memory you yelled something at me across the office about that, so my response may not have been entirely well considered. I'm not quite sure what the state of the OF overlays support is, but that would be The Right Way to do that sort of modification to the device tree at runtime. If we merged this and then later got the of_pmem driver to work for us would there be any user-visible change? cheers ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[nvdimm PATCH 4/6] nvdimm: Remove empty if statement
This patch removes an empty statement from an if expression and promotes the else statement to the if expression with the expression logic reversed. I feel this is more readable as the empty statement can lead to issues if any additional logic was ever added. Signed-off-by: Alexander Duyck --- drivers/nvdimm/label.c |5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/nvdimm/label.c b/drivers/nvdimm/label.c index bb813b8e8ace..43bad0d5bdb6 100644 --- a/drivers/nvdimm/label.c +++ b/drivers/nvdimm/label.c @@ -261,9 +261,8 @@ int nd_label_validate(struct nvdimm_drvdata *ndd) void nd_label_copy(struct nvdimm_drvdata *ndd, struct nd_namespace_index *dst, struct nd_namespace_index *src) { - if (dst && src) - /* pass */; - else + /* just exit if either destination or source is NULL */ + if (!dst || !src) return; memcpy(dst, src, sizeof_namespace_index(ndd)); ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[nvdimm PATCH 5/6] nvdimm: Split label init out from the logic for getting config data
This patch splits the initialization of the label data into two functions. One for doing the init, and another for reading the actual configuration data. The idea behind this is that by doing this we create a symmetry between the getting and setting of config data in that we have a function for both. In addition it will make it easier for us to identify the bits that are related to init versus the pieces that are a wrapper for reading data from the ACPI interface. So for example by splitting things out like this it becomes much more obvious that we were performing checks that weren't necessarily related to the set/get operations such as relying on ndd->data being present when the set and get ops should not care about a locally cached copy of the label area. Signed-off-by: Alexander Duyck --- drivers/nvdimm/dimm.c |2 +- drivers/nvdimm/dimm_devs.c | 49 +--- drivers/nvdimm/label.c | 38 ++ drivers/nvdimm/label.h |1 + drivers/nvdimm/nd.h|2 ++ 5 files changed, 61 insertions(+), 31 deletions(-) diff --git a/drivers/nvdimm/dimm.c b/drivers/nvdimm/dimm.c index 6c8fb7590838..07bf96948553 100644 --- a/drivers/nvdimm/dimm.c +++ b/drivers/nvdimm/dimm.c @@ -75,7 +75,7 @@ static int nvdimm_probe(struct device *dev) * DIMM capacity. We fail the dimm probe to prevent regions from * attempting to parse the label area. */ - rc = nvdimm_init_config_data(ndd); + rc = nd_label_data_init(ndd); if (rc == -EACCES) nvdimm_set_locked(dev); if (rc) diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c index 75ac78017b15..6c3de2317390 100644 --- a/drivers/nvdimm/dimm_devs.c +++ b/drivers/nvdimm/dimm_devs.c @@ -85,55 +85,47 @@ int nvdimm_init_nsarea(struct nvdimm_drvdata *ndd) return cmd_rc; } -int nvdimm_init_config_data(struct nvdimm_drvdata *ndd) +int nvdimm_get_config_data(struct nvdimm_drvdata *ndd, void *buf, + size_t offset, size_t len) { struct nvdimm_bus *nvdimm_bus = walk_to_nvdimm_bus(ndd->dev); + struct nvdimm_bus_descriptor *nd_desc = nvdimm_bus->nd_desc; int rc = validate_dimm(ndd), cmd_rc = 0; struct nd_cmd_get_config_data_hdr *cmd; - struct nvdimm_bus_descriptor *nd_desc; - u32 max_cmd_size, config_size; - size_t offset; + size_t max_cmd_size, buf_offset; if (rc) return rc; - if (ndd->data) - return 0; - - if (ndd->nsarea.status || ndd->nsarea.max_xfer == 0 - || ndd->nsarea.config_size < ND_LABEL_MIN_SIZE) { - dev_dbg(ndd->dev, "failed to init config data area: (%d:%d)\n", - ndd->nsarea.max_xfer, ndd->nsarea.config_size); + if (offset + len > ndd->nsarea.config_size) return -ENXIO; - } - ndd->data = kvmalloc(ndd->nsarea.config_size, GFP_KERNEL); - if (!ndd->data) - return -ENOMEM; - - max_cmd_size = min_t(u32, ndd->nsarea.config_size, ndd->nsarea.max_xfer); + max_cmd_size = min_t(u32, len, ndd->nsarea.max_xfer); cmd = kvzalloc(max_cmd_size + sizeof(*cmd), GFP_KERNEL); if (!cmd) return -ENOMEM; - nd_desc = nvdimm_bus->nd_desc; - for (config_size = ndd->nsarea.config_size, offset = 0; - config_size; config_size -= cmd->in_length, - offset += cmd->in_length) { - cmd->in_length = min(config_size, max_cmd_size); - cmd->in_offset = offset; + for (buf_offset = 0; len; +len -= cmd->in_length, buf_offset += cmd->in_length) { + size_t cmd_size; + + cmd->in_offset = offset + buf_offset; + cmd->in_length = min(max_cmd_size, len); + + cmd_size = sizeof(*cmd) + cmd->in_length; + rc = nd_desc->ndctl(nd_desc, to_nvdimm(ndd->dev), - ND_CMD_GET_CONFIG_DATA, cmd, - cmd->in_length + sizeof(*cmd), _rc); + ND_CMD_GET_CONFIG_DATA, cmd, cmd_size, _rc); if (rc < 0) break; if (cmd_rc < 0) { rc = cmd_rc; break; } - memcpy(ndd->data + offset, cmd->out_buf, cmd->in_length); + + /* out_buf should be valid, copy it into our output buffer */ + memcpy(buf + buf_offset, cmd->out_buf, cmd->in_length); } - dev_dbg(ndd->dev, "len: %zu rc: %d\n", offset, rc); kvfree(cmd); return rc; @@ -151,9 +143,6 @@ int nvdimm_set_config_data(struct nvdimm_drvdata *ndd, size_t offset, if (rc) return rc; - if (!ndd->data) - return -ENXIO; - if (offset + len >
[nvdimm PATCH 6/6] nvdimm: Use namespace index data to reduce number of label reads needed
This patch adds logic that is meant to make use of the namespace index data to reduce the number of reads that are needed to initialize a given namespace. The general idea is that once we have enough data to validate the namespace index we do so and then proceed to fetch only those labels that are not listed as being "free". By doing this I am seeing a total time reduction from about 4-5 seconds to 2-3 seconds for 24 NVDIMM modules each with 128K of label config area. Signed-off-by: Alexander Duyck --- drivers/nvdimm/dimm.c |4 -- drivers/nvdimm/label.c | 93 +--- drivers/nvdimm/label.h |3 -- 3 files changed, 88 insertions(+), 12 deletions(-) diff --git a/drivers/nvdimm/dimm.c b/drivers/nvdimm/dimm.c index 07bf96948553..9899c97138a3 100644 --- a/drivers/nvdimm/dimm.c +++ b/drivers/nvdimm/dimm.c @@ -84,10 +84,6 @@ static int nvdimm_probe(struct device *dev) dev_dbg(dev, "config data size: %d\n", ndd->nsarea.config_size); nvdimm_bus_lock(dev); - ndd->ns_current = nd_label_validate(ndd); - ndd->ns_next = nd_label_next_nsindex(ndd->ns_current); - nd_label_copy(ndd, to_next_namespace_index(ndd), - to_current_namespace_index(ndd)); if (ndd->ns_current >= 0) { rc = nd_label_reserve_dpa(ndd); if (rc == 0) diff --git a/drivers/nvdimm/label.c b/drivers/nvdimm/label.c index 563f24af01b5..7f03d117824f 100644 --- a/drivers/nvdimm/label.c +++ b/drivers/nvdimm/label.c @@ -235,7 +235,7 @@ static int __nd_label_validate(struct nvdimm_drvdata *ndd) return -1; } -int nd_label_validate(struct nvdimm_drvdata *ndd) +static int nd_label_validate(struct nvdimm_drvdata *ndd) { /* * In order to probe for and validate namespace index blocks we @@ -258,8 +258,9 @@ int nd_label_validate(struct nvdimm_drvdata *ndd) return -1; } -void nd_label_copy(struct nvdimm_drvdata *ndd, struct nd_namespace_index *dst, - struct nd_namespace_index *src) +static void nd_label_copy(struct nvdimm_drvdata *ndd, + struct nd_namespace_index *dst, + struct nd_namespace_index *src) { /* just exit if either destination or source is NULL */ if (!dst || !src) @@ -419,7 +420,9 @@ int nd_label_reserve_dpa(struct nvdimm_drvdata *ndd) int nd_label_data_init(struct nvdimm_drvdata *ndd) { - size_t config_size, read_size; + size_t config_size, read_size, max_xfer, offset; + struct nd_namespace_index *nsindex; + unsigned int i; int rc = 0; if (ndd->data) @@ -452,7 +455,87 @@ int nd_label_data_init(struct nvdimm_drvdata *ndd) if (!ndd->data) return -ENOMEM; - return nvdimm_get_config_data(ndd, ndd->data, 0, config_size); + /* +* We want to guarantee as few reads as possible while conserving +* memory. To do that we figure out how much unused space will be left +* in the last read, divide that by the total number of reads it is +* going to take given our maximum transfer size, and then reduce our +* maximum transfer size based on that result. +*/ + max_xfer = min_t(size_t, ndd->nsarea.max_xfer, config_size); + if (read_size < max_xfer) { + /* trim waste */ + max_xfer -= ((max_xfer - 1) - (config_size - 1) % max_xfer) / + DIV_ROUND_UP(config_size, max_xfer); + /* make certain we read indexes in exactly 1 read */ + if (max_xfer < read_size) + max_xfer = read_size; + } + + /* Make our initial read size a multiple of max_xfer size */ + read_size = min(DIV_ROUND_UP(read_size, max_xfer) * max_xfer, + config_size); + + /* Read the index data */ + rc = nvdimm_get_config_data(ndd, ndd->data, 0, read_size); + if (rc) + goto out_err; + + /* Validate index data, if not valid assume all labels are invalid */ + ndd->ns_current = nd_label_validate(ndd); + if (ndd->ns_current < 0) + return 0; + + /* Record our index values */ + ndd->ns_next = nd_label_next_nsindex(ndd->ns_current); + + /* Copy "current" index on top of the "next" index */ + nsindex = to_current_namespace_index(ndd); + nd_label_copy(ndd, to_next_namespace_index(ndd), nsindex); + + /* Determine starting offset for label data */ + offset = __le64_to_cpu(nsindex->labeloff); + + /* Loop through the free list pulling in any active labels */ + for (i = 0; i < nsindex->nslot; i++, offset += ndd->nslabel_size) { + size_t label_read_size; + + /* zero out the unused labels */ + if (test_bit_le(i, nsindex->free)) { + memset(ndd->data + offset, 0, ndd->nslabel_size); +
[nvdimm PATCH 3/6] nvdimm: Clarify comment in sizeof_namespace_index
When working on the label code I found it rather confusing to see several spots that reference a minimum label size of 256 while working with labels that are 128 bytes in size. This patch is meant to provide a clarification on one of the comments that was at the heart of the issue. Specifically for version 1.2 and later of the namespace specification the minimum label size is 256, prior to that the minimum label size was 128. So we should state that as such to avoid confusion. Signed-off-by: Alexander Duyck --- drivers/nvdimm/label.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/nvdimm/label.c b/drivers/nvdimm/label.c index 1f5842509dbc..bb813b8e8ace 100644 --- a/drivers/nvdimm/label.c +++ b/drivers/nvdimm/label.c @@ -75,7 +75,8 @@ size_t sizeof_namespace_index(struct nvdimm_drvdata *ndd) /* * Per UEFI 2.7, the minimum size of the Label Storage Area is large * enough to hold 2 index blocks and 2 labels. The minimum index -* block size is 256 bytes, and the minimum label size is 256 bytes. +* block size is 256 bytes. The label size is 128 for namespaces +* prior to version 1.2 and at minimum 256 for version 1.2 and later. */ nslot = nvdimm_num_label_slots(ndd); space = ndd->nsarea.config_size - nslot * sizeof_namespace_label(ndd); ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[nvdimm PATCH 2/6] nvdimm: Sanity check labeloff
This patch adds validation for the labeloff field in the indexes. Signed-off-by: Alexander Duyck --- drivers/nvdimm/label.c |7 +++ 1 file changed, 7 insertions(+) diff --git a/drivers/nvdimm/label.c b/drivers/nvdimm/label.c index 1d28cd656536..1f5842509dbc 100644 --- a/drivers/nvdimm/label.c +++ b/drivers/nvdimm/label.c @@ -183,6 +183,13 @@ static int __nd_label_validate(struct nvdimm_drvdata *ndd) __le64_to_cpu(nsindex[i]->otheroff)); continue; } + if (__le64_to_cpu(nsindex[i]->labeloff) + != 2 * sizeof_namespace_index(ndd)) { + dev_dbg(dev, "nsindex%d labeloff: %#llx invalid\n", + i, (unsigned long long) + __le64_to_cpu(nsindex[i]->labeloff)); + continue; + } size = __le64_to_cpu(nsindex[i]->mysize); if (size > sizeof_namespace_index(ndd) ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[nvdimm PATCH 1/6] libnvdimm, dimm: Maximize label transfer size
From: Dan Williams Use kvzalloc() to bypass the arbitrary PAGE_SIZE limit of label transfer operations. Given the expense of calling into firmware, maximize the amount of label data we transfer per call to be up to the total label space if allowed by the firmware. Instead of limiting based on PAGE_SIZE we can instead simply limit the maximum size based on either the config_size int he case of the get operation, or the length of the write based on the set operation. On a system with 24 NVDIMM modules each with a config_size of 128K and a maximum transfer size of 64K - 4, this patch reduces the init time for the label data from around 24 seconds down to between 4-5 seconds. Signed-off-by: Dan Williams Signed-off-by: Alexander Duyck --- I think I only did a few minor tweaks on this so I thought I would leave Dan as the author and just add my Signed-off-by. drivers/nvdimm/dimm_devs.c | 13 ++--- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c index 863cabc35215..75ac78017b15 100644 --- a/drivers/nvdimm/dimm_devs.c +++ b/drivers/nvdimm/dimm_devs.c @@ -111,8 +111,8 @@ int nvdimm_init_config_data(struct nvdimm_drvdata *ndd) if (!ndd->data) return -ENOMEM; - max_cmd_size = min_t(u32, PAGE_SIZE, ndd->nsarea.max_xfer); - cmd = kzalloc(max_cmd_size + sizeof(*cmd), GFP_KERNEL); + max_cmd_size = min_t(u32, ndd->nsarea.config_size, ndd->nsarea.max_xfer); + cmd = kvzalloc(max_cmd_size + sizeof(*cmd), GFP_KERNEL); if (!cmd) return -ENOMEM; @@ -134,7 +134,7 @@ int nvdimm_init_config_data(struct nvdimm_drvdata *ndd) memcpy(ndd->data + offset, cmd->out_buf, cmd->in_length); } dev_dbg(ndd->dev, "len: %zu rc: %d\n", offset, rc); - kfree(cmd); + kvfree(cmd); return rc; } @@ -157,9 +157,8 @@ int nvdimm_set_config_data(struct nvdimm_drvdata *ndd, size_t offset, if (offset + len > ndd->nsarea.config_size) return -ENXIO; - max_cmd_size = min_t(u32, PAGE_SIZE, len); - max_cmd_size = min_t(u32, max_cmd_size, ndd->nsarea.max_xfer); - cmd = kzalloc(max_cmd_size + sizeof(*cmd) + sizeof(u32), GFP_KERNEL); + max_cmd_size = min_t(u32, len, ndd->nsarea.max_xfer); + cmd = kvzalloc(max_cmd_size + sizeof(*cmd) + sizeof(u32), GFP_KERNEL); if (!cmd) return -ENOMEM; @@ -183,7 +182,7 @@ int nvdimm_set_config_data(struct nvdimm_drvdata *ndd, size_t offset, break; } } - kfree(cmd); + kvfree(cmd); return rc; } ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[nvdimm PATCH 0/6] Label initialization time optimizations
This patch set is intended to improve NVDIMM label read times by first increasing the upper limit on the label read/write size, and then reducing the number of reads by making use of the free label bitmap in the index to determine what labels are actually populated and only read those labels. In my testing on a system populated with 24 NVDIMM modules I see the total label init time drop from about 24 seconds down to 2 to 3 seconds. In the process of coding this up I came across a few minor issues that I felt should be addressed so I have added a few patches for those fixes along the way. --- Alexander Duyck (5): nvdimm: Sanity check labeloff nvdimm: Clarify comment in sizeof_namespace_index nvdimm: Remove empty if statement nvdimm: Split label init out from the logic for getting config data nvdimm: Use namespace index data to reduce number of label reads needed Dan Williams (1): libnvdimm, dimm: Maximize label transfer size drivers/nvdimm/dimm.c |6 -- drivers/nvdimm/dimm_devs.c | 60 +++ drivers/nvdimm/label.c | 142 ++-- drivers/nvdimm/label.h |4 - drivers/nvdimm/nd.h|2 + 5 files changed, 163 insertions(+), 51 deletions(-) -- ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v9 00/13] Copy Offload in NVMe Fabrics with P2P PCI Memory
On 2018-10-10 2:19 p.m., Bjorn Helgaas wrote: > I added the reviewed-by tags from Christoph, Jens' ack on the blkdev.h > change, and applied these to pci/peer-to-peer with the intent of > merging these for v4.20. > > I gave up on waiting for an ack for the memremap.h and mm.h changes. > > I dropped the "nvme-pci: Add a quirk for a pseudo CMB" quirk because > of Christoph's objection. After this is all merged, I won't need to > be involved, and you and the NVMe folks can hash that out. > > If there are updates to "nvmet: Optionally use PCI P2P memory" based > on Sagi's comments, send an incremental patch and I'll fold them in. Thanks for picking this up. However, I hate to throw a wrench in the works, but I had a v10[1] queued up because kbuild found some problems with the series over the weekend. I can send v10 off right away if you want to just replace it in your branch or, if you'd like, I can generate some incremental patches. Let me know which you'd prefer. Thanks, Logan [1] https://github.com/sbates130272/linux-p2pmem pci-p2p-v10 ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 1/2] powerpc/pseries: PAPR persistent memory support
On Thu, Oct 11, 2018 at 3:36 AM Nathan Fontenot wrote: > > On 10/10/2018 01:08 AM, Oliver O'Halloran wrote: > > This patch implements support for discovering storage class memory > > devices at boot and for handling hotplug of new regions via RTAS > > hotplug events. > > > > Signed-off-by: Oliver O'Halloran > > --- > > arch/powerpc/include/asm/firmware.h | 3 ++- > > arch/powerpc/include/asm/hvcall.h | 10 +- > > arch/powerpc/include/asm/rtas.h | 2 ++ > > arch/powerpc/kernel/rtasd.c | 2 ++ > > arch/powerpc/platforms/pseries/Makefile | 2 +- > > arch/powerpc/platforms/pseries/dlpar.c| 4 > > arch/powerpc/platforms/pseries/firmware.c | 1 + > > arch/powerpc/platforms/pseries/pseries.h | 5 + > > arch/powerpc/platforms/pseries/ras.c | 3 ++- > > 9 files changed, 28 insertions(+), 4 deletions(-) > > > > diff --git a/arch/powerpc/include/asm/firmware.h > > b/arch/powerpc/include/asm/firmware.h > > index 7a051bd21f87..113c64d5d394 100644 > > --- a/arch/powerpc/include/asm/firmware.h > > +++ b/arch/powerpc/include/asm/firmware.h > > @@ -52,6 +52,7 @@ > > #define FW_FEATURE_PRRN ASM_CONST(0x0002) > > #define FW_FEATURE_DRMEM_V2 ASM_CONST(0x0004) > > #define FW_FEATURE_DRC_INFO ASM_CONST(0x0008) > > +#define FW_FEATURE_PAPR_SCM ASM_CONST(0x0010) > > > > #ifndef __ASSEMBLY__ > > > > @@ -69,7 +70,7 @@ enum { > > FW_FEATURE_SET_MODE | FW_FEATURE_BEST_ENERGY | > > FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN | > > FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 | > > - FW_FEATURE_DRC_INFO, > > + FW_FEATURE_DRC_INFO | FW_FEATURE_PAPR_SCM, > > FW_FEATURE_PSERIES_ALWAYS = 0, > > FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL, > > FW_FEATURE_POWERNV_ALWAYS = 0, > > diff --git a/arch/powerpc/include/asm/hvcall.h > > b/arch/powerpc/include/asm/hvcall.h > > index a0b17f9f1ea4..0e81ef83b35a 100644 > > --- a/arch/powerpc/include/asm/hvcall.h > > +++ b/arch/powerpc/include/asm/hvcall.h > > @@ -295,7 +295,15 @@ > > #define H_INT_ESB 0x3C8 > > #define H_INT_SYNC 0x3CC > > #define H_INT_RESET 0x3D0 > > -#define MAX_HCALL_OPCODE H_INT_RESET > > +#define H_SCM_READ_METADATA 0x3E4 > > +#define H_SCM_WRITE_METADATA0x3E8 > > +#define H_SCM_BIND_MEM 0x3EC > > +#define H_SCM_UNBIND_MEM0x3F0 > > +#define H_SCM_QUERY_BLOCK_MEM_BINDING 0x3F4 > > +#define H_SCM_QUERY_LOGICAL_MEM_BINDING 0x3F8 > > +#define H_SCM_MEM_QUERY 0x3FC > > +#define H_SCM_BLOCK_CLEAR 0x400 > > +#define MAX_HCALL_OPCODE H_SCM_BLOCK_CLEAR > > > > /* H_VIOCTL functions */ > > #define H_GET_VIOA_DUMP_SIZE 0x01 > > diff --git a/arch/powerpc/include/asm/rtas.h > > b/arch/powerpc/include/asm/rtas.h > > index 71e393c46a49..1e81f3d55457 100644 > > --- a/arch/powerpc/include/asm/rtas.h > > +++ b/arch/powerpc/include/asm/rtas.h > > @@ -125,6 +125,7 @@ struct rtas_suspend_me_data { > > #define RTAS_TYPE_INFO 0xE2 > > #define RTAS_TYPE_DEALLOC0xE3so we might as well > > #define RTAS_TYPE_DUMP 0xE4 > > +#define RTAS_TYPE_HOTPLUG0xE5 > > /* I don't add PowerMGM events right now, this is a different topic */ > > #define RTAS_TYPE_PMGM_POWER_SW_ON 0x60 > > #define RTAS_TYPE_PMGM_POWER_SW_OFF 0x61 > > @@ -316,6 +317,7 @@ struct pseries_hp_errorlog { > > #define PSERIES_HP_ELOG_RESOURCE_MEM 2 > > #define PSERIES_HP_ELOG_RESOURCE_SLOT3 > > #define PSERIES_HP_ELOG_RESOURCE_PHB 4 > > +#define PSERIES_HP_ELOG_RESOURCE_PMEM 6 > > > > #define PSERIES_HP_ELOG_ACTION_ADD 1 > > #define PSERIES_HP_ELOG_ACTION_REMOVE2 > > diff --git a/arch/powerpc/kernel/rtasd.c b/arch/powerpc/kernel/rtasd.c > > index 6fafc82c04b0..fad0baddfcba 100644 > > --- a/arch/powerpc/kernel/rtasd.c > > +++ b/arch/powerpc/kernel/rtasd.c > > @@ -91,6 +91,8 @@ static char *rtas_event_type(int type) > > return "Dump Notification Event"; > > case RTAS_TYPE_PRRN: > > return "Platform Resource Reassignment Event"; > > + case RTAS_TYPE_HOTPLUG: > > + return "Hotplug Event"; > > } > > > > return rtas_type[0]; > > diff --git a/arch/powerpc/platforms/pseries/Makefile > > b/arch/powerpc/platforms/pseries/Makefile > > index 7e89d5c47068..892b27ced973 100644 > > --- a/arch/powerpc/platforms/pseries/Makefile > > +++ b/arch/powerpc/platforms/pseries/Makefile > > @@ -13,7 +13,7 @@ obj-$(CONFIG_KEXEC_CORE)+= kexec.o > > obj-$(CONFIG_PSERIES_ENERGY) += pseries_energy.o > > > > obj-$(CONFIG_HOTPLUG_CPU)+= hotplug-cpu.o > > -obj-$(CONFIG_MEMORY_HOTPLUG) += hotplug-memory.o > > +obj-$(CONFIG_MEMORY_HOTPLUG) += hotplug-memory.o pmem.o > > > > obj-$(CONFIG_HVC_CONSOLE)+= hvconsole.o > > obj-$(CONFIG_HVCS)
Re: [PATCH v9 00/13] Copy Offload in NVMe Fabrics with P2P PCI Memory
On Thu, Oct 04, 2018 at 03:27:34PM -0600, Logan Gunthorpe wrote: > This is v9 for the p2pdma patch set. It has some substantial changes > from the previous version. Essentially, the last patch has changed > based on information from Sagi that the P2P memory can be assigned > per namespace instead of globally. To that end, I've added a > radix tree that stores the p2p devices to use for each namespace and > a lookup will be done on that before allocating the memory. > > This has some knock on effects of simplifying the requirements > from the p2pdma code and so we drop all the client list code seeing > we don't need to worry about selecting a P2P device that works > with every namespace at once; only each namespace independantly. > Thus, Patch 1, 6 and 13 have significant changes. However, > I think the removal of the client list code will be generally > appreciated based on the feedback I've gotten from Christian. > > As usual, a git repo of the branch is available here: > > https://github.com/sbates130272/linux-p2pmem pci-p2p-v9 > > Thanks, > > Logan > > -- > > Changes in v9: > * Rebased on v4.19-rc6 > * Droped pci_p2pdma_add_client(), pci_p2pdma_remove_client() and > pci_p2pdma_client_list_free(), and reworked pci_p2pdma_distance() and > pci_p2pmem_find() to take simple arrays instead of list_heads. > * Updated the documentation to match > * Reworked the nvmet code to have a configfs attribute per namespace > instead of per port. Then each namespace has it's own P2P device > stored in a radix-tree in the controller (it can't be stored in > the nvme_ns structure seeing it needs to be unique per controller). > * Dropped the 'sq' argument in nvmet_req_alloc_sgl() seeing I've noticed > it is now available in the nvmet_req structure. > * Collected Keith's reviewed-by tags (however, I have not applied his > tag to the last patch seeing I think it has changed too much since > his review). > > Changes in v8: > > * Added a couple of comments to address Bart's feedback and > removed the bogus call to percpu_ref_switch_to_atomic_sync() > > Changes in v7: > > * Rebased on v4.19-rc5 > > * Fixed up commit message of patch 7 that was no longer accurate. (as > pointed out by Jens) > * Change the configfs to not use "auto" or "none" and instead just > use a 0/1/ (or boolean). This matches the existing > nvme-target configfs booleans. (Per Bjorn) > * A handful of other minor changes and edits that were noticed by Bjorn > * Collected Acks from Bjorn > > Changes in v6: > > * Rebased on v4.19-rc3 > > * Remove the changes to the common submit_bio() path and instead > set REQ_NOMERGE in the NVME target driver, when appropriate. > Per discussions with Jens and Christoph. > > * Some minor grammar changes in the documentation as spotted by Randy. > > Changes in v5: > > * Rebased on v4.19-rc1 > > * Drop changing ACS settings in this patchset. Now, the code > will only allow P2P transactions between devices whos > downstream ports do not restrict P2P TLPs. > > * Drop the REQ_PCI_P2PDMA block flag and instead use > is_pci_p2pdma_page() to tell if a request is P2P or not. In that > case we check for queue support and enforce using REQ_NOMERGE. > Per feedback from Christoph. > > * Drop the pci_p2pdma_unmap_sg() function as it was empty and only > there for symmetry and compatibility with dma_unmap_sg. Per feedback > from Christoph. > > * Split off the logic to handle enabling P2P in NVMe fabrics' configfs > into specific helpers in the p2pdma code. Per feedback from Christoph. > > * A number of other minor cleanups and fixes as pointed out by > Christoph and others. > > Changes in v4: > > * Change the original upstream_bridges_match() function to > upstream_bridge_distance() which calculates the distance between two > devices as long as they are behind the same root port. This should > address Bjorn's concerns that the code was to focused on > being behind a single switch. > > * The disable ACS function now disables ACS for all bridge ports instead > of switch ports (ie. those that had two upstream_bridge ports). > > * Change the pci_p2pmem_alloc_sgl() and pci_p2pmem_free_sgl() > API to be more like sgl_alloc() in that the alloc function returns > the allocated scatterlist and nents is not required bythe free > function. > > * Moved the new documentation into the driver-api tree as requested > by Jonathan > > * Add SGL alloc and free helpers in the nvmet code so that the > individual drivers can share the code that allocates P2P memory. > As requested by Christoph. > > * Cleanup the nvmet_p2pmem_store() function as Christoph > thought my first attempt was ugly. > > * Numerous commit message and comment fix-ups > > Changes in v3: > > * Many more fixes and minor cleanups that were spotted by Bjorn > > * Additional explanation of the ACS change in both the commit message > and Kconfig doc. Also, the code that disables the ACS bits
Re: [PATCH v9 07/13] block: Add PCI P2P flag for request queue and check support for requests
On Fri, Oct 05, 2018 at 07:16:04PM -0600, Jens Axboe wrote: > On 10/4/18 3:27 PM, Logan Gunthorpe wrote: > > QUEUE_FLAG_PCI_P2P is introduced meaning a driver's request queue > > supports targeting P2P memory. This will be used by P2P providers and > > orchestrators (in subsequent patches) to ensure block devices can > > support P2P memory before submitting P2P backed pages to submit_bio(). > > Nit pick, but the subject line still says that it checks support > for requests. This patch just adds the ability to flag support > in the queue. What about the following? MAINTAINERS doesn't list a specific maintainer for include/linux/blkdev.h (maybe the "BLOCK LAYER" entry should have an F: pattern for it?), but I'd really like your ack before merging this. commit 7e647ae1eda290786851c3dff4f38189b982386d Author: Logan Gunthorpe Date: Thu Oct 4 15:27:41 2018 -0600 block: Add PCI P2P flag for request queue Add QUEUE_FLAG_PCI_P2P, meaning a driver's request queue supports targeting P2P memory. This will be used by P2P providers and orchestrators (in subsequent patches) to ensure block devices can support P2P memory before submitting P2P-backed pages to submit_bio(). Signed-off-by: Logan Gunthorpe Signed-off-by: Bjorn Helgaas Reviewed-by: Christoph Hellwig diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 6980014357d4..c32f7171899b 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -699,6 +699,7 @@ struct request_queue { #define QUEUE_FLAG_SCSI_PASSTHROUGH 27 /* queue supports SCSI commands */ #define QUEUE_FLAG_QUIESCED28 /* queue has been quiesced */ #define QUEUE_FLAG_PREEMPT_ONLY29 /* only process REQ_PREEMPT requests */ +#define QUEUE_FLAG_PCI_P2PDMA 30 /* device supports PCI p2p requests */ #define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) |\ (1 << QUEUE_FLAG_SAME_COMP)| \ @@ -731,6 +732,8 @@ bool blk_queue_flag_test_and_clear(unsigned int flag, struct request_queue *q); #define blk_queue_dax(q) test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags) #define blk_queue_scsi_passthrough(q) \ test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags) +#define blk_queue_pci_p2pdma(q)\ + test_bit(QUEUE_FLAG_PCI_P2PDMA, &(q)->queue_flags) #define blk_noretry_request(rq) \ ((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \ ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v9 07/13] block: Add PCI P2P flag for request queue and check support for requests
On 10/10/18 1:59 PM, Bjorn Helgaas wrote: > On Fri, Oct 05, 2018 at 07:16:04PM -0600, Jens Axboe wrote: >> On 10/4/18 3:27 PM, Logan Gunthorpe wrote: >>> QUEUE_FLAG_PCI_P2P is introduced meaning a driver's request queue >>> supports targeting P2P memory. This will be used by P2P providers and >>> orchestrators (in subsequent patches) to ensure block devices can >>> support P2P memory before submitting P2P backed pages to submit_bio(). >> >> Nit pick, but the subject line still says that it checks support >> for requests. This patch just adds the ability to flag support >> in the queue. > > What about the following? MAINTAINERS doesn't list a specific > maintainer for include/linux/blkdev.h (maybe the "BLOCK LAYER" entry > should have an F: pattern for it?), but I'd really like your ack > before merging this. > > commit 7e647ae1eda290786851c3dff4f38189b982386d > Author: Logan Gunthorpe > Date: Thu Oct 4 15:27:41 2018 -0600 > > block: Add PCI P2P flag for request queue > > Add QUEUE_FLAG_PCI_P2P, meaning a driver's request queue supports > targeting > P2P memory. This will be used by P2P providers and orchestrators (in > subsequent patches) to ensure block devices can support P2P memory before > submitting P2P-backed pages to submit_bio(). > > Signed-off-by: Logan Gunthorpe > Signed-off-by: Bjorn Helgaas > Reviewed-by: Christoph Hellwig You can add my acked-by to this one. -- Jens Axboe ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap
On Wed 10-10-18 10:39:01, Alexander Duyck wrote: > On 10/10/2018 10:24 AM, Michal Hocko wrote: [...] > > I thought I have already made it clear that these zone device hacks are > > not acceptable to the generic hotplug code. If the current reserve bit > > handling is not correct then give us a specific reason for that and we > > can start thinking about the proper fix. > > I might have misunderstood your earlier comment then. I thought you were > saying that we shouldn't bother with setting the reserved bit. Now it sounds > like you were thinking more along the lines of what I was here in my comment > where I thought the bit should be cleared later in some code specifically > related to DAX when it is exposing it for use to userspace or KVM. It seems I managed to confuse myself completely. Sorry, it's been a long day and I am sick so the brain doesn't work all that well. I will get back to this tomorrow or on Friday with a fresh brain. My recollection was that we do clear the reserved bit in move_pfn_range_to_zone and we indeed do in __init_single_page. But then we set the bit back right afterwards. This seems to be the case since d0dc12e86b319 which reorganized the code. I have to study this some more obviously. -- Michal Hocko SUSE Labs ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap
On Wed, Oct 10, 2018 at 10:30 AM Michal Hocko wrote: > > On Wed 10-10-18 09:39:08, Alexander Duyck wrote: > > On 10/10/2018 2:58 AM, Michal Hocko wrote: > > > On Tue 09-10-18 13:26:41, Alexander Duyck wrote: > > > [...] > > > > I would think with that being the case we still probably need the call > > > > to > > > > __SetPageReserved to set the bit with the expectation that it will not > > > > be > > > > cleared for device-pages since the pages are not onlined. Removing the > > > > call > > > > to __SetPageReserved would probably introduce a number of regressions as > > > > there are multiple spots that use the reserved bit to determine if a > > > > page > > > > can be swapped out to disk, mapped as system memory, or migrated. > > > > > > PageReserved is meant to tell any potential pfn walkers that might get > > > to this struct page to back off and not touch it. Even though > > > ZONE_DEVICE doesn't online pages in traditional sense it makes those > > > pages available for further use so the page reserved bit should be > > > cleared. > > > > So from what I can tell that isn't necessarily the case. Specifically if the > > pagemap type is MEMORY_DEVICE_PRIVATE or MEMORY_DEVICE_PUBLIC both are > > special cases where the memory may not be accessible to the CPU or cannot be > > pinned in order to allow for eviction. > > Could you give me an example please? > > > The specific case that Dan and Yi are referring to is for the type > > MEMORY_DEVICE_FS_DAX. For that type I could probably look at not setting the > > reserved bit. Part of me wants to say that we should wait and clear the bit > > later, but that would end up just adding time back to initialization. At > > this point I would consider the change more of a follow-up optimization > > rather than a fix though since this is tailoring things specifically for DAX > > versus the other ZONE_DEVICE types. > > I thought I have already made it clear that these zone device hacks are > not acceptable to the generic hotplug code. If the current reserve bit > handling is not correct then give us a specific reason for that and we > can start thinking about the proper fix. > Right, so we're in a situation where a hack is needed for KVM's current interpretation of the Reserved flag relative to dax mapped pages. I'm arguing to push that knowledge / handling as deep as possible into the core rather than hack the leaf implementations like KVM, i.e. disable the Reserved flag for all non-MEMORY_DEVICE_* ZONE_DEVICE types. Here is the KVM thread about why they need a change: https://lkml.org/lkml/2018/9/7/552 ...and where I pushed back on a KVM-local hack: https://lkml.org/lkml/2018/9/19/154 ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap
On 10/10/2018 10:53 AM, Michal Hocko wrote: On Wed 10-10-18 10:39:01, Alexander Duyck wrote: On 10/10/2018 10:24 AM, Michal Hocko wrote: On Wed 10-10-18 09:39:08, Alexander Duyck wrote: On 10/10/2018 2:58 AM, Michal Hocko wrote: On Tue 09-10-18 13:26:41, Alexander Duyck wrote: [...] I would think with that being the case we still probably need the call to __SetPageReserved to set the bit with the expectation that it will not be cleared for device-pages since the pages are not onlined. Removing the call to __SetPageReserved would probably introduce a number of regressions as there are multiple spots that use the reserved bit to determine if a page can be swapped out to disk, mapped as system memory, or migrated. PageReserved is meant to tell any potential pfn walkers that might get to this struct page to back off and not touch it. Even though ZONE_DEVICE doesn't online pages in traditional sense it makes those pages available for further use so the page reserved bit should be cleared. So from what I can tell that isn't necessarily the case. Specifically if the pagemap type is MEMORY_DEVICE_PRIVATE or MEMORY_DEVICE_PUBLIC both are special cases where the memory may not be accessible to the CPU or cannot be pinned in order to allow for eviction. Could you give me an example please? Honestly I am getting a bit beyond my depth here so maybe Dan could explain better. I am basing the above comment on Dan's earlier comment in this thread combined with the comment that explains the "memory_type" field for the pgmap: https://elixir.bootlin.com/linux/v4.19-rc7/source/include/linux/memremap.h#L28 The specific case that Dan and Yi are referring to is for the type MEMORY_DEVICE_FS_DAX. For that type I could probably look at not setting the reserved bit. Part of me wants to say that we should wait and clear the bit later, but that would end up just adding time back to initialization. At this point I would consider the change more of a follow-up optimization rather than a fix though since this is tailoring things specifically for DAX versus the other ZONE_DEVICE types. I thought I have already made it clear that these zone device hacks are not acceptable to the generic hotplug code. If the current reserve bit handling is not correct then give us a specific reason for that and we can start thinking about the proper fix. I might have misunderstood your earlier comment then. I thought you were saying that we shouldn't bother with setting the reserved bit. Now it sounds like you were thinking more along the lines of what I was here in my comment where I thought the bit should be cleared later in some code specifically related to DAX when it is exposing it for use to userspace or KVM. I was referring to my earlier comment that if you need to do something about struct page initialization (move_pfn_range_to_zone) outside of the lock (with the appropriate ground work that is needed) rather than pulling more zone device hacks into the generic hotplug code [1] [1] http://lkml.kernel.org/r/20180926075540.gd6...@dhcp22.suse.cz The only issue is if we don't pull the code together we are looking at a massive increase in initialization times. So for example just the loop that was originally there that was setting the pgmap and resetting the LRU prev pointer was adding an additional 20+ seconds per node with 3TB allocated per node. That essentially doubles the initialization time. How would you recommend I address that? Currently it is a few extra lines in the memmap_init_zone_device function. Eventually I was planning to combine the memmap_init_zone hoplug functionality and memmap_init_zone_device core functionality into a single function called __init_pageblock[1] and then reuse that functionality for deferred page init as well. [1] http://lkml.kernel.org/r/20181005151224.17473.53398.stgit@localhost.localdomain ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap
On Wed 10-10-18 10:39:01, Alexander Duyck wrote: > > > On 10/10/2018 10:24 AM, Michal Hocko wrote: > > On Wed 10-10-18 09:39:08, Alexander Duyck wrote: > > > On 10/10/2018 2:58 AM, Michal Hocko wrote: > > > > On Tue 09-10-18 13:26:41, Alexander Duyck wrote: > > > > [...] > > > > > I would think with that being the case we still probably need the > > > > > call to > > > > > __SetPageReserved to set the bit with the expectation that it will > > > > > not be > > > > > cleared for device-pages since the pages are not onlined. Removing > > > > > the call > > > > > to __SetPageReserved would probably introduce a number of regressions > > > > > as > > > > > there are multiple spots that use the reserved bit to determine if a > > > > > page > > > > > can be swapped out to disk, mapped as system memory, or migrated. > > > > > > > > PageReserved is meant to tell any potential pfn walkers that might get > > > > to this struct page to back off and not touch it. Even though > > > > ZONE_DEVICE doesn't online pages in traditional sense it makes those > > > > pages available for further use so the page reserved bit should be > > > > cleared. > > > > > > So from what I can tell that isn't necessarily the case. Specifically if > > > the > > > pagemap type is MEMORY_DEVICE_PRIVATE or MEMORY_DEVICE_PUBLIC both are > > > special cases where the memory may not be accessible to the CPU or cannot > > > be > > > pinned in order to allow for eviction. > > > > Could you give me an example please? > > Honestly I am getting a bit beyond my depth here so maybe Dan could explain > better. I am basing the above comment on Dan's earlier comment in this > thread combined with the comment that explains the "memory_type" field for > the pgmap: > https://elixir.bootlin.com/linux/v4.19-rc7/source/include/linux/memremap.h#L28 > > > > The specific case that Dan and Yi are referring to is for the type > > > MEMORY_DEVICE_FS_DAX. For that type I could probably look at not setting > > > the > > > reserved bit. Part of me wants to say that we should wait and clear the > > > bit > > > later, but that would end up just adding time back to initialization. At > > > this point I would consider the change more of a follow-up optimization > > > rather than a fix though since this is tailoring things specifically for > > > DAX > > > versus the other ZONE_DEVICE types. > > > > I thought I have already made it clear that these zone device hacks are > > not acceptable to the generic hotplug code. If the current reserve bit > > handling is not correct then give us a specific reason for that and we > > can start thinking about the proper fix. > > I might have misunderstood your earlier comment then. I thought you were > saying that we shouldn't bother with setting the reserved bit. Now it sounds > like you were thinking more along the lines of what I was here in my comment > where I thought the bit should be cleared later in some code specifically > related to DAX when it is exposing it for use to userspace or KVM. I was referring to my earlier comment that if you need to do something about struct page initialization (move_pfn_range_to_zone) outside of the lock (with the appropriate ground work that is needed) rather than pulling more zone device hacks into the generic hotplug code [1] [1] http://lkml.kernel.org/r/20180926075540.gd6...@dhcp22.suse.cz -- Michal Hocko SUSE Labs ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap
On 10/10/2018 10:24 AM, Michal Hocko wrote: On Wed 10-10-18 09:39:08, Alexander Duyck wrote: On 10/10/2018 2:58 AM, Michal Hocko wrote: On Tue 09-10-18 13:26:41, Alexander Duyck wrote: [...] I would think with that being the case we still probably need the call to __SetPageReserved to set the bit with the expectation that it will not be cleared for device-pages since the pages are not onlined. Removing the call to __SetPageReserved would probably introduce a number of regressions as there are multiple spots that use the reserved bit to determine if a page can be swapped out to disk, mapped as system memory, or migrated. PageReserved is meant to tell any potential pfn walkers that might get to this struct page to back off and not touch it. Even though ZONE_DEVICE doesn't online pages in traditional sense it makes those pages available for further use so the page reserved bit should be cleared. So from what I can tell that isn't necessarily the case. Specifically if the pagemap type is MEMORY_DEVICE_PRIVATE or MEMORY_DEVICE_PUBLIC both are special cases where the memory may not be accessible to the CPU or cannot be pinned in order to allow for eviction. Could you give me an example please? Honestly I am getting a bit beyond my depth here so maybe Dan could explain better. I am basing the above comment on Dan's earlier comment in this thread combined with the comment that explains the "memory_type" field for the pgmap: https://elixir.bootlin.com/linux/v4.19-rc7/source/include/linux/memremap.h#L28 The specific case that Dan and Yi are referring to is for the type MEMORY_DEVICE_FS_DAX. For that type I could probably look at not setting the reserved bit. Part of me wants to say that we should wait and clear the bit later, but that would end up just adding time back to initialization. At this point I would consider the change more of a follow-up optimization rather than a fix though since this is tailoring things specifically for DAX versus the other ZONE_DEVICE types. I thought I have already made it clear that these zone device hacks are not acceptable to the generic hotplug code. If the current reserve bit handling is not correct then give us a specific reason for that and we can start thinking about the proper fix. I might have misunderstood your earlier comment then. I thought you were saying that we shouldn't bother with setting the reserved bit. Now it sounds like you were thinking more along the lines of what I was here in my comment where I thought the bit should be cleared later in some code specifically related to DAX when it is exposing it for use to userspace or KVM. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap
On Wed 10-10-18 09:39:08, Alexander Duyck wrote: > On 10/10/2018 2:58 AM, Michal Hocko wrote: > > On Tue 09-10-18 13:26:41, Alexander Duyck wrote: > > [...] > > > I would think with that being the case we still probably need the call to > > > __SetPageReserved to set the bit with the expectation that it will not be > > > cleared for device-pages since the pages are not onlined. Removing the > > > call > > > to __SetPageReserved would probably introduce a number of regressions as > > > there are multiple spots that use the reserved bit to determine if a page > > > can be swapped out to disk, mapped as system memory, or migrated. > > > > PageReserved is meant to tell any potential pfn walkers that might get > > to this struct page to back off and not touch it. Even though > > ZONE_DEVICE doesn't online pages in traditional sense it makes those > > pages available for further use so the page reserved bit should be > > cleared. > > So from what I can tell that isn't necessarily the case. Specifically if the > pagemap type is MEMORY_DEVICE_PRIVATE or MEMORY_DEVICE_PUBLIC both are > special cases where the memory may not be accessible to the CPU or cannot be > pinned in order to allow for eviction. Could you give me an example please? > The specific case that Dan and Yi are referring to is for the type > MEMORY_DEVICE_FS_DAX. For that type I could probably look at not setting the > reserved bit. Part of me wants to say that we should wait and clear the bit > later, but that would end up just adding time back to initialization. At > this point I would consider the change more of a follow-up optimization > rather than a fix though since this is tailoring things specifically for DAX > versus the other ZONE_DEVICE types. I thought I have already made it clear that these zone device hacks are not acceptable to the generic hotplug code. If the current reserve bit handling is not correct then give us a specific reason for that and we can start thinking about the proper fix. -- Michal Hocko SUSE Labs ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap
On 10/10/2018 2:58 AM, Michal Hocko wrote: On Tue 09-10-18 13:26:41, Alexander Duyck wrote: [...] I would think with that being the case we still probably need the call to __SetPageReserved to set the bit with the expectation that it will not be cleared for device-pages since the pages are not onlined. Removing the call to __SetPageReserved would probably introduce a number of regressions as there are multiple spots that use the reserved bit to determine if a page can be swapped out to disk, mapped as system memory, or migrated. PageReserved is meant to tell any potential pfn walkers that might get to this struct page to back off and not touch it. Even though ZONE_DEVICE doesn't online pages in traditional sense it makes those pages available for further use so the page reserved bit should be cleared. So from what I can tell that isn't necessarily the case. Specifically if the pagemap type is MEMORY_DEVICE_PRIVATE or MEMORY_DEVICE_PUBLIC both are special cases where the memory may not be accessible to the CPU or cannot be pinned in order to allow for eviction. The specific case that Dan and Yi are referring to is for the type MEMORY_DEVICE_FS_DAX. For that type I could probably look at not setting the reserved bit. Part of me wants to say that we should wait and clear the bit later, but that would end up just adding time back to initialization. At this point I would consider the change more of a follow-up optimization rather than a fix though since this is tailoring things specifically for DAX versus the other ZONE_DEVICE types. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 1/2] powerpc/pseries: PAPR persistent memory support
On 10/10/2018 01:08 AM, Oliver O'Halloran wrote: > This patch implements support for discovering storage class memory > devices at boot and for handling hotplug of new regions via RTAS > hotplug events. > > Signed-off-by: Oliver O'Halloran > --- > arch/powerpc/include/asm/firmware.h | 3 ++- > arch/powerpc/include/asm/hvcall.h | 10 +- > arch/powerpc/include/asm/rtas.h | 2 ++ > arch/powerpc/kernel/rtasd.c | 2 ++ > arch/powerpc/platforms/pseries/Makefile | 2 +- > arch/powerpc/platforms/pseries/dlpar.c| 4 > arch/powerpc/platforms/pseries/firmware.c | 1 + > arch/powerpc/platforms/pseries/pseries.h | 5 + > arch/powerpc/platforms/pseries/ras.c | 3 ++- > 9 files changed, 28 insertions(+), 4 deletions(-) > > diff --git a/arch/powerpc/include/asm/firmware.h > b/arch/powerpc/include/asm/firmware.h > index 7a051bd21f87..113c64d5d394 100644 > --- a/arch/powerpc/include/asm/firmware.h > +++ b/arch/powerpc/include/asm/firmware.h > @@ -52,6 +52,7 @@ > #define FW_FEATURE_PRRN ASM_CONST(0x0002) > #define FW_FEATURE_DRMEM_V2 ASM_CONST(0x0004) > #define FW_FEATURE_DRC_INFO ASM_CONST(0x0008) > +#define FW_FEATURE_PAPR_SCM ASM_CONST(0x0010) > > #ifndef __ASSEMBLY__ > > @@ -69,7 +70,7 @@ enum { > FW_FEATURE_SET_MODE | FW_FEATURE_BEST_ENERGY | > FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN | > FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 | > - FW_FEATURE_DRC_INFO, > + FW_FEATURE_DRC_INFO | FW_FEATURE_PAPR_SCM, > FW_FEATURE_PSERIES_ALWAYS = 0, > FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL, > FW_FEATURE_POWERNV_ALWAYS = 0, > diff --git a/arch/powerpc/include/asm/hvcall.h > b/arch/powerpc/include/asm/hvcall.h > index a0b17f9f1ea4..0e81ef83b35a 100644 > --- a/arch/powerpc/include/asm/hvcall.h > +++ b/arch/powerpc/include/asm/hvcall.h > @@ -295,7 +295,15 @@ > #define H_INT_ESB 0x3C8 > #define H_INT_SYNC 0x3CC > #define H_INT_RESET 0x3D0 > -#define MAX_HCALL_OPCODE H_INT_RESET > +#define H_SCM_READ_METADATA 0x3E4 > +#define H_SCM_WRITE_METADATA0x3E8 > +#define H_SCM_BIND_MEM 0x3EC > +#define H_SCM_UNBIND_MEM0x3F0 > +#define H_SCM_QUERY_BLOCK_MEM_BINDING 0x3F4 > +#define H_SCM_QUERY_LOGICAL_MEM_BINDING 0x3F8 > +#define H_SCM_MEM_QUERY 0x3FC > +#define H_SCM_BLOCK_CLEAR 0x400 > +#define MAX_HCALL_OPCODE H_SCM_BLOCK_CLEAR > > /* H_VIOCTL functions */ > #define H_GET_VIOA_DUMP_SIZE 0x01 > diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h > index 71e393c46a49..1e81f3d55457 100644 > --- a/arch/powerpc/include/asm/rtas.h > +++ b/arch/powerpc/include/asm/rtas.h > @@ -125,6 +125,7 @@ struct rtas_suspend_me_data { > #define RTAS_TYPE_INFO 0xE2 > #define RTAS_TYPE_DEALLOC0xE3 > #define RTAS_TYPE_DUMP 0xE4 > +#define RTAS_TYPE_HOTPLUG0xE5 > /* I don't add PowerMGM events right now, this is a different topic */ > #define RTAS_TYPE_PMGM_POWER_SW_ON 0x60 > #define RTAS_TYPE_PMGM_POWER_SW_OFF 0x61 > @@ -316,6 +317,7 @@ struct pseries_hp_errorlog { > #define PSERIES_HP_ELOG_RESOURCE_MEM 2 > #define PSERIES_HP_ELOG_RESOURCE_SLOT3 > #define PSERIES_HP_ELOG_RESOURCE_PHB 4 > +#define PSERIES_HP_ELOG_RESOURCE_PMEM 6 > > #define PSERIES_HP_ELOG_ACTION_ADD 1 > #define PSERIES_HP_ELOG_ACTION_REMOVE2 > diff --git a/arch/powerpc/kernel/rtasd.c b/arch/powerpc/kernel/rtasd.c > index 6fafc82c04b0..fad0baddfcba 100644 > --- a/arch/powerpc/kernel/rtasd.c > +++ b/arch/powerpc/kernel/rtasd.c > @@ -91,6 +91,8 @@ static char *rtas_event_type(int type) > return "Dump Notification Event"; > case RTAS_TYPE_PRRN: > return "Platform Resource Reassignment Event"; > + case RTAS_TYPE_HOTPLUG: > + return "Hotplug Event"; > } > > return rtas_type[0]; > diff --git a/arch/powerpc/platforms/pseries/Makefile > b/arch/powerpc/platforms/pseries/Makefile > index 7e89d5c47068..892b27ced973 100644 > --- a/arch/powerpc/platforms/pseries/Makefile > +++ b/arch/powerpc/platforms/pseries/Makefile > @@ -13,7 +13,7 @@ obj-$(CONFIG_KEXEC_CORE)+= kexec.o > obj-$(CONFIG_PSERIES_ENERGY) += pseries_energy.o > > obj-$(CONFIG_HOTPLUG_CPU)+= hotplug-cpu.o > -obj-$(CONFIG_MEMORY_HOTPLUG) += hotplug-memory.o > +obj-$(CONFIG_MEMORY_HOTPLUG) += hotplug-memory.o pmem.o > > obj-$(CONFIG_HVC_CONSOLE)+= hvconsole.o > obj-$(CONFIG_HVCS) += hvcserver.o > diff --git a/arch/powerpc/platforms/pseries/dlpar.c > b/arch/powerpc/platforms/pseries/dlpar.c > index a0b20c03f078..795996fefdb9 100644 > --- a/arch/powerpc/platforms/pseries/dlpar.c > +++ b/arch/powerpc/platforms/pseries/dlpar.c > @@ -357,6 +357,10 @@ static int
Re: [PATCH v12 00/12] Adding security support for nvdimm
On 10/09/2018 06:35 PM, Williams, Dan J wrote: > On Mon, 2018-10-08 at 16:55 -0700, Dave Jiang wrote: >> The following series implements security support for nvdimm. Mostly >> adding >> new security DSM support from the Intel NVDIMM DSM spec v1.7, but >> also >> adding generic support libnvdimm for other vendors. The most >> important >> security features are unlocking locked nvdimms, and updating/setting >> security >> passphrase to nvdimms. >> >> v12: >> - Add a mutex for the cached key and remove key_get/key_put messiness >> (Dan) >> - Move security code to its own C file and wrap under >> CONFIG_NVDIMM_SECURITY >> in order to fix issue reported by 0-day build without CONFIG_KEYS. > > Going over this a bit more closely here is some proposed cleanups / > fixes: > > * remove nvdimm_get_key() and just rely on nvdimm->key being not NULL > under the key_mutex > > * open code nvdimm_check_key_len() checks > > * make all the security op function signatures take an nvdimm rather > than a dev > > * move the release of nvdimm->key to nvdimm_remove() rather than > nvdimm_release(), i.e. driver unload vs nvdimm_bus de-registration. > > * rename nvdimm_replace_key to make_kernel_key > > * clean up nvdimm_security_change_key() to a be bit more readable and > fix a missing key_put() > > > Let me know if these look acceptable and I can fold them into the patch > set. Looks good. Thanks! > > diff --git a/drivers/nvdimm/dimm.c b/drivers/nvdimm/dimm.c > index b6381ddbd6c1..429c544e7ac5 100644 > --- a/drivers/nvdimm/dimm.c > +++ b/drivers/nvdimm/dimm.c > @@ -23,6 +23,7 @@ > > static int nvdimm_probe(struct device *dev) > { > + struct nvdimm *nvdimm = to_nvdimm(dev); > struct nvdimm_drvdata *ndd; > int rc; > > @@ -51,10 +52,10 @@ static int nvdimm_probe(struct device *dev) > get_device(dev); > kref_init(>kref); > > - nvdimm_security_get_state(dev); > + nvdimm_security_get_state(nvdimm); > > /* unlock DIMM here before touch label */ > - rc = nvdimm_security_unlock_dimm(dev); > + rc = nvdimm_security_unlock_dimm(nvdimm); > if (rc < 0) > dev_warn(dev, "failed to unlock dimm %s\n", dev_name(dev)); > > @@ -115,6 +116,9 @@ static int nvdimm_probe(struct device *dev) > static int nvdimm_remove(struct device *dev) > { > struct nvdimm_drvdata *ndd = dev_get_drvdata(dev); > + struct nvdimm *nvdimm = to_nvdimm(dev); > + > + nvdimm_security_release(nvdimm); > > if (!ndd) > return 0; > diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c > index 2d9915378bbc..84bec3bb025e 100644 > --- a/drivers/nvdimm/dimm_devs.c > +++ b/drivers/nvdimm/dimm_devs.c > @@ -213,13 +213,7 @@ void nvdimm_clear_locked(struct device *dev) > static void nvdimm_release(struct device *dev) > { > struct nvdimm *nvdimm = to_nvdimm(dev); > - struct key *key; > > - mutex_lock(>key_mutex); > - key = nvdimm_get_key(dev); > - if (key) > - key_put(key); > - mutex_unlock(>key_mutex); > ida_simple_remove(_ida, nvdimm->id); > kfree(nvdimm); > } > @@ -436,20 +430,20 @@ static ssize_t security_store(struct device *dev, > if (rc != 3) > return -EINVAL; > dev_dbg(dev, "update %#x %#x\n", old_key, new_key); > - rc = nvdimm_security_change_key(dev, old_key, new_key); > + rc = nvdimm_security_change_key(nvdimm, old_key, new_key); > } else if (sysfs_streq(cmd, "disable")) { > if (rc != 2) > return -EINVAL; > dev_dbg(dev, "disable %#x\n", old_key); > - rc = nvdimm_security_disable(dev, old_key); > + rc = nvdimm_security_disable(nvdimm, old_key); > } else if (sysfs_streq(buf, "freeze")) { > dev_dbg(dev, "freeze\n"); > - rc = nvdimm_security_freeze_lock(dev); > + rc = nvdimm_security_freeze_lock(nvdimm); > } else if (sysfs_streq(cmd, "erase")) { > if (rc != 2) > return -EINVAL; > dev_dbg(dev, "erase %#x\n", old_key); > - rc = nvdimm_security_erase(dev, old_key); > + rc = nvdimm_security_erase(nvdimm, old_key); > } else > return -EINVAL; > > diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h > index 08e442632a2d..45fcaf45ba2c 100644 > --- a/drivers/nvdimm/nd.h > +++ b/drivers/nvdimm/nd.h > @@ -425,48 +425,47 @@ const u8 *nd_dev_to_uuid(struct device *dev); > bool pmem_should_map_pages(struct device *dev); > > #ifdef CONFIG_NVDIMM_SECURITY > -struct key *nvdimm_get_key(struct device *dev); > -int nvdimm_security_unlock_dimm(struct device *dev); > -int nvdimm_security_get_state(struct device *dev); > -int nvdimm_security_change_key(struct device *dev, unsigned int old_keyid, > +int nvdimm_security_unlock_dimm(struct nvdimm *nvdimm); > +void nvdimm_security_release(struct nvdimm
Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap
On 10/10/2018 5:52 AM, Yi Zhang wrote: On 2018-10-09 at 14:19:32 -0700, Dan Williams wrote: On Tue, Oct 9, 2018 at 1:34 PM Alexander Duyck wrote: On 10/9/2018 11:04 AM, Dan Williams wrote: On Tue, Oct 9, 2018 at 3:21 AM Yi Zhang wrote: [..] That comment is incorrect, device-pages are never onlined. So I think we can just skip that call to __SetPageReserved() unless the memory range is MEMORY_DEVICE_{PRIVATE,PUBLIC}. When pages are "onlined" via __free_pages_boot_core they clear the reserved bit, that is the reason for the comment. The reserved bit is meant to indicate that the page cannot be swapped out or moved based on the description of the bit. ...but ZONE_DEVICE pages are never onlined so I would expect memmap_init_zone_device() to know that detail. I would think with that being the case we still probably need the call to __SetPageReserved to set the bit with the expectation that it will not be cleared for device-pages since the pages are not onlined. Removing the call to __SetPageReserved would probably introduce a number of regressions as there are multiple spots that use the reserved bit to determine if a page can be swapped out to disk, mapped as system memory, or migrated. Another things, it seems page_init/set_reserved already been done in the move_pfn_range_to_zone |-->memmap_init_zone |-->for_each_page_in_pfn |-->__init_single_page |-->SetPageReserved Why we haven't remove these redundant initial in memmap_init_zone? Correct me if I missed something. In this case it isn't redundant as only the vmmemmap pages are initialized in memmap_init_zone now. So all of the pages that are going to be used as device pages are not initialized until the call to memmap_init_zone_device. What I did is split the initialization of the pages into two parts in order to allow us to initialize the pages outside of the hotplug lock. Right, this is what Yi is working on... the PageReserved flag is problematic for KVM. Auditing those locations it seems as long as we teach hibernation to avoid ZONE_DEVICE ranges we can safely not set the reserved flag for DAX pages. What I'm trying to avoid is a local KVM hack to check for DAX pages when the Reserved flag is not otherwise needed. Thanks Dan. Provide the patch link. https://lore.kernel.org/lkml/cover.1536342881.git.yi.z.zh...@linux.intel.com So it looks like your current logic is just working around the bit then since it just allows for reserved DAX pages. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap
On Tue 09-10-18 13:26:41, Alexander Duyck wrote: [...] > I would think with that being the case we still probably need the call to > __SetPageReserved to set the bit with the expectation that it will not be > cleared for device-pages since the pages are not onlined. Removing the call > to __SetPageReserved would probably introduce a number of regressions as > there are multiple spots that use the reserved bit to determine if a page > can be swapped out to disk, mapped as system memory, or migrated. PageReserved is meant to tell any potential pfn walkers that might get to this struct page to back off and not touch it. Even though ZONE_DEVICE doesn't online pages in traditional sense it makes those pages available for further use so the page reserved bit should be cleared. -- Michal Hocko SUSE Labs ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH] mm: Preserve _PAGE_DEVMAP across mprotect() calls
On Tue 09-10-18 10:55:14, Dan Williams wrote: > On Tue, Oct 9, 2018 at 3:19 AM Jan Kara wrote: > > > > Currently _PAGE_DEVMAP bit is not preserved in mprotect(2) calls. As a > > result we will see warnings such as: > > > > BUG: Bad page map in process JobWrk0013 pte:81803875ea25 pmd:7624381067 > > addr:7f093072 vm_flags:28f9 anon_vma: (null) > > mapping:97f2384056f0 index:0 > > file:457-00fe0030-0009-00ca-0001_2001.fileblock > > fault:xfs_filemap_fault [xfs] mmap:xfs_file_mmap [xfs] readpage: > > (null) > > CPU: 3 PID: 15848 Comm: JobWrk0013 Tainted: GW > > 4.12.14-2.g7573215-default #1 SLE12-SP4 (unreleased) > > Hardware name: Intel Corporation S2600WFD/S2600WFD, BIOS > > SE5C620.86B.01.00.0833.051120182255 05/11/2018 > > Call Trace: > > dump_stack+0x5a/0x75 > > print_bad_pte+0x217/0x2c0 > > ? enqueue_task_fair+0x76/0x9f0 > > _vm_normal_page+0xe5/0x100 > > zap_pte_range+0x148/0x740 > > unmap_page_range+0x39a/0x4b0 > > unmap_vmas+0x42/0x90 > > unmap_region+0x99/0xf0 > > ? vma_gap_callbacks_rotate+0x1a/0x20 > > do_munmap+0x255/0x3a0 > > vm_munmap+0x54/0x80 > > SyS_munmap+0x1d/0x30 > > do_syscall_64+0x74/0x150 > > entry_SYSCALL_64_after_hwframe+0x3d/0xa2 > > ... > > > > when mprotect(2) gets used on DAX mappings. Also there is a wide variety > > of other failures that can result from the missing _PAGE_DEVMAP flag > > when the area gets used by get_user_pages() later. > > > > Fix the problem by including _PAGE_DEVMAP in a set of flags that get > > preserved by mprotect(2). > > > > Fixes: 69660fd797c3 ("x86, mm: introduce _PAGE_DEVMAP") > > Fixes: ebd31197931d ("powerpc/mm: Add devmap support for ppc64") > > CC: sta...@vger.kernel.org > > Signed-off-by: Jan Kara > > Looks good, do you want me to take this upstream along with the livelock fix? Yes, I think that would be best. Thanks! Honza -- Jan Kara SUSE Labs, CR ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v5 4/4] mm: Defer ZONE_DEVICE page initialization to the point where we init pgmap
On 2018-10-09 at 14:19:32 -0700, Dan Williams wrote: > On Tue, Oct 9, 2018 at 1:34 PM Alexander Duyck > wrote: > > > > On 10/9/2018 11:04 AM, Dan Williams wrote: > > > On Tue, Oct 9, 2018 at 3:21 AM Yi Zhang > > > wrote: > [..] > > > That comment is incorrect, device-pages are never onlined. So I think > > > we can just skip that call to __SetPageReserved() unless the memory > > > range is MEMORY_DEVICE_{PRIVATE,PUBLIC}. > > > > > > > When pages are "onlined" via __free_pages_boot_core they clear the > > reserved bit, that is the reason for the comment. The reserved bit is > > meant to indicate that the page cannot be swapped out or moved based on > > the description of the bit. > > ...but ZONE_DEVICE pages are never onlined so I would expect > memmap_init_zone_device() to know that detail. > > > I would think with that being the case we still probably need the call > > to __SetPageReserved to set the bit with the expectation that it will > > not be cleared for device-pages since the pages are not onlined. > > Removing the call to __SetPageReserved would probably introduce a number > > of regressions as there are multiple spots that use the reserved bit to > > determine if a page can be swapped out to disk, mapped as system memory, > > or migrated. Another things, it seems page_init/set_reserved already been done in the move_pfn_range_to_zone |-->memmap_init_zone |-->for_each_page_in_pfn |-->__init_single_page |-->SetPageReserved Why we haven't remove these redundant initial in memmap_init_zone? Correct me if I missed something. > > Right, this is what Yi is working on... the PageReserved flag is > problematic for KVM. Auditing those locations it seems as long as we > teach hibernation to avoid ZONE_DEVICE ranges we can safely not set > the reserved flag for DAX pages. What I'm trying to avoid is a local > KVM hack to check for DAX pages when the Reserved flag is not > otherwise needed. Thanks Dan. Provide the patch link. https://lore.kernel.org/lkml/cover.1536342881.git.yi.z.zh...@linux.intel.com > ___ > Linux-nvdimm mailing list > Linux-nvdimm@lists.01.org > https://lists.01.org/mailman/listinfo/linux-nvdimm ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
PAPR SCM support
This series adds support for the para-virtualised storage class memory interface defined by the Power Architecture Platform Reference. Patch 1 implements the pseries device discovery (via DT) and hotplug support (via RTAS hotplug interrupt). Patch 2 implements a driver that binds to the platform devices, does the necessary H-calls to activate the region and configures the libnvdimm interface. It also adds an NDCTL implementation to allow the "metadata" space associated with an SCM region to be used as label space. This should go in via the ppc tree, but an ack from Dan for patch two would be appreciated. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[PATCH 2/2] powerpc/pseries: Add driver for PAPR SCM regions
Adds a driver that implements support for enabling and accessing PAPR SCM regions. Unfortunately due to how the PAPR interface works we can't use the existing of_pmem driver (yet) because: a) The guest is required to use the H_SCM_BIND_MEM h-call to add add the SCM region to it's physical address space, and b) There is currently no mechanism for relating a bare of_pmem region to the backing DIMM (or not-a-DIMM for our case). Both of these are easily handled by rolling the functionality into a seperate driver so here we are... Signed-off-by: Oliver O'Halloran --- The alternative implementation here is that we have the pseries code do the h-calls and craft a pmem-region@ node based on that. However, that doesn't solve b) and mpe has expressed his dislike of adding new stuff to the DT at runtime so i'd say that's a non-starter. --- arch/powerpc/platforms/pseries/Kconfig| 7 + arch/powerpc/platforms/pseries/Makefile | 1 + arch/powerpc/platforms/pseries/papr_scm.c | 340 ++ 3 files changed, 348 insertions(+) create mode 100644 arch/powerpc/platforms/pseries/papr_scm.c diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig index 0c698fd6d491..4b0fcb80efe5 100644 --- a/arch/powerpc/platforms/pseries/Kconfig +++ b/arch/powerpc/platforms/pseries/Kconfig @@ -140,3 +140,10 @@ config IBMEBUS bool "Support for GX bus based adapters" help Bus device driver for GX bus based adapters. + +config PAPR_SCM + depends on PPC_PSERIES && MEMORY_HOTPLUG + select LIBNVDIMM + tristate "Support for the PAPR Storage Class Memory interface" + help + Enable access to hypervisor provided storage class memory. diff --git a/arch/powerpc/platforms/pseries/Makefile b/arch/powerpc/platforms/pseries/Makefile index 892b27ced973..a43ec843c8e2 100644 --- a/arch/powerpc/platforms/pseries/Makefile +++ b/arch/powerpc/platforms/pseries/Makefile @@ -24,6 +24,7 @@ obj-$(CONFIG_IO_EVENT_IRQ)+= io_event_irq.o obj-$(CONFIG_LPARCFG) += lparcfg.o obj-$(CONFIG_IBMVIO) += vio.o obj-$(CONFIG_IBMEBUS) += ibmebus.o +obj-$(CONFIG_PAPR_SCM) += papr_scm.o ifdef CONFIG_PPC_PSERIES obj-$(CONFIG_SUSPEND) += suspend.o diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c new file mode 100644 index ..87d4dbc18845 --- /dev/null +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -0,0 +1,340 @@ +// SPDX-License-Identifier: GPL-2.0 + +#define pr_fmt(fmt)"papr-scm: " fmt + +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#define BIND_ANY_ADDR (~0ul) + +#define PAPR_SCM_DIMM_CMD_MASK \ + ((1ul << ND_CMD_GET_CONFIG_SIZE) | \ +(1ul << ND_CMD_GET_CONFIG_DATA) | \ +(1ul << ND_CMD_SET_CONFIG_DATA)) + +struct papr_scm_priv { + struct platform_device *pdev; + struct device_node *dn; + uint32_t drc_index; + uint64_t blocks; + uint64_t block_size; + int metadata_size; + + uint64_t bound_addr; + + struct nvdimm_bus_descriptor bus_desc; + struct nvdimm_bus *bus; + struct nvdimm *nvdimm; + struct resource res; + struct nd_region *region; + struct nd_interleave_set nd_set; +}; + +static int drc_pmem_bind(struct papr_scm_priv *p) +{ + unsigned long ret[PLPAR_HCALL_BUFSIZE]; + uint64_t rc, token; + + /* +* When the hypervisor cannot map all the requested memory in a single +* hcall it returns H_BUSY and we call again with the token until +* we get H_SUCCESS. Aborting the retry loop before getting H_SUCCESS +* leave the system in an undefined state, so we wait. +*/ + token = 0; + + do { + rc = plpar_hcall(H_SCM_BIND_MEM, ret, p->drc_index, 0, + p->blocks, BIND_ANY_ADDR, token); + token = be64_to_cpu(ret[0]); + } while (rc == H_BUSY); + + if (rc) { + dev_err(>pdev->dev, "bind err: %lld\n", rc); + return -ENXIO; + } + + p->bound_addr = be64_to_cpu(ret[1]); + + dev_dbg(>pdev->dev, "bound drc %x to %pR\n", p->drc_index, >res); + + return 0; +} + +static int drc_pmem_unbind(struct papr_scm_priv *p) +{ + unsigned long ret[PLPAR_HCALL_BUFSIZE]; + uint64_t rc, token; + + token = 0; + + /* NB: unbind has the same retry requirements mentioned above */ + do { + rc = plpar_hcall(H_SCM_UNBIND_MEM, ret, p->drc_index, + p->bound_addr, p->blocks, token); + token = be64_to_cpu(ret); + } while (rc == H_BUSY); + + if (rc) + dev_err(>pdev->dev, "unbind error: %lld\n", rc); + + return !!rc; +} + +static int papr_scm_meta_get(struct papr_scm_priv *p, + struct
[PATCH 1/2] powerpc/pseries: PAPR persistent memory support
This patch implements support for discovering storage class memory devices at boot and for handling hotplug of new regions via RTAS hotplug events. Signed-off-by: Oliver O'Halloran --- arch/powerpc/include/asm/firmware.h | 3 ++- arch/powerpc/include/asm/hvcall.h | 10 +- arch/powerpc/include/asm/rtas.h | 2 ++ arch/powerpc/kernel/rtasd.c | 2 ++ arch/powerpc/platforms/pseries/Makefile | 2 +- arch/powerpc/platforms/pseries/dlpar.c| 4 arch/powerpc/platforms/pseries/firmware.c | 1 + arch/powerpc/platforms/pseries/pseries.h | 5 + arch/powerpc/platforms/pseries/ras.c | 3 ++- 9 files changed, 28 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/include/asm/firmware.h b/arch/powerpc/include/asm/firmware.h index 7a051bd21f87..113c64d5d394 100644 --- a/arch/powerpc/include/asm/firmware.h +++ b/arch/powerpc/include/asm/firmware.h @@ -52,6 +52,7 @@ #define FW_FEATURE_PRRNASM_CONST(0x0002) #define FW_FEATURE_DRMEM_V2ASM_CONST(0x0004) #define FW_FEATURE_DRC_INFOASM_CONST(0x0008) +#define FW_FEATURE_PAPR_SCMASM_CONST(0x0010) #ifndef __ASSEMBLY__ @@ -69,7 +70,7 @@ enum { FW_FEATURE_SET_MODE | FW_FEATURE_BEST_ENERGY | FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN | FW_FEATURE_HPT_RESIZE | FW_FEATURE_DRMEM_V2 | - FW_FEATURE_DRC_INFO, + FW_FEATURE_DRC_INFO | FW_FEATURE_PAPR_SCM, FW_FEATURE_PSERIES_ALWAYS = 0, FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL, FW_FEATURE_POWERNV_ALWAYS = 0, diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h index a0b17f9f1ea4..0e81ef83b35a 100644 --- a/arch/powerpc/include/asm/hvcall.h +++ b/arch/powerpc/include/asm/hvcall.h @@ -295,7 +295,15 @@ #define H_INT_ESB 0x3C8 #define H_INT_SYNC 0x3CC #define H_INT_RESET 0x3D0 -#define MAX_HCALL_OPCODE H_INT_RESET +#define H_SCM_READ_METADATA 0x3E4 +#define H_SCM_WRITE_METADATA0x3E8 +#define H_SCM_BIND_MEM 0x3EC +#define H_SCM_UNBIND_MEM0x3F0 +#define H_SCM_QUERY_BLOCK_MEM_BINDING 0x3F4 +#define H_SCM_QUERY_LOGICAL_MEM_BINDING 0x3F8 +#define H_SCM_MEM_QUERY0x3FC +#define H_SCM_BLOCK_CLEAR 0x400 +#define MAX_HCALL_OPCODE H_SCM_BLOCK_CLEAR /* H_VIOCTL functions */ #define H_GET_VIOA_DUMP_SIZE 0x01 diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h index 71e393c46a49..1e81f3d55457 100644 --- a/arch/powerpc/include/asm/rtas.h +++ b/arch/powerpc/include/asm/rtas.h @@ -125,6 +125,7 @@ struct rtas_suspend_me_data { #define RTAS_TYPE_INFO 0xE2 #define RTAS_TYPE_DEALLOC 0xE3 #define RTAS_TYPE_DUMP 0xE4 +#define RTAS_TYPE_HOTPLUG 0xE5 /* I don't add PowerMGM events right now, this is a different topic */ #define RTAS_TYPE_PMGM_POWER_SW_ON 0x60 #define RTAS_TYPE_PMGM_POWER_SW_OFF0x61 @@ -316,6 +317,7 @@ struct pseries_hp_errorlog { #define PSERIES_HP_ELOG_RESOURCE_MEM 2 #define PSERIES_HP_ELOG_RESOURCE_SLOT 3 #define PSERIES_HP_ELOG_RESOURCE_PHB 4 +#define PSERIES_HP_ELOG_RESOURCE_PMEM 6 #define PSERIES_HP_ELOG_ACTION_ADD 1 #define PSERIES_HP_ELOG_ACTION_REMOVE 2 diff --git a/arch/powerpc/kernel/rtasd.c b/arch/powerpc/kernel/rtasd.c index 6fafc82c04b0..fad0baddfcba 100644 --- a/arch/powerpc/kernel/rtasd.c +++ b/arch/powerpc/kernel/rtasd.c @@ -91,6 +91,8 @@ static char *rtas_event_type(int type) return "Dump Notification Event"; case RTAS_TYPE_PRRN: return "Platform Resource Reassignment Event"; + case RTAS_TYPE_HOTPLUG: + return "Hotplug Event"; } return rtas_type[0]; diff --git a/arch/powerpc/platforms/pseries/Makefile b/arch/powerpc/platforms/pseries/Makefile index 7e89d5c47068..892b27ced973 100644 --- a/arch/powerpc/platforms/pseries/Makefile +++ b/arch/powerpc/platforms/pseries/Makefile @@ -13,7 +13,7 @@ obj-$(CONFIG_KEXEC_CORE) += kexec.o obj-$(CONFIG_PSERIES_ENERGY) += pseries_energy.o obj-$(CONFIG_HOTPLUG_CPU) += hotplug-cpu.o -obj-$(CONFIG_MEMORY_HOTPLUG) += hotplug-memory.o +obj-$(CONFIG_MEMORY_HOTPLUG) += hotplug-memory.o pmem.o obj-$(CONFIG_HVC_CONSOLE) += hvconsole.o obj-$(CONFIG_HVCS) += hvcserver.o diff --git a/arch/powerpc/platforms/pseries/dlpar.c b/arch/powerpc/platforms/pseries/dlpar.c index a0b20c03f078..795996fefdb9 100644 --- a/arch/powerpc/platforms/pseries/dlpar.c +++ b/arch/powerpc/platforms/pseries/dlpar.c @@ -357,6 +357,10 @@ static int handle_dlpar_errorlog(struct pseries_hp_errorlog *hp_elog) case PSERIES_HP_ELOG_RESOURCE_CPU: rc = dlpar_cpu(hp_elog); break; + case PSERIES_HP_ELOG_RESOURCE_PMEM: +