Re: [GIT PULL] Compute Express Linux (CXL) for v5.12-rc1
As much as I'd love to be working on "Compute Express Linux" the subject should have read "Compute Express Link". On Tue, Feb 23, 2021 at 8:05 PM Dan Williams wrote: > > Hi Linus, please pull from: > > git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm > tags/cxl-for-5.12 > > ...to receive an initial driver for CXL 2.0 Memory Devices. Technical > details are in the tag message and Documentation/. I am taking this > through nvdimm.git this first cycle until the cxl.git repository and > maintainer team can be set up on git.kernel.org. > > In terms of why merge this initial driver now, it establishes just > enough functionality to enumerate these devices and issue all > administrative commands. It sets a v5.12 baseline to develop the more > complicated higher order functionality like memory device > interleaving, persistent memory support, and hotplug which entangle > with ACPI, LIBNVDIMM, and PCI. > > The focus of this release is establishing the ioctl UAPI for the > management commands. Similar to NVME there are a set of standard > commands as well as the possibility for vendor specific commands. > Unlike the NVME driver the CXL driver does not enable vendor specific > command functionality by default. This conservatism is out of concern > for the fact that CXL interleaves memory across devices and implements > host memory. The system integrity implications of some commands are > more severe than NVME and vendor specific functionality is mostly > unauditable. This will be an ongoing topic of discussion with the > wider CXL community for next few months. > > The driver has been developed in the open since November against a > work-in-progress QEMU emulation of the CXL device model. That QEMU > effort has recently attracted contributions from multiple hardware > vendors. > > The driver has appeared in -next. It collected some initial static > analysis fixes and build-robot reports, but all quiet in -next for the > past week. > > A list of review tags that arrived after the branch for -next was cut > is appended to the tag message below. > > --- > > The following changes since commit 1048ba83fb1c00cd24172e23e8263972f6b5d9ac: > > Linux 5.11-rc6 (2021-01-31 13:50:09 -0800) > > are available in the Git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm > tags/cxl-for-5.12 > > for you to fetch changes up to 88ff5d466c0250259818f3153dbdc4af1f8615dd: > > cxl/mem: Fix potential memory leak (2021-02-22 14:44:39 -0800) > > > cxl for 5.12 > > Introduce an initial driver for CXL 2.0 Type-3 Memory Devices. CXL is > Compute Express Link which released the 2.0 specification in November. > The Linux relevant changes in CXL 2.0 are support for an OS to > dynamically assign address space to memory devices, support for > switches, persistent memory, and hotplug. A Type-3 Memory Device is a > PCI enumerated device presenting the CXL Memory Device Class Code and > implementing the CXL.mem protocol. CXL.mem allows device to advertise > CPU and I/O coherent memory to the system, i.e. typical "System RAM" and > "Persistent Memory" in Linux /proc/iomem terms. > > In addition to the CXL.mem fast path there is an administrative command > hardware mailbox interface for maintenance and provisioning. It is this > command interface that is the focus of the initial driver. With this > driver a CXL device that is mapped by the BIOS can be administered by > Linux. Linux support for CXL PMEM and dynamic CXL address space > management are to be implemented post v5.12. > > 4cdadfd5e0a7 cxl/mem: Introduce a driver for CXL-2.0-Type-3 endpoints > Reviewed-by: Konrad Rzeszutek Wilk > > 8adaf747c9f0 cxl/mem: Find device capabilities > Reviewed-by: Jonathan Cameron > > b39cb1052a5c cxl/mem: Register CXL memX devices > Reviewed-by: Jonathan Cameron > > 13237183c735 cxl/mem: Add a "RAW" send command > Reviewed-by: Konrad Rzeszutek Wilk > > 472b1ce6e9d6 cxl/mem: Enable commands via CEL > Reviewed-by: Konrad Rzeszutek Wilk > > 57ee605b976c cxl/mem: Add set of informational commands > Reviewed-by: Konrad Rzeszutek Wilk > > > Ben Widawsky (7): > cxl/mem: Find device capabilities > cxl/mem: Add basic IOCTL interface > cxl/mem: Add a "RAW" send command > cxl/mem: Enable commands via CEL > cxl/mem: Add set of informational commands > MAINTAINERS: Add maintainers of the CXL driver > cxl/mem: Fix potential memory leak > > Dan Carpenter (1): > cxl/mem: Return -EFAULT if copy_to_user() fails > > Dan Williams (2): > cxl/mem: Introduce a driver for CXL-2.0-Type-3 endpoints > cxl/mem: Register CXL memX devices > > .clang-format |1 + > Documentation/ABI/testing/sysfs-bus-cxl| 26 + > Documentation/driver-api/cxl/index.rst | 12 + >
[GIT PULL] Compute Express Linux (CXL) for v5.12-rc1
Hi Linus, please pull from: git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm tags/cxl-for-5.12 ...to receive an initial driver for CXL 2.0 Memory Devices. Technical details are in the tag message and Documentation/. I am taking this through nvdimm.git this first cycle until the cxl.git repository and maintainer team can be set up on git.kernel.org. In terms of why merge this initial driver now, it establishes just enough functionality to enumerate these devices and issue all administrative commands. It sets a v5.12 baseline to develop the more complicated higher order functionality like memory device interleaving, persistent memory support, and hotplug which entangle with ACPI, LIBNVDIMM, and PCI. The focus of this release is establishing the ioctl UAPI for the management commands. Similar to NVME there are a set of standard commands as well as the possibility for vendor specific commands. Unlike the NVME driver the CXL driver does not enable vendor specific command functionality by default. This conservatism is out of concern for the fact that CXL interleaves memory across devices and implements host memory. The system integrity implications of some commands are more severe than NVME and vendor specific functionality is mostly unauditable. This will be an ongoing topic of discussion with the wider CXL community for next few months. The driver has been developed in the open since November against a work-in-progress QEMU emulation of the CXL device model. That QEMU effort has recently attracted contributions from multiple hardware vendors. The driver has appeared in -next. It collected some initial static analysis fixes and build-robot reports, but all quiet in -next for the past week. A list of review tags that arrived after the branch for -next was cut is appended to the tag message below. --- The following changes since commit 1048ba83fb1c00cd24172e23e8263972f6b5d9ac: Linux 5.11-rc6 (2021-01-31 13:50:09 -0800) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm tags/cxl-for-5.12 for you to fetch changes up to 88ff5d466c0250259818f3153dbdc4af1f8615dd: cxl/mem: Fix potential memory leak (2021-02-22 14:44:39 -0800) cxl for 5.12 Introduce an initial driver for CXL 2.0 Type-3 Memory Devices. CXL is Compute Express Link which released the 2.0 specification in November. The Linux relevant changes in CXL 2.0 are support for an OS to dynamically assign address space to memory devices, support for switches, persistent memory, and hotplug. A Type-3 Memory Device is a PCI enumerated device presenting the CXL Memory Device Class Code and implementing the CXL.mem protocol. CXL.mem allows device to advertise CPU and I/O coherent memory to the system, i.e. typical "System RAM" and "Persistent Memory" in Linux /proc/iomem terms. In addition to the CXL.mem fast path there is an administrative command hardware mailbox interface for maintenance and provisioning. It is this command interface that is the focus of the initial driver. With this driver a CXL device that is mapped by the BIOS can be administered by Linux. Linux support for CXL PMEM and dynamic CXL address space management are to be implemented post v5.12. 4cdadfd5e0a7 cxl/mem: Introduce a driver for CXL-2.0-Type-3 endpoints Reviewed-by: Konrad Rzeszutek Wilk 8adaf747c9f0 cxl/mem: Find device capabilities Reviewed-by: Jonathan Cameron b39cb1052a5c cxl/mem: Register CXL memX devices Reviewed-by: Jonathan Cameron 13237183c735 cxl/mem: Add a "RAW" send command Reviewed-by: Konrad Rzeszutek Wilk 472b1ce6e9d6 cxl/mem: Enable commands via CEL Reviewed-by: Konrad Rzeszutek Wilk 57ee605b976c cxl/mem: Add set of informational commands Reviewed-by: Konrad Rzeszutek Wilk Ben Widawsky (7): cxl/mem: Find device capabilities cxl/mem: Add basic IOCTL interface cxl/mem: Add a "RAW" send command cxl/mem: Enable commands via CEL cxl/mem: Add set of informational commands MAINTAINERS: Add maintainers of the CXL driver cxl/mem: Fix potential memory leak Dan Carpenter (1): cxl/mem: Return -EFAULT if copy_to_user() fails Dan Williams (2): cxl/mem: Introduce a driver for CXL-2.0-Type-3 endpoints cxl/mem: Register CXL memX devices .clang-format |1 + Documentation/ABI/testing/sysfs-bus-cxl| 26 + Documentation/driver-api/cxl/index.rst | 12 + Documentation/driver-api/cxl/memory-devices.rst| 46 + Documentation/driver-api/index.rst |1 + Documentation/userspace-api/ioctl/ioctl-number.rst |1 + MAINTAINERS| 11 + drivers/Kconfig|1 + drivers/Makefile |1 + drivers/cxl/Kconfig
[GIT PULL] libnvdimm + device-dax for v5.12-rc1
Hi Linus, please pull from: git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm tags/libnvdimm-for-5.12 ...to receive some miscellaneous cleanups and a fix for v5.12. This mainly continues the kernel wide effort to remove a return code from the remove() callback in the driver model. The fix addresses a return code polarity typo in the new sysfs attribute to manually specify a device-dax instance mapping range. This has all appeared in -next with no reported issues. --- The following changes since commit 1048ba83fb1c00cd24172e23e8263972f6b5d9ac: Linux 5.11-rc6 (2021-01-31 13:50:09 -0800) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm tags/libnvdimm-for-5.12 for you to fetch changes up to 64ffe84320745ea836555ad207ebfb0e896b6167: Merge branch 'for-5.12/dax' into for-5.12/libnvdimm (2021-02-23 18:13:45 -0800) libnvdimm + device-dax for 5.12 - Fix the error code polarity for the device-dax/mapping attribute - For the device-dax and libnvdimm bus implementations stop implementing a useless return code for the remove() callback. - Miscellaneous cleanups Dan Williams (1): Merge branch 'for-5.12/dax' into for-5.12/libnvdimm Shiyang Ruan (1): device-dax: Fix default return code of range_parse() Uwe Kleine-König (7): libnvdimm/dimm: Simplify nvdimm_remove() libnvdimm: Make remove callback return void device-dax: Prevent registering drivers without probe callback device-dax: Properly handle drivers without remove callback device-dax: Fix error path in dax_driver_register device-dax: Drop an empty .remove callback dax-device: Make remove callback return void drivers/dax/bus.c | 24 +--- drivers/dax/bus.h | 2 +- drivers/dax/device.c | 8 +--- drivers/dax/kmem.c| 7 ++- drivers/dax/pmem/compat.c | 3 +-- drivers/nvdimm/blk.c | 3 +-- drivers/nvdimm/bus.c | 13 + drivers/nvdimm/dimm.c | 7 +-- drivers/nvdimm/pmem.c | 4 +--- drivers/nvdimm/region.c | 4 +--- include/linux/nd.h| 2 +- 11 files changed, 36 insertions(+), 41 deletions(-) ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps
On Tue, Feb 23, 2021 at 5:00 PM Jason Gunthorpe wrote: > > On Tue, Feb 23, 2021 at 04:14:01PM -0800, Dan Williams wrote: > > [ add Ralph ] > > > > On Tue, Feb 23, 2021 at 3:07 PM Jason Gunthorpe wrote: > > > > > > On Tue, Feb 23, 2021 at 02:48:20PM -0800, Dan Williams wrote: > > > > On Tue, Feb 23, 2021 at 10:54 AM Jason Gunthorpe wrote: > > > > > > > > > > On Tue, Feb 23, 2021 at 08:44:52AM -0800, Dan Williams wrote: > > > > > > > > > > > > The downside would be one extra lookup in dev_pagemap tree > > > > > > > for other pgmap->types (P2P, FSDAX, PRIVATE). But just one > > > > > > > per gup-fast() call. > > > > > > > > > > > > I'd guess a dev_pagemap lookup is faster than a get_user_pages slow > > > > > > path. It should be measurable that this change is at least as fast > > > > > > or > > > > > > faster than falling back to the slow path, but it would be good to > > > > > > measure. > > > > > > > > > > What is the dev_pagemap thing doing in gup fast anyhow? > > > > > > > > > > I've been wondering for a while.. > > > > > > > > It's there to synchronize against dax-device removal. The device will > > > > suspend removal awaiting all page references to be dropped, but > > > > gup-fast could be racing device removal. So gup-fast checks for > > > > pte_devmap() to grab a live reference to the device before assuming it > > > > can pin a page. > > > > > > From the perspective of CPU A it can't tell if CPU B is doing a HW > > > page table walk or a GUP fast when it invalidates a page table. The > > > design of gup-fast is supposed to be the same as the design of a HW > > > page table walk, and the tlb invalidate CPU A does when removing a > > > page from a page table is supposed to serialize against both a HW page > > > table walk and gup-fast. > > > > > > Given that the HW page table walker does not do dev_pagemap stuff, why > > > does gup-fast? > > > > gup-fast historically assumed that the 'struct page' and memory > > backing the page-table walk could not physically be removed from the > > system during its walk because those pages were allocated from the > > page allocator before being mapped into userspace. > > No, I'd say gup-fast assumes that any non-special PTE it finds in a > page table must have a struct page. > > If something wants to remove that struct page it must first remove all > the PTEs pointing at it from the entire system and flush the TLBs, > which directly prevents a future gup-fast from running and trying to > access the struct page. No extra locking needed > > > implied elevated reference on any page that gup-fast would be asked to > > walk, or pte_special() is there to "say wait, nevermind this isn't a > > page allocator page fallback to gup-slow()". > > pte_special says there is no struct page, and some of those cases can > be fixed up in gup-slow. > > > > Can you sketch the exact race this is protecting against? > > > > Thread1 mmaps /mnt/daxfile1 from a "mount -o dax" filesystem and > > issues direct I/O with that mapping as the target buffer, Thread2 does > > "echo "namespace0.0" > /sys/bus/nd/drivers/nd_pmem/unbind". Without > > the dev_pagemap check reference gup-fast could execute > > get_page(pte_page(pte)) on a page that doesn't even exist anymore > > because the driver unbind has already performed remove_pages(). > > Surely the unbind either waits for all the VMAs to be destroyed or > zaps them before allowing things to progress to remove_pages()? If we're talking about device-dax this is precisely what it does, zaps and prevents new faults from resolving, but filesystem-dax... > Having a situation where the CPU page tables still point at physical > pages that have been removed sounds so crazy/insecure, that can't be > what is happening, can it?? Hmm, that may be true and an original dax bug! The unbind of a block-device from underneath the filesystem does trigger the filesystem to emergency shutdown / go read-only, but unless that process also includes a global zap of all dax mappings not only is that violating expectations of "page-tables to disappearing memory", but the filesystem may also want to guarantee that no further dax writes can happen after shutdown. Right now I believe it only assumes that mmap I/O will come from page writeback so there's no need to bother applications with mappings to page cache, but dax mappings need to be ripped away. /me goes to look at what filesytems guarantee when the block-device is surprise removed out from under them. In any event, this accelerates the effort to go implement fs-global-dax-zap at the request of the device driver. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps
[ add Ralph ] On Tue, Feb 23, 2021 at 3:07 PM Jason Gunthorpe wrote: > > On Tue, Feb 23, 2021 at 02:48:20PM -0800, Dan Williams wrote: > > On Tue, Feb 23, 2021 at 10:54 AM Jason Gunthorpe wrote: > > > > > > On Tue, Feb 23, 2021 at 08:44:52AM -0800, Dan Williams wrote: > > > > > > > > The downside would be one extra lookup in dev_pagemap tree > > > > > for other pgmap->types (P2P, FSDAX, PRIVATE). But just one > > > > > per gup-fast() call. > > > > > > > > I'd guess a dev_pagemap lookup is faster than a get_user_pages slow > > > > path. It should be measurable that this change is at least as fast or > > > > faster than falling back to the slow path, but it would be good to > > > > measure. > > > > > > What is the dev_pagemap thing doing in gup fast anyhow? > > > > > > I've been wondering for a while.. > > > > It's there to synchronize against dax-device removal. The device will > > suspend removal awaiting all page references to be dropped, but > > gup-fast could be racing device removal. So gup-fast checks for > > pte_devmap() to grab a live reference to the device before assuming it > > can pin a page. > > From the perspective of CPU A it can't tell if CPU B is doing a HW > page table walk or a GUP fast when it invalidates a page table. The > design of gup-fast is supposed to be the same as the design of a HW > page table walk, and the tlb invalidate CPU A does when removing a > page from a page table is supposed to serialize against both a HW page > table walk and gup-fast. > > Given that the HW page table walker does not do dev_pagemap stuff, why > does gup-fast? gup-fast historically assumed that the 'struct page' and memory backing the page-table walk could not physically be removed from the system during its walk because those pages were allocated from the page allocator before being mapped into userspace. So there is an implied elevated reference on any page that gup-fast would be asked to walk, or pte_special() is there to "say wait, nevermind this isn't a page allocator page fallback to gup-slow()". pte_devmap() is there to say "wait, there is no implied elevated reference for this page, check and hold dev_pagemap alive until a page reference can be taken". So it splits the difference between pte_special() and typical page allocator pages. > Can you sketch the exact race this is protecting against? Thread1 mmaps /mnt/daxfile1 from a "mount -o dax" filesystem and issues direct I/O with that mapping as the target buffer, Thread2 does "echo "namespace0.0" > /sys/bus/nd/drivers/nd_pmem/unbind". Without the dev_pagemap check reference gup-fast could execute get_page(pte_page(pte)) on a page that doesn't even exist anymore because the driver unbind has already performed remove_pages(). Effectively the same percpu_ref that protects the pmem0 block device from new command submissions while the device is dying also prevents new dax page references being taken while the device is dying. This could be solved with the traditional gup-fast rules if the device driver could tell the filesystem to unmap all dax files and force them to re-fault through the gup-slow path to see that the device is now dying. I'll likely be working on that sooner rather than later given some of the expectations of the CXL persistent memory "dirty shutdown" detection. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps
On Tue, Feb 23, 2021 at 10:54 AM Jason Gunthorpe wrote: > > On Tue, Feb 23, 2021 at 08:44:52AM -0800, Dan Williams wrote: > > > > The downside would be one extra lookup in dev_pagemap tree > > > for other pgmap->types (P2P, FSDAX, PRIVATE). But just one > > > per gup-fast() call. > > > > I'd guess a dev_pagemap lookup is faster than a get_user_pages slow > > path. It should be measurable that this change is at least as fast or > > faster than falling back to the slow path, but it would be good to > > measure. > > What is the dev_pagemap thing doing in gup fast anyhow? > > I've been wondering for a while.. It's there to synchronize against dax-device removal. The device will suspend removal awaiting all page references to be dropped, but gup-fast could be racing device removal. So gup-fast checks for pte_devmap() to grab a live reference to the device before assuming it can pin a page. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [ndctl PATCH v2 01/13] cxl: add a cxl utility and libcxl library
On Mon, 2021-02-22 at 13:36 -0800, Ben Widawsky wrote: [..] > > > +SYNOPSIS > > + > > +[verse] > > +'cxl list' [] > > + > > +Walk the CXL capable device hierarchy in the system and list all device > > +instances along with some of their major attributes. > > This doesn't seem to match the above. Here it's just devices and above you > talk > about bridges and switches as well. Good catch - those can be added in later when we have a sysfs representation for them. I'll change it to say just devices for now. [..] > > + > > +static void *add_cxl_memdev(void *parent, int id, const char *cxlmem_base) > > +{ > > + const char *devname = devpath_to_devname(cxlmem_base); > > + char *path = calloc(1, strlen(cxlmem_base) + 100); > > + struct cxl_ctx *ctx = parent; > > + struct cxl_memdev *memdev, *memdev_dup; > > + char buf[SYSFS_ATTR_SIZE]; > > + struct stat st; > > + > > + if (!path) > > + return NULL; > > + dbg(ctx, "%s: base: \'%s\'\n", __func__, cxlmem_base); > > + > > + memdev = calloc(1, sizeof(*memdev)); > > + if (!memdev) > > + goto err_dev; > > + memdev->id = id; > > + memdev->ctx = ctx; > > + > > + sprintf(path, "/dev/cxl/%s", devname); > > + if (stat(path, ) < 0) > > + goto err_read; > > + memdev->major = major(st.st_rdev); > > + memdev->minor = minor(st.st_rdev); > > + > > + sprintf(path, "%s/pmem/size", cxlmem_base); > > + if (sysfs_read_attr(ctx, path, buf) < 0) > > + goto err_read; > > + memdev->pmem_size = strtoull(buf, NULL, 0); > > For strtoull usage and below - it certainly doesn't matter much but maybe > using > 10 for base would better since sysfs is ABI and therefore anything other than > base 10 is incorrect. Hm, I followed what libndctl does, but I think there is value in accepting valid hex even if it is technically 'wrong' per the robustness principle. How much do we want libcxl/libndctl to be a kernel validation vehicle vs. just work if you can? [..] > > + > > +static int cmd_help(int argc, const char **argv, struct cxl_ctx *ctx) > > +{ > > + const char * const builtin_help_subcommands[] = { > > + "list", NULL, > > + }; > > Move NULL to newline. Yep. > > > +int cmd_list(int argc, const char **argv, struct cxl_ctx *ctx) > > +{ > > + const struct option options[] = { > > + OPT_STRING('d', "memdev", , "memory device name", > > + "filter by CXL memory device name"), > > + OPT_BOOLEAN('D', "memdevs", , > > + "include CXL memory device info"), > > + OPT_BOOLEAN('i', "idle", , "include idle devices"), > > + OPT_BOOLEAN('u', "human", , > > + "use human friendly number formats "), > > + OPT_END(), > > + }; > > + const char * const u[] = { > > + "cxl list []", > > + NULL > > + }; > > + struct json_object *jdevs = NULL; > > + unsigned long list_flags; > > + struct cxl_memdev *memdev; > > + int i; > > + > > +argc = parse_options(argc, argv, options, u, 0); > > Tab. > > /me looks for .clang-format Thanks - let me see if I can quickly adapt the kernel's .clang-format for this and add it in for the next revision. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages
On Tue, Feb 23, 2021 at 9:19 AM Joao Martins wrote: > > On 2/23/21 4:50 PM, Dan Williams wrote: > > On Tue, Feb 23, 2021 at 7:46 AM Joao Martins > > wrote: > >> On 2/22/21 8:37 PM, Dan Williams wrote: > >>> On Mon, Feb 22, 2021 at 3:24 AM Joao Martins > >>> wrote: > On 2/20/21 1:43 AM, Dan Williams wrote: > > On Tue, Dec 8, 2020 at 9:59 PM John Hubbard wrote: > >> On 12/8/20 9:28 AM, Joao Martins wrote: > > [...] > > >>> Don't get me wrong the > >>> capability is still needed for filesystem-dax, but the distinction is > >>> that vmemmap_populate_compound_pages() need never worry about an > >>> altmap. > >>> > >> IMO there's not much added complexity strictly speaking about altmap. We > >> still use the > >> same vmemmap_{pmd,pte,pgd}_populate helpers which just pass an altmap. So > >> whatever it is > >> being maintained for fsdax or other altmap consumers (e.g. we seem to be > >> working towards > >> hotplug making use of it) we are using it in the exact same way. > >> > >> The complexity of the future vmemmap_populate_compound_pages() has more to > >> do with reusing > >> vmemmap blocks allocated in previous vmemmap pages, and preserving that > >> across section > >> onlining (for 1G pages). > > > > True, I'm less worried about the complexity as much as > > opportunistically converting configurations to RAM backed pages. It's > > already the case that poison handling is page mapping size aligned for > > device-dax, and filesystem-dax needs to stick with non-compound-pages > > for the foreseeable future. > > > Hmm, I was sort off wondering that fsdax could move to compound pages too as > opposed to base pages, albeit not necessarily using the vmemmap page reuse > as it splits pages IIUC. I'm not sure compound pages for fsdax would work long term because there's no infrastructure to reassemble compound pages after a split. So if you fracture a block and then coalesce it back to a 2MB or 1GB aligned block there's nothing to go fixup the compound page... unless the filesystem wants to get into mm metadata fixups. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps
On Tue, Feb 23, 2021 at 9:16 AM Joao Martins wrote: > > On 2/23/21 4:44 PM, Dan Williams wrote: > > On Tue, Feb 23, 2021 at 8:30 AM Joao Martins > > wrote: > >> > >> On 2/20/21 1:18 AM, Dan Williams wrote: > >>> On Tue, Dec 8, 2020 at 9:32 AM Joao Martins > >>> wrote: > Patch 6 - 8: Optimize grabbing/release a page refcount changes given > that we > are working with compound pages i.e. we do 1 increment/decrement to the > head > page for a given set of N subpages compared as opposed to N individual > writes. > {get,pin}_user_pages_fast() for zone_device with compound pagemap > consequently > improves considerably, and unpin_user_pages() improves as well when > passed a > set of consecutive pages: > > before after > (get_user_pages_fast 1G;2M page size) ~75k us -> ~3.2k ; ~5.2k us > (pin_user_pages_fast 1G;2M page size) ~125k us -> ~3.4k ; ~5.5k us > >>> > >>> Compelling! > >>> > >> > >> BTW is there any reason why we don't support pin_user_pages_fast() with > >> FOLL_LONGTERM for > >> device-dax? > >> > > > > Good catch. > > > > Must have been an oversight of the conversion. FOLL_LONGTERM collides > > with filesystem operations, but not device-dax. > > h, fwiw, it was unilaterally disabled for any devmap pmd/pud in commit > 7af75561e171 > ("mm/gup: add FOLL_LONGTERM capability to GUP fast") and I must only assume > that > by "DAX pages" the submitter was only referring to fs-dax pages. Agree, that was an fsdax only assumption. Maybe this went unnoticed because the primary gup-fast case for direct-I/O was not impacted. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages
On 2/23/21 4:50 PM, Dan Williams wrote: > On Tue, Feb 23, 2021 at 7:46 AM Joao Martins > wrote: >> On 2/22/21 8:37 PM, Dan Williams wrote: >>> On Mon, Feb 22, 2021 at 3:24 AM Joao Martins >>> wrote: On 2/20/21 1:43 AM, Dan Williams wrote: > On Tue, Dec 8, 2020 at 9:59 PM John Hubbard wrote: >> On 12/8/20 9:28 AM, Joao Martins wrote: [...] >>> Don't get me wrong the >>> capability is still needed for filesystem-dax, but the distinction is >>> that vmemmap_populate_compound_pages() need never worry about an >>> altmap. >>> >> IMO there's not much added complexity strictly speaking about altmap. We >> still use the >> same vmemmap_{pmd,pte,pgd}_populate helpers which just pass an altmap. So >> whatever it is >> being maintained for fsdax or other altmap consumers (e.g. we seem to be >> working towards >> hotplug making use of it) we are using it in the exact same way. >> >> The complexity of the future vmemmap_populate_compound_pages() has more to >> do with reusing >> vmemmap blocks allocated in previous vmemmap pages, and preserving that >> across section >> onlining (for 1G pages). > > True, I'm less worried about the complexity as much as > opportunistically converting configurations to RAM backed pages. It's > already the case that poison handling is page mapping size aligned for > device-dax, and filesystem-dax needs to stick with non-compound-pages > for the foreseeable future. > Hmm, I was sort off wondering that fsdax could move to compound pages too as opposed to base pages, albeit not necessarily using the vmemmap page reuse as it splits pages IIUC. > Ok, let's try to keep altmap in vmemmap_populate_compound_pages() and > see how it looks. > OK, will do. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps
On 2/23/21 4:44 PM, Dan Williams wrote: > On Tue, Feb 23, 2021 at 8:30 AM Joao Martins > wrote: >> >> On 2/20/21 1:18 AM, Dan Williams wrote: >>> On Tue, Dec 8, 2020 at 9:32 AM Joao Martins >>> wrote: Patch 6 - 8: Optimize grabbing/release a page refcount changes given that we are working with compound pages i.e. we do 1 increment/decrement to the head page for a given set of N subpages compared as opposed to N individual writes. {get,pin}_user_pages_fast() for zone_device with compound pagemap consequently improves considerably, and unpin_user_pages() improves as well when passed a set of consecutive pages: before after (get_user_pages_fast 1G;2M page size) ~75k us -> ~3.2k ; ~5.2k us (pin_user_pages_fast 1G;2M page size) ~125k us -> ~3.4k ; ~5.5k us >>> >>> Compelling! >>> >> >> BTW is there any reason why we don't support pin_user_pages_fast() with >> FOLL_LONGTERM for >> device-dax? >> > > Good catch. > > Must have been an oversight of the conversion. FOLL_LONGTERM collides > with filesystem operations, but not device-dax. h, fwiw, it was unilaterally disabled for any devmap pmd/pud in commit 7af75561e171 ("mm/gup: add FOLL_LONGTERM capability to GUP fast") and I must only assume that by "DAX pages" the submitter was only referring to fs-dax pages. > In fact that's the > motivation for device-dax in the first instance, no need to coordinate > runtime physical address layout changes because the device is > statically allocated. > /me nods >> Looking at the history, I understand that fsdax can't support it atm, but I >> am not sure >> that the same holds for device-dax. I have this small chunk (see below the >> scissors mark) >> which relaxes this for a pgmap of type MEMORY_DEVICE_GENERIC, albeit not >> sure if there is >> a fundamental issue for the other types that makes this an unwelcoming >> change. >> >> Joao >> >> ->8- >> >> Subject: [PATCH] mm/gup: allow FOLL_LONGTERM pin-fast for >> MEMORY_DEVICE_GENERIC >> >> The downside would be one extra lookup in dev_pagemap tree >> for other pgmap->types (P2P, FSDAX, PRIVATE). But just one >> per gup-fast() call. > > I'd guess a dev_pagemap lookup is faster than a get_user_pages slow > path. It should be measurable that this change is at least as fast or > faster than falling back to the slow path, but it would be good to > measure. > But with the changes I am/will-be making I hope gup-fast and gup-slow will be as fast (for present pmd/puds ofc, as the fault makes it slower). I'll formally submit below patch, once I ran over the numbers. > Code changes look good to me. > Cool! Will add in the suggested change below. >> >> Signed-off-by: Joao Martins >> --- >> include/linux/mm.h | 5 + >> mm/gup.c | 24 +--- >> 2 files changed, 18 insertions(+), 11 deletions(-) >> >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index 32f0c3986d4f..c89a049bbd7a 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -1171,6 +1171,11 @@ static inline bool is_pci_p2pdma_page(const struct >> page *page) >> page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA; >> } >> >> +static inline bool devmap_longterm_available(const struct dev_pagemap >> *pgmap) >> +{ > > I'd call this devmap_can_longterm(). > Ack. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages
On Tue, Feb 23, 2021 at 7:46 AM Joao Martins wrote: > > On 2/22/21 8:37 PM, Dan Williams wrote: > > On Mon, Feb 22, 2021 at 3:24 AM Joao Martins > > wrote: > >> On 2/20/21 1:43 AM, Dan Williams wrote: > >>> On Tue, Dec 8, 2020 at 9:59 PM John Hubbard wrote: > On 12/8/20 9:28 AM, Joao Martins wrote: > > diff --git a/mm/memremap.c b/mm/memremap.c > > index 16b2fb482da1..287a24b7a65a 100644 > > --- a/mm/memremap.c > > +++ b/mm/memremap.c > > @@ -277,8 +277,12 @@ static int pagemap_range(struct dev_pagemap > > *pgmap, struct mhp_params *params, > > memmap_init_zone_device(_DATA(nid)->node_zones[ZONE_DEVICE], > > PHYS_PFN(range->start), > > PHYS_PFN(range_len(range)), pgmap); > > - percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id) > > - - pfn_first(pgmap, range_id)); > > + if (pgmap->flags & PGMAP_COMPOUND) > > + percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id) > > + - pfn_first(pgmap, range_id)) / > > PHYS_PFN(pgmap->align)); > > Is there some reason that we cannot use range_len(), instead of > pfn_end() minus > pfn_first()? (Yes, this more about the pre-existing code than about your > change.) > > And if not, then why are the nearby range_len() uses OK? I realize that > range_len() > is simpler and skips a case, but it's not clear that it's required here. > But I'm > new to this area so be warned. :) > >>> > >>> There's a subtle distinction between the range that was passed in and > >>> the pfns that are activated inside of it. See the offset trickery in > >>> pfn_first(). > >>> > Also, dividing by PHYS_PFN() feels quite misleading: that function does > what you > happen to want, but is not named accordingly. Can you use or create > something > more accurately named? Like "number of pages in this large page"? > >>> > >>> It's not the number of pages in a large page it's converting bytes to > >>> pages. Other place in the kernel write it as (x >> PAGE_SHIFT), but my > >>> though process was if I'm going to add () might as well use a macro > >>> that already does this. > >>> > >>> That said I think this calculation is broken precisely because > >>> pfn_first() makes the result unaligned. > >>> > >>> Rather than fix the unaligned pfn_first() problem I would use this > >>> support as an opportunity to revisit the option of storing pages in > >>> the vmem_altmap reserve soace. The altmap's whole reason for existence > >>> was that 1.5% of large PMEM might completely swamp DRAM. However if > >>> that overhead is reduced by an order (or orders) of magnitude the > >>> primary need for vmem_altmap vanishes. > >>> > >>> Now, we'll still need to keep it around for the ->align == PAGE_SIZE > >>> case, but for most part existing deployments that are specifying page > >>> map on PMEM and an align > PAGE_SIZE can instead just transparently be > >>> upgraded to page map on a smaller amount of DRAM. > >>> > >> I feel the altmap is still relevant. Even with the struct page reuse for > >> tail pages, the overhead for 2M align is still non-negligeble i.e. 4G per > >> 1Tb (strictly speaking about what's stored in the altmap). Muchun and > >> Matthew were thinking (in another thread) on compound_head() adjustments > >> that probably can make this overhead go to 2G (if we learn to differentiate > >> the reused head page from the real head page). > > > > I think that realization is more justification to make a new first > > class vmemmap_populate_compound_pages() rather than try to reuse > > vmemmap_populate_basepages() with new parameters. > > > I was already going to move this to vmemmap_populate_compound_pages() based > on your earlier suggestion :) > > >> But even there it's still > >> 2G per 1Tb. 1G pages, though, have a better story to remove altmap need. > > > > The concern that led to altmap is that someone would build a system > > with a 96:1 (PMEM:RAM) ratio where that correlates to maximum PMEM and > > minimum RAM, and mapping all PMEM consumes all RAM. As far as I > > understand real world populations are rarely going past 8:1, that > > seems to make 'struct page' in RAM feasible even for the 2M compound > > page case. > > > > Let me ask you for a data point, since you're one of the people > > actively deploying such systems, would you still use the 'struct page' > > in PMEM capability after this set was merged? > > > We might be sticking to RAM stored 'struct page' yes, but hard to say atm > what the future holds. > > >> One thing to point out about altmap is that the degradation (in pinning and > >> unpining) we observed with struct page's in device memory, is no longer > >> observed > >> once 1) we batch ref count updates as we move to compound pages 2) reusing > >> tail pages seems to lead to these struct pages staying
Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps
On Tue, Feb 23, 2021 at 8:30 AM Joao Martins wrote: > > On 2/20/21 1:18 AM, Dan Williams wrote: > > On Tue, Dec 8, 2020 at 9:32 AM Joao Martins > > wrote: > >> Patch 6 - 8: Optimize grabbing/release a page refcount changes given that > >> we > >> are working with compound pages i.e. we do 1 increment/decrement to the > >> head > >> page for a given set of N subpages compared as opposed to N individual > >> writes. > >> {get,pin}_user_pages_fast() for zone_device with compound pagemap > >> consequently > >> improves considerably, and unpin_user_pages() improves as well when passed > >> a > >> set of consecutive pages: > >> > >>before after > >> (get_user_pages_fast 1G;2M page size) ~75k us -> ~3.2k ; ~5.2k us > >> (pin_user_pages_fast 1G;2M page size) ~125k us -> ~3.4k ; ~5.5k us > > > > Compelling! > > > > BTW is there any reason why we don't support pin_user_pages_fast() with > FOLL_LONGTERM for > device-dax? > Good catch. Must have been an oversight of the conversion. FOLL_LONGTERM collides with filesystem operations, but not device-dax. In fact that's the motivation for device-dax in the first instance, no need to coordinate runtime physical address layout changes because the device is statically allocated. > Looking at the history, I understand that fsdax can't support it atm, but I > am not sure > that the same holds for device-dax. I have this small chunk (see below the > scissors mark) > which relaxes this for a pgmap of type MEMORY_DEVICE_GENERIC, albeit not sure > if there is > a fundamental issue for the other types that makes this an unwelcoming change. > > Joao > > ->8- > > Subject: [PATCH] mm/gup: allow FOLL_LONGTERM pin-fast for > MEMORY_DEVICE_GENERIC > > The downside would be one extra lookup in dev_pagemap tree > for other pgmap->types (P2P, FSDAX, PRIVATE). But just one > per gup-fast() call. I'd guess a dev_pagemap lookup is faster than a get_user_pages slow path. It should be measurable that this change is at least as fast or faster than falling back to the slow path, but it would be good to measure. Code changes look good to me. > > Signed-off-by: Joao Martins > --- > include/linux/mm.h | 5 + > mm/gup.c | 24 +--- > 2 files changed, 18 insertions(+), 11 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 32f0c3986d4f..c89a049bbd7a 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1171,6 +1171,11 @@ static inline bool is_pci_p2pdma_page(const struct > page *page) > page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA; > } > > +static inline bool devmap_longterm_available(const struct dev_pagemap *pgmap) > +{ I'd call this devmap_can_longterm(). > + return pgmap->type == MEMORY_DEVICE_GENERIC; > +} > + > /* 127: arbitrary random number, small enough to assemble well */ > #define page_ref_zero_or_close_to_overflow(page) \ > ((unsigned int) page_ref_count(page) + 127u <= 127u) > diff --git a/mm/gup.c b/mm/gup.c > index 222d1fdc5cfa..03e370d360e6 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2092,14 +2092,18 @@ static int gup_pte_range(pmd_t pmd, unsigned long > addr, unsigned > long end, > goto pte_unmap; > > if (pte_devmap(pte)) { > - if (unlikely(flags & FOLL_LONGTERM)) > - goto pte_unmap; > - > pgmap = get_dev_pagemap(pte_pfn(pte), pgmap); > if (unlikely(!pgmap)) { > undo_dev_pagemap(nr, nr_start, flags, pages); > goto pte_unmap; > } > + > + if (unlikely(flags & FOLL_LONGTERM) && > + !devmap_longterm_available(pgmap)) { > + undo_dev_pagemap(nr, nr_start, flags, pages); > + goto pte_unmap; > + } > + > } else if (pte_special(pte)) > goto pte_unmap; > > @@ -2195,6 +2199,10 @@ static int __gup_device_huge(unsigned long pfn, > unsigned long addr, > return 0; > } > > + if (unlikely(flags & FOLL_LONGTERM) && > + !devmap_longterm_available(pgmap)) > + return 0; > + > @@ -2356,12 +2364,9 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, > unsigned long addr, > if (!pmd_access_permitted(orig, flags & FOLL_WRITE)) > return 0; > > - if (pmd_devmap(orig)) { > - if (unlikely(flags & FOLL_LONGTERM)) > - return 0; > + if (pmd_devmap(orig)) > return __gup_device_huge_pmd(orig, pmdp, addr, end, flags, > pages, nr); > - } > > page =
Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps
On 2/20/21 1:18 AM, Dan Williams wrote: > On Tue, Dec 8, 2020 at 9:32 AM Joao Martins wrote: >> Patch 6 - 8: Optimize grabbing/release a page refcount changes given that we >> are working with compound pages i.e. we do 1 increment/decrement to the head >> page for a given set of N subpages compared as opposed to N individual >> writes. >> {get,pin}_user_pages_fast() for zone_device with compound pagemap >> consequently >> improves considerably, and unpin_user_pages() improves as well when passed a >> set of consecutive pages: >> >>before after >> (get_user_pages_fast 1G;2M page size) ~75k us -> ~3.2k ; ~5.2k us >> (pin_user_pages_fast 1G;2M page size) ~125k us -> ~3.4k ; ~5.5k us > > Compelling! > BTW is there any reason why we don't support pin_user_pages_fast() with FOLL_LONGTERM for device-dax? Looking at the history, I understand that fsdax can't support it atm, but I am not sure that the same holds for device-dax. I have this small chunk (see below the scissors mark) which relaxes this for a pgmap of type MEMORY_DEVICE_GENERIC, albeit not sure if there is a fundamental issue for the other types that makes this an unwelcoming change. Joao ->8- Subject: [PATCH] mm/gup: allow FOLL_LONGTERM pin-fast for MEMORY_DEVICE_GENERIC The downside would be one extra lookup in dev_pagemap tree for other pgmap->types (P2P, FSDAX, PRIVATE). But just one per gup-fast() call. Signed-off-by: Joao Martins --- include/linux/mm.h | 5 + mm/gup.c | 24 +--- 2 files changed, 18 insertions(+), 11 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 32f0c3986d4f..c89a049bbd7a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1171,6 +1171,11 @@ static inline bool is_pci_p2pdma_page(const struct page *page) page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA; } +static inline bool devmap_longterm_available(const struct dev_pagemap *pgmap) +{ + return pgmap->type == MEMORY_DEVICE_GENERIC; +} + /* 127: arbitrary random number, small enough to assemble well */ #define page_ref_zero_or_close_to_overflow(page) \ ((unsigned int) page_ref_count(page) + 127u <= 127u) diff --git a/mm/gup.c b/mm/gup.c index 222d1fdc5cfa..03e370d360e6 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2092,14 +2092,18 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, goto pte_unmap; if (pte_devmap(pte)) { - if (unlikely(flags & FOLL_LONGTERM)) - goto pte_unmap; - pgmap = get_dev_pagemap(pte_pfn(pte), pgmap); if (unlikely(!pgmap)) { undo_dev_pagemap(nr, nr_start, flags, pages); goto pte_unmap; } + + if (unlikely(flags & FOLL_LONGTERM) && + !devmap_longterm_available(pgmap)) { + undo_dev_pagemap(nr, nr_start, flags, pages); + goto pte_unmap; + } + } else if (pte_special(pte)) goto pte_unmap; @@ -2195,6 +2199,10 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr, return 0; } + if (unlikely(flags & FOLL_LONGTERM) && + !devmap_longterm_available(pgmap)) + return 0; + @@ -2356,12 +2364,9 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, if (!pmd_access_permitted(orig, flags & FOLL_WRITE)) return 0; - if (pmd_devmap(orig)) { - if (unlikely(flags & FOLL_LONGTERM)) - return 0; + if (pmd_devmap(orig)) return __gup_device_huge_pmd(orig, pmdp, addr, end, flags, pages, nr); - } page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT); refs = record_subpages(page, addr, end, pages + *nr); @@ -2390,12 +2395,9 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, if (!pud_access_permitted(orig, flags & FOLL_WRITE)) return 0; - if (pud_devmap(orig)) { - if (unlikely(flags & FOLL_LONGTERM)) - return 0; + if (pud_devmap(orig)) return __gup_device_huge_pud(orig, pudp, addr, end, flags, pages, nr); - } page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT); refs = record_subpages(page, addr, end, pages + *nr); -- 2.17.1 ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to
Re: [PATCH RFC 3/9] sparse-vmemmap: Reuse vmemmap areas for a given page size
On 2/22/21 10:40 PM, Dan Williams wrote: > On Mon, Feb 22, 2021 at 3:42 AM Joao Martins > wrote: >> On 2/20/21 3:34 AM, Dan Williams wrote: >>> On Tue, Dec 8, 2020 at 9:32 AM Joao Martins >>> wrote: Sections are 128M (or bigger/smaller), >>> >>> Huh? >>> >> >> Section size is arch-dependent if we are being hollistic. >> On x86 it's 64M, 128M or 512M right? >> >> #ifdef CONFIG_X86_32 >> # ifdef CONFIG_X86_PAE >> # define SECTION_SIZE_BITS 29 >> # define MAX_PHYSMEM_BITS 36 >> # else >> # define SECTION_SIZE_BITS 26 >> # define MAX_PHYSMEM_BITS 32 >> # endif >> #else /* CONFIG_X86_32 */ >> # define SECTION_SIZE_BITS 27 /* matt - 128 is convenient right now */ >> # define MAX_PHYSMEM_BITS (pgtable_l5_enabled() ? 52 : 46) >> #endif >> >> Also, me pointing about section sizes, is because a 1GB+ page vmemmap >> population will >> cross sections in how sparsemem populates the vmemmap. And on that case we >> gotta reuse the >> the PTE/PMD pages across multiple invocations of >> vmemmap_populate_basepages(). Either >> that, or looking at the previous page PTE, but that might be ineficient. > > Ok, makes sense I think saying this description of needing to handle > section crossing is clearer than mentioning one of the section sizes. > I'll amend the commit message to have this. >> @@ -229,38 +235,95 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end, for (; addr < end; addr += PAGE_SIZE) { pgd = vmemmap_pgd_populate(addr, node); if (!pgd) - return -ENOMEM; + return NULL; p4d = vmemmap_p4d_populate(pgd, addr, node); if (!p4d) - return -ENOMEM; + return NULL; pud = vmemmap_pud_populate(p4d, addr, node); if (!pud) - return -ENOMEM; + return NULL; pmd = vmemmap_pmd_populate(pud, addr, node); if (!pmd) - return -ENOMEM; - pte = vmemmap_pte_populate(pmd, addr, node, altmap); + return NULL; + pte = vmemmap_pte_populate(pmd, addr, node, altmap, block); if (!pte) - return -ENOMEM; + return NULL; vmemmap_verify(pte, node, addr, addr + PAGE_SIZE); } + return __va(__pfn_to_phys(pte_pfn(*pte))); +} + +int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end, +int node, struct vmem_altmap *altmap) +{ + if (!__vmemmap_populate_basepages(start, end, node, altmap, NULL)) + return -ENOMEM; return 0; } +static struct page * __meminit vmemmap_populate_reuse(unsigned long start, + unsigned long end, int node, + struct vmem_context *ctx) +{ + unsigned long size, addr = start; + unsigned long psize = PHYS_PFN(ctx->align) * sizeof(struct page); + + size = min(psize, end - start); + + for (; addr < end; addr += size) { + unsigned long head = addr + PAGE_SIZE; + unsigned long tail = addr; + unsigned long last = addr + size; + void *area; + + if (ctx->block_page && + IS_ALIGNED((addr - ctx->block_page), psize)) + ctx->block = NULL; + + area = ctx->block; + if (!area) { + if (!__vmemmap_populate_basepages(addr, head, node, + ctx->altmap, NULL)) + return NULL; + + tail = head + PAGE_SIZE; + area = __vmemmap_populate_basepages(head, tail, node, + ctx->altmap, NULL); + if (!area) + return NULL; + + ctx->block = area; + ctx->block_page = addr; + } + + if (!__vmemmap_populate_basepages(tail, last, node, + ctx->altmap, area)) + return NULL; + } >>> >>> I think that compound page accounting and combined altmap accounting >>> makes this difficult to read, and I think the compound page case >>> deserves
Re: [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages
On 2/22/21 8:37 PM, Dan Williams wrote: > On Mon, Feb 22, 2021 at 3:24 AM Joao Martins > wrote: >> On 2/20/21 1:43 AM, Dan Williams wrote: >>> On Tue, Dec 8, 2020 at 9:59 PM John Hubbard wrote: On 12/8/20 9:28 AM, Joao Martins wrote: > diff --git a/mm/memremap.c b/mm/memremap.c > index 16b2fb482da1..287a24b7a65a 100644 > --- a/mm/memremap.c > +++ b/mm/memremap.c > @@ -277,8 +277,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, > struct mhp_params *params, > memmap_init_zone_device(_DATA(nid)->node_zones[ZONE_DEVICE], > PHYS_PFN(range->start), > PHYS_PFN(range_len(range)), pgmap); > - percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id) > - - pfn_first(pgmap, range_id)); > + if (pgmap->flags & PGMAP_COMPOUND) > + percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id) > + - pfn_first(pgmap, range_id)) / > PHYS_PFN(pgmap->align)); Is there some reason that we cannot use range_len(), instead of pfn_end() minus pfn_first()? (Yes, this more about the pre-existing code than about your change.) And if not, then why are the nearby range_len() uses OK? I realize that range_len() is simpler and skips a case, but it's not clear that it's required here. But I'm new to this area so be warned. :) >>> >>> There's a subtle distinction between the range that was passed in and >>> the pfns that are activated inside of it. See the offset trickery in >>> pfn_first(). >>> Also, dividing by PHYS_PFN() feels quite misleading: that function does what you happen to want, but is not named accordingly. Can you use or create something more accurately named? Like "number of pages in this large page"? >>> >>> It's not the number of pages in a large page it's converting bytes to >>> pages. Other place in the kernel write it as (x >> PAGE_SHIFT), but my >>> though process was if I'm going to add () might as well use a macro >>> that already does this. >>> >>> That said I think this calculation is broken precisely because >>> pfn_first() makes the result unaligned. >>> >>> Rather than fix the unaligned pfn_first() problem I would use this >>> support as an opportunity to revisit the option of storing pages in >>> the vmem_altmap reserve soace. The altmap's whole reason for existence >>> was that 1.5% of large PMEM might completely swamp DRAM. However if >>> that overhead is reduced by an order (or orders) of magnitude the >>> primary need for vmem_altmap vanishes. >>> >>> Now, we'll still need to keep it around for the ->align == PAGE_SIZE >>> case, but for most part existing deployments that are specifying page >>> map on PMEM and an align > PAGE_SIZE can instead just transparently be >>> upgraded to page map on a smaller amount of DRAM. >>> >> I feel the altmap is still relevant. Even with the struct page reuse for >> tail pages, the overhead for 2M align is still non-negligeble i.e. 4G per >> 1Tb (strictly speaking about what's stored in the altmap). Muchun and >> Matthew were thinking (in another thread) on compound_head() adjustments >> that probably can make this overhead go to 2G (if we learn to differentiate >> the reused head page from the real head page). > > I think that realization is more justification to make a new first > class vmemmap_populate_compound_pages() rather than try to reuse > vmemmap_populate_basepages() with new parameters. > I was already going to move this to vmemmap_populate_compound_pages() based on your earlier suggestion :) >> But even there it's still >> 2G per 1Tb. 1G pages, though, have a better story to remove altmap need. > > The concern that led to altmap is that someone would build a system > with a 96:1 (PMEM:RAM) ratio where that correlates to maximum PMEM and > minimum RAM, and mapping all PMEM consumes all RAM. As far as I > understand real world populations are rarely going past 8:1, that > seems to make 'struct page' in RAM feasible even for the 2M compound > page case. > > Let me ask you for a data point, since you're one of the people > actively deploying such systems, would you still use the 'struct page' > in PMEM capability after this set was merged? > We might be sticking to RAM stored 'struct page' yes, but hard to say atm what the future holds. >> One thing to point out about altmap is that the degradation (in pinning and >> unpining) we observed with struct page's in device memory, is no longer >> observed >> once 1) we batch ref count updates as we move to compound pages 2) reusing >> tail pages seems to lead to these struct pages staying more likely in cache >> which perhaps contributes to dirtying a lot less cachelines. > > True, it makes it more palatable to survive 'struct page' in PMEM, but > it's an ongoing maintenance burden that I'm not sure there are users > after putting
【三井住友カード】登録情報を訂正してください
いつも弊社カードをご利用いただきありがとうございます。 お客様の登録情報が間違っているため、カードを使用できません。ログインして修正してください。 ━━━ ご変更はこちらから ━━━ お電話でのお手続き 紛失盗難デスク 0120-919-456 (受付時間/24時間・年中無休) 三井住友カード株式会社 〒105-8011 東京都港区海岸1丁目2番20号 汐留ビルディング Copyright (C) 2021 Sumitomo Mitsui Card Co., Ltd. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Receipt of Payment Advice Details
___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org