date:20210223

Re: [GIT PULL] Compute Express Linux (CXL) for v5.12-rc1

2021-02-23 Thread Dan Williams

As much as I'd love to be working on "Compute Express Linux" the
subject should have read "Compute Express Link".

On Tue, Feb 23, 2021 at 8:05 PM Dan Williams  wrote:
>
> Hi Linus, please pull from:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm 
> tags/cxl-for-5.12
>
> ...to receive an initial driver for CXL 2.0 Memory Devices. Technical
> details are in the tag message and Documentation/. I am taking this
> through nvdimm.git this first cycle until the cxl.git repository and
> maintainer team can be set up on git.kernel.org.
>
> In terms of why merge this initial driver now, it establishes just
> enough functionality to enumerate these devices and issue all
> administrative commands. It sets a v5.12 baseline to develop the more
> complicated higher order functionality like memory device
> interleaving, persistent memory support, and hotplug which entangle
> with ACPI, LIBNVDIMM, and PCI.
>
> The focus of this release is establishing the ioctl UAPI for the
> management commands. Similar to NVME there are a set of standard
> commands as well as the possibility for vendor specific commands.
> Unlike the NVME driver the CXL driver does not enable vendor specific
> command functionality by default. This conservatism is out of concern
> for the fact that CXL interleaves memory across devices and implements
> host memory. The system integrity implications of some commands are
> more severe than NVME and vendor specific functionality is mostly
> unauditable. This will be an ongoing topic of discussion with the
> wider CXL community for next few months.
>
> The driver has been developed in the open since November against a
> work-in-progress QEMU emulation of the CXL device model. That QEMU
> effort has recently attracted contributions from multiple hardware
> vendors.
>
> The driver has appeared in -next. It collected some initial static
> analysis fixes and build-robot reports, but all quiet in -next for the
> past week.
>
> A list of review tags that arrived after the branch for -next was cut
> is appended to the tag message below.
>
> ---
>
> The following changes since commit 1048ba83fb1c00cd24172e23e8263972f6b5d9ac:
>
>   Linux 5.11-rc6 (2021-01-31 13:50:09 -0800)
>
> are available in the Git repository at:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm 
> tags/cxl-for-5.12
>
> for you to fetch changes up to 88ff5d466c0250259818f3153dbdc4af1f8615dd:
>
>   cxl/mem: Fix potential memory leak (2021-02-22 14:44:39 -0800)
>
> 
> cxl for 5.12
>
> Introduce an initial driver for CXL 2.0 Type-3 Memory Devices. CXL is
> Compute Express Link which released the 2.0 specification in November.
> The Linux relevant changes in CXL 2.0 are support for an OS to
> dynamically assign address space to memory devices, support for
> switches, persistent memory, and hotplug. A Type-3 Memory Device is a
> PCI enumerated device presenting the CXL Memory Device Class Code and
> implementing the CXL.mem protocol. CXL.mem allows device to advertise
> CPU and I/O coherent memory to the system, i.e. typical "System RAM" and
> "Persistent Memory" in Linux /proc/iomem terms.
>
> In addition to the CXL.mem fast path there is an administrative command
> hardware mailbox interface for maintenance and provisioning. It is this
> command interface that is the focus of the initial driver. With this
> driver a CXL device that is mapped by the BIOS can be administered by
> Linux. Linux support for CXL PMEM and dynamic CXL address space
> management are to be implemented post v5.12.
>
> 4cdadfd5e0a7 cxl/mem: Introduce a driver for CXL-2.0-Type-3 endpoints
> Reviewed-by: Konrad Rzeszutek Wilk 
>
> 8adaf747c9f0 cxl/mem: Find device capabilities
> Reviewed-by: Jonathan Cameron 
>
> b39cb1052a5c cxl/mem: Register CXL memX devices
> Reviewed-by: Jonathan Cameron 
>
> 13237183c735 cxl/mem: Add a "RAW" send command
> Reviewed-by: Konrad Rzeszutek Wilk 
>
> 472b1ce6e9d6 cxl/mem: Enable commands via CEL
> Reviewed-by: Konrad Rzeszutek Wilk 
>
> 57ee605b976c cxl/mem: Add set of informational commands
> Reviewed-by: Konrad Rzeszutek Wilk 
>
> 
> Ben Widawsky (7):
>   cxl/mem: Find device capabilities
>   cxl/mem: Add basic IOCTL interface
>   cxl/mem: Add a "RAW" send command
>   cxl/mem: Enable commands via CEL
>   cxl/mem: Add set of informational commands
>   MAINTAINERS: Add maintainers of the CXL driver
>   cxl/mem: Fix potential memory leak
>
> Dan Carpenter (1):
>   cxl/mem: Return -EFAULT if copy_to_user() fails
>
> Dan Williams (2):
>   cxl/mem: Introduce a driver for CXL-2.0-Type-3 endpoints
>   cxl/mem: Register CXL memX devices
>
>  .clang-format  |1 +
>  Documentation/ABI/testing/sysfs-bus-cxl|   26 +
>  Documentation/driver-api/cxl/index.rst |   12 +
>

[GIT PULL] Compute Express Linux (CXL) for v5.12-rc1

2021-02-23 Thread Dan Williams

Hi Linus, please pull from:

  git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm tags/cxl-for-5.12

...to receive an initial driver for CXL 2.0 Memory Devices. Technical
details are in the tag message and Documentation/. I am taking this
through nvdimm.git this first cycle until the cxl.git repository and
maintainer team can be set up on git.kernel.org.

In terms of why merge this initial driver now, it establishes just
enough functionality to enumerate these devices and issue all
administrative commands. It sets a v5.12 baseline to develop the more
complicated higher order functionality like memory device
interleaving, persistent memory support, and hotplug which entangle
with ACPI, LIBNVDIMM, and PCI.

The focus of this release is establishing the ioctl UAPI for the
management commands. Similar to NVME there are a set of standard
commands as well as the possibility for vendor specific commands.
Unlike the NVME driver the CXL driver does not enable vendor specific
command functionality by default. This conservatism is out of concern
for the fact that CXL interleaves memory across devices and implements
host memory. The system integrity implications of some commands are
more severe than NVME and vendor specific functionality is mostly
unauditable. This will be an ongoing topic of discussion with the
wider CXL community for next few months.

The driver has been developed in the open since November against a
work-in-progress QEMU emulation of the CXL device model. That QEMU
effort has recently attracted contributions from multiple hardware
vendors.

The driver has appeared in -next. It collected some initial static
analysis fixes and build-robot reports, but all quiet in -next for the
past week.

A list of review tags that arrived after the branch for -next was cut
is appended to the tag message below.

---

The following changes since commit 1048ba83fb1c00cd24172e23e8263972f6b5d9ac:

  Linux 5.11-rc6 (2021-01-31 13:50:09 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm tags/cxl-for-5.12

for you to fetch changes up to 88ff5d466c0250259818f3153dbdc4af1f8615dd:

  cxl/mem: Fix potential memory leak (2021-02-22 14:44:39 -0800)


cxl for 5.12

Introduce an initial driver for CXL 2.0 Type-3 Memory Devices. CXL is
Compute Express Link which released the 2.0 specification in November.
The Linux relevant changes in CXL 2.0 are support for an OS to
dynamically assign address space to memory devices, support for
switches, persistent memory, and hotplug. A Type-3 Memory Device is a
PCI enumerated device presenting the CXL Memory Device Class Code and
implementing the CXL.mem protocol. CXL.mem allows device to advertise
CPU and I/O coherent memory to the system, i.e. typical "System RAM" and
"Persistent Memory" in Linux /proc/iomem terms.

In addition to the CXL.mem fast path there is an administrative command
hardware mailbox interface for maintenance and provisioning. It is this
command interface that is the focus of the initial driver. With this
driver a CXL device that is mapped by the BIOS can be administered by
Linux. Linux support for CXL PMEM and dynamic CXL address space
management are to be implemented post v5.12.

4cdadfd5e0a7 cxl/mem: Introduce a driver for CXL-2.0-Type-3 endpoints
Reviewed-by: Konrad Rzeszutek Wilk 

8adaf747c9f0 cxl/mem: Find device capabilities
Reviewed-by: Jonathan Cameron 

b39cb1052a5c cxl/mem: Register CXL memX devices
Reviewed-by: Jonathan Cameron 

13237183c735 cxl/mem: Add a "RAW" send command
Reviewed-by: Konrad Rzeszutek Wilk 

472b1ce6e9d6 cxl/mem: Enable commands via CEL
Reviewed-by: Konrad Rzeszutek Wilk 

57ee605b976c cxl/mem: Add set of informational commands
Reviewed-by: Konrad Rzeszutek Wilk 


Ben Widawsky (7):
  cxl/mem: Find device capabilities
  cxl/mem: Add basic IOCTL interface
  cxl/mem: Add a "RAW" send command
  cxl/mem: Enable commands via CEL
  cxl/mem: Add set of informational commands
  MAINTAINERS: Add maintainers of the CXL driver
  cxl/mem: Fix potential memory leak

Dan Carpenter (1):
  cxl/mem: Return -EFAULT if copy_to_user() fails

Dan Williams (2):
  cxl/mem: Introduce a driver for CXL-2.0-Type-3 endpoints
  cxl/mem: Register CXL memX devices

 .clang-format  |1 +
 Documentation/ABI/testing/sysfs-bus-cxl|   26 +
 Documentation/driver-api/cxl/index.rst |   12 +
 Documentation/driver-api/cxl/memory-devices.rst|   46 +
 Documentation/driver-api/index.rst |1 +
 Documentation/userspace-api/ioctl/ioctl-number.rst |1 +
 MAINTAINERS|   11 +
 drivers/Kconfig|1 +
 drivers/Makefile   |1 +
 drivers/cxl/Kconfig

[GIT PULL] libnvdimm + device-dax for v5.12-rc1

2021-02-23 Thread Dan Williams

Hi Linus, please pull from:

  git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
tags/libnvdimm-for-5.12

...to receive some miscellaneous cleanups and a fix for v5.12. This
mainly continues the kernel wide effort to remove a return code from
the remove() callback in the driver model. The fix addresses a return
code polarity typo in the new sysfs attribute to manually specify a
device-dax instance mapping range. This has all appeared in -next with
no reported issues.

---

The following changes since commit 1048ba83fb1c00cd24172e23e8263972f6b5d9ac:

  Linux 5.11-rc6 (2021-01-31 13:50:09 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
tags/libnvdimm-for-5.12

for you to fetch changes up to 64ffe84320745ea836555ad207ebfb0e896b6167:

  Merge branch 'for-5.12/dax' into for-5.12/libnvdimm (2021-02-23
18:13:45 -0800)


libnvdimm + device-dax for 5.12

- Fix the error code polarity for the device-dax/mapping attribute

- For the device-dax and libnvdimm bus implementations stop implementing
  a useless return code for the remove() callback.

- Miscellaneous cleanups


Dan Williams (1):
  Merge branch 'for-5.12/dax' into for-5.12/libnvdimm

Shiyang Ruan (1):
  device-dax: Fix default return code of range_parse()

Uwe Kleine-König (7):
  libnvdimm/dimm: Simplify nvdimm_remove()
  libnvdimm: Make remove callback return void
  device-dax: Prevent registering drivers without probe callback
  device-dax: Properly handle drivers without remove callback
  device-dax: Fix error path in dax_driver_register
  device-dax: Drop an empty .remove callback
  dax-device: Make remove callback return void

 drivers/dax/bus.c | 24 +---
 drivers/dax/bus.h |  2 +-
 drivers/dax/device.c  |  8 +---
 drivers/dax/kmem.c|  7 ++-
 drivers/dax/pmem/compat.c |  3 +--
 drivers/nvdimm/blk.c  |  3 +--
 drivers/nvdimm/bus.c  | 13 +
 drivers/nvdimm/dimm.c |  7 +--
 drivers/nvdimm/pmem.c |  4 +---
 drivers/nvdimm/region.c   |  4 +---
 include/linux/nd.h|  2 +-
 11 files changed, 36 insertions(+), 41 deletions(-)
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps

2021-02-23 Thread Dan Williams

On Tue, Feb 23, 2021 at 5:00 PM Jason Gunthorpe  wrote:
>
> On Tue, Feb 23, 2021 at 04:14:01PM -0800, Dan Williams wrote:
> > [ add Ralph ]
> >
> > On Tue, Feb 23, 2021 at 3:07 PM Jason Gunthorpe  wrote:
> > >
> > > On Tue, Feb 23, 2021 at 02:48:20PM -0800, Dan Williams wrote:
> > > > On Tue, Feb 23, 2021 at 10:54 AM Jason Gunthorpe  wrote:
> > > > >
> > > > > On Tue, Feb 23, 2021 at 08:44:52AM -0800, Dan Williams wrote:
> > > > >
> > > > > > > The downside would be one extra lookup in dev_pagemap tree
> > > > > > > for other pgmap->types (P2P, FSDAX, PRIVATE). But just one
> > > > > > > per gup-fast() call.
> > > > > >
> > > > > > I'd guess a dev_pagemap lookup is faster than a get_user_pages slow
> > > > > > path. It should be measurable that this change is at least as fast 
> > > > > > or
> > > > > > faster than falling back to the slow path, but it would be good to
> > > > > > measure.
> > > > >
> > > > > What is the dev_pagemap thing doing in gup fast anyhow?
> > > > >
> > > > > I've been wondering for a while..
> > > >
> > > > It's there to synchronize against dax-device removal. The device will
> > > > suspend removal awaiting all page references to be dropped, but
> > > > gup-fast could be racing device removal. So gup-fast checks for
> > > > pte_devmap() to grab a live reference to the device before assuming it
> > > > can pin a page.
> > >
> > > From the perspective of CPU A it can't tell if CPU B is doing a HW
> > > page table walk or a GUP fast when it invalidates a page table. The
> > > design of gup-fast is supposed to be the same as the design of a HW
> > > page table walk, and the tlb invalidate CPU A does when removing a
> > > page from a page table is supposed to serialize against both a HW page
> > > table walk and gup-fast.
> > >
> > > Given that the HW page table walker does not do dev_pagemap stuff, why
> > > does gup-fast?
> >
> > gup-fast historically assumed that the 'struct page' and memory
> > backing the page-table walk could not physically be removed from the
> > system during its walk because those pages were allocated from the
> > page allocator before being mapped into userspace.
>
> No, I'd say gup-fast assumes that any non-special PTE it finds in a
> page table must have a struct page.
>
> If something wants to remove that struct page it must first remove all
> the PTEs pointing at it from the entire system and flush the TLBs,
> which directly prevents a future gup-fast from running and trying to
> access the struct page. No extra locking needed
>
> > implied elevated reference on any page that gup-fast would be asked to
> > walk, or pte_special() is there to "say wait, nevermind this isn't a
> > page allocator page fallback to gup-slow()".
>
> pte_special says there is no struct page, and some of those cases can
> be fixed up in gup-slow.
>
> > > Can you sketch the exact race this is protecting against?
> >
> > Thread1 mmaps /mnt/daxfile1 from a "mount -o dax" filesystem and
> > issues direct I/O with that mapping as the target buffer, Thread2 does
> > "echo "namespace0.0" > /sys/bus/nd/drivers/nd_pmem/unbind". Without
> > the dev_pagemap check reference gup-fast could execute
> > get_page(pte_page(pte)) on a page that doesn't even exist anymore
> > because the driver unbind has already performed remove_pages().
>
> Surely the unbind either waits for all the VMAs to be destroyed or
> zaps them before allowing things to progress to remove_pages()?

If we're talking about device-dax this is precisely what it does, zaps
and prevents new faults from resolving, but filesystem-dax...

> Having a situation where the CPU page tables still point at physical
> pages that have been removed sounds so crazy/insecure, that can't be
> what is happening, can it??

Hmm, that may be true and an original dax bug! The unbind of a
block-device from underneath the filesystem does trigger the
filesystem to emergency shutdown / go read-only, but unless that
process also includes a global zap of all dax mappings not only is
that violating expectations of "page-tables to disappearing memory",
but the filesystem may also want to guarantee that no further dax
writes can happen after shutdown. Right now I believe it only assumes
that mmap I/O will come from page writeback so there's no need to
bother applications with mappings to page cache, but dax mappings need
to be ripped away.

/me goes to look at what filesytems guarantee when the block-device is
surprise removed out from under them.

In any event, this accelerates the effort to go implement
fs-global-dax-zap at the request of the device driver.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps

2021-02-23 Thread Dan Williams

[ add Ralph ]

On Tue, Feb 23, 2021 at 3:07 PM Jason Gunthorpe  wrote:
>
> On Tue, Feb 23, 2021 at 02:48:20PM -0800, Dan Williams wrote:
> > On Tue, Feb 23, 2021 at 10:54 AM Jason Gunthorpe  wrote:
> > >
> > > On Tue, Feb 23, 2021 at 08:44:52AM -0800, Dan Williams wrote:
> > >
> > > > > The downside would be one extra lookup in dev_pagemap tree
> > > > > for other pgmap->types (P2P, FSDAX, PRIVATE). But just one
> > > > > per gup-fast() call.
> > > >
> > > > I'd guess a dev_pagemap lookup is faster than a get_user_pages slow
> > > > path. It should be measurable that this change is at least as fast or
> > > > faster than falling back to the slow path, but it would be good to
> > > > measure.
> > >
> > > What is the dev_pagemap thing doing in gup fast anyhow?
> > >
> > > I've been wondering for a while..
> >
> > It's there to synchronize against dax-device removal. The device will
> > suspend removal awaiting all page references to be dropped, but
> > gup-fast could be racing device removal. So gup-fast checks for
> > pte_devmap() to grab a live reference to the device before assuming it
> > can pin a page.
>
> From the perspective of CPU A it can't tell if CPU B is doing a HW
> page table walk or a GUP fast when it invalidates a page table. The
> design of gup-fast is supposed to be the same as the design of a HW
> page table walk, and the tlb invalidate CPU A does when removing a
> page from a page table is supposed to serialize against both a HW page
> table walk and gup-fast.
>
> Given that the HW page table walker does not do dev_pagemap stuff, why
> does gup-fast?

gup-fast historically assumed that the 'struct page' and memory
backing the page-table walk could not physically be removed from the
system during its walk because those pages were allocated from the
page allocator before being mapped into userspace. So there is an
implied elevated reference on any page that gup-fast would be asked to
walk, or pte_special() is there to "say wait, nevermind this isn't a
page allocator page fallback to gup-slow()". pte_devmap() is there to
say "wait, there is no implied elevated reference for this page, check
and hold dev_pagemap alive until a page reference can be taken". So it
splits the difference between pte_special() and typical page allocator
pages.

> Can you sketch the exact race this is protecting against?

Thread1 mmaps /mnt/daxfile1 from a "mount -o dax" filesystem and
issues direct I/O with that mapping as the target buffer, Thread2 does
"echo "namespace0.0" > /sys/bus/nd/drivers/nd_pmem/unbind". Without
the dev_pagemap check reference gup-fast could execute
get_page(pte_page(pte)) on a page that doesn't even exist anymore
because the driver unbind has already performed remove_pages().

Effectively the same percpu_ref that protects the pmem0 block device
from new command submissions while the device is dying also prevents
new dax page references being taken while the device is dying.

This could be solved with the traditional gup-fast rules if the device
driver could tell the filesystem to unmap all dax files and force them
to re-fault through the gup-slow path to see that the device is now
dying. I'll likely be working on that sooner rather than later given
some of the expectations of the CXL persistent memory "dirty shutdown"
detection.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps

2021-02-23 Thread Dan Williams

On Tue, Feb 23, 2021 at 10:54 AM Jason Gunthorpe  wrote:
>
> On Tue, Feb 23, 2021 at 08:44:52AM -0800, Dan Williams wrote:
>
> > > The downside would be one extra lookup in dev_pagemap tree
> > > for other pgmap->types (P2P, FSDAX, PRIVATE). But just one
> > > per gup-fast() call.
> >
> > I'd guess a dev_pagemap lookup is faster than a get_user_pages slow
> > path. It should be measurable that this change is at least as fast or
> > faster than falling back to the slow path, but it would be good to
> > measure.
>
> What is the dev_pagemap thing doing in gup fast anyhow?
>
> I've been wondering for a while..

It's there to synchronize against dax-device removal. The device will
suspend removal awaiting all page references to be dropped, but
gup-fast could be racing device removal. So gup-fast checks for
pte_devmap() to grab a live reference to the device before assuming it
can pin a page.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [ndctl PATCH v2 01/13] cxl: add a cxl utility and libcxl library

2021-02-23 Thread Verma, Vishal L

On Mon, 2021-02-22 at 13:36 -0800, Ben Widawsky wrote:
[..]
> 
> > +SYNOPSIS
> > +
> > +[verse]
> > +'cxl list' []
> > +
> > +Walk the CXL capable device hierarchy in the system and list all device
> > +instances along with some of their major attributes.
> 
> This doesn't seem to match the above. Here it's just devices and above you 
> talk
> about bridges and switches as well.

Good catch - those can be added in later when we have a sysfs
representation for them. I'll change it to say just devices for now.

[..]
> > +
> > +static void *add_cxl_memdev(void *parent, int id, const char *cxlmem_base)
> > +{
> > +   const char *devname = devpath_to_devname(cxlmem_base);
> > +   char *path = calloc(1, strlen(cxlmem_base) + 100);
> > +   struct cxl_ctx *ctx = parent;
> > +   struct cxl_memdev *memdev, *memdev_dup;
> > +   char buf[SYSFS_ATTR_SIZE];
> > +   struct stat st;
> > +
> > +   if (!path)
> > +   return NULL;
> > +   dbg(ctx, "%s: base: \'%s\'\n", __func__, cxlmem_base);
> > +
> > +   memdev = calloc(1, sizeof(*memdev));
> > +   if (!memdev)
> > +   goto err_dev;
> > +   memdev->id = id;
> > +   memdev->ctx = ctx;
> > +
> > +   sprintf(path, "/dev/cxl/%s", devname);
> > +   if (stat(path, ) < 0)
> > +   goto err_read;
> > +   memdev->major = major(st.st_rdev);
> > +   memdev->minor = minor(st.st_rdev);
> > +
> > +   sprintf(path, "%s/pmem/size", cxlmem_base);
> > +   if (sysfs_read_attr(ctx, path, buf) < 0)
> > +   goto err_read;
> > +   memdev->pmem_size = strtoull(buf, NULL, 0);
> 
> For strtoull usage and below - it certainly doesn't matter much but maybe 
> using
> 10 for base would better since sysfs is ABI and therefore anything other than
> base 10 is incorrect.

Hm, I followed what libndctl does, but I think there is value in
accepting valid hex even if it is technically 'wrong' per the robustness
principle. How much do we want libcxl/libndctl to be a kernel validation
vehicle vs. just work if you can?

[..]
> > +
> > +static int cmd_help(int argc, const char **argv, struct cxl_ctx *ctx)
> > +{
> > +   const char * const builtin_help_subcommands[] = {
> > +   "list", NULL,
> > +   };
> 
> Move NULL to newline.

Yep.

> 
> > +int cmd_list(int argc, const char **argv, struct cxl_ctx *ctx)
> > +{
> > +   const struct option options[] = {
> > +   OPT_STRING('d', "memdev", , "memory device name",
> > +  "filter by CXL memory device name"),
> > +   OPT_BOOLEAN('D', "memdevs", ,
> > +   "include CXL memory device info"),
> > +   OPT_BOOLEAN('i', "idle", , "include idle devices"),
> > +   OPT_BOOLEAN('u', "human", ,
> > +   "use human friendly number formats "),
> > +   OPT_END(),
> > +   };
> > +   const char * const u[] = {
> > +   "cxl list []",
> > +   NULL
> > +   };
> > +   struct json_object *jdevs = NULL;
> > +   unsigned long list_flags;
> > +   struct cxl_memdev *memdev;
> > +   int i;
> > +
> > +argc = parse_options(argc, argv, options, u, 0);
> 
> Tab.
> 
> /me looks for .clang-format

Thanks - let me see if I can quickly adapt the kernel's .clang-format
for this and add it in for the next revision.

___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages

2021-02-23 Thread Dan Williams

On Tue, Feb 23, 2021 at 9:19 AM Joao Martins  wrote:
>
> On 2/23/21 4:50 PM, Dan Williams wrote:
> > On Tue, Feb 23, 2021 at 7:46 AM Joao Martins  
> > wrote:
> >> On 2/22/21 8:37 PM, Dan Williams wrote:
> >>> On Mon, Feb 22, 2021 at 3:24 AM Joao Martins  
> >>> wrote:
>  On 2/20/21 1:43 AM, Dan Williams wrote:
> > On Tue, Dec 8, 2020 at 9:59 PM John Hubbard  wrote:
> >> On 12/8/20 9:28 AM, Joao Martins wrote:
>
> [...]
>
> >>> Don't get me wrong the
> >>> capability is still needed for filesystem-dax, but the distinction is
> >>> that vmemmap_populate_compound_pages() need never worry about an
> >>> altmap.
> >>>
> >> IMO there's not much added complexity strictly speaking about altmap. We 
> >> still use the
> >> same vmemmap_{pmd,pte,pgd}_populate helpers which just pass an altmap. So 
> >> whatever it is
> >> being maintained for fsdax or other altmap consumers (e.g. we seem to be 
> >> working towards
> >> hotplug making use of it) we are using it in the exact same way.
> >>
> >> The complexity of the future vmemmap_populate_compound_pages() has more to 
> >> do with reusing
> >> vmemmap blocks allocated in previous vmemmap pages, and preserving that 
> >> across section
> >> onlining (for 1G pages).
> >
> > True, I'm less worried about the complexity as much as
> > opportunistically converting configurations to RAM backed pages. It's
> > already the case that poison handling is page mapping size aligned for
> > device-dax, and filesystem-dax needs to stick with non-compound-pages
> > for the foreseeable future.
> >
> Hmm, I was sort off wondering that fsdax could move to compound pages too as
> opposed to base pages, albeit not necessarily using the vmemmap page reuse
> as it splits pages IIUC.

I'm not sure compound pages for fsdax would work long term because
there's no infrastructure to reassemble compound pages after a split.
So if you fracture a block and then coalesce it back to a 2MB or 1GB
aligned block there's nothing to go fixup the compound page... unless
the filesystem wants to get into mm metadata fixups.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps

2021-02-23 Thread Dan Williams

On Tue, Feb 23, 2021 at 9:16 AM Joao Martins  wrote:
>
> On 2/23/21 4:44 PM, Dan Williams wrote:
> > On Tue, Feb 23, 2021 at 8:30 AM Joao Martins  
> > wrote:
> >>
> >> On 2/20/21 1:18 AM, Dan Williams wrote:
> >>> On Tue, Dec 8, 2020 at 9:32 AM Joao Martins  
> >>> wrote:
>  Patch 6 - 8: Optimize grabbing/release a page refcount changes given 
>  that we
>  are working with compound pages i.e. we do 1 increment/decrement to the 
>  head
>  page for a given set of N subpages compared as opposed to N individual 
>  writes.
>  {get,pin}_user_pages_fast() for zone_device with compound pagemap 
>  consequently
>  improves considerably, and unpin_user_pages() improves as well when 
>  passed a
>  set of consecutive pages:
> 
> before  after
>  (get_user_pages_fast 1G;2M page size) ~75k  us -> ~3.2k ; ~5.2k us
>  (pin_user_pages_fast 1G;2M page size) ~125k us -> ~3.4k ; ~5.5k us
> >>>
> >>> Compelling!
> >>>
> >>
> >> BTW is there any reason why we don't support pin_user_pages_fast() with 
> >> FOLL_LONGTERM for
> >> device-dax?
> >>
> >
> > Good catch.
> >
> > Must have been an oversight of the conversion. FOLL_LONGTERM collides
> > with filesystem operations, but not device-dax.
>
> h, fwiw, it was unilaterally disabled for any devmap pmd/pud in commit 
> 7af75561e171
> ("mm/gup: add FOLL_LONGTERM capability to GUP fast") and I must only assume 
> that
> by "DAX pages" the submitter was only referring to fs-dax pages.

Agree, that was an fsdax only assumption. Maybe this went unnoticed
because the primary gup-fast case for direct-I/O was not impacted.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages

2021-02-23 Thread Joao Martins

On 2/23/21 4:50 PM, Dan Williams wrote:
> On Tue, Feb 23, 2021 at 7:46 AM Joao Martins  
> wrote:
>> On 2/22/21 8:37 PM, Dan Williams wrote:
>>> On Mon, Feb 22, 2021 at 3:24 AM Joao Martins  
>>> wrote:
 On 2/20/21 1:43 AM, Dan Williams wrote:
> On Tue, Dec 8, 2020 at 9:59 PM John Hubbard  wrote:
>> On 12/8/20 9:28 AM, Joao Martins wrote:

[...]

>>> Don't get me wrong the
>>> capability is still needed for filesystem-dax, but the distinction is
>>> that vmemmap_populate_compound_pages() need never worry about an
>>> altmap.
>>>
>> IMO there's not much added complexity strictly speaking about altmap. We 
>> still use the
>> same vmemmap_{pmd,pte,pgd}_populate helpers which just pass an altmap. So 
>> whatever it is
>> being maintained for fsdax or other altmap consumers (e.g. we seem to be 
>> working towards
>> hotplug making use of it) we are using it in the exact same way.
>>
>> The complexity of the future vmemmap_populate_compound_pages() has more to 
>> do with reusing
>> vmemmap blocks allocated in previous vmemmap pages, and preserving that 
>> across section
>> onlining (for 1G pages).
> 
> True, I'm less worried about the complexity as much as
> opportunistically converting configurations to RAM backed pages. It's
> already the case that poison handling is page mapping size aligned for
> device-dax, and filesystem-dax needs to stick with non-compound-pages
> for the foreseeable future.
> 
Hmm, I was sort off wondering that fsdax could move to compound pages too as
opposed to base pages, albeit not necessarily using the vmemmap page reuse
as it splits pages IIUC.

> Ok, let's try to keep altmap in vmemmap_populate_compound_pages() and
> see how it looks.
> 
OK, will do.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps

2021-02-23 Thread Joao Martins

On 2/23/21 4:44 PM, Dan Williams wrote:
> On Tue, Feb 23, 2021 at 8:30 AM Joao Martins  
> wrote:
>>
>> On 2/20/21 1:18 AM, Dan Williams wrote:
>>> On Tue, Dec 8, 2020 at 9:32 AM Joao Martins  
>>> wrote:
 Patch 6 - 8: Optimize grabbing/release a page refcount changes given that 
 we
 are working with compound pages i.e. we do 1 increment/decrement to the 
 head
 page for a given set of N subpages compared as opposed to N individual 
 writes.
 {get,pin}_user_pages_fast() for zone_device with compound pagemap 
 consequently
 improves considerably, and unpin_user_pages() improves as well when passed 
 a
 set of consecutive pages:

before  after
 (get_user_pages_fast 1G;2M page size) ~75k  us -> ~3.2k ; ~5.2k us
 (pin_user_pages_fast 1G;2M page size) ~125k us -> ~3.4k ; ~5.5k us
>>>
>>> Compelling!
>>>
>>
>> BTW is there any reason why we don't support pin_user_pages_fast() with 
>> FOLL_LONGTERM for
>> device-dax?
>>
> 
> Good catch.
> 
> Must have been an oversight of the conversion. FOLL_LONGTERM collides
> with filesystem operations, but not device-dax. 

h, fwiw, it was unilaterally disabled for any devmap pmd/pud in commit 
7af75561e171
("mm/gup: add FOLL_LONGTERM capability to GUP fast") and I must only assume that
by "DAX pages" the submitter was only referring to fs-dax pages.

> In fact that's the
> motivation for device-dax in the first instance, no need to coordinate
> runtime physical address layout changes because the device is
> statically allocated.
> 
/me nods

>> Looking at the history, I understand that fsdax can't support it atm, but I 
>> am not sure
>> that the same holds for device-dax. I have this small chunk (see below the 
>> scissors mark)
>> which relaxes this for a pgmap of type MEMORY_DEVICE_GENERIC, albeit not 
>> sure if there is
>> a fundamental issue for the other types that makes this an unwelcoming 
>> change.
>>
>> Joao
>>
>> ->8-
>>
>> Subject: [PATCH] mm/gup: allow FOLL_LONGTERM pin-fast for
>>  MEMORY_DEVICE_GENERIC
>>
>> The downside would be one extra lookup in dev_pagemap tree
>> for other pgmap->types (P2P, FSDAX, PRIVATE). But just one
>> per gup-fast() call.
> 
> I'd guess a dev_pagemap lookup is faster than a get_user_pages slow
> path. It should be measurable that this change is at least as fast or
> faster than falling back to the slow path, but it would be good to
> measure.
> 
But with the changes I am/will-be making I hope gup-fast and gup-slow
will be as fast (for present pmd/puds ofc, as the fault makes it slower).

I'll formally submit below patch, once I ran over the numbers.

> Code changes look good to me.
> 
Cool! Will add in the suggested change below.

>>
>> Signed-off-by: Joao Martins 
>> ---
>>  include/linux/mm.h |  5 +
>>  mm/gup.c   | 24 +---
>>  2 files changed, 18 insertions(+), 11 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 32f0c3986d4f..c89a049bbd7a 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1171,6 +1171,11 @@ static inline bool is_pci_p2pdma_page(const struct 
>> page *page)
>> page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
>>  }
>>
>> +static inline bool devmap_longterm_available(const struct dev_pagemap 
>> *pgmap)
>> +{
> 
> I'd call this devmap_can_longterm().
> 
Ack.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages

2021-02-23 Thread Dan Williams

On Tue, Feb 23, 2021 at 7:46 AM Joao Martins  wrote:
>
> On 2/22/21 8:37 PM, Dan Williams wrote:
> > On Mon, Feb 22, 2021 at 3:24 AM Joao Martins  
> > wrote:
> >> On 2/20/21 1:43 AM, Dan Williams wrote:
> >>> On Tue, Dec 8, 2020 at 9:59 PM John Hubbard  wrote:
>  On 12/8/20 9:28 AM, Joao Martins wrote:
> > diff --git a/mm/memremap.c b/mm/memremap.c
> > index 16b2fb482da1..287a24b7a65a 100644
> > --- a/mm/memremap.c
> > +++ b/mm/memremap.c
> > @@ -277,8 +277,12 @@ static int pagemap_range(struct dev_pagemap 
> > *pgmap, struct mhp_params *params,
> >   memmap_init_zone_device(_DATA(nid)->node_zones[ZONE_DEVICE],
> >   PHYS_PFN(range->start),
> >   PHYS_PFN(range_len(range)), pgmap);
> > - percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
> > - - pfn_first(pgmap, range_id));
> > + if (pgmap->flags & PGMAP_COMPOUND)
> > + percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
> > + - pfn_first(pgmap, range_id)) / 
> > PHYS_PFN(pgmap->align));
> 
>  Is there some reason that we cannot use range_len(), instead of 
>  pfn_end() minus
>  pfn_first()? (Yes, this more about the pre-existing code than about your 
>  change.)
> 
>  And if not, then why are the nearby range_len() uses OK? I realize that 
>  range_len()
>  is simpler and skips a case, but it's not clear that it's required here. 
>  But I'm
>  new to this area so be warned. :)
> >>>
> >>> There's a subtle distinction between the range that was passed in and
> >>> the pfns that are activated inside of it. See the offset trickery in
> >>> pfn_first().
> >>>
>  Also, dividing by PHYS_PFN() feels quite misleading: that function does 
>  what you
>  happen to want, but is not named accordingly. Can you use or create 
>  something
>  more accurately named? Like "number of pages in this large page"?
> >>>
> >>> It's not the number of pages in a large page it's converting bytes to
> >>> pages. Other place in the kernel write it as (x >> PAGE_SHIFT), but my
> >>> though process was if I'm going to add () might as well use a macro
> >>> that already does this.
> >>>
> >>> That said I think this calculation is broken precisely because
> >>> pfn_first() makes the result unaligned.
> >>>
> >>> Rather than fix the unaligned pfn_first() problem I would use this
> >>> support as an opportunity to revisit the option of storing pages in
> >>> the vmem_altmap reserve soace. The altmap's whole reason for existence
> >>> was that 1.5% of large PMEM might completely swamp DRAM. However if
> >>> that overhead is reduced by an order (or orders) of magnitude the
> >>> primary need for vmem_altmap vanishes.
> >>>
> >>> Now, we'll still need to keep it around for the ->align == PAGE_SIZE
> >>> case, but for most part existing deployments that are specifying page
> >>> map on PMEM and an align > PAGE_SIZE can instead just transparently be
> >>> upgraded to page map on a smaller amount of DRAM.
> >>>
> >> I feel the altmap is still relevant. Even with the struct page reuse for
> >> tail pages, the overhead for 2M align is still non-negligeble i.e. 4G per
> >> 1Tb (strictly speaking about what's stored in the altmap). Muchun and
> >> Matthew were thinking (in another thread) on compound_head() adjustments
> >> that probably can make this overhead go to 2G (if we learn to differentiate
> >> the reused head page from the real head page).
> >
> > I think that realization is more justification to make a new first
> > class vmemmap_populate_compound_pages() rather than try to reuse
> > vmemmap_populate_basepages() with new parameters.
> >
> I was already going to move this to vmemmap_populate_compound_pages() based
> on your earlier suggestion :)
>
> >> But even there it's still
> >> 2G per 1Tb. 1G pages, though, have a better story to remove altmap need.
> >
> > The concern that led to altmap is that someone would build a system
> > with a 96:1 (PMEM:RAM) ratio where that correlates to maximum PMEM and
> > minimum RAM, and mapping all PMEM consumes all RAM. As far as I
> > understand real world populations are rarely going past 8:1, that
> > seems to make 'struct page' in RAM feasible even for the 2M compound
> > page case.
> >
> > Let me ask you for a data point, since you're one of the people
> > actively deploying such systems, would you still use the 'struct page'
> > in PMEM capability after this set was merged?
> >
> We might be sticking to RAM stored 'struct page' yes, but hard to say atm
> what the future holds.
>
> >> One thing to point out about altmap is that the degradation (in pinning and
> >> unpining) we observed with struct page's in device memory, is no longer 
> >> observed
> >> once 1) we batch ref count updates as we move to compound pages 2) reusing
> >> tail pages seems to lead to these struct pages staying

Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps

2021-02-23 Thread Dan Williams

On Tue, Feb 23, 2021 at 8:30 AM Joao Martins  wrote:
>
> On 2/20/21 1:18 AM, Dan Williams wrote:
> > On Tue, Dec 8, 2020 at 9:32 AM Joao Martins  
> > wrote:
> >> Patch 6 - 8: Optimize grabbing/release a page refcount changes given that 
> >> we
> >> are working with compound pages i.e. we do 1 increment/decrement to the 
> >> head
> >> page for a given set of N subpages compared as opposed to N individual 
> >> writes.
> >> {get,pin}_user_pages_fast() for zone_device with compound pagemap 
> >> consequently
> >> improves considerably, and unpin_user_pages() improves as well when passed 
> >> a
> >> set of consecutive pages:
> >>
> >>before  after
> >> (get_user_pages_fast 1G;2M page size) ~75k  us -> ~3.2k ; ~5.2k us
> >> (pin_user_pages_fast 1G;2M page size) ~125k us -> ~3.4k ; ~5.5k us
> >
> > Compelling!
> >
>
> BTW is there any reason why we don't support pin_user_pages_fast() with 
> FOLL_LONGTERM for
> device-dax?
>

Good catch.

Must have been an oversight of the conversion. FOLL_LONGTERM collides
with filesystem operations, but not device-dax. In fact that's the
motivation for device-dax in the first instance, no need to coordinate
runtime physical address layout changes because the device is
statically allocated.

> Looking at the history, I understand that fsdax can't support it atm, but I 
> am not sure
> that the same holds for device-dax. I have this small chunk (see below the 
> scissors mark)
> which relaxes this for a pgmap of type MEMORY_DEVICE_GENERIC, albeit not sure 
> if there is
> a fundamental issue for the other types that makes this an unwelcoming change.
>
> Joao
>
> ->8-
>
> Subject: [PATCH] mm/gup: allow FOLL_LONGTERM pin-fast for
>  MEMORY_DEVICE_GENERIC
>
> The downside would be one extra lookup in dev_pagemap tree
> for other pgmap->types (P2P, FSDAX, PRIVATE). But just one
> per gup-fast() call.

I'd guess a dev_pagemap lookup is faster than a get_user_pages slow
path. It should be measurable that this change is at least as fast or
faster than falling back to the slow path, but it would be good to
measure.

Code changes look good to me.

>
> Signed-off-by: Joao Martins 
> ---
>  include/linux/mm.h |  5 +
>  mm/gup.c   | 24 +---
>  2 files changed, 18 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 32f0c3986d4f..c89a049bbd7a 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1171,6 +1171,11 @@ static inline bool is_pci_p2pdma_page(const struct 
> page *page)
> page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
>  }
>
> +static inline bool devmap_longterm_available(const struct dev_pagemap *pgmap)
> +{

I'd call this devmap_can_longterm().

> +   return pgmap->type == MEMORY_DEVICE_GENERIC;
> +}
> +
>  /* 127: arbitrary random number, small enough to assemble well */
>  #define page_ref_zero_or_close_to_overflow(page) \
> ((unsigned int) page_ref_count(page) + 127u <= 127u)
> diff --git a/mm/gup.c b/mm/gup.c
> index 222d1fdc5cfa..03e370d360e6 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2092,14 +2092,18 @@ static int gup_pte_range(pmd_t pmd, unsigned long 
> addr, unsigned
> long end,
> goto pte_unmap;
>
> if (pte_devmap(pte)) {
> -   if (unlikely(flags & FOLL_LONGTERM))
> -   goto pte_unmap;
> -
> pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
> if (unlikely(!pgmap)) {
> undo_dev_pagemap(nr, nr_start, flags, pages);
> goto pte_unmap;
> }
> +
> +   if (unlikely(flags & FOLL_LONGTERM) &&
> +   !devmap_longterm_available(pgmap)) {
> +   undo_dev_pagemap(nr, nr_start, flags, pages);
> +   goto pte_unmap;
> +   }
> +
> } else if (pte_special(pte))
> goto pte_unmap;
>
> @@ -2195,6 +2199,10 @@ static int __gup_device_huge(unsigned long pfn, 
> unsigned long addr,
> return 0;
> }
>
> +   if (unlikely(flags & FOLL_LONGTERM) &&
> +   !devmap_longterm_available(pgmap))
> +   return 0;
> +
> @@ -2356,12 +2364,9 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, 
> unsigned long addr,
> if (!pmd_access_permitted(orig, flags & FOLL_WRITE))
> return 0;
>
> -   if (pmd_devmap(orig)) {
> -   if (unlikely(flags & FOLL_LONGTERM))
> -   return 0;
> +   if (pmd_devmap(orig))
> return __gup_device_huge_pmd(orig, pmdp, addr, end, flags,
>  pages, nr);
> -   }
>
> page =

Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps

2021-02-23 Thread Joao Martins

On 2/20/21 1:18 AM, Dan Williams wrote:
> On Tue, Dec 8, 2020 at 9:32 AM Joao Martins  wrote:
>> Patch 6 - 8: Optimize grabbing/release a page refcount changes given that we
>> are working with compound pages i.e. we do 1 increment/decrement to the head
>> page for a given set of N subpages compared as opposed to N individual 
>> writes.
>> {get,pin}_user_pages_fast() for zone_device with compound pagemap 
>> consequently
>> improves considerably, and unpin_user_pages() improves as well when passed a
>> set of consecutive pages:
>>
>>before  after
>> (get_user_pages_fast 1G;2M page size) ~75k  us -> ~3.2k ; ~5.2k us
>> (pin_user_pages_fast 1G;2M page size) ~125k us -> ~3.4k ; ~5.5k us
> 
> Compelling!
> 

BTW is there any reason why we don't support pin_user_pages_fast() with 
FOLL_LONGTERM for
device-dax?

Looking at the history, I understand that fsdax can't support it atm, but I am 
not sure
that the same holds for device-dax. I have this small chunk (see below the 
scissors mark)
which relaxes this for a pgmap of type MEMORY_DEVICE_GENERIC, albeit not sure 
if there is
a fundamental issue for the other types that makes this an unwelcoming change.

Joao

->8-

Subject: [PATCH] mm/gup: allow FOLL_LONGTERM pin-fast for
 MEMORY_DEVICE_GENERIC

The downside would be one extra lookup in dev_pagemap tree
for other pgmap->types (P2P, FSDAX, PRIVATE). But just one
per gup-fast() call.

Signed-off-by: Joao Martins 
---
 include/linux/mm.h |  5 +
 mm/gup.c   | 24 +---
 2 files changed, 18 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 32f0c3986d4f..c89a049bbd7a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1171,6 +1171,11 @@ static inline bool is_pci_p2pdma_page(const struct page 
*page)
page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
 }

+static inline bool devmap_longterm_available(const struct dev_pagemap *pgmap)
+{
+   return pgmap->type == MEMORY_DEVICE_GENERIC;
+}
+
 /* 127: arbitrary random number, small enough to assemble well */
 #define page_ref_zero_or_close_to_overflow(page) \
((unsigned int) page_ref_count(page) + 127u <= 127u)
diff --git a/mm/gup.c b/mm/gup.c
index 222d1fdc5cfa..03e370d360e6 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2092,14 +2092,18 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, 
unsigned
long end,
goto pte_unmap;

if (pte_devmap(pte)) {
-   if (unlikely(flags & FOLL_LONGTERM))
-   goto pte_unmap;
-
pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
if (unlikely(!pgmap)) {
undo_dev_pagemap(nr, nr_start, flags, pages);
goto pte_unmap;
}
+
+   if (unlikely(flags & FOLL_LONGTERM) &&
+   !devmap_longterm_available(pgmap)) {
+   undo_dev_pagemap(nr, nr_start, flags, pages);
+   goto pte_unmap;
+   }
+
} else if (pte_special(pte))
goto pte_unmap;

@@ -2195,6 +2199,10 @@ static int __gup_device_huge(unsigned long pfn, unsigned 
long addr,
return 0;
}

+   if (unlikely(flags & FOLL_LONGTERM) &&
+   !devmap_longterm_available(pgmap))
+   return 0;
+
@@ -2356,12 +2364,9 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, 
unsigned long addr,
if (!pmd_access_permitted(orig, flags & FOLL_WRITE))
return 0;

-   if (pmd_devmap(orig)) {
-   if (unlikely(flags & FOLL_LONGTERM))
-   return 0;
+   if (pmd_devmap(orig))
return __gup_device_huge_pmd(orig, pmdp, addr, end, flags,
 pages, nr);
-   }

page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
refs = record_subpages(page, addr, end, pages + *nr);
@@ -2390,12 +2395,9 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, 
unsigned long addr,
if (!pud_access_permitted(orig, flags & FOLL_WRITE))
return 0;

-   if (pud_devmap(orig)) {
-   if (unlikely(flags & FOLL_LONGTERM))
-   return 0;
+   if (pud_devmap(orig))
return __gup_device_huge_pud(orig, pudp, addr, end, flags,
 pages, nr);
-   }

page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
refs = record_subpages(page, addr, end, pages + *nr);
-- 
2.17.1
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to

Re: [PATCH RFC 3/9] sparse-vmemmap: Reuse vmemmap areas for a given page size

2021-02-23 Thread Joao Martins

On 2/22/21 10:40 PM, Dan Williams wrote:
> On Mon, Feb 22, 2021 at 3:42 AM Joao Martins  
> wrote:
>> On 2/20/21 3:34 AM, Dan Williams wrote:
>>> On Tue, Dec 8, 2020 at 9:32 AM Joao Martins  
>>> wrote:
 Sections are 128M (or bigger/smaller),
>>>
>>> Huh?
>>>
>>
>> Section size is arch-dependent if we are being hollistic.
>> On x86 it's 64M, 128M or 512M right?
>>
>>  #ifdef CONFIG_X86_32
>>  # ifdef CONFIG_X86_PAE
>>  #  define SECTION_SIZE_BITS 29
>>  #  define MAX_PHYSMEM_BITS  36
>>  # else
>>  #  define SECTION_SIZE_BITS 26
>>  #  define MAX_PHYSMEM_BITS  32
>>  # endif
>>  #else /* CONFIG_X86_32 */
>>  # define SECTION_SIZE_BITS  27 /* matt - 128 is convenient right now */
>>  # define MAX_PHYSMEM_BITS   (pgtable_l5_enabled() ? 52 : 46)
>>  #endif
>>
>> Also, me pointing about section sizes, is because a 1GB+ page vmemmap 
>> population will
>> cross sections in how sparsemem populates the vmemmap. And on that case we 
>> gotta reuse the
>> the PTE/PMD pages across multiple invocations of 
>> vmemmap_populate_basepages(). Either
>> that, or looking at the previous page PTE, but that might be ineficient.
> 
> Ok, makes sense I think saying this description of needing to handle
> section crossing is clearer than mentioning one of the section sizes.
> 
I'll amend the commit message to have this.

>>
 @@ -229,38 +235,95 @@ int __meminit vmemmap_populate_basepages(unsigned 
 long start, unsigned long end,
 for (; addr < end; addr += PAGE_SIZE) {
 pgd = vmemmap_pgd_populate(addr, node);
 if (!pgd)
 -   return -ENOMEM;
 +   return NULL;
 p4d = vmemmap_p4d_populate(pgd, addr, node);
 if (!p4d)
 -   return -ENOMEM;
 +   return NULL;
 pud = vmemmap_pud_populate(p4d, addr, node);
 if (!pud)
 -   return -ENOMEM;
 +   return NULL;
 pmd = vmemmap_pmd_populate(pud, addr, node);
 if (!pmd)
 -   return -ENOMEM;
 -   pte = vmemmap_pte_populate(pmd, addr, node, altmap);
 +   return NULL;
 +   pte = vmemmap_pte_populate(pmd, addr, node, altmap, block);
 if (!pte)
 -   return -ENOMEM;
 +   return NULL;
 vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
 }

 +   return __va(__pfn_to_phys(pte_pfn(*pte)));
 +}
 +
 +int __meminit vmemmap_populate_basepages(unsigned long start, unsigned 
 long end,
 +int node, struct vmem_altmap 
 *altmap)
 +{
 +   if (!__vmemmap_populate_basepages(start, end, node, altmap, NULL))
 +   return -ENOMEM;
 return 0;
  }

 +static struct page * __meminit vmemmap_populate_reuse(unsigned long start,
 +   unsigned long end, int node,
 +   struct vmem_context *ctx)
 +{
 +   unsigned long size, addr = start;
 +   unsigned long psize = PHYS_PFN(ctx->align) * sizeof(struct page);
 +
 +   size = min(psize, end - start);
 +
 +   for (; addr < end; addr += size) {
 +   unsigned long head = addr + PAGE_SIZE;
 +   unsigned long tail = addr;
 +   unsigned long last = addr + size;
 +   void *area;
 +
 +   if (ctx->block_page &&
 +   IS_ALIGNED((addr - ctx->block_page), psize))
 +   ctx->block = NULL;
 +
 +   area  = ctx->block;
 +   if (!area) {
 +   if (!__vmemmap_populate_basepages(addr, head, node,
 + ctx->altmap, 
 NULL))
 +   return NULL;
 +
 +   tail = head + PAGE_SIZE;
 +   area = __vmemmap_populate_basepages(head, tail, 
 node,
 +   ctx->altmap, 
 NULL);
 +   if (!area)
 +   return NULL;
 +
 +   ctx->block = area;
 +   ctx->block_page = addr;
 +   }
 +
 +   if (!__vmemmap_populate_basepages(tail, last, node,
 + ctx->altmap, area))
 +   return NULL;
 +   }
>>>
>>> I think that compound page accounting and combined altmap accounting
>>> makes this difficult to read, and I think the compound page case
>>> deserves

Re: [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages

2021-02-23 Thread Joao Martins

On 2/22/21 8:37 PM, Dan Williams wrote:
> On Mon, Feb 22, 2021 at 3:24 AM Joao Martins  
> wrote:
>> On 2/20/21 1:43 AM, Dan Williams wrote:
>>> On Tue, Dec 8, 2020 at 9:59 PM John Hubbard  wrote:
 On 12/8/20 9:28 AM, Joao Martins wrote:
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 16b2fb482da1..287a24b7a65a 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -277,8 +277,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, 
> struct mhp_params *params,
>   memmap_init_zone_device(_DATA(nid)->node_zones[ZONE_DEVICE],
>   PHYS_PFN(range->start),
>   PHYS_PFN(range_len(range)), pgmap);
> - percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
> - - pfn_first(pgmap, range_id));
> + if (pgmap->flags & PGMAP_COMPOUND)
> + percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
> + - pfn_first(pgmap, range_id)) / 
> PHYS_PFN(pgmap->align));

 Is there some reason that we cannot use range_len(), instead of pfn_end() 
 minus
 pfn_first()? (Yes, this more about the pre-existing code than about your 
 change.)

 And if not, then why are the nearby range_len() uses OK? I realize that 
 range_len()
 is simpler and skips a case, but it's not clear that it's required here. 
 But I'm
 new to this area so be warned. :)
>>>
>>> There's a subtle distinction between the range that was passed in and
>>> the pfns that are activated inside of it. See the offset trickery in
>>> pfn_first().
>>>
 Also, dividing by PHYS_PFN() feels quite misleading: that function does 
 what you
 happen to want, but is not named accordingly. Can you use or create 
 something
 more accurately named? Like "number of pages in this large page"?
>>>
>>> It's not the number of pages in a large page it's converting bytes to
>>> pages. Other place in the kernel write it as (x >> PAGE_SHIFT), but my
>>> though process was if I'm going to add () might as well use a macro
>>> that already does this.
>>>
>>> That said I think this calculation is broken precisely because
>>> pfn_first() makes the result unaligned.
>>>
>>> Rather than fix the unaligned pfn_first() problem I would use this
>>> support as an opportunity to revisit the option of storing pages in
>>> the vmem_altmap reserve soace. The altmap's whole reason for existence
>>> was that 1.5% of large PMEM might completely swamp DRAM. However if
>>> that overhead is reduced by an order (or orders) of magnitude the
>>> primary need for vmem_altmap vanishes.
>>>
>>> Now, we'll still need to keep it around for the ->align == PAGE_SIZE
>>> case, but for most part existing deployments that are specifying page
>>> map on PMEM and an align > PAGE_SIZE can instead just transparently be
>>> upgraded to page map on a smaller amount of DRAM.
>>>
>> I feel the altmap is still relevant. Even with the struct page reuse for
>> tail pages, the overhead for 2M align is still non-negligeble i.e. 4G per
>> 1Tb (strictly speaking about what's stored in the altmap). Muchun and
>> Matthew were thinking (in another thread) on compound_head() adjustments
>> that probably can make this overhead go to 2G (if we learn to differentiate
>> the reused head page from the real head page).
> 
> I think that realization is more justification to make a new first
> class vmemmap_populate_compound_pages() rather than try to reuse
> vmemmap_populate_basepages() with new parameters.
> 
I was already going to move this to vmemmap_populate_compound_pages() based
on your earlier suggestion :)

>> But even there it's still
>> 2G per 1Tb. 1G pages, though, have a better story to remove altmap need.
> 
> The concern that led to altmap is that someone would build a system
> with a 96:1 (PMEM:RAM) ratio where that correlates to maximum PMEM and
> minimum RAM, and mapping all PMEM consumes all RAM. As far as I
> understand real world populations are rarely going past 8:1, that
> seems to make 'struct page' in RAM feasible even for the 2M compound
> page case.
> 
> Let me ask you for a data point, since you're one of the people
> actively deploying such systems, would you still use the 'struct page'
> in PMEM capability after this set was merged?
> 
We might be sticking to RAM stored 'struct page' yes, but hard to say atm
what the future holds.

>> One thing to point out about altmap is that the degradation (in pinning and
>> unpining) we observed with struct page's in device memory, is no longer 
>> observed
>> once 1) we batch ref count updates as we move to compound pages 2) reusing
>> tail pages seems to lead to these struct pages staying more likely in cache
>> which perhaps contributes to dirtying a lot less cachelines.
> 
> True, it makes it more palatable to survive 'struct page' in PMEM, but
> it's an ongoing maintenance burden that I'm not sure there are users
> after putting

【三井住友カード】登録情報を訂正してください

2021-02-23 Thread 三井住友カード

 

 
 

 

 
 

 

 



  いつも弊社カードをご利用いただきありがとうございます。

  お客様の登録情報が間違っているため、カードを使用できません。ログインして修正してください。

 
━━━
ご変更はこちらから
━━━



お電話でのお手続き
紛失盗難デスク　0120-919-456　（受付時間／24時間・年中無休）



三井住友カード株式会社
〒105-8011 東京都港区海岸1丁目2番20号 汐留ビルディング



 
Copyright (C) 2021 Sumitomo Mitsui Card Co., Ltd.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Receipt of Payment Advice Details

2021-02-23 Thread Anil Account

___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [GIT PULL] Compute Express Linux (CXL) for v5.12-rc1

[GIT PULL] Compute Express Linux (CXL) for v5.12-rc1

[GIT PULL] libnvdimm + device-dax for v5.12-rc1

Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps

Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps

Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps

Re: [ndctl PATCH v2 01/13] cxl: add a cxl utility and libcxl library

Re: [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages

Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps

Re: [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages

Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps

Re: [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages

Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps

Re: [PATCH RFC 0/9] mm, sparse-vmemmap: Introduce compound pagemaps

Re: [PATCH RFC 3/9] sparse-vmemmap: Reuse vmemmap areas for a given page size

Re: [PATCH RFC 1/9] memremap: add ZONE_DEVICE support for compound pages

【三井住友カード】登録情報を訂正してください

Receipt of Payment Advice Details

18 matches

Site Navigation

Mail list logo

Footer information