[no subject]
This Message was undeliverable due to the following reason: Your message was not delivered because the destination computer was not reachable within the allowed queue period. The amount of time a message is queued before it is returned depends on local configura- tion parameters. Most likely there is a network problem that prevented delivery, but it is also possible that the computer is turned off, or does not have a mail system running right now. Your message was not delivered within 4 days: Host 44.159.81.28 is not responding. The following recipients did not receive this message:Please reply to postmas...@lists.01.org if you feel this message to be in error. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[PATCH] acpi, nfit: fix health event notification
Integration testing with a BIOS that generates injected health event notifications fails to communicate those events to userspace. The nfit driver neglects to link the ACPI DIMM device with the necessary driver data so acpi_nvdimm_notify() fails this lookup: nfit_mem = dev_get_drvdata(dev); if (nfit_mem && nfit_mem->flags_attr) sysfs_notify_dirent(nfit_mem->flags_attr); Add the necessary linkage when installing the notification handler and clean it up when the nfit driver instance is torn down. Cc:Cc: Toshi Kani Cc: Vishal Verma Fixes: ba9c8dd3c222 ("acpi, nfit: add dimm device notification support") Reported-by: Daniel Osawa Signed-off-by: Dan Williams --- drivers/acpi/nfit/core.c |6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c index ff2580e7611d..947ea8a92761 100644 --- a/drivers/acpi/nfit/core.c +++ b/drivers/acpi/nfit/core.c @@ -1670,6 +1670,11 @@ static int acpi_nfit_add_dimm(struct acpi_nfit_desc *acpi_desc, dev_name(_dimm->dev)); return -ENXIO; } + /* +* Record nfit_mem for the notification path to track back to +* the nfit sysfs attributes for this dimm device object. +*/ + dev_set_drvdata(_dimm->dev, nfit_mem); /* * Until standardization materializes we need to consider 4 @@ -1755,6 +1760,7 @@ static void shutdown_dimm_notify(void *data) if (adev_dimm) acpi_remove_notify_handler(adev_dimm->handle, ACPI_DEVICE_NOTIFY, acpi_nvdimm_notify); + dev_set_drvdata(_dimm->dev, NULL); } mutex_unlock(_desc->init_mutex); } ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[RFC PATCH 2/4] firmware: dmi: Add function to look up a handle and return DIMM size
When we first scan the SMBIOS table, save the size of the DIMM. Provide a function for other code (EDAC driver) to look up the size of a DIMM from its SMBIOS handle. Signed-off-by: Tony Luck--- drivers/firmware/dmi_scan.c | 29 + include/linux/dmi.h | 2 ++ 2 files changed, 31 insertions(+) diff --git a/drivers/firmware/dmi_scan.c b/drivers/firmware/dmi_scan.c index 783041964439..946e86fb1ec6 100644 --- a/drivers/firmware/dmi_scan.c +++ b/drivers/firmware/dmi_scan.c @@ -37,6 +37,7 @@ static char dmi_ids_string[128] __initdata; static struct dmi_memdev_info { const char *device; const char *bank; + u64 size; u16 handle; } *dmi_memdev; static int dmi_memdev_nr; @@ -395,6 +396,8 @@ static void __init save_mem_devices(const struct dmi_header *dm, void *v) { const char *d = (const char *)dm; static int nr; + u64 bytes; + u16 size; if (dm->type != DMI_ENTRY_MEM_DEVICE || dm->length < 0x12) return; @@ -405,6 +408,18 @@ static void __init save_mem_devices(const struct dmi_header *dm, void *v) dmi_memdev[nr].handle = get_unaligned(>handle); dmi_memdev[nr].device = dmi_string(dm, d[0x10]); dmi_memdev[nr].bank = dmi_string(dm, d[0x11]); + size = get_unaligned((u16 *)[0xC]); + if (size == 0) + bytes = 0; + else if (size == 0x) + bytes = ~0ul; + else if (size & 0x8000) + bytes = (u64)(size & 0x7fff) << 10; + else if (size != 0x7fff) + bytes = (u64)size << 20; + else + bytes = (u64)get_unaligned((u32 *)[0x1C]) << 20; + dmi_memdev[nr].size = bytes; nr++; } @@ -1073,3 +1088,17 @@ void dmi_memdev_name(u16 handle, const char **bank, const char **device) } } EXPORT_SYMBOL_GPL(dmi_memdev_name); + +u64 dmi_memdev_size(u16 handle) +{ + int n; + + if (dmi_memdev) { + for (n = 0; n < dmi_memdev_nr; n++) { + if (handle == dmi_memdev[n].handle) + return dmi_memdev[n].size; + } + } + return ~0ul; +} +EXPORT_SYMBOL_GPL(dmi_memdev_size); diff --git a/include/linux/dmi.h b/include/linux/dmi.h index 46e151172d95..7f5929123b69 100644 --- a/include/linux/dmi.h +++ b/include/linux/dmi.h @@ -113,6 +113,7 @@ extern int dmi_walk(void (*decode)(const struct dmi_header *, void *), void *private_data); extern bool dmi_match(enum dmi_field f, const char *str); extern void dmi_memdev_name(u16 handle, const char **bank, const char **device); +extern u64 dmi_memdev_size(u16 handle); #else @@ -142,6 +143,7 @@ static inline bool dmi_match(enum dmi_field f, const char *str) { return false; } static inline void dmi_memdev_name(u16 handle, const char **bank, const char **device) { } +static inline u64 dmi_memdev_size(u16 handle) { return ~0ul; } static inline const struct dmi_system_id * dmi_first_match(const struct dmi_system_id *list) { return NULL; } -- 2.14.1 ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[RFC PATCH 3/4] edac: Add new memory type for non-volatile DIMMs
There are now non-volatile versions of DIMMs. Add a new entry to "enum mem_type" and update places that use it with new strings. Signed-off-by: Tony Luck--- drivers/edac/edac_mc.c | 1 + drivers/edac/edac_mc_sysfs.c | 3 ++- include/linux/edac.h | 3 +++ 3 files changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c index 480072139b7a..8178e74decbf 100644 --- a/drivers/edac/edac_mc.c +++ b/drivers/edac/edac_mc.c @@ -215,6 +215,7 @@ const char * const edac_mem_types[] = { [MEM_LRDDR3]= "Load-Reduced DDR3 RAM", [MEM_DDR4] = "Unbuffered DDR4 RAM", [MEM_RDDR4] = "Registered DDR4 RAM", + [MEM_NVDIMM]= "Non-volatile RAM", }; EXPORT_SYMBOL_GPL(edac_mem_types); diff --git a/drivers/edac/edac_mc_sysfs.c b/drivers/edac/edac_mc_sysfs.c index e4fcfa84fbd3..53cbb3518efc 100644 --- a/drivers/edac/edac_mc_sysfs.c +++ b/drivers/edac/edac_mc_sysfs.c @@ -110,7 +110,8 @@ static const char * const mem_types[] = { [MEM_DDR3] = "Unbuffered-DDR3", [MEM_RDDR3] = "Registered-DDR3", [MEM_DDR4] = "Unbuffered-DDR4", - [MEM_RDDR4] = "Registered-DDR4" + [MEM_RDDR4] = "Registered-DDR4", + [MEM_NVDIMM] = "Non-volatile RAM", }; static const char * const dev_types[] = { diff --git a/include/linux/edac.h b/include/linux/edac.h index cd75c173fd00..bffb97828ed6 100644 --- a/include/linux/edac.h +++ b/include/linux/edac.h @@ -186,6 +186,7 @@ static inline char *mc_event_error_type(const unsigned int err_type) * @MEM_RDDR4: Registered DDR4 RAM * This is a variant of the DDR4 memories. * @MEM_LRDDR4:Load-Reduced DDR4 memory. + * @MEM_NVDIMM:Non-volatile RAM */ enum mem_type { MEM_EMPTY = 0, @@ -209,6 +210,7 @@ enum mem_type { MEM_DDR4, MEM_RDDR4, MEM_LRDDR4, + MEM_NVDIMM, }; #define MEM_FLAG_EMPTY BIT(MEM_EMPTY) @@ -231,6 +233,7 @@ enum mem_type { #define MEM_FLAG_DDR4 BIT(MEM_DDR4) #define MEM_FLAG_RDDR4 BIT(MEM_RDDR4) #define MEM_FLAG_LRDDR4 BIT(MEM_LRDDR4) +#define MEM_FLAG_NVDIMM BIT(MEM_NVDIMM) /** * enum edac-type - Error Detection and Correction capabilities and mode -- 2.14.1 ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[RFC PATCH 0/4] Teach EDAC driver about NVDIMMs
A Skylake server may have some DIMM slots filled with NVDIMMs instead of normal DDR4 DIMMs. These are enumerated differently by the memory controller. Sadly there isn't an easy way to just peek at some memory controller register to find the size of these DIMMs, so we have to rely on the NFIT and SMBIOS tables to get that information. This series only tackles the topology function of the EDAC driver. A later series of patches will fix the address translation parts so that errors in NVDIMMs will be reported correctly. It's marked "RFC" because it depends on the new ACPCIA version 20171110 which has only just made it to Rafael's tree. Some of you may only care about some of the parts that touch code you maintain, but I copied you on all four because you might like to see the bigger picture. Tony Luck (4): acpi, nfit: Add function to look up nvdimm device and provide SMBIOS handle firmware: dmi: Add function to look up a handle and return DIMM size edac: Add new memory type for non-volatile DIMMs EDAC, skx_edac: Detect non-volatile DIMMs drivers/acpi/nfit/core.c | 27 + drivers/edac/Kconfig | 2 ++ drivers/edac/edac_mc.c | 1 + drivers/edac/edac_mc_sysfs.c | 3 ++- drivers/edac/skx_edac.c | 56 drivers/firmware/dmi_scan.c | 29 +++ include/acpi/nfit.h | 19 +++ include/linux/dmi.h | 2 ++ include/linux/edac.h | 3 +++ 9 files changed, 136 insertions(+), 6 deletions(-) create mode 100644 include/acpi/nfit.h base-commit: 3fc70f8be59950ee2deecefdddb68be19b8cddd1 -- 2.14.1 ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[RFC PATCH 4/4] EDAC, skx_edac: Detect non-volatile DIMMs
This just covers the topology function of the EDAC driver. We locate which DIMM slots are populated with NVDIMMs and query the NFIT and SMBIOS tables to get the size. Signed-off-by: Tony Luck--- drivers/edac/Kconfig| 2 ++ drivers/edac/skx_edac.c | 56 - 2 files changed, 53 insertions(+), 5 deletions(-) diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig index 96afb2aeed18..5c0c4a358f67 100644 --- a/drivers/edac/Kconfig +++ b/drivers/edac/Kconfig @@ -232,6 +232,8 @@ config EDAC_SBRIDGE config EDAC_SKX tristate "Intel Skylake server Integrated MC" depends on PCI && X86_64 && X86_MCE_INTEL && PCI_MMCONFIG + select DMI + select ACPI_NFIT help Support for error detection and correction the Intel Skylake server Integrated Memory Controllers. diff --git a/drivers/edac/skx_edac.c b/drivers/edac/skx_edac.c index 16dea97568a1..814a5245029c 100644 --- a/drivers/edac/skx_edac.c +++ b/drivers/edac/skx_edac.c @@ -14,6 +14,8 @@ #include #include +#include +#include #include #include #include @@ -24,6 +26,7 @@ #include #include #include +#include #include #include #include @@ -298,6 +301,7 @@ static int get_dimm_attr(u32 reg, int lobit, int hibit, int add, int minval, } #define IS_DIMM_PRESENT(mtr) GET_BITFIELD((mtr), 15, 15) +#define IS_NVDIMM_PRESENT(mcddrtcfg, i)GET_BITFIELD((mcddrtcfg), (i), (i)) #define numrank(reg) get_dimm_attr((reg), 12, 13, 0, 1, 2, "ranks") #define numrow(reg) get_dimm_attr((reg), 2, 4, 12, 1, 6, "rows") @@ -346,8 +350,6 @@ static int get_dimm_info(u32 mtr, u32 amap, struct dimm_info *dimm, int banks = 16, ranks, rows, cols, npages; u64 size; - if (!IS_DIMM_PRESENT(mtr)) - return 0; ranks = numrank(mtr); rows = numrow(mtr); cols = numcol(mtr); @@ -379,6 +381,46 @@ static int get_dimm_info(u32 mtr, u32 amap, struct dimm_info *dimm, return 1; } +static int get_nvdimm_info(struct dimm_info *dimm, struct skx_imc *imc, + int chan, int dimmno) +{ + int smbios_handle; + u32 dev_handle; + u16 flags; + u64 size; + + dev_handle = ACPI_NFIT_BUILD_DEVICE_HANDLE(dimmno, chan, imc->lmc, + imc->src_id, 0); + + smbios_handle = nfit_get_smbios_id(dev_handle, ); + if (smbios_handle < 0) { + skx_printk(KERN_ERR, "Can't find handle for NVDIMM ADR=%x\n", dev_handle); + return 0; + } + if (flags & ACPI_NFIT_MEM_MAP_FAILED) { + skx_printk(KERN_ERR, "NVDIMM ADR=%x is not mapped\n", dev_handle); + return 0; + } + size = dmi_memdev_size(smbios_handle); + if (size == ~0ul) { + skx_printk(KERN_ERR, "Can't find size for NVDIMM ADR=%x/SMBIOS=%x\n", + dev_handle, smbios_handle); + return 0; + } + edac_dbg(0, "mc#%d: channel %d, dimm %d, %lld Mb (%lld pages)\n", +imc->mc, chan, dimmno, size >> 20, size >> PAGE_SHIFT); + + dimm->nr_pages = size >> PAGE_SHIFT; + dimm->grain = 32; + dimm->dtype = DEV_UNKNOWN; + dimm->mtype = MEM_NVDIMM; + dimm->edac_mode = EDAC_SECDED; /* likely better than this */ + snprintf(dimm->label, sizeof(dimm->label), "CPU_SrcID#%u_MC#%u_Chan#%u_DIMM#%u", +imc->src_id, imc->lmc, chan, dimmno); + + return 1; +} + #define SKX_GET_MTMTR(dev, reg) \ pci_read_config_dword((dev), 0x87c, ) @@ -395,20 +437,24 @@ static int skx_get_dimm_config(struct mem_ctl_info *mci) { struct skx_pvt *pvt = mci->pvt_info; struct skx_imc *imc = pvt->imc; + u32 mtr, amap, mcddrtcfg; struct dimm_info *dimm; int i, j; - u32 mtr, amap; int ndimms; for (i = 0; i < NUM_CHANNELS; i++) { ndimms = 0; pci_read_config_dword(imc->chan[i].cdev, 0x8C, ); + pci_read_config_dword(imc->chan[i].cdev, 0x400, ); for (j = 0; j < NUM_DIMMS; j++) { dimm = EDAC_DIMM_PTR(mci->layers, mci->dimms, mci->n_layers, i, j, 0); pci_read_config_dword(imc->chan[i].cdev, 0x80 + 4*j, ); - ndimms += get_dimm_info(mtr, amap, dimm, imc, i, j); + if (IS_DIMM_PRESENT(mtr)) + ndimms += get_dimm_info(mtr, amap, dimm, imc, i, j); + else if (IS_NVDIMM_PRESENT(mcddrtcfg, j)) + ndimms += get_nvdimm_info(dimm, imc, i, j); } if (ndimms && !skx_check_ecc(imc->chan[0].cdev)) { skx_printk(KERN_ERR, "ECC is disabled on imc %d\n",
[RFC PATCH 1/4] acpi, nfit: Add function to look up nvdimm device and provide SMBIOS handle
EDAC driver needs to look up attributes of NVDIMMs provided in SMBIOS. Provide a function that looks up an acpi_nfit_memory_map from a device handle (node/socket/mc/channel/dimm) and returns the SMBIOS handle. Also pass back the "flags" so we can see if the NVDIMM is OK. Signed-off-by: Tony Luck--- drivers/acpi/nfit/core.c | 27 +++ include/acpi/nfit.h | 19 +++ 2 files changed, 46 insertions(+) create mode 100644 include/acpi/nfit.h diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c index 9c2c49b6a240..31c0dc30f88f 100644 --- a/drivers/acpi/nfit/core.c +++ b/drivers/acpi/nfit/core.c @@ -23,6 +23,7 @@ #include #include #include +#include #include "nfit.h" /* @@ -478,6 +479,32 @@ static bool add_memdev(struct acpi_nfit_desc *acpi_desc, return true; } +int nfit_get_smbios_id(u32 device_handle, u16 *flags) +{ + struct acpi_nfit_memory_map *memdev; + struct acpi_nfit_desc *acpi_desc; + struct nfit_mem *nfit_mem; + + mutex_lock(_desc_lock); + list_for_each_entry(acpi_desc, _descs, list) { + mutex_lock(_desc->init_mutex); + list_for_each_entry(nfit_mem, _desc->dimms, list) { + memdev = __to_nfit_memdev(nfit_mem); + if (memdev->device_handle == device_handle) { + mutex_unlock(_desc->init_mutex); + mutex_unlock(_desc_lock); + *flags = memdev->flags; + return memdev->physical_id; + } + } + mutex_unlock(_desc->init_mutex); + } + mutex_unlock(_desc_lock); + + return -ENODEV; +} +EXPORT_SYMBOL_GPL(nfit_get_smbios_id); + /* * An implementation may provide a truncated control region if no block windows * are defined. diff --git a/include/acpi/nfit.h b/include/acpi/nfit.h new file mode 100644 index ..1eee1e32e72e --- /dev/null +++ b/include/acpi/nfit.h @@ -0,0 +1,19 @@ +/* + * Copyright(c) 2017 Intel Corporation. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of version 2 of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + */ + +#ifndef __ACPI_NFIT_H +#define __ACPI_NFIT_H + +int nfit_get_smbios_id(u32 device_handle, u16 *flags); + +#endif /* __ACPI_NFIT_H */ -- 2.14.1 ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v3 1/4] mm: introduce get_user_pages_longterm
[ adding linux-rdma ] On Thu, Nov 30, 2017 at 10:17 AM, Michal Hockowrote: > > On Thu 30-11-17 10:03:26, Dan Williams wrote: > > On Thu, Nov 30, 2017 at 9:42 AM, Michal Hocko wrote: > > > > > > On Thu 30-11-17 08:39:51, Dan Williams wrote: > > > > On Thu, Nov 30, 2017 at 1:53 AM, Michal Hocko wrote: > > > > > On Wed 29-11-17 10:05:35, Dan Williams wrote: > > > > >> Until there is a solution to the dma-to-dax vs truncate problem it is > > > > >> not safe to allow long standing memory registrations against > > > > >> filesytem-dax vmas. Device-dax vmas do not have this problem and are > > > > >> explicitly allowed. > > > > >> > > > > >> This is temporary until a "memory registration with layout-lease" > > > > >> mechanism can be implemented for the affected sub-systems (RDMA and > > > > >> V4L2). > > > > > > > > > > One thing is not clear to me. Who is allowed to pin pages for ever? > > > > > Is it possible to pin LRU pages that way as well? If yes then there > > > > > absolutely has to be a limit for that. Sorry I could have studied the > > > > > code much more but from a quick glance it seems to me that this is not > > > > > limited to dax (or non-LRU in general) pages. > > > > > > > > I would turn this question around. "who can not tolerate a page being > > > > pinned forever?". > > > > > > Any struct page on the movable zone or anything that is living on the > > > LRU list because such a memory is unreclaimable. > > > > > > > In the case of filesytem-dax a page is > > > > one-in-the-same object as a filesystem-block, and a filesystem expects > > > > that its operations will not be blocked indefinitely. LRU pages can > > > > continue to be pinned indefinitely because operations can continue > > > > around the pinned page, i.e. every agent, save for the dma agent, > > > > drops their reference to the page and its tolerable that the final > > > > put_page() never arrives. > > > > > > I do not understand. Are you saying that a user triggered IO can pin LRU > > > pages indefinitely. This would be _really_ wrong. It would be basically > > > an mlock without any limit. So I must be misreading you here > > > > You're not misreading. See ib_umem_get() for example, it pins pages in > > response to the userspace library call ibv_reg_mr() (memory > > registration), and will not release those pages unless/until a call to > > ibv_dereg_mr() is made. > > Who and how many LRU pages can pin that way and how do you prevent nasty > users to DoS systems this way? I assume this is something the RDMA community has had to contend with? I'm not an RDMA person, I'm just here to fix dax. > I remember PeterZ wanted to address a similar issue by vmpin syscall > that would be a subject of a rlimit control. Sorry but I cannot find a > reference here https://lwn.net/Articles/600502/ > but if this is at g-u-p level without any accounting then > it smells quite broken to me. It's certainly broken with respect to filesystem-dax and if there is other breakage we should get it all on the table. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v3 1/4] mm: introduce get_user_pages_longterm
On Thu 30-11-17 10:03:26, Dan Williams wrote: > On Thu, Nov 30, 2017 at 9:42 AM, Michal Hockowrote: > > > > On Thu 30-11-17 08:39:51, Dan Williams wrote: > > > On Thu, Nov 30, 2017 at 1:53 AM, Michal Hocko wrote: > > > > On Wed 29-11-17 10:05:35, Dan Williams wrote: > > > >> Until there is a solution to the dma-to-dax vs truncate problem it is > > > >> not safe to allow long standing memory registrations against > > > >> filesytem-dax vmas. Device-dax vmas do not have this problem and are > > > >> explicitly allowed. > > > >> > > > >> This is temporary until a "memory registration with layout-lease" > > > >> mechanism can be implemented for the affected sub-systems (RDMA and > > > >> V4L2). > > > > > > > > One thing is not clear to me. Who is allowed to pin pages for ever? > > > > Is it possible to pin LRU pages that way as well? If yes then there > > > > absolutely has to be a limit for that. Sorry I could have studied the > > > > code much more but from a quick glance it seems to me that this is not > > > > limited to dax (or non-LRU in general) pages. > > > > > > I would turn this question around. "who can not tolerate a page being > > > pinned forever?". > > > > Any struct page on the movable zone or anything that is living on the > > LRU list because such a memory is unreclaimable. > > > > > In the case of filesytem-dax a page is > > > one-in-the-same object as a filesystem-block, and a filesystem expects > > > that its operations will not be blocked indefinitely. LRU pages can > > > continue to be pinned indefinitely because operations can continue > > > around the pinned page, i.e. every agent, save for the dma agent, > > > drops their reference to the page and its tolerable that the final > > > put_page() never arrives. > > > > I do not understand. Are you saying that a user triggered IO can pin LRU > > pages indefinitely. This would be _really_ wrong. It would be basically > > an mlock without any limit. So I must be misreading you here > > You're not misreading. See ib_umem_get() for example, it pins pages in > response to the userspace library call ibv_reg_mr() (memory > registration), and will not release those pages unless/until a call to > ibv_dereg_mr() is made. Who and how many LRU pages can pin that way and how do you prevent nasty users to DoS systems this way? I remember PeterZ wanted to address a similar issue by vmpin syscall that would be a subject of a rlimit control. Sorry but I cannot find a reference here but if this is at g-u-p level without any accounting then it smells quite broken to me. -- Michal Hocko SUSE Labs ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v3 1/4] mm: introduce get_user_pages_longterm
On Thu, Nov 30, 2017 at 9:42 AM, Michal Hockowrote: > > On Thu 30-11-17 08:39:51, Dan Williams wrote: > > On Thu, Nov 30, 2017 at 1:53 AM, Michal Hocko wrote: > > > On Wed 29-11-17 10:05:35, Dan Williams wrote: > > >> Until there is a solution to the dma-to-dax vs truncate problem it is > > >> not safe to allow long standing memory registrations against > > >> filesytem-dax vmas. Device-dax vmas do not have this problem and are > > >> explicitly allowed. > > >> > > >> This is temporary until a "memory registration with layout-lease" > > >> mechanism can be implemented for the affected sub-systems (RDMA and > > >> V4L2). > > > > > > One thing is not clear to me. Who is allowed to pin pages for ever? > > > Is it possible to pin LRU pages that way as well? If yes then there > > > absolutely has to be a limit for that. Sorry I could have studied the > > > code much more but from a quick glance it seems to me that this is not > > > limited to dax (or non-LRU in general) pages. > > > > I would turn this question around. "who can not tolerate a page being > > pinned forever?". > > Any struct page on the movable zone or anything that is living on the > LRU list because such a memory is unreclaimable. > > > In the case of filesytem-dax a page is > > one-in-the-same object as a filesystem-block, and a filesystem expects > > that its operations will not be blocked indefinitely. LRU pages can > > continue to be pinned indefinitely because operations can continue > > around the pinned page, i.e. every agent, save for the dma agent, > > drops their reference to the page and its tolerable that the final > > put_page() never arrives. > > I do not understand. Are you saying that a user triggered IO can pin LRU > pages indefinitely. This would be _really_ wrong. It would be basically > an mlock without any limit. So I must be misreading you here You're not misreading. See ib_umem_get() for example, it pins pages in response to the userspace library call ibv_reg_mr() (memory registration), and will not release those pages unless/until a call to ibv_dereg_mr() is made. The current plan to fix this is to create something like a ibv_reg_mr_lease() call that registers the memory with an F_SETLEASE semantic so that the kernel can notify userspace that a memory registration is being forcibly revoked by the kernel. A previous attempt at something like this was the proposed MAP_DIRECT mmap flag [1]. [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-October/012815.html ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v3 1/4] mm: introduce get_user_pages_longterm
On Thu 30-11-17 08:39:51, Dan Williams wrote: > On Thu, Nov 30, 2017 at 1:53 AM, Michal Hockowrote: > > On Wed 29-11-17 10:05:35, Dan Williams wrote: > >> Until there is a solution to the dma-to-dax vs truncate problem it is > >> not safe to allow long standing memory registrations against > >> filesytem-dax vmas. Device-dax vmas do not have this problem and are > >> explicitly allowed. > >> > >> This is temporary until a "memory registration with layout-lease" > >> mechanism can be implemented for the affected sub-systems (RDMA and > >> V4L2). > > > > One thing is not clear to me. Who is allowed to pin pages for ever? > > Is it possible to pin LRU pages that way as well? If yes then there > > absolutely has to be a limit for that. Sorry I could have studied the > > code much more but from a quick glance it seems to me that this is not > > limited to dax (or non-LRU in general) pages. > > I would turn this question around. "who can not tolerate a page being > pinned forever?". Any struct page on the movable zone or anything that is living on the LRU list because such a memory is unreclaimable. > In the case of filesytem-dax a page is > one-in-the-same object as a filesystem-block, and a filesystem expects > that its operations will not be blocked indefinitely. LRU pages can > continue to be pinned indefinitely because operations can continue > around the pinned page, i.e. every agent, save for the dma agent, > drops their reference to the page and its tolerable that the final > put_page() never arrives. I do not understand. Are you saying that a user triggered IO can pin LRU pages indefinitely. This would be _really_ wrong. It would be basically an mlock without any limit. So I must be misreading you here > As far as I can tell it's only filesystems > and dax that have this collision of wanting to revoke dma access to a > page combined with not being able to wait indefinitely for dma to > quiesce. -- Michal Hocko SUSE Labs ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v3 1/4] mm: introduce get_user_pages_longterm
On Thu, Nov 30, 2017 at 1:53 AM, Michal Hockowrote: > On Wed 29-11-17 10:05:35, Dan Williams wrote: >> Until there is a solution to the dma-to-dax vs truncate problem it is >> not safe to allow long standing memory registrations against >> filesytem-dax vmas. Device-dax vmas do not have this problem and are >> explicitly allowed. >> >> This is temporary until a "memory registration with layout-lease" >> mechanism can be implemented for the affected sub-systems (RDMA and >> V4L2). > > One thing is not clear to me. Who is allowed to pin pages for ever? > Is it possible to pin LRU pages that way as well? If yes then there > absolutely has to be a limit for that. Sorry I could have studied the > code much more but from a quick glance it seems to me that this is not > limited to dax (or non-LRU in general) pages. I would turn this question around. "who can not tolerate a page being pinned forever?". In the case of filesytem-dax a page is one-in-the-same object as a filesystem-block, and a filesystem expects that its operations will not be blocked indefinitely. LRU pages can continue to be pinned indefinitely because operations can continue around the pinned page, i.e. every agent, save for the dma agent, drops their reference to the page and its tolerable that the final put_page() never arrives. As far as I can tell it's only filesystems and dax that have this collision of wanting to revoke dma access to a page combined with not being able to wait indefinitely for dma to quiesce. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v3 1/4] mm: introduce get_user_pages_longterm
On Wed 29-11-17 10:05:35, Dan Williams wrote: > Until there is a solution to the dma-to-dax vs truncate problem it is > not safe to allow long standing memory registrations against > filesytem-dax vmas. Device-dax vmas do not have this problem and are > explicitly allowed. > > This is temporary until a "memory registration with layout-lease" > mechanism can be implemented for the affected sub-systems (RDMA and > V4L2). One thing is not clear to me. Who is allowed to pin pages for ever? Is it possible to pin LRU pages that way as well? If yes then there absolutely has to be a limit for that. Sorry I could have studied the code much more but from a quick glance it seems to me that this is not limited to dax (or non-LRU in general) pages. -- Michal Hocko SUSE Labs ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm