Re: [PATCH] memremap: move from kernel/ to mm/
On 07/22/2019 03:11 PM, Christoph Hellwig wrote: > memremap.c implements MM functionality for ZONE_DEVICE, so it really > should be in the mm/ directory, not the kernel/ one. > > Signed-off-by: Christoph Hellwig This always made sense. FWIW Reviewed-by: Anshuman Khandual ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH] mm, memory-failure: clarify error message
On 05/17/2019 09:38 AM, Jane Chu wrote: > Some user who install SIGBUS handler that does longjmp out What the longjmp about ? Are you referring to the mechanism of catching the signal which was registered ? > therefore keeping the process alive is confused by the error > message > "[188988.765862] Memory failure: 0x1840200: Killing >cellsrv:33395 due to hardware memory corruption" Its a valid point because those are two distinct actions. > Slightly modify the error message to improve clarity. > > Signed-off-by: Jane Chu > --- > mm/memory-failure.c | 7 --- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index fc8b517..14de5e2 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -216,10 +216,9 @@ static int kill_proc(struct to_kill *tk, unsigned long > pfn, int flags) > short addr_lsb = tk->size_shift; > int ret; > > - pr_err("Memory failure: %#lx: Killing %s:%d due to hardware memory > corruption\n", > - pfn, t->comm, t->pid); > - > if ((flags & MF_ACTION_REQUIRED) && t->mm == current->mm) { > + pr_err("Memory failure: %#lx: Killing %s:%d due to hardware > memory " > + "corruption\n", pfn, t->comm, t->pid); > ret = force_sig_mceerr(BUS_MCEERR_AR, (void __user *)tk->addr, > addr_lsb, current); > } else { > @@ -229,6 +228,8 @@ static int kill_proc(struct to_kill *tk, unsigned long > pfn, int flags) >* This could cause a loop when the user sets SIGBUS >* to SIG_IGN, but hopefully no one will do that? >*/ > + pr_err("Memory failure: %#lx: Sending SIGBUS to %s:%d due to > hardware " > + "memory corruption\n", pfn, t->comm, t->pid); > ret = send_sig_mceerr(BUS_MCEERR_AO, (void __user *)tk->addr, > addr_lsb, t); /* synchronous? */ As both the pr_err() messages are very similar, could not we just switch between "Killing" and "Sending SIGBUS to" based on a variable e.g action_[kill|sigbus] evaluated previously with ((flags & MF_ACTION_REQUIRED) && t->mm == current->mm). ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [RFC PATCH] mm/nvdimm: Fix kernel crash on devm_mremap_pages_release
On 05/14/2019 08:23 AM, Aneesh Kumar K.V wrote: > When we initialize the namespace, if we support altmap, we don't initialize > all the > backing struct page where as while releasing the namespace we look at some of > these uninitilized struct page. This results in a kernel crash as below. Yes this has been problematic which I have also previously encountered but in a bit different way (while searching memory resources). > > kernel BUG at include/linux/mm.h:1034! What that would be ? Did not see a corresponding BUG_ON() line in the file. > cpu 0x2: Vector: 700 (Program Check) at [c0024146b870] > pc: c03788f8: devm_memremap_pages_release+0x258/0x3a0 > lr: c03788f4: devm_memremap_pages_release+0x254/0x3a0 > sp: c0024146bb00 >msr: 8282b033 > current = 0xc00241382f00 > paca= 0xc0003fffd680 irqmask: 0x03 irq_happened: 0x01 > pid = 4114, comm = ndctl > c09bf8c0 devm_action_release+0x30/0x50 > c09c0938 release_nodes+0x268/0x2d0 > c09b95b4 device_release_driver_internal+0x164/0x230 > c09b638c unbind_store+0x13c/0x190 > c09b4f44 drv_attr_store+0x44/0x60 > c058ccc0 sysfs_kf_write+0x70/0xa0 > c058b52c kernfs_fop_write+0x1ac/0x290 > c04a415c __vfs_write+0x3c/0x70 > c04a85ac vfs_write+0xec/0x200 > c04a8920 ksys_write+0x80/0x130 > c000bee4 system_call+0x5c/0x70 I saw this as memory hotplug problem with respect to ZONE_DEVICE based device memory. Hence a bit different explanation which I never posted. I guess parts of the commit message here can be used for a better comprehensive explanation of the problem. mm/hotplug: Initialize struct pages for vmem_altmap reserved areas The following ZONE_DEVICE ranges (altmap) have valid struct pages allocated from within device memory memmap range. A. Driver reserved area [BASE -> BASE + RESV) B. Device mmap area [BASE + RESV -> BASE + RESV + FREE] C. Device usable area [BASE + RESV + FREE -> END] BASE - pgmap->altmap.base_pfn (pgmap->res.start >> PAGE_SHIFT) RESV - pgmap->altmap.reserve FREE - pgmap->altmap.free END - pgmap->res->end >> PAGE_SHIFT Struct page init for all areas happens in two phases which detects altmap use case and init parts of the device range in each phase. 1. memmap_init_zone (Device mmap area) 2. memmap_init_zone_device (Device usable area) memmap_init_zone() skips driver reserved area and does not init the struct pages. This is problematic primarily for two reasons. Though NODE_DATA(device_node(dev))->node_zones[ZONE_DEVICE] contains the device memory range in it's entirety (in zone->spanned_pages) parts of this range does not have zone set to ZONE_DEVICE in their struct page. __remove_pages() called directly or from within arch_remove_memory() during ZONE_DEVICE tear down procedure (devm_memremap_pages_release) hits an error (like below) if there are reserved pages. This is because the first pfn of the device range (invariably also the first pfn from reserved area) cannot be identified belonging to ZONE_DEVICE. This erroneously leads range search within iomem_resource region which never had this device memory region. So this eventually ends up flashing the following error. Unable to release resource <0x00068000-0x0006bfff> (-22) Initialize struct pages for the driver reserved range while still staying clear from it's contents. > > Signed-off-by: Aneesh Kumar K.V > --- > mm/page_alloc.c | 5 + > 1 file changed, 1 insertion(+), 4 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 59661106da16..892eabe1ec13 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -5740,8 +5740,7 @@ void __meminit memmap_init_zone(unsigned long size, int > nid, unsigned long zone, > > #ifdef CONFIG_ZONE_DEVICE > /* > - * Honor reservation requested by the driver for this ZONE_DEVICE > - * memory. We limit the total number of pages to initialize to just > + * We limit the total number of pages to initialize to just Comment needs bit change to reflect on the fact that both driver reserved as well as mapped area (containing altmap struct pages) needs init here. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH] arm64: configurable sparsemem section size
On 04/25/2019 01:18 AM, Pavel Tatashin wrote: > On Wed, Apr 24, 2019 at 5:07 AM Anshuman Khandual > wrote: >> >> On 04/24/2019 02:08 AM, Pavel Tatashin wrote: >>> sparsemem section size determines the maximum size and alignment that >>> is allowed to offline/online memory block. The bigger the size the less >>> the clutter in /sys/devices/system/memory/*. On the other hand, however, >>> there is less flexability in what granules of memory can be added and >>> removed. >> >> Is there any scenario where less than a 1GB needs to be added on arm64 ? > > Yes, DAX hotplug loses 1G of memory without allowing smaller sections. > Machines on which we are going to be using this functionality have 8G > of System RAM, therefore losing 1G is a big problem. > > For details about using scenario see this cover letter: > https://lore.kernel.org/lkml/20190421014429.31206-1-pasha.tatas...@soleen.com/ Its loosing 1GB because devdax has 2M alignment ? IIRC from Dan's subsection memory hot add series 2M comes from persistent memory HW controller's limitations. Does that limitation applicable across all platforms including arm64 for all possible persistent memory vendors. I mean is it universal ? IIUC subsection memory hot plug series is still getting reviewed. Hence should not we wait for it to get merged before enabling applicable platforms to accommodate these 2M limitations. > >> >>> >>> Recently, it was enabled in Linux to hotadd persistent memory that >>> can be either real NV device, or reserved from regular System RAM >>> and has identity of devdax. >> >> devdax (even ZONE_DEVICE) support has not been enabled on arm64 yet. > > Correct, I use your patches to enable ZONE_DEVICE, and thus devdax on ARM64: > https://lore.kernel.org/lkml/1554265806-11501-1-git-send-email-anshuman.khand...@arm.com/ > >> >>> >>> The problem is that because ARM64's section size is 1G, and devdax must >>> have 2M label section, the first 1G is always missed when device is >>> attached, because it is not 1G aligned. >> >> devdax has to be 2M aligned ? Does Linux enforce that right now ? > > Unfortunately, there is no way around this. Part of the memory can be > reserved as persistent memory via device tree. > memory@4000 { > device_type = "memory"; > reg = < 0x 0x4000 > 0x0002 0x >; > }; > > pmem@1c000 { > compatible = "pmem-region"; > reg = <0x0001 0xc000 >0x 0x8000>; > volatile; > numa-node-id = <0>; > }; > > So, while pmem is section aligned, as it should be, the dax device is > going to be pmem start address + label size, which is 2M. The actual Forgive my ignorance here but why dax device label size is 2M aligned. Again is that because of some persistent memory HW controller limitations ? > DAX device starts at: > 0x1c000 + 2M. > > Because section size is 1G, the hotplug will able to add only memory > starting from > 0x1c000 + 1G Got it but as mentioned before we will have to make sure that 2M alignment requirement is universal else we will be adjusting this multiple times. > >> 27 and 28 do not even compile for ARM64_64_PAGES because of MAX_ORDER and >> SECTION_SIZE mismatch. Even with 27 bits its 128 MB section size. How does it solve the problem with 2M ? The patch just wanted to reduce the memory wastage ? > > Can you please elaborate what configs are you using? I have no > problems compiling with 27 and 28 bit. After applying your patch [1] on current mainline kernel [2]. $make defconfig CONFIG_ARM64_64K_PAGES=y CONFIG_ARM64_VA_BITS_48=y CONFIG_ARM64_VA_BITS=48 CONFIG_ARM64_PA_BITS_48=y CONFIG_ARM64_PA_BITS=48 CONFIG_ARM64_SECTION_SIZE_BITS=27 [1] https://patchwork.kernel.org/patch/10913737/ [2] git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git It fails with CC arch/arm64/kernel/asm-offsets.s In file included from ./include/linux/gfp.h:6, from ./include/linux/slab.h:15, from ./include/linux/resource_ext.h:19, from ./include/linux/acpi.h:26, from ./include/acpi/apei.h:9, from ./include/acpi/ghes.h:5, from ./include/linux/arm_sdei.h:14, from arch/arm64/kernel/asm-offsets.c:21: ./include/linux/mmzone.h:1095:2: error: #error Allocator MAX_ORDER exceeds SECTION_SIZE #error Allocator MAX_ORDER exceeds SECTION_SIZE ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH] arm64: configurable sparsemem section size
On 04/24/2019 02:08 AM, Pavel Tatashin wrote: > sparsemem section size determines the maximum size and alignment that > is allowed to offline/online memory block. The bigger the size the less > the clutter in /sys/devices/system/memory/*. On the other hand, however, > there is less flexability in what granules of memory can be added and > removed. Is there any scenario where less than a 1GB needs to be added on arm64 ? > > Recently, it was enabled in Linux to hotadd persistent memory that > can be either real NV device, or reserved from regular System RAM > and has identity of devdax. devdax (even ZONE_DEVICE) support has not been enabled on arm64 yet. > > The problem is that because ARM64's section size is 1G, and devdax must > have 2M label section, the first 1G is always missed when device is > attached, because it is not 1G aligned. devdax has to be 2M aligned ? Does Linux enforce that right now ? > > Allow, better flexibility by making section size configurable. Unless 2M is being enforced from Linux not sure why this is necessary at the moment. > > Signed-off-by: Pavel Tatashin > --- > arch/arm64/Kconfig | 10 ++ > arch/arm64/include/asm/sparsemem.h | 2 +- > 2 files changed, 11 insertions(+), 1 deletion(-) > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > index b5d8cf57e220..a0c5b9d13a7f 100644 > --- a/arch/arm64/Kconfig > +++ b/arch/arm64/Kconfig > @@ -801,6 +801,16 @@ config ARM64_PA_BITS > default 48 if ARM64_PA_BITS_48 > default 52 if ARM64_PA_BITS_52 > > +config ARM64_SECTION_SIZE_BITS > + int "sparsemem section size shift" > + range 27 30 27 and 28 do not even compile for ARM64_64_PAGES because of MAX_ORDER and SECTION_SIZE mismatch. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On 12/23/2017 03:43 AM, Ross Zwisler wrote: > On Fri, Dec 22, 2017 at 08:39:41AM +0530, Anshuman Khandual wrote: >> On 12/14/2017 07:40 AM, Ross Zwisler wrote: >>> Quick Summary >>> >>> Platforms exist today which have multiple types of memory attached to a >>> single CPU. These disparate memory ranges have some characteristics in >>> common, such as CPU cache coherence, but they can have wide ranges of >>> performance both in terms of latency and bandwidth. >> >> Right. >> >>> >>> For example, consider a system that contains persistent memory, standard >>> DDR memory and High Bandwidth Memory (HBM), all attached to the same CPU. >>> There could potentially be an order of magnitude or more difference in >>> performance between the slowest and fastest memory attached to that CPU. >> >> Right. >> >>> >>> With the current Linux code NUMA nodes are CPU-centric, so all the memory >>> attached to a given CPU will be lumped into the same NUMA node. This makes >>> it very difficult for userspace applications to understand the performance >>> of different memory ranges on a given CPU. >> >> Right but that might require fundamental changes to the NUMA representation. >> Plugging those memory as separate NUMA nodes, identify them through sysfs >> and try allocating from it through mbind() seems like a short term solution. >> >> Though if we decide to go in this direction, sysfs interface or something >> similar is required to enumerate memory properties. > > Yep, and this patch series is trying to be the sysfs interface that is > required to the memory properties. :) It's a certainty that we will have > memory-only NUMA nodes, at least on platforms that support ACPI. Supporting > memory-only proximity domains (which Linux turns in to memory-only NUMA nodes) > is explicitly supported with the introduction of the HMAT in ACPI 6.2. Yeah, even on POWER platforms can have memory only NUMA nodes. > > It also turns out that the existing memory management code already deals with > them just fine - you see this with my hmat_examples setup: > > https://github.com/rzwisler/hmat_examples > > Both configurations created by this repo create memory-only NUMA nodes, even > with upstream kernels. My patches don't change that, they just provide a > sysfs representation of the HMAT so users can discover the memory that exists > in the system. Once its a NUMA node everything will work as is from MM interface point of view. But the point is how we export these properties to user space. My only concern is lets not do it in a way which will be locked without first going through NUMA redesign for these new attribute based memory, thats all. > >>> We solve this issue by providing userspace with performance information on >>> individual memory ranges. This performance information is exposed via >>> sysfs: >>> >>> # grep . mem_tgt2/* mem_tgt2/local_init/* 2>/dev/null >>> mem_tgt2/firmware_id:1 >>> mem_tgt2/is_cached:0 >>> mem_tgt2/local_init/read_bw_MBps:40960 >>> mem_tgt2/local_init/read_lat_nsec:50 >>> mem_tgt2/local_init/write_bw_MBps:40960 >>> mem_tgt2/local_init/write_lat_nsec:50 >> >> I might have missed discussions from earlier versions, why we have this >> kind of a "source --> target" model ? We will enlist properties for all >> possible "source --> target" on the system ? Right now it shows only >> bandwidth and latency properties, can it accommodate other properties >> as well in future ? > > The initiator/target model is useful in preventing us from needing a > MAX_NUMA_NODES x MAX_NUMA_NODES sized table for each performance attribute. I > talked about it a little more here: That makes it even more complex. Not only we have a memory attribute like bandwidth specific to the range, we are also exporting it's relative values as seen from different CPU nodes. Its again kind of a NUMA distance table being exported in the generic sysfs path like /sys/devices/. The problem is possible future memory attributes like 'reliability', 'density', 'power consumption' might not have a need for a "source --> destination" kind of model as they dont change based on which CPU node is accessing it. > > https://lists.01.org/pipermail/linux-nvdimm/2017-December/013654.html > >>> This allows applications to easily find the memory that they want to use. >>> We expect that the existing NUMA APIs will be enhanced to use this new >>> information so that applications can continue to use them to
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On 12/22/2017 10:43 PM, Dave Hansen wrote: > On 12/21/2017 07:09 PM, Anshuman Khandual wrote: >> I had presented a proposal for NUMA redesign in the Plumbers Conference this >> year where various memory devices with different kind of memory attributes >> can be represented in the kernel and be used explicitly from the user space. >> Here is the link to the proposal if you feel interested. The proposal is >> very intrusive and also I dont have a RFC for it yet for discussion here. > I think that's the best reason to "re-use NUMA" for this: it's _not_ > intrusive. > > Also, from an x86 perspective, these HMAT systems *will* be out there. > Old versions of Linux *will* see different types of memory as separate > NUMA nodes. So, if we are going to do something different, it's going > to be interesting to un-teach those systems about using the NUMA APIs > for this. That ship has sailed. I understand the need to fetch these details from ACPI/DT for applications to target these distinct memory only NUMA nodes. This can be done by parsing from platform specific values from /proc/acpi/ or /proc/device-tree/ interfaces. This can be a short term solution before NUMA redesign can be figured out. But adding generic devices like "hmat" in the /sys/devices/ path which will be locked for good, seems problematic. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On 12/22/2017 04:01 PM, Kogut, Jaroslaw wrote: >> ... first thinking about redesigning the NUMA for >> heterogeneous memory may not be a good idea. Will look into this further. > I agree with comment that first a direction should be defined how to handle > heterogeneous memory system. > >> https://linuxplumbersconf.org/2017/ocw//system/presentations/4656/original/ >> Hierarchical_NUMA_Design_Plumbers_2017.pdf > I miss in the presentation a user perspective of the new approach, e.g. > - How does application developer see/understand the heterogeneous memory > system? >From user perspective - Each memory node (with or without CPU) is a NUMA node with attributes - User should detect these NUMA nodes from sysfs (not part of proposal) - User allocates/operates/destroys VMA with new sys calls (_mattr based) > - How does app developer use the heterogeneous memory system? - Through existing and new system calls > - What are modification in API/sys interfaces? - The presentation has possible addition of new system calls with 'u64 _mattr' representation for memory attributes which can be used while requesting different kinds of memory from the kernel > > In other hand, if we assume that separate memory NUMA node has different > memory capabilities/attributes from stand point of particular CPU, it is easy > to explain for user how to describe/handle heterogeneous memory. > > Of course, current numa design is not sufficient in kernel in following areas > today: > - Exposing memory attributes that describe heterogeneous memory system > - Interfaces to use the heterogeneous memory system, e.g. more sophisticated > policies > - Internal mechanism in memory management, e.g. automigration, maybe > something else. Right, we would need - Representation of NUMA with attributes - APIs/syscalls for accessing the intended memory from user space - Memory management policies and algorithms navigating trough all these new attributes in various situations IMHO, we should not consider sysfs interfaces for heterogeneous memory (which will be an ABI going forward and hence cannot be changed easily) before we get the NUMA redesign right. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT
On 12/14/2017 07:40 AM, Ross Zwisler wrote: > Quick Summary > > Platforms exist today which have multiple types of memory attached to a > single CPU. These disparate memory ranges have some characteristics in > common, such as CPU cache coherence, but they can have wide ranges of > performance both in terms of latency and bandwidth. Right. > > For example, consider a system that contains persistent memory, standard > DDR memory and High Bandwidth Memory (HBM), all attached to the same CPU. > There could potentially be an order of magnitude or more difference in > performance between the slowest and fastest memory attached to that CPU. Right. > > With the current Linux code NUMA nodes are CPU-centric, so all the memory > attached to a given CPU will be lumped into the same NUMA node. This makes > it very difficult for userspace applications to understand the performance > of different memory ranges on a given CPU. Right but that might require fundamental changes to the NUMA representation. Plugging those memory as separate NUMA nodes, identify them through sysfs and try allocating from it through mbind() seems like a short term solution. Though if we decide to go in this direction, sysfs interface or something similar is required to enumerate memory properties. > > We solve this issue by providing userspace with performance information on > individual memory ranges. This performance information is exposed via > sysfs: > > # grep . mem_tgt2/* mem_tgt2/local_init/* 2>/dev/null > mem_tgt2/firmware_id:1 > mem_tgt2/is_cached:0 > mem_tgt2/local_init/read_bw_MBps:40960 > mem_tgt2/local_init/read_lat_nsec:50 > mem_tgt2/local_init/write_bw_MBps:40960 > mem_tgt2/local_init/write_lat_nsec:50 I might have missed discussions from earlier versions, why we have this kind of a "source --> target" model ? We will enlist properties for all possible "source --> target" on the system ? Right now it shows only bandwidth and latency properties, can it accommodate other properties as well in future ? > > This allows applications to easily find the memory that they want to use. > We expect that the existing NUMA APIs will be enhanced to use this new > information so that applications can continue to use them to select their > desired memory. I had presented a proposal for NUMA redesign in the Plumbers Conference this year where various memory devices with different kind of memory attributes can be represented in the kernel and be used explicitly from the user space. Here is the link to the proposal if you feel interested. The proposal is very intrusive and also I dont have a RFC for it yet for discussion here. https://linuxplumbersconf.org/2017/ocw//system/presentations/4656/original/Hierarchical_NUMA_Design_Plumbers_2017.pdf Problem is, designing the sysfs interface for memory attribute detection from user space without first thinking about redesigning the NUMA for heterogeneous memory may not be a good idea. Will look into this further. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH] mm: add ZONE_DEVICE statistics to smaps
On 11/11/2016 03:41 AM, Dan Williams wrote: > ZONE_DEVICE pages are mapped into a process via the filesystem-dax and > device-dax mechanisms. There are also proposals to use ZONE_DEVICE > pages for other usages outside of dax. Add statistics to smaps so > applications can debug that they are obtaining the mappings they expect, > or otherwise accounting them. This might also help when we will have ZONE_DEVICE based solution for HMM based device memory. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm