Re: [Qemu-devel] [PATCH v9 2/7] virtio-pmem: Add virtio pmem driver
On 5/16/19 10:35 PM, Pankaj Gupta wrote: > Can I take it your reviewed/acked-by? or tested-by tag? for the virtio patch > :)I don't feel that I have enough expertise to give the reviewed-by tag, but > you can take my acked-by + tested-by. Acked-by: Jakub Staron Tested-by: Jakub Staron No kernel panics/stalls encountered during testing this patches (v9) with QEMU + xfstests. Some CPU stalls encountered while testing with crosvm instead of QEMU with xfstests (test generic/464) but no repro for QEMU, so the fault may be on the side of crosvm. The dump for the crosvm/xfstests stall: [ 2504.175276] rcu: INFO: rcu_sched detected stalls on CPUs/tasks: [ 2504.176681] rcu: 0-...!: (1 GPs behind) idle=9b2/1/0x4000 softirq=1089198/1089202 fqs=0 [ 2504.178270] rcu: 2-...!: (1 ticks this GP) idle=cfe/1/0x4002 softirq=1055108/1055110 fqs=0 [ 2504.179802] rcu: 3-...!: (1 GPs behind) idle=1d6/1/0x4002 softirq=1046798/1046802 fqs=0 [ 2504.181215] rcu: 4-...!: (2 ticks this GP) idle=522/1/0x4002 softirq=1249063/1249064 fqs=0 [ 2504.182625] rcu: 5-...!: (1 GPs behind) idle=6da/1/0x4000 softirq=1131036/1131047 fqs=0 [ 2504.183955] (detected by 3, t=0 jiffies, g=1232529, q=1370) [ 2504.184762] Sending NMI from CPU 3 to CPUs 0: [ 2504.186400] NMI backtrace for cpu 0 [ 2504.186401] CPU: 0 PID: 6670 Comm: 464 Not tainted 5.1.0+ #1 [ 2504.186401] Hardware name: ChromiumOS crosvm, BIOS 0 [ 2504.186402] RIP: 0010:queued_spin_lock_slowpath+0x1c/0x1e0 [ 2504.186402] Code: e7 89 c8 f0 44 0f b1 07 39 c1 75 dc f3 c3 0f 1f 44 00 00 ba 01 00 00 00 8b 07 85 c0 75 0a f0 0f b1 17 85 c0 75 f2 f3 c3 f3 90 ec 81 fe 00 01 00 00 0f 84 ab 00 00 00 81 e6 00 ff ff ff 75 44 [ 2504.186403] RSP: 0018:c9003ee8 EFLAGS: 0002 [ 2504.186404] RAX: 0001 RBX: 0246 RCX: 00404044 [ 2504.186404] RDX: 0001 RSI: 0001 RDI: 8244a280 [ 2504.186405] RBP: 8244a280 R08: 000f4200 R09: 024709ed6c32 [ 2504.186405] R10: R11: 0001 R12: 8244a280 [ 2504.186405] R13: 0009 R14: 0009 R15: [ 2504.186406] FS: () GS:8880cc60() knlGS: [ 2504.186406] CS: 0010 DS: ES: CR0: 80050033 [ 2504.186406] CR2: 7efd6b0f15d8 CR3: 0260a006 CR4: 00360ef0 [ 2504.186407] DR0: DR1: DR2: [ 2504.186407] DR3: DR6: fffe0ff0 DR7: 0400 [ 2504.186407] Call Trace: [ 2504.186408] [ 2504.186408] _raw_spin_lock_irqsave+0x1d/0x30 [ 2504.186408] rcu_core+0x3b6/0x740 [ 2504.186408] ? __hrtimer_run_queues+0x133/0x280 [ 2504.186409] ? recalibrate_cpu_khz+0x10/0x10 [ 2504.186409] __do_softirq+0xd8/0x2e4 [ 2504.186409] irq_exit+0xa3/0xb0 [ 2504.186410] smp_apic_timer_interrupt+0x67/0x120 [ 2504.186410] apic_timer_interrupt+0xf/0x20 [ 2504.186410] [ 2504.186410] RIP: 0010:unmap_page_range+0x47a/0x9b0 [ 2504.186411] Code: 0f 46 46 10 49 39 6e 18 49 89 46 10 48 89 e8 49 0f 43 46 18 41 80 4e 20 08 4d 85 c9 49 89 46 18 0f 84 68 ff ff ff 49 8b 51 08 <48> 8d 42 ff 83 e2 01 49 0f 44 c1 f6 40 18 01 75 38 48 ba ff 0f 00 [ 2504.186411] RSP: 0018:c900036cbcc8 EFLAGS: 0282 ORIG_RAX: ff13 [ 2504.186412] RAX: RBX: 80003751d045 RCX: 0001 [ 2504.186413] RDX: ea0002e09288 RSI: 0269b000 RDI: 8880b6525e40 [ 2504.186413] RBP: 0269c000 R08: R09: eadd4740 [ 2504.186413] R10: ea0001755700 R11: 8880cc62d120 R12: 02794000 [ 2504.186414] R13: 0269b000 R14: c900036cbdf0 R15: 8880572434d8 [ 2504.186414] ? unmap_page_range+0x420/0x9b0 [ 2504.186414] ? release_pages+0x175/0x390 [ 2504.186414] unmap_vmas+0x7c/0xe0 [ 2504.186415] exit_mmap+0xa4/0x190 [ 2504.186415] mmput+0x3b/0x100 [ 2504.186415] do_exit+0x276/0xc10 [ 2504.186415] do_group_exit+0x35/0xa0 [ 2504.186415] __x64_sys_exit_group+0xf/0x10 [ 2504.186416] do_syscall_64+0x43/0x120 [ 2504.186416] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 2504.186416] RIP: 0033:0x7efd6ae10618 [ 2504.186416] Code: Bad RIP value. [ 2504.186417] RSP: 002b:7ffcac9bde38 EFLAGS: 0246 ORIG_RAX: 00e7 [ 2504.186417] RAX: ffda RBX: RCX: 7efd6ae10618 [ 2504.186418] RDX: RSI: 003c RDI: [ 2504.186418] RBP: 7efd6b0ed8e0 R08: 00e7 R09: ff98 [ 2504.186418] R10: 7ffcac9bddb8 R11: 0246 R12: 7efd6b0ed8e0 [ 2504.186419] R13: 7efd6b0f2c20 R14: 0060 R15: 0070e705 [ 2504.186421] NMI backtrace for cpu 3 [ 2504.226980] CPU: 3 PID: 6596 Comm: xfs_io Not tainted 5.1.0+ #1 [ 2504.227661] Hardware name: ChromiumOS crosvm, BIOS 0 [ 2504.228261] Call Trace: [ 2504.228552
[v6 0/3] "Hotremove" persistent memory
Changelog: v6 - A few minor changes and added reviewed-by's. - Spent time studying lock ordering issue that was reported by Vishal Verma, but that issue already exists in Linux, and can be reproduced with exactly the same steps with ACPI memory hotplugging. v5 - Addressed comments from Dan Williams: made remove_memory() to return an error code, and use this function from dax. v4 - Addressed comments from Dave Hansen v3 - Addressed comments from David Hildenbrand. Don't release lock_device_hotplug after checking memory status, and rename memblock_offlined_cb() to check_memblock_offlined_cb() v2 - Dan Williams mentioned that drv->remove() return is ignored by unbind. Unbind always succeeds. Because we cannot guarantee that memory can be offlined from the driver, don't even attempt to do so. Simply check that every section is offlined beforehand and only then proceed with removing dax memory. --- Recently, adding a persistent memory to be used like a regular RAM was added to Linux. This work extends this functionality to also allow hot removing persistent memory. We (Microsoft) have an important use case for this functionality. The requirement is for physical machines with small amount of RAM (~8G) to be able to reboot in a very short period of time (<1s). Yet, there is a userland state that is expensive to recreate (~2G). The solution is to boot machines with 2G preserved for persistent memory. Copy the state, and hotadd the persistent memory so machine still has all 8G available for runtime. Before reboot, offline and hotremove device-dax 2G, copy the memory that is needed to be preserved to pmem0 device, and reboot. The series of operations look like this: 1. After boot restore /dev/pmem0 to ramdisk to be consumed by apps. and free ramdisk. 2. Convert raw pmem0 to devdax ndctl create-namespace --mode devdax --map mem -e namespace0.0 -f 3. Hotadd to System RAM echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id echo online_movable > /sys/devices/system/memoryXXX/state 4. Before reboot hotremove device-dax memory from System RAM echo offline > /sys/devices/system/memoryXXX/state echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind 5. Create raw pmem0 device ndctl create-namespace --mode raw -e namespace0.0 -f 6. Copy the state that was stored by apps to ramdisk to pmem device 7. Do kexec reboot or reboot through firmware if firmware does not zero memory in pmem0 region (These machines have only regular volatile memory). So to have pmem0 device either memmap kernel parameter is used, or devices nodes in dtb are specified. Pavel Tatashin (3): device-dax: fix memory and resource leak if hotplug fails mm/hotplug: make remove_memory() interface useable device-dax: "Hotremove" persistent memory that is used like normal RAM drivers/dax/dax-private.h | 2 ++ drivers/dax/kmem.c | 46 +--- include/linux/memory_hotplug.h | 8 +++-- mm/memory_hotplug.c| 64 +++--- 4 files changed, 92 insertions(+), 28 deletions(-) -- 2.21.0 ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[v6 1/3] device-dax: fix memory and resource leak if hotplug fails
When add_memory() function fails, the resource and the memory should be freed. Fixes: c221c0b0308f ("device-dax: "Hotplug" persistent memory for use like normal RAM") Signed-off-by: Pavel Tatashin Reviewed-by: Dave Hansen --- drivers/dax/kmem.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index a02318c6d28a..4c0131857133 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -66,8 +66,11 @@ int dev_dax_kmem_probe(struct device *dev) new_res->name = dev_name(dev); rc = add_memory(numa_node, new_res->start, resource_size(new_res)); - if (rc) + if (rc) { + release_resource(new_res); + kfree(new_res); return rc; + } return 0; } -- 2.21.0 ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[v6 3/3] device-dax: "Hotremove" persistent memory that is used like normal RAM
It is now allowed to use persistent memory like a regular RAM, but currently there is no way to remove this memory until machine is rebooted. This work expands the functionality to also allows hotremoving previously hotplugged persistent memory, and recover the device for use for other purposes. To hotremove persistent memory, the management software must first offline all memory blocks of dax region, and than unbind it from device-dax/kmem driver. So, operations should look like this: echo offline > /sys/devices/system/memory/memoryN/state ... echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind Note: if unbind is done without offlining memory beforehand, it won't be possible to do dax0.0 hotremove, and dax's memory is going to be part of System RAM until reboot. Signed-off-by: Pavel Tatashin Reviewed-by: David Hildenbrand --- drivers/dax/dax-private.h | 2 ++ drivers/dax/kmem.c| 41 +++ 2 files changed, 39 insertions(+), 4 deletions(-) diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h index a45612148ca0..999aaf3a29b3 100644 --- a/drivers/dax/dax-private.h +++ b/drivers/dax/dax-private.h @@ -53,6 +53,7 @@ struct dax_region { * @pgmap - pgmap for memmap setup / lifetime (driver owned) * @ref: pgmap reference count (driver owned) * @cmp: @ref final put completion (driver owned) + * @dax_mem_res: physical address range of hotadded DAX memory */ struct dev_dax { struct dax_region *region; @@ -62,6 +63,7 @@ struct dev_dax { struct dev_pagemap pgmap; struct percpu_ref ref; struct completion cmp; + struct resource *dax_kmem_res; }; static inline struct dev_dax *to_dev_dax(struct device *dev) diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index 4c0131857133..3d0a7e702c94 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -71,21 +71,54 @@ int dev_dax_kmem_probe(struct device *dev) kfree(new_res); return rc; } + dev_dax->dax_kmem_res = new_res; return 0; } +#ifdef CONFIG_MEMORY_HOTREMOVE +static int dev_dax_kmem_remove(struct device *dev) +{ + struct dev_dax *dev_dax = to_dev_dax(dev); + struct resource *res = dev_dax->dax_kmem_res; + resource_size_t kmem_start = res->start; + resource_size_t kmem_size = resource_size(res); + int rc; + + /* +* We have one shot for removing memory, if some memory blocks were not +* offline prior to calling this function remove_memory() will fail, and +* there is no way to hotremove this memory until reboot because device +* unbind will succeed even if we return failure. +*/ + rc = remove_memory(dev_dax->target_node, kmem_start, kmem_size); + if (rc) { + dev_err(dev, + "DAX region %pR cannot be hotremoved until the next reboot\n", + res); + return rc; + } + + /* Release and free dax resources */ + release_resource(res); + kfree(res); + dev_dax->dax_kmem_res = NULL; + + return 0; +} +#else static int dev_dax_kmem_remove(struct device *dev) { /* -* Purposely leak the request_mem_region() for the device-dax -* range and return '0' to ->remove() attempts. The removal of -* the device from the driver always succeeds, but the region -* is permanently pinned as reserved by the unreleased +* Without hotremove purposely leak the request_mem_region() for the +* device-dax range and return '0' to ->remove() attempts. The removal +* of the device from the driver always succeeds, but the region is +* permanently pinned as reserved by the unreleased * request_mem_region(). */ return 0; } +#endif /* CONFIG_MEMORY_HOTREMOVE */ static struct dax_device_driver device_dax_kmem_driver = { .drv = { -- 2.21.0 ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[v6 2/3] mm/hotplug: make remove_memory() interface useable
As of right now remove_memory() interface is inherently broken. It tries to remove memory but panics if some memory is not offline. The problem is that it is impossible to ensure that all memory blocks are offline as this function also takes lock_device_hotplug that is required to change memory state via sysfs. So, between calling this function and offlining all memory blocks there is always a window when lock_device_hotplug is released, and therefore, there is always a chance for a panic during this window. Make this interface to return an error if memory removal fails. This way it is safe to call this function without panicking machine, and also makes it symmetric to add_memory() which already returns an error. Signed-off-by: Pavel Tatashin Reviewed-by: David Hildenbrand --- include/linux/memory_hotplug.h | 8 +++-- mm/memory_hotplug.c| 64 +++--- 2 files changed, 49 insertions(+), 23 deletions(-) diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index ae892eef8b82..988fde33cd7f 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -324,7 +324,7 @@ static inline void pgdat_resize_init(struct pglist_data *pgdat) {} extern bool is_mem_section_removable(unsigned long pfn, unsigned long nr_pages); extern void try_offline_node(int nid); extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages); -extern void remove_memory(int nid, u64 start, u64 size); +extern int remove_memory(int nid, u64 start, u64 size); extern void __remove_memory(int nid, u64 start, u64 size); #else @@ -341,7 +341,11 @@ static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages) return -EINVAL; } -static inline void remove_memory(int nid, u64 start, u64 size) {} +static inline int remove_memory(int nid, u64 start, u64 size) +{ + return -EBUSY; +} + static inline void __remove_memory(int nid, u64 start, u64 size) {} #endif /* CONFIG_MEMORY_HOTREMOVE */ diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 328878b6799d..ace2cc614da4 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1735,9 +1735,10 @@ static int check_memblock_offlined_cb(struct memory_block *mem, void *arg) endpa = PFN_PHYS(section_nr_to_pfn(mem->end_section_nr + 1))-1; pr_warn("removing memory fails, because memory [%pa-%pa] is onlined\n", &beginpa, &endpa); - } - return ret; + return -EBUSY; + } + return 0; } static int check_cpu_on_node(pg_data_t *pgdat) @@ -1820,19 +1821,9 @@ static void __release_memory_resource(resource_size_t start, } } -/** - * remove_memory - * @nid: the node ID - * @start: physical address of the region to remove - * @size: size of the region to remove - * - * NOTE: The caller must call lock_device_hotplug() to serialize hotplug - * and online/offline operations before this call, as required by - * try_offline_node(). - */ -void __ref __remove_memory(int nid, u64 start, u64 size) +static int __ref try_remove_memory(int nid, u64 start, u64 size) { - int ret; + int rc = 0; BUG_ON(check_hotplug_memory_range(start, size)); @@ -1840,13 +1831,13 @@ void __ref __remove_memory(int nid, u64 start, u64 size) /* * All memory blocks must be offlined before removing memory. Check -* whether all memory blocks in question are offline and trigger a BUG() +* whether all memory blocks in question are offline and return error * if this is not the case. */ - ret = walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1), NULL, - check_memblock_offlined_cb); - if (ret) - BUG(); + rc = walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1), NULL, + check_memblock_offlined_cb); + if (rc) + goto done; /* remove memmap entry */ firmware_map_remove(start, start + size, "System RAM"); @@ -1858,14 +1849,45 @@ void __ref __remove_memory(int nid, u64 start, u64 size) try_offline_node(nid); +done: mem_hotplug_done(); + return rc; } -void remove_memory(int nid, u64 start, u64 size) +/** + * remove_memory + * @nid: the node ID + * @start: physical address of the region to remove + * @size: size of the region to remove + * + * NOTE: The caller must call lock_device_hotplug() to serialize hotplug + * and online/offline operations before this call, as required by + * try_offline_node(). + */ +void __remove_memory(int nid, u64 start, u64 size) +{ + + /* +* trigger BUG() is some memory is not offlined prior to calling this +* function +*/ + if (try_remove_memory(nid, start, size)) + BUG(); +} + +/* + * Remove memory if every memory block is offline, otherwise return -EBUSY is + * some memory i
Re: [PATCH] libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead
On Fri, May 17, 2019 at 10:28:48AM -0700, Dan Williams wrote: > On Fri, May 17, 2019 at 8:57 AM Kees Cook wrote: > > > > On Fri, May 17, 2019 at 08:08:27AM -0700, Dan Williams wrote: > > > As far as I can see it's mostly check_heap_object() that is the > > > problem, so I'm open to finding a way to just bypass that sub-routine. > > > However, as far as I can see none of the other block / filesystem user > > > copy implementations submit to the hardened checks, like > > > bio_copy_from_iter(), and iov_iter_copy_from_user_atomic() . So, > > > either those need to grow additional checks, or the hardened copy > > > implementation is targeting single object copy use cases, not > > > necessarily block-I/O. Yes, Kees, please advise. > > > > The intention is mainly for copies that haven't had explicit bounds > > checking already performed on them, yes. Is there something getting > > checked out of the slab, or is it literally just the overhead of doing > > the "is this slab?" check that you're seeing? > > It's literally the overhead of "is this slab?" since it needs to go > retrieve the struct page and read that potentially cold cacheline. In > the case where that page is on memory media that is higher latency > than DRAM we get the ~37% performance loss that Jeff measured. Ah-ha! Okay, I understand now; thanks! > The path is via the filesystem ->write_iter() file operation. In the > DAX case the filesystem traps that path early, before submitting block > I/O, and routes it to the dax_iomap_actor() routine. That routine > validates that the logical file offset is within bounds of the file, > then it does a sector-to-pfn translation which validates that the > physical mapping is within bounds of the block device. > > It seems dax_iomap_actor() is not a path where we'd be worried about > needing hardened user copy checks. I would agree: I think the proposed patch makes sense. :) -- Kees Cook ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v4 01/18] kunit: test: add KUnit test runner core
Quoting Brendan Higgins (2019-05-14 15:16:54) > diff --git a/include/kunit/test.h b/include/kunit/test.h > new file mode 100644 > index 0..e682ea0e1f9a5 > --- /dev/null > +++ b/include/kunit/test.h > @@ -0,0 +1,162 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +/* > + * Base unit test (KUnit) API. > + * > + * Copyright (C) 2019, Google LLC. > + * Author: Brendan Higgins > + */ > + > +#ifndef _KUNIT_TEST_H > +#define _KUNIT_TEST_H > + > +#include > +#include Is this include used here? > + > +struct kunit; > + > +/** > + * struct kunit_case - represents an individual test case. > + * @run_case: the function representing the actual test case. > + * @name: the name of the test case. > + * > + * A test case is a function with the signature, ``void (*)(struct kunit *)`` > + * that makes expectations (see KUNIT_EXPECT_TRUE()) about code under test. > Each > + * test case is associated with a &struct kunit_module and will be run after > the > + * module's init function and followed by the module's exit function. > + * > + * A test case should be static and should only be created with the > KUNIT_CASE() > + * macro; additionally, every array of test cases should be terminated with > an > + * empty test case. > + * > + * Example: > + * > + * .. code-block:: c > + * > + * void add_test_basic(struct kunit *test) > + * { > + * KUNIT_EXPECT_EQ(test, 1, add(1, 0)); > + * KUNIT_EXPECT_EQ(test, 2, add(1, 1)); > + * KUNIT_EXPECT_EQ(test, 0, add(-1, 1)); > + * KUNIT_EXPECT_EQ(test, INT_MAX, add(0, INT_MAX)); > + * KUNIT_EXPECT_EQ(test, -1, add(INT_MAX, INT_MIN)); > + * } > + * > + * static struct kunit_case example_test_cases[] = { > + * KUNIT_CASE(add_test_basic), > + * {}, Nitpick: Please drop the comma on the sentinel so nobody gets ideas to add another entry after it. > + * }; > + * > + */ > +struct kunit_case { > + void (*run_case)(struct kunit *test); > + const char name[256]; Maybe 256 can be a #define KUNIT_NAME_MAX_LEN? Or it could just be a const char pointer to a literal pool? Are unit tests making up names at runtime? > + > + /* private: internal use only. */ > + bool success; > +}; > + > +/** > + * KUNIT_CASE - A helper for creating a &struct kunit_case > + * @test_name: a reference to a test case function. > + * > + * Takes a symbol for a function representing a test case and creates a > + * &struct kunit_case object from it. See the documentation for > + * &struct kunit_case for an example on how to use it. > + */ > +#define KUNIT_CASE(test_name) { .run_case = test_name, .name = #test_name } > + > +/** > + * struct kunit_module - describes a related collection of &struct > kunit_case s. > + * @name: the name of the test. Purely informational. > + * @init: called before every test case. > + * @exit: called after every test case. > + * @test_cases: a null terminated array of test cases. > + * > + * A kunit_module is a collection of related &struct kunit_case s, such that > + * @init is called before every test case and @exit is called after every > test > + * case, similar to the notion of a *test fixture* or a *test class* in other > + * unit testing frameworks like JUnit or Googletest. > + * > + * Every &struct kunit_case must be associated with a kunit_module for KUnit > to > + * run it. > + */ > +struct kunit_module { > + const char name[256]; > + int (*init)(struct kunit *test); > + void (*exit)(struct kunit *test); > + struct kunit_case *test_cases; Can this variable be const? Or we expect test modules to adjust test_cases after the fact? > +}; > + > +/** > + * struct kunit - represents a running instance of a test. > + * @priv: for user to store arbitrary data. Commonly used to pass data > created > + * in the init function (see &struct kunit_module). > + * > + * Used to store information about the current context under which the test > is > + * running. Most of this data is private and should only be accessed > indirectly > + * via public functions; the one exception is @priv which can be used by the > + * test writer to store arbitrary data. > + */ > +struct kunit { > + void *priv; > + > + /* private: internal use only. */ > + const char *name; /* Read only after initialization! */ > + spinlock_t lock; /* Gaurds all mutable test state. */ > + bool success; /* Protected by lock. */ > +}; > + > +void kunit_init_test(struct kunit *test, const char *name); > + > +int kunit_run_tests(struct kunit_module *module); > + > +/** > + * module_test() - used to register a &struct kunit_module with KUnit. > + * @module: a statically allocated &struct kunit_module. > + * > + * Registers @module with the test framework. See &struct kunit_module for > more > + * information. > + */ > +#define module_test(module) \ > + static int module_kunit_init##module(void) \ > + { \ > +
Re: [PATCH v4 17/18] kernel/sysctl-test: Add null pointer test for sysctl.c:proc_dointvec()
Quoting Brendan Higgins (2019-05-14 15:17:10) > diff --git a/kernel/sysctl-test.c b/kernel/sysctl-test.c > new file mode 100644 > index 0..fe0f2bae66085 > --- /dev/null > +++ b/kernel/sysctl-test.c > @@ -0,0 +1,293 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * KUnit test of proc sysctl. > + */ > + > +#include > +#include Is this include used? > +#include > +#include Is this include used? > + > + > +static void sysctl_test_dointvec_happy_single_negative(struct kunit *test) > +{ > + struct ctl_table table = { > + .procname = "foo", > + .data = &test_data.int_0001, > + .maxlen = sizeof(int), > + .mode = 0644, > + .proc_handler = proc_dointvec, > + .extra1 = &i_zero, > + .extra2 = &i_one_hundred, > + }; > + char input[] = "-9"; > + size_t len = sizeof(input) - 1; > + loff_t pos = 0; > + > + table.data = kunit_kzalloc(test, sizeof(int), GFP_USER); > + KUNIT_EXPECT_EQ(test, 0, proc_dointvec(&table, 1, input, &len, &pos)); > + KUNIT_EXPECT_EQ(test, sizeof(input) - 1, len); > + KUNIT_EXPECT_EQ(test, sizeof(input) - 1, pos); > + KUNIT_EXPECT_EQ(test, -9, *(int *)table.data); Is the casting necessary? Or can the macro do a type coercion of the second parameter based on the first type? > +} > + > +static void sysctl_test_dointvec_single_less_int_min(struct kunit *test) > +{ > + struct ctl_table table = { > + .procname = "foo", > + .data = &test_data.int_0001, > + .maxlen = sizeof(int), > + .mode = 0644, > + .proc_handler = proc_dointvec, > + .extra1 = &i_zero, > + .extra2 = &i_one_hundred, > + }; > + char input[32]; > + size_t len = sizeof(input) - 1; > + loff_t pos = 0; > + unsigned long abs_of_less_than_min = (unsigned long)INT_MAX > +- (INT_MAX + INT_MIN) + 1; > + > + KUNIT_EXPECT_LT(test, > + snprintf(input, sizeof(input), "-%lu", > +abs_of_less_than_min), > + sizeof(input)); > + > + table.data = kunit_kzalloc(test, sizeof(int), GFP_USER); > + KUNIT_EXPECT_EQ(test, -EINVAL, > + proc_dointvec(&table, 1, input, &len, &pos)); > + KUNIT_EXPECT_EQ(test, sizeof(input) - 1, len); > + KUNIT_EXPECT_EQ(test, 0, *(int *)table.data); > +} > + > +static void sysctl_test_dointvec_single_greater_int_max(struct kunit *test) > +{ > + struct ctl_table table = { > + .procname = "foo", > + .data = &test_data.int_0001, > + .maxlen = sizeof(int), > + .mode = 0644, > + .proc_handler = proc_dointvec, > + .extra1 = &i_zero, > + .extra2 = &i_one_hundred, > + }; > + char input[32]; > + size_t len = sizeof(input) - 1; > + loff_t pos = 0; > + unsigned long greater_than_max = (unsigned long)INT_MAX + 1; > + > + KUNIT_EXPECT_GT(test, greater_than_max, INT_MAX); > + KUNIT_EXPECT_LT(test, snprintf(input, sizeof(input), "%lu", > + greater_than_max), > + sizeof(input)); > + table.data = kunit_kzalloc(test, sizeof(int), GFP_USER); > + KUNIT_EXPECT_EQ(test, -EINVAL, > + proc_dointvec(&table, 1, input, &len, &pos)); > + KUNIT_EXPECT_EQ(test, sizeof(input) - 1, len); > + KUNIT_EXPECT_EQ(test, 0, *(int *)table.data); > +} > + > +static int sysctl_test_init(struct kunit *test) > +{ > + return 0; > +} > + > +/* > + * This is run once after each test case, see the comment on > example_test_module > + * for more information. > + */ > +static void sysctl_test_exit(struct kunit *test) > +{ > +} Can the above two be omitted? If they can be empty sometimes it would be nice to avoid the extra symbols and code by letting them be assigned to NULL in the kunit_module. > + > +/* > + * Here we make a list of all the test cases we want to add to the test > module > + * below. > + */ > +static struct kunit_case sysctl_test_cases[] = { > + /* > +* This is a helper to create a test case object from a test case > +* function; its exact function is not important to understand how to > +* use KUnit, just know that this is how you associate test cases > with a > +* test module. > +*/ > + KUNIT_CASE(sysctl_test_dointvec_null_tbl_data), > + KUNIT_CASE(sysctl_test_dointvec_table_maxlen_unset), > + KUNIT_CASE(sysctl_test_dointvec_table_len_is_zero), > + KUNIT_CASE(sysctl_test_dointvec_table_read_but_position_set), > + KUNIT_CA
Re: [PATCH] mm, memory-failure: clarify error message
On Thu, 2019-05-16 at 22:08 -0600, Jane Chu wrote: > Some user who install SIGBUS handler that does longjmp out > therefore keeping the process alive is confused by the error > message > "[188988.765862] Memory failure: 0x1840200: Killing >cellsrv:33395 due to hardware memory corruption" > Slightly modify the error message to improve clarity. > > Signed-off-by: Jane Chu > --- > mm/memory-failure.c | 7 --- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index fc8b517..14de5e2 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -216,10 +216,9 @@ static int kill_proc(struct to_kill *tk, unsigned long > pfn, int flags) > short addr_lsb = tk->size_shift; > int ret; > > - pr_err("Memory failure: %#lx: Killing %s:%d due to hardware memory > corruption\n", > - pfn, t->comm, t->pid); > - > if ((flags & MF_ACTION_REQUIRED) && t->mm == current->mm) { > + pr_err("Memory failure: %#lx: Killing %s:%d due to hardware > memory " > + "corruption\n", pfn, t->comm, t->pid); Minor nit, but the string shouldn't be split over multiple lines to preserve grep-ability. In such a case it is usually considered OK to exceed 80 characters for the line if needed. > ret = force_sig_mceerr(BUS_MCEERR_AR, (void __user *)tk->addr, > addr_lsb, current); > } else { > @@ -229,6 +228,8 @@ static int kill_proc(struct to_kill *tk, unsigned long > pfn, int flags) >* This could cause a loop when the user sets SIGBUS >* to SIG_IGN, but hopefully no one will do that? >*/ > + pr_err("Memory failure: %#lx: Sending SIGBUS to %s:%d due to > hardware " > + "memory corruption\n", pfn, t->comm, t->pid); > ret = send_sig_mceerr(BUS_MCEERR_AO, (void __user *)tk->addr, > addr_lsb, t); /* synchronous? */ > } ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [v5 2/3] mm/hotplug: make remove_memory() interface useable
Hi Dan, Thank you very much for your review, my comments below: On Mon, May 6, 2019 at 2:01 PM Dan Williams wrote: > > On Mon, May 6, 2019 at 10:57 AM Dave Hansen wrote: > > > > > -static inline void remove_memory(int nid, u64 start, u64 size) {} > > > +static inline bool remove_memory(int nid, u64 start, u64 size) > > > +{ > > > + return -EBUSY; > > > +} > > > > This seems like an appropriate place for a WARN_ONCE(), if someone > > manages to call remove_memory() with hotplug disabled. I decided not to do WARN_ONCE(), in all likelihood compiler will simply optimize this function out, but with WARN_ONCE() some traces of it will remain. > > > > BTW, I looked and can't think of a better errno, but -EBUSY probably > > isn't the best error code, right? -EBUSY is the only error that is returned in case of error by real remove_memory(), so I think it is OK to keep it here. > > > > > -void remove_memory(int nid, u64 start, u64 size) > > > +/** > > > + * remove_memory > > > + * @nid: the node ID > > > + * @start: physical address of the region to remove > > > + * @size: size of the region to remove > > > + * > > > + * NOTE: The caller must call lock_device_hotplug() to serialize hotplug > > > + * and online/offline operations before this call, as required by > > > + * try_offline_node(). > > > + */ > > > +void __remove_memory(int nid, u64 start, u64 size) > > > { > > > + > > > + /* > > > + * trigger BUG() is some memory is not offlined prior to calling > > > this > > > + * function > > > + */ > > > + if (try_remove_memory(nid, start, size)) > > > + BUG(); > > > +} > > > > Could we call this remove_offline_memory()? That way, it makes _some_ > > sense why we would BUG() if the memory isn't offline. It is this particular code path, the second one: remove_memory(), actually tries to remove memory and returns failure if it can't. So, I think the current name is OK. > > Please WARN() instead of BUG() because failing to remove memory should > not be system fatal. As mentioned earlier, I will keep BUG(), because existing code does that, and there is no good handling of this code to return on error. Thank you, Pavel ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v4 04/18] kunit: test: add kunit_stream a std::stream like logger
Quoting Brendan Higgins (2019-05-14 15:16:57) > diff --git a/kunit/kunit-stream.c b/kunit/kunit-stream.c > new file mode 100644 > index 0..1884f1b550888 > --- /dev/null > +++ b/kunit/kunit-stream.c > @@ -0,0 +1,152 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * C++ stream style string formatter and printer used in KUnit for outputting > + * KUnit messages. > + * > + * Copyright (C) 2019, Google LLC. > + * Author: Brendan Higgins > + */ > + > +#include > +#include > +#include > + > +static const char *kunit_stream_get_level(struct kunit_stream *this) > +{ > + unsigned long flags; > + const char *level; > + > + spin_lock_irqsave(&this->lock, flags); > + level = this->level; > + spin_unlock_irqrestore(&this->lock, flags); > + > + return level; Please remove this whole function and inline it to the one call-site. > +} > + > +void kunit_stream_set_level(struct kunit_stream *this, const char *level) > +{ > + unsigned long flags; > + > + spin_lock_irqsave(&this->lock, flags); > + this->level = level; > + spin_unlock_irqrestore(&this->lock, flags); I don't get the locking here. What are we protecting against? Are tests running in parallel using the same kunit_stream? If so, why is the level changeable in one call and then adding strings is done in a different function call? It would make sense to combine the level setting and string adding so that it's one atomic operation if it's truly a parallel operation, or remove the locking entirely. > +} > + > +void kunit_stream_add(struct kunit_stream *this, const char *fmt, ...) > +{ > + va_list args; > + struct string_stream *stream = this->internal_stream; > + > + va_start(args, fmt); > + > + if (string_stream_vadd(stream, fmt, args) < 0) > + kunit_err(this->test, "Failed to allocate fragment: %s\n", > fmt); > + > + va_end(args); > +} > + > +void kunit_stream_append(struct kunit_stream *this, > + struct kunit_stream *other) > +{ > + struct string_stream *other_stream = other->internal_stream; > + const char *other_content; > + > + other_content = string_stream_get_string(other_stream); > + > + if (!other_content) { > + kunit_err(this->test, > + "Failed to get string from second argument for > appending.\n"); > + return; > + } > + > + kunit_stream_add(this, other_content); > +} > + > +void kunit_stream_clear(struct kunit_stream *this) > +{ > + string_stream_clear(this->internal_stream); > +} > + > +void kunit_stream_commit(struct kunit_stream *this) Should this be rather called kunit_stream_flush()? > +{ > + struct string_stream *stream = this->internal_stream; > + struct string_stream_fragment *fragment; > + const char *level; > + char *buf; > + > + level = kunit_stream_get_level(this); > + if (!level) { > + kunit_err(this->test, > + "Stream was committed without a specified log > level.\n"); Drop the full-stop? > + level = KERN_ERR; > + kunit_stream_set_level(this, level); > + } > + > + buf = string_stream_get_string(stream); > + if (!buf) { > + kunit_err(this->test, Can you grow a local variable for 'this->test'? It's used many times. Also, 'this' is not very kernel idiomatic. We usually name variables by their type instead of 'this' which is a keyword in other languages. Perhaps it could be named 'kstream'? > +"Could not allocate buffer, dumping stream:\n"); > + list_for_each_entry(fragment, &stream->fragments, node) { > + kunit_err(this->test, fragment->fragment); > + } > + kunit_err(this->test, "\n"); > + goto cleanup; > + } > + > + kunit_printk(level, this->test, buf); > + kfree(buf); > + > +cleanup: > + kunit_stream_clear(this); > +} > + > +static int kunit_stream_init(struct kunit_resource *res, void *context) > +{ > + struct kunit *test = context; > + struct kunit_stream *stream; > + > + stream = kzalloc(sizeof(*stream), GFP_KERNEL); Of course, here it's called 'stream', so maybe it should be 'kstream' here too. > + if (!stream) > + return -ENOMEM; > + > + res->allocation = stream; > + stream->test = test; > + spin_lock_init(&stream->lock); > + stream->internal_stream = new_string_stream(); Can new_string_stream() be renamed to alloc_string_stream()? Sorry, I just see so much C++ isms in here it's hard to read from the kernel developer perspective. > + > + if (!stream->internal_stream) { Nitpick: Please join this to the "allocation" event above instead of keeping it separated. > + kfree(stream); > + return -ENOMEM; > + } > + > + retur
Re: [PATCH v4 03/18] kunit: test: add string_stream a std::stream like string builder
Quoting Brendan Higgins (2019-05-14 15:16:56) > A number of test features need to do pretty complicated string printing > where it may not be possible to rely on a single preallocated string > with parameters. > > So provide a library for constructing the string as you go similar to > C++'s std::string. > > Signed-off-by: Brendan Higgins > Reviewed-by: Greg Kroah-Hartman > Reviewed-by: Logan Gunthorpe Is there any reason why we can't use the seqfile API for this? These both share a similar goal, formatting strings into a buffer to be read later. Maybe some new APIs would be needed to extract the buffer differently, but I hope we could share the code. If it can't be used, can you please add the reasoning to the commit text here? ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [ndctl PATCH v3 00/10] daxctl: add a new reconfigure-device command
> Hi Pavel, > > I've still not been able to hit this in my testing, is it something you > hit only after applying these patches? i.e. does plain v65 work? Yes, plain v65 works, but with these patches I see this error. I use buildroot to build initramfs with ndctl. Here is how ndctl.mk looks like: NDCTL_VERSION = ed17fd1 NDCTL_SITE = $(call github,pmem,ndctl,$(NDCTL_VERSION)) NDCTL_LICENSE = GNU Lesser General Public License v2.1 NDCTL_LICENSE_FILES = COPYING #NDCTL_AUTORECv65ONF = YES define NDCTL_RUN_AUTOGEN cd $(@D) && PATH=$(BR_PATH) ./autogen.sh endef NDCTL_PRE_CONFIGURE_HOOKS += NDCTL_RUN_AUTOGEN NDCTL_LDFLAGS = $(TARGET_LDFLAGS) NDCTL_CONF_OPTS = \ --without-bash \ --without-keyutils \ --without-systemd $(eval $(autotools-package)) This version works, but when I change: NDCTL_VERSION = 9ee82c8 I get add_dev error. Pasha ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [ndctl PATCH v3 00/10] daxctl: add a new reconfigure-device command
On Fri, 2019-05-17 at 11:30 -0400, Pavel Tatashin wrote: > On Thu, May 16, 2019 at 6:40 PM Vishal Verma > wrote: > > Changes in v3: > > - In daxctl_dev_get_mode(), remove the subsystem warning, detect > > dax-class > >and simply make it return devdax > > Hi Vishal, > > I am still getting the same error as before: > > # ndctl create-namespace --mode devdax --map mem -e namespace0.0 -f > [ 141.525873] dax0.0 initialised, 524288 pages in 7ms > libdaxctl: __sysfs_device_parse: dax0.0: add_dev() failed > ... > > I am building it via buildroot, and can share the initramfs file or > anything that can help you with fixing this issue. Hi Pavel, I've still not been able to hit this in my testing, is it something you hit only after applying these patches? i.e. does plain v65 work? ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: NULL pointer dereference during memory hotremove
On Fri, May 17, 2019 at 1:24 PM Pavel Tatashin wrote: > > On Fri, May 17, 2019 at 1:22 PM Pavel Tatashin > wrote: > > > > On Fri, May 17, 2019 at 10:38 AM Michal Hocko wrote: > > > > > > On Fri 17-05-19 10:20:38, Pavel Tatashin wrote: > > > > This panic is unrelated to circular lock issue that I reported in a > > > > separate thread, that also happens during memory hotremove. > > > > > > > > xakep ~/x/linux$ git describe > > > > v5.1-12317-ga6a4b66bd8f4 > > > > > > Does this happen on 5.0 as well? > > > > Yes, just reproduced it on 5.0 as well. Unfortunately, I do not have a > > script, and have to do it manually, also it does not happen every > > time, it happened on 3rd time for me. > > Actually, sorry, I have not tested 5.0, I compiled 5.0, but my script > still tested v5.1-12317-ga6a4b66bd8f4 build. I will report later if I > am able to reproduce it on 5.0. OK, confirmed on 5.0 as well, took 4 tries to reproduce: (qemu) [ 17.104486] Offlined Pages 32768 [ 17.105543] Built 1 zonelists, mobility grouping on. Total pages: 1515892 [ 17.106475] Policy zone: Normal [ 17.107029] BUG: unable to handle kernel NULL pointer dereference at 0698 [ 17.107645] #PF error: [normal kernel read fault] [ 17.108038] PGD 0 P4D 0 [ 17.108287] Oops: [#1] SMP PTI [ 17.108557] CPU: 5 PID: 313 Comm: kworker/u16:5 Not tainted 5.0.0_pt_pmem1 #2 [ 17.109128] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-20181126_142135-anatol 04/01/2014 [ 17.109910] Workqueue: kacpi_hotplug acpi_hotplug_work_fn [ 17.110323] RIP: 0010:__remove_pages+0x2f/0x520 [ 17.110674] Code: 41 56 41 55 49 89 fd 41 54 55 53 48 89 d3 48 83 ec 68 48 89 4c 24 08 65 48 8b 04 25 28 00 00 00 48 89 44 24 60 31 c0 48 89 f8 <48> 2b 47 58 48 3d 00 19 00 00 0f 85 7f 03 00 00 48 85 c9 0f 84 df [ 17.112114] RSP: 0018:b43b815f3ca8 EFLAGS: 00010246 [ 17.112518] RAX: 0640 RBX: 0004 RCX: [ 17.113073] RDX: 0004 RSI: 0024 RDI: 0640 [ 17.113615] RBP: 00024000 R08: R09: 4000 [ 17.114186] R10: 4000 R11: 00024000 R12: e382c900 [ 17.114743] R13: 0640 R14: 0004 R15: 0024 [ 17.115288] FS: () GS:979539b4() knlGS: [ 17.115911] CS: 0010 DS: ES: CR0: 80050033 [ 17.116356] CR2: 0698 CR3: 000133c22004 CR4: 00360ee0 [ 17.116913] DR0: DR1: DR2: [ 17.117467] DR3: DR6: fffe0ff0 DR7: 0400 [ 17.118016] Call Trace: [ 17.118214] ? memblock_isolate_range+0xc4/0x139 [ 17.118570] ? firmware_map_remove+0x48/0x90 [ 17.118908] arch_remove_memory+0x7b/0xc0 [ 17.119216] __remove_memory+0x93/0xc0 [ 17.119528] acpi_memory_device_remove+0x67/0xe0 [ 17.119890] acpi_bus_trim+0x50/0x90 [ 17.120167] acpi_device_hotplug+0x2fc/0x460 [ 17.120498] acpi_hotplug_work_fn+0x15/0x20 [ 17.120834] process_one_work+0x2a0/0x650 [ 17.121146] worker_thread+0x34/0x3d0 [ 17.121432] ? process_one_work+0x650/0x650 [ 17.121772] kthread+0x118/0x130 [ 17.122032] ? kthread_create_on_node+0x60/0x60 [ 17.122413] ret_from_fork+0x3a/0x50 [ 17.122727] Modules linked in: [ 17.122983] CR2: 0698 [ 17.123250] ---[ end trace 389c4034f6d42e6f ]--- [ 17.123618] RIP: 0010:__remove_pages+0x2f/0x520 [ 17.123979] Code: 41 56 41 55 49 89 fd 41 54 55 53 48 89 d3 48 83 ec 68 48 89 4c 24 08 65 48 8b 04 25 28 00 00 00 48 89 44 24 60 31 c0 48 89 f8 <48> 2b 47 58 48 3d 00 19 00 00 0f 85 7f 03 00 00 48 85 c9 0f 84 df [ 17.125410] RSP: 0018:b43b815f3ca8 EFLAGS: 00010246 [ 17.125818] RAX: 0640 RBX: 0004 RCX: [ 17.126359] RDX: 0004 RSI: 0024 RDI: 0640 [ 17.126906] RBP: 00024000 R08: R09: 4000 [ 17.127453] R10: 4000 R11: 00024000 R12: e382c900 [ 17.128008] R13: 0640 R14: 0004 R15: 0024 [ 17.128555] FS: () GS:979539b4() knlGS: [ 17.129182] CS: 0010 DS: ES: CR0: 80050033 [ 17.129627] CR2: 0698 CR3: 000133c22004 CR4: 00360ee0 [ 17.130182] DR0: DR1: DR2: [ 17.130744] DR3: DR6: fffe0ff0 DR7: 0400 [ 17.131293] BUG: sleeping function called from invalid context at include/linux/percpu-rwsem.h:34 [ 17.132050] in_atomic(): 0, irqs_disabled(): 1, pid: 313, name: kworker/u16:5 [ 17.132596] INFO: lockdep is turned off. [ 17.132908] irq event stamp: 14046 [ 17.133175] hardirqs last enabled at (14045): [] kfree+0xba/0x230 [ 17.133777] hardirqs last disabled at (14046): [] trace_hardi
Re: [PATCH] libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead
On Fri, May 17, 2019 at 8:57 AM Kees Cook wrote: > > On Fri, May 17, 2019 at 08:08:27AM -0700, Dan Williams wrote: > > As far as I can see it's mostly check_heap_object() that is the > > problem, so I'm open to finding a way to just bypass that sub-routine. > > However, as far as I can see none of the other block / filesystem user > > copy implementations submit to the hardened checks, like > > bio_copy_from_iter(), and iov_iter_copy_from_user_atomic() . So, > > either those need to grow additional checks, or the hardened copy > > implementation is targeting single object copy use cases, not > > necessarily block-I/O. Yes, Kees, please advise. > > The intention is mainly for copies that haven't had explicit bounds > checking already performed on them, yes. Is there something getting > checked out of the slab, or is it literally just the overhead of doing > the "is this slab?" check that you're seeing? It's literally the overhead of "is this slab?" since it needs to go retrieve the struct page and read that potentially cold cacheline. In the case where that page is on memory media that is higher latency than DRAM we get the ~37% performance loss that Jeff measured. The path is via the filesystem ->write_iter() file operation. In the DAX case the filesystem traps that path early, before submitting block I/O, and routes it to the dax_iomap_actor() routine. That routine validates that the logical file offset is within bounds of the file, then it does a sector-to-pfn translation which validates that the physical mapping is within bounds of the block device. It seems dax_iomap_actor() is not a path where we'd be worried about needing hardened user copy checks. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: NULL pointer dereference during memory hotremove
On Fri, May 17, 2019 at 1:22 PM Pavel Tatashin wrote: > > On Fri, May 17, 2019 at 10:38 AM Michal Hocko wrote: > > > > On Fri 17-05-19 10:20:38, Pavel Tatashin wrote: > > > This panic is unrelated to circular lock issue that I reported in a > > > separate thread, that also happens during memory hotremove. > > > > > > xakep ~/x/linux$ git describe > > > v5.1-12317-ga6a4b66bd8f4 > > > > Does this happen on 5.0 as well? > > Yes, just reproduced it on 5.0 as well. Unfortunately, I do not have a > script, and have to do it manually, also it does not happen every > time, it happened on 3rd time for me. Actually, sorry, I have not tested 5.0, I compiled 5.0, but my script still tested v5.1-12317-ga6a4b66bd8f4 build. I will report later if I am able to reproduce it on 5.0. Pasha ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: NULL pointer dereference during memory hotremove
On Fri, May 17, 2019 at 10:38 AM Michal Hocko wrote: > > On Fri 17-05-19 10:20:38, Pavel Tatashin wrote: > > This panic is unrelated to circular lock issue that I reported in a > > separate thread, that also happens during memory hotremove. > > > > xakep ~/x/linux$ git describe > > v5.1-12317-ga6a4b66bd8f4 > > Does this happen on 5.0 as well? Yes, just reproduced it on 5.0 as well. Unfortunately, I do not have a script, and have to do it manually, also it does not happen every time, it happened on 3rd time for me. Pasha ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH] libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead
On Fri, May 17, 2019 at 04:14:03PM +, David Laight wrote: > From: Kees Cook > > Sent: 17 May 2019 16:54 > ... > > > I've changed some of our code to use __get_user() to avoid > > > these stupid overheads. > > > > __get_user() skips even access_ok() checking too, so that doesn't seem > > like a good idea. Did you run access_ok() checks separately? (This > > generally isn't recommended.) > > Of course, I'm not THAT stupid :-) Right, yes, I know. :) I just wanted to double-check since accidents can happen. The number of underscores on these function is not really a great way to indicate what they're doing. ;) -- Kees Cook ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
RE: [PATCH] libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead
From: Kees Cook > Sent: 17 May 2019 16:54 ... > > I've changed some of our code to use __get_user() to avoid > > these stupid overheads. > > __get_user() skips even access_ok() checking too, so that doesn't seem > like a good idea. Did you run access_ok() checks separately? (This > generally isn't recommended.) Of course, I'm not THAT stupid :-) David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH] libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead
On Fri, May 17, 2019 at 08:08:27AM -0700, Dan Williams wrote: > As far as I can see it's mostly check_heap_object() that is the > problem, so I'm open to finding a way to just bypass that sub-routine. > However, as far as I can see none of the other block / filesystem user > copy implementations submit to the hardened checks, like > bio_copy_from_iter(), and iov_iter_copy_from_user_atomic() . So, > either those need to grow additional checks, or the hardened copy > implementation is targeting single object copy use cases, not > necessarily block-I/O. Yes, Kees, please advise. The intention is mainly for copies that haven't had explicit bounds checking already performed on them, yes. Is there something getting checked out of the slab, or is it literally just the overhead of doing the "is this slab?" check that you're seeing? -- Kees Cook ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH] libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead
On Fri, May 17, 2019 at 09:06:26AM +, David Laight wrote: > From: Jan Kara > > Sent: 17 May 2019 09:48 > ... > > So this last paragraph is not obvious to me as check_copy_size() does a lot > > of various checks in CONFIG_HARDENED_USERCOPY case. I agree that some of > > those checks don't make sense for PMEM pages but I'd rather handle that by > > refining check_copy_size() and check_object_size() functions to detect and > > appropriately handle pmem pages rather that generally skip all the checks > > in pmem_copy_from/to_iter(). And yes, every check in such hot path is going > > to cost performance but that's what user asked for with > > CONFIG_HARDENED_USERCOPY... Kees? > > Except that all the distros enable it by default. > So you get the performance cost whether you (as a user) want it or not. Note that it can be disabled on the kernel command line, but that seems like a last resort. :) > > I've changed some of our code to use __get_user() to avoid > these stupid overheads. __get_user() skips even access_ok() checking too, so that doesn't seem like a good idea. Did you run access_ok() checks separately? (This generally isn't recommended.) The usercopy hardening is intended to only kick in when the copy size isn't a builtin constant -- it's attempting to do a bounds check on the size, with the pointer used to figure out what bounds checking is possible (basically "is this stack? check stack location/frame size" or "is this kmem cache? check the allocation size") and then do bounds checks from there. It tries to bail out early to avoid needless checking, so if there is some additional logic to be added to check_object_size() that is globally applicable, sure, let's do it. I'm still not clear from this thread about the case that is getting tripped/slowed? If you're already doing bounds checks somewhere else and there isn't a chance for the pointer or size to change, then yeah, it seems safe to skip the usercopy size checks. But what's the actual code that is having a problem? -- Kees Cook ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [ndctl PATCH v3 00/10] daxctl: add a new reconfigure-device command
On Thu, May 16, 2019 at 6:40 PM Vishal Verma wrote: > > Changes in v3: > - In daxctl_dev_get_mode(), remove the subsystem warning, detect dax-class >and simply make it return devdax Hi Vishal, I am still getting the same error as before: # ndctl create-namespace --mode devdax --map mem -e namespace0.0 -f [ 141.525873] dax0.0 initialised, 524288 pages in 7ms libdaxctl: __sysfs_device_parse: dax0.0: add_dev() failed ... I am building it via buildroot, and can share the initramfs file or anything that can help you with fixing this issue. Thank you, Pavel ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH] mm/nvdimm: Pick the right alignment default when creating dax devices
On 5/17/19 8:19 PM, Vaibhav Jain wrote: Hi Aneesh, Apart from a minor review comment for changes to nd_pfn_validate() the patch looks good to me. Also, I Tested this patch on a PPC64 qemu guest with virtual nvdimm and verified that default alignment of newly created devdax namespace was 64KiB instead of 16MiB. Below are the test results: * Without the patch creating a devdax namespace results in namespace with 16MiB default alignment. Using daxio to zero out the dax device results in a SIGBUS and a hashing failure. $ sudo ndctl create-namespace --mode=devdax | grep align "align":16777216, "align":16777216 $ sudo cat /sys/devices/ndbus0/region0/dax0.0/supported_alignments 65536 16777216 $ sudo daxio.static-debug -z -o /dev/dax0.0 Bus error (core dumped) $ dmesg | tail [ 438.738958] lpar: Failed hash pte insert with error -4 [ 438.739412] hash-mmu: mm: Hashing failure ! EA=0x7fff1700 access=0x8006 current=daxio [ 438.739760] hash-mmu: trap=0x300 vsid=0x22cb7a3 ssize=1 base psize=2 psize 10 pte=0xc00501002b86 [ 438.740143] daxio[3860]: bus error (7) at 7fff1700 nip 7fff973c007c lr 7fff973bff34 code 2 in libpmem.so.1.0.0[7fff973b+2] [ 438.740634] daxio[3860]: code: 792945e4 7d494b78 e95f0098 7d494b78 f93f00a0 4800012c e93f0088 f93f0120 [ 438.741015] daxio[3860]: code: e93f00a0 f93f0128 e93f0120 e95f0128 e93f0088 39290008 f93f0110 * With the patch creating a devdax namespace results in namespace with 64KiB default alignment. Using daxio to zero out the dax device succeeds: $ sudo ndctl create-namespace --mode=devdax | grep align "align":65536, "align":65536 $ sudo cat /sys/devices/ndbus0/region0/dax0.0/supported_alignments 65536 $ daxio -z -o /dev/dax0.0 daxio: copied 2130706432 bytes to device "/dev/dax0.0" Hence, Tested-by: Vaibhav Jain "Aneesh Kumar K.V" writes: Allow arch to provide the supported alignments and use hugepage alignment only if we support hugepage. Right now we depend on compile time configs whereas this patch switch this to runtime discovery. Architectures like ppc64 can have THP enabled in code, but then can have hugepage size disabled by the hypervisor. This allows us to create dax devices with PAGE_SIZE alignment in this case. Existing dax namespace with alignment larger than PAGE_SIZE will fail to initialize in this specific case. We still allow fsdax namespace initialization. With respect to identifying whether to enable hugepage fault for a dax device, if THP is enabled during compile, we default to taking hugepage fault and in dax fault handler if we find the fault size > alignment we retry with PAGE_SIZE fault size. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/include/asm/libnvdimm.h | 9 arch/powerpc/mm/Makefile | 1 + arch/powerpc/mm/nvdimm.c | 34 arch/x86/include/asm/libnvdimm.h | 19 drivers/nvdimm/nd.h | 6 - drivers/nvdimm/pfn_devs.c| 32 +- include/linux/huge_mm.h | 7 +- 7 files changed, 100 insertions(+), 8 deletions(-) create mode 100644 arch/powerpc/include/asm/libnvdimm.h create mode 100644 arch/powerpc/mm/nvdimm.c create mode 100644 arch/x86/include/asm/libnvdimm.h diff --git a/arch/powerpc/include/asm/libnvdimm.h b/arch/powerpc/include/asm/libnvdimm.h new file mode 100644 index ..d35fd7f48603 --- /dev/null +++ b/arch/powerpc/include/asm/libnvdimm.h @@ -0,0 +1,9 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_POWERPC_LIBNVDIMM_H +#define _ASM_POWERPC_LIBNVDIMM_H + +#define nd_pfn_supported_alignments nd_pfn_supported_alignments +extern unsigned long *nd_pfn_supported_alignments(void); +extern unsigned long nd_pfn_default_alignment(void); + +#endif diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile index 0f499db315d6..42e4a399ba5d 100644 --- a/arch/powerpc/mm/Makefile +++ b/arch/powerpc/mm/Makefile @@ -20,3 +20,4 @@ obj-$(CONFIG_HIGHMEM) += highmem.o obj-$(CONFIG_PPC_COPRO_BASE) += copro_fault.o obj-$(CONFIG_PPC_PTDUMP) += ptdump/ obj-$(CONFIG_KASAN) += kasan/ +obj-$(CONFIG_NVDIMM_PFN) += nvdimm.o diff --git a/arch/powerpc/mm/nvdimm.c b/arch/powerpc/mm/nvdimm.c new file mode 100644 index ..a29a4510715e --- /dev/null +++ b/arch/powerpc/mm/nvdimm.c @@ -0,0 +1,34 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include +#include + +#include +/* + * We support only pte and pmd mappings for now. + */ +const unsigned long *nd_pfn_supported_alignments(void) +{ + static unsigned long supported_alignments[3]; + + supported_alignments[0] = PAGE_SIZE; + + if (has_transparent_hugepage()) + supported_alignments[1] = HPAGE_PMD_SIZE; + else + supported_alignments[1] = 0; + + supported_alignments[2] = 0; +
Re: [PATCH] libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead
On Fri, May 17, 2019 at 1:47 AM Jan Kara wrote: > > Let's add Kees to CC for usercopy expertise... > > On Thu 16-05-19 17:33:38, Dan Williams wrote: > > Jeff discovered that performance improves from ~375K iops to ~519K iops > > on a simple psync-write fio workload when moving the location of 'struct > > page' from the default PMEM location to DRAM. This result is surprising > > because the expectation is that 'struct page' for dax is only needed for > > third party references to dax mappings. For example, a dax-mapped buffer > > passed to another system call for direct-I/O requires 'struct page' for > > sending the request down the driver stack and pinning the page. There is > > no usage of 'struct page' for first party access to a file via > > read(2)/write(2) and friends. > > > > However, this "no page needed" expectation is violated by > > CONFIG_HARDENED_USERCOPY and the check_copy_size() performed in > > copy_from_iter_full_nocache() and copy_to_iter_mcsafe(). The > > check_heap_object() helper routine assumes the buffer is backed by a > > page-allocator DRAM page and applies some checks. Those checks are > > invalid, dax pages are not from the heap, and redundant, > > dax_iomap_actor() has already validated that the I/O is within bounds. > > So this last paragraph is not obvious to me as check_copy_size() does a lot > of various checks in CONFIG_HARDENED_USERCOPY case. I agree that some of > those checks don't make sense for PMEM pages but I'd rather handle that by > refining check_copy_size() and check_object_size() functions to detect and > appropriately handle pmem pages rather that generally skip all the checks > in pmem_copy_from/to_iter(). And yes, every check in such hot path is going > to cost performance but that's what user asked for with > CONFIG_HARDENED_USERCOPY... Kees? As far as I can see it's mostly check_heap_object() that is the problem, so I'm open to finding a way to just bypass that sub-routine. However, as far as I can see none of the other block / filesystem user copy implementations submit to the hardened checks, like bio_copy_from_iter(), and iov_iter_copy_from_user_atomic() . So, either those need to grow additional checks, or the hardened copy implementation is targeting single object copy use cases, not necessarily block-I/O. Yes, Kees, please advise. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: NULL pointer dereference during memory hotremove
On 17.05.19 16:38, Michal Hocko wrote: > On Fri 17-05-19 10:20:38, Pavel Tatashin wrote: >> This panic is unrelated to circular lock issue that I reported in a >> separate thread, that also happens during memory hotremove. >> >> xakep ~/x/linux$ git describe >> v5.1-12317-ga6a4b66bd8f4 > > Does this happen on 5.0 as well? > We have on the list [PATCH V3 1/4] mm/hotplug: Reorder arch_remove_memory() call in __remove_memory() Can that help? -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH] mm/nvdimm: Pick the right alignment default when creating dax devices
Hi Aneesh, Apart from a minor review comment for changes to nd_pfn_validate() the patch looks good to me. Also, I Tested this patch on a PPC64 qemu guest with virtual nvdimm and verified that default alignment of newly created devdax namespace was 64KiB instead of 16MiB. Below are the test results: * Without the patch creating a devdax namespace results in namespace with 16MiB default alignment. Using daxio to zero out the dax device results in a SIGBUS and a hashing failure. $ sudo ndctl create-namespace --mode=devdax | grep align "align":16777216, "align":16777216 $ sudo cat /sys/devices/ndbus0/region0/dax0.0/supported_alignments 65536 16777216 $ sudo daxio.static-debug -z -o /dev/dax0.0 Bus error (core dumped) $ dmesg | tail [ 438.738958] lpar: Failed hash pte insert with error -4 [ 438.739412] hash-mmu: mm: Hashing failure ! EA=0x7fff1700 access=0x8006 current=daxio [ 438.739760] hash-mmu: trap=0x300 vsid=0x22cb7a3 ssize=1 base psize=2 psize 10 pte=0xc00501002b86 [ 438.740143] daxio[3860]: bus error (7) at 7fff1700 nip 7fff973c007c lr 7fff973bff34 code 2 in libpmem.so.1.0.0[7fff973b+2] [ 438.740634] daxio[3860]: code: 792945e4 7d494b78 e95f0098 7d494b78 f93f00a0 4800012c e93f0088 f93f0120 [ 438.741015] daxio[3860]: code: e93f00a0 f93f0128 e93f0120 e95f0128 e93f0088 39290008 f93f0110 * With the patch creating a devdax namespace results in namespace with 64KiB default alignment. Using daxio to zero out the dax device succeeds: $ sudo ndctl create-namespace --mode=devdax | grep align "align":65536, "align":65536 $ sudo cat /sys/devices/ndbus0/region0/dax0.0/supported_alignments 65536 $ daxio -z -o /dev/dax0.0 daxio: copied 2130706432 bytes to device "/dev/dax0.0" Hence, Tested-by: Vaibhav Jain "Aneesh Kumar K.V" writes: > Allow arch to provide the supported alignments and use hugepage alignment only > if we support hugepage. Right now we depend on compile time configs whereas > this > patch switch this to runtime discovery. > > Architectures like ppc64 can have THP enabled in code, but then can have > hugepage size disabled by the hypervisor. This allows us to create dax devices > with PAGE_SIZE alignment in this case. > > Existing dax namespace with alignment larger than PAGE_SIZE will fail to > initialize in this specific case. We still allow fsdax namespace > initialization. > > With respect to identifying whether to enable hugepage fault for a dax device, > if THP is enabled during compile, we default to taking hugepage fault and in > dax > fault handler if we find the fault size > alignment we retry with PAGE_SIZE > fault size. > > Signed-off-by: Aneesh Kumar K.V > --- > arch/powerpc/include/asm/libnvdimm.h | 9 > arch/powerpc/mm/Makefile | 1 + > arch/powerpc/mm/nvdimm.c | 34 > arch/x86/include/asm/libnvdimm.h | 19 > drivers/nvdimm/nd.h | 6 - > drivers/nvdimm/pfn_devs.c| 32 +- > include/linux/huge_mm.h | 7 +- > 7 files changed, 100 insertions(+), 8 deletions(-) > create mode 100644 arch/powerpc/include/asm/libnvdimm.h > create mode 100644 arch/powerpc/mm/nvdimm.c > create mode 100644 arch/x86/include/asm/libnvdimm.h > > diff --git a/arch/powerpc/include/asm/libnvdimm.h > b/arch/powerpc/include/asm/libnvdimm.h > new file mode 100644 > index ..d35fd7f48603 > --- /dev/null > +++ b/arch/powerpc/include/asm/libnvdimm.h > @@ -0,0 +1,9 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +#ifndef _ASM_POWERPC_LIBNVDIMM_H > +#define _ASM_POWERPC_LIBNVDIMM_H > + > +#define nd_pfn_supported_alignments nd_pfn_supported_alignments > +extern unsigned long *nd_pfn_supported_alignments(void); > +extern unsigned long nd_pfn_default_alignment(void); > + > +#endif > diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile > index 0f499db315d6..42e4a399ba5d 100644 > --- a/arch/powerpc/mm/Makefile > +++ b/arch/powerpc/mm/Makefile > @@ -20,3 +20,4 @@ obj-$(CONFIG_HIGHMEM) += highmem.o > obj-$(CONFIG_PPC_COPRO_BASE) += copro_fault.o > obj-$(CONFIG_PPC_PTDUMP) += ptdump/ > obj-$(CONFIG_KASAN) += kasan/ > +obj-$(CONFIG_NVDIMM_PFN) += nvdimm.o > diff --git a/arch/powerpc/mm/nvdimm.c b/arch/powerpc/mm/nvdimm.c > new file mode 100644 > index ..a29a4510715e > --- /dev/null > +++ b/arch/powerpc/mm/nvdimm.c > @@ -0,0 +1,34 @@ > +// SPDX-License-Identifier: GPL-2.0 > + > +#include > +#include > + > +#include > +/* > + * We support only pte and pmd mappings for now. > + */ > +const unsigned long *nd_pfn_supported_alignments(void) > +{ > + static unsigned long supported_alignments[3]; > + > + supported_alignments[0] = PAGE_SIZE; > + > + if (has_transparent_hugepage()) > + supported_alignments[1] = HPAGE_PMD_SIZE; > + else > + s
Re: NULL pointer dereference during memory hotremove
On Fri 17-05-19 10:20:38, Pavel Tatashin wrote: > This panic is unrelated to circular lock issue that I reported in a > separate thread, that also happens during memory hotremove. > > xakep ~/x/linux$ git describe > v5.1-12317-ga6a4b66bd8f4 Does this happen on 5.0 as well? -- Michal Hocko SUSE Labs ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
NULL pointer dereference during memory hotremove
This panic is unrelated to circular lock issue that I reported in a separate thread, that also happens during memory hotremove. xakep ~/x/linux$ git describe v5.1-12317-ga6a4b66bd8f4 Config is attached, qemu script is following: qemu-system-x86_64 \ -enable-kvm \ -cpu host \ -parallel none \ -echr 1 \ -serial none\ -chardev stdio,id=console,signal=off,mux=on \ -serial chardev:console \ -mon chardev=console\ -vga none \ -display none \ -kernel pmem/native/arch/x86/boot/bzImage \ -m 8G,slots=1,maxmem=16G\ -smp 8 \ -fsdev local,id=virtfs1,path=/,security_model=none \ -device virtio-9p-pci,fsdev=virtfs1,mount_tag=hostfs\ -append 'earlyprintk=serial,ttyS0,115200 console=ttyS0 TERM=xterm ip=dhcp memmap=2G!6G loglevel=7' The unusual case with this script is that 2G reserved for pmem device: memmap=2G!6G. Otherwise, it is a normal layout. Unfortunately, it does not happen every time, but I have hit it a couple times. # QEMU 4.0.0 monitor - type 'help' for more information (qemu) object_add memory-backend-ram,id=mem1,size=1G (qemu) device_add pc-dimm,id=dimm1,memdev=mem1 # echo online_movable > /sys/devices/system/memory/memory79/state [ 40.219090] Built 1 zonelists, mobility grouping on. Total pages: 1529279 [ 40.223258] Policy zone: Normal # (qemu) device_del dimm1 (qemu) [ 49.624600] Offlined Pages 32768 [ 49.625796] Built 1 zonelists, mobility grouping on. Total pages: 1516352 [ 49.627841] Policy zone: Normal [ 49.630932] BUG: kernel NULL pointer dereference, address: 0698 [ 49.633704] #PF: supervisor read access in kernel mode [ 49.635689] #PF: error_code(0x) - not-present page [ 49.637620] PGD 800236b59067 P4D 800236b59067 PUD 2358fe067 PMD 0 [ 49.640163] Oops: [#1] SMP PTI [ 49.641223] CPU: 0 PID: 7 Comm: kworker/u16:0 Not tainted 5.1.0_pt_pmem #38 [ 49.643183] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-20181126_142135-anatol 04/01/2014 [ 49.645858] Workqueue: kacpi_hotplug acpi_hotplug_work_fn [ 49.647101] RIP: 0010:__remove_pages+0x1a/0x460 [ 49.648165] Code: e9 bb a9 fd ff 0f 0b 66 0f 1f 84 00 00 00 00 00 41 57 48 89 f8 49 89 ff 41 56 49 89 f6 41 55 41 54 55 53 48 89 d3 48 83 ec 50 <48> 2b 47 58 48 89 4c 24 48 48 3d 00 19 00 00 75 09 48 85 c9 0f 85 [ 49.651925] RSP: 0018:bd1000c8fcb8 EFLAGS: 00010286 [ 49.652857] RAX: 0640 RBX: 0004 RCX: [ 49.654139] RDX: 0004 RSI: 0024 RDI: 0640 [ 49.655393] RBP: R08: R09: 4000 [ 49.656523] R10: 4000 R11: 00024000 R12: 4000 [ 49.657654] R13: 00024000 R14: 0024 R15: 0640 [ 49.658828] FS: () GS:9b4bf980() knlGS: [ 49.660178] CS: 0010 DS: ES: CR0: 80050033 [ 49.661033] CR2: 0698 CR3: 0002382e0006 CR4: 00360ef0 [ 49.662114] DR0: DR1: DR2: [ 49.663172] DR3: DR6: fffe0ff0 DR7: 0400 [ 49.664243] Call Trace: [ 49.664622] ? memblock_isolate_range+0xc4/0x139 [ 49.665290] ? firmware_map_add_hotplug+0x7e/0xde [ 49.665908] ? memblock_remove_region+0x30/0x74 [ 49.666498] arch_remove_memory+0x6f/0xa0 [ 49.667012] __remove_memory+0xab/0x130 [ 49.667492] ? walk_memory_range+0xa1/0xe0 [ 49.668008] acpi_memory_device_remove+0x67/0xe0 [ 49.668595] acpi_bus_trim+0x50/0x90 [ 49.669051] acpi_device_hotplug+0x2fa/0x3e0 [ 49.669590] acpi_hotplug_work_fn+0x15/0x20 [ 49.670116] process_one_work+0x2a0/0x650 [ 49.670577] worker_thread+0x34/0x3d0 [ 49.670997] ? process_one_work+0x650/0x650 [ 49.671503] kthread+0x118/0x130 [ 49.671879] ? kthread_create_on_node+0x60/0x60 [ 49.672411] ret_from_fork+0x3a/0x50 [ 49.672836] Modules linked in: [ 49.673190] CR2: 0698 [ 49.673583] ---[ end trace 6b727d3a8ce48aa1 ]--- [ 49.674120] RIP: 0010:__remove_pages+0x1a/0x460 [ 49.674624] Code: e9 bb a9 fd ff 0f 0b 66 0f 1f 84 00 00 00 00 00 41 57 48 89 f8 49 89 ff 41 56 49 89 f6 41 55 41 54 55
Re: [v5 0/3] "Hotremove" persistent memory
> > I would think that ACPI hotplug would have a similar problem, but it does > this: > > acpi_unbind_memory_blocks(info); > __remove_memory(nid, info->start_addr, info->length); ACPI does have exactly the same problem, so this is not a bug for this series, I will submit a new version of my series with comments addressed, but without fix for this issue. I was able to reproduce this issue on the current mainline kernel. Also, I been thinking more about how to fix it, and there is no easy fix without a major hotplug redesign. Basically, we have to remove sysfs memory entries before or after memory is hotplugged/hotremoved. But, we also have to guarantee that hotplug/hotremove will succeed or reinstate sysfs entries. Qemu script: qemu-system-x86_64 \ -enable-kvm \ -cpu host \ -parallel none \ -echr 1 \ -serial none\ -chardev stdio,id=console,signal=off,mux=on \ -serial chardev:console \ -mon chardev=console\ -vga none \ -display none \ -kernel pmem/native/arch/x86/boot/bzImage \ -m 8G,slots=1,maxmem=16G\ -smp 8 \ -fsdev local,id=virtfs1,path=/,security_model=none \ -device virtio-9p-pci,fsdev=virtfs1,mount_tag=hostfs\ -append 'earlyprintk=serial,ttyS0,115200 console=ttyS0 TERM=xterm ip=dhcp loglevel=7' Config is attached. Steps to reproduce: # # QEMU 4.0.0 monitor - type 'help' for more information (qemu) object_add memory-backend-ram,id=mem1,size=1G (qemu) device_add pc-dimm,id=dimm1,memdev=mem1 (qemu) # echo online_movable > /sys/devices/system/memory/memory79/state [ 23.029552] Built 1 zonelists, mobility grouping on. Total pages: 2045370 [ 23.032591] Policy zone: Normal # (qemu) device_del dimm1 (qemu) [ 32.013950] Offlined Pages 32768 [ 32.014307] Built 1 zonelists, mobility grouping on. Total pages: 2031022 [ 32.014843] Policy zone: Normal [ 32.015733] [ 32.015881] == [ 32.016390] WARNING: possible circular locking dependency detected [ 32.016881] 5.1.0_pt_pmem #38 Not tainted [ 32.017202] -- [ 32.017680] kworker/u16:4/380 is trying to acquire lock: [ 32.018096] 675cc7e1 (kn->count#18){}, at: kernfs_remove_by_name_ns+0x3b/0x80 [ 32.018745] [ 32.018745] but task is already holding lock: [ 32.019201] 53e50a99 (mem_sysfs_mutex){+.+.}, at: unregister_memory_section+0x1d/0xa0 [ 32.019859] [ 32.019859] which lock already depends on the new lock. [ 32.019859] [ 32.020499] [ 32.020499] the existing dependency chain (in reverse order) is: [ 32.021080] [ 32.021080] -> #4 (mem_sysfs_mutex){+.+.}: [ 32.021522]__mutex_lock+0x8b/0x900 [ 32.021843]hotplug_memory_register+0x26/0xa0 [ 32.022231]__add_pages+0xe7/0x160 [ 32.022545]add_pages+0xd/0x60 [ 32.022835]add_memory_resource+0xc3/0x1d0 [ 32.023207]__add_memory+0x57/0x80 [ 32.023530]acpi_memory_device_add+0x13a/0x2d0 [ 32.023928]acpi_bus_attach+0xf1/0x200 [ 32.024272]acpi_bus_scan+0x3e/0x90 [ 32.024597]acpi_device_hotplug+0x284/0x3e0 [ 32.024972]acpi_hotplug_work_fn+0x15/0x20 [ 32.025342]process_one_work+0x2a0/0x650 [ 32.025755]worker_thread+0x34/0x3d0 [ 32.026077]kthread+0x118/0x130 [ 32.026442]ret_from_fork+0x3a/0x50 [ 32.026766] [ 32.026766] -> #3 (mem_hotplug_lock.rw_sem){}: [ 32.027261]get_online_mems+0x39/0x80 [ 32.027600]kmem_cache_create_usercopy+0x29/0x2c0 [ 32.028019]kmem_cache_create+0xd/0x10 [ 32.028367]ptlock_cache_init+0x1b/0x23 [ 32.028724]start_kernel+0x1d2/0x4b8 [ 32.029060]secondary_startup_64+0xa4/0xb0 [ 32.029447] [ 32.029447] -> #2 (cpu_hotplug_lock.rw_sem){}: [ 32.030007]cpus_read_lock+0x39/0x80 [ 32.030360]__offline_pages+0x32/0x790 [ 32.030709]memory_subsys_offline+0x3a/0x60 [ 32.031089]device_offline+0x7e/0xb0 [ 32.031425]acpi_bus_offline+0xd8/0x140 [ 32.031821]acpi_device_hotplug+0x1b2/0x3e0 [ 32.032202]acpi_hotplug_work_fn+0x15/0x20 [ 32.032576]process_o
RE: [PATCH] libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead
From: Jan Kara > Sent: 17 May 2019 09:48 ... > So this last paragraph is not obvious to me as check_copy_size() does a lot > of various checks in CONFIG_HARDENED_USERCOPY case. I agree that some of > those checks don't make sense for PMEM pages but I'd rather handle that by > refining check_copy_size() and check_object_size() functions to detect and > appropriately handle pmem pages rather that generally skip all the checks > in pmem_copy_from/to_iter(). And yes, every check in such hot path is going > to cost performance but that's what user asked for with > CONFIG_HARDENED_USERCOPY... Kees? Except that all the distros enable it by default. So you get the performance cost whether you (as a user) want it or not. I've changed some of our code to use __get_user() to avoid these stupid overheads. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH] libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead
Let's add Kees to CC for usercopy expertise... On Thu 16-05-19 17:33:38, Dan Williams wrote: > Jeff discovered that performance improves from ~375K iops to ~519K iops > on a simple psync-write fio workload when moving the location of 'struct > page' from the default PMEM location to DRAM. This result is surprising > because the expectation is that 'struct page' for dax is only needed for > third party references to dax mappings. For example, a dax-mapped buffer > passed to another system call for direct-I/O requires 'struct page' for > sending the request down the driver stack and pinning the page. There is > no usage of 'struct page' for first party access to a file via > read(2)/write(2) and friends. > > However, this "no page needed" expectation is violated by > CONFIG_HARDENED_USERCOPY and the check_copy_size() performed in > copy_from_iter_full_nocache() and copy_to_iter_mcsafe(). The > check_heap_object() helper routine assumes the buffer is backed by a > page-allocator DRAM page and applies some checks. Those checks are > invalid, dax pages are not from the heap, and redundant, > dax_iomap_actor() has already validated that the I/O is within bounds. So this last paragraph is not obvious to me as check_copy_size() does a lot of various checks in CONFIG_HARDENED_USERCOPY case. I agree that some of those checks don't make sense for PMEM pages but I'd rather handle that by refining check_copy_size() and check_object_size() functions to detect and appropriately handle pmem pages rather that generally skip all the checks in pmem_copy_from/to_iter(). And yes, every check in such hot path is going to cost performance but that's what user asked for with CONFIG_HARDENED_USERCOPY... Kees? Honza > > Bypass this overhead and call the 'no check' versions of the > copy_{to,from}_iter operations directly. > > Fixes: 0aed55af8834 ("x86, uaccess: introduce copy_from_iter_flushcache...") > Cc: Jan Kara > Cc: > Cc: Jeff Moyer > Cc: Ingo Molnar > Cc: Christoph Hellwig > Cc: Al Viro > Cc: Thomas Gleixner > Cc: Matthew Wilcox > Reported-and-tested-by: Jeff Smits > Signed-off-by: Dan Williams > --- > drivers/nvdimm/pmem.c |9 +++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c > index 845c5b430cdd..c894f45e5077 100644 > --- a/drivers/nvdimm/pmem.c > +++ b/drivers/nvdimm/pmem.c > @@ -281,16 +281,21 @@ static long pmem_dax_direct_access(struct dax_device > *dax_dev, > return __pmem_direct_access(pmem, pgoff, nr_pages, kaddr, pfn); > } > > +/* > + * Use the 'no check' versions of copy_from_iter_flushcache() and > + * copy_to_iter_mcsafe() to bypass HARDENED_USERCOPY overhead. Bounds > + * checking is handled by dax_iomap_actor() > + */ > static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, > void *addr, size_t bytes, struct iov_iter *i) > { > - return copy_from_iter_flushcache(addr, bytes, i); > + return _copy_from_iter_flushcache(addr, bytes, i); > } > > static size_t pmem_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, > void *addr, size_t bytes, struct iov_iter *i) > { > - return copy_to_iter_mcsafe(addr, bytes, i); > + return _copy_to_iter_mcsafe(addr, bytes, i); > } > > static const struct dax_operations pmem_dax_ops = { > -- Jan Kara SUSE Labs, CR ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm