Re: [Qemu-devel] [PATCH v9 2/7] virtio-pmem: Add virtio pmem driver

2019-05-17 Thread Jakub StaroĊ„
On 5/16/19 10:35 PM, Pankaj Gupta wrote:
> Can I take it your reviewed/acked-by? or tested-by tag? for the virtio patch 
> :)I don't feel that I have enough expertise to give the reviewed-by tag, but 
> you can
take my acked-by + tested-by.

Acked-by: Jakub Staron 
Tested-by: Jakub Staron 

No kernel panics/stalls encountered during testing this patches (v9) with QEMU 
+ xfstests.
Some CPU stalls encountered while testing with crosvm instead of QEMU with 
xfstests
(test generic/464) but no repro for QEMU, so the fault may be on the side of 
crosvm.


The dump for the crosvm/xfstests stall:
[ 2504.175276] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 2504.176681] rcu: 0-...!: (1 GPs behind) idle=9b2/1/0x4000 
softirq=1089198/1089202 fqs=0 
[ 2504.178270] rcu: 2-...!: (1 ticks this GP) idle=cfe/1/0x4002 
softirq=1055108/1055110 fqs=0 
[ 2504.179802] rcu: 3-...!: (1 GPs behind) idle=1d6/1/0x4002 
softirq=1046798/1046802 fqs=0 
[ 2504.181215] rcu: 4-...!: (2 ticks this GP) idle=522/1/0x4002 
softirq=1249063/1249064 fqs=0 
[ 2504.182625] rcu: 5-...!: (1 GPs behind) idle=6da/1/0x4000 
softirq=1131036/1131047 fqs=0 
[ 2504.183955]  (detected by 3, t=0 jiffies, g=1232529, q=1370)
[ 2504.184762] Sending NMI from CPU 3 to CPUs 0:
[ 2504.186400] NMI backtrace for cpu 0
[ 2504.186401] CPU: 0 PID: 6670 Comm: 464 Not tainted 5.1.0+ #1
[ 2504.186401] Hardware name: ChromiumOS crosvm, BIOS 0 
[ 2504.186402] RIP: 0010:queued_spin_lock_slowpath+0x1c/0x1e0
[ 2504.186402] Code: e7 89 c8 f0 44 0f b1 07 39 c1 75 dc f3 c3 0f 1f 44 00 00 
ba 01 00 00 00 8b 07 85 c0 75 0a f0 0f b1 17 85 c0 75 f2 f3 c3 f3 90  ec 81 
fe 00 01 00 00 0f 84 ab 00 00 00 81 e6 00 ff ff ff 75 44
[ 2504.186403] RSP: 0018:c9003ee8 EFLAGS: 0002
[ 2504.186404] RAX: 0001 RBX: 0246 RCX: 00404044
[ 2504.186404] RDX: 0001 RSI: 0001 RDI: 8244a280
[ 2504.186405] RBP: 8244a280 R08: 000f4200 R09: 024709ed6c32
[ 2504.186405] R10:  R11: 0001 R12: 8244a280
[ 2504.186405] R13: 0009 R14: 0009 R15: 
[ 2504.186406] FS:  () GS:8880cc60() 
knlGS:
[ 2504.186406] CS:  0010 DS:  ES:  CR0: 80050033
[ 2504.186406] CR2: 7efd6b0f15d8 CR3: 0260a006 CR4: 00360ef0
[ 2504.186407] DR0:  DR1:  DR2: 
[ 2504.186407] DR3:  DR6: fffe0ff0 DR7: 0400
[ 2504.186407] Call Trace:
[ 2504.186408]  
[ 2504.186408]  _raw_spin_lock_irqsave+0x1d/0x30
[ 2504.186408]  rcu_core+0x3b6/0x740
[ 2504.186408]  ? __hrtimer_run_queues+0x133/0x280
[ 2504.186409]  ? recalibrate_cpu_khz+0x10/0x10
[ 2504.186409]  __do_softirq+0xd8/0x2e4
[ 2504.186409]  irq_exit+0xa3/0xb0
[ 2504.186410]  smp_apic_timer_interrupt+0x67/0x120
[ 2504.186410]  apic_timer_interrupt+0xf/0x20
[ 2504.186410]  
[ 2504.186410] RIP: 0010:unmap_page_range+0x47a/0x9b0
[ 2504.186411] Code: 0f 46 46 10 49 39 6e 18 49 89 46 10 48 89 e8 49 0f 43 46 
18 41 80 4e 20 08 4d 85 c9 49 89 46 18 0f 84 68 ff ff ff 49 8b 51 08 <48> 8d 42 
ff 83 e2 01 49 0f 44 c1 f6 40 18 01 75 38 48 ba ff 0f 00
[ 2504.186411] RSP: 0018:c900036cbcc8 EFLAGS: 0282 ORIG_RAX: 
ff13
[ 2504.186412] RAX:  RBX: 80003751d045 RCX: 0001
[ 2504.186413] RDX: ea0002e09288 RSI: 0269b000 RDI: 8880b6525e40
[ 2504.186413] RBP: 0269c000 R08:  R09: eadd4740
[ 2504.186413] R10: ea0001755700 R11: 8880cc62d120 R12: 02794000
[ 2504.186414] R13: 0269b000 R14: c900036cbdf0 R15: 8880572434d8
[ 2504.186414]  ? unmap_page_range+0x420/0x9b0
[ 2504.186414]  ? release_pages+0x175/0x390
[ 2504.186414]  unmap_vmas+0x7c/0xe0
[ 2504.186415]  exit_mmap+0xa4/0x190
[ 2504.186415]  mmput+0x3b/0x100
[ 2504.186415]  do_exit+0x276/0xc10
[ 2504.186415]  do_group_exit+0x35/0xa0
[ 2504.186415]  __x64_sys_exit_group+0xf/0x10
[ 2504.186416]  do_syscall_64+0x43/0x120
[ 2504.186416]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 2504.186416] RIP: 0033:0x7efd6ae10618
[ 2504.186416] Code: Bad RIP value.
[ 2504.186417] RSP: 002b:7ffcac9bde38 EFLAGS: 0246 ORIG_RAX: 
00e7
[ 2504.186417] RAX: ffda RBX:  RCX: 7efd6ae10618
[ 2504.186418] RDX:  RSI: 003c RDI: 
[ 2504.186418] RBP: 7efd6b0ed8e0 R08: 00e7 R09: ff98
[ 2504.186418] R10: 7ffcac9bddb8 R11: 0246 R12: 7efd6b0ed8e0
[ 2504.186419] R13: 7efd6b0f2c20 R14: 0060 R15: 0070e705
[ 2504.186421] NMI backtrace for cpu 3
[ 2504.226980] CPU: 3 PID: 6596 Comm: xfs_io Not tainted 5.1.0+ #1
[ 2504.227661] Hardware name: ChromiumOS crosvm, BIOS 0 
[ 2504.228261] Call Trace:
[ 2504.228552

[v6 0/3] "Hotremove" persistent memory

2019-05-17 Thread Pavel Tatashin
Changelog:

v6
- A few minor changes and added reviewed-by's.
- Spent time studying lock ordering issue that was reported by Vishal
  Verma, but that issue already exists in Linux, and can be reproduced
  with exactly the same steps with ACPI memory hotplugging.

v5
- Addressed comments from Dan Williams: made remove_memory() to return
  an error code, and use this function from dax.

v4
- Addressed comments from Dave Hansen

v3
- Addressed comments from David Hildenbrand. Don't release
  lock_device_hotplug after checking memory status, and rename
  memblock_offlined_cb() to check_memblock_offlined_cb()

v2
- Dan Williams mentioned that drv->remove() return is ignored
  by unbind. Unbind always succeeds. Because we cannot guarantee
  that memory can be offlined from the driver, don't even
  attempt to do so. Simply check that every section is offlined
  beforehand and only then proceed with removing dax memory.

---

Recently, adding a persistent memory to be used like a regular RAM was
added to Linux. This work extends this functionality to also allow hot
removing persistent memory.

We (Microsoft) have an important use case for this functionality.

The requirement is for physical machines with small amount of RAM (~8G)
to be able to reboot in a very short period of time (<1s). Yet, there is
a userland state that is expensive to recreate (~2G).

The solution is to boot machines with 2G preserved for persistent
memory.

Copy the state, and hotadd the persistent memory so machine still has
all 8G available for runtime. Before reboot, offline and hotremove
device-dax 2G, copy the memory that is needed to be preserved to pmem0
device, and reboot.

The series of operations look like this:

1. After boot restore /dev/pmem0 to ramdisk to be consumed by apps.
   and free ramdisk.
2. Convert raw pmem0 to devdax
   ndctl create-namespace --mode devdax --map mem -e namespace0.0 -f
3. Hotadd to System RAM
   echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
   echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
   echo online_movable > /sys/devices/system/memoryXXX/state
4. Before reboot hotremove device-dax memory from System RAM
   echo offline > /sys/devices/system/memoryXXX/state
   echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind
5. Create raw pmem0 device
   ndctl create-namespace --mode raw  -e namespace0.0 -f
6. Copy the state that was stored by apps to ramdisk to pmem device
7. Do kexec reboot or reboot through firmware if firmware does not
   zero memory in pmem0 region (These machines have only regular
   volatile memory). So to have pmem0 device either memmap kernel
   parameter is used, or devices nodes in dtb are specified.

Pavel Tatashin (3):
  device-dax: fix memory and resource leak if hotplug fails
  mm/hotplug: make remove_memory() interface useable
  device-dax: "Hotremove" persistent memory that is used like normal RAM

 drivers/dax/dax-private.h  |  2 ++
 drivers/dax/kmem.c | 46 +---
 include/linux/memory_hotplug.h |  8 +++--
 mm/memory_hotplug.c| 64 +++---
 4 files changed, 92 insertions(+), 28 deletions(-)

-- 
2.21.0

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[v6 1/3] device-dax: fix memory and resource leak if hotplug fails

2019-05-17 Thread Pavel Tatashin
When add_memory() function fails, the resource and the memory should be
freed.

Fixes: c221c0b0308f ("device-dax: "Hotplug" persistent memory for use like 
normal RAM")

Signed-off-by: Pavel Tatashin 
Reviewed-by: Dave Hansen 
---
 drivers/dax/kmem.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index a02318c6d28a..4c0131857133 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -66,8 +66,11 @@ int dev_dax_kmem_probe(struct device *dev)
new_res->name = dev_name(dev);
 
rc = add_memory(numa_node, new_res->start, resource_size(new_res));
-   if (rc)
+   if (rc) {
+   release_resource(new_res);
+   kfree(new_res);
return rc;
+   }
 
return 0;
 }
-- 
2.21.0

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[v6 3/3] device-dax: "Hotremove" persistent memory that is used like normal RAM

2019-05-17 Thread Pavel Tatashin
It is now allowed to use persistent memory like a regular RAM, but
currently there is no way to remove this memory until machine is
rebooted.

This work expands the functionality to also allows hotremoving
previously hotplugged persistent memory, and recover the device for use
for other purposes.

To hotremove persistent memory, the management software must first
offline all memory blocks of dax region, and than unbind it from
device-dax/kmem driver. So, operations should look like this:

echo offline > /sys/devices/system/memory/memoryN/state
...
echo dax0.0 > /sys/bus/dax/drivers/kmem/unbind

Note: if unbind is done without offlining memory beforehand, it won't be
possible to do dax0.0 hotremove, and dax's memory is going to be part of
System RAM until reboot.

Signed-off-by: Pavel Tatashin 
Reviewed-by: David Hildenbrand 
---
 drivers/dax/dax-private.h |  2 ++
 drivers/dax/kmem.c| 41 +++
 2 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index a45612148ca0..999aaf3a29b3 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -53,6 +53,7 @@ struct dax_region {
  * @pgmap - pgmap for memmap setup / lifetime (driver owned)
  * @ref: pgmap reference count (driver owned)
  * @cmp: @ref final put completion (driver owned)
+ * @dax_mem_res: physical address range of hotadded DAX memory
  */
 struct dev_dax {
struct dax_region *region;
@@ -62,6 +63,7 @@ struct dev_dax {
struct dev_pagemap pgmap;
struct percpu_ref ref;
struct completion cmp;
+   struct resource *dax_kmem_res;
 };
 
 static inline struct dev_dax *to_dev_dax(struct device *dev)
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 4c0131857133..3d0a7e702c94 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -71,21 +71,54 @@ int dev_dax_kmem_probe(struct device *dev)
kfree(new_res);
return rc;
}
+   dev_dax->dax_kmem_res = new_res;
 
return 0;
 }
 
+#ifdef CONFIG_MEMORY_HOTREMOVE
+static int dev_dax_kmem_remove(struct device *dev)
+{
+   struct dev_dax *dev_dax = to_dev_dax(dev);
+   struct resource *res = dev_dax->dax_kmem_res;
+   resource_size_t kmem_start = res->start;
+   resource_size_t kmem_size = resource_size(res);
+   int rc;
+
+   /*
+* We have one shot for removing memory, if some memory blocks were not
+* offline prior to calling this function remove_memory() will fail, and
+* there is no way to hotremove this memory until reboot because device
+* unbind will succeed even if we return failure.
+*/
+   rc = remove_memory(dev_dax->target_node, kmem_start, kmem_size);
+   if (rc) {
+   dev_err(dev,
+   "DAX region %pR cannot be hotremoved until the next 
reboot\n",
+   res);
+   return rc;
+   }
+
+   /* Release and free dax resources */
+   release_resource(res);
+   kfree(res);
+   dev_dax->dax_kmem_res = NULL;
+
+   return 0;
+}
+#else
 static int dev_dax_kmem_remove(struct device *dev)
 {
/*
-* Purposely leak the request_mem_region() for the device-dax
-* range and return '0' to ->remove() attempts. The removal of
-* the device from the driver always succeeds, but the region
-* is permanently pinned as reserved by the unreleased
+* Without hotremove purposely leak the request_mem_region() for the
+* device-dax range and return '0' to ->remove() attempts. The removal
+* of the device from the driver always succeeds, but the region is
+* permanently pinned as reserved by the unreleased
 * request_mem_region().
 */
return 0;
 }
+#endif /* CONFIG_MEMORY_HOTREMOVE */
 
 static struct dax_device_driver device_dax_kmem_driver = {
.drv = {
-- 
2.21.0

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[v6 2/3] mm/hotplug: make remove_memory() interface useable

2019-05-17 Thread Pavel Tatashin
As of right now remove_memory() interface is inherently broken. It tries
to remove memory but panics if some memory is not offline. The problem
is that it is impossible to ensure that all memory blocks are offline as
this function also takes lock_device_hotplug that is required to
change memory state via sysfs.

So, between calling this function and offlining all memory blocks there
is always a window when lock_device_hotplug is released, and therefore,
there is always a chance for a panic during this window.

Make this interface to return an error if memory removal fails. This way
it is safe to call this function without panicking machine, and also
makes it symmetric to add_memory() which already returns an error.

Signed-off-by: Pavel Tatashin 
Reviewed-by: David Hildenbrand 
---
 include/linux/memory_hotplug.h |  8 +++--
 mm/memory_hotplug.c| 64 +++---
 2 files changed, 49 insertions(+), 23 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index ae892eef8b82..988fde33cd7f 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -324,7 +324,7 @@ static inline void pgdat_resize_init(struct pglist_data 
*pgdat) {}
 extern bool is_mem_section_removable(unsigned long pfn, unsigned long 
nr_pages);
 extern void try_offline_node(int nid);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
-extern void remove_memory(int nid, u64 start, u64 size);
+extern int remove_memory(int nid, u64 start, u64 size);
 extern void __remove_memory(int nid, u64 start, u64 size);
 
 #else
@@ -341,7 +341,11 @@ static inline int offline_pages(unsigned long start_pfn, 
unsigned long nr_pages)
return -EINVAL;
 }
 
-static inline void remove_memory(int nid, u64 start, u64 size) {}
+static inline int remove_memory(int nid, u64 start, u64 size)
+{
+   return -EBUSY;
+}
+
 static inline void __remove_memory(int nid, u64 start, u64 size) {}
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 328878b6799d..ace2cc614da4 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1735,9 +1735,10 @@ static int check_memblock_offlined_cb(struct 
memory_block *mem, void *arg)
endpa = PFN_PHYS(section_nr_to_pfn(mem->end_section_nr + 1))-1;
pr_warn("removing memory fails, because memory [%pa-%pa] is 
onlined\n",
&beginpa, &endpa);
-   }
 
-   return ret;
+   return -EBUSY;
+   }
+   return 0;
 }
 
 static int check_cpu_on_node(pg_data_t *pgdat)
@@ -1820,19 +1821,9 @@ static void __release_memory_resource(resource_size_t 
start,
}
 }
 
-/**
- * remove_memory
- * @nid: the node ID
- * @start: physical address of the region to remove
- * @size: size of the region to remove
- *
- * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
- * and online/offline operations before this call, as required by
- * try_offline_node().
- */
-void __ref __remove_memory(int nid, u64 start, u64 size)
+static int __ref try_remove_memory(int nid, u64 start, u64 size)
 {
-   int ret;
+   int rc = 0;
 
BUG_ON(check_hotplug_memory_range(start, size));
 
@@ -1840,13 +1831,13 @@ void __ref __remove_memory(int nid, u64 start, u64 size)
 
/*
 * All memory blocks must be offlined before removing memory.  Check
-* whether all memory blocks in question are offline and trigger a BUG()
+* whether all memory blocks in question are offline and return error
 * if this is not the case.
 */
-   ret = walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1), NULL,
-   check_memblock_offlined_cb);
-   if (ret)
-   BUG();
+   rc = walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1), NULL,
+  check_memblock_offlined_cb);
+   if (rc)
+   goto done;
 
/* remove memmap entry */
firmware_map_remove(start, start + size, "System RAM");
@@ -1858,14 +1849,45 @@ void __ref __remove_memory(int nid, u64 start, u64 size)
 
try_offline_node(nid);
 
+done:
mem_hotplug_done();
+   return rc;
 }
 
-void remove_memory(int nid, u64 start, u64 size)
+/**
+ * remove_memory
+ * @nid: the node ID
+ * @start: physical address of the region to remove
+ * @size: size of the region to remove
+ *
+ * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
+ * and online/offline operations before this call, as required by
+ * try_offline_node().
+ */
+void __remove_memory(int nid, u64 start, u64 size)
+{
+
+   /*
+* trigger BUG() is some memory is not offlined prior to calling this
+* function
+*/
+   if (try_remove_memory(nid, start, size))
+   BUG();
+}
+
+/*
+ * Remove memory if every memory block is offline, otherwise return -EBUSY is
+ * some memory i

Re: [PATCH] libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead

2019-05-17 Thread Kees Cook
On Fri, May 17, 2019 at 10:28:48AM -0700, Dan Williams wrote:
> On Fri, May 17, 2019 at 8:57 AM Kees Cook  wrote:
> >
> > On Fri, May 17, 2019 at 08:08:27AM -0700, Dan Williams wrote:
> > > As far as I can see it's mostly check_heap_object() that is the
> > > problem, so I'm open to finding a way to just bypass that sub-routine.
> > > However, as far as I can see none of the other block / filesystem user
> > > copy implementations submit to the hardened checks, like
> > > bio_copy_from_iter(), and iov_iter_copy_from_user_atomic() . So,
> > > either those need to grow additional checks, or the hardened copy
> > > implementation is targeting single object copy use cases, not
> > > necessarily block-I/O. Yes, Kees, please advise.
> >
> > The intention is mainly for copies that haven't had explicit bounds
> > checking already performed on them, yes. Is there something getting
> > checked out of the slab, or is it literally just the overhead of doing
> > the "is this slab?" check that you're seeing?
> 
> It's literally the overhead of "is this slab?" since it needs to go
> retrieve the struct page and read that potentially cold cacheline. In
> the case where that page is on memory media that is higher latency
> than DRAM we get the ~37% performance loss that Jeff measured.

Ah-ha! Okay, I understand now; thanks!

> The path is via the filesystem ->write_iter() file operation. In the
> DAX case the filesystem traps that path early, before submitting block
> I/O, and routes it to the dax_iomap_actor() routine. That routine
> validates that the logical file offset is within bounds of the file,
> then it does a sector-to-pfn translation which validates that the
> physical mapping is within bounds of the block device.
> 
> It seems dax_iomap_actor() is not a path where we'd be worried about
> needing hardened user copy checks.

I would agree: I think the proposed patch makes sense. :)

-- 
Kees Cook
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 01/18] kunit: test: add KUnit test runner core

2019-05-17 Thread Stephen Boyd
Quoting Brendan Higgins (2019-05-14 15:16:54)
> diff --git a/include/kunit/test.h b/include/kunit/test.h
> new file mode 100644
> index 0..e682ea0e1f9a5
> --- /dev/null
> +++ b/include/kunit/test.h
> @@ -0,0 +1,162 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Base unit test (KUnit) API.
> + *
> + * Copyright (C) 2019, Google LLC.
> + * Author: Brendan Higgins 
> + */
> +
> +#ifndef _KUNIT_TEST_H
> +#define _KUNIT_TEST_H
> +
> +#include 
> +#include 

Is this include used here?

> +
> +struct kunit;
> +
> +/**
> + * struct kunit_case - represents an individual test case.
> + * @run_case: the function representing the actual test case.
> + * @name: the name of the test case.
> + *
> + * A test case is a function with the signature, ``void (*)(struct kunit *)``
> + * that makes expectations (see KUNIT_EXPECT_TRUE()) about code under test. 
> Each
> + * test case is associated with a &struct kunit_module and will be run after 
> the
> + * module's init function and followed by the module's exit function.
> + *
> + * A test case should be static and should only be created with the 
> KUNIT_CASE()
> + * macro; additionally, every array of test cases should be terminated with 
> an
> + * empty test case.
> + *
> + * Example:
> + *
> + * .. code-block:: c
> + *
> + * void add_test_basic(struct kunit *test)
> + * {
> + * KUNIT_EXPECT_EQ(test, 1, add(1, 0));
> + * KUNIT_EXPECT_EQ(test, 2, add(1, 1));
> + * KUNIT_EXPECT_EQ(test, 0, add(-1, 1));
> + * KUNIT_EXPECT_EQ(test, INT_MAX, add(0, INT_MAX));
> + * KUNIT_EXPECT_EQ(test, -1, add(INT_MAX, INT_MIN));
> + * }
> + *
> + * static struct kunit_case example_test_cases[] = {
> + * KUNIT_CASE(add_test_basic),
> + * {},

Nitpick: Please drop the comma on the sentinel so nobody gets ideas to
add another entry after it.

> + * };
> + *
> + */
> +struct kunit_case {
> +   void (*run_case)(struct kunit *test);
> +   const char name[256];

Maybe 256 can be a #define KUNIT_NAME_MAX_LEN? Or it could just be a
const char pointer to a literal pool? Are unit tests making up names at
runtime?

> +
> +   /* private: internal use only. */
> +   bool success;
> +};
> +
> +/**
> + * KUNIT_CASE - A helper for creating a &struct kunit_case
> + * @test_name: a reference to a test case function.
> + *
> + * Takes a symbol for a function representing a test case and creates a
> + * &struct kunit_case object from it. See the documentation for
> + * &struct kunit_case for an example on how to use it.
> + */
> +#define KUNIT_CASE(test_name) { .run_case = test_name, .name = #test_name }
> +
> +/**
> + * struct kunit_module - describes a related collection of &struct 
> kunit_case s.
> + * @name: the name of the test. Purely informational.
> + * @init: called before every test case.
> + * @exit: called after every test case.
> + * @test_cases: a null terminated array of test cases.
> + *
> + * A kunit_module is a collection of related &struct kunit_case s, such that
> + * @init is called before every test case and @exit is called after every 
> test
> + * case, similar to the notion of a *test fixture* or a *test class* in other
> + * unit testing frameworks like JUnit or Googletest.
> + *
> + * Every &struct kunit_case must be associated with a kunit_module for KUnit 
> to
> + * run it.
> + */
> +struct kunit_module {
> +   const char name[256];
> +   int (*init)(struct kunit *test);
> +   void (*exit)(struct kunit *test);
> +   struct kunit_case *test_cases;

Can this variable be const? Or we expect test modules to adjust test_cases after
the fact?

> +};
> +
> +/**
> + * struct kunit - represents a running instance of a test.
> + * @priv: for user to store arbitrary data. Commonly used to pass data 
> created
> + * in the init function (see &struct kunit_module).
> + *
> + * Used to store information about the current context under which the test 
> is
> + * running. Most of this data is private and should only be accessed 
> indirectly
> + * via public functions; the one exception is @priv which can be used by the
> + * test writer to store arbitrary data.
> + */
> +struct kunit {
> +   void *priv;
> +
> +   /* private: internal use only. */
> +   const char *name; /* Read only after initialization! */
> +   spinlock_t lock; /* Gaurds all mutable test state. */
> +   bool success; /* Protected by lock. */
> +};
> +
> +void kunit_init_test(struct kunit *test, const char *name);
> +
> +int kunit_run_tests(struct kunit_module *module);
> +
> +/**
> + * module_test() - used to register a &struct kunit_module with KUnit.
> + * @module: a statically allocated &struct kunit_module.
> + *
> + * Registers @module with the test framework. See &struct kunit_module for 
> more
> + * information.
> + */
> +#define module_test(module) \
> +   static int module_kunit_init##module(void) \
> +   { \
> +  

Re: [PATCH v4 17/18] kernel/sysctl-test: Add null pointer test for sysctl.c:proc_dointvec()

2019-05-17 Thread Stephen Boyd
Quoting Brendan Higgins (2019-05-14 15:17:10)
> diff --git a/kernel/sysctl-test.c b/kernel/sysctl-test.c
> new file mode 100644
> index 0..fe0f2bae66085
> --- /dev/null
> +++ b/kernel/sysctl-test.c
> @@ -0,0 +1,293 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * KUnit test of proc sysctl.
> + */
> +
> +#include 
> +#include 

Is this include used?

> +#include 
> +#include 

Is this include used?

> +
> +
> +static void sysctl_test_dointvec_happy_single_negative(struct kunit *test)
> +{
> +   struct ctl_table table = {
> +   .procname = "foo",
> +   .data   = &test_data.int_0001,
> +   .maxlen = sizeof(int),
> +   .mode   = 0644,
> +   .proc_handler   = proc_dointvec,
> +   .extra1 = &i_zero,
> +   .extra2 = &i_one_hundred,
> +   };
> +   char input[] = "-9";
> +   size_t len = sizeof(input) - 1;
> +   loff_t pos = 0;
> +
> +   table.data = kunit_kzalloc(test, sizeof(int), GFP_USER);
> +   KUNIT_EXPECT_EQ(test, 0, proc_dointvec(&table, 1, input, &len, &pos));
> +   KUNIT_EXPECT_EQ(test, sizeof(input) - 1, len);
> +   KUNIT_EXPECT_EQ(test, sizeof(input) - 1, pos);
> +   KUNIT_EXPECT_EQ(test, -9, *(int *)table.data);

Is the casting necessary? Or can the macro do a type coercion of the
second parameter based on the first type?

> +}
> +
> +static void sysctl_test_dointvec_single_less_int_min(struct kunit *test)
> +{
> +   struct ctl_table table = {
> +   .procname = "foo",
> +   .data   = &test_data.int_0001,
> +   .maxlen = sizeof(int),
> +   .mode   = 0644,
> +   .proc_handler   = proc_dointvec,
> +   .extra1 = &i_zero,
> +   .extra2 = &i_one_hundred,
> +   };
> +   char input[32];
> +   size_t len = sizeof(input) - 1;
> +   loff_t pos = 0;
> +   unsigned long abs_of_less_than_min = (unsigned long)INT_MAX
> +- (INT_MAX + INT_MIN) + 1;
> +
> +   KUNIT_EXPECT_LT(test,
> +   snprintf(input, sizeof(input), "-%lu",
> +abs_of_less_than_min),
> +   sizeof(input));
> +
> +   table.data = kunit_kzalloc(test, sizeof(int), GFP_USER);
> +   KUNIT_EXPECT_EQ(test, -EINVAL,
> +   proc_dointvec(&table, 1, input, &len, &pos));
> +   KUNIT_EXPECT_EQ(test, sizeof(input) - 1, len);
> +   KUNIT_EXPECT_EQ(test, 0, *(int *)table.data);
> +}
> +
> +static void sysctl_test_dointvec_single_greater_int_max(struct kunit *test)
> +{
> +   struct ctl_table table = {
> +   .procname = "foo",
> +   .data   = &test_data.int_0001,
> +   .maxlen = sizeof(int),
> +   .mode   = 0644,
> +   .proc_handler   = proc_dointvec,
> +   .extra1 = &i_zero,
> +   .extra2 = &i_one_hundred,
> +   };
> +   char input[32];
> +   size_t len = sizeof(input) - 1;
> +   loff_t pos = 0;
> +   unsigned long greater_than_max = (unsigned long)INT_MAX + 1;
> +
> +   KUNIT_EXPECT_GT(test, greater_than_max, INT_MAX);
> +   KUNIT_EXPECT_LT(test, snprintf(input, sizeof(input), "%lu",
> +  greater_than_max),
> +   sizeof(input));
> +   table.data = kunit_kzalloc(test, sizeof(int), GFP_USER);
> +   KUNIT_EXPECT_EQ(test, -EINVAL,
> +   proc_dointvec(&table, 1, input, &len, &pos));
> +   KUNIT_EXPECT_EQ(test, sizeof(input) - 1, len);
> +   KUNIT_EXPECT_EQ(test, 0, *(int *)table.data);
> +}
> +
> +static int sysctl_test_init(struct kunit *test)
> +{
> +   return 0;
> +}
> +
> +/*
> + * This is run once after each test case, see the comment on 
> example_test_module
> + * for more information.
> + */
> +static void sysctl_test_exit(struct kunit *test)
> +{
> +}

Can the above two be omitted? If they can be empty sometimes it would be
nice to avoid the extra symbols and code by letting them be assigned to
NULL in the kunit_module.

> +
> +/*
> + * Here we make a list of all the test cases we want to add to the test 
> module
> + * below.
> + */
> +static struct kunit_case sysctl_test_cases[] = {
> +   /*
> +* This is a helper to create a test case object from a test case
> +* function; its exact function is not important to understand how to
> +* use KUnit, just know that this is how you associate test cases 
> with a
> +* test module.
> +*/
> +   KUNIT_CASE(sysctl_test_dointvec_null_tbl_data),
> +   KUNIT_CASE(sysctl_test_dointvec_table_maxlen_unset),
> +   KUNIT_CASE(sysctl_test_dointvec_table_len_is_zero),
> +   KUNIT_CASE(sysctl_test_dointvec_table_read_but_position_set),
> +   KUNIT_CA

Re: [PATCH] mm, memory-failure: clarify error message

2019-05-17 Thread Verma, Vishal L
On Thu, 2019-05-16 at 22:08 -0600, Jane Chu wrote:
> Some user who install SIGBUS handler that does longjmp out
> therefore keeping the process alive is confused by the error
> message
>   "[188988.765862] Memory failure: 0x1840200: Killing
>cellsrv:33395 due to hardware memory corruption"
> Slightly modify the error message to improve clarity.
> 
> Signed-off-by: Jane Chu 
> ---
>  mm/memory-failure.c | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index fc8b517..14de5e2 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -216,10 +216,9 @@ static int kill_proc(struct to_kill *tk, unsigned long 
> pfn, int flags)
>   short addr_lsb = tk->size_shift;
>   int ret;
>  
> - pr_err("Memory failure: %#lx: Killing %s:%d due to hardware memory 
> corruption\n",
> - pfn, t->comm, t->pid);
> -
>   if ((flags & MF_ACTION_REQUIRED) && t->mm == current->mm) {
> + pr_err("Memory failure: %#lx: Killing %s:%d due to hardware 
> memory "
> + "corruption\n", pfn, t->comm, t->pid);

Minor nit, but the string shouldn't be split over multiple lines to
preserve grep-ability. In such a case it is usually considered OK to
exceed 80 characters for the line if needed.

>   ret = force_sig_mceerr(BUS_MCEERR_AR, (void __user *)tk->addr,
>  addr_lsb, current);
>   } else {
> @@ -229,6 +228,8 @@ static int kill_proc(struct to_kill *tk, unsigned long 
> pfn, int flags)
>* This could cause a loop when the user sets SIGBUS
>* to SIG_IGN, but hopefully no one will do that?
>*/
> + pr_err("Memory failure: %#lx: Sending SIGBUS to %s:%d due to 
> hardware "
> + "memory corruption\n", pfn, t->comm, t->pid);
>   ret = send_sig_mceerr(BUS_MCEERR_AO, (void __user *)tk->addr,
> addr_lsb, t);  /* synchronous? */
>   }
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [v5 2/3] mm/hotplug: make remove_memory() interface useable

2019-05-17 Thread Pavel Tatashin
Hi Dan,

Thank you very much for your review, my comments below:

On Mon, May 6, 2019 at 2:01 PM Dan Williams  wrote:
>
> On Mon, May 6, 2019 at 10:57 AM Dave Hansen  wrote:
> >
> > > -static inline void remove_memory(int nid, u64 start, u64 size) {}
> > > +static inline bool remove_memory(int nid, u64 start, u64 size)
> > > +{
> > > + return -EBUSY;
> > > +}
> >
> > This seems like an appropriate place for a WARN_ONCE(), if someone
> > manages to call remove_memory() with hotplug disabled.

I decided not to do WARN_ONCE(), in all likelihood compiler will
simply optimize this function out, but with WARN_ONCE() some traces of
it will remain.

> >
> > BTW, I looked and can't think of a better errno, but -EBUSY probably
> > isn't the best error code, right?

-EBUSY is the only error that is returned in case of error by real
remove_memory(), so I think it is OK to keep it here.

> >
> > > -void remove_memory(int nid, u64 start, u64 size)
> > > +/**
> > > + * remove_memory
> > > + * @nid: the node ID
> > > + * @start: physical address of the region to remove
> > > + * @size: size of the region to remove
> > > + *
> > > + * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
> > > + * and online/offline operations before this call, as required by
> > > + * try_offline_node().
> > > + */
> > > +void __remove_memory(int nid, u64 start, u64 size)
> > >  {
> > > +
> > > + /*
> > > +  * trigger BUG() is some memory is not offlined prior to calling 
> > > this
> > > +  * function
> > > +  */
> > > + if (try_remove_memory(nid, start, size))
> > > + BUG();
> > > +}
> >
> > Could we call this remove_offline_memory()?  That way, it makes _some_
> > sense why we would BUG() if the memory isn't offline.

It is this particular code path, the second one: remove_memory(),
actually tries to remove memory and returns failure if it can't. So, I
think the current name is OK.

>
> Please WARN() instead of BUG() because failing to remove memory should
> not be system fatal.

As mentioned earlier, I will keep BUG(), because existing code does
that, and there is no good handling of this code to return on error.

Thank you,
Pavel
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/18] kunit: test: add kunit_stream a std::stream like logger

2019-05-17 Thread Stephen Boyd
Quoting Brendan Higgins (2019-05-14 15:16:57)
> diff --git a/kunit/kunit-stream.c b/kunit/kunit-stream.c
> new file mode 100644
> index 0..1884f1b550888
> --- /dev/null
> +++ b/kunit/kunit-stream.c
> @@ -0,0 +1,152 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * C++ stream style string formatter and printer used in KUnit for outputting
> + * KUnit messages.
> + *
> + * Copyright (C) 2019, Google LLC.
> + * Author: Brendan Higgins 
> + */
> +
> +#include 
> +#include 
> +#include 
> +
> +static const char *kunit_stream_get_level(struct kunit_stream *this)
> +{
> +   unsigned long flags;
> +   const char *level;
> +
> +   spin_lock_irqsave(&this->lock, flags);
> +   level = this->level;
> +   spin_unlock_irqrestore(&this->lock, flags);
> +
> +   return level;

Please remove this whole function and inline it to the one call-site.

> +}
> +
> +void kunit_stream_set_level(struct kunit_stream *this, const char *level)
> +{
> +   unsigned long flags;
> +
> +   spin_lock_irqsave(&this->lock, flags);
> +   this->level = level;
> +   spin_unlock_irqrestore(&this->lock, flags);

I don't get the locking here. What are we protecting against? Are tests
running in parallel using the same kunit_stream? If so, why is the level
changeable in one call and then adding strings is done in a different
function call? It would make sense to combine the level setting and
string adding so that it's one atomic operation if it's truly a parallel
operation, or remove the locking entirely.

> +}
> +
> +void kunit_stream_add(struct kunit_stream *this, const char *fmt, ...)
> +{
> +   va_list args;
> +   struct string_stream *stream = this->internal_stream;
> +
> +   va_start(args, fmt);
> +
> +   if (string_stream_vadd(stream, fmt, args) < 0)
> +   kunit_err(this->test, "Failed to allocate fragment: %s\n", 
> fmt);
> +
> +   va_end(args);
> +}
> +
> +void kunit_stream_append(struct kunit_stream *this,
> +   struct kunit_stream *other)
> +{
> +   struct string_stream *other_stream = other->internal_stream;
> +   const char *other_content;
> +
> +   other_content = string_stream_get_string(other_stream);
> +
> +   if (!other_content) {
> +   kunit_err(this->test,
> + "Failed to get string from second argument for 
> appending.\n");
> +   return;
> +   }
> +
> +   kunit_stream_add(this, other_content);
> +}
> +
> +void kunit_stream_clear(struct kunit_stream *this)
> +{
> +   string_stream_clear(this->internal_stream);
> +}
> +
> +void kunit_stream_commit(struct kunit_stream *this)

Should this be rather called kunit_stream_flush()?

> +{
> +   struct string_stream *stream = this->internal_stream;
> +   struct string_stream_fragment *fragment;
> +   const char *level;
> +   char *buf;
> +
> +   level = kunit_stream_get_level(this);
> +   if (!level) {
> +   kunit_err(this->test,
> + "Stream was committed without a specified log 
> level.\n");

Drop the full-stop?

> +   level = KERN_ERR;
> +   kunit_stream_set_level(this, level);
> +   }
> +
> +   buf = string_stream_get_string(stream);
> +   if (!buf) {
> +   kunit_err(this->test,

Can you grow a local variable for 'this->test'? It's used many times.

Also, 'this' is not very kernel idiomatic. We usually name variables by
their type instead of 'this' which is a keyword in other languages.
Perhaps it could be named 'kstream'?

> +"Could not allocate buffer, dumping stream:\n");
> +   list_for_each_entry(fragment, &stream->fragments, node) {
> +   kunit_err(this->test, fragment->fragment);
> +   }
> +   kunit_err(this->test, "\n");
> +   goto cleanup;
> +   }
> +
> +   kunit_printk(level, this->test, buf);
> +   kfree(buf);
> +
> +cleanup:
> +   kunit_stream_clear(this);
> +}
> +
> +static int kunit_stream_init(struct kunit_resource *res, void *context)
> +{
> +   struct kunit *test = context;
> +   struct kunit_stream *stream;
> +
> +   stream = kzalloc(sizeof(*stream), GFP_KERNEL);

Of course, here it's called 'stream', so maybe it should be 'kstream'
here too.

> +   if (!stream)
> +   return -ENOMEM;
> +
> +   res->allocation = stream;
> +   stream->test = test;
> +   spin_lock_init(&stream->lock);
> +   stream->internal_stream = new_string_stream();

Can new_string_stream() be renamed to alloc_string_stream()? Sorry, I
just see so much C++ isms in here it's hard to read from the kernel
developer perspective.

> +
> +   if (!stream->internal_stream) {

Nitpick: Please join this to the "allocation" event above instead of
keeping it separated.

> +   kfree(stream);
> +   return -ENOMEM;
> +   }
> +
> +   retur

Re: [PATCH v4 03/18] kunit: test: add string_stream a std::stream like string builder

2019-05-17 Thread Stephen Boyd
Quoting Brendan Higgins (2019-05-14 15:16:56)
> A number of test features need to do pretty complicated string printing
> where it may not be possible to rely on a single preallocated string
> with parameters.
> 
> So provide a library for constructing the string as you go similar to
> C++'s std::string.
> 
> Signed-off-by: Brendan Higgins 
> Reviewed-by: Greg Kroah-Hartman 
> Reviewed-by: Logan Gunthorpe 

Is there any reason why we can't use the seqfile API for this? These
both share a similar goal, formatting strings into a buffer to be read
later. Maybe some new APIs would be needed to extract the buffer
differently, but I hope we could share the code.

If it can't be used, can you please add the reasoning to the commit text
here?

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [ndctl PATCH v3 00/10] daxctl: add a new reconfigure-device command

2019-05-17 Thread Pavel Tatashin
> Hi Pavel,
>
> I've still not been able to hit this in my testing, is it something you
> hit only after applying these patches? i.e. does plain v65 work?

Yes, plain v65 works, but with these patches I see this error.

I use buildroot to build initramfs with ndctl. Here is how ndctl.mk looks like:

NDCTL_VERSION = ed17fd1
NDCTL_SITE = $(call github,pmem,ndctl,$(NDCTL_VERSION))
NDCTL_LICENSE = GNU Lesser General Public License v2.1
NDCTL_LICENSE_FILES = COPYING

#NDCTL_AUTORECv65ONF = YES
define NDCTL_RUN_AUTOGEN
cd $(@D) && PATH=$(BR_PATH) ./autogen.sh
endef
NDCTL_PRE_CONFIGURE_HOOKS += NDCTL_RUN_AUTOGEN
NDCTL_LDFLAGS = $(TARGET_LDFLAGS)
NDCTL_CONF_OPTS = \
--without-bash \
--without-keyutils \
--without-systemd

$(eval $(autotools-package))

This version works,  but when I change:
NDCTL_VERSION = 9ee82c8

I get add_dev error.

Pasha
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [ndctl PATCH v3 00/10] daxctl: add a new reconfigure-device command

2019-05-17 Thread Verma, Vishal L
On Fri, 2019-05-17 at 11:30 -0400, Pavel Tatashin wrote:
> On Thu, May 16, 2019 at 6:40 PM Vishal Verma  > wrote:
> > Changes in v3:
> >  - In daxctl_dev_get_mode(), remove the subsystem warning, detect
> > dax-class
> >and simply make it return devdax
> 
> Hi Vishal,
> 
> I am still getting the same error as before:
> 
> # ndctl create-namespace --mode devdax --map mem -e namespace0.0 -f
> [  141.525873] dax0.0 initialised, 524288 pages in 7ms
> libdaxctl: __sysfs_device_parse: dax0.0: add_dev() failed
> ...
> 
> I am building it via buildroot, and can share the initramfs file or
> anything that can help you with fixing this issue.

Hi Pavel,

I've still not been able to hit this in my testing, is it something you
hit only after applying these patches? i.e. does plain v65 work?
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: NULL pointer dereference during memory hotremove

2019-05-17 Thread Pavel Tatashin
On Fri, May 17, 2019 at 1:24 PM Pavel Tatashin
 wrote:
>
> On Fri, May 17, 2019 at 1:22 PM Pavel Tatashin
>  wrote:
> >
> > On Fri, May 17, 2019 at 10:38 AM Michal Hocko  wrote:
> > >
> > > On Fri 17-05-19 10:20:38, Pavel Tatashin wrote:
> > > > This panic is unrelated to circular lock issue that I reported in a
> > > > separate thread, that also happens during memory hotremove.
> > > >
> > > > xakep ~/x/linux$ git describe
> > > > v5.1-12317-ga6a4b66bd8f4
> > >
> > > Does this happen on 5.0 as well?
> >
> > Yes, just reproduced it on 5.0 as well. Unfortunately, I do not have a
> > script, and have to do it manually, also it does not happen every
> > time, it happened on 3rd time for me.
>
> Actually, sorry, I have not tested 5.0, I compiled 5.0, but my script
> still tested v5.1-12317-ga6a4b66bd8f4 build. I will report later if I
> am able to reproduce it on 5.0.

OK, confirmed on 5.0 as well, took 4 tries to reproduce:
(qemu) [   17.104486] Offlined Pages 32768
[   17.105543] Built 1 zonelists, mobility grouping on.  Total pages: 1515892
[   17.106475] Policy zone: Normal
[   17.107029] BUG: unable to handle kernel NULL pointer dereference
at 0698
[   17.107645] #PF error: [normal kernel read fault]
[   17.108038] PGD 0 P4D 0
[   17.108287] Oops:  [#1] SMP PTI
[   17.108557] CPU: 5 PID: 313 Comm: kworker/u16:5 Not tainted 5.0.0_pt_pmem1 #2
[   17.109128] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.12.0-20181126_142135-anatol 04/01/2014
[   17.109910] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
[   17.110323] RIP: 0010:__remove_pages+0x2f/0x520
[   17.110674] Code: 41 56 41 55 49 89 fd 41 54 55 53 48 89 d3 48 83
ec 68 48 89 4c 24 08 65 48 8b 04 25 28 00 00 00 48 89 44 24 60 31 c0
48 89 f8 <48> 2b 47 58 48 3d 00 19 00 00 0f 85 7f 03 00 00 48 85 c9 0f
84 df
[   17.112114] RSP: 0018:b43b815f3ca8 EFLAGS: 00010246
[   17.112518] RAX: 0640 RBX: 0004 RCX: 
[   17.113073] RDX: 0004 RSI: 0024 RDI: 0640
[   17.113615] RBP: 00024000 R08:  R09: 4000
[   17.114186] R10: 4000 R11: 00024000 R12: e382c900
[   17.114743] R13: 0640 R14: 0004 R15: 0024
[   17.115288] FS:  () GS:979539b4()
knlGS:
[   17.115911] CS:  0010 DS:  ES:  CR0: 80050033
[   17.116356] CR2: 0698 CR3: 000133c22004 CR4: 00360ee0
[   17.116913] DR0:  DR1:  DR2: 
[   17.117467] DR3:  DR6: fffe0ff0 DR7: 0400
[   17.118016] Call Trace:
[   17.118214]  ? memblock_isolate_range+0xc4/0x139
[   17.118570]  ? firmware_map_remove+0x48/0x90
[   17.118908]  arch_remove_memory+0x7b/0xc0
[   17.119216]  __remove_memory+0x93/0xc0
[   17.119528]  acpi_memory_device_remove+0x67/0xe0
[   17.119890]  acpi_bus_trim+0x50/0x90
[   17.120167]  acpi_device_hotplug+0x2fc/0x460
[   17.120498]  acpi_hotplug_work_fn+0x15/0x20
[   17.120834]  process_one_work+0x2a0/0x650
[   17.121146]  worker_thread+0x34/0x3d0
[   17.121432]  ? process_one_work+0x650/0x650
[   17.121772]  kthread+0x118/0x130
[   17.122032]  ? kthread_create_on_node+0x60/0x60
[   17.122413]  ret_from_fork+0x3a/0x50
[   17.122727] Modules linked in:
[   17.122983] CR2: 0698
[   17.123250] ---[ end trace 389c4034f6d42e6f ]---
[   17.123618] RIP: 0010:__remove_pages+0x2f/0x520
[   17.123979] Code: 41 56 41 55 49 89 fd 41 54 55 53 48 89 d3 48 83
ec 68 48 89 4c 24 08 65 48 8b 04 25 28 00 00 00 48 89 44 24 60 31 c0
48 89 f8 <48> 2b 47 58 48 3d 00 19 00 00 0f 85 7f 03 00 00 48 85 c9 0f
84 df
[   17.125410] RSP: 0018:b43b815f3ca8 EFLAGS: 00010246
[   17.125818] RAX: 0640 RBX: 0004 RCX: 
[   17.126359] RDX: 0004 RSI: 0024 RDI: 0640
[   17.126906] RBP: 00024000 R08:  R09: 4000
[   17.127453] R10: 4000 R11: 00024000 R12: e382c900
[   17.128008] R13: 0640 R14: 0004 R15: 0024
[   17.128555] FS:  () GS:979539b4()
knlGS:
[   17.129182] CS:  0010 DS:  ES:  CR0: 80050033
[   17.129627] CR2: 0698 CR3: 000133c22004 CR4: 00360ee0
[   17.130182] DR0:  DR1:  DR2: 
[   17.130744] DR3:  DR6: fffe0ff0 DR7: 0400
[   17.131293] BUG: sleeping function called from invalid context at
include/linux/percpu-rwsem.h:34
[   17.132050] in_atomic(): 0, irqs_disabled(): 1, pid: 313, name: kworker/u16:5
[   17.132596] INFO: lockdep is turned off.
[   17.132908] irq event stamp: 14046
[   17.133175] hardirqs last  enabled at (14045): []
kfree+0xba/0x230
[   17.133777] hardirqs last disabled at (14046): []
trace_hardi

Re: [PATCH] libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead

2019-05-17 Thread Dan Williams
On Fri, May 17, 2019 at 8:57 AM Kees Cook  wrote:
>
> On Fri, May 17, 2019 at 08:08:27AM -0700, Dan Williams wrote:
> > As far as I can see it's mostly check_heap_object() that is the
> > problem, so I'm open to finding a way to just bypass that sub-routine.
> > However, as far as I can see none of the other block / filesystem user
> > copy implementations submit to the hardened checks, like
> > bio_copy_from_iter(), and iov_iter_copy_from_user_atomic() . So,
> > either those need to grow additional checks, or the hardened copy
> > implementation is targeting single object copy use cases, not
> > necessarily block-I/O. Yes, Kees, please advise.
>
> The intention is mainly for copies that haven't had explicit bounds
> checking already performed on them, yes. Is there something getting
> checked out of the slab, or is it literally just the overhead of doing
> the "is this slab?" check that you're seeing?

It's literally the overhead of "is this slab?" since it needs to go
retrieve the struct page and read that potentially cold cacheline. In
the case where that page is on memory media that is higher latency
than DRAM we get the ~37% performance loss that Jeff measured.

The path is via the filesystem ->write_iter() file operation. In the
DAX case the filesystem traps that path early, before submitting block
I/O, and routes it to the dax_iomap_actor() routine. That routine
validates that the logical file offset is within bounds of the file,
then it does a sector-to-pfn translation which validates that the
physical mapping is within bounds of the block device.

It seems dax_iomap_actor() is not a path where we'd be worried about
needing hardened user copy checks.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: NULL pointer dereference during memory hotremove

2019-05-17 Thread Pavel Tatashin
On Fri, May 17, 2019 at 1:22 PM Pavel Tatashin
 wrote:
>
> On Fri, May 17, 2019 at 10:38 AM Michal Hocko  wrote:
> >
> > On Fri 17-05-19 10:20:38, Pavel Tatashin wrote:
> > > This panic is unrelated to circular lock issue that I reported in a
> > > separate thread, that also happens during memory hotremove.
> > >
> > > xakep ~/x/linux$ git describe
> > > v5.1-12317-ga6a4b66bd8f4
> >
> > Does this happen on 5.0 as well?
>
> Yes, just reproduced it on 5.0 as well. Unfortunately, I do not have a
> script, and have to do it manually, also it does not happen every
> time, it happened on 3rd time for me.

Actually, sorry, I have not tested 5.0, I compiled 5.0, but my script
still tested v5.1-12317-ga6a4b66bd8f4 build. I will report later if I
am able to reproduce it on 5.0.

Pasha
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: NULL pointer dereference during memory hotremove

2019-05-17 Thread Pavel Tatashin
On Fri, May 17, 2019 at 10:38 AM Michal Hocko  wrote:
>
> On Fri 17-05-19 10:20:38, Pavel Tatashin wrote:
> > This panic is unrelated to circular lock issue that I reported in a
> > separate thread, that also happens during memory hotremove.
> >
> > xakep ~/x/linux$ git describe
> > v5.1-12317-ga6a4b66bd8f4
>
> Does this happen on 5.0 as well?

Yes, just reproduced it on 5.0 as well. Unfortunately, I do not have a
script, and have to do it manually, also it does not happen every
time, it happened on 3rd time for me.

Pasha
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead

2019-05-17 Thread Kees Cook
On Fri, May 17, 2019 at 04:14:03PM +, David Laight wrote:
> From: Kees Cook
> > Sent: 17 May 2019 16:54
> ...
> > > I've changed some of our code to use __get_user() to avoid
> > > these stupid overheads.
> > 
> > __get_user() skips even access_ok() checking too, so that doesn't seem
> > like a good idea. Did you run access_ok() checks separately? (This
> > generally isn't recommended.)
> 
> Of course, I'm not THAT stupid :-)

Right, yes, I know. :) I just wanted to double-check since accidents
can happen. The number of underscores on these function is not really
a great way to indicate what they're doing. ;)

-- 
Kees Cook
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


RE: [PATCH] libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead

2019-05-17 Thread David Laight
From: Kees Cook
> Sent: 17 May 2019 16:54
...
> > I've changed some of our code to use __get_user() to avoid
> > these stupid overheads.
> 
> __get_user() skips even access_ok() checking too, so that doesn't seem
> like a good idea. Did you run access_ok() checks separately? (This
> generally isn't recommended.)

Of course, I'm not THAT stupid :-)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead

2019-05-17 Thread Kees Cook
On Fri, May 17, 2019 at 08:08:27AM -0700, Dan Williams wrote:
> As far as I can see it's mostly check_heap_object() that is the
> problem, so I'm open to finding a way to just bypass that sub-routine.
> However, as far as I can see none of the other block / filesystem user
> copy implementations submit to the hardened checks, like
> bio_copy_from_iter(), and iov_iter_copy_from_user_atomic() . So,
> either those need to grow additional checks, or the hardened copy
> implementation is targeting single object copy use cases, not
> necessarily block-I/O. Yes, Kees, please advise.

The intention is mainly for copies that haven't had explicit bounds
checking already performed on them, yes. Is there something getting
checked out of the slab, or is it literally just the overhead of doing
the "is this slab?" check that you're seeing?

-- 
Kees Cook
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead

2019-05-17 Thread Kees Cook
On Fri, May 17, 2019 at 09:06:26AM +, David Laight wrote:
> From: Jan Kara
> > Sent: 17 May 2019 09:48
> ...
> > So this last paragraph is not obvious to me as check_copy_size() does a lot
> > of various checks in CONFIG_HARDENED_USERCOPY case. I agree that some of
> > those checks don't make sense for PMEM pages but I'd rather handle that by
> > refining check_copy_size() and check_object_size() functions to detect and
> > appropriately handle pmem pages rather that generally skip all the checks
> > in pmem_copy_from/to_iter(). And yes, every check in such hot path is going
> > to cost performance but that's what user asked for with
> > CONFIG_HARDENED_USERCOPY... Kees?
> 
> Except that all the distros enable it by default.
> So you get the performance cost whether you (as a user) want it or not.

Note that it can be disabled on the kernel command line, but that seems
like a last resort. :)

> 
> I've changed some of our code to use __get_user() to avoid
> these stupid overheads.

__get_user() skips even access_ok() checking too, so that doesn't seem
like a good idea. Did you run access_ok() checks separately? (This
generally isn't recommended.)

The usercopy hardening is intended to only kick in when the copy size
isn't a builtin constant -- it's attempting to do a bounds check on
the size, with the pointer used to figure out what bounds checking is
possible (basically "is this stack? check stack location/frame size"
or "is this kmem cache? check the allocation size") and then do bounds
checks from there. It tries to bail out early to avoid needless checking,
so if there is some additional logic to be added to check_object_size()
that is globally applicable, sure, let's do it.

I'm still not clear from this thread about the case that is getting
tripped/slowed? If you're already doing bounds checks somewhere else
and there isn't a chance for the pointer or size to change, then yeah,
it seems safe to skip the usercopy size checks. But what's the actual
code that is having a problem?

-- 
Kees Cook
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [ndctl PATCH v3 00/10] daxctl: add a new reconfigure-device command

2019-05-17 Thread Pavel Tatashin
On Thu, May 16, 2019 at 6:40 PM Vishal Verma  wrote:
>
> Changes in v3:
>  - In daxctl_dev_get_mode(), remove the subsystem warning, detect dax-class
>and simply make it return devdax

Hi Vishal,

I am still getting the same error as before:

# ndctl create-namespace --mode devdax --map mem -e namespace0.0 -f
[  141.525873] dax0.0 initialised, 524288 pages in 7ms
libdaxctl: __sysfs_device_parse: dax0.0: add_dev() failed
...

I am building it via buildroot, and can share the initramfs file or
anything that can help you with fixing this issue.

Thank you,
Pavel
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] mm/nvdimm: Pick the right alignment default when creating dax devices

2019-05-17 Thread Aneesh Kumar K.V

On 5/17/19 8:19 PM, Vaibhav Jain wrote:

Hi Aneesh,

Apart from a minor review comment for changes to nd_pfn_validate() the
patch looks good to me.

Also, I Tested this patch on a PPC64 qemu guest with virtual nvdimm and
verified that default alignment of newly created devdax namespace was
64KiB instead of 16MiB. Below are the test results:

* Without the patch creating a devdax namespace results in namespace
   with 16MiB default alignment. Using daxio to zero out the dax device
   results in a SIGBUS and a hashing failure.

   $ sudo ndctl create-namespace --mode=devdax  | grep align
 "align":16777216,
   "align":16777216

   $ sudo cat /sys/devices/ndbus0/region0/dax0.0/supported_alignments
   65536 16777216

   $ sudo daxio.static-debug  -z -o /dev/dax0.0
   Bus error (core dumped)

   $ dmesg | tail
   [  438.738958] lpar: Failed hash pte insert with error -4
   [  438.739412] hash-mmu: mm: Hashing failure ! EA=0x7fff1700 
access=0x8006 current=daxio
   [  438.739760] hash-mmu: trap=0x300 vsid=0x22cb7a3 ssize=1 base psize=2 
psize 10 pte=0xc00501002b86
   [  438.740143] daxio[3860]: bus error (7) at 7fff1700 nip 7fff973c007c 
lr 7fff973bff34 code 2 in libpmem.so.1.0.0[7fff973b+2]
   [  438.740634] daxio[3860]: code: 792945e4 7d494b78 e95f0098 7d494b78 
f93f00a0 4800012c e93f0088 f93f0120
   [  438.741015] daxio[3860]: code: e93f00a0 f93f0128 e93f0120 e95f0128 
 e93f0088 39290008 f93f0110

* With the patch creating a devdax namespace results in namespace
   with 64KiB default alignment. Using daxio to zero out the dax device
   succeeds:
   
   $ sudo ndctl create-namespace --mode=devdax  | grep align

 "align":65536,
   "align":65536

   $ sudo cat /sys/devices/ndbus0/region0/dax0.0/supported_alignments
   65536

   $ daxio -z -o /dev/dax0.0
   daxio: copied 2130706432 bytes to device "/dev/dax0.0"

Hence,

Tested-by: Vaibhav Jain 

"Aneesh Kumar K.V"  writes:


Allow arch to provide the supported alignments and use hugepage alignment only
if we support hugepage. Right now we depend on compile time configs whereas this
patch switch this to runtime discovery.

Architectures like ppc64 can have THP enabled in code, but then can have
hugepage size disabled by the hypervisor. This allows us to create dax devices
with PAGE_SIZE alignment in this case.

Existing dax namespace with alignment larger than PAGE_SIZE will fail to
initialize in this specific case. We still allow fsdax namespace initialization.

With respect to identifying whether to enable hugepage fault for a dax device,
if THP is enabled during compile, we default to taking hugepage fault and in dax
fault handler if we find the fault size > alignment we retry with PAGE_SIZE
fault size.

Signed-off-by: Aneesh Kumar K.V 
---
  arch/powerpc/include/asm/libnvdimm.h |  9 
  arch/powerpc/mm/Makefile |  1 +
  arch/powerpc/mm/nvdimm.c | 34 
  arch/x86/include/asm/libnvdimm.h | 19 
  drivers/nvdimm/nd.h  |  6 -
  drivers/nvdimm/pfn_devs.c| 32 +-
  include/linux/huge_mm.h  |  7 +-
  7 files changed, 100 insertions(+), 8 deletions(-)
  create mode 100644 arch/powerpc/include/asm/libnvdimm.h
  create mode 100644 arch/powerpc/mm/nvdimm.c
  create mode 100644 arch/x86/include/asm/libnvdimm.h

diff --git a/arch/powerpc/include/asm/libnvdimm.h 
b/arch/powerpc/include/asm/libnvdimm.h
new file mode 100644
index ..d35fd7f48603
--- /dev/null
+++ b/arch/powerpc/include/asm/libnvdimm.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_POWERPC_LIBNVDIMM_H
+#define _ASM_POWERPC_LIBNVDIMM_H
+
+#define nd_pfn_supported_alignments nd_pfn_supported_alignments
+extern unsigned long *nd_pfn_supported_alignments(void);
+extern unsigned long nd_pfn_default_alignment(void);
+
+#endif
diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
index 0f499db315d6..42e4a399ba5d 100644
--- a/arch/powerpc/mm/Makefile
+++ b/arch/powerpc/mm/Makefile
@@ -20,3 +20,4 @@ obj-$(CONFIG_HIGHMEM) += highmem.o
  obj-$(CONFIG_PPC_COPRO_BASE)  += copro_fault.o
  obj-$(CONFIG_PPC_PTDUMP)  += ptdump/
  obj-$(CONFIG_KASAN)   += kasan/
+obj-$(CONFIG_NVDIMM_PFN)   += nvdimm.o
diff --git a/arch/powerpc/mm/nvdimm.c b/arch/powerpc/mm/nvdimm.c
new file mode 100644
index ..a29a4510715e
--- /dev/null
+++ b/arch/powerpc/mm/nvdimm.c
@@ -0,0 +1,34 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include 
+#include 
+
+#include 
+/*
+ * We support only pte and pmd mappings for now.
+ */
+const unsigned long *nd_pfn_supported_alignments(void)
+{
+   static unsigned long supported_alignments[3];
+
+   supported_alignments[0] = PAGE_SIZE;
+
+   if (has_transparent_hugepage())
+   supported_alignments[1] = HPAGE_PMD_SIZE;
+   else
+   supported_alignments[1] = 0;
+
+   supported_alignments[2] = 0;
+   

Re: [PATCH] libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead

2019-05-17 Thread Dan Williams
On Fri, May 17, 2019 at 1:47 AM Jan Kara  wrote:
>
> Let's add Kees to CC for usercopy expertise...
>
> On Thu 16-05-19 17:33:38, Dan Williams wrote:
> > Jeff discovered that performance improves from ~375K iops to ~519K iops
> > on a simple psync-write fio workload when moving the location of 'struct
> > page' from the default PMEM location to DRAM. This result is surprising
> > because the expectation is that 'struct page' for dax is only needed for
> > third party references to dax mappings. For example, a dax-mapped buffer
> > passed to another system call for direct-I/O requires 'struct page' for
> > sending the request down the driver stack and pinning the page. There is
> > no usage of 'struct page' for first party access to a file via
> > read(2)/write(2) and friends.
> >
> > However, this "no page needed" expectation is violated by
> > CONFIG_HARDENED_USERCOPY and the check_copy_size() performed in
> > copy_from_iter_full_nocache() and copy_to_iter_mcsafe(). The
> > check_heap_object() helper routine assumes the buffer is backed by a
> > page-allocator DRAM page and applies some checks.  Those checks are
> > invalid, dax pages are not from the heap, and redundant,
> > dax_iomap_actor() has already validated that the I/O is within bounds.
>
> So this last paragraph is not obvious to me as check_copy_size() does a lot
> of various checks in CONFIG_HARDENED_USERCOPY case. I agree that some of
> those checks don't make sense for PMEM pages but I'd rather handle that by
> refining check_copy_size() and check_object_size() functions to detect and
> appropriately handle pmem pages rather that generally skip all the checks
> in pmem_copy_from/to_iter(). And yes, every check in such hot path is going
> to cost performance but that's what user asked for with
> CONFIG_HARDENED_USERCOPY... Kees?

As far as I can see it's mostly check_heap_object() that is the
problem, so I'm open to finding a way to just bypass that sub-routine.
However, as far as I can see none of the other block / filesystem user
copy implementations submit to the hardened checks, like
bio_copy_from_iter(), and iov_iter_copy_from_user_atomic() . So,
either those need to grow additional checks, or the hardened copy
implementation is targeting single object copy use cases, not
necessarily block-I/O. Yes, Kees, please advise.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: NULL pointer dereference during memory hotremove

2019-05-17 Thread David Hildenbrand
On 17.05.19 16:38, Michal Hocko wrote:
> On Fri 17-05-19 10:20:38, Pavel Tatashin wrote:
>> This panic is unrelated to circular lock issue that I reported in a
>> separate thread, that also happens during memory hotremove.
>>
>> xakep ~/x/linux$ git describe
>> v5.1-12317-ga6a4b66bd8f4
> 
> Does this happen on 5.0 as well?
> 

We have on the list

[PATCH V3 1/4] mm/hotplug: Reorder arch_remove_memory() call in
__remove_memory()

Can that help?

-- 

Thanks,

David / dhildenb
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] mm/nvdimm: Pick the right alignment default when creating dax devices

2019-05-17 Thread Vaibhav Jain
Hi Aneesh,

Apart from a minor review comment for changes to nd_pfn_validate() the
patch looks good to me.

Also, I Tested this patch on a PPC64 qemu guest with virtual nvdimm and
verified that default alignment of newly created devdax namespace was
64KiB instead of 16MiB. Below are the test results:

* Without the patch creating a devdax namespace results in namespace
  with 16MiB default alignment. Using daxio to zero out the dax device
  results in a SIGBUS and a hashing failure.

  $ sudo ndctl create-namespace --mode=devdax  | grep align
"align":16777216,
  "align":16777216

  $ sudo cat /sys/devices/ndbus0/region0/dax0.0/supported_alignments
  65536 16777216

  $ sudo daxio.static-debug  -z -o /dev/dax0.0
  Bus error (core dumped)

  $ dmesg | tail
  [  438.738958] lpar: Failed hash pte insert with error -4
  [  438.739412] hash-mmu: mm: Hashing failure ! EA=0x7fff1700 
access=0x8006 current=daxio
  [  438.739760] hash-mmu: trap=0x300 vsid=0x22cb7a3 ssize=1 base psize=2 
psize 10 pte=0xc00501002b86
  [  438.740143] daxio[3860]: bus error (7) at 7fff1700 nip 7fff973c007c lr 
7fff973bff34 code 2 in libpmem.so.1.0.0[7fff973b+2]
  [  438.740634] daxio[3860]: code: 792945e4 7d494b78 e95f0098 7d494b78 
f93f00a0 4800012c e93f0088 f93f0120 
  [  438.741015] daxio[3860]: code: e93f00a0 f93f0128 e93f0120 e95f0128 
 e93f0088 39290008 f93f0110 

* With the patch creating a devdax namespace results in namespace
  with 64KiB default alignment. Using daxio to zero out the dax device
  succeeds:
  
  $ sudo ndctl create-namespace --mode=devdax  | grep align
"align":65536,
  "align":65536

  $ sudo cat /sys/devices/ndbus0/region0/dax0.0/supported_alignments
  65536

  $ daxio -z -o /dev/dax0.0
  daxio: copied 2130706432 bytes to device "/dev/dax0.0"

Hence,

Tested-by: Vaibhav Jain 

"Aneesh Kumar K.V"  writes:

> Allow arch to provide the supported alignments and use hugepage alignment only
> if we support hugepage. Right now we depend on compile time configs whereas 
> this
> patch switch this to runtime discovery.
>
> Architectures like ppc64 can have THP enabled in code, but then can have
> hugepage size disabled by the hypervisor. This allows us to create dax devices
> with PAGE_SIZE alignment in this case.
>
> Existing dax namespace with alignment larger than PAGE_SIZE will fail to
> initialize in this specific case. We still allow fsdax namespace 
> initialization.
>
> With respect to identifying whether to enable hugepage fault for a dax device,
> if THP is enabled during compile, we default to taking hugepage fault and in 
> dax
> fault handler if we find the fault size > alignment we retry with PAGE_SIZE
> fault size.
>
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/include/asm/libnvdimm.h |  9 
>  arch/powerpc/mm/Makefile |  1 +
>  arch/powerpc/mm/nvdimm.c | 34 
>  arch/x86/include/asm/libnvdimm.h | 19 
>  drivers/nvdimm/nd.h  |  6 -
>  drivers/nvdimm/pfn_devs.c| 32 +-
>  include/linux/huge_mm.h  |  7 +-
>  7 files changed, 100 insertions(+), 8 deletions(-)
>  create mode 100644 arch/powerpc/include/asm/libnvdimm.h
>  create mode 100644 arch/powerpc/mm/nvdimm.c
>  create mode 100644 arch/x86/include/asm/libnvdimm.h
>
> diff --git a/arch/powerpc/include/asm/libnvdimm.h 
> b/arch/powerpc/include/asm/libnvdimm.h
> new file mode 100644
> index ..d35fd7f48603
> --- /dev/null
> +++ b/arch/powerpc/include/asm/libnvdimm.h
> @@ -0,0 +1,9 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_POWERPC_LIBNVDIMM_H
> +#define _ASM_POWERPC_LIBNVDIMM_H
> +
> +#define nd_pfn_supported_alignments nd_pfn_supported_alignments
> +extern unsigned long *nd_pfn_supported_alignments(void);
> +extern unsigned long nd_pfn_default_alignment(void);
> +
> +#endif
> diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile
> index 0f499db315d6..42e4a399ba5d 100644
> --- a/arch/powerpc/mm/Makefile
> +++ b/arch/powerpc/mm/Makefile
> @@ -20,3 +20,4 @@ obj-$(CONFIG_HIGHMEM)   += highmem.o
>  obj-$(CONFIG_PPC_COPRO_BASE) += copro_fault.o
>  obj-$(CONFIG_PPC_PTDUMP) += ptdump/
>  obj-$(CONFIG_KASAN)  += kasan/
> +obj-$(CONFIG_NVDIMM_PFN) += nvdimm.o
> diff --git a/arch/powerpc/mm/nvdimm.c b/arch/powerpc/mm/nvdimm.c
> new file mode 100644
> index ..a29a4510715e
> --- /dev/null
> +++ b/arch/powerpc/mm/nvdimm.c
> @@ -0,0 +1,34 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include 
> +#include 
> +
> +#include 
> +/*
> + * We support only pte and pmd mappings for now.
> + */
> +const unsigned long *nd_pfn_supported_alignments(void)
> +{
> + static unsigned long supported_alignments[3];
> +
> + supported_alignments[0] = PAGE_SIZE;
> +
> + if (has_transparent_hugepage())
> + supported_alignments[1] = HPAGE_PMD_SIZE;
> + else
> + s

Re: NULL pointer dereference during memory hotremove

2019-05-17 Thread Michal Hocko
On Fri 17-05-19 10:20:38, Pavel Tatashin wrote:
> This panic is unrelated to circular lock issue that I reported in a
> separate thread, that also happens during memory hotremove.
> 
> xakep ~/x/linux$ git describe
> v5.1-12317-ga6a4b66bd8f4

Does this happen on 5.0 as well?
-- 
Michal Hocko
SUSE Labs
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


NULL pointer dereference during memory hotremove

2019-05-17 Thread Pavel Tatashin
This panic is unrelated to circular lock issue that I reported in a
separate thread, that also happens during memory hotremove.

xakep ~/x/linux$ git describe
v5.1-12317-ga6a4b66bd8f4

Config is attached, qemu script is following:

qemu-system-x86_64  \
-enable-kvm \
-cpu host   \
-parallel none  \
-echr 1 \
-serial none\
-chardev stdio,id=console,signal=off,mux=on \
-serial chardev:console \
-mon chardev=console\
-vga none   \
-display none   \
-kernel pmem/native/arch/x86/boot/bzImage   \
-m 8G,slots=1,maxmem=16G\
-smp 8  \
-fsdev local,id=virtfs1,path=/,security_model=none  \
-device virtio-9p-pci,fsdev=virtfs1,mount_tag=hostfs\
-append 'earlyprintk=serial,ttyS0,115200 console=ttyS0
TERM=xterm ip=dhcp memmap=2G!6G loglevel=7'

The unusual case with this script is that 2G reserved for pmem device:
memmap=2G!6G. Otherwise, it is a normal layout. Unfortunately, it does
not happen every time, but I have hit it a couple times.


# QEMU 4.0.0 monitor - type 'help' for more information
(qemu) object_add memory-backend-ram,id=mem1,size=1G
(qemu) device_add pc-dimm,id=dimm1,memdev=mem1
# echo online_movable > /sys/devices/system/memory/memory79/state
[   40.219090] Built 1 zonelists, mobility grouping on.  Total pages: 1529279
[   40.223258] Policy zone: Normal
# (qemu) device_del dimm1
(qemu) [   49.624600] Offlined Pages 32768
[   49.625796] Built 1 zonelists, mobility grouping on.  Total pages: 1516352
[   49.627841] Policy zone: Normal
[   49.630932] BUG: kernel NULL pointer dereference, address: 0698
[   49.633704] #PF: supervisor read access in kernel mode
[   49.635689] #PF: error_code(0x) - not-present page
[   49.637620] PGD 800236b59067 P4D 800236b59067 PUD 2358fe067 PMD 0
[   49.640163] Oops:  [#1] SMP PTI
[   49.641223] CPU: 0 PID: 7 Comm: kworker/u16:0 Not tainted 5.1.0_pt_pmem #38
[   49.643183] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.12.0-20181126_142135-anatol 04/01/2014
[   49.645858] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
[   49.647101] RIP: 0010:__remove_pages+0x1a/0x460
[   49.648165] Code: e9 bb a9 fd ff 0f 0b 66 0f 1f 84 00 00 00 00 00
41 57 48 89 f8 49 89 ff 41 56 49 89 f6 41 55 41 54 55 53 48 89 d3 48
83 ec 50 <48> 2b 47 58 48 89 4c 24 48 48 3d 00 19 00 00 75 09 48 85 c9
0f 85
[   49.651925] RSP: 0018:bd1000c8fcb8 EFLAGS: 00010286
[   49.652857] RAX: 0640 RBX: 0004 RCX: 
[   49.654139] RDX: 0004 RSI: 0024 RDI: 0640
[   49.655393] RBP:  R08:  R09: 4000
[   49.656523] R10: 4000 R11: 00024000 R12: 4000
[   49.657654] R13: 00024000 R14: 0024 R15: 0640
[   49.658828] FS:  () GS:9b4bf980()
knlGS:
[   49.660178] CS:  0010 DS:  ES:  CR0: 80050033
[   49.661033] CR2: 0698 CR3: 0002382e0006 CR4: 00360ef0
[   49.662114] DR0:  DR1:  DR2: 
[   49.663172] DR3:  DR6: fffe0ff0 DR7: 0400
[   49.664243] Call Trace:
[   49.664622]  ? memblock_isolate_range+0xc4/0x139
[   49.665290]  ? firmware_map_add_hotplug+0x7e/0xde
[   49.665908]  ? memblock_remove_region+0x30/0x74
[   49.666498]  arch_remove_memory+0x6f/0xa0
[   49.667012]  __remove_memory+0xab/0x130
[   49.667492]  ? walk_memory_range+0xa1/0xe0
[   49.668008]  acpi_memory_device_remove+0x67/0xe0
[   49.668595]  acpi_bus_trim+0x50/0x90
[   49.669051]  acpi_device_hotplug+0x2fa/0x3e0
[   49.669590]  acpi_hotplug_work_fn+0x15/0x20
[   49.670116]  process_one_work+0x2a0/0x650
[   49.670577]  worker_thread+0x34/0x3d0
[   49.670997]  ? process_one_work+0x650/0x650
[   49.671503]  kthread+0x118/0x130
[   49.671879]  ? kthread_create_on_node+0x60/0x60
[   49.672411]  ret_from_fork+0x3a/0x50
[   49.672836] Modules linked in:
[   49.673190] CR2: 0698
[   49.673583] ---[ end trace 6b727d3a8ce48aa1 ]---
[   49.674120] RIP: 0010:__remove_pages+0x1a/0x460
[   49.674624] Code: e9 bb a9 fd ff 0f 0b 66 0f 1f 84 00 00 00 00 00
41 57 48 89 f8 49 89 ff 41 56 49 89 f6 41 55 41 54 55

Re: [v5 0/3] "Hotremove" persistent memory

2019-05-17 Thread Pavel Tatashin
>
> I would think that ACPI hotplug would have a similar problem, but it does 
> this:
>
> acpi_unbind_memory_blocks(info);
> __remove_memory(nid, info->start_addr, info->length);

ACPI does have exactly the same problem, so this is not a bug for this
series, I will submit a new version of my series with comments
addressed, but without fix for this issue.

I was able to reproduce this issue on the current mainline kernel.
Also, I been thinking more about how to fix it, and there is no easy
fix without a major hotplug redesign. Basically, we have to remove
sysfs memory entries before or after memory is hotplugged/hotremoved.
But, we also have to guarantee that hotplug/hotremove will succeed or
reinstate sysfs entries.

Qemu script:

qemu-system-x86_64  \
-enable-kvm \
-cpu host   \
-parallel none  \
-echr 1 \
-serial none\
-chardev stdio,id=console,signal=off,mux=on \
-serial chardev:console \
-mon chardev=console\
-vga none   \
-display none   \
-kernel pmem/native/arch/x86/boot/bzImage   \
-m 8G,slots=1,maxmem=16G\
-smp 8  \
-fsdev local,id=virtfs1,path=/,security_model=none  \
-device virtio-9p-pci,fsdev=virtfs1,mount_tag=hostfs\
-append 'earlyprintk=serial,ttyS0,115200 console=ttyS0
TERM=xterm ip=dhcp loglevel=7'

Config is attached.

Steps to reproduce:
#
# QEMU 4.0.0 monitor - type 'help' for more information
(qemu) object_add memory-backend-ram,id=mem1,size=1G
(qemu) device_add pc-dimm,id=dimm1,memdev=mem1
(qemu)

# echo online_movable > /sys/devices/system/memory/memory79/state
[   23.029552] Built 1 zonelists, mobility grouping on.  Total pages: 2045370
[   23.032591] Policy zone: Normal
# (qemu) device_del dimm1
(qemu) [   32.013950] Offlined Pages 32768
[   32.014307] Built 1 zonelists, mobility grouping on.  Total pages: 2031022
[   32.014843] Policy zone: Normal
[   32.015733]
[   32.015881] ==
[   32.016390] WARNING: possible circular locking dependency detected
[   32.016881] 5.1.0_pt_pmem #38 Not tainted
[   32.017202] --
[   32.017680] kworker/u16:4/380 is trying to acquire lock:
[   32.018096] 675cc7e1 (kn->count#18){}, at:
kernfs_remove_by_name_ns+0x3b/0x80
[   32.018745]
[   32.018745] but task is already holding lock:
[   32.019201] 53e50a99 (mem_sysfs_mutex){+.+.}, at:
unregister_memory_section+0x1d/0xa0
[   32.019859]
[   32.019859] which lock already depends on the new lock.
[   32.019859]
[   32.020499]
[   32.020499] the existing dependency chain (in reverse order) is:
[   32.021080]
[   32.021080] -> #4 (mem_sysfs_mutex){+.+.}:
[   32.021522]__mutex_lock+0x8b/0x900
[   32.021843]hotplug_memory_register+0x26/0xa0
[   32.022231]__add_pages+0xe7/0x160
[   32.022545]add_pages+0xd/0x60
[   32.022835]add_memory_resource+0xc3/0x1d0
[   32.023207]__add_memory+0x57/0x80
[   32.023530]acpi_memory_device_add+0x13a/0x2d0
[   32.023928]acpi_bus_attach+0xf1/0x200
[   32.024272]acpi_bus_scan+0x3e/0x90
[   32.024597]acpi_device_hotplug+0x284/0x3e0
[   32.024972]acpi_hotplug_work_fn+0x15/0x20
[   32.025342]process_one_work+0x2a0/0x650
[   32.025755]worker_thread+0x34/0x3d0
[   32.026077]kthread+0x118/0x130
[   32.026442]ret_from_fork+0x3a/0x50
[   32.026766]
[   32.026766] -> #3 (mem_hotplug_lock.rw_sem){}:
[   32.027261]get_online_mems+0x39/0x80
[   32.027600]kmem_cache_create_usercopy+0x29/0x2c0
[   32.028019]kmem_cache_create+0xd/0x10
[   32.028367]ptlock_cache_init+0x1b/0x23
[   32.028724]start_kernel+0x1d2/0x4b8
[   32.029060]secondary_startup_64+0xa4/0xb0
[   32.029447]
[   32.029447] -> #2 (cpu_hotplug_lock.rw_sem){}:
[   32.030007]cpus_read_lock+0x39/0x80
[   32.030360]__offline_pages+0x32/0x790
[   32.030709]memory_subsys_offline+0x3a/0x60
[   32.031089]device_offline+0x7e/0xb0
[   32.031425]acpi_bus_offline+0xd8/0x140
[   32.031821]acpi_device_hotplug+0x1b2/0x3e0
[   32.032202]acpi_hotplug_work_fn+0x15/0x20
[   32.032576]process_o

RE: [PATCH] libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead

2019-05-17 Thread David Laight
From: Jan Kara
> Sent: 17 May 2019 09:48
...
> So this last paragraph is not obvious to me as check_copy_size() does a lot
> of various checks in CONFIG_HARDENED_USERCOPY case. I agree that some of
> those checks don't make sense for PMEM pages but I'd rather handle that by
> refining check_copy_size() and check_object_size() functions to detect and
> appropriately handle pmem pages rather that generally skip all the checks
> in pmem_copy_from/to_iter(). And yes, every check in such hot path is going
> to cost performance but that's what user asked for with
> CONFIG_HARDENED_USERCOPY... Kees?

Except that all the distros enable it by default.
So you get the performance cost whether you (as a user) want it or not.

I've changed some of our code to use __get_user() to avoid
these stupid overheads.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead

2019-05-17 Thread Jan Kara
Let's add Kees to CC for usercopy expertise...

On Thu 16-05-19 17:33:38, Dan Williams wrote:
> Jeff discovered that performance improves from ~375K iops to ~519K iops
> on a simple psync-write fio workload when moving the location of 'struct
> page' from the default PMEM location to DRAM. This result is surprising
> because the expectation is that 'struct page' for dax is only needed for
> third party references to dax mappings. For example, a dax-mapped buffer
> passed to another system call for direct-I/O requires 'struct page' for
> sending the request down the driver stack and pinning the page. There is
> no usage of 'struct page' for first party access to a file via
> read(2)/write(2) and friends.
> 
> However, this "no page needed" expectation is violated by
> CONFIG_HARDENED_USERCOPY and the check_copy_size() performed in
> copy_from_iter_full_nocache() and copy_to_iter_mcsafe(). The
> check_heap_object() helper routine assumes the buffer is backed by a
> page-allocator DRAM page and applies some checks.  Those checks are
> invalid, dax pages are not from the heap, and redundant,
> dax_iomap_actor() has already validated that the I/O is within bounds.

So this last paragraph is not obvious to me as check_copy_size() does a lot
of various checks in CONFIG_HARDENED_USERCOPY case. I agree that some of
those checks don't make sense for PMEM pages but I'd rather handle that by
refining check_copy_size() and check_object_size() functions to detect and
appropriately handle pmem pages rather that generally skip all the checks
in pmem_copy_from/to_iter(). And yes, every check in such hot path is going
to cost performance but that's what user asked for with
CONFIG_HARDENED_USERCOPY... Kees?

Honza

> 
> Bypass this overhead and call the 'no check' versions of the
> copy_{to,from}_iter operations directly.
> 
> Fixes: 0aed55af8834 ("x86, uaccess: introduce copy_from_iter_flushcache...")
> Cc: Jan Kara 
> Cc: 
> Cc: Jeff Moyer 
> Cc: Ingo Molnar 
> Cc: Christoph Hellwig 
> Cc: Al Viro 
> Cc: Thomas Gleixner 
> Cc: Matthew Wilcox 
> Reported-and-tested-by: Jeff Smits 
> Signed-off-by: Dan Williams 
> ---
>  drivers/nvdimm/pmem.c |9 +++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 845c5b430cdd..c894f45e5077 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -281,16 +281,21 @@ static long pmem_dax_direct_access(struct dax_device 
> *dax_dev,
>   return __pmem_direct_access(pmem, pgoff, nr_pages, kaddr, pfn);
>  }
>  
> +/*
> + * Use the 'no check' versions of copy_from_iter_flushcache() and
> + * copy_to_iter_mcsafe() to bypass HARDENED_USERCOPY overhead. Bounds
> + * checking is handled by dax_iomap_actor()
> + */
>  static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
>   void *addr, size_t bytes, struct iov_iter *i)
>  {
> - return copy_from_iter_flushcache(addr, bytes, i);
> + return _copy_from_iter_flushcache(addr, bytes, i);
>  }
>  
>  static size_t pmem_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff,
>   void *addr, size_t bytes, struct iov_iter *i)
>  {
> - return copy_to_iter_mcsafe(addr, bytes, i);
> + return _copy_to_iter_mcsafe(addr, bytes, i);
>  }
>  
>  static const struct dax_operations pmem_dax_ops = {
> 
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm