Re: [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem

2017-04-07 Thread Stephen Bates
On 2017-04-06, 6:33 AM, "Sagi Grimberg"  wrote:

> Say it's connected via 2 legs, the bar is accessed from leg A and the
> data from the disk comes via leg B. In this case, the data is heading
> towards the p2p device via leg B (might be congested), the completion
> goes directly to the RC, and then the host issues a read from the
> bar via leg A. I don't understand what can guarantee ordering here.

> Stephen told me that this still guarantees ordering, but I honestly
> can't understand how, perhaps someone can explain to me in a simple
> way that I can understand.

Sagi

As long as legA, legB and the RC are all connected to the same switch then 
ordering will be preserved (I think many other topologies also work). Here is 
how it would work for the problem case you are concerned about (which is a read 
from the NVMe drive).

1. Disk device DMAs out the data to the p2pmem device via a string of PCIe 
MemWr TLPs.
2. Disk device writes to the completion queue (in system memory) via a MemWr 
TLP.
3. The last of the MemWrs from step 1 might have got stalled in the PCIe switch 
due to congestion but if so they are stalled in the egress path of the switch 
for the p2pmem port.
4. The RC determines the IO is complete when the TLP associated with step 2 
updates the memory associated with the CQ. It issues some operation to read the 
p2pmem.
5. Regardless of whether the MemRd TLP comes from the RC or another device 
connected to the switch it is queued in the egress queue for the p2pmem FIO 
behind the last DMA TLP (from step 1). PCIe ordering ensures that this MemRd 
cannot overtake the MemWr (Reads can never pass writes). Therefore the MemRd 
can never get to the p2pmem device until after the last DMA MemWr has.

I hope this helps!

Stephen


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


panics related to nfit_test?

2017-04-07 Thread Linda Knippers
I'm trying to run the ndctl tests on 4.11-rc5.  I've never run them before but I
think I correctly followed all the directions for building and installing the
tools/testing/nvdimm components as described in the ndctl README.md.  I'm
seeing two problems that may be related and I'm wondering whether this could
be build/user error or something real.

1) Running the tests was causing my system to panic when the nfit_test module
is unloaded.  I determined I don't actually have to run a test to cause the 
panic, just
modprobe the modules as listed in ndctl nfit_test_init(), then modprobe 
nfit_test,
then rmmod nfit_test.  I'm doing this on a system without NVDIMMs. I get
the same thing on a system with NVDIMMs although the other modules are already
loaded.

This is the panic I get, very reproducibly.

[53617.173340] nfit_test nfit_test.0: failed to evaluate _FIT



[53683.797952] BUG: unable to handle kernel NULL pointer dereference at (null)
[53683.837521] IP: __list_del_entry_valid+0x29/0xd0
[53683.861449] PGD 105f4fb067
[53683.861449] PUD 1054889067
[53683.874551] PMD 0
[53683.887664]
[53683.903937] Oops:  [#1] SMP
[53683.918657] Modules linked in: nfit_test(O-) nd_pmem(O) nd_e820(O) nd_blk(O) 
nd_btt(O)
dax_pmem(O) dax(O) nfit(O) libnvdimm(O) nfit_test_iomap(O) ip6t_rpfilter 
ipt_REJECT nf_reject_ipv4
ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat 
ebtable_broute bridge stp llc
ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle 
ip6table_security
ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat 
nf_conntrack
iptable_mangle iptable_security iptable_raw ebtable_filter ebtables 
ip6table_filter ip6_tables
iptable_filter intel_rapl sb_edac edac_core x86_pkg_temp_thermal 
intel_powerclamp coretemp vfat fat
kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc 
ipmi_ssif aesni_intel
crypto_simd glue_helper cryptd sg hpilo iTCO_wdt
[53684.252765] hpwdt ipmi_si ipmi_devintf iTCO_vendor_support ioatdma i2c_i801 
lpc_ich shpchp pcspkr
acpi_power_meter ipmi_msghandler dca wmi ip_tables xfs sd_mod mgag200 
i2c_algo_bit drm_kms_helper
syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm bnx2x tg3 mdio hpsa ptp 
i2c_core pps_core
libcrc32c scsi_transport_sas crc32c_intel
[53684.394684] CPU: 35 PID: 4087 Comm: rmmod Tainted: G W O 4.11.0-rc5+ #3
[53684.430295] Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS 
P89 10/05/2016
[53684.469368] task: 9cdbaca9ad00 task.stack: bf3348cc8000
[53684.497175] RIP: 0010:__list_del_entry_valid+0x29/0xd0
[53684.521315] RSP: 0018:bf3348ccbd90 EFLAGS: 00010007
[53684.545823] RAX:  RBX:  RCX: 0006
[53684.579642] RDX: dead0200 RSI: 9cdbaf4268a0 RDI: bf334e302000
[53684.613132] RBP: bf3348ccbd90 R08:  R09: bf334e302000
[53684.646725] R10: 0004 R11: 9cdbaf4268a0 R12: bf3348ccbdc8
[53684.680100] R13:  R14:  R15: 9ce7a36f2400
[53684.713655] FS: 7f1fab239740() GS:9ce7af04() 
knlGS:
[53684.751875] CS: 0010 DS:  ES:  CR0: 80050033
[53684.778962] CR2:  CR3: 00106eb12000 CR4: 003406e0
[53684.812949] DR0:  DR1:  DR2: 
[53684.847826] DR3:  DR6: fffe0ff0 DR7: 0400
[53684.883234] Call Trace:
[53684.896228] release_nodes+0x76/0x260
[53684.913359] devres_release_all+0x3c/0x60
[53684.932192] device_release_driver_internal+0x151/0x1f0
[53684.956700] driver_detach+0x3f/0x80
[53684.973569] bus_remove_driver+0x55/0xd0
[53684.992057] driver_unregister+0x2c/0x50
[53685.010575] platform_driver_unregister+0x12/0x20
[53685.032584] nfit_test_exit+0x10/0xaa9 [nfit_test]
[53685.055372] SyS_delete_module+0x1ba/0x220
[53685.074931] do_syscall_64+0x67/0x180
[53685.092329] entry_SYSCALL64_slow_path+0x25/0x25
[53685.114144] RIP: 0033:0x7f1faa70dc27
[53685.131113] RSP: 002b:7ffc579ffa98 EFLAGS: 0202 ORIG_RAX: 
00b0
[53685.167000] RAX: ffda RBX: 02560340 RCX: 7f1faa70dc27
[53685.201314] RDX: 7f1faa77e000 RSI: 0800 RDI: 025603a8
[53685.234812] RBP:  R08: 7f1faa9d1060 R09: 7f1faa77e000
[53685.267909] R10: 7ffc579ff820 R11: 0202 R12: 7ffc57a00922
[53685.301350] R13:  R14: 02560340 R15: 02560010
[53685.335068] Code: 00 00 55 48 8b 07 48 ba 00 01 00 00 00 00 ad de 4c 8b 47 
08 48 89 e5 48 39 d0
74 27 48 ba 00 02 00 00 00 00 ad de 49 39 d0 74 7e <4d> 8b 00 4c 39 c7 75 55 4c 
8b 40 08 4c 39 c7 75
2b b8 01 00 00
[53685.427540] RIP: __list_del_entry_valid+0x29/0xd0 RSP: bf3348ccbd90
[53685.459123] CR2: 
[53685.477027] ---[ end trace 2392c114f429911a ]---
[53685.503198] Kernel panic - not syncing: Fatal exception
[53685.528001] Kernel Offset: 0x2da00

Re: KASLR causes intermittent boot failures on some systems

2017-04-07 Thread Thomas Garnier
CCing Kees for information.

On Fri, Apr 7, 2017 at 7:41 AM, Jeff Moyer  wrote:
> Hi,
>
> commit 021182e52fe01 ("x86/mm: Enable KASLR for physical mapping memory
> regions") causes some of my systems with persistent memory (whether real
> or emulated) to fail to boot with a couple of different crash
> signatures.  The first signature is a NMI watchdog lockup of all but 1
> cpu, which causes much difficulty in extracting useful information from
> the console.  The second variant is an invalid paging request, listed
> below.
>
> On some systems, I haven't hit this problem at all.  Other systems
> experience a failed boot maybe 20-30% of the time.  To reproduce it,
> configure some emulated pmem on your system.  You can find directions
> for that here: https://nvdimm.wiki.kernel.org/

Did you try to repro on qemu?

>
> Install ndctl (https://github.com/pmem/ndctl).
> Configure the namespace:
> # ndctl create-namespace -f -e namespace0.0 -m memory
>
> Then just reboot several times (5 should be enough), and hopefully
> you'll hit the issue.
>
> I've attached both my .config and the dmesg output from a successful
> boot at the end of this mail.

Thanks for looking into it. I will look into getting a repro on qemu
or a dedicated machine.

If anyone has a guess on the cause, please let me know.

>
> Cheers,
> Jeff
>
> [9.874109] pmem0: detected capacity change from 0 to 206158430208
> [9.881652] BUG: unable to handle kernel paging request at 9406bfff
> [9.889431] IP: memcpy_erms+0x6/0x10
> [9.893422] PGD 0
> [9.893423]
> [9.897316] Oops:  [#1] SMP
> [9.900820] Modules linked in: isci mgag200 drm_kms_helper syscopyarea 
> sysfillrect sysimgblt igb fb_sys_fops ahci libsas ttm ptp libahci 
> crc32c_intel scsi_transport_sas nd_pmem pps_core nd_btt drm dca libata 
> i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod
> [9.927322] CPU: 11 PID: 441 Comm: systemd-udevd Not tainted 4.11.0-rc5+ #1
> [9.935092] Hardware name: Intel Corporation LH Pass/SVRBD-ROW_P, BIOS 
> SE5C600.86B.02.01.SP06.050920141054 05/09/2014
> [9.946934] task: 92dedae12b80 task.stack: baeb0783c000
> [9.953539] RIP: 0010:memcpy_erms+0x6/0x10
> [9.958108] RSP: 0018:baeb0783f9b8 EFLAGS: 00010286
> [9.963939] RAX: 92e6dafef000 RBX:  RCX: 
> 1000
> [9.971904] RDX: 1000 RSI: 9406bfff RDI: 
> 92e6dafef000
> [9.979869] RBP: baeb0783fa38 R08:  R09: 
> 1780
> [9.987831] R10:  R11: 9406bfff R12: 
> 92d83bfaea98
> [9.995794] R13: 002f R14: 1000 R15: 
> 92e6dafef000
> [   10.003759] FS:  7fd4c2e618c0() GS:92e6de4c() 
> knlGS:
> [   10.012779] CS:  0010 DS:  ES:  CR0: 80050033
> [   10.019192] CR2: 9406bfff CR3: 00081a05c000 CR4: 
> 001406e0
> [   10.027158] Call Trace:
> [   10.029891]  ? pmem_do_bvec+0x93/0x290 [nd_pmem]
> [   10.035046]  ? radix_tree_node_alloc.constprop.20+0x85/0xc0
> [   10.041263]  ? radix_tree_node_alloc.constprop.20+0x85/0xc0
> [   10.047481]  pmem_rw_page+0x3a/0x60 [nd_pmem]
> [   10.052343]  bdev_read_page+0x81/0xb0
> [   10.056431]  do_mpage_readpage+0x56f/0x770
> [   10.060991]  ? I_BDEV+0x20/0x20
> [   10.064500]  ? lru_cache_add+0xe/0x10
> [   10.068584]  mpage_readpages+0x148/0x1e0
> [   10.072958]  ? I_BDEV+0x20/0x20
> [   10.076462]  ? I_BDEV+0x20/0x20
> [   10.079969]  ? alloc_pages_current+0x88/0x120
> [   10.084830]  blkdev_readpages+0x1d/0x20
> [   10.089111]  __do_page_cache_readahead+0x1ce/0x2c0
> [   10.094456]  force_page_cache_readahead+0xa2/0x100
> [   10.099800]  page_cache_sync_readahead+0x3f/0x50
> [   10.104956]  generic_file_read_iter+0x60d/0x8c0
> [   10.110014]  ? cp_new_stat+0x14f/0x180
> [   10.114187]  blkdev_read_iter+0x37/0x40
> [   10.118469]  __vfs_read+0xe0/0x150
> [   10.122253]  vfs_read+0x8c/0x130
> [   10.125856]  SyS_read+0x55/0xc0
> [   10.129354]  entry_SYSCALL_64_fastpath+0x1a/0xa9
> [   10.134508] RIP: 0033:0x7fd4c1d9d480
> [   10.138487] RSP: 002b:7fffa1f96e08 EFLAGS: 0246 ORIG_RAX: 
> 
> [   10.146934] RAX: ffda RBX: 7fffa1f968f0 RCX: 
> 7fd4c1d9d480
> [   10.154896] RDX: 0040 RSI: 559de3d6d978 RDI: 
> 0008
> [   10.162859] RBP: 00010300 R08: 0020 R09: 
> 0068
> [   10.170820] R10: 7fffa1f96b90 R11: 0246 R12: 
> 
> [   10.178783] R13: 7fffa1f97980 R14:  R15: 
> 
> [   10.186748] Code: ff 90 90 90 90 eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 
> 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1  
> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38
> [   10.207813] RIP: memcpy_erms+0x6/0x10 RSP: baeb0783f9b8
> [   10.214022] CR2: 9406bfff
> [   10.217774] ---[ end trace 2e

Re: KASLR causes intermittent boot failures on some systems

2017-04-07 Thread Jeff Moyer
Thomas Garnier  writes:

> CCing Kees for information.
>
> On Fri, Apr 7, 2017 at 7:41 AM, Jeff Moyer  wrote:
>> Hi,
>>
>> commit 021182e52fe01 ("x86/mm: Enable KASLR for physical mapping memory
>> regions") causes some of my systems with persistent memory (whether real
>> or emulated) to fail to boot with a couple of different crash
>> signatures.  The first signature is a NMI watchdog lockup of all but 1
>> cpu, which causes much difficulty in extracting useful information from
>> the console.  The second variant is an invalid paging request, listed
>> below.
>>
>> On some systems, I haven't hit this problem at all.  Other systems
>> experience a failed boot maybe 20-30% of the time.  To reproduce it,
>> configure some emulated pmem on your system.  You can find directions
>> for that here: https://nvdimm.wiki.kernel.org/
>
> Did you try to repro on qemu?

I did not.

-Jeff
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 4/5] acpi_nfit, libnvdimm: Add support for clear poison list and bad blocks

2017-04-07 Thread Dan Williams
On Thu, Mar 16, 2017 at 3:59 PM, Dave Jiang  wrote:
> Providing mechanism to clear poison list via the ndctl ND_CMD_CLEAR_ERROR
> call. We will update the poison list and also the badblocks at region level
> if the region is in dax mode or in pmem mode and not active.
>
> Signed-off-by: Dave Jiang 
> Reviewed-by: Johannes Thumshirn 
> ---
>  drivers/acpi/nfit/core.c |   24 ++
>  drivers/acpi/nfit/nfit.h |2 +
>  drivers/nvdimm/bus.c |   64 
> ++
>  drivers/nvdimm/core.c|   17 --
>  drivers/nvdimm/region.c  |   25 +++
>  include/linux/libnvdimm.h|7 
>  tools/testing/nvdimm/test/nfit.c |   21 +++-
>  7 files changed, 139 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
> index e7b05df..706eccd 100644
> --- a/drivers/acpi/nfit/core.c
> +++ b/drivers/acpi/nfit/core.c
> @@ -94,6 +94,28 @@ static struct acpi_device *to_acpi_dev(struct 
> acpi_nfit_desc *acpi_desc)
> return to_acpi_device(acpi_desc->dev);
>  }
>
> +void acpi_nfit_forget_poison(struct nvdimm_bus_descriptor *nd_desc,
> +   unsigned int cmd, void *buf)
> +{
> +   struct acpi_nfit_desc *acpi_desc = to_acpi_nfit_desc(nd_desc);
> +   struct nvdimm_bus *nvdimm_bus = acpi_desc->nvdimm_bus;
> +   struct nd_cmd_clear_error *clear_err = buf;
> +   struct resource res;
> +
> +   if (!nvdimm_bus || !clear_err->cleared)
> +   return;
> +
> +   /* clearing the poison list we keep track of */
> +   __nvdimm_forget_poison(nvdimm_bus, clear_err->address,
> +   clear_err->cleared);
> +
> +   /* now sync the badblocks lists from the poison list */
> +   res.start = clear_err->address;
> +   res.end = clear_err->address + clear_err->cleared - 1;
> +   __nvdimm_bus_badblocks_clear(nvdimm_bus, &res);
> +}
> +EXPORT_SYMBOL_GPL(acpi_nfit_forget_poison);
> +
>  static int xlat_bus_status(void *buf, unsigned int cmd, u32 status)
>  {
> struct nd_cmd_clear_error *clear_err;
> @@ -353,6 +375,8 @@ int acpi_nfit_ctl(struct nvdimm_bus_descriptor *nd_desc, 
> struct nvdimm *nvdimm,
> }
>
> xlat_rc = xlat_status(nvdimm, buf, cmd, fw_status);
> +   if (!nvdimm && cmd == ND_CMD_CLEAR_ERROR && xlat_rc >= 0)
> +   acpi_nfit_forget_poison(nd_desc, cmd, buf);

I think this needs to move out to __nd_ioctl(), otherwise we'll be
calling this in response to a kernel internal ND_CMD_CLEAR_ERROR. This
should only be invoked for external clear error. This also means we
don't need the previous patch to unconditionally retrieve xlat_rc.
Instead we can just have __nd_ioctl() provide a valid cmd_rc parameter
rather than NULL.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: panics related to nfit_test?

2017-04-07 Thread Dan Williams
On Fri, Apr 7, 2017 at 6:28 AM, Linda Knippers  wrote:
> I'm trying to run the ndctl tests on 4.11-rc5.  I've never run them before 
> but I
> think I correctly followed all the directions for building and installing the
> tools/testing/nvdimm components as described in the ndctl README.md.  I'm
> seeing two problems that may be related and I'm wondering whether this could
> be build/user error or something real.
>
> 1) Running the tests was causing my system to panic when the nfit_test module
> is unloaded.  I determined I don't actually have to run a test to cause the 
> panic, just
> modprobe the modules as listed in ndctl nfit_test_init(), then modprobe 
> nfit_test,
> then rmmod nfit_test.  I'm doing this on a system without NVDIMMs. I get
> the same thing on a system with NVDIMMs although the other modules are already
> loaded.
>
> This is the panic I get, very reproducibly.
>
> [53617.173340] nfit_test nfit_test.0: failed to evaluate _FIT
>
>  the rmmod.>
>
> [53683.797952] BUG: unable to handle kernel NULL pointer dereference at (null)
> [53683.837521] IP: __list_del_entry_valid+0x29/0xd0
> [53683.861449] PGD 105f4fb067
> [53683.861449] PUD 1054889067
> [53683.874551] PMD 0
> [53683.887664]
> [53683.903937] Oops:  [#1] SMP
> [53683.918657] Modules linked in: nfit_test(O-) nd_pmem(O) nd_e820(O) 
> nd_blk(O) nd_btt(O)
> dax_pmem(O) dax(O) nfit(O) libnvdimm(O) nfit_test_iomap(O) ip6t_rpfilter 
> ipt_REJECT nf_reject_ipv4
> ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat 
> ebtable_broute bridge stp llc
> ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle 
> ip6table_security
> ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat 
> nf_conntrack
> iptable_mangle iptable_security iptable_raw ebtable_filter ebtables 
> ip6table_filter ip6_tables
> iptable_filter intel_rapl sb_edac edac_core x86_pkg_temp_thermal 
> intel_powerclamp coretemp vfat fat
> kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel 
> pcbc ipmi_ssif aesni_intel
> crypto_simd glue_helper cryptd sg hpilo iTCO_wdt
> [53684.252765] hpwdt ipmi_si ipmi_devintf iTCO_vendor_support ioatdma 
> i2c_i801 lpc_ich shpchp pcspkr
> acpi_power_meter ipmi_msghandler dca wmi ip_tables xfs sd_mod mgag200 
> i2c_algo_bit drm_kms_helper
> syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm bnx2x tg3 mdio hpsa ptp 
> i2c_core pps_core
> libcrc32c scsi_transport_sas crc32c_intel
> [53684.394684] CPU: 35 PID: 4087 Comm: rmmod Tainted: G W O 4.11.0-rc5+ #3
> [53684.430295] Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, 
> BIOS P89 10/05/2016
> [53684.469368] task: 9cdbaca9ad00 task.stack: bf3348cc8000
> [53684.497175] RIP: 0010:__list_del_entry_valid+0x29/0xd0
> [53684.521315] RSP: 0018:bf3348ccbd90 EFLAGS: 00010007
> [53684.545823] RAX:  RBX:  RCX: 
> 0006
> [53684.579642] RDX: dead0200 RSI: 9cdbaf4268a0 RDI: 
> bf334e302000
> [53684.613132] RBP: bf3348ccbd90 R08:  R09: 
> bf334e302000
> [53684.646725] R10: 0004 R11: 9cdbaf4268a0 R12: 
> bf3348ccbdc8
> [53684.680100] R13:  R14:  R15: 
> 9ce7a36f2400
> [53684.713655] FS: 7f1fab239740() GS:9ce7af04() 
> knlGS:
> [53684.751875] CS: 0010 DS:  ES:  CR0: 80050033
> [53684.778962] CR2:  CR3: 00106eb12000 CR4: 
> 003406e0
> [53684.812949] DR0:  DR1:  DR2: 
> 
> [53684.847826] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [53684.883234] Call Trace:
> [53684.896228] release_nodes+0x76/0x260
> [53684.913359] devres_release_all+0x3c/0x60
> [53684.932192] device_release_driver_internal+0x151/0x1f0
> [53684.956700] driver_detach+0x3f/0x80
> [53684.973569] bus_remove_driver+0x55/0xd0
> [53684.992057] driver_unregister+0x2c/0x50
> [53685.010575] platform_driver_unregister+0x12/0x20
> [53685.032584] nfit_test_exit+0x10/0xaa9 [nfit_test]
> [53685.055372] SyS_delete_module+0x1ba/0x220

Can you send your kernel config? I've seen reports of this crash
signature from the team trying to integrate the ndctl unit tests into
the 0day kbuild robot, but I have thus far been unable to reproduce
them. On my system if I do:

# modprobe nfit_test
# rmmod nfit_test
rmmod: ERROR: Module nfit_test is in use

Are you saying you are able to remove nfit_test on your system without
first disabling regions?
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] x86, pmem: fix broken __copy_user_nocache cache-bypass assumptions

2017-04-07 Thread Kani, Toshimitsu
On Thu, 2017-04-06 at 13:59 -0700, Dan Williams wrote:
> Before we rework the "pmem api" to stop abusing __copy_user_nocache()
> for memcpy_to_pmem() we need to fix cases where we may strand dirty
> data in the cpu cache. The problem occurs when copy_from_iter_pmem()
> is used for arbitrary data transfers from userspace. There is no
> guarantee that these transfers, performed by dax_iomap_actor(), will
> have aligned destinations or aligned transfer lengths. Backstop the
> usage __copy_user_nocache() with explicit cache management in these
> unaligned cases.
> 
> Yes, copy_from_iter_pmem() is now too big for an inline, but
> addressing that is saved for a later patch that moves the entirety of
> the "pmem api" into the pmem driver directly.

The change looks good to me.  Should we also avoid cache flushing in
the case of size=4B & dest aligned by 4B?

Thanks,
-Toshi
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC 5/8] scatterlist: Modify SG copy functions to support io memory.

2017-04-07 Thread Logan Gunthorpe
Hi Dan,

On 03/04/17 06:07 PM, Dan Williams wrote:
> The completely agnostic part is where I get worried, but I shouldn't
> say anymore until I actually read the patch.The worry is cases where
> this agnostic enabling allows unsuspecting code paths to do the wrong
> thing. Like bypass iomem safety.

Yup, you're right the iomem safety issue is a really difficult problem.
I think replacing struct page with pfn_t in a bunch of places is
probably going to be a requirement for my work. However, this is going
to be a very large undertaking.

I've done an audit of sg_page users and there will indeed be some
difficult cases. However, I'm going to start doing some cleanup and
semantic changes to hopefully move in that direction. The first step
I've chosen to look at is to create an sg_kmap interface which replaces
about 77 (out of ~340) sg_page users. I'm hoping the new interface can
have the semantic that sg_kmap can fail (which would happen in the case
that no suitable page exists).

Eventually, I'd want to get to a place where sg_page either doesn't
exists or can fail and is always checked. At that point swapping out
pfn_t in the sgl would be manageable.

Thoughts?

Logan
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH] libnvdimm: fix reconfig_mutex, mmap_sem, and jbd2_handle lockdep splat

2017-04-07 Thread Dan Williams
Holding the reconfig_mutex over a potential userspace fault sets up a
lockdep dependency chain between filesystem-DAX and the libnvdimm ioctl
path. Move the user access outside of the lock.

 [ INFO: possible circular locking dependency detected ]
 4.11.0-rc3+ #13 Tainted: GW  O
 ---
 fallocate/16656 is trying to acquire lock:
  (&nvdimm_bus->reconfig_mutex){+.+.+.}, at: [] 
nvdimm_bus_lock+0x21/0x30 [libnvdimm]
 but task is already holding lock:
  (jbd2_handle){..}, at: [] 
start_this_handle+0x104/0x460

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (jbd2_handle){..}:
lock_acquire+0xbd/0x200
start_this_handle+0x16a/0x460
jbd2__journal_start+0xe9/0x2d0
__ext4_journal_start_sb+0x89/0x1c0
ext4_dirty_inode+0x32/0x70
__mark_inode_dirty+0x235/0x670
generic_update_time+0x87/0xd0
touch_atime+0xa9/0xd0
ext4_file_mmap+0x90/0xb0
mmap_region+0x370/0x5b0
do_mmap+0x415/0x4f0
vm_mmap_pgoff+0xd7/0x120
SyS_mmap_pgoff+0x1c5/0x290
SyS_mmap+0x22/0x30
entry_SYSCALL_64_fastpath+0x1f/0xc2

-> #1 (&mm->mmap_sem){++}:
lock_acquire+0xbd/0x200
__might_fault+0x70/0xa0
__nd_ioctl+0x683/0x720 [libnvdimm]
nvdimm_ioctl+0x8b/0xe0 [libnvdimm]
do_vfs_ioctl+0xa8/0x740
SyS_ioctl+0x79/0x90
do_syscall_64+0x6c/0x200
return_from_SYSCALL_64+0x0/0x7a

-> #0 (&nvdimm_bus->reconfig_mutex){+.+.+.}:
__lock_acquire+0x16b6/0x1730
lock_acquire+0xbd/0x200
__mutex_lock+0x88/0x9b0
mutex_lock_nested+0x1b/0x20
nvdimm_bus_lock+0x21/0x30 [libnvdimm]
nvdimm_forget_poison+0x25/0x50 [libnvdimm]
nvdimm_clear_poison+0x106/0x140 [libnvdimm]
pmem_do_bvec+0x1c2/0x2b0 [nd_pmem]
pmem_make_request+0xf9/0x270 [nd_pmem]
generic_make_request+0x118/0x3b0
submit_bio+0x75/0x150

Cc: 
Fixes: 62232e45f4a2 ("libnvdimm: control (ioctl) messages for nvdimm_bus and 
nvdimm devices")
Cc: Dave Jiang 
Reported-by: Vishal Verma 
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/bus.c |6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index 23d4a1728cdf..351bac8f6503 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -934,8 +934,14 @@ static int __nd_ioctl(struct nvdimm_bus *nvdimm_bus, 
struct nvdimm *nvdimm,
rc = nd_desc->ndctl(nd_desc, nvdimm, cmd, buf, buf_len, NULL);
if (rc < 0)
goto out_unlock;
+   nvdimm_bus_unlock(&nvdimm_bus->dev);
+
if (copy_to_user(p, buf, buf_len))
rc = -EFAULT;
+
+   vfree(buf);
+   return rc;
+
  out_unlock:
nvdimm_bus_unlock(&nvdimm_bus->dev);
  out:

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: panics related to nfit_test?

2017-04-07 Thread Linda Knippers
On 04/07/2017 01:12 PM, Linda Knippers wrote:
> On 04/07/2017 12:44 PM, Dan Williams wrote:
>> On Fri, Apr 7, 2017 at 6:28 AM, Linda Knippers  
>> wrote:
>> I've seen reports of this crash
>> signature from the team trying to integrate the ndctl unit tests into
>> the 0day kbuild robot, but I have thus far been unable to reproduce
>> them. On my system if I do:
>>
>> # modprobe nfit_test
>> # rmmod nfit_test
>> rmmod: ERROR: Module nfit_test is in use
>>
>> Are you saying you are able to remove nfit_test on your system without
>> first disabling regions?
> 
> No, sorry.  I missed that step in my description.  I'm doing 'ndctl 
> disable-region all'
> before the rmmod.

I've been doing a bit more testing and once, I had 'ndctl check' make it through
all the tests and pass.  A few times I've made it part way through the tests 
before
I hit the panic.  However, if I just modprobe the modules, disable the regions,
and then rmmod nfit_test, it panics for me 100% of the time.  Try this in a 
script.

modprobe nfit
modprobe dax
modprobe dax_pmem
modprobe libnvdimm
modprobe nd_blk
modprobe nd_btt
modprobe nd_e820
modprobe nd_pmem
lsmod |grep nfit
modprobe nfit_test
lsmod |grep nfit
ndctl disable-region all
rmmod nfit_test

-- ljk

> 
> -- ljk
> 
> 

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH] libnvdimm: fix btt vs clear poison locking

2017-04-07 Thread Dan Williams
The following warning results from holding a lane spinlock,
preempt_disable(), or the btt map spinlock and then trying to take the
reconfig_mutex to walk the poison list and potentially add new entries.

 BUG: sleeping function called from invalid context at 
kernel/locking/mutex.c:747
 in_atomic(): 1, irqs_disabled(): 0, pid: 17159, name: dd
 [..]
 Call Trace:
  dump_stack+0x85/0xc8
  ___might_sleep+0x184/0x250
  __might_sleep+0x4a/0x90
  __mutex_lock+0x58/0x9b0
  ? nvdimm_bus_lock+0x21/0x30 [libnvdimm]
  ? __nvdimm_bus_badblocks_clear+0x2f/0x60 [libnvdimm]
  ? acpi_nfit_forget_poison+0x79/0x80 [nfit]
  ? _raw_spin_unlock+0x27/0x40
  mutex_lock_nested+0x1b/0x20
  nvdimm_bus_lock+0x21/0x30 [libnvdimm]
  nvdimm_forget_poison+0x25/0x50 [libnvdimm]
  nvdimm_clear_poison+0x106/0x140 [libnvdimm]
  nsio_rw_bytes+0x164/0x270 [libnvdimm]
  btt_write_pg+0x1de/0x3e0 [nd_btt]
  ? blk_queue_enter+0x30/0x290
  btt_make_request+0x11a/0x310 [nd_btt]
  ? blk_queue_enter+0xb7/0x290
  ? blk_queue_enter+0x30/0x290
  generic_make_request+0x118/0x3b0

As a minimal fix, disable error clearing when the BTT is enabled. For
the final fix a larger rework of the poison list locking is needed.

Note that this is not a problem in the blk case since that path never
calls nvdimm_clear_poison().

Cc: 
Fixes: 82bf1037f2ca ("libnvdimm: check and clear poison before writing to pmem")
Cc: Dave Jiang 
Reported-by: Vishal Verma 
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/claim.c |   10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
index b3323c0697f6..36da71e5a591 100644
--- a/drivers/nvdimm/claim.c
+++ b/drivers/nvdimm/claim.c
@@ -243,7 +243,15 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
}
 
if (unlikely(is_bad_pmem(&nsio->bb, sector, sz_align))) {
-   if (IS_ALIGNED(offset, 512) && IS_ALIGNED(size, 512)) {
+   /*
+* FIXME: nsio_rw_bytes() may be called from atomic
+* context in the BTT case and nvdimm_clear_poison()
+* takes a sleeping lock. Until the locking can be
+* reworked this capability depends on !BTT or BROKEN.
+*/
+   if ((!IS_ENABLED(CONFIG_BTT) || IS_ENABLED(CONFIG_BROKEN))
+   && IS_ALIGNED(offset, 512)
+   && IS_ALIGNED(size, 512)) {
long cleared;
 
cleared = nvdimm_clear_poison(&ndns->dev, offset, size);

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: KASLR causes intermittent boot failures on some systems

2017-04-07 Thread Kees Cook
On Fri, Apr 7, 2017 at 7:41 AM, Jeff Moyer  wrote:
> Hi,
>
> commit 021182e52fe01 ("x86/mm: Enable KASLR for physical mapping memory
> regions") causes some of my systems with persistent memory (whether real
> or emulated) to fail to boot with a couple of different crash
> signatures.  The first signature is a NMI watchdog lockup of all but 1
> cpu, which causes much difficulty in extracting useful information from
> the console.  The second variant is an invalid paging request, listed
> below.

Just to rule out some of the stuff in the boot path, does booting with
"nokaslr" solve this? (i.e. I want to figure out if this is from some
of the rearrangements done that are exposed under that commit, or if
it is genuinely the randomization that is killing the systems...)

> On some systems, I haven't hit this problem at all.  Other systems
> experience a failed boot maybe 20-30% of the time.  To reproduce it,
> configure some emulated pmem on your system.  You can find directions
> for that here: https://nvdimm.wiki.kernel.org/
>
> Install ndctl (https://github.com/pmem/ndctl).
> Configure the namespace:
> # ndctl create-namespace -f -e namespace0.0 -m memory
>
> Then just reboot several times (5 should be enough), and hopefully
> you'll hit the issue.
>
> I've attached both my .config and the dmesg output from a successful
> boot at the end of this mail.

Thanks! Considering I know nothing about pmem (yet), I bet there is
some oversight in what's happening with how KASLR scans for available
memory areas. I'll carve out some time next week to look into this.

-Kees

-- 
Kees Cook
Pixel Security
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: panics related to nfit_test?

2017-04-07 Thread Dan Williams
On Fri, Apr 7, 2017 at 1:28 PM, Linda Knippers  wrote:
> On 04/07/2017 01:12 PM, Linda Knippers wrote:
>> On 04/07/2017 12:44 PM, Dan Williams wrote:
>>> On Fri, Apr 7, 2017 at 6:28 AM, Linda Knippers  
>>> wrote:
>>> I've seen reports of this crash
>>> signature from the team trying to integrate the ndctl unit tests into
>>> the 0day kbuild robot, but I have thus far been unable to reproduce
>>> them. On my system if I do:
>>>
>>> # modprobe nfit_test
>>> # rmmod nfit_test
>>> rmmod: ERROR: Module nfit_test is in use
>>>
>>> Are you saying you are able to remove nfit_test on your system without
>>> first disabling regions?
>>
>> No, sorry.  I missed that step in my description.  I'm doing 'ndctl 
>> disable-region all'
>> before the rmmod.
>
> I've been doing a bit more testing and once, I had 'ndctl check' make it 
> through
> all the tests and pass.  A few times I've made it part way through the tests 
> before
> I hit the panic.  However, if I just modprobe the modules, disable the 
> regions,
> and then rmmod nfit_test, it panics for me 100% of the time.  Try this in a 
> script.
>
> modprobe nfit
> modprobe dax
> modprobe dax_pmem
> modprobe libnvdimm
> modprobe nd_blk
> modprobe nd_btt
> modprobe nd_e820
> modprobe nd_pmem
> lsmod |grep nfit
> modprobe nfit_test
> lsmod |grep nfit
> ndctl disable-region all
> rmmod nfit_test
>

What distribution are you using? This loop is running fine in my
Fedora Rawhide virtual machine environment. The other report of this
was from a Debian environment. So I wonder if there is some timing
differences related to udev or libkmod that prevent me from hitting
the failure condition?
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Your visit to the 121st Canton Fair in China (46)

2017-04-07 Thread Coco Chen
Dear Madam/Sir, I am pleased to announce that the 121st Canton Fair is 
coming. The Canton Fair is co-hosted biannually by the Ministry of Commerce of 
People Republic of China and the Guangdong Government, every spring and autumn. 
In the Autumn of 2016, the past120th Canton Fair attracted over 25,000 
suppliers to exhibit over 150,000 kinds of quality products with distinctive 
features, and the annual turnover reached over USD 29 billion.We invite you to 
join the 121st Canton fair. The first 500 customers signing up for attendance 
may enjoy the following offers:1) Directory of superior suppliers of and a 
latest report for all industries.2) VIP permit will be granted3) Assistance on 
China Visa application4) Free one-way airport transfer and special price for 
hotel accommodation5) Free local mobile sim-card6) 2 free SPA couponsWe 
sincerely believe that you and your company will fruitfully benefit from it. 
Please find the contact person details as follows:Contact Person : Ms. Coco Ch
 enEmail:cantonfair4@126.comTel: +971-55-7951168We look forward to your 
favorable reply. Best regards,Coco Chen- The China Import and Export Fair- 
Guangdong Emirates Business Union In U.A.E.We hosted Promotion Conferences of 
Canton Fair in:1.) Sept. 2014, Carlton Hotel, Dubai, U.A.E.2.) April 
2015,  Radisson Hotel, Dubai, U.A.E.3.) Aug. 2015,  Holiday Inn, Abu 
Dhabi, U.A.E.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: panics related to nfit_test?

2017-04-07 Thread Linda Knippers


On 04/07/2017 05:46 PM, Dan Williams wrote:
> On Fri, Apr 7, 2017 at 1:28 PM, Linda Knippers  wrote:
>> On 04/07/2017 01:12 PM, Linda Knippers wrote:
>>> On 04/07/2017 12:44 PM, Dan Williams wrote:
 On Fri, Apr 7, 2017 at 6:28 AM, Linda Knippers  
 wrote:
 I've seen reports of this crash
 signature from the team trying to integrate the ndctl unit tests into
 the 0day kbuild robot, but I have thus far been unable to reproduce
 them. On my system if I do:

 # modprobe nfit_test
 # rmmod nfit_test
 rmmod: ERROR: Module nfit_test is in use

 Are you saying you are able to remove nfit_test on your system without
 first disabling regions?
>>>
>>> No, sorry.  I missed that step in my description.  I'm doing 'ndctl 
>>> disable-region all'
>>> before the rmmod.
>>
>> I've been doing a bit more testing and once, I had 'ndctl check' make it 
>> through
>> all the tests and pass.  A few times I've made it part way through the tests 
>> before
>> I hit the panic.  However, if I just modprobe the modules, disable the 
>> regions,
>> and then rmmod nfit_test, it panics for me 100% of the time.  Try this in a 
>> script.
>>
>> modprobe nfit
>> modprobe dax
>> modprobe dax_pmem
>> modprobe libnvdimm
>> modprobe nd_blk
>> modprobe nd_btt
>> modprobe nd_e820
>> modprobe nd_pmem
>> lsmod |grep nfit
>> modprobe nfit_test
>> lsmod |grep nfit
>> ndctl disable-region all
>> rmmod nfit_test
>>
> 
> What distribution are you using? This loop is running fine in my
> Fedora Rawhide virtual machine environment. The other report of this
> was from a Debian environment. So I wonder if there is some timing
> differences related to udev or libkmod that prevent me from hitting
> the failure condition?

I'm running RHEL7.3 with a 4.11-rc5 kernel on bare metal with no
physical NVDIMMs.   My system is a 2-socket box with E5-2695 v4
processors and a total of 72 cores with HT on.  Maybe you need
more cores.

-- ljk
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v5 2/4] libnvdimm: Add 'resource' sysfs attribute to regions

2017-04-07 Thread Dave Jiang
Adding sysfs attribute in order to export the physical address of the
region. This is for supporting of user app poison clear via
ND_IOCTL_CLEAR_ERROR.

Signed-off-by: Dave Jiang 
Reviewed-by: Johannes Thumshirn 
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/region_devs.c |   13 +
 1 file changed, 13 insertions(+)

diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index 3500fc8..8de5a04 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -463,6 +463,15 @@ static struct device_attribute dev_attr_nd_badblocks = {
.show = nd_badblocks_show,
 };
 
+static ssize_t resource_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct nd_region *nd_region = to_nd_region(dev);
+
+   return sprintf(buf, "%#llx\n", nd_region->ndr_start);
+}
+static DEVICE_ATTR_RO(resource);
+
 static struct attribute *nd_region_attributes[] = {
&dev_attr_size.attr,
&dev_attr_nstype.attr,
@@ -476,6 +485,7 @@ static struct attribute *nd_region_attributes[] = {
&dev_attr_namespace_seed.attr,
&dev_attr_init_namespaces.attr,
&dev_attr_nd_badblocks.attr,
+   &dev_attr_resource.attr,
NULL,
 };
 
@@ -495,6 +505,9 @@ static umode_t region_visible(struct kobject *kobj, struct 
attribute *a, int n)
if (!is_nd_pmem(dev) && a == &dev_attr_nd_badblocks.attr)
return 0;
 
+   if (!is_nd_pmem(dev) && a == &dev_attr_resource.attr)
+   return 0;
+
if (a != &dev_attr_set_cookie.attr
&& a != &dev_attr_available_size.attr)
return a->mode;

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v5 3/4] libnvdimm: add support for clear poison list and badblocks for device dax

2017-04-07 Thread Dave Jiang
Providing mechanism to clear poison list via the ndctl ND_CMD_CLEAR_ERROR
call. We will update the poison list and also the badblocks at region level
if the region is in dax mode or in pmem mode and not active. In other
words we force badblocks to be cleared through write requests if the
address is currently accessed through a block device, otherwise it can
only be done via the ioctl+dsm path.

Signed-off-by: Dave Jiang 
Reviewed-by: Johannes Thumshirn 
---
 drivers/nvdimm/bus.c  |   78 +
 drivers/nvdimm/core.c |   17 --
 drivers/nvdimm/region.c   |   25 ++
 include/linux/libnvdimm.h |7 +++-
 4 files changed, 115 insertions(+), 12 deletions(-)

diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index 351bac8..af64e9c 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -27,6 +27,7 @@
 #include 
 #include "nd-core.h"
 #include "nd.h"
+#include "pfn.h"
 
 int nvdimm_major;
 static int nvdimm_bus_major;
@@ -218,11 +219,19 @@ long nvdimm_clear_poison(struct device *dev, phys_addr_t 
phys,
if (cmd_rc < 0)
return cmd_rc;
 
-   nvdimm_clear_from_poison_list(nvdimm_bus, phys, len);
return clear_err.cleared;
 }
 EXPORT_SYMBOL_GPL(nvdimm_clear_poison);
 
+void __nvdimm_bus_badblocks_clear(struct nvdimm_bus *nvdimm_bus,
+   struct resource *res)
+{
+   lockdep_assert_held(&nvdimm_bus->reconfig_mutex);
+   device_for_each_child(&nvdimm_bus->dev, (void *)res,
+   nvdimm_region_badblocks_clear);
+}
+EXPORT_SYMBOL_GPL(__nvdimm_bus_badblocks_clear);
+
 static int nvdimm_bus_match(struct device *dev, struct device_driver *drv);
 
 static struct bus_type nvdimm_bus_type = {
@@ -769,16 +778,55 @@ void wait_nvdimm_bus_probe_idle(struct device *dev)
} while (true);
 }
 
-static int pmem_active(struct device *dev, void *data)
+static int nd_pmem_forget_poison_check(struct device *dev, void *data)
 {
-   if (is_nd_pmem(dev) && dev->driver)
+   struct nd_cmd_clear_error *clear_err =
+   (struct nd_cmd_clear_error *)data;
+   struct nd_btt *nd_btt = is_nd_btt(dev) ? to_nd_btt(dev) : NULL;
+   struct nd_pfn *nd_pfn = is_nd_pfn(dev) ? to_nd_pfn(dev) : NULL;
+   struct nd_dax *nd_dax = is_nd_dax(dev) ? to_nd_dax(dev) : NULL;
+   struct nd_namespace_common *ndns = NULL;
+   struct nd_namespace_io *nsio;
+   resource_size_t offset = 0, end_trunc = 0, start, end, pstart, pend;
+
+   if (nd_dax || !dev->driver)
+   return 0;
+
+   start = clear_err->address;
+   end = clear_err->address + clear_err->cleared - 1;
+
+   if (nd_btt || nd_pfn || nd_dax) {
+   if (nd_btt)
+   ndns = nd_btt->ndns;
+   else if (nd_pfn)
+   ndns = nd_pfn->ndns;
+   else if (nd_dax)
+   ndns = nd_dax->nd_pfn.ndns;
+
+   if (!ndns)
+   return 0;
+   } else
+   ndns = to_ndns(dev);
+
+   nsio = to_nd_namespace_io(&ndns->dev);
+   pstart = nsio->res.start + offset;
+   pend = nsio->res.end - end_trunc;
+
+   if ((pstart >= start) && (pend <= end))
return -EBUSY;
+
return 0;
+
+}
+
+static int nd_ns_forget_poison_check(struct device *dev, void *data)
+{
+   return device_for_each_child(dev, data, nd_pmem_forget_poison_check);
 }
 
 /* set_config requires an idle interleave set */
 static int nd_cmd_clear_to_send(struct nvdimm_bus *nvdimm_bus,
-   struct nvdimm *nvdimm, unsigned int cmd)
+   struct nvdimm *nvdimm, unsigned int cmd, void *data)
 {
struct nvdimm_bus_descriptor *nd_desc = nvdimm_bus->nd_desc;
 
@@ -792,8 +840,8 @@ static int nd_cmd_clear_to_send(struct nvdimm_bus 
*nvdimm_bus,
 
/* require clear error to go through the pmem driver */
if (!nvdimm && cmd == ND_CMD_CLEAR_ERROR)
-   return device_for_each_child(&nvdimm_bus->dev, NULL,
-   pmem_active);
+   return device_for_each_child(&nvdimm_bus->dev, data,
+   nd_ns_forget_poison_check);
 
if (!nvdimm || cmd != ND_CMD_SET_CONFIG_DATA)
return 0;
@@ -927,13 +975,29 @@ static int __nd_ioctl(struct nvdimm_bus *nvdimm_bus, 
struct nvdimm *nvdimm,
}
 
nvdimm_bus_lock(&nvdimm_bus->dev);
-   rc = nd_cmd_clear_to_send(nvdimm_bus, nvdimm, cmd);
+   rc = nd_cmd_clear_to_send(nvdimm_bus, nvdimm, cmd, buf);
if (rc)
goto out_unlock;
 
rc = nd_desc->ndctl(nd_desc, nvdimm, cmd, buf, buf_len, NULL);
if (rc < 0)
goto out_unlock;
+
+   if (cmd == ND_CMD_CLEAR_ERROR) {
+   struct nd_cmd_clear_error *clear_err = buf;
+   struct resource res;
+
+   if (clear_err->cleared) {
+   /* clearing the poison list we keep

[PATCH v5 1/4] libnvdimm: add mechanism to publish badblocks at the region level

2017-04-07 Thread Dave Jiang
badblocks sysfs file will be export at region level. When nvdimm event
notifier happens for NVDIMM_REVALIATE_POISON, the badblocks in the
region will be updated.

Signed-off-by: Dave Jiang 
Reviewed-by: Johannes Thumshirn 
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/nd.h  |1 +
 drivers/nvdimm/region.c  |   24 
 drivers/nvdimm/region_devs.c |   19 +++
 3 files changed, 44 insertions(+)

diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index 2a99c83..c3b33cf 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -154,6 +154,7 @@ struct nd_region {
u64 ndr_start;
int id, num_lanes, ro, numa_node;
void *provider_data;
+   struct badblocks bb;
struct nd_interleave_set *nd_set;
struct nd_percpu_lane __percpu *lane;
struct nd_mapping mapping[0];
diff --git a/drivers/nvdimm/region.c b/drivers/nvdimm/region.c
index 8f24177..869a886 100644
--- a/drivers/nvdimm/region.c
+++ b/drivers/nvdimm/region.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include "nd-core.h"
 #include "nd.h"
 
 static int nd_region_probe(struct device *dev)
@@ -52,6 +53,17 @@ static int nd_region_probe(struct device *dev)
if (rc && err && rc == err)
return -ENODEV;
 
+   if (is_nd_pmem(&nd_region->dev)) {
+   struct resource ndr_res;
+
+   if (devm_init_badblocks(dev, &nd_region->bb))
+   return -ENODEV;
+   ndr_res.start = nd_region->ndr_start;
+   ndr_res.end = nd_region->ndr_start + nd_region->ndr_size - 1;
+   nvdimm_badblocks_populate(nd_region,
+   &nd_region->bb, &ndr_res);
+   }
+
nd_region->btt_seed = nd_btt_create(nd_region);
nd_region->pfn_seed = nd_pfn_create(nd_region);
nd_region->dax_seed = nd_dax_create(nd_region);
@@ -104,6 +116,18 @@ static int child_notify(struct device *dev, void *data)
 
 static void nd_region_notify(struct device *dev, enum nvdimm_event event)
 {
+   if (event == NVDIMM_REVALIDATE_POISON) {
+   struct nd_region *nd_region = to_nd_region(dev);
+   struct resource res;
+
+   if (is_nd_pmem(&nd_region->dev)) {
+   res.start = nd_region->ndr_start;
+   res.end = nd_region->ndr_start +
+   nd_region->ndr_size - 1;
+   nvdimm_badblocks_populate(nd_region,
+   &nd_region->bb, &res);
+   }
+   }
device_for_each_child(dev, &event, child_notify);
 }
 
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index b7cb506..3500fc8 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -448,6 +448,21 @@ static ssize_t read_only_store(struct device *dev,
 }
 static DEVICE_ATTR_RW(read_only);
 
+static ssize_t nd_badblocks_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct nd_region *nd_region = to_nd_region(dev);
+
+   return badblocks_show(&nd_region->bb, buf, 0);
+}
+static struct device_attribute dev_attr_nd_badblocks = {
+   .attr = {
+   .name = "badblocks",
+   .mode = S_IRUGO
+   },
+   .show = nd_badblocks_show,
+};
+
 static struct attribute *nd_region_attributes[] = {
&dev_attr_size.attr,
&dev_attr_nstype.attr,
@@ -460,6 +475,7 @@ static struct attribute *nd_region_attributes[] = {
&dev_attr_available_size.attr,
&dev_attr_namespace_seed.attr,
&dev_attr_init_namespaces.attr,
+   &dev_attr_nd_badblocks.attr,
NULL,
 };
 
@@ -476,6 +492,9 @@ static umode_t region_visible(struct kobject *kobj, struct 
attribute *a, int n)
if (!is_nd_pmem(dev) && a == &dev_attr_dax_seed.attr)
return 0;
 
+   if (!is_nd_pmem(dev) && a == &dev_attr_nd_badblocks.attr)
+   return 0;
+
if (a != &dev_attr_set_cookie.attr
&& a != &dev_attr_available_size.attr)
return a->mode;

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v5 4/4] device-dax, tools/testing/nvdimm: enable device-dax with mock resources

2017-04-07 Thread Dave Jiang
Provide a replacement pgoff_to_phys() that translates an nfit_test
resource (allocated by vmalloc()) to a pfn.

Signed-off-by: Dave Jiang 
Reviewed-by: Johannes Thumshirn 
Signed-off-by: Dan Williams 
---
 drivers/dax/dax-private.h  |   61 
 drivers/dax/dax.c  |   52 --
 tools/testing/nvdimm/Kbuild|3 +-
 tools/testing/nvdimm/dax-dev.c |   49 
 4 files changed, 118 insertions(+), 47 deletions(-)
 create mode 100644 drivers/dax/dax-private.h
 create mode 100644 tools/testing/nvdimm/dax-dev.c

diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
new file mode 100644
index 000..c45ac94a
--- /dev/null
+++ b/drivers/dax/dax-private.h
@@ -0,0 +1,61 @@
+/*
+ * Copyright(c) 2016 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#ifndef __DAX_PRIVATE_H__
+#define __DAX_PRIVATE_H__
+
+#include 
+#include 
+
+/**
+ * struct dax_region - mapping infrastructure for dax devices
+ * @id: kernel-wide unique region for a memory range
+ * @base: linear address corresponding to @res
+ * @kref: to pin while other agents have a need to do lookups
+ * @dev: parent device backing this region
+ * @align: allocation and mapping alignment for child dax devices
+ * @res: physical address range of the region
+ * @pfn_flags: identify whether the pfns are paged back or not
+ */
+struct dax_region {
+   int id;
+   struct ida ida;
+   void *base;
+   struct kref kref;
+   struct device *dev;
+   unsigned int align;
+   struct resource res;
+   unsigned long pfn_flags;
+};
+
+/**
+ * struct dax_dev - subdivision of a dax region
+ * @region - parent region
+ * @inode - inode
+ * @dev - device backing the character device
+ * @cdev - core chardev data
+ * @alive - !alive + rcu grace period == no new mappings can be established
+ * @id - child id in the region
+ * @num_resources - number of physical address extents in this device
+ * @res - array of physical address ranges
+ */
+struct dax_dev {
+   struct dax_region *region;
+   struct inode *inode;
+   struct device dev;
+   struct cdev cdev;
+   bool alive;
+   int id;
+   int num_resources;
+   struct resource res[0];
+};
+#endif
diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c
index 80c6db279..3a29b97 100644
--- a/drivers/dax/dax.c
+++ b/drivers/dax/dax.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include "dax-private.h"
 #include "dax.h"
 
 static dev_t dax_devt;
@@ -34,48 +35,6 @@ static struct kmem_cache *dax_cache __read_mostly;
 static struct super_block *dax_superblock __read_mostly;
 MODULE_PARM_DESC(nr_dax, "max number of device-dax instances");
 
-/**
- * struct dax_region - mapping infrastructure for dax devices
- * @id: kernel-wide unique region for a memory range
- * @base: linear address corresponding to @res
- * @kref: to pin while other agents have a need to do lookups
- * @dev: parent device backing this region
- * @align: allocation and mapping alignment for child dax devices
- * @res: physical address range of the region
- * @pfn_flags: identify whether the pfns are paged back or not
- */
-struct dax_region {
-   int id;
-   struct ida ida;
-   void *base;
-   struct kref kref;
-   struct device *dev;
-   unsigned int align;
-   struct resource res;
-   unsigned long pfn_flags;
-};
-
-/**
- * struct dax_dev - subdivision of a dax region
- * @region - parent region
- * @dev - device backing the character device
- * @cdev - core chardev data
- * @alive - !alive + rcu grace period == no new mappings can be established
- * @id - child id in the region
- * @num_resources - number of physical address extents in this device
- * @res - array of physical address ranges
- */
-struct dax_dev {
-   struct dax_region *region;
-   struct inode *inode;
-   struct device dev;
-   struct cdev cdev;
-   bool alive;
-   int id;
-   int num_resources;
-   struct resource res[0];
-};
-
 static ssize_t id_show(struct device *dev,
struct device_attribute *attr, char *buf)
 {
@@ -396,7 +355,8 @@ static int check_vma(struct dax_dev *dax_dev, struct 
vm_area_struct *vma,
return 0;
 }
 
-static phys_addr_t pgoff_to_phys(struct dax_dev *dax_dev, pgoff_t pgoff,
+/* see "strong" declaration in tools/testing/nvdimm/dax-dev.c */
+__weak phys_addr_t dax_pgoff_to_phys(struct dax_dev *dax_dev, pgoff_t pgoff,
unsigned long size)
 {
struct

[PATCH v6 1/4] libnvdimm: add mechanism to publish badblocks at the region level

2017-04-07 Thread Dave Jiang
badblocks sysfs file will be export at region level. When nvdimm event
notifier happens for NVDIMM_REVALIATE_POISON, the badblocks in the
region will be updated.

Signed-off-by: Dave Jiang 
Reviewed-by: Johannes Thumshirn 
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/nd.h  |1 +
 drivers/nvdimm/region.c  |   24 
 drivers/nvdimm/region_devs.c |   19 +++
 3 files changed, 44 insertions(+)

diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index 2a99c83..c3b33cf 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -154,6 +154,7 @@ struct nd_region {
u64 ndr_start;
int id, num_lanes, ro, numa_node;
void *provider_data;
+   struct badblocks bb;
struct nd_interleave_set *nd_set;
struct nd_percpu_lane __percpu *lane;
struct nd_mapping mapping[0];
diff --git a/drivers/nvdimm/region.c b/drivers/nvdimm/region.c
index 8f24177..869a886 100644
--- a/drivers/nvdimm/region.c
+++ b/drivers/nvdimm/region.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include "nd-core.h"
 #include "nd.h"
 
 static int nd_region_probe(struct device *dev)
@@ -52,6 +53,17 @@ static int nd_region_probe(struct device *dev)
if (rc && err && rc == err)
return -ENODEV;
 
+   if (is_nd_pmem(&nd_region->dev)) {
+   struct resource ndr_res;
+
+   if (devm_init_badblocks(dev, &nd_region->bb))
+   return -ENODEV;
+   ndr_res.start = nd_region->ndr_start;
+   ndr_res.end = nd_region->ndr_start + nd_region->ndr_size - 1;
+   nvdimm_badblocks_populate(nd_region,
+   &nd_region->bb, &ndr_res);
+   }
+
nd_region->btt_seed = nd_btt_create(nd_region);
nd_region->pfn_seed = nd_pfn_create(nd_region);
nd_region->dax_seed = nd_dax_create(nd_region);
@@ -104,6 +116,18 @@ static int child_notify(struct device *dev, void *data)
 
 static void nd_region_notify(struct device *dev, enum nvdimm_event event)
 {
+   if (event == NVDIMM_REVALIDATE_POISON) {
+   struct nd_region *nd_region = to_nd_region(dev);
+   struct resource res;
+
+   if (is_nd_pmem(&nd_region->dev)) {
+   res.start = nd_region->ndr_start;
+   res.end = nd_region->ndr_start +
+   nd_region->ndr_size - 1;
+   nvdimm_badblocks_populate(nd_region,
+   &nd_region->bb, &res);
+   }
+   }
device_for_each_child(dev, &event, child_notify);
 }
 
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index b7cb506..3500fc8 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -448,6 +448,21 @@ static ssize_t read_only_store(struct device *dev,
 }
 static DEVICE_ATTR_RW(read_only);
 
+static ssize_t nd_badblocks_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct nd_region *nd_region = to_nd_region(dev);
+
+   return badblocks_show(&nd_region->bb, buf, 0);
+}
+static struct device_attribute dev_attr_nd_badblocks = {
+   .attr = {
+   .name = "badblocks",
+   .mode = S_IRUGO
+   },
+   .show = nd_badblocks_show,
+};
+
 static struct attribute *nd_region_attributes[] = {
&dev_attr_size.attr,
&dev_attr_nstype.attr,
@@ -460,6 +475,7 @@ static struct attribute *nd_region_attributes[] = {
&dev_attr_available_size.attr,
&dev_attr_namespace_seed.attr,
&dev_attr_init_namespaces.attr,
+   &dev_attr_nd_badblocks.attr,
NULL,
 };
 
@@ -476,6 +492,9 @@ static umode_t region_visible(struct kobject *kobj, struct 
attribute *a, int n)
if (!is_nd_pmem(dev) && a == &dev_attr_dax_seed.attr)
return 0;
 
+   if (!is_nd_pmem(dev) && a == &dev_attr_nd_badblocks.attr)
+   return 0;
+
if (a != &dev_attr_set_cookie.attr
&& a != &dev_attr_available_size.attr)
return a->mode;

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v6 3/4] libnvdimm: add support for clear poison list and badblocks for device dax

2017-04-07 Thread Dave Jiang
Providing mechanism to clear poison list via the ndctl ND_CMD_CLEAR_ERROR
call. We will update the poison list and also the badblocks at region level
if the region is in dax mode or in pmem mode and not active. In other
words we force badblocks to be cleared through write requests if the
address is currently accessed through a block device, otherwise it can
only be done via the ioctl+dsm path.

Signed-off-by: Dave Jiang 
Reviewed-by: Johannes Thumshirn 
---
 drivers/nvdimm/bus.c  |   82 -
 drivers/nvdimm/core.c |   17 +++--
 drivers/nvdimm/region.c   |   25 ++
 include/linux/libnvdimm.h |7 +++-
 4 files changed, 117 insertions(+), 14 deletions(-)

diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index 351bac8..2ae7658 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -27,6 +27,7 @@
 #include 
 #include "nd-core.h"
 #include "nd.h"
+#include "pfn.h"
 
 int nvdimm_major;
 static int nvdimm_bus_major;
@@ -218,11 +219,19 @@ long nvdimm_clear_poison(struct device *dev, phys_addr_t 
phys,
if (cmd_rc < 0)
return cmd_rc;
 
-   nvdimm_clear_from_poison_list(nvdimm_bus, phys, len);
return clear_err.cleared;
 }
 EXPORT_SYMBOL_GPL(nvdimm_clear_poison);
 
+void __nvdimm_bus_badblocks_clear(struct nvdimm_bus *nvdimm_bus,
+   struct resource *res)
+{
+   lockdep_assert_held(&nvdimm_bus->reconfig_mutex);
+   device_for_each_child(&nvdimm_bus->dev, (void *)res,
+   nvdimm_region_badblocks_clear);
+}
+EXPORT_SYMBOL_GPL(__nvdimm_bus_badblocks_clear);
+
 static int nvdimm_bus_match(struct device *dev, struct device_driver *drv);
 
 static struct bus_type nvdimm_bus_type = {
@@ -769,16 +778,55 @@ void wait_nvdimm_bus_probe_idle(struct device *dev)
} while (true);
 }
 
-static int pmem_active(struct device *dev, void *data)
+static int nd_pmem_forget_poison_check(struct device *dev, void *data)
 {
-   if (is_nd_pmem(dev) && dev->driver)
+   struct nd_cmd_clear_error *clear_err =
+   (struct nd_cmd_clear_error *)data;
+   struct nd_btt *nd_btt = is_nd_btt(dev) ? to_nd_btt(dev) : NULL;
+   struct nd_pfn *nd_pfn = is_nd_pfn(dev) ? to_nd_pfn(dev) : NULL;
+   struct nd_dax *nd_dax = is_nd_dax(dev) ? to_nd_dax(dev) : NULL;
+   struct nd_namespace_common *ndns = NULL;
+   struct nd_namespace_io *nsio;
+   resource_size_t offset = 0, end_trunc = 0, start, end, pstart, pend;
+
+   if (nd_dax || !dev->driver)
+   return 0;
+
+   start = clear_err->address;
+   end = clear_err->address + clear_err->cleared - 1;
+
+   if (nd_btt || nd_pfn || nd_dax) {
+   if (nd_btt)
+   ndns = nd_btt->ndns;
+   else if (nd_pfn)
+   ndns = nd_pfn->ndns;
+   else if (nd_dax)
+   ndns = nd_dax->nd_pfn.ndns;
+
+   if (!ndns)
+   return 0;
+   } else
+   ndns = to_ndns(dev);
+
+   nsio = to_nd_namespace_io(&ndns->dev);
+   pstart = nsio->res.start + offset;
+   pend = nsio->res.end - end_trunc;
+
+   if ((pstart >= start) && (pend <= end))
return -EBUSY;
+
return 0;
+
+}
+
+static int nd_ns_forget_poison_check(struct device *dev, void *data)
+{
+   return device_for_each_child(dev, data, nd_pmem_forget_poison_check);
 }
 
 /* set_config requires an idle interleave set */
 static int nd_cmd_clear_to_send(struct nvdimm_bus *nvdimm_bus,
-   struct nvdimm *nvdimm, unsigned int cmd)
+   struct nvdimm *nvdimm, unsigned int cmd, void *data)
 {
struct nvdimm_bus_descriptor *nd_desc = nvdimm_bus->nd_desc;
 
@@ -792,8 +840,8 @@ static int nd_cmd_clear_to_send(struct nvdimm_bus 
*nvdimm_bus,
 
/* require clear error to go through the pmem driver */
if (!nvdimm && cmd == ND_CMD_CLEAR_ERROR)
-   return device_for_each_child(&nvdimm_bus->dev, NULL,
-   pmem_active);
+   return device_for_each_child(&nvdimm_bus->dev, data,
+   nd_ns_forget_poison_check);
 
if (!nvdimm || cmd != ND_CMD_SET_CONFIG_DATA)
return 0;
@@ -820,7 +868,7 @@ static int __nd_ioctl(struct nvdimm_bus *nvdimm_bus, struct 
nvdimm *nvdimm,
const char *cmd_name, *dimm_name;
unsigned long cmd_mask;
void *buf;
-   int rc, i;
+   int rc, i, cmd_rc;
 
if (nvdimm) {
desc = nd_cmd_dimm_desc(cmd);
@@ -927,13 +975,29 @@ static int __nd_ioctl(struct nvdimm_bus *nvdimm_bus, 
struct nvdimm *nvdimm,
}
 
nvdimm_bus_lock(&nvdimm_bus->dev);
-   rc = nd_cmd_clear_to_send(nvdimm_bus, nvdimm, cmd);
+   rc = nd_cmd_clear_to_send(nvdimm_bus, nvdimm, cmd, buf);
if (rc)
goto out_unlock;
 
-   rc = nd_desc->ndctl(nd_desc, nvdimm, cmd, buf, buf_len

[PATCH v6 2/4] libnvdimm: Add 'resource' sysfs attribute to regions

2017-04-07 Thread Dave Jiang
Adding sysfs attribute in order to export the physical address of the
region. This is for supporting of user app poison clear via
ND_IOCTL_CLEAR_ERROR.

Signed-off-by: Dave Jiang 
Reviewed-by: Johannes Thumshirn 
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/region_devs.c |   13 +
 1 file changed, 13 insertions(+)

diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index 3500fc8..8de5a04 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -463,6 +463,15 @@ static struct device_attribute dev_attr_nd_badblocks = {
.show = nd_badblocks_show,
 };
 
+static ssize_t resource_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   struct nd_region *nd_region = to_nd_region(dev);
+
+   return sprintf(buf, "%#llx\n", nd_region->ndr_start);
+}
+static DEVICE_ATTR_RO(resource);
+
 static struct attribute *nd_region_attributes[] = {
&dev_attr_size.attr,
&dev_attr_nstype.attr,
@@ -476,6 +485,7 @@ static struct attribute *nd_region_attributes[] = {
&dev_attr_namespace_seed.attr,
&dev_attr_init_namespaces.attr,
&dev_attr_nd_badblocks.attr,
+   &dev_attr_resource.attr,
NULL,
 };
 
@@ -495,6 +505,9 @@ static umode_t region_visible(struct kobject *kobj, struct 
attribute *a, int n)
if (!is_nd_pmem(dev) && a == &dev_attr_nd_badblocks.attr)
return 0;
 
+   if (!is_nd_pmem(dev) && a == &dev_attr_resource.attr)
+   return 0;
+
if (a != &dev_attr_set_cookie.attr
&& a != &dev_attr_available_size.attr)
return a->mode;

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v6 4/4] device-dax, tools/testing/nvdimm: enable device-dax with mock resources

2017-04-07 Thread Dave Jiang
Provide a replacement pgoff_to_phys() that translates an nfit_test
resource (allocated by vmalloc()) to a pfn.

Signed-off-by: Dave Jiang 
Reviewed-by: Johannes Thumshirn 
Signed-off-by: Dan Williams 
---
 drivers/dax/dax-private.h  |   61 
 drivers/dax/dax.c  |   52 --
 tools/testing/nvdimm/Kbuild|3 +-
 tools/testing/nvdimm/dax-dev.c |   49 
 4 files changed, 118 insertions(+), 47 deletions(-)
 create mode 100644 drivers/dax/dax-private.h
 create mode 100644 tools/testing/nvdimm/dax-dev.c

diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
new file mode 100644
index 000..c45ac94a
--- /dev/null
+++ b/drivers/dax/dax-private.h
@@ -0,0 +1,61 @@
+/*
+ * Copyright(c) 2016 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#ifndef __DAX_PRIVATE_H__
+#define __DAX_PRIVATE_H__
+
+#include 
+#include 
+
+/**
+ * struct dax_region - mapping infrastructure for dax devices
+ * @id: kernel-wide unique region for a memory range
+ * @base: linear address corresponding to @res
+ * @kref: to pin while other agents have a need to do lookups
+ * @dev: parent device backing this region
+ * @align: allocation and mapping alignment for child dax devices
+ * @res: physical address range of the region
+ * @pfn_flags: identify whether the pfns are paged back or not
+ */
+struct dax_region {
+   int id;
+   struct ida ida;
+   void *base;
+   struct kref kref;
+   struct device *dev;
+   unsigned int align;
+   struct resource res;
+   unsigned long pfn_flags;
+};
+
+/**
+ * struct dax_dev - subdivision of a dax region
+ * @region - parent region
+ * @inode - inode
+ * @dev - device backing the character device
+ * @cdev - core chardev data
+ * @alive - !alive + rcu grace period == no new mappings can be established
+ * @id - child id in the region
+ * @num_resources - number of physical address extents in this device
+ * @res - array of physical address ranges
+ */
+struct dax_dev {
+   struct dax_region *region;
+   struct inode *inode;
+   struct device dev;
+   struct cdev cdev;
+   bool alive;
+   int id;
+   int num_resources;
+   struct resource res[0];
+};
+#endif
diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c
index 80c6db279..3a29b97 100644
--- a/drivers/dax/dax.c
+++ b/drivers/dax/dax.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include "dax-private.h"
 #include "dax.h"
 
 static dev_t dax_devt;
@@ -34,48 +35,6 @@ static struct kmem_cache *dax_cache __read_mostly;
 static struct super_block *dax_superblock __read_mostly;
 MODULE_PARM_DESC(nr_dax, "max number of device-dax instances");
 
-/**
- * struct dax_region - mapping infrastructure for dax devices
- * @id: kernel-wide unique region for a memory range
- * @base: linear address corresponding to @res
- * @kref: to pin while other agents have a need to do lookups
- * @dev: parent device backing this region
- * @align: allocation and mapping alignment for child dax devices
- * @res: physical address range of the region
- * @pfn_flags: identify whether the pfns are paged back or not
- */
-struct dax_region {
-   int id;
-   struct ida ida;
-   void *base;
-   struct kref kref;
-   struct device *dev;
-   unsigned int align;
-   struct resource res;
-   unsigned long pfn_flags;
-};
-
-/**
- * struct dax_dev - subdivision of a dax region
- * @region - parent region
- * @dev - device backing the character device
- * @cdev - core chardev data
- * @alive - !alive + rcu grace period == no new mappings can be established
- * @id - child id in the region
- * @num_resources - number of physical address extents in this device
- * @res - array of physical address ranges
- */
-struct dax_dev {
-   struct dax_region *region;
-   struct inode *inode;
-   struct device dev;
-   struct cdev cdev;
-   bool alive;
-   int id;
-   int num_resources;
-   struct resource res[0];
-};
-
 static ssize_t id_show(struct device *dev,
struct device_attribute *attr, char *buf)
 {
@@ -396,7 +355,8 @@ static int check_vma(struct dax_dev *dax_dev, struct 
vm_area_struct *vma,
return 0;
 }
 
-static phys_addr_t pgoff_to_phys(struct dax_dev *dax_dev, pgoff_t pgoff,
+/* see "strong" declaration in tools/testing/nvdimm/dax-dev.c */
+__weak phys_addr_t dax_pgoff_to_phys(struct dax_dev *dax_dev, pgoff_t pgoff,
unsigned long size)
 {
struct

[ndctl PATCH v4 1/6] libndctl: add a ndctl_namespace_is_active helper

2017-04-07 Thread Vishal Verma
The pattern of checking if a namespace is currently active was repeated
in many places. Convert the scattered usage into a libndctl API. This
gets rid of util_namespace_active from util/json.c which was an awkward
place for this anyway.

Cc: Dan Williams 
Signed-off-by: Vishal Verma 
Signed-off-by: Dan Williams 
---
 ndctl/builtin-list.c   |  2 +-
 ndctl/lib/libndctl.c   | 15 +++
 ndctl/lib/libndctl.sym |  1 +
 ndctl/libndctl.h.in|  2 ++
 util/json.c| 17 +
 util/json.h|  1 -
 6 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/ndctl/builtin-list.c b/ndctl/builtin-list.c
index e8d0070..536d333 100644
--- a/ndctl/builtin-list.c
+++ b/ndctl/builtin-list.c
@@ -84,7 +84,7 @@ static struct json_object *list_namespaces(struct 
ndctl_region *region,
if (param.mode && mode_to_type(param.mode) != mode)
continue;
 
-   if (!list.idle && !util_namespace_active(ndns))
+   if (!list.idle && !ndctl_namespace_is_active(ndns))
continue;
 
if (!jnamespaces) {
diff --git a/ndctl/lib/libndctl.c b/ndctl/lib/libndctl.c
index 090ec0b..ae029c5 100644
--- a/ndctl/lib/libndctl.c
+++ b/ndctl/lib/libndctl.c
@@ -3234,6 +3234,21 @@ static void region_refresh_children(struct ndctl_region 
*region)
daxs_init(region);
 }
 
+NDCTL_EXPORT bool ndctl_namespace_is_active(struct ndctl_namespace *ndns)
+{
+   struct ndctl_btt *btt = ndctl_namespace_get_btt(ndns);
+   struct ndctl_pfn *pfn = ndctl_namespace_get_pfn(ndns);
+   struct ndctl_dax *dax = ndctl_namespace_get_dax(ndns);
+
+   if ((btt && ndctl_btt_is_enabled(btt))
+   || (pfn && ndctl_pfn_is_enabled(pfn))
+   || (dax && ndctl_dax_is_enabled(dax))
+   || (!btt && !pfn && !dax
+   && ndctl_namespace_is_enabled(ndns)))
+   return true;
+   return false;
+}
+
 /*
  * Return 0 if enabled, < 0 if failed to enable, and > 0 if claimed by
  * another device and that device is enabled.  In the > 0 case a
diff --git a/ndctl/lib/libndctl.sym b/ndctl/lib/libndctl.sym
index ca5165a..705ec4c 100644
--- a/ndctl/lib/libndctl.sym
+++ b/ndctl/lib/libndctl.sym
@@ -172,6 +172,7 @@ global:
ndctl_namespace_enable;
ndctl_namespace_disable;
ndctl_namespace_disable_invalidate;
+   ndctl_namespace_is_active;
ndctl_namespace_is_valid;
ndctl_namespace_is_configured;
ndctl_namespace_delete;
diff --git a/ndctl/libndctl.h.in b/ndctl/libndctl.h.in
index d38aa45..586eb26 100644
--- a/ndctl/libndctl.h.in
+++ b/ndctl/libndctl.h.in
@@ -13,6 +13,7 @@
 #ifndef _LIBNDCTL_H_
 #define _LIBNDCTL_H_
 
+#include 
 #include 
 #include 
 
@@ -484,6 +485,7 @@ int ndctl_namespace_is_enabled(struct ndctl_namespace 
*ndns);
 int ndctl_namespace_enable(struct ndctl_namespace *ndns);
 int ndctl_namespace_disable(struct ndctl_namespace *ndns);
 int ndctl_namespace_disable_invalidate(struct ndctl_namespace *ndns);
+bool ndctl_namespace_is_active(struct ndctl_namespace *ndns);
 int ndctl_namespace_is_valid(struct ndctl_namespace *ndns);
 int ndctl_namespace_is_configured(struct ndctl_namespace *ndns);
 int ndctl_namespace_delete(struct ndctl_namespace *ndns);
diff --git a/util/json.c b/util/json.c
index d6a8d4c..82d8073 100644
--- a/util/json.c
+++ b/util/json.c
@@ -86,21 +86,6 @@ struct json_object *util_dimm_to_json(struct ndctl_dimm 
*dimm)
return NULL;
 }
 
-bool util_namespace_active(struct ndctl_namespace *ndns)
-{
-   struct ndctl_btt *btt = ndctl_namespace_get_btt(ndns);
-   struct ndctl_pfn *pfn = ndctl_namespace_get_pfn(ndns);
-   struct ndctl_dax *dax = ndctl_namespace_get_dax(ndns);
-
-   if ((btt && ndctl_btt_is_enabled(btt))
-   || (pfn && ndctl_pfn_is_enabled(pfn))
-   || (dax && ndctl_dax_is_enabled(dax))
-   || (!btt && !pfn && !dax
-   && ndctl_namespace_is_enabled(ndns)))
-   return true;
-   return false;
-}
-
 struct json_object *util_daxctl_dev_to_json(struct daxctl_dev *dev)
 {
const char *devname = daxctl_dev_get_devname(dev);
@@ -334,7 +319,7 @@ struct json_object *util_namespace_to_json(struct 
ndctl_namespace *ndns,
json_object_object_add(jndns, "blockdev", jobj);
}
 
-   if (!util_namespace_active(ndns)) {
+   if (!ndctl_namespace_is_active(ndns)) {
jobj = json_object_new_string("disabled");
if (!jobj)
goto err;
diff --git a/util/json.h b/util/json.h
index a9afb2d..2449c2d 100644
--- a/util/json.h
+++ b/util/json.h
@@ -6,7 +6,6 @@
 
 struct json_object;
 void util_display_json_array(FILE *f_out, struct json_object *jarray, int 
jflag);
-bool util_namespace_active(struct ndctl_namespace *ndns);
 struct json_object *util_bus_to_json(struct ndctl_bus

[ndctl PATCH v4 0/6] Add ndctl check-namespace

2017-04-07 Thread Vishal Verma
Changes in v4:
- Change the bitmap code to the kernel's GPLv2 Routines instead of the
  LGPL ccan/bitmap.
- Upgrade a few messages from 'info' to 'err'

Changes in v3:
- Move the addition of ccan/bitmap to its own patch(es) (Dan)
- Drop the changelog update from the spec (Dan)
- Fix the [verse] section in the documentation text for check-namespace (Dan)
- Unify all namespace_disable paths to perform checking for a mounted
  filesystem (Dan)
- Change the logging to use util/log.h (Dan)
- Use BTT_START_OFFSET for the initial offset, and store it in bttc (Jeff, Dan)
- Fix a number of line > 80 chars (everything but strings) (Jeff)
- Fix short write error handling, add fsync (Jeff)
- Save system page size in bttc to avoid calling sysconf repeatedly (Jeff)
- In check_log_map(), loop through the entire log even in case of an error,
  and if there was a saved error, fail. (Jeff)
- btt-check.sh: in the post repair test, validate that the data read back
  is the same as what was written (Jeff)
- Stop playing games with pre-adding/subtracting the initial 4K offset (Jeff)
- btt_read_info doesn't need to use 'rc', return directly.

Changes in v2:
- Move checking functionality to a separate file (Dan, Jeff)
- Rename btt-structs.h to check.h (Dan)
- Don't provide a configure option for building the checker, always
  build it in. (Dan, Jeff)
- Fix the Documentation example to also include disable-namespace (Linda)
- Update the description text to note the namespace needs to be disabled
  before checking (Linda)
- Use util/size.h for sizes (Dan)
- Use --repair to do repairs instead of --dry-run to disable repairs (Dan)
- Fix btt_read_info short read error handling (Jeff)
- Simplify the map lookup/write routines (Jeff)
- Differentiate the use off BTT_PG_SIZE, sysconf(_SC_PAGESIZE), and SZ_4K
  (for the fixed start offset) in the different places they're used (Jeff)
- Add the missing msync when copying over info2 (Jeff)
- Add unit tests to test the checker (Jeff)
- Add a missing error case check in do_xaction_namespace for check
- Add a --force option that allows running on an active namespace (Jeff)
- Add a bitmap test for checking all internal blocks are referenced exactly
  once between the map and flog (Jeff)
- Remove unused #defines in check.h
- Add comments to explain what we do with raw_mode (Jeff)
- Add some sanity checking when parsing an arena's metadata (Jeff)
- Refactor some read-verify sequences into a helper that combines the two (Jeff)
- Additional bounds checking on the 'offset' in recover_first_sb attempt 3 
(Jeff)
- Add a missing ACTION_DESTROY string in parse_namespace_options (Dan)
- Use uXX, and cpu_to_XX from ccan/endian (Dan)
- Move the fletcher64 Routing to util/ as it is shared by builtin-dimm.c (Dan)
- Open the raw block device only once with O_EXCL instead of every time on
  read/write/mmap (Dan)
- Add a new 'inform' routing in util/usage.c, and use it for some non-critical
  messages (Dan)
- Remove namespace_is_offline() from builtin-check.c. Instead, use
  util_namespace_active() from util/json.c
- Add a missing return value check after info block restoration in
  discover_arenas

Vishal Verma (6):
  libndctl: add a ndctl_namespace_is_active helper
  libndctl: add a ndctl_namespace_disable_safe() API
  ndctl: move the fletcher64 routine to util/
  util: add util/bitmap in preparation for the BTT checker
  ndctl: add a BTT check utility
  ndctl, test: Add a unit test for the BTT checker

 Documentation/Makefile.am   |   1 +
 Documentation/ndctl-check-namespace.txt |  64 +++
 Documentation/ndctl.txt |   1 +
 Makefile.am |   4 +-
 builtin.h   |   1 +
 contrib/ndctl   |   3 +
 ndctl/Makefile.am   |   1 +
 ndctl/builtin-check.c   | 988 
 ndctl/builtin-dimm.c|  18 +-
 ndctl/builtin-list.c|   2 +-
 ndctl/builtin-xaction-namespace.c   | 112 ++--
 ndctl/check.h   | 127 
 ndctl/lib/libndctl.c|  59 ++
 ndctl/lib/libndctl.sym  |   2 +
 ndctl/libndctl.h.in |   3 +
 ndctl/ndctl.c   |   1 +
 test/Makefile.am|   5 +-
 test/btt-check.sh   | 172 ++
 util/bitmap.c   | 115 
 util/bitmap.h   |  32 ++
 util/fletcher.c |  23 +
 util/fletcher.h |   8 +
 util/json.c |  17 +-
 util/json.h |   1 -
 util/util.h |  12 +
 25 files changed, 1696 insertions(+), 76 deletions(-)
 create mode 100644 Documentation/ndctl-check-namespace.txt
 create mode 100644 ndctl/builtin-check.c
 create mode 100644 ndctl/check.h
 create mode 100755 test/btt-check.sh
 create mode 100644 util/bitma

[ndctl PATCH v4 2/6] libndctl: add a ndctl_namespace_disable_safe() API

2017-04-07 Thread Vishal Verma
Disabling a namespace which has a filesystem mounted on it is unsafe as
filesystems are not prepared for a block device to be yanked from under
them. The destroy_namespace routine checked for an active mount by
performing an O_EXCL open of the backing block device, but many other
callers of ndctl_namespace_disable* could benefit from this checking.

Codify the mounted filesystem check in a new libndctl API -
ndctl_namespace_disable_safe(), and use it for the destroy/disable
namespace ndctl commands as well as the upcoming check-namespace
command.

Cc: Dan Williams 
Signed-off-by: Vishal Verma 
Signed-off-by: Dan Williams 
---
 ndctl/builtin-xaction-namespace.c | 46 +++
 ndctl/lib/libndctl.c  | 44 +
 ndctl/lib/libndctl.sym|  1 +
 ndctl/libndctl.h.in   |  1 +
 4 files changed, 54 insertions(+), 38 deletions(-)

diff --git a/ndctl/builtin-xaction-namespace.c 
b/ndctl/builtin-xaction-namespace.c
index 46d651e..d6b0c37 100644
--- a/ndctl/builtin-xaction-namespace.c
+++ b/ndctl/builtin-xaction-namespace.c
@@ -731,10 +731,7 @@ static int namespace_destroy(struct ndctl_region *region,
struct ndctl_pfn *pfn = ndctl_namespace_get_pfn(ndns);
struct ndctl_dax *dax = ndctl_namespace_get_dax(ndns);
struct ndctl_btt *btt = ndctl_namespace_get_btt(ndns);
-   const char *bdev = NULL;
-   bool dax_active = false;
-   char path[50];
-   int fd, rc;
+   int rc;
 
if (ndctl_region_get_ro(region)) {
error("%s: read-only, re-configuration disabled\n",
@@ -742,42 +739,15 @@ static int namespace_destroy(struct ndctl_region *region,
return -ENXIO;
}
 
-   if (pfn && ndctl_pfn_is_enabled(pfn))
-   bdev = ndctl_pfn_get_block_device(pfn);
-   else if (dax && ndctl_dax_is_enabled(dax))
-   dax_active = true;
-   else if (btt && ndctl_btt_is_enabled(btt))
-   bdev = ndctl_btt_get_block_device(btt);
-   else if (ndctl_namespace_is_enabled(ndns))
-   bdev = ndctl_namespace_get_block_device(ndns);
-
-   if ((bdev || dax_active) && !force) {
+   if (ndctl_namespace_is_active(ndns) && !force) {
error("%s is active, specify --force for re-configuration\n",
devname);
return -EBUSY;
-   } else if (bdev) {
-   sprintf(path, "/dev/%s", bdev);
-   fd = open(path, O_RDWR|O_EXCL);
-   if (fd >= 0) {
-   /*
-* Got it, now block new mounts while we have it
-* pinned.
-*/
-   ndctl_namespace_disable_invalidate(ndns);
-   close(fd);
-   } else {
-   /*
-* Yes, TOCTOU hole, but if you're racing namespace
-* creation you have other problems, and there's nothing
-* stopping the !bdev case from racing to mount an fs or
-* re-enabling the namepace.
-*/
-   error("%s: %s failed exlusive open: %s\n",
-   devname, bdev, strerror(errno));
-   return -errno;
-   }
-   } else if (dax_active)
-   ndctl_namespace_disable_invalidate(ndns);
+   } else {
+   rc = ndctl_namespace_disable_safe(ndns);
+   if (rc)
+   return rc;
+   }
 
if (pfn || btt || dax) {
rc = zero_info_block(ndns);
@@ -869,7 +839,7 @@ static int do_xaction_namespace(const char *namespace,
continue;
switch (action) {
case ACTION_DISABLE:
-   rc = 
ndctl_namespace_disable_invalidate(ndns);
+   rc = ndctl_namespace_disable_safe(ndns);
break;
case ACTION_ENABLE:
rc = ndctl_namespace_enable(ndns);
diff --git a/ndctl/lib/libndctl.c b/ndctl/lib/libndctl.c
index ae029c5..a3481b1 100644
--- a/ndctl/lib/libndctl.c
+++ b/ndctl/lib/libndctl.c
@@ -3346,6 +3346,50 @@ NDCTL_EXPORT int 
ndctl_namespace_disable_invalidate(struct ndctl_namespace *ndns
return ndctl_namespace_disable(ndns);
 }
 
+NDCTL_EXPORT int ndctl_namespace_disable_safe(struct ndctl_namespace *ndns)
+{
+   const char *devname = ndctl_namespace_get_devname(ndns);
+   struct ndctl_ctx *ctx = ndctl_namespace_get_ctx(ndns);
+   struct ndctl_pfn *pfn = ndctl_namespace_get_pfn(ndns);
+   struct ndctl_btt *btt = ndctl_namespace_get_btt(ndns);
+   const char *bdev = NULL;
+   char path[50];
+   int fd;
+
+   i

[ndctl PATCH v4 6/6] ndctl, test: Add a unit test for the BTT checker

2017-04-07 Thread Vishal Verma
Add a new unit test that will set up BTTs, corrupt them in known ways,
and test that the checker is able to detect or repair the corruption in
the expected way.

Cc: Jeff Moyer 
Cc: Dan Williams 
Signed-off-by: Vishal Verma 
Signed-off-by: Dan Williams 
---
 test/Makefile.am  |   3 +-
 test/btt-check.sh | 172 ++
 2 files changed, 174 insertions(+), 1 deletion(-)
 create mode 100755 test/btt-check.sh

diff --git a/test/Makefile.am b/test/Makefile.am
index 24afcea..b09d2dd 100644
--- a/test/Makefile.am
+++ b/test/Makefile.am
@@ -9,7 +9,8 @@ TESTS =\
create.sh \
clear.sh \
dax-errors.sh \
-   daxdev-errors.sh
+   daxdev-errors.sh \
+   btt-check.sh
 
 check_PROGRAMS =\
libndctl \
diff --git a/test/btt-check.sh b/test/btt-check.sh
new file mode 100755
index 000..11821d2
--- /dev/null
+++ b/test/btt-check.sh
@@ -0,0 +1,172 @@
+#!/bin/bash -E
+
+[ -f "../ndctl/ndctl" ] && [ -x "../ndctl/ndctl" ] && ndctl="../ndctl/ndctl"
+[ -f "./ndctl/ndctl" ] && [ -x "./ndctl/ndctl" ] && ndctl="./ndctl/ndctl"
+[ -z "$ndctl" ] && echo "Couldn't find an ndctl binary" && exit 1
+bus="nfit_test.0"
+json2var="s/[{}\",]//g; s/:/=/g"
+dev=""
+mode=""
+size=""
+sector_size=""
+blockdev=""
+bs=4096
+rc=77
+
+trap 'err $LINENO' ERR
+
+# sample json:
+# {
+#   "dev":"namespace5.0",
+#   "mode":"sector",
+#   "size":32440320,
+#   "uuid":"51805176-e124-4635-ae17-0e6a4a16671a",
+#   "sector_size":4096,
+#   "blockdev":"pmem5s"
+# }
+
+# $1: Line number
+# $2: exit code
+err()
+{
+   [ -n "$2" ] && rc="$2"
+   echo "test/btt-check: failed at line $1"
+   exit "$rc"
+}
+
+create()
+{
+   json=$($ndctl create-namespace -b "$bus" -t pmem -m sector)
+   eval "$(echo "$json" | sed -e "$json2var")"
+   [ -n "$dev" ] || err "$LINENO" 2
+   [ "$mode" = "sector" ] || err "$LINENO" 2
+   [ -n "$size" ] || err "$LINENO" 2
+   [ -n "$sector_size" ] || err "$LINENO" 2
+   [ -n "$blockdev" ] || err "$LINENO" 2
+   [ $size -gt 0 ] || err "$LINENO" 2
+}
+
+reset()
+{
+   $ndctl disable-region -b "$bus" all
+   $ndctl zero-labels -b "$bus" all
+   $ndctl enable-region -b "$bus" all
+}
+
+# re-enable the BTT namespace, and do IO to it in an attempt to
+# verify it still comes up ok, and functions as expected
+post_repair_test()
+{
+   echo "${FUNCNAME[0]}: I/O to BTT namespace"
+   test -b /dev/$blockdev
+   dd if=/dev/urandom of=test-bin bs=$sector_size 
count=$((size/sector_size)) > /dev/null 2>&1
+   dd if=test-bin of=/dev/$blockdev bs=$sector_size 
count=$((size/sector_size)) > /dev/null 2>&1
+   dd if=/dev/$blockdev of=test-bin-read bs=$sector_size 
count=$((size/sector_size)) > /dev/null 2>&1
+   diff test-bin test-bin-read
+   rm -f test-bin*
+   echo "done"
+}
+
+test_normal()
+{
+   echo "=== ${FUNCNAME[0]} ==="
+   # disable the namespace
+   $ndctl disable-namespace $dev
+   $ndctl check-namespace $dev
+   $ndctl enable-namespace $dev
+   post_repair_test
+}
+
+test_force()
+{
+   echo "=== ${FUNCNAME[0]} ==="
+   $ndctl check-namespace --force $dev
+   post_repair_test
+}
+
+set_raw()
+{
+   $ndctl disable-namespace $dev
+   echo -n "set raw_mode: "
+   echo 1 | tee /sys/bus/nd/devices/$dev/force_raw
+   $ndctl enable-namespace $dev
+   raw_bdev="${blockdev%%s}"
+   test -b /dev/$raw_bdev
+   raw_size="$(cat /sys/bus/nd/devices/$dev/size)"
+}
+
+unset_raw()
+{
+   $ndctl disable-namespace $dev
+   echo -n "set raw_mode: "
+   echo 0 | tee /sys/bus/nd/devices/$dev/force_raw
+   $ndctl enable-namespace $dev
+   raw_bdev=""
+}
+
+test_bad_info2()
+{
+   echo "=== ${FUNCNAME[0]} ==="
+   set_raw
+   seek="$((raw_size/bs - 1))"
+   echo "wiping info2 block (offset = $seek blocks)"
+   dd if=/dev/zero of=/dev/$raw_bdev bs=$bs count=1 seek=$seek
+   unset_raw
+   $ndctl disable-namespace $dev
+   $ndctl check-namespace $dev 2>&1 | grep "info2 needs to be restored"
+   $ndctl check-namespace --repair $dev
+   $ndctl enable-namespace $dev
+   post_repair_test
+}
+
+test_bad_info()
+{
+   echo "=== ${FUNCNAME[0]} ==="
+   set_raw
+   echo "wiping info block"
+   dd if=/dev/zero of=/dev/$raw_bdev bs=$bs count=1 seek=1
+   unset_raw
+   $ndctl disable-namespace $dev
+   $ndctl check-namespace $dev 2>&1 | grep "info block at offset 0x1000 
needs to be restored"
+   $ndctl check-namespace --repair $dev
+   $ndctl enable-namespace $dev
+   post_repair_test
+}
+
+test_bitmap()
+{
+   echo "=== ${FUNCNAME[0]} ==="
+   reset && create
+   set_raw
+   # scribble over the last 4K of the map
+   rm -f /tmp/scribble
+   for (( i=0 ; i<512 ; i++ )); do
+   echo -n -e \\x1e\\x1e\\x00\\xc0\\x1e\\x1e\\x00\\xc0 >> 
/tmp/scribble
+   done
+   seek="$((raw_size/bs - (256*64/bs) - 2)

[ndctl PATCH v4 3/6] ndctl: move the fletcher64 routine to util/

2017-04-07 Thread Vishal Verma
In preparation for check-namespace, since it will also use the
fletcher64 routine, move it to util/ so that it can be shared by both
builtin-check.c and builtin-dimm.c

Cc: Dan Williams 
Signed-off-by: Vishal Verma 
Signed-off-by: Dan Williams 
---
 Makefile.am  |  3 ++-
 ndctl/builtin-dimm.c | 18 ++
 util/fletcher.c  | 23 +++
 util/fletcher.h  |  8 
 4 files changed, 35 insertions(+), 17 deletions(-)
 create mode 100644 util/fletcher.c
 create mode 100644 util/fletcher.h

diff --git a/Makefile.am b/Makefile.am
index 06cd1b0..5453b2a 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -68,6 +68,7 @@ libutil_a_SOURCES = \
util/help.c \
util/strbuf.c \
util/wrapper.c \
-   util/filter.c
+   util/filter.c \
+   util/fletcher.c
 
 nobase_include_HEADERS = daxctl/libdaxctl.h
diff --git a/ndctl/builtin-dimm.c b/ndctl/builtin-dimm.c
index 637b10b..93f9530 100644
--- a/ndctl/builtin-dimm.c
+++ b/ndctl/builtin-dimm.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -358,7 +359,7 @@ struct nvdimm_data {
 };
 
 /*
- * Note, best_seq(), inc_seq(), fletcher64(), sizeof_namespace_index()
+ * Note, best_seq(), inc_seq(), sizeof_namespace_index()
  * nvdimm_num_label_slots(), label_validate(), and label_write_index()
  * are copied from drivers/nvdimm/label.c in the Linux kernel with the
  * following modifications:
@@ -371,21 +372,6 @@ struct nvdimm_data {
  * 7/ dropped clear_bit_le() usage in label_write_index
  */
 
-static u64 fletcher64(void *addr, size_t len, bool le)
-{
-   u32 *buf = addr;
-   u32 lo32 = 0;
-   u64 hi32 = 0;
-   size_t i;
-
-   for (i = 0; i < len / sizeof(u32); i++) {
-   lo32 += le ? le32_to_cpu((le32) buf[i]) : buf[i];
-   hi32 += lo32;
-   }
-
-   return hi32 << 32 | lo32;
-}
-
 static unsigned inc_seq(unsigned seq)
 {
static const unsigned next[] = { 0, 2, 3, 1 };
diff --git a/util/fletcher.c b/util/fletcher.c
new file mode 100644
index 000..cee2fc3
--- /dev/null
+++ b/util/fletcher.c
@@ -0,0 +1,23 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * Note, fletcher64() is copied from drivers/nvdimm/label.c in the Linux kernel
+ */
+u64 fletcher64(void *addr, size_t len, bool le)
+{
+   u32 *buf = addr;
+   u32 lo32 = 0;
+   u64 hi32 = 0;
+   size_t i;
+
+   for (i = 0; i < len / sizeof(u32); i++) {
+   lo32 += le ? le32_to_cpu((le32) buf[i]) : buf[i];
+   hi32 += lo32;
+   }
+
+   return hi32 << 32 | lo32;
+}
diff --git a/util/fletcher.h b/util/fletcher.h
new file mode 100644
index 000..e3bbce3
--- /dev/null
+++ b/util/fletcher.h
@@ -0,0 +1,8 @@
+#ifndef _NDCTL_FLETCHER_H_
+#define _NDCTL_FLETCHER_H_
+
+#include 
+
+u64 fletcher64(void *addr, size_t len, bool le);
+
+#endif /* _NDCTL_FLETCHER_H_ */
-- 
2.9.3

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[ndctl PATCH v4 5/6] ndctl: add a BTT check utility

2017-04-07 Thread Vishal Verma
Add the check-namespace command to ndctl. This will check the BTT
metadata layout for the given namespace, and if requested, correct any
errors found. Not all metadata corruption is detectable or fixable.

Cc: Dan Williams 
Cc: Jeff Moyer 
Cc: Linda Knippers 
Signed-off-by: Vishal Verma 
Signed-off-by: Dan Williams 
---
 Documentation/Makefile.am   |   1 +
 Documentation/ndctl-check-namespace.txt |  64 +++
 Documentation/ndctl.txt |   1 +
 builtin.h   |   1 +
 contrib/ndctl   |   3 +
 ndctl/Makefile.am   |   1 +
 ndctl/builtin-check.c   | 988 
 ndctl/builtin-xaction-namespace.c   |  66 ++-
 ndctl/check.h   | 127 
 ndctl/ndctl.c   |   1 +
 test/Makefile.am|   2 +
 util/util.h |   1 +
 12 files changed, 1254 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/ndctl-check-namespace.txt
 create mode 100644 ndctl/builtin-check.c
 create mode 100644 ndctl/check.h

diff --git a/Documentation/Makefile.am b/Documentation/Makefile.am
index 6daeb56..eea11e0 100644
--- a/Documentation/Makefile.am
+++ b/Documentation/Makefile.am
@@ -12,6 +12,7 @@ man1_MANS = \
ndctl-disable-namespace.1 \
ndctl-create-namespace.1 \
ndctl-destroy-namespace.1 \
+   ndctl-check-namespace.1 \
ndctl-list.1 \
daxctl-list.1
 
diff --git a/Documentation/ndctl-check-namespace.txt 
b/Documentation/ndctl-check-namespace.txt
new file mode 100644
index 000..232f22d
--- /dev/null
+++ b/Documentation/ndctl-check-namespace.txt
@@ -0,0 +1,64 @@
+ndctl-check-namespace(1)
+=
+
+NAME
+
+ndctl-check-namespace - check namespace metadata consistency
+
+SYNOPSIS
+
+[verse]
+'ndctl check-namespace'  []
+
+DESCRIPTION
+---
+
+A namespace in the 'sector' mode will have metadata on it to describe
+the kernel BTT (Block Translation Table). The check-namespace command
+can be used to check the consistency of this metadata, and optionally,
+also attempt to repair it, if it has enough information to do so.
+
+The namespace being checked has to be disabled before initiating a
+check on it as a precautionary measure. The --force option can override
+this.
+
+EXAMPLES
+
+
+Check a namespace (only report errors)
+[verse]
+ndctl disable-namespace namespace0.0
+ndctl check-namespace namespace0.0
+
+Check a namespace, and perform repairs if possible
+[verse]
+ndctl disable-namespace namespace0.0
+ndctl check-namespace --repair namespace0.0
+
+OPTIONS
+---
+-R::
+--repair::
+   Perform metadata repairs if possible. Without this option,
+   the raw namespace contents will not be touched.
+
+-f::
+--force::
+   Unless this option is specified, a check-namespace operation
+   will fail if the namespace is presently active. Specifying
+   --force causes the namespace to be disabled before checking.
+
+-v::
+--verbose::
+   Emit debug messages for the namespace check process.
+
+-r::
+--region=::
+include::xable-region-options.txt[]
+
+SEE ALSO
+
+linkndctl:ndctl-disable-namespace[1],
+linkndctl:ndctl-enable-namespace[1],
+http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf[NVDIMM Namespace
+Specification]
diff --git a/Documentation/ndctl.txt b/Documentation/ndctl.txt
index 883a59c..c26cc2f 100644
--- a/Documentation/ndctl.txt
+++ b/Documentation/ndctl.txt
@@ -34,6 +34,7 @@ SEE ALSO
 
 linkndctl:ndctl-create-namespace[1],
 linkndctl:ndctl-destroy-namespace[1],
+linkndctl:ndctl-check-namespace[1],
 linkndctl:ndctl-enable-region[1],
 linkndctl:ndctl-disable-region[1],
 linkndctl:ndctl-enable-dimm[1],
diff --git a/builtin.h b/builtin.h
index 9b66196..200bd8e 100644
--- a/builtin.h
+++ b/builtin.h
@@ -13,6 +13,7 @@ int cmd_enable_namespace(int argc, const char **argv, void 
*ctx);
 int cmd_create_namespace(int argc, const char **argv, void *ctx);
 int cmd_destroy_namespace(int argc, const char **argv, void *ctx);
 int cmd_disable_namespace(int argc, const char **argv, void *ctx);
+int cmd_check_namespace(int argc, const char **argv, void *ctx);
 int cmd_enable_region(int argc, const char **argv, void *ctx);
 int cmd_disable_region(int argc, const char **argv, void *ctx);
 int cmd_enable_dimm(int argc, const char **argv, void *ctx);
diff --git a/contrib/ndctl b/contrib/ndctl
index ea7303c..c97adcc 100755
--- a/contrib/ndctl
+++ b/contrib/ndctl
@@ -194,6 +194,9 @@ __ndctl_comp_non_option_args()
destroy-namespace)
opts="$(__ndctl_get_ns) all"
;;
+   check-namespace)
+   opts="$(__ndctl_get_ns -i) all"
+   ;;
enable-region)
opts="$(__ndctl_get_regions -i) all"
;;
diff --git a/ndctl/Makefile.am b/ndctl/Makefile.am
index c563e94..f9158d9 100644
--- a/ndctl/Makefile.am
+++ b/ndctl/M

[ndctl PATCH v4 4/6] util: add util/bitmap in preparation for the BTT checker

2017-04-07 Thread Vishal Verma
The BTT checker will include a bitmap test where we mark a bit for
each post-map and free block, and check if the bitmap is full. Add
util/bitmap based on the kernels bitmap code to facilitate this.

Cc: Dan Williams 
Signed-off-by: Vishal Verma 
---
 Makefile.am   |   3 +-
 util/bitmap.c | 115 ++
 util/bitmap.h |  32 
 util/util.h   |  11 ++
 4 files changed, 160 insertions(+), 1 deletion(-)
 create mode 100644 util/bitmap.c
 create mode 100644 util/bitmap.h

diff --git a/Makefile.am b/Makefile.am
index 5453b2a..2b46736 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -69,6 +69,7 @@ libutil_a_SOURCES = \
util/strbuf.c \
util/wrapper.c \
util/filter.c \
-   util/fletcher.c
+   util/fletcher.c\
+   util/bitmap.c
 
 nobase_include_HEADERS = daxctl/libdaxctl.h
diff --git a/util/bitmap.c b/util/bitmap.c
new file mode 100644
index 000..31e8c3a
--- /dev/null
+++ b/util/bitmap.c
@@ -0,0 +1,115 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+unsigned long *bitmap_alloc(unsigned long nbits)
+{
+   return calloc(BITS_TO_LONGS(nbits), sizeof(unsigned long));
+}
+
+void bitmap_set(unsigned long *map, unsigned int start, int len)
+{
+   unsigned long *p = map + BIT_WORD(start);
+   const unsigned int size = start + len;
+   int bits_to_set = BITS_PER_LONG - (start % BITS_PER_LONG);
+   unsigned long mask_to_set = BITMAP_FIRST_WORD_MASK(start);
+
+   while (len - bits_to_set >= 0) {
+   *p |= mask_to_set;
+   len -= bits_to_set;
+   bits_to_set = BITS_PER_LONG;
+   mask_to_set = ~0UL;
+   p++;
+   }
+   if (len) {
+   mask_to_set &= BITMAP_LAST_WORD_MASK(size);
+   *p |= mask_to_set;
+   }
+}
+
+void bitmap_clear(unsigned long *map, unsigned int start, int len)
+{
+   unsigned long *p = map + BIT_WORD(start);
+   const unsigned int size = start + len;
+   int bits_to_clear = BITS_PER_LONG - (start % BITS_PER_LONG);
+   unsigned long mask_to_clear = BITMAP_FIRST_WORD_MASK(start);
+
+   while (len - bits_to_clear >= 0) {
+   *p &= ~mask_to_clear;
+   len -= bits_to_clear;
+   bits_to_clear = BITS_PER_LONG;
+   mask_to_clear = ~0UL;
+   p++;
+   }
+   if (len) {
+   mask_to_clear &= BITMAP_LAST_WORD_MASK(size);
+   *p &= ~mask_to_clear;
+   }
+}
+
+/**
+ * test_bit - Determine whether a bit is set
+ * @nr: bit number to test
+ * @addr: Address to start counting from
+ */
+int test_bit(unsigned int nr, const volatile unsigned long *addr)
+{
+   return 1UL & (addr[BIT_WORD(nr)] >> (nr & (BITS_PER_LONG-1)));
+}
+
+/*
+ * This is a common helper function for find_next_bit and
+ * find_next_zero_bit.  The difference is the "invert" argument, which
+ * is XORed with each fetched word before searching it for one bits.
+ */
+static unsigned long _find_next_bit(const unsigned long *addr,
+   unsigned long nbits, unsigned long start, unsigned long invert)
+{
+   unsigned long tmp;
+
+   if (!nbits || start >= nbits)
+   return nbits;
+
+   tmp = addr[start / BITS_PER_LONG] ^ invert;
+
+   /* Handle 1st word. */
+   tmp &= BITMAP_FIRST_WORD_MASK(start);
+   start = round_down(start, BITS_PER_LONG);
+
+   while (!tmp) {
+   start += BITS_PER_LONG;
+   if (start >= nbits)
+   return nbits;
+
+   tmp = addr[start / BITS_PER_LONG] ^ invert;
+   }
+
+   return min(start + __builtin_ffsl(tmp), nbits);
+}
+
+/*
+ * Find the next set bit in a memory region.
+ */
+unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
+   unsigned long offset)
+{
+   return _find_next_bit(addr, size, offset, 0UL);
+}
+
+unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
+unsigned long offset)
+{
+   return _find_next_bit(addr, size, offset, ~0UL);
+}
+
+int bitmap_full(const unsigned long *src, unsigned int nbits)
+{
+   if (small_const_nbits(nbits))
+   return ! (~(*src) & BITMAP_LAST_WORD_MASK(nbits));
+
+   return find_next_zero_bit(src, nbits, 0UL) == nbits;
+}
diff --git a/util/bitmap.h b/util/bitmap.h
new file mode 100644
index 000..826ae28
--- /dev/null
+++ b/util/bitmap.h
@@ -0,0 +1,32 @@
+#ifndef _NDCTL_BITMAP_H_
+#define _NDCTL_BITMAP_H_
+
+#include 
+#include 
+
+#define DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d))
+
+#define BIT(nr)(1UL << (nr))
+#define BIT_MASK(nr)   (1UL << ((nr) % BITS_PER_LONG))
+#define BIT_WORD(nr)   ((nr) / BITS_PER_LONG)
+#define BITS_PER_BYTE  8
+#define BITS_TO_LONGS(nr)  DIV_ROUND_UP(nr, BITS_PER_BYTE * sizeof(long))
+
+#define BITMAP_FIRST_

Re: [PATCH] x86, pmem: fix broken __copy_user_nocache cache-bypass assumptions

2017-04-07 Thread Dan Williams
On Fri, Apr 7, 2017 at 10:41 AM, Kani, Toshimitsu  wrote:
> On Thu, 2017-04-06 at 13:59 -0700, Dan Williams wrote:
>> Before we rework the "pmem api" to stop abusing __copy_user_nocache()
>> for memcpy_to_pmem() we need to fix cases where we may strand dirty
>> data in the cpu cache. The problem occurs when copy_from_iter_pmem()
>> is used for arbitrary data transfers from userspace. There is no
>> guarantee that these transfers, performed by dax_iomap_actor(), will
>> have aligned destinations or aligned transfer lengths. Backstop the
>> usage __copy_user_nocache() with explicit cache management in these
>> unaligned cases.
>>
>> Yes, copy_from_iter_pmem() is now too big for an inline, but
>> addressing that is saved for a later patch that moves the entirety of
>> the "pmem api" into the pmem driver directly.
>
> The change looks good to me.  Should we also avoid cache flushing in
> the case of size=4B & dest aligned by 4B?

Yes, since you fixed the 4B aligned case we should skip cache flushing
in that case. I'll send a v2.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v2] x86, pmem: fix broken __copy_user_nocache cache-bypass assumptions

2017-04-07 Thread Dan Williams
Before we rework the "pmem api" to stop abusing __copy_user_nocache()
for memcpy_to_pmem() we need to fix cases where we may strand dirty data
in the cpu cache. The problem occurs when copy_from_iter_pmem() is used
for arbitrary data transfers from userspace. There is no guarantee that
these transfers, performed by dax_iomap_actor(), will have aligned
destinations or aligned transfer lengths. Backstop the usage
__copy_user_nocache() with explicit cache management in these unaligned
cases.

Yes, copy_from_iter_pmem() is now too big for an inline, but addressing
that is saved for a later patch that moves the entirety of the "pmem
api" into the pmem driver directly.

Fixes: 5de490daec8b ("pmem: add copy_from_iter_pmem() and clear_pmem()")
Cc: 
Cc: 
Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Ingo Molnar 
Cc: Christoph Hellwig 
Cc: Toshi Kani 
Cc: "H. Peter Anvin" 
Cc: Al Viro 
Cc: Thomas Gleixner 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
[toshi: trailing bytes flush only needed in the 4B misalign case]
Signed-off-by: Dan Williams 
---
v2: Change the condition for flushing the last cacheline of the
destination from 8-byte to 4-byte misalignment (Toshi)

 arch/x86/include/asm/pmem.h |   41 ++---
 1 file changed, 30 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index 2c1ebeb4d737..cf4e68faedc4 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -55,7 +55,8 @@ static inline int arch_memcpy_from_pmem(void *dst, const void 
*src, size_t n)
  * @size:  number of bytes to write back
  *
  * Write back a cache range using the CLWB (cache line write back)
- * instruction.
+ * instruction. Note that @size is internally rounded up to be cache
+ * line size aligned.
  */
 static inline void arch_wb_cache_pmem(void *addr, size_t size)
 {
@@ -69,15 +70,6 @@ static inline void arch_wb_cache_pmem(void *addr, size_t 
size)
clwb(p);
 }
 
-/*
- * copy_from_iter_nocache() on x86 only uses non-temporal stores for iovec
- * iterators, so for other types (bvec & kvec) we must do a cache write-back.
- */
-static inline bool __iter_needs_pmem_wb(struct iov_iter *i)
-{
-   return iter_is_iovec(i) == false;
-}
-
 /**
  * arch_copy_from_iter_pmem - copy data from an iterator to PMEM
  * @addr:  PMEM destination address
@@ -94,7 +86,34 @@ static inline size_t arch_copy_from_iter_pmem(void *addr, 
size_t bytes,
/* TODO: skip the write-back by always using non-temporal stores */
len = copy_from_iter_nocache(addr, bytes, i);
 
-   if (__iter_needs_pmem_wb(i))
+   /*
+* In the iovec case on x86_64 copy_from_iter_nocache() uses
+* non-temporal stores for the bulk of the transfer, but we need
+* to manually flush if the transfer is unaligned. In the
+* non-iovec case the entire destination needs to be flushed.
+*/
+   if (iter_is_iovec(i)) {
+   unsigned long dest = (unsigned long) addr;
+
+   /*
+* If the destination is not 8-byte aligned then
+* __copy_user_nocache (on x86_64) uses cached copies
+*/
+   if (dest & 8) {
+   arch_wb_cache_pmem(addr, 1);
+   dest = ALIGN(dest, 8);
+   }
+
+   /*
+* If the remaining transfer length, after accounting
+* for destination alignment, is not 4-byte aligned
+* then __copy_user_nocache() falls back to cached
+* copies for the trailing bytes in the final cacheline
+* of the transfer.
+*/
+   if ((bytes - (dest - (unsigned long) addr)) & 4)
+   arch_wb_cache_pmem(addr + bytes - 1, 1);
+   } else
arch_wb_cache_pmem(addr, bytes);
 
return len;

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH] device-dax: switch to srcu, fix rcu_read_lock() vs pte allocation

2017-04-07 Thread Dan Williams
The following warning triggers with a new unit test that stresses the
device-dax interface.

 ===
 [ ERR: suspicious RCU usage.  ]
 4.11.0-rc4+ #1049 Tainted: G   O
 ---
 ./include/linux/rcupdate.h:521 Illegal context switch in RCU read-side 
critical section!

 other info that might help us debug this:

 rcu_scheduler_active = 2, debug_locks = 0
 2 locks held by fio/9070:
  #0:  (&mm->mmap_sem){++}, at: [] 
__do_page_fault+0x167/0x4f0
  #1:  (rcu_read_lock){..}, at: [] 
dax_dev_huge_fault+0x32/0x620 [dax]

 Call Trace:
  dump_stack+0x86/0xc3
  lockdep_rcu_suspicious+0xd7/0x110
  ___might_sleep+0xac/0x250
  __might_sleep+0x4a/0x80
  __alloc_pages_nodemask+0x23a/0x360
  alloc_pages_current+0xa1/0x1f0
  pte_alloc_one+0x17/0x80
  __pte_alloc+0x1e/0x120
  __get_locked_pte+0x1bf/0x1d0
  insert_pfn.isra.70+0x3a/0x100
  ? lookup_memtype+0xa6/0xd0
  vm_insert_mixed+0x64/0x90
  dax_dev_huge_fault+0x520/0x620 [dax]
  ? dax_dev_huge_fault+0x32/0x620 [dax]
  dax_dev_fault+0x10/0x20 [dax]
  __do_fault+0x1e/0x140
  __handle_mm_fault+0x9af/0x10d0
  handle_mm_fault+0x16d/0x370
  ? handle_mm_fault+0x47/0x370
  __do_page_fault+0x28c/0x4f0
  trace_do_page_fault+0x58/0x2a0
  do_async_page_fault+0x1a/0xa0
  async_page_fault+0x28/0x30

Inserting a page table entry may trigger an allocation while we are
holding a read lock to keep the device instance alive for the duration
of the fault. Use srcu for this keep-alive protection.

Fixes: dee410792419 ("/dev/dax, core: file operations and dax-mmap")
Cc: 
Signed-off-by: Dan Williams 
---
 drivers/dax/Kconfig |1 +
 drivers/dax/dax.c   |   13 +++--
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index 3e2ab3b14eea..9e95bf94eb13 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -2,6 +2,7 @@ menuconfig DEV_DAX
tristate "DAX: direct access to differentiated memory"
default m if NVDIMM_DAX
depends on TRANSPARENT_HUGEPAGE
+   select SRCU
help
  Support raw access to differentiated (persistence, bandwidth,
  latency...) memory via an mmap(2) capable character
diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c
index 80c6db279ae1..806f180c80d8 100644
--- a/drivers/dax/dax.c
+++ b/drivers/dax/dax.c
@@ -25,6 +25,7 @@
 #include "dax.h"
 
 static dev_t dax_devt;
+DEFINE_STATIC_SRCU(dax_srcu);
 static struct class *dax_class;
 static DEFINE_IDA(dax_minor_ida);
 static int nr_dax = CONFIG_NR_DEV_DAX;
@@ -60,7 +61,7 @@ struct dax_region {
  * @region - parent region
  * @dev - device backing the character device
  * @cdev - core chardev data
- * @alive - !alive + rcu grace period == no new mappings can be established
+ * @alive - !alive + srcu grace period == no new mappings can be established
  * @id - child id in the region
  * @num_resources - number of physical address extents in this device
  * @res - array of physical address ranges
@@ -569,7 +570,7 @@ static int __dax_dev_pud_fault(struct dax_dev *dax_dev, 
struct vm_fault *vmf)
 static int dax_dev_huge_fault(struct vm_fault *vmf,
enum page_entry_size pe_size)
 {
-   int rc;
+   int rc, id;
struct file *filp = vmf->vma->vm_file;
struct dax_dev *dax_dev = filp->private_data;
 
@@ -578,7 +579,7 @@ static int dax_dev_huge_fault(struct vm_fault *vmf,
? "write" : "read",
vmf->vma->vm_start, vmf->vma->vm_end);
 
-   rcu_read_lock();
+   id = srcu_read_lock(&dax_srcu);
switch (pe_size) {
case PE_SIZE_PTE:
rc = __dax_dev_pte_fault(dax_dev, vmf);
@@ -592,7 +593,7 @@ static int dax_dev_huge_fault(struct vm_fault *vmf,
default:
return VM_FAULT_FALLBACK;
}
-   rcu_read_unlock();
+   srcu_read_unlock(&dax_srcu, id);
 
return rc;
 }
@@ -713,11 +714,11 @@ static void unregister_dax_dev(void *dev)
 * Note, rcu is not protecting the liveness of dax_dev, rcu is
 * ensuring that any fault handlers that might have seen
 * dax_dev->alive == true, have completed.  Any fault handlers
-* that start after synchronize_rcu() has started will abort
+* that start after synchronize_srcu() has started will abort
 * upon seeing dax_dev->alive == false.
 */
dax_dev->alive = false;
-   synchronize_rcu();
+   synchronize_srcu(&dax_srcu);
unmap_mapping_range(dax_dev->inode->i_mapping, 0, 0, 1);
cdev_del(cdev);
device_unregister(dev);

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 3/4] libnvdimm: add support for clear poison list and badblocks for device dax

2017-04-07 Thread Dan Williams
On Fri, Apr 7, 2017 at 3:07 PM, Dave Jiang  wrote:
> Providing mechanism to clear poison list via the ndctl ND_CMD_CLEAR_ERROR
> call. We will update the poison list and also the badblocks at region level
> if the region is in dax mode or in pmem mode and not active. In other
> words we force badblocks to be cleared through write requests if the
> address is currently accessed through a block device, otherwise it can
> only be done via the ioctl+dsm path.
>
> Signed-off-by: Dave Jiang 
> Reviewed-by: Johannes Thumshirn 
> ---
>  drivers/nvdimm/bus.c  |   78 
> +
>  drivers/nvdimm/core.c |   17 --
>  drivers/nvdimm/region.c   |   25 ++
>  include/linux/libnvdimm.h |7 +++-
>  4 files changed, 115 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
> index 351bac8..af64e9c 100644
> --- a/drivers/nvdimm/bus.c
> +++ b/drivers/nvdimm/bus.c
> @@ -27,6 +27,7 @@
>  #include 
>  #include "nd-core.h"
>  #include "nd.h"
> +#include "pfn.h"
>
>  int nvdimm_major;
>  static int nvdimm_bus_major;
> @@ -218,11 +219,19 @@ long nvdimm_clear_poison(struct device *dev, 
> phys_addr_t phys,
> if (cmd_rc < 0)
> return cmd_rc;
>
> -   nvdimm_clear_from_poison_list(nvdimm_bus, phys, len);
> return clear_err.cleared;

This seems like a typo. We want to clear poison in the pmem write path.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 3/4] libnvdimm: add support for clear poison list and badblocks for device dax

2017-04-07 Thread Dan Williams
On Fri, Apr 7, 2017 at 5:31 PM, Dan Williams  wrote:
> On Fri, Apr 7, 2017 at 3:07 PM, Dave Jiang  wrote:
>> Providing mechanism to clear poison list via the ndctl ND_CMD_CLEAR_ERROR
>> call. We will update the poison list and also the badblocks at region level
>> if the region is in dax mode or in pmem mode and not active. In other
>> words we force badblocks to be cleared through write requests if the
>> address is currently accessed through a block device, otherwise it can
>> only be done via the ioctl+dsm path.
>>
>> Signed-off-by: Dave Jiang 
>> Reviewed-by: Johannes Thumshirn 
>> ---
>>  drivers/nvdimm/bus.c  |   78 
>> +
>>  drivers/nvdimm/core.c |   17 --
>>  drivers/nvdimm/region.c   |   25 ++
>>  include/linux/libnvdimm.h |7 +++-
>>  4 files changed, 115 insertions(+), 12 deletions(-)
>>
>> diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
>> index 351bac8..af64e9c 100644
>> --- a/drivers/nvdimm/bus.c
>> +++ b/drivers/nvdimm/bus.c
>> @@ -27,6 +27,7 @@
>>  #include 
>>  #include "nd-core.h"
>>  #include "nd.h"
>> +#include "pfn.h"
>>
>>  int nvdimm_major;
>>  static int nvdimm_bus_major;
>> @@ -218,11 +219,19 @@ long nvdimm_clear_poison(struct device *dev, 
>> phys_addr_t phys,
>> if (cmd_rc < 0)
>> return cmd_rc;
>>
>> -   nvdimm_clear_from_poison_list(nvdimm_bus, phys, len);
>> return clear_err.cleared;
>
> This seems like a typo. We want to clear poison in the pmem write path.

Changing this into an nvdimm_forget_poison() call works (passes your
daxdev-errors.sh test), and that's what I've pushed out to my pending
branch to let 0day beat up on it a bit.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [ndctl PATCH v4 0/6] Add ndctl check-namespace

2017-04-07 Thread Dan Williams
On Fri, Apr 7, 2017 at 4:17 PM, Vishal Verma  wrote:
> Changes in v4:
> - Change the bitmap code to the kernel's GPLv2 Routines instead of the
>   LGPL ccan/bitmap.
> - Upgrade a few messages from 'info' to 'err'

Thanks Vishal! Applied and pushed out to 'pending'.

https://github.com/pmem/ndctl/tree/pending
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm