Questions about vNVDIMM on qemu/KVM

2018-05-22 Thread Yasunori Goto
Hello,

I'm investigating status of vNVDIMM on qemu/KVM,
and I have some questions about it. I'm glad if anyone answer them.

In my understanding, qemu/KVM has a feature to show NFIT for guest,
and it will be still updated about platform capability with this patch set.
https://lists.gnu.org/archive/html/qemu-devel/2018-05/msg04756.html

And libvirt also supports this feature with 
https://libvirt.org/formatdomain.html#elementsMemory


However, virtio-pmem is developing now, and it is better
for archtectures to detect regions of NVDIMM without ACPI (like s390x)
In addition, It is also necessary to flush guest contents on vNVDIMM
who has a backend-file. 


Q1) Does ACPI.NFIT bus of qemu/kvm remain with virtio-pmem? 
How do each roles become it if both NFIT and virtio-pmem will be available?
If my understanding is correct, both NFIT and virtio-pmem is used to
detect vNVDIMM regions, but only one seems to be necessary

Otherwize, is the NFIT bus just for keeping compatibility, 
and virtio-pmem is promising way?


Q2) What bus is(will be?) created for virtio-pmem?
I could confirm the bus of NFIT is created with ,
and I heard other bus will be created for virtio-pmem, but I could not
find what bus is created concretely.
---
  # ndctl list -B
  {
 "provider":"ACPI.NFIT",
 "dev":"ndbus0"
  }
---
   
I think it affects what operations user will be able to, and what 
notification is necessary for vNVDIMM. 
ACPI defines some operations like namespace controll, and notification
for NVDIMM health status or others.
(I suppose that other status notification might be necessary for vNVDIMM,
 but I'm not sure yet...)

If my understanding is wrong, please correct me.

Thanks,
---
Yasunori Goto


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 07/11] mm, madvise_inject_error: fix page count leak

2018-05-22 Thread Naoya Horiguchi
On Tue, May 22, 2018 at 07:40:09AM -0700, Dan Williams wrote:
> The madvise_inject_error() routine uses get_user_pages() to lookup the
> pfn and other information for injected error, but it fails to release
> that pin.
> 
> The dax-dma-vs-truncate warning catches this failure with the following
> signature:
> 
>  Injecting memory failure for pfn 0x208900 at process virtual address 
> 0x7f3908d0
>  Memory failure: 0x208900: reserved kernel page still referenced by 1 users
>  Memory failure: 0x208900: recovery action for reserved kernel page: Failed
>  WARNING: CPU: 37 PID: 9566 at fs/dax.c:348 dax_disassociate_entry+0x4e/0x90
>  CPU: 37 PID: 9566 Comm: umount Tainted: GW  OE 4.17.0-rc6+ #1900
>  [..]
>  RIP: 0010:dax_disassociate_entry+0x4e/0x90
>  RSP: 0018:c9000a9b3b30 EFLAGS: 00010002
>  RAX: ea0008224000 RBX: 00208a00 RCX: 00208900
>  RDX: 0001 RSI: 8804058c6160 RDI: 0008
>  RBP: 0822000a R08: 0002 R09: 00208800
>  R10:  R11: 00208801 R12: 8804058c6168
>  R13:  R14: 0002 R15: 0001
>  FS:  7f4548027fc0() GS:880431d4() knlGS:
>  CS:  0010 DS:  ES:  CR0: 80050033
>  CR2: 56316d5f8988 CR3: 0004298cc000 CR4: 000406e0
>  Call Trace:
>   __dax_invalidate_mapping_entry+0xab/0xe0
>   dax_delete_mapping_entry+0xf/0x20
>   truncate_exceptional_pvec_entries.part.14+0x1d4/0x210
>   truncate_inode_pages_range+0x291/0x920
>   ? kmem_cache_free+0x1f8/0x300
>   ? lock_acquire+0x9f/0x200
>   ? truncate_inode_pages_final+0x31/0x50
>   ext4_evict_inode+0x69/0x740
> 
> Cc: 
> Fixes: bd1ce5f91f54 ("HWPOISON: avoid grabbing the page count...")
> Cc: Michal Hocko 
> Cc: Andi Kleen 
> Cc: Wu Fengguang 
> Signed-off-by: Dan Williams 
> ---
>  mm/madvise.c |   11 ---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 4d3c922ea1a1..246fa4d4eee2 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -631,11 +631,13 @@ static int madvise_inject_error(int behavior,
>  
>  
>   for (; start < end; start += PAGE_SIZE << order) {
> + unsigned long pfn;
>   int ret;
>  
>   ret = get_user_pages_fast(start, 1, 0, );
>   if (ret != 1)
>   return ret;
> + pfn = page_to_pfn(page);
>  
>   /*
>* When soft offlining hugepages, after migrating the page
> @@ -651,17 +653,20 @@ static int madvise_inject_error(int behavior,
>  
>   if (behavior == MADV_SOFT_OFFLINE) {
>   pr_info("Soft offlining pfn %#lx at process virtual 
> address %#lx\n",
> - page_to_pfn(page), start);
> + pfn, start);
>  
>   ret = soft_offline_page(page, MF_COUNT_INCREASED);
> + put_page(page);
>   if (ret)
>   return ret;
>   continue;
>   }
> + put_page(page);

We keep the page count pinned after the isolation of the error page
in order to make sure that the error page is disabled and never reused.
This seems not explicit enough, so some comment should be helpful.

BTW, looking at the kernel message like "Memory failure: 0x208900:
reserved kernel page still referenced by 1 users", memory_failure()
considers dav_pagemap pages as "reserved kernel pages" (MF_MSG_KERNEL).
If memory error handler recovers a dav_pagemap page in its special way,
we can define a new action_page_types entry like MF_MSG_DAX.
Reporting like "Memory failure: 0xX: recovery action for dax page:
Failed" might be helpful for end user's perspective.

Thanks,
Naoya Horiguchi

> +
>   pr_info("Injecting memory failure for pfn %#lx at process 
> virtual address %#lx\n",
> - page_to_pfn(page), start);
> + pfn, start);
>  
> - ret = memory_failure(page_to_pfn(page), MF_COUNT_INCREASED);
> + ret = memory_failure(pfn, MF_COUNT_INCREASED);
>   if (ret)
>   return ret;
>   }
> 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


RETURNED MAIL: DATA FORMAT ERROR

2018-05-22 Thread Bounced mail
The original message was received at Wed, 23 May 2018 10:59:05 +0800
from lists.01.org [149.215.9.222]

- The following addresses had permanent fatal errors -


- Transcript of session follows -
  while talking to lists.01.org.:
>>> MAIL From:"Bounced mail" 
<<< 501 "Bounced mail" ... Refused



___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [dm-devel] [patch 4/4] dm-writecache: use new API for flushing

2018-05-22 Thread Jeff Moyer
Hi, Mike,

Mike Snitzer  writes:

> Looking at Mikulas' wrapper API that you and hch are calling into
> question:
>
> For ARM it is using arch/arm64/mm/flush.c:arch_wb_cache_pmem().
> (And ARM does seem to be providing CONFIG_ARCH_HAS_PMEM_API.)
>
> Whereas x86_64 is using memcpy_flushcache() as provided by
> CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE.
> (Yet ARM does provide arch/arm64/lib/uaccess_flushcache.c:memcpy_flushcache)
>
> Just seems this isn't purely about ARM lacking on an API level (given on
> x86_64 Mikulas isn't only using CONFIG_ARCH_HAS_PMEM_API).
>
> Seems this is more to do with x86_64 having efficient Non-temporal
> stores?

Yeah, I think you've got that all right.

> Anyway, I'm still trying to appreciate the details here before I can
> make any forward progress.

Making data persistent on x64 requires 3 steps:
1) copy the data into pmem   (store instructions)
2) flush the cache lines associated with the data (clflush, clflush_opt, clwb)
3) wait on the flush to complete (sfence)

I'm not sure if other architectures require step 3.  Mikulas'
implementation seems to imply that arm64 doesn't require the fence.

The current pmem api provides:

memcpy*   -- step 1
memcpy_flushcache -- this combines steps 1 and 2
dax_flush -- step 2
wmb*  -- step 3

* not strictly part of the pmem api

So, if you didn't care about performance, you could write generic code
that only used memcpy, dax_flush, and wmb (assuming other arches
actually need the wmb).  What Mikulas did was to abstract out an API
that could be called by generic code that would work optimally on all
architectures.

This looks like a worth-while addition to the PMEM API, to me.  Mikulas,
what do you think about refactoring the code as Christoph suggested?

Cheers,
Jeff
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 06/14] PCI/P2PDMA: Add P2P DMA driver writer's documentation

2018-05-22 Thread Logan Gunthorpe
Thanks for the review Randy! I'll make the changes for the next time we
post the series.

On 22/05/18 03:24 PM, Randy Dunlap wrote:
>> +The first task an orchestrator driver must do is compile a list of
>> +all client drivers that will be involved in a given transaction. For
>> +example, the NVMe Target driver creates a list including all NVMe drives
>  ^^
>or drivers 
> ?
> Could be either, I guess, but the previous sentence says "compile a list of 
> drivers."

I did mean "drives". But perhaps "devices" would be more clear. A list
of all NVMe drivers doesn't make much sense as I'm pretty sure there is
only one NVMe driver.

Logan
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 06/14] PCI/P2PDMA: Add P2P DMA driver writer's documentation

2018-05-22 Thread Randy Dunlap
On 04/23/2018 04:30 PM, Logan Gunthorpe wrote:
> Add a restructured text file describing how to write drivers
> with support for P2P DMA transactions. The document describes
> how to use the APIs that were added in the previous few
> commits.
> 
> Also adds an index for the PCI documentation tree even though this
> is the only PCI document that has been converted to restructured text
> at this time.
> 
> Signed-off-by: Logan Gunthorpe 
> Cc: Jonathan Corbet 
> ---
>  Documentation/PCI/index.rst |  14 +++
>  Documentation/driver-api/pci/index.rst  |   1 +
>  Documentation/driver-api/pci/p2pdma.rst | 166 
> 
>  Documentation/index.rst |   3 +-
>  4 files changed, 183 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/PCI/index.rst
>  create mode 100644 Documentation/driver-api/pci/p2pdma.rst


> diff --git a/Documentation/driver-api/pci/p2pdma.rst 
> b/Documentation/driver-api/pci/p2pdma.rst
> new file mode 100644
> index ..49a512c405b2
> --- /dev/null
> +++ b/Documentation/driver-api/pci/p2pdma.rst
> @@ -0,0 +1,166 @@
> +
> +PCI Peer-to-Peer DMA Support
> +
> +
> +The PCI bus has pretty decent support for performing DMA transfers
> +between two endpoints on the bus. This type of transaction is
> +henceforth called Peer-to-Peer (or P2P). However, there are a number of
> +issues that make P2P transactions tricky to do in a perfectly safe way.
> +
> +One of the biggest issues is that PCI Root Complexes are not required
> +to support forwarding packets between Root Ports. To make things worse,
> +there is no simple way to determine if a given Root Complex supports
> +this or not. (See PCIe r4.0, sec 1.3.1). Therefore, as of this writing,
> +the kernel only supports doing P2P when the endpoints involved are all
> +behind the same PCIe root port as the spec guarantees that all
> +packets will always be routable but does not require routing between
> +root ports.
> +
> +The second issue is that to make use of existing interfaces in Linux,
> +memory that is used for P2P transactions needs to be backed by struct
> +pages. However, PCI BARs are not typically cache coherent so there are
> +a few corner case gotchas with these pages so developers need to
> +be careful about what they do with them.
> +
> +
> +Driver Writer's Guide
> +=
> +
> +In a given P2P implementation there may be three or more different
> +types of kernel drivers in play:
> +
> +* Providers - A driver which provides or publishes P2P resources like

   * Provider -

> +  memory or doorbell registers to other drivers.
> +* Clients - A driver which makes use of a resource by setting up a

   * Client -

> +  DMA transaction to or from it.
> +* Orchestrators - A driver which orchestrates the flow of data between

   * Orchestrator -

> +  clients and providers
> +
> +In many cases there could be overlap between these three types (ie.

  (i.e.,

> +it may be typical for a driver to be both a provider and a client).
> +
> +For example, in the NVMe Target Copy Offload implementation:
> +
> +* The NVMe PCI driver is both a client, provider and orchestrator
> +  in that it exposes any CMB (Controller Memory Buffer) as a P2P memory
> +  resource (provider), it accepts P2P memory pages as buffers in requests
> +  to be used directly (client) and it can also make use the CMB as
> +  submission queue entries.
> +* The RDMA driver is a client in this arrangement so that an RNIC
> +  can DMA directly to the memory exposed by the NVMe device.
> +* The NVMe Target driver (nvmet) can orchestrate the data from the RNIC
> +  to the P2P memory (CMB) and then to the NVMe device (and vice versa).
> +
> +This is currently the only arrangement supported by the kernel but
> +one could imagine slight tweaks to this that would allow for the same
> +functionality. For example, if a specific RNIC added a BAR with some
> +memory behind it, its driver could add support as a P2P provider and
> +then the NVMe Target could use the RNIC's memory instead of the CMB
> +in cases where the NVMe cards in use do not have CMB support.
> +
> +
> +Provider Drivers
> +
> +
> +A provider simply needs to register a BAR (or a portion of a BAR)
> +as a P2P DMA resource using :c:func:`pci_p2pdma_add_resource()`.
> +This will register struct pages for all the specified memory.
> +
> +After that it may optionally publish all of its resources as
> +P2P memory using :c:func:`pci_p2pmem_publish()`. This will allow
> +any orchestrator drivers to find and use the memory. When marked in
> +this way, the resource must be regular memory with no side effects.
> +
> +For the time being this is fairly rudimentary in that all resources
> +are typically going to be P2P memory. Future work will likely expand
> +this to include other types 

Re: [patch 4/4] dm-writecache: use new API for flushing

2018-05-22 Thread Mike Snitzer
On Tue, May 22 2018 at  3:27pm -0400,
Dan Williams  wrote:

> On Tue, May 22, 2018 at 12:19 PM, Mike Snitzer  wrote:
> > On Tue, May 22 2018 at  3:00pm -0400,
> > Dan Williams  wrote:
> >
> >> On Tue, May 22, 2018 at 11:41 AM, Mike Snitzer  wrote:
> >> > On Tue, May 22 2018 at  2:39am -0400,
> >> > Christoph Hellwig  wrote:
> >> >
> >> >> On Sat, May 19, 2018 at 07:25:07AM +0200, Mikulas Patocka wrote:
> >> >> > Use new API for flushing persistent memory.
> >> >>
> >> >> The sentence doesnt make much sense.  'A new API', 'A better
> >> >> abstraction' maybe?
> >> >>
> >> >> >
> >> >> > The problem is this:
> >> >> > * on X86-64, non-temporal stores have the best performance
> >> >> > * ARM64 doesn't have non-temporal stores, so we must flush cache. We
> >> >> >   should flush cache as late as possible, because it performs better 
> >> >> > this
> >> >> >   way.
> >> >> >
> >> >> > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To 
> >> >> > commit
> >> >> > data persistently, all three functions must be called.
> >> >> >
> >> >> > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
> >> >> > (unlike pmem_memcpy) guarantees that 8-byte values are written 
> >> >> > atomically.
> >> >> >
> >> >> > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
> >> >> > pmem_commit is wmb.
> >> >> >
> >> >> > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
> >> >> > pmem_commit is empty.
> >> >>
> >> >> All these should be provided by the pmem layer, and be properly
> >> >> documented.  And be sorted before adding your new target that uses
> >> >> them.
> >> >
> >> > I don't see that as a hard requirement.  Mikulas did the work to figure
> >> > out what is more optimal on x86_64 vs amd64.  It makes a difference for
> >> > his target and that is sufficient to carry it locally until/when it is
> >> > either elevated to pmem.
> >> >
> >> > We cannot even get x86 and swait maintainers to reply to repeat requests
> >> > for review.  Stacking up further deps on pmem isn't high on my list.
> >> >
> >>
> >> Except I'm being responsive.
> >
> > Except you're looking to immediately punt to linux-arm-kernel ;)
> 
> Well, I'm not, not really. I'm saying drop ARM support, it's not ready.
> 
> >
> >> I agree with Christoph that we should
> >> build pmem helpers at an architecture level and not per-driver. Let's
> >> make this driver depend on ARCH_HAS_PMEM_API and require ARM to catch
> >> up to x86 in this space. We already have PowerPC enabling PMEM API, so
> >> I don't see an unreasonable barrier to ask the same of ARM. This patch
> >> is not even cc'd to linux-arm-kernel. Has the subject been broached
> >> with them?
> >
> > No idea.  Not by me.
> >
> > The thing is, I'm no expert in pmem.  You are.  Coordinating the change
> > with ARM et al feels unnecessarily limiting and quicky moves outside my
> > control.
> >
> > Serious question: Why can't this code land in this dm-writecache target
> > and then be lifted (or obsoleted)?
> 
> Because we already have an API, and we don't want to promote local
> solutions to global problems, or carry  unnecessary technical debt.
> 
> >
> > But if you think it worthwhile to force ARM to step up then fine.  That
> > does limit the availability of using writecache on ARM while they get
> > the PMEM API together.
> >
> > I'll do whatever you want.. just put the smack down and tell me how it
> > is ;)
> 
> I'd say just control the variables you can control. Drop the ARM
> support if you want to move forward and propose extensions / updates
> to the pmem api for x86 and I'll help push those since I was involved
> in pushing the x86 pmem api in the first instance. That way you don't
> need to touch this driver as new archs add their pmem api enabling.

Looking at Mikulas' wrapper API that you and hch are calling into
question:

For ARM it is using arch/arm64/mm/flush.c:arch_wb_cache_pmem().
(And ARM does seem to be providing CONFIG_ARCH_HAS_PMEM_API.)

Whereas x86_64 is using memcpy_flushcache() as provided by
CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE.
(Yet ARM does provide arch/arm64/lib/uaccess_flushcache.c:memcpy_flushcache)

Just seems this isn't purely about ARM lacking on an API level (given on
x86_64 Mikulas isn't only using CONFIG_ARCH_HAS_PMEM_API).

Seems this is more to do with x86_64 having efficient Non-temporal
stores?

Anyway, I'm still trying to appreciate the details here before I can
make any forward progress.

Mike
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [patch 4/4] dm-writecache: use new API for flushing

2018-05-22 Thread Dan Williams
On Tue, May 22, 2018 at 12:19 PM, Mike Snitzer  wrote:
> On Tue, May 22 2018 at  3:00pm -0400,
> Dan Williams  wrote:
>
>> On Tue, May 22, 2018 at 11:41 AM, Mike Snitzer  wrote:
>> > On Tue, May 22 2018 at  2:39am -0400,
>> > Christoph Hellwig  wrote:
>> >
>> >> On Sat, May 19, 2018 at 07:25:07AM +0200, Mikulas Patocka wrote:
>> >> > Use new API for flushing persistent memory.
>> >>
>> >> The sentence doesnt make much sense.  'A new API', 'A better
>> >> abstraction' maybe?
>> >>
>> >> >
>> >> > The problem is this:
>> >> > * on X86-64, non-temporal stores have the best performance
>> >> > * ARM64 doesn't have non-temporal stores, so we must flush cache. We
>> >> >   should flush cache as late as possible, because it performs better 
>> >> > this
>> >> >   way.
>> >> >
>> >> > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To 
>> >> > commit
>> >> > data persistently, all three functions must be called.
>> >> >
>> >> > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
>> >> > (unlike pmem_memcpy) guarantees that 8-byte values are written 
>> >> > atomically.
>> >> >
>> >> > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
>> >> > pmem_commit is wmb.
>> >> >
>> >> > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
>> >> > pmem_commit is empty.
>> >>
>> >> All these should be provided by the pmem layer, and be properly
>> >> documented.  And be sorted before adding your new target that uses
>> >> them.
>> >
>> > I don't see that as a hard requirement.  Mikulas did the work to figure
>> > out what is more optimal on x86_64 vs amd64.  It makes a difference for
>> > his target and that is sufficient to carry it locally until/when it is
>> > either elevated to pmem.
>> >
>> > We cannot even get x86 and swait maintainers to reply to repeat requests
>> > for review.  Stacking up further deps on pmem isn't high on my list.
>> >
>>
>> Except I'm being responsive.
>
> Except you're looking to immediately punt to linux-arm-kernel ;)

Well, I'm not, not really. I'm saying drop ARM support, it's not ready.

>
>> I agree with Christoph that we should
>> build pmem helpers at an architecture level and not per-driver. Let's
>> make this driver depend on ARCH_HAS_PMEM_API and require ARM to catch
>> up to x86 in this space. We already have PowerPC enabling PMEM API, so
>> I don't see an unreasonable barrier to ask the same of ARM. This patch
>> is not even cc'd to linux-arm-kernel. Has the subject been broached
>> with them?
>
> No idea.  Not by me.
>
> The thing is, I'm no expert in pmem.  You are.  Coordinating the change
> with ARM et al feels unnecessarily limiting and quicky moves outside my
> control.
>
> Serious question: Why can't this code land in this dm-writecache target
> and then be lifted (or obsoleted)?

Because we already have an API, and we don't want to promote local
solutions to global problems, or carry  unnecessary technical debt.

>
> But if you think it worthwhile to force ARM to step up then fine.  That
> does limit the availability of using writecache on ARM while they get
> the PMEM API together.
>
> I'll do whatever you want.. just put the smack down and tell me how it
> is ;)

I'd say just control the variables you can control. Drop the ARM
support if you want to move forward and propose extensions / updates
to the pmem api for x86 and I'll help push those since I was involved
in pushing the x86 pmem api in the first instance. That way you don't
need to touch this driver as new archs add their pmem api enabling.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [patch 4/4] dm-writecache: use new API for flushing

2018-05-22 Thread Mike Snitzer
On Tue, May 22 2018 at  3:00pm -0400,
Dan Williams  wrote:

> On Tue, May 22, 2018 at 11:41 AM, Mike Snitzer  wrote:
> > On Tue, May 22 2018 at  2:39am -0400,
> > Christoph Hellwig  wrote:
> >
> >> On Sat, May 19, 2018 at 07:25:07AM +0200, Mikulas Patocka wrote:
> >> > Use new API for flushing persistent memory.
> >>
> >> The sentence doesnt make much sense.  'A new API', 'A better
> >> abstraction' maybe?
> >>
> >> >
> >> > The problem is this:
> >> > * on X86-64, non-temporal stores have the best performance
> >> > * ARM64 doesn't have non-temporal stores, so we must flush cache. We
> >> >   should flush cache as late as possible, because it performs better this
> >> >   way.
> >> >
> >> > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
> >> > data persistently, all three functions must be called.
> >> >
> >> > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
> >> > (unlike pmem_memcpy) guarantees that 8-byte values are written 
> >> > atomically.
> >> >
> >> > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
> >> > pmem_commit is wmb.
> >> >
> >> > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
> >> > pmem_commit is empty.
> >>
> >> All these should be provided by the pmem layer, and be properly
> >> documented.  And be sorted before adding your new target that uses
> >> them.
> >
> > I don't see that as a hard requirement.  Mikulas did the work to figure
> > out what is more optimal on x86_64 vs amd64.  It makes a difference for
> > his target and that is sufficient to carry it locally until/when it is
> > either elevated to pmem.
> >
> > We cannot even get x86 and swait maintainers to reply to repeat requests
> > for review.  Stacking up further deps on pmem isn't high on my list.
> >
> 
> Except I'm being responsive.

Except you're looking to immediately punt to linux-arm-kernel ;)

> I agree with Christoph that we should
> build pmem helpers at an architecture level and not per-driver. Let's
> make this driver depend on ARCH_HAS_PMEM_API and require ARM to catch
> up to x86 in this space. We already have PowerPC enabling PMEM API, so
> I don't see an unreasonable barrier to ask the same of ARM. This patch
> is not even cc'd to linux-arm-kernel. Has the subject been broached
> with them?

No idea.  Not by me.

The thing is, I'm no expert in pmem.  You are.  Coordinating the change
with ARM et al feels unnecessarily limiting and quicky moves outside my
control.

Serious question: Why can't this code land in this dm-writecache target
and then be lifted (or obsoleted)?

But if you think it worthwhile to force ARM to step up then fine.  That
does limit the availability of using writecache on ARM while they get
the PMEM API together.

I'll do whatever you want.. just put the smack down and tell me how it
is ;)

Thanks,
Mike
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [patch 4/4] dm-writecache: use new API for flushing

2018-05-22 Thread Dan Williams
On Tue, May 22, 2018 at 11:41 AM, Mike Snitzer  wrote:
> On Tue, May 22 2018 at  2:39am -0400,
> Christoph Hellwig  wrote:
>
>> On Sat, May 19, 2018 at 07:25:07AM +0200, Mikulas Patocka wrote:
>> > Use new API for flushing persistent memory.
>>
>> The sentence doesnt make much sense.  'A new API', 'A better
>> abstraction' maybe?
>>
>> >
>> > The problem is this:
>> > * on X86-64, non-temporal stores have the best performance
>> > * ARM64 doesn't have non-temporal stores, so we must flush cache. We
>> >   should flush cache as late as possible, because it performs better this
>> >   way.
>> >
>> > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
>> > data persistently, all three functions must be called.
>> >
>> > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
>> > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically.
>> >
>> > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
>> > pmem_commit is wmb.
>> >
>> > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
>> > pmem_commit is empty.
>>
>> All these should be provided by the pmem layer, and be properly
>> documented.  And be sorted before adding your new target that uses
>> them.
>
> I don't see that as a hard requirement.  Mikulas did the work to figure
> out what is more optimal on x86_64 vs amd64.  It makes a difference for
> his target and that is sufficient to carry it locally until/when it is
> either elevated to pmem.
>
> We cannot even get x86 and swait maintainers to reply to repeat requests
> for review.  Stacking up further deps on pmem isn't high on my list.
>

Except I'm being responsive. I agree with Christoph that we should
build pmem helpers at an architecture level and not per-driver. Let's
make this driver depend on ARCH_HAS_PMEM_API and require ARM to catch
up to x86 in this space. We already have PowerPC enabling PMEM API, so
I don't see an unreasonable barrier to ask the same of ARM. This patch
is not even cc'd to linux-arm-kernel. Has the subject been broached
with them?
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [patch 4/4] dm-writecache: use new API for flushing

2018-05-22 Thread Mike Snitzer
On Tue, May 22 2018 at  2:39am -0400,
Christoph Hellwig  wrote:

> On Sat, May 19, 2018 at 07:25:07AM +0200, Mikulas Patocka wrote:
> > Use new API for flushing persistent memory.
> 
> The sentence doesnt make much sense.  'A new API', 'A better
> abstraction' maybe?
> 
> > 
> > The problem is this:
> > * on X86-64, non-temporal stores have the best performance
> > * ARM64 doesn't have non-temporal stores, so we must flush cache. We
> >   should flush cache as late as possible, because it performs better this
> >   way.
> > 
> > We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
> > data persistently, all three functions must be called.
> > 
> > The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
> > (unlike pmem_memcpy) guarantees that 8-byte values are written atomically.
> > 
> > On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
> > pmem_commit is wmb.
> > 
> > On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
> > pmem_commit is empty.
> 
> All these should be provided by the pmem layer, and be properly
> documented.  And be sorted before adding your new target that uses
> them.

I don't see that as a hard requirement.  Mikulas did the work to figure
out what is more optimal on x86_64 vs amd64.  It makes a difference for
his target and that is sufficient to carry it locally until/when it is
either elevated to pmem.

We cannot even get x86 and swait maintainers to reply to repeat requests
for review.  Stacking up further deps on pmem isn't high on my list.

Mike
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[ndctl PATCH] test: add a MADV_HWPOISON test

2018-05-22 Thread Dan Williams
Check that injecting soft-poison to a dax mapping results in SIGBUS with
the expected BUS_MCEERR_AR siginfo data.

Signed-off-by: Dan Williams 
---
 test/dax-pmd.c |  130 ++--
 1 file changed, 126 insertions(+), 4 deletions(-)

diff --git a/test/dax-pmd.c b/test/dax-pmd.c
index 65bee6ffe907..abff4f9fd199 100644
--- a/test/dax-pmd.c
+++ b/test/dax-pmd.c
@@ -12,6 +12,7 @@
  */
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -192,15 +193,130 @@ int test_dax_directio(int dax_fd, unsigned long align, 
void *dax_addr, off_t off
return rc;
 }
 
+static sigjmp_buf sj_env;
+static int sig_mcerr_ao, sig_mcerr_ar, sig_count;
+
+static void sigbus_hdl(int sig, siginfo_t *si, void *ptr)
+{
+   switch (si->si_code) {
+   case BUS_MCEERR_AO:
+   fprintf(stderr, "%s: BUS_MCEERR_AO addr: %p len: %d\n",
+   __func__, si->si_addr, 1 << si->si_addr_lsb);
+   sig_mcerr_ao++;
+   break;
+   case BUS_MCEERR_AR:
+   fprintf(stderr, "%s: BUS_MCEERR_AR addr: %p len: %d\n",
+   __func__, si->si_addr, 1 << si->si_addr_lsb);
+   sig_mcerr_ar++;
+   break;
+   default:
+   sig_count++;
+   break;
+   }
+
+   siglongjmp(sj_env, 1);
+}
+
+static int test_dax_poison(int dax_fd, unsigned long align, void *dax_addr,
+   off_t offset)
+{
+   unsigned char *addr = MAP_FAILED;
+   struct sigaction act;
+   unsigned x = x;
+   void *buf;
+   int rc;
+
+   if (posix_memalign(, 4096, 4096) != 0)
+   return -ENOMEM;
+
+   memset(, 0, sizeof(act));
+   act.sa_sigaction = sigbus_hdl;
+   act.sa_flags = SA_SIGINFO;
+
+   if (sigaction(SIGBUS, , 0)) {
+   fail();
+   rc = -errno;
+   goto out;
+   }
+
+   /* dirty the block on disk to bypass the default zero page */
+   rc = pwrite(dax_fd, buf, 4096, offset + align / 2);
+   if (rc < 4096) {
+   fail();
+   rc = -ENXIO;
+   goto out;
+   }
+   fsync(dax_fd);
+
+   addr = mmap(dax_addr, 2*align, PROT_READ|PROT_WRITE,
+   MAP_SHARED_VALIDATE|MAP_POPULATE|MAP_SYNC, dax_fd, 
offset);
+   if (addr == MAP_FAILED) {
+   fail();
+   rc = -errno;
+   goto out;
+   }
+
+   if (sigsetjmp(sj_env, 1)) {
+   if (sig_mcerr_ar) {
+   fprintf(stderr, "madvise triggered 'action required' 
sigbus\n");
+   goto clear_error;
+   } else if (sig_count) {
+   fail();
+   return -ENXIO;
+   }
+   }
+
+   rc = madvise(addr + align / 2, 4096, MADV_HWPOISON);
+   if (rc) {
+   fail();
+   rc = -errno;
+   goto out;
+   }
+
+   /* clear the error */
+clear_error:
+   if (!sig_mcerr_ar) {
+   fail();
+   rc = -ENXIO;
+   goto out;
+   }
+
+   rc = fallocate(dax_fd, FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE,
+   offset + align / 2, 4096);
+   if (rc) {
+   fail();
+   rc = -errno;
+   goto out;
+   }
+
+   rc = pwrite(dax_fd, buf, 4096, offset + align / 2);
+   if (rc < 4096) {
+   fail();
+   rc = -ENXIO;
+   goto out;
+   }
+   fsync(dax_fd);
+
+   /* check that we can fault in the poison page */
+   x = *(volatile unsigned *) addr + align / 2;
+   rc = 0;
+
+out:
+   if (addr != MAP_FAILED)
+   munmap(addr, 2 * align);
+   free(buf);
+   return rc;
+}
+
 /* test_pmd assumes that fd references a pre-allocated + dax-capable file */
 static int test_pmd(int fd)
 {
-   unsigned long long m_align, p_align;
+   unsigned long long m_align, p_align, pmd_off;
struct fiemap_extent *ext;
+   void *base, *pmd_addr;
struct fiemap *map;
int rc = -ENXIO;
unsigned long i;
-   void *base;
 
if (fd < 0) {
fail();
@@ -249,9 +365,15 @@ static int test_pmd(int fd)
m_align = ALIGN(base, HPAGE_SIZE) - ((unsigned long) base);
p_align = ALIGN(ext->fe_physical, HPAGE_SIZE) - ext->fe_physical;
 
-   rc = test_dax_directio(fd, HPAGE_SIZE, (char *) base + m_align,
-   ext->fe_logical + p_align);
+   pmd_addr = (char *) base + m_align;
+   pmd_off =  ext->fe_logical + p_align;
+   rc = test_dax_directio(fd, HPAGE_SIZE, pmd_addr, pmd_off);
+   if (rc)
+   goto err_directio;
+
+   rc = test_dax_poison(fd, HPAGE_SIZE, pmd_addr, pmd_off);
 
+ err_directio:
  err_extent:
  err_mmap:
free(map);

___
Linux-nvdimm 

[ndctl PATCH] ndctl: hide null uuids

2018-05-22 Thread Dan Williams
Clean up the namespace listing to hide the 'raw_uuid' field when it is
zero.

Signed-off-by: Dan Williams 
---
 util/json.c |   19 +--
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/util/json.c b/util/json.c
index 1772177a8abd..c606e1cdf770 100644
--- a/util/json.c
+++ b/util/json.c
@@ -652,10 +652,12 @@ static struct json_object 
*util_dax_badblocks_to_json(struct ndctl_dax *dax,
 
 static struct json_object *util_raw_uuid(struct ndctl_namespace *ndns)
 {
-   uuid_t raw_uuid;
char buf[40];
+   uuid_t raw_uuid;
 
ndctl_namespace_get_uuid(ndns, raw_uuid);
+   if (uuid_is_null(raw_uuid))
+   return NULL;
uuid_unparse(raw_uuid, buf);
return json_object_new_string(buf);
 }
@@ -734,9 +736,8 @@ struct json_object *util_namespace_to_json(struct 
ndctl_namespace *ndns,
json_object_object_add(jndns, "uuid", jobj);
 
jobj = util_raw_uuid(ndns);
-   if (!jobj)
-   goto err;
-   json_object_object_add(jndns, "raw_uuid", jobj);
+   if (jobj)
+   json_object_object_add(jndns, "raw_uuid", jobj);
bdev = ndctl_btt_get_block_device(btt);
} else if (pfn) {
ndctl_pfn_get_uuid(pfn, uuid);
@@ -746,9 +747,8 @@ struct json_object *util_namespace_to_json(struct 
ndctl_namespace *ndns,
goto err;
json_object_object_add(jndns, "uuid", jobj);
jobj = util_raw_uuid(ndns);
-   if (!jobj)
-   goto err;
-   json_object_object_add(jndns, "raw_uuid", jobj);
+   if (jobj)
+   json_object_object_add(jndns, "raw_uuid", jobj);
bdev = ndctl_pfn_get_block_device(pfn);
} else if (dax) {
struct daxctl_region *dax_region;
@@ -761,9 +761,8 @@ struct json_object *util_namespace_to_json(struct 
ndctl_namespace *ndns,
goto err;
json_object_object_add(jndns, "uuid", jobj);
jobj = util_raw_uuid(ndns);
-   if (!jobj)
-   goto err;
-   json_object_object_add(jndns, "raw_uuid", jobj);
+   if (jobj)
+   json_object_object_add(jndns, "raw_uuid", jobj);
if ((flags & UTIL_JSON_DAX) && dax_region) {
jobj = util_daxctl_region_to_json(dax_region, NULL,
flags);

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 10/11] mm, memory_failure: teach memory_failure() about dev_pagemap pages

2018-05-22 Thread Dan Williams
mce: Uncorrected hardware memory error in user-access at af34214200
{1}[Hardware Error]: It has been corrected by h/w and requires no further 
action
mce: [Hardware Error]: Machine check events logged
{1}[Hardware Error]: event severity: corrected
Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
[..]
Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
mce: Memory error not recovered

In contrast to typical memory, dev_pagemap pages may be dax mapped. With
dax there is no possibility to map in another page dynamically since dax
establishes 1:1 physical address to file offset associations. Also
dev_pagemap pages associated with NVDIMM / persistent memory devices can
internal remap/repair addresses with poison. While memory_failure()
assumes that it can discard typical poisoned pages and keep them
unmapped indefinitely, dev_pagemap pages may be returned to service
after the error is cleared.

Teach memory_failure() to detect and handle MEMORY_DEVICE_HOST
dev_pagemap pages that have poison consumed by userspace. Mark the
memory as UC instead of unmapping it completely to allow ongoing access
via the device driver (nd_pmem). Later, nd_pmem will grow support for
marking the page back to WB when the error is cleared.

Cc: Jan Kara 
Cc: Christoph Hellwig 
Cc: Jérôme Glisse 
Cc: Matthew Wilcox 
Cc: Naoya Horiguchi 
Cc: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 mm/memory-failure.c |  117 +++
 1 file changed, 117 insertions(+)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 42a193ee14d3..f95036f99a79 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -55,6 +55,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include "internal.h"
@@ -1112,6 +1113,117 @@ static int memory_failure_hugetlb(unsigned long pfn, 
int flags)
return res;
 }
 
+static unsigned long dax_mapping_size(struct page *page)
+{
+   struct address_space *mapping = page->mapping;
+   pgoff_t pgoff = page_to_pgoff(page);
+   struct vm_area_struct *vma;
+   unsigned long size = 0;
+
+   i_mmap_lock_read(mapping);
+   vma_interval_tree_foreach(vma, >i_mmap, pgoff, pgoff) {
+   unsigned long address = vma_address(page, vma);
+   pgd_t *pgd;
+   p4d_t *p4d;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *pte;
+
+   pgd = pgd_offset(vma->vm_mm, address);
+   if (!pgd_present(*pgd))
+   continue;
+   p4d = p4d_offset(pgd, address);
+   if (!p4d_present(*p4d))
+   continue;
+   pud = pud_offset(p4d, address);
+   if (!pud_present(*pud))
+   continue;
+   if (pud_devmap(*pud)) {
+   size = PUD_SIZE;
+   break;
+   }
+   pmd = pmd_offset(pud, address);
+   if (!pmd_present(*pmd))
+   continue;
+   if (pmd_devmap(*pmd)) {
+   size = PMD_SIZE;
+   break;
+   }
+   pte = pte_offset_map(pmd, address);
+   if (!pte_present(*pte))
+   continue;
+   if (pte_devmap(*pte)) {
+   size = PAGE_SIZE;
+   break;
+   }
+   }
+   i_mmap_unlock_read(mapping);
+   return size;
+}
+
+static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
+   struct dev_pagemap *pgmap)
+{
+   struct page *page = pfn_to_page(pfn);
+   const bool unmap_success = true;
+   unsigned long size;
+   LIST_HEAD(tokill);
+   int rc = -EBUSY;
+   loff_t start;
+
+   lock_page(page);
+   if (hwpoison_filter(page)) {
+   rc = 0;
+   goto out;
+   }
+
+   switch (pgmap->type) {
+   case MEMORY_DEVICE_PRIVATE:
+   case MEMORY_DEVICE_PUBLIC:
+   /*
+* TODO: Handle HMM pages which may need coordination
+* with device-side memory.
+*/
+   goto out;
+   default:
+   if (!page->mapping)
+   goto out;
+   break;
+   }
+
+   /*
+* If the page is not mapped in userspace then report it as
+* unhandled.
+*/
+   size = dax_mapping_size(page);
+   if (!size) {
+   pr_err("Memory failure: %#lx: failed to unmap page\n", pfn);
+   goto out;
+   }
+
+   SetPageHWPoison(page);
+
+   /*
+* Unlike System-RAM there is no possibility to swap in a
+* different physical page at a given 

[PATCH 07/11] mm, madvise_inject_error: fix page count leak

2018-05-22 Thread Dan Williams
The madvise_inject_error() routine uses get_user_pages() to lookup the
pfn and other information for injected error, but it fails to release
that pin.

The dax-dma-vs-truncate warning catches this failure with the following
signature:

 Injecting memory failure for pfn 0x208900 at process virtual address 
0x7f3908d0
 Memory failure: 0x208900: reserved kernel page still referenced by 1 users
 Memory failure: 0x208900: recovery action for reserved kernel page: Failed
 WARNING: CPU: 37 PID: 9566 at fs/dax.c:348 dax_disassociate_entry+0x4e/0x90
 CPU: 37 PID: 9566 Comm: umount Tainted: GW  OE 4.17.0-rc6+ #1900
 [..]
 RIP: 0010:dax_disassociate_entry+0x4e/0x90
 RSP: 0018:c9000a9b3b30 EFLAGS: 00010002
 RAX: ea0008224000 RBX: 00208a00 RCX: 00208900
 RDX: 0001 RSI: 8804058c6160 RDI: 0008
 RBP: 0822000a R08: 0002 R09: 00208800
 R10:  R11: 00208801 R12: 8804058c6168
 R13:  R14: 0002 R15: 0001
 FS:  7f4548027fc0() GS:880431d4() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 56316d5f8988 CR3: 0004298cc000 CR4: 000406e0
 Call Trace:
  __dax_invalidate_mapping_entry+0xab/0xe0
  dax_delete_mapping_entry+0xf/0x20
  truncate_exceptional_pvec_entries.part.14+0x1d4/0x210
  truncate_inode_pages_range+0x291/0x920
  ? kmem_cache_free+0x1f8/0x300
  ? lock_acquire+0x9f/0x200
  ? truncate_inode_pages_final+0x31/0x50
  ext4_evict_inode+0x69/0x740

Cc: 
Fixes: bd1ce5f91f54 ("HWPOISON: avoid grabbing the page count...")
Cc: Michal Hocko 
Cc: Andi Kleen 
Cc: Wu Fengguang 
Signed-off-by: Dan Williams 
---
 mm/madvise.c |   11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 4d3c922ea1a1..246fa4d4eee2 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -631,11 +631,13 @@ static int madvise_inject_error(int behavior,
 
 
for (; start < end; start += PAGE_SIZE << order) {
+   unsigned long pfn;
int ret;
 
ret = get_user_pages_fast(start, 1, 0, );
if (ret != 1)
return ret;
+   pfn = page_to_pfn(page);
 
/*
 * When soft offlining hugepages, after migrating the page
@@ -651,17 +653,20 @@ static int madvise_inject_error(int behavior,
 
if (behavior == MADV_SOFT_OFFLINE) {
pr_info("Soft offlining pfn %#lx at process virtual 
address %#lx\n",
-   page_to_pfn(page), start);
+   pfn, start);
 
ret = soft_offline_page(page, MF_COUNT_INCREASED);
+   put_page(page);
if (ret)
return ret;
continue;
}
+   put_page(page);
+
pr_info("Injecting memory failure for pfn %#lx at process 
virtual address %#lx\n",
-   page_to_pfn(page), start);
+   pfn, start);
 
-   ret = memory_failure(page_to_pfn(page), MF_COUNT_INCREASED);
+   ret = memory_failure(pfn, MF_COUNT_INCREASED);
if (ret)
return ret;
}

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 11/11] libnvdimm, pmem: restore page attributes when clearing errors

2018-05-22 Thread Dan Williams
Use clear_mce_nospec() to restore WB mode for the kernel linear mapping
of a pmem page that was marked 'HWPoison'. A page with 'HWPoison' set
has also been marked UC in PAT (page attribute table) via
set_mce_nospec() to prevent speculative retrievals of poison.

The 'HWPoison' flag is only cleared when overwriting an entire page.

Signed-off-by: Dan Williams 
---
 drivers/nvdimm/pmem.c |   26 ++
 drivers/nvdimm/pmem.h |   13 +
 2 files changed, 39 insertions(+)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 9d714926ecf5..04ee1fdee219 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -51,6 +52,30 @@ static struct nd_region *to_region(struct pmem_device *pmem)
return to_nd_region(to_dev(pmem)->parent);
 }
 
+static void hwpoison_clear(struct pmem_device *pmem,
+   phys_addr_t phys, unsigned int len)
+{
+   unsigned long pfn_start, pfn_end, pfn;
+
+   /* only pmem in the linear map supports HWPoison */
+   if (is_vmalloc_addr(pmem->virt_addr))
+   return;
+
+   pfn_start = PHYS_PFN(phys);
+   pfn_end = pfn_start + PHYS_PFN(len);
+   for (pfn = pfn_start; pfn < pfn_end; pfn++) {
+   struct page *page = pfn_to_page(pfn);
+
+   /*
+* Note, no need to hold a get_dev_pagemap() reference
+* here since we're in the driver I/O path and
+* outstanding I/O requests pin the dev_pagemap.
+*/
+   if (test_and_clear_pmem_poison(page))
+   clear_mce_nospec(pfn);
+   }
+}
+
 static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
phys_addr_t offset, unsigned int len)
 {
@@ -65,6 +90,7 @@ static blk_status_t pmem_clear_poison(struct pmem_device 
*pmem,
if (cleared < len)
rc = BLK_STS_IOERR;
if (cleared > 0 && cleared / 512) {
+   hwpoison_clear(pmem, pmem->phys_addr + offset, cleared);
cleared /= 512;
dev_dbg(dev, "%#llx clear %ld sector%s\n",
(unsigned long long) sector, cleared,
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index a64ebc78b5df..59cfe13ea8a8 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -1,6 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef __NVDIMM_PMEM_H__
 #define __NVDIMM_PMEM_H__
+#include 
 #include 
 #include 
 #include 
@@ -27,4 +28,16 @@ struct pmem_device {
 
 long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff,
long nr_pages, void **kaddr, pfn_t *pfn);
+
+#ifdef CONFIG_MEMORY_FAILURE
+static inline bool test_and_clear_pmem_poison(struct page *page)
+{
+   return TestClearPageHWPoison(page);
+}
+#else
+static inline bool test_and_clear_pmem_poison(struct page *page)
+{
+   return false;
+}
+#endif
 #endif /* __NVDIMM_PMEM_H__ */

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 08/11] x86, memory_failure: introduce {set, clear}_mce_nospec()

2018-05-22 Thread Dan Williams
Currently memory_failure() returns zero if the error was handled. On
that result mce_unmap_kpfn() is called to zap the page out of the kernel
linear mapping to prevent speculative fetches of potentially poisoned
memory. However, in the case of dax mapped devmap pages the page may be
in active permanent use by the device driver, so it cannot be unmapped
from the kernel.

Instead of marking the page not present, marking the page UC should
be sufficient for preventing poison from being pre-fetched into the
cache. Convert mce_unmap_pfn() to set_mce_nospec() remapping the page as
UC, to hide it from speculative accesses.

Given that that persistent memory errors can be cleared by the driver,
include a facility to restore the page to cacheable operation,
clear_mce_nospec().

Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: 
Cc: Tony Luck 
Cc: Borislav Petkov 
Cc: 
Signed-off-by: Dan Williams 
---
 arch/x86/include/asm/set_memory.h |   29 ++
 arch/x86/kernel/cpu/mcheck/mce-internal.h |   15 ---
 arch/x86/kernel/cpu/mcheck/mce.c  |   38 ++---
 include/linux/set_memory.h|   14 +++
 4 files changed, 46 insertions(+), 50 deletions(-)

diff --git a/arch/x86/include/asm/set_memory.h 
b/arch/x86/include/asm/set_memory.h
index bd090367236c..debc1fee1457 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -88,4 +88,33 @@ extern int kernel_set_to_readonly;
 void set_kernel_text_rw(void);
 void set_kernel_text_ro(void);
 
+#ifdef CONFIG_X86_64
+/*
+ * Mark the linear address as UC to disable speculative pre-fetches into
+ * potentially poisoned memory.
+ */
+static inline int set_mce_nospec(unsigned long pfn)
+{
+   int rc;
+
+   rc = set_memory_uc((unsigned long) __va(PFN_PHYS(pfn)), 1);
+   if (rc)
+   pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
+   return rc;
+}
+#define set_mce_nospec set_mce_nospec
+
+/* Restore full speculative operation to the pfn. */
+static inline int clear_mce_nospec(unsigned long pfn)
+{
+   return set_memory_wb((unsigned long) __va(PFN_PHYS(pfn)), 1);
+}
+#define clear_mce_nospec clear_mce_nospec
+#else
+/*
+ * Few people would run a 32-bit kernel on a machine that supports
+ * recoverable errors because they have too much memory to boot 32-bit.
+ */
+#endif
+
 #endif /* _ASM_X86_SET_MEMORY_H */
diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h 
b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index 374d1aa66952..ceb67cd5918f 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -113,21 +113,6 @@ static inline void mce_register_injector_chain(struct 
notifier_block *nb)  { }
 static inline void mce_unregister_injector_chain(struct notifier_block *nb)
{ }
 #endif
 
-#ifndef CONFIG_X86_64
-/*
- * On 32-bit systems it would be difficult to safely unmap a poison page
- * from the kernel 1:1 map because there are no non-canonical addresses that
- * we can use to refer to the address without risking a speculative access.
- * However, this isn't much of an issue because:
- * 1) Few unmappable pages are in the 1:1 map. Most are in HIGHMEM which
- *are only mapped into the kernel as needed
- * 2) Few people would run a 32-bit kernel on a machine that supports
- *recoverable errors because they have too much memory to boot 32-bit.
- */
-static inline void mce_unmap_kpfn(unsigned long pfn) {}
-#define mce_unmap_kpfn mce_unmap_kpfn
-#endif
-
 struct mca_config {
bool dont_log_ce;
bool cmci_disabled;
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 42cf2880d0ed..a0fbf0a8b7e6 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -42,6 +42,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -50,7 +51,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #include "mce-internal.h"
 
@@ -108,10 +108,6 @@ static struct irq_work mce_irq_work;
 
 static void (*quirk_no_way_out)(int bank, struct mce *m, struct pt_regs *regs);
 
-#ifndef mce_unmap_kpfn
-static void mce_unmap_kpfn(unsigned long pfn);
-#endif
-
 /*
  * CPU/chipset specific EDAC code can register a notifier call here to print
  * MCE errors in a human-readable form.
@@ -602,7 +598,7 @@ static int srao_decode_notifier(struct notifier_block *nb, 
unsigned long val,
if (mce_usable_address(mce) && (mce->severity == MCE_AO_SEVERITY)) {
pfn = mce->addr >> PAGE_SHIFT;
if (!memory_failure(pfn, 0))
-   mce_unmap_kpfn(pfn);
+   set_mce_nospec(pfn);
}
 
return NOTIFY_OK;
@@ -1070,38 +1066,10 @@ static int do_memory_failure(struct mce *m)
if 

[PATCH 06/11] filesystem-dax: perform __dax_invalidate_mapping_entry() under the page lock

2018-05-22 Thread Dan Williams
Hold the page lock while invalidating mapping entries to prevent races
between rmap using the address_space and the filesystem freeing the
address_space.

This is more complicated than the simple description implies because
dev_pagemap pages that fsdax uses do not have any concept of page size.
Size information is stored in the radix and can only be safely read
while holding the xa_lock. Since lock_page() can not be taken while
holding xa_lock, drop xa_lock and speculatively lock all the associated
pages. Once all the pages are locked re-take the xa_lock and revalidate
that the radix entry did not change.

Cc: Jan Kara 
Cc: Christoph Hellwig 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 fs/dax.c |   91 ++
 1 file changed, 85 insertions(+), 6 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 2e4682cd7c69..e6d44d336283 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -319,6 +319,13 @@ static unsigned long dax_radix_end_pfn(void *entry)
for (pfn = dax_radix_pfn(entry); \
pfn < dax_radix_end_pfn(entry); pfn++)
 
+#define for_each_mapped_pfn_reverse(entry, pfn) \
+   for (pfn = dax_radix_end_pfn(entry) - 1; \
+   dax_entry_size(entry) \
+   && pfn >= dax_radix_pfn(entry); \
+   pfn--)
+
+
 static void dax_associate_entry(void *entry, struct address_space *mapping,
struct vm_area_struct *vma, unsigned long address)
 {
@@ -497,6 +504,80 @@ static void *grab_mapping_entry(struct address_space 
*mapping, pgoff_t index,
return entry;
 }
 
+static bool dax_lock_pages(struct address_space *mapping, pgoff_t index,
+   void **entry)
+{
+   struct radix_tree_root *pages = >i_pages;
+   unsigned long pfn;
+   void *entry2;
+
+   xa_lock_irq(pages);
+   *entry = get_unlocked_mapping_entry(mapping, index, NULL);
+   if (!*entry || WARN_ON_ONCE(!radix_tree_exceptional_entry(*entry))) {
+   put_unlocked_mapping_entry(mapping, index, entry);
+   xa_unlock_irq(pages);
+   return false;
+   }
+
+   /*
+* In the limited case there are no races to prevent with rmap,
+* because rmap can not perform pfn_to_page().
+*/
+   if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+   return true;
+
+   /*
+* Now, drop the xa_lock, grab all the page locks then validate
+* that the entry has not changed and return with the xa_lock
+* held.
+*/
+   xa_unlock_irq(pages);
+
+   /*
+* Retry until the entry stabilizes or someone else invalidates
+* the entry;
+*/
+   for (;;) {
+   for_each_mapped_pfn(*entry, pfn)
+   lock_page(pfn_to_page(pfn));
+
+   xa_lock_irq(pages);
+   entry2 = get_unlocked_mapping_entry(mapping, index, NULL);
+   if (!entry2 || 
WARN_ON_ONCE(!radix_tree_exceptional_entry(entry2))
+   || entry2 != *entry) {
+   put_unlocked_mapping_entry(mapping, index, entry2);
+   xa_unlock_irq(pages);
+
+   for_each_mapped_pfn_reverse(*entry, pfn)
+   unlock_page(pfn_to_page(pfn));
+
+   if (!entry2 || !radix_tree_exceptional_entry(entry2))
+   return false;
+   *entry = entry2;
+   continue;
+   }
+   break;
+   }
+
+   return true;
+}
+
+static void dax_unlock_pages(struct address_space *mapping, pgoff_t index,
+   void *entry)
+{
+   struct radix_tree_root *pages = >i_pages;
+   unsigned long pfn;
+
+   put_unlocked_mapping_entry(mapping, index, entry);
+   xa_unlock_irq(pages);
+
+   if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+   return;
+
+   for_each_mapped_pfn_reverse(entry, pfn)
+   unlock_page(pfn_to_page(pfn));
+}
+
 static int __dax_invalidate_mapping_entry(struct address_space *mapping,
  pgoff_t index, bool trunc)
 {
@@ -504,10 +585,8 @@ static int __dax_invalidate_mapping_entry(struct 
address_space *mapping,
void *entry;
struct radix_tree_root *pages = >i_pages;
 
-   xa_lock_irq(pages);
-   entry = get_unlocked_mapping_entry(mapping, index, NULL);
-   if (!entry || WARN_ON_ONCE(!radix_tree_exceptional_entry(entry)))
-   goto out;
+   if (!dax_lock_pages(mapping, index, ))
+   return ret;
if (!trunc &&
(radix_tree_tag_get(pages, index, PAGECACHE_TAG_DIRTY) ||
 radix_tree_tag_get(pages, index, PAGECACHE_TAG_TOWRITE)))
@@ -517,8 +596,8 @@ static int 

[PATCH 09/11] mm, memory_failure: pass page size to kill_proc()

2018-05-22 Thread Dan Williams
Given that ZONE_DEVICE / dev_pagemap pages are never assembled into
compound pages, the size determination logic in kill_proc() needs
updating for the dev_pagemap case. In preparation for dev_pagemap
support rework memory_failure() and kill_proc() to pass / consume the page
size explicitly.

Cc: Naoya Horiguchi 
Signed-off-by: Dan Williams 
---
 mm/memory-failure.c |   16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 9d142b9b86dc..42a193ee14d3 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -179,18 +179,16 @@ EXPORT_SYMBOL_GPL(hwpoison_filter);
  * ``action required'' if error happened in current execution context
  */
 static int kill_proc(struct task_struct *t, unsigned long addr,
-   unsigned long pfn, struct page *page, int flags)
+   unsigned long pfn, unsigned size_shift, int flags)
 {
-   short addr_lsb;
int ret;
 
pr_err("Memory failure: %#lx: Killing %s:%d due to hardware memory 
corruption\n",
pfn, t->comm, t->pid);
-   addr_lsb = compound_order(compound_head(page)) + PAGE_SHIFT;
 
if ((flags & MF_ACTION_REQUIRED) && t->mm == current->mm) {
ret = force_sig_mceerr(BUS_MCEERR_AR, (void __user *)addr,
-  addr_lsb, current);
+  size_shift, current);
} else {
/*
 * Don't use force here, it's convenient if the signal
@@ -199,7 +197,7 @@ static int kill_proc(struct task_struct *t, unsigned long 
addr,
 * to SIG_IGN, but hopefully no one will do that?
 */
ret = send_sig_mceerr(BUS_MCEERR_AO, (void __user *)addr,
- addr_lsb, t);  /* synchronous? */
+ size_shift, t);  /* synchronous? */
}
if (ret < 0)
pr_info("Memory failure: Error sending signal to %s:%d: %d\n",
@@ -318,7 +316,7 @@ static void add_to_kill(struct task_struct *tsk, struct 
page *p,
  * wrong earlier.
  */
 static void kill_procs(struct list_head *to_kill, int forcekill,
- bool fail, struct page *page, unsigned long pfn,
+ bool fail, unsigned size_shift, unsigned long pfn,
  int flags)
 {
struct to_kill *tk, *next;
@@ -343,7 +341,7 @@ static void kill_procs(struct list_head *to_kill, int 
forcekill,
 * process anyways.
 */
else if (kill_proc(tk->tsk, tk->addr,
- pfn, page, flags) < 0)
+ pfn, size_shift, flags) < 0)
pr_err("Memory failure: %#lx: Cannot send 
advisory machine check signal to %s:%d\n",
   pfn, tk->tsk->comm, tk->tsk->pid);
}
@@ -928,6 +926,7 @@ static bool hwpoison_user_mappings(struct page *p, unsigned 
long pfn,
struct address_space *mapping;
LIST_HEAD(tokill);
bool unmap_success;
+   unsigned size_shift;
int kill = 1, forcekill;
struct page *hpage = *hpagep;
bool mlocked = PageMlocked(hpage);
@@ -1012,7 +1011,8 @@ static bool hwpoison_user_mappings(struct page *p, 
unsigned long pfn,
 * any accesses to the poisoned memory.
 */
forcekill = PageDirty(hpage) || (flags & MF_MUST_KILL);
-   kill_procs(, forcekill, !unmap_success, p, pfn, flags);
+   size_shift = compound_order(compound_head(p)) + PAGE_SHIFT;
+   kill_procs(, forcekill, !unmap_success, size_shift, pfn, flags);
 
return unmap_success;
 }

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 03/11] device-dax: enable page_mapping()

2018-05-22 Thread Dan Williams
In support of enabling memory_failure() handling for device-dax
mappings, set the ->mapping association of pages backing device-dax
mappings. The rmap implementation requires page_mapping() to return the
address_space hosting the vmas that map the page.

The ->mapping pointer is never cleared. There is no possibility for the
page to become associated with another address_space while the device is
enabled. When the device is disabled the 'struct page' array for the
device is destroyed / later reinitialized to zero.

Signed-off-by: Dan Williams 
---
 drivers/dax/device.c |   47 +++
 1 file changed, 31 insertions(+), 16 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 686de08e120b..8e986478d48d 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -245,13 +245,12 @@ __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax 
*dev_dax, pgoff_t pgoff,
 }
 
 static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
-   struct vm_fault *vmf)
+   struct vm_fault *vmf, pfn_t *pfn)
 {
struct vm_area_struct *vma = vmf->vma;
struct device *dev = _dax->dev;
struct dax_region *dax_region;
phys_addr_t phys;
-   pfn_t pfn;
unsigned int fault_size = PAGE_SIZE;
 
if (check_vma(dev_dax, vma, __func__))
@@ -273,13 +272,13 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax 
*dev_dax,
return VM_FAULT_SIGBUS;
}
 
-   pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
+   *pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-   return vmf_insert_mixed(vma, vmf->address, pfn);
+   return vmf_insert_mixed(vma, vmf->address, *pfn);
 }
 
 static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
-   struct vm_fault *vmf)
+   struct vm_fault *vmf, pfn_t *pfn)
 {
unsigned long pmd_addr = vmf->address & PMD_MASK;
struct vm_area_struct *vma = vmf->vma;
@@ -287,7 +286,6 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax 
*dev_dax,
struct dax_region *dax_region;
phys_addr_t phys;
pgoff_t pgoff;
-   pfn_t pfn;
unsigned int fault_size = PMD_SIZE;
 
if (check_vma(dev_dax, vma, __func__))
@@ -322,15 +320,15 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax 
*dev_dax,
return VM_FAULT_SIGBUS;
}
 
-   pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
+   *pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-   return vmf_insert_pfn_pmd(vma, vmf->address, vmf->pmd, pfn,
+   return vmf_insert_pfn_pmd(vma, vmf->address, vmf->pmd, *pfn,
vmf->flags & FAULT_FLAG_WRITE);
 }
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
-   struct vm_fault *vmf)
+   struct vm_fault *vmf, pfn_t *pfn)
 {
unsigned long pud_addr = vmf->address & PUD_MASK;
struct vm_area_struct *vma = vmf->vma;
@@ -338,7 +336,6 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax 
*dev_dax,
struct dax_region *dax_region;
phys_addr_t phys;
pgoff_t pgoff;
-   pfn_t pfn;
unsigned int fault_size = PUD_SIZE;
 
 
@@ -374,9 +371,9 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax 
*dev_dax,
return VM_FAULT_SIGBUS;
}
 
-   pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
+   *pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-   return vmf_insert_pfn_pud(vma, vmf->address, vmf->pud, pfn,
+   return vmf_insert_pfn_pud(vma, vmf->address, vmf->pud, *pfn,
vmf->flags & FAULT_FLAG_WRITE);
 }
 #else
@@ -390,9 +387,11 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax 
*dev_dax,
 static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
enum page_entry_size pe_size)
 {
-   int rc, id;
struct vm_area_struct *vma = vmf->vma;
struct file *filp = vma->vm_file;
+   unsigned long fault_size;
+   int rc, id;
+   pfn_t pfn;
struct dev_dax *dev_dax = filp->private_data;
 
dev_dbg(_dax->dev, "%s: %s (%#lx - %#lx) size = %d\n", 
current->comm,
@@ -402,17 +401,33 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
id = dax_read_lock();
switch (pe_size) {
case PE_SIZE_PTE:
-   rc = __dev_dax_pte_fault(dev_dax, vmf);
+   fault_size = PAGE_SIZE;
+   rc = __dev_dax_pte_fault(dev_dax, vmf, );
break;
case PE_SIZE_PMD:
-   rc = __dev_dax_pmd_fault(dev_dax, vmf);
+   fault_size = PMD_SIZE;
+   rc = __dev_dax_pmd_fault(dev_dax, vmf, );
break;
case PE_SIZE_PUD:
-   rc = __dev_dax_pud_fault(dev_dax, 

[PATCH 05/11] filesystem-dax: set page->index

2018-05-22 Thread Dan Williams
In support of enabling memory_failure() handling for filesystem-dax
mappings, set ->index to the pgoff of the page. The rmap implementation
requires ->index to bound the search through the vma interval tree. The
index is set and cleared at dax_associate_entry() and
dax_disassociate_entry() time respectively.

Cc: Jan Kara 
Cc: Christoph Hellwig 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 fs/dax.c |   11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index aaec72ded1b6..2e4682cd7c69 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -319,18 +319,22 @@ static unsigned long dax_radix_end_pfn(void *entry)
for (pfn = dax_radix_pfn(entry); \
pfn < dax_radix_end_pfn(entry); pfn++)
 
-static void dax_associate_entry(void *entry, struct address_space *mapping)
+static void dax_associate_entry(void *entry, struct address_space *mapping,
+   struct vm_area_struct *vma, unsigned long address)
 {
-   unsigned long pfn;
+   unsigned long size = dax_entry_size(entry), pfn, index;
+   int i = 0;
 
if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
return;
 
+   index = linear_page_index(vma, address & ~(size - 1));
for_each_mapped_pfn(entry, pfn) {
struct page *page = pfn_to_page(pfn);
 
WARN_ON_ONCE(page->mapping);
page->mapping = mapping;
+   page->index = index + i++;
}
 }
 
@@ -348,6 +352,7 @@ static void dax_disassociate_entry(void *entry, struct 
address_space *mapping,
WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
WARN_ON_ONCE(page->mapping && page->mapping != mapping);
page->mapping = NULL;
+   page->index = 0;
}
 }
 
@@ -604,7 +609,7 @@ static void *dax_insert_mapping_entry(struct address_space 
*mapping,
new_entry = dax_radix_locked_entry(pfn, flags);
if (dax_entry_size(entry) != dax_entry_size(new_entry)) {
dax_disassociate_entry(entry, mapping, false);
-   dax_associate_entry(new_entry, mapping);
+   dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address);
}
 
if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 04/11] device-dax: set page->index

2018-05-22 Thread Dan Williams
In support of enabling memory_failure() handling for device-dax
mappings, set ->index to the pgoff of the page. The rmap implementation
requires ->index to bound the search through the vma interval tree.

The ->index value is never cleared. There is no possibility for the
page to become associated with another pgoff while the device is
enabled. When the device is disabled the 'struct page' array for the
device is destroyed and ->index is reinitialized to zero.

Signed-off-by: Dan Williams 
---
 drivers/dax/device.c |4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 8e986478d48d..b33e45ee4f70 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -418,7 +418,10 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
 
if (rc == VM_FAULT_NOPAGE) {
unsigned long i;
+   pgoff_t pgoff;
 
+   pgoff = linear_page_index(vma, vmf->address
+   & ~(fault_size - 1));
for (i = 0; i < fault_size / PAGE_SIZE; i++) {
struct page *page;
 
@@ -426,6 +429,7 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
if (page->mapping)
continue;
page->mapping = filp->f_mapping;
+   page->index = pgoff + i;
}
}
dax_read_unlock(id);

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 02/11] device-dax: cleanup vm_fault de-reference chains

2018-05-22 Thread Dan Williams
Define a local 'vma' variable rather than repetitively de-referencing
the passed in 'struct vm_fault *' instance.

Signed-off-by: Dan Williams 
---
 drivers/dax/device.c |   30 --
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index d44d98c54d0f..686de08e120b 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -247,13 +247,14 @@ __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax 
*dev_dax, pgoff_t pgoff,
 static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
struct vm_fault *vmf)
 {
+   struct vm_area_struct *vma = vmf->vma;
struct device *dev = _dax->dev;
struct dax_region *dax_region;
phys_addr_t phys;
pfn_t pfn;
unsigned int fault_size = PAGE_SIZE;
 
-   if (check_vma(dev_dax, vmf->vma, __func__))
+   if (check_vma(dev_dax, vma, __func__))
return VM_FAULT_SIGBUS;
 
dax_region = dev_dax->region;
@@ -274,13 +275,14 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax 
*dev_dax,
 
pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-   return vmf_insert_mixed(vmf->vma, vmf->address, pfn);
+   return vmf_insert_mixed(vma, vmf->address, pfn);
 }
 
 static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
struct vm_fault *vmf)
 {
unsigned long pmd_addr = vmf->address & PMD_MASK;
+   struct vm_area_struct *vma = vmf->vma;
struct device *dev = _dax->dev;
struct dax_region *dax_region;
phys_addr_t phys;
@@ -288,7 +290,7 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax 
*dev_dax,
pfn_t pfn;
unsigned int fault_size = PMD_SIZE;
 
-   if (check_vma(dev_dax, vmf->vma, __func__))
+   if (check_vma(dev_dax, vma, __func__))
return VM_FAULT_SIGBUS;
 
dax_region = dev_dax->region;
@@ -310,11 +312,10 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax 
*dev_dax,
return VM_FAULT_FALLBACK;
 
/* if we are outside of the VMA */
-   if (pmd_addr < vmf->vma->vm_start ||
-   (pmd_addr + PMD_SIZE) > vmf->vma->vm_end)
+   if (pmd_addr < vma->vm_start || (pmd_addr + PMD_SIZE) > vma->vm_end)
return VM_FAULT_SIGBUS;
 
-   pgoff = linear_page_index(vmf->vma, pmd_addr);
+   pgoff = linear_page_index(vma, pmd_addr);
phys = dax_pgoff_to_phys(dev_dax, pgoff, PMD_SIZE);
if (phys == -1) {
dev_dbg(dev, "pgoff_to_phys(%#lx) failed\n", pgoff);
@@ -323,7 +324,7 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax 
*dev_dax,
 
pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-   return vmf_insert_pfn_pmd(vmf->vma, vmf->address, vmf->pmd, pfn,
+   return vmf_insert_pfn_pmd(vma, vmf->address, vmf->pmd, pfn,
vmf->flags & FAULT_FLAG_WRITE);
 }
 
@@ -332,6 +333,7 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax 
*dev_dax,
struct vm_fault *vmf)
 {
unsigned long pud_addr = vmf->address & PUD_MASK;
+   struct vm_area_struct *vma = vmf->vma;
struct device *dev = _dax->dev;
struct dax_region *dax_region;
phys_addr_t phys;
@@ -340,7 +342,7 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax 
*dev_dax,
unsigned int fault_size = PUD_SIZE;
 
 
-   if (check_vma(dev_dax, vmf->vma, __func__))
+   if (check_vma(dev_dax, vma, __func__))
return VM_FAULT_SIGBUS;
 
dax_region = dev_dax->region;
@@ -362,11 +364,10 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax 
*dev_dax,
return VM_FAULT_FALLBACK;
 
/* if we are outside of the VMA */
-   if (pud_addr < vmf->vma->vm_start ||
-   (pud_addr + PUD_SIZE) > vmf->vma->vm_end)
+   if (pud_addr < vma->vm_start || (pud_addr + PUD_SIZE) > vma->vm_end)
return VM_FAULT_SIGBUS;
 
-   pgoff = linear_page_index(vmf->vma, pud_addr);
+   pgoff = linear_page_index(vma, pud_addr);
phys = dax_pgoff_to_phys(dev_dax, pgoff, PUD_SIZE);
if (phys == -1) {
dev_dbg(dev, "pgoff_to_phys(%#lx) failed\n", pgoff);
@@ -375,7 +376,7 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax 
*dev_dax,
 
pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-   return vmf_insert_pfn_pud(vmf->vma, vmf->address, vmf->pud, pfn,
+   return vmf_insert_pfn_pud(vma, vmf->address, vmf->pud, pfn,
vmf->flags & FAULT_FLAG_WRITE);
 }
 #else
@@ -390,12 +391,13 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
enum page_entry_size pe_size)
 {
int rc, id;
-   struct file *filp = vmf->vma->vm_file;
+   struct vm_area_struct *vma = vmf->vma;
+   struct file *filp = vma->vm_file;
struct dev_dax *dev_dax = 

[PATCH 01/11] device-dax: convert to vmf_insert_mixed and vm_fault_t

2018-05-22 Thread Dan Williams
Use new return type vm_fault_t for fault and huge_fault handler. For
now, this is just documenting that the function returns a VM_FAULT value
rather than an errno.  Once all instances are converted, vm_fault_t will
become a distinct type.

Commit 1c8f422059ae ("mm: change return type to vm_fault_t")

Previously vm_insert_mixed() returned an error code which driver mapped into
VM_FAULT_* type. The new function vmf_insert_mixed() will replace this
inefficiency by returning VM_FAULT_* type.

Signed-off-by: Souptick Joarder 
Reviewed-by: Matthew Wilcox 
Reviewed-by: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 drivers/dax/device.c|   26 +++---
 include/linux/huge_mm.h |5 +++--
 mm/huge_memory.c|4 ++--
 3 files changed, 16 insertions(+), 19 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index aff2c1594220..d44d98c54d0f 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -244,11 +244,11 @@ __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax 
*dev_dax, pgoff_t pgoff,
return -1;
 }
 
-static int __dev_dax_pte_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
+static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
+   struct vm_fault *vmf)
 {
struct device *dev = _dax->dev;
struct dax_region *dax_region;
-   int rc = VM_FAULT_SIGBUS;
phys_addr_t phys;
pfn_t pfn;
unsigned int fault_size = PAGE_SIZE;
@@ -274,17 +274,11 @@ static int __dev_dax_pte_fault(struct dev_dax *dev_dax, 
struct vm_fault *vmf)
 
pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
 
-   rc = vm_insert_mixed(vmf->vma, vmf->address, pfn);
-
-   if (rc == -ENOMEM)
-   return VM_FAULT_OOM;
-   if (rc < 0 && rc != -EBUSY)
-   return VM_FAULT_SIGBUS;
-
-   return VM_FAULT_NOPAGE;
+   return vmf_insert_mixed(vmf->vma, vmf->address, pfn);
 }
 
-static int __dev_dax_pmd_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
+static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
+   struct vm_fault *vmf)
 {
unsigned long pmd_addr = vmf->address & PMD_MASK;
struct device *dev = _dax->dev;
@@ -334,7 +328,8 @@ static int __dev_dax_pmd_fault(struct dev_dax *dev_dax, 
struct vm_fault *vmf)
 }
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
-static int __dev_dax_pud_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
+static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
+   struct vm_fault *vmf)
 {
unsigned long pud_addr = vmf->address & PUD_MASK;
struct device *dev = _dax->dev;
@@ -384,13 +379,14 @@ static int __dev_dax_pud_fault(struct dev_dax *dev_dax, 
struct vm_fault *vmf)
vmf->flags & FAULT_FLAG_WRITE);
 }
 #else
-static int __dev_dax_pud_fault(struct dev_dax *dev_dax, struct vm_fault *vmf)
+static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
+   struct vm_fault *vmf)
 {
return VM_FAULT_FALLBACK;
 }
 #endif /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
-static int dev_dax_huge_fault(struct vm_fault *vmf,
+static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf,
enum page_entry_size pe_size)
 {
int rc, id;
@@ -420,7 +416,7 @@ static int dev_dax_huge_fault(struct vm_fault *vmf,
return rc;
 }
 
-static int dev_dax_fault(struct vm_fault *vmf)
+static vm_fault_t dev_dax_fault(struct vm_fault *vmf)
 {
return dev_dax_huge_fault(vmf, PE_SIZE_PTE);
 }
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a8a126259bc4..d3bbf6bea9e9 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -3,6 +3,7 @@
 #define _LINUX_HUGE_MM_H
 
 #include 
+#include 
 
 #include  /* only for vma_is_dax() */
 
@@ -46,9 +47,9 @@ extern bool move_huge_pmd(struct vm_area_struct *vma, 
unsigned long old_addr,
 extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, pgprot_t newprot,
int prot_numa);
-int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
+vm_fault_t vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmd, pfn_t pfn, bool write);
-int vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
+vm_fault_t vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
pud_t *pud, pfn_t pfn, bool write);
 enum transparent_hugepage_flag {
TRANSPARENT_HUGEPAGE_FLAG,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a3a1815f8e11..6af976472a5d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -755,7 +755,7 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, 
unsigned long addr,
spin_unlock(ptl);
 }
 
-int 

[PATCH 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages

2018-05-22 Thread Dan Williams
As it stands, memory_failure() gets thoroughly confused by dev_pagemap
backed mappings. The recovery code has specific enabling for several
possible page states and needs new enabling to handle poison in dax
mappings.

In order to support reliable reverse mapping of user space addresses add
new locking in the fsdax implementation to prevent races between
page-address_space disassociation events and the rmap performed in the
memory_failure() path. Additionally, since dev_pagemap pages are hidden
from the page allocator, add a mechanism to determine the size of the
mapping that encompasses a given poisoned pfn. Lastly, since pmem errors
can be repaired, change the speculatively accessed poison protection,
mce_unmap_kpfn(), to be reversible and otherwise allow ongoing access
from the kernel.

---

Dan Williams (11):
  device-dax: convert to vmf_insert_mixed and vm_fault_t
  device-dax: cleanup vm_fault de-reference chains
  device-dax: enable page_mapping()
  device-dax: set page->index
  filesystem-dax: set page->index
  filesystem-dax: perform __dax_invalidate_mapping_entry() under the page 
lock
  mm, madvise_inject_error: fix page count leak
  x86, memory_failure: introduce {set,clear}_mce_nospec()
  mm, memory_failure: pass page size to kill_proc()
  mm, memory_failure: teach memory_failure() about dev_pagemap pages
  libnvdimm, pmem: restore page attributes when clearing errors


 arch/x86/include/asm/set_memory.h |   29 ++
 arch/x86/kernel/cpu/mcheck/mce-internal.h |   15 ---
 arch/x86/kernel/cpu/mcheck/mce.c  |   38 +---
 drivers/dax/device.c  |   91 
 drivers/nvdimm/pmem.c |   26 ++
 drivers/nvdimm/pmem.h |   13 +++
 fs/dax.c  |  102 --
 include/linux/huge_mm.h   |5 +
 include/linux/set_memory.h|   14 +++
 mm/huge_memory.c  |4 -
 mm/madvise.c  |   11 ++
 mm/memory-failure.c   |  133 +++--
 12 files changed, 370 insertions(+), 111 deletions(-)
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [dm-devel] [patch 4/4] dm-writecache: use new API for flushing

2018-05-22 Thread Christoph Hellwig
On Sat, May 19, 2018 at 07:25:07AM +0200, Mikulas Patocka wrote:
> Use new API for flushing persistent memory.

The sentence doesnt make much sense.  'A new API', 'A better
abstraction' maybe?

> 
> The problem is this:
> * on X86-64, non-temporal stores have the best performance
> * ARM64 doesn't have non-temporal stores, so we must flush cache. We
>   should flush cache as late as possible, because it performs better this
>   way.
> 
> We introduce functions pmem_memcpy, pmem_flush and pmem_commit. To commit
> data persistently, all three functions must be called.
> 
> The macro pmem_assign may be used instead of pmem_memcpy. pmem_assign
> (unlike pmem_memcpy) guarantees that 8-byte values are written atomically.
> 
> On X86, pmem_memcpy is memcpy_flushcache, pmem_flush is empty and
> pmem_commit is wmb.
> 
> On ARM64, pmem_memcpy is memcpy, pmem_flush is arch_wb_cache_pmem and
> pmem_commit is empty.

All these should be provided by the pmem layer, and be properly
documented.  And be sorted before adding your new target that uses
them.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v10] mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS

2018-05-22 Thread Christoph Hellwig
On Mon, May 21, 2018 at 11:04:10AM +0200, Jan Kara wrote:
> We definitely do have customers using "execute in place" on s390x from
> dcssblk. I've got about two bug reports for it when customers were updating
> from old kernels using original XIP to kernels using DAX. So we need to
> keep that working.

That is all good an fine, but I think time has come where s390 needs
to migrate to provide the pmem API so that we can get rid of these
special cases.  Especially given that the old XIP/legacy DAX has all
kinds of known bugs at this point in time.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v11 7/7] xfs, dax: introduce xfs_break_dax_layouts()

2018-05-22 Thread Christoph Hellwig
Looks good,

Reviewed-by: Christoph Hellwig 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v11 2/7] mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS

2018-05-22 Thread Christoph Hellwig
Looks good,

Reviewed-by: Christoph Hellwig 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v11 1/7] memremap: split devm_memremap_pages() and memremap() infrastructure

2018-05-22 Thread Christoph Hellwig
Looks good:

Reviewed-by: Christoph Hellwig 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm