Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush

2017-10-19 Thread Jason Gunthorpe
On Mon, Oct 16, 2017 at 03:02:52PM +0300, Sagi Grimberg wrote:
> But why should the kernel ever need to mangle the CQ? if a lease break
> would deregister the MR the device is expected to generate remote
> protection errors on its own.

The point is to avoid protection errors - hittles change over when the
DAX mapping changes like ODP does.

Theonly way to get there is to notify the app before the mappings
change.. Dan suggested having ibv_pollcq return this indication..

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush

2017-10-17 Thread Christoph Hellwig
On Mon, Oct 16, 2017 at 12:44:31PM -0700, Dan Williams wrote:
> > While I agree with the need for a per-MR notification mechanism, one
> > thing we lose by walking away from MAP_DIRECT is a way for a
> > hypervisor to coordinate pass through of a DAX mapping to an RDMA
> > device in a guest. That will remain a case where we will still need to
> > use device-dax. I'm fine if that's the answer, but just want to be
> > clear about all the places we need to protect a DAX mapping against
> > RDMA from a non-ODP device.
> 
> For this specific issue perhaps we promote FL_LAYOUT as a lease-type
> that can be set by fcntl().

I don't think it is a good userspace interface, mostly because it
is about things that don't matter for userspace (block mappings).

It makes sense as a kernel interface for callers that want to pin
down a memory long-term, but for userspace the fact that the block
mapping changes doesn't matter - it matters that their long term
pin is broken by something.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush

2017-10-16 Thread Dan Williams
On Mon, Oct 16, 2017 at 10:43 AM, Dan Williams  wrote:
> On Mon, Oct 16, 2017 at 12:26 AM, Christoph Hellwig  wrote:
>> On Fri, Oct 13, 2017 at 11:31:45AM -0600, Jason Gunthorpe wrote:
>>> I don't think that really represents how lots of apps actually use
>>> RDMA.
>>>
>>> RDMA is often buried down in the software stack (eg in a MPI), and by
>>> the time a mapping gets used for RDMA transfer the link between the
>>> FD, mmap and the MR is totally opaque.
>>>
>>> Having a MR specific notification means the low level RDMA libraries
>>> have a chance to deal with everything for the app.
>>>
>>> Eg consider a HPC app using MPI that uses some DAX aware library to
>>> get DAX backed mmap's. It then passes memory in those mmaps to the
>>> MPI library to do transfers. The MPI creates the MR on demand.
>>>
>>
>> I suspect one of the more interesting use cases might be a file server,
>> for which that's not the case.  But otherwise I agree with the above,
>> and also thing that notifying the MR handle is the only way to go for
>> another very important reason:  fencing.  What if the application/library
>> does not react on the notification?  With a per-MR notification we
>> can unregister the MR in kernel space and have a rock solid fencing
>> mechanism.  And that is the most important bit here.
>
> While I agree with the need for a per-MR notification mechanism, one
> thing we lose by walking away from MAP_DIRECT is a way for a
> hypervisor to coordinate pass through of a DAX mapping to an RDMA
> device in a guest. That will remain a case where we will still need to
> use device-dax. I'm fine if that's the answer, but just want to be
> clear about all the places we need to protect a DAX mapping against
> RDMA from a non-ODP device.

For this specific issue perhaps we promote FL_LAYOUT as a lease-type
that can be set by fcntl().
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush

2017-10-16 Thread Dan Williams
On Mon, Oct 16, 2017 at 12:26 AM, Christoph Hellwig  wrote:
> On Fri, Oct 13, 2017 at 11:31:45AM -0600, Jason Gunthorpe wrote:
>> I don't think that really represents how lots of apps actually use
>> RDMA.
>>
>> RDMA is often buried down in the software stack (eg in a MPI), and by
>> the time a mapping gets used for RDMA transfer the link between the
>> FD, mmap and the MR is totally opaque.
>>
>> Having a MR specific notification means the low level RDMA libraries
>> have a chance to deal with everything for the app.
>>
>> Eg consider a HPC app using MPI that uses some DAX aware library to
>> get DAX backed mmap's. It then passes memory in those mmaps to the
>> MPI library to do transfers. The MPI creates the MR on demand.
>>
>
> I suspect one of the more interesting use cases might be a file server,
> for which that's not the case.  But otherwise I agree with the above,
> and also thing that notifying the MR handle is the only way to go for
> another very important reason:  fencing.  What if the application/library
> does not react on the notification?  With a per-MR notification we
> can unregister the MR in kernel space and have a rock solid fencing
> mechanism.  And that is the most important bit here.

While I agree with the need for a per-MR notification mechanism, one
thing we lose by walking away from MAP_DIRECT is a way for a
hypervisor to coordinate pass through of a DAX mapping to an RDMA
device in a guest. That will remain a case where we will still need to
use device-dax. I'm fine if that's the answer, but just want to be
clear about all the places we need to protect a DAX mapping against
RDMA from a non-ODP device.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush

2017-10-16 Thread Sagi Grimberg



I don't think that really represents how lots of apps actually use
RDMA.

RDMA is often buried down in the software stack (eg in a MPI), and by
the time a mapping gets used for RDMA transfer the link between the
FD, mmap and the MR is totally opaque.

Having a MR specific notification means the low level RDMA libraries
have a chance to deal with everything for the app.

Eg consider a HPC app using MPI that uses some DAX aware library to
get DAX backed mmap's. It then passes memory in those mmaps to the
MPI library to do transfers. The MPI creates the MR on demand.



I suspect one of the more interesting use cases might be a file server,
for which that's not the case.  But otherwise I agree with the above,
and also thing that notifying the MR handle is the only way to go for
another very important reason:  fencing.  What if the application/library
does not react on the notification?  With a per-MR notification we
can unregister the MR in kernel space and have a rock solid fencing
mechanism.  And that is the most important bit here.


I agree we must deregister the MR in kernel space. As said, I think
its perfectly reasonable to let user-space see error completions and
provide query mechanism for MR granularity (unfortunately this will
probably need drivers assistance as they know how their device reports
in MR granularity access violations).
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush

2017-10-16 Thread Sagi Grimberg


Hey folks, (chiming in very late here...)


I think, if you want to build a uAPI for notification of MR lease
break, then you need show how it fits into the above software model:
  - How it can be hidden in a RDMA specific library


So, here's a strawman can ibv_poll_cq() start returning ibv_wc_status
== IBV_WC_LOC_PROT_ERR when file coherency is lost. This would make
the solution generic across DAX and non-DAX. What's you're feeling for
how well applications are prepared to deal with that status return?


Stuffing an entry into the CQ is difficult. The CQ is in user memory
and it is DMA'd from the HCA for several pieces of hardware, so the
kernel can't just stuff something in there. It can be done
with HW support by having the HCA DMA it via an exception path or
something, but even then, you run into questions like CQ overflow and
accounting issues since it is not ment for this.


But why should the kernel ever need to mangle the CQ? if a lease break
would deregister the MR the device is expected to generate remote
protection errors on its own.

And in that case, I think we need a query mechanism rather an event
mechanism so when the application starts seeing protection errors
it can query the relevant MR (I think most if not all devices have that
information in their internal completion queue entries).



So, you need a side channel of some kind, either in certain drivers or
generically..


  - How lease break can be done hitlessly, so the library user never
needs to know it is happening or see failed/missed transfers


I agree that the application should not be aware of lease breakages, but
seeing failed transfers is perfectly acceptable given that an access
violation is happening (my assumption is that failed transfers are error
completions reported in the user completion queue). What we need to have
is a framework to help user-space to recover sanely, which is to query
what MR had the access violation, restore it, and re-establish the queue
pair.



iommu redirect should be hit less and behave like the page cache case
where RDMA targets pages that are no longer part of the file.


Yes, if the iommu can be fenced properly it sounds doable.


  - Whatever fast path checking is needed does not kill performance


What do you consider a fast path? I was assuming that memory
registration is a slow path, and iommu operations are asynchronous so
should not impact performance of ongoing operations beyond typical
iommu overhead.


ibv_poll_cq() and ibv_post_send() would be a fast path.

Where this struggled before is in creating a side channel you also now
have to check that side channel, and checking it at high performance
is quite hard.. Even quiecing things to be able to tear down the MR
has performance implications on post send...


This is exactly why I think we should not have it, but instead give
building blocks to recover sanely from error completions...
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush

2017-10-16 Thread Christoph Hellwig
On Fri, Oct 13, 2017 at 11:22:21AM -0700, Dan Williams wrote:
> So, here's a strawman can ibv_poll_cq() start returning ibv_wc_status
> == IBV_WC_LOC_PROT_ERR when file coherency is lost. This would make
> the solution generic across DAX and non-DAX. What's you're feeling for
> how well applications are prepared to deal with that status return?

The problem aren't local protection errors, but remote protection errors
when we modify a MR with an rkey that the remote side accesses.

> >  - How lease break can be done hitlessly, so the library user never
> >needs to know it is happening or see failed/missed transfers
> 
> iommu redirect should be hit less and behave like the page cache case
> where RDMA targets pages that are no longer part of the file.

But systems that care about performance (e.g. the usual RDMA users) usually
don't use an IOMMU due to the performance impact.  Especially as HCAs
already have their own built-in iommus (aka the MR mechanism).

Note that file systems already have a mechanism like you mention above
to keep extents that are busy from being reallocated.  E.g. take a look at
fs/xfs/xfs_extent_busy.c.  The downside is that this could lock down
a massive amount of space in the busy list if we for example have a MR
covering a huge file that is truncated down.  So even if we'd want that
scheme we'd need some sort of ulmit for the amount of DAX pages locked
down in get_user_pages.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush

2017-10-16 Thread Christoph Hellwig
On Fri, Oct 13, 2017 at 11:31:45AM -0600, Jason Gunthorpe wrote:
> I don't think that really represents how lots of apps actually use
> RDMA.
> 
> RDMA is often buried down in the software stack (eg in a MPI), and by
> the time a mapping gets used for RDMA transfer the link between the
> FD, mmap and the MR is totally opaque.
> 
> Having a MR specific notification means the low level RDMA libraries
> have a chance to deal with everything for the app.
> 
> Eg consider a HPC app using MPI that uses some DAX aware library to
> get DAX backed mmap's. It then passes memory in those mmaps to the
> MPI library to do transfers. The MPI creates the MR on demand.
> 

I suspect one of the more interesting use cases might be a file server,
for which that's not the case.  But otherwise I agree with the above,
and also thing that notifying the MR handle is the only way to go for
another very important reason:  fencing.  What if the application/library
does not react on the notification?  With a per-MR notification we
can unregister the MR in kernel space and have a rock solid fencing
mechanism.  And that is the most important bit here.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush

2017-10-16 Thread Christoph Hellwig
On Fri, Oct 13, 2017 at 10:38:22AM -0600, Jason Gunthorpe wrote:
> > scheme specific to RDMA which seems like a waste to me when we can
> > generically signal an event on the fd for any event that effects any
> > of the vma's on the file. The FL_LAYOUT lease impacts the entire file,
> > so as far as I can see delaying the notification until MR-init is too
> > late, too granular, and too RDMA specific.
> 
> But for RDMA a FD is not what we care about - we want the MR handle so
> the app knows which MR needs fixing.

Yes.  Although the fd for the ibX device might be a good handle to
transport that information, unlike the fd for the mapped file.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush

2017-10-13 Thread Jason Gunthorpe
On Fri, Oct 13, 2017 at 11:22:21AM -0700, Dan Williams wrote:
> > So, who should be responsible for MR coherency? Today we say the MPI
> > is responsible. But we can't really expect the MPI
> > to hook SIGIO and somehow try to reverse engineer what MRs are
> > impacted from a FD that may not even still be open.
> 
> Ok, that's good insight that I didn't have. Userspace needs more help
> than just an fd notification.

Glad to help!

> > I think, if you want to build a uAPI for notification of MR lease
> > break, then you need show how it fits into the above software model:
> >  - How it can be hidden in a RDMA specific library
> 
> So, here's a strawman can ibv_poll_cq() start returning ibv_wc_status
> == IBV_WC_LOC_PROT_ERR when file coherency is lost. This would make
> the solution generic across DAX and non-DAX. What's you're feeling for
> how well applications are prepared to deal with that status return?

Stuffing an entry into the CQ is difficult. The CQ is in user memory
and it is DMA'd from the HCA for several pieces of hardware, so the
kernel can't just stuff something in there. It can be done
with HW support by having the HCA DMA it via an exception path or
something, but even then, you run into questions like CQ overflow and
accounting issues since it is not ment for this.

So, you need a side channel of some kind, either in certain drivers or
generically..

> >  - How lease break can be done hitlessly, so the library user never
> >needs to know it is happening or see failed/missed transfers
> 
> iommu redirect should be hit less and behave like the page cache case
> where RDMA targets pages that are no longer part of the file.

Yes, if the iommu can be fenced properly it sounds doable.

> >  - Whatever fast path checking is needed does not kill performance
> 
> What do you consider a fast path? I was assuming that memory
> registration is a slow path, and iommu operations are asynchronous so
> should not impact performance of ongoing operations beyond typical
> iommu overhead.

ibv_poll_cq() and ibv_post_send() would be a fast path.

Where this struggled before is in creating a side channel you also now
have to check that side channel, and checking it at high performance
is quite hard.. Even quiecing things to be able to tear down the MR
has performance implications on post send...

Now that I see this whole thing in this light it seem so very similar
to the MPI driven user space mmu notifications ideas and has similar
challenges. FWIW, RDMA banged its head on this issue for 10 years and
it was ODP that emerged as the solution.

One option might be to use an async event notification 'MR
de-coherence' and rely on a main polling loop to catch it.

This is good enough for dax becaue the lease-requestor would wait
until the async event was processed. It would also be acceptable for
the general MPI case too, but only if this lease concept was wider
than just DAX, eg a MR leases a peice of VMA, and if anything anyhow
changes that VMA (eg munamp, mmap, mremap, etc) then it has to wait
from the MR to release the lease. ie munmap would block until the
async event is processed. ODP-light in userspace, essentially.

IIRC this sort of suggestion was never explored, something like:

poll(fd)
ibv_read_async_event(fd)
if (event == MR_DECOHERENCE) {
queice_network();
ibv_restore_mr(mr);
restore_network();
}

The implemention of ibv_restore_mr would have to make a new MR that
pointed to the same virtual memory addresses, but was backed by the
*new* physical pages. This means it has to unblock the lease, and wait
for the lease requestor to complete executing.

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush

2017-10-13 Thread Dan Williams
On Fri, Oct 13, 2017 at 10:31 AM, Jason Gunthorpe
 wrote:
> On Fri, Oct 13, 2017 at 10:01:04AM -0700, Dan Williams wrote:
>> On Fri, Oct 13, 2017 at 9:38 AM, Jason Gunthorpe
>>  wrote:
>> > On Fri, Oct 13, 2017 at 08:14:55AM -0700, Dan Williams wrote:
>> >
>> >> scheme specific to RDMA which seems like a waste to me when we can
>> >> generically signal an event on the fd for any event that effects any
>> >> of the vma's on the file. The FL_LAYOUT lease impacts the entire file,
>> >> so as far as I can see delaying the notification until MR-init is too
>> >> late, too granular, and too RDMA specific.
>> >
>> > But for RDMA a FD is not what we care about - we want the MR handle so
>> > the app knows which MR needs fixing.
>>
>> I'd rather put the onus on userspace to remember where it used a
>> MAP_DIRECT mapping and be aware that all the mappings of that file are
>> subject to a lease break. Sure, we could build up a pile of kernel
>> infrastructure to notify on a per-MR basis, but I think that would
>> only be worth it if leases were range based. As it is, the entire file
>> is covered by a lease instance and all MRs that might reference that
>> file get one notification. That said, we can always arrange for a
>> per-driver callback at lease-break time so that it can do something
>> above and beyond the default notification.
>
> I don't think that really represents how lots of apps actually use
> RDMA.
>
> RDMA is often buried down in the software stack (eg in a MPI), and by
> the time a mapping gets used for RDMA transfer the link between the
> FD, mmap and the MR is totally opaque.
>
> Having a MR specific notification means the low level RDMA libraries
> have a chance to deal with everything for the app.
>
> Eg consider a HPC app using MPI that uses some DAX aware library to
> get DAX backed mmap's. It then passes memory in those mmaps to the
> MPI library to do transfers. The MPI creates the MR on demand.
>
> So, who should be responsible for MR coherency? Today we say the MPI
> is responsible. But we can't really expect the MPI
> to hook SIGIO and somehow try to reverse engineer what MRs are
> impacted from a FD that may not even still be open.

Ok, that's good insight that I didn't have. Userspace needs more help
than just an fd notification.

> I think, if you want to build a uAPI for notification of MR lease
> break, then you need show how it fits into the above software model:
>  - How it can be hidden in a RDMA specific library

So, here's a strawman can ibv_poll_cq() start returning ibv_wc_status
== IBV_WC_LOC_PROT_ERR when file coherency is lost. This would make
the solution generic across DAX and non-DAX. What's you're feeling for
how well applications are prepared to deal with that status return?

>  - How lease break can be done hitlessly, so the library user never
>needs to know it is happening or see failed/missed transfers

iommu redirect should be hit less and behave like the page cache case
where RDMA targets pages that are no longer part of the file.

>  - Whatever fast path checking is needed does not kill performance

What do you consider a fast path? I was assuming that memory
registration is a slow path, and iommu operations are asynchronous so
should not impact performance of ongoing operations beyond typical
iommu overhead.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush

2017-10-13 Thread Jason Gunthorpe
On Fri, Oct 13, 2017 at 10:01:04AM -0700, Dan Williams wrote:
> On Fri, Oct 13, 2017 at 9:38 AM, Jason Gunthorpe
>  wrote:
> > On Fri, Oct 13, 2017 at 08:14:55AM -0700, Dan Williams wrote:
> >
> >> scheme specific to RDMA which seems like a waste to me when we can
> >> generically signal an event on the fd for any event that effects any
> >> of the vma's on the file. The FL_LAYOUT lease impacts the entire file,
> >> so as far as I can see delaying the notification until MR-init is too
> >> late, too granular, and too RDMA specific.
> >
> > But for RDMA a FD is not what we care about - we want the MR handle so
> > the app knows which MR needs fixing.
> 
> I'd rather put the onus on userspace to remember where it used a
> MAP_DIRECT mapping and be aware that all the mappings of that file are
> subject to a lease break. Sure, we could build up a pile of kernel
> infrastructure to notify on a per-MR basis, but I think that would
> only be worth it if leases were range based. As it is, the entire file
> is covered by a lease instance and all MRs that might reference that
> file get one notification. That said, we can always arrange for a
> per-driver callback at lease-break time so that it can do something
> above and beyond the default notification.

I don't think that really represents how lots of apps actually use
RDMA.

RDMA is often buried down in the software stack (eg in a MPI), and by
the time a mapping gets used for RDMA transfer the link between the
FD, mmap and the MR is totally opaque.

Having a MR specific notification means the low level RDMA libraries
have a chance to deal with everything for the app.

Eg consider a HPC app using MPI that uses some DAX aware library to
get DAX backed mmap's. It then passes memory in those mmaps to the
MPI library to do transfers. The MPI creates the MR on demand.

So, who should be responsible for MR coherency? Today we say the MPI
is responsible. But we can't really expect the MPI
to hook SIGIO and somehow try to reverse engineer what MRs are
impacted from a FD that may not even still be open.

I think, if you want to build a uAPI for notification of MR lease
break, then you need show how it fits into the above software model:
 - How it can be hidden in a RDMA specific library
 - How lease break can be done hitlessly, so the library user never
   needs to know it is happening or see failed/missed transfers
 - Whatever fast path checking is needed does not kill performance

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush

2017-10-13 Thread Dan Williams
On Fri, Oct 13, 2017 at 9:38 AM, Jason Gunthorpe
 wrote:
> On Fri, Oct 13, 2017 at 08:14:55AM -0700, Dan Williams wrote:
>
>> scheme specific to RDMA which seems like a waste to me when we can
>> generically signal an event on the fd for any event that effects any
>> of the vma's on the file. The FL_LAYOUT lease impacts the entire file,
>> so as far as I can see delaying the notification until MR-init is too
>> late, too granular, and too RDMA specific.
>
> But for RDMA a FD is not what we care about - we want the MR handle so
> the app knows which MR needs fixing.

I'd rather put the onus on userspace to remember where it used a
MAP_DIRECT mapping and be aware that all the mappings of that file are
subject to a lease break. Sure, we could build up a pile of kernel
infrastructure to notify on a per-MR basis, but I think that would
only be worth it if leases were range based. As it is, the entire file
is covered by a lease instance and all MRs that might reference that
file get one notification. That said, we can always arrange for a
per-driver callback at lease-break time so that it can do something
above and beyond the default notification.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush

2017-10-13 Thread Jason Gunthorpe
On Fri, Oct 13, 2017 at 08:14:55AM -0700, Dan Williams wrote:

> scheme specific to RDMA which seems like a waste to me when we can
> generically signal an event on the fd for any event that effects any
> of the vma's on the file. The FL_LAYOUT lease impacts the entire file,
> so as far as I can see delaying the notification until MR-init is too
> late, too granular, and too RDMA specific.

But for RDMA a FD is not what we care about - we want the MR handle so
the app knows which MR needs fixing.

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush

2017-10-13 Thread Dan Williams
On Thu, Oct 12, 2017 at 11:57 PM, Christoph Hellwig  wrote:
> On Thu, Oct 12, 2017 at 10:41:39AM -0700, Dan Williams wrote:
>> So, you're jumping into this review at v9 where I've split the patches
>> that take an initial MAP_DIRECT lease out from the patches that take
>> FL_LAYOUT leases at memory registration time. You can see a previous
>> attempt in "[PATCH v8 00/14] MAP_DIRECT for DAX RDMA and userspace
>> flush" which should be in your inbox.
>
> The point is that your problem has absolutely nothing to do with mmap,
> and all with get_user_pages.
>
> get_user_pages on DAX doesn't give the same guarantees as on pagecache
> or anonymous memory, and that is the prbolem we need to fix.  In fact
> I'm pretty sure if we try hard enough (and we might have to try
> very hard) we can see the same problem with plain direct I/O and without
> any RDMA involved, e.g. do a larger direct I/O write to memory that is
> mmap()ed from a DAX file, then truncate the DAX file and reallocate
> the blocks, and we might corrupt that new file.  We'll probably need
> a special setup where there is little other chance but to reallocate
> those used blocks.

I'll take a harder look at this...

> So what we need to do first is to fix get_user_pages vs unmapping
> DAX mmap()ed blocks, be that from a hole punch, truncate, COW
> operation, etc.
>
> Then we need to look into the special case of a long-living non-transient
> get_user_pages that RDMA does - we can't just reject any truncate or
> other operation for that, so that's where something like me layout
> lease suggestion comes into play - but the call that should get the
> least is not the mmap - it's the memory registration call that does
> the get_user_pages.

Yes, mmap is not the place to get the lease for a later
get_user_pages, and my patches do take an additional lease at
get_user_pages / MR init time. However, the mmap call has the
file-descriptor for SIGIO the MR-init call does not. If we delay all
of the setup it to MR time then we need to invent a notification
scheme specific to RDMA which seems like a waste to me when we can
generically signal an event on the fd for any event that effects any
of the vma's on the file. The FL_LAYOUT lease impacts the entire file,
so as far as I can see delaying the notification until MR-init is too
late, too granular, and too RDMA specific.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush

2017-10-13 Thread Christoph Hellwig
On Thu, Oct 12, 2017 at 10:41:39AM -0700, Dan Williams wrote:
> So, you're jumping into this review at v9 where I've split the patches
> that take an initial MAP_DIRECT lease out from the patches that take
> FL_LAYOUT leases at memory registration time. You can see a previous
> attempt in "[PATCH v8 00/14] MAP_DIRECT for DAX RDMA and userspace
> flush" which should be in your inbox.

The point is that your problem has absolutely nothing to do with mmap,
and all with get_user_pages.

get_user_pages on DAX doesn't give the same guarantees as on pagecache
or anonymous memory, and that is the prbolem we need to fix.  In fact
I'm pretty sure if we try hard enough (and we might have to try
very hard) we can see the same problem with plain direct I/O and without
any RDMA involved, e.g. do a larger direct I/O write to memory that is
mmap()ed from a DAX file, then truncate the DAX file and reallocate
the blocks, and we might corrupt that new file.  We'll probably need
a special setup where there is little other chance but to reallocate
those used blocks.

So what we need to do first is to fix get_user_pages vs unmapping
DAX mmap()ed blocks, be that from a hole punch, truncate, COW
operation, etc.

Then we need to look into the special case of a long-living non-transient
get_user_pages that RDMA does - we can't just reject any truncate or
other operation for that, so that's where something like me layout
lease suggestion comes into play - but the call that should get the
least is not the mmap - it's the memory registration call that does
the get_user_pages.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush

2017-10-12 Thread Dan Williams
On Thu, Oct 12, 2017 at 7:23 AM, Christoph Hellwig  wrote:
> Sorry for chiming in so late, been extremely busy lately.
>
> From quickly glacing over what the now finally described use case is
> (which contradicts the subject btw - it's not about flushing, it's
> about not removing block mapping under a MR) and the previous comments
> I think that mmap is simply the wrong kind of interface for this.
>
> What we want is support for a new kinds of userspace memory registration in 
> the
> RDMA code that uses the pnfs export interface, both getting the block (or
> rather byte in this case) mapping, and also gets the FL_LAYOUT lease for the
> memory registration.
>
> That btw is exactly what I do for the pNFS RDMA layout, just in-kernel.

...and this is exactly my plan.

So, you're jumping into this review at v9 where I've split the patches
that take an initial MAP_DIRECT lease out from the patches that take
FL_LAYOUT leases at memory registration time. You can see a previous
attempt in "[PATCH v8 00/14] MAP_DIRECT for DAX RDMA and userspace
flush" which should be in your inbox.

I'm not proposing mmap as the memory registration interface, it's the
"register for notification of lease break" interface. Here's my
proposed sequence:

addr = mmap(..., MAP_DIRECT.., fd); <- register a vma for "direct"
memory registrations with an FL_LAYOUT lease that at a lease break
event sends SIGIO on the fd used for mmap.

ibv_reg_mr(..., addr, ...); <- check for a valid MAP_DIRECT vma, and
take out another FL_LAYOUT lease. This lease force revokes the RDMA
mapping when it expires, and it relies on the process receiving SIGIO
as the 'break' notification.

fallocate(fd, PUNCH_HOLE...) <- breaks all the FL_LAYOUT leases, the
vma owner gets notified by fd.

Al, rightly points out that the fd may be closed by the time the event
fires since the lease follows the vma lifetime. I see two ways to
solve this, document that the process may get notifications on a stale
fd if close() happens before munmap(), or, similar to how we call
locks_remove_posix() in filp_close(), add a routine to disable any
lease notifiers on close(). I'll investigate the second option because
this seems to be a general problem with leases.

For RDMA I am presently re-working the implementation [1]. Inspired by
a discussion with Jason [2], I am going to add something like
ib_umem_ops to allow drivers to override the default policy of what
happens on a lease that expires. The default action is to invalidate
device access to the memory with iommu_unmap(), but I want to allow
for drivers to do something smarter or choose to not support DAX
mappings at all.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-October/012785.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2017-October/012793.html
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush

2017-10-12 Thread Christoph Hellwig
Sorry for chiming in so late, been extremely busy lately.

>From quickly glacing over what the now finally described use case is
(which contradicts the subject btw - it's not about flushing, it's
about not removing block mapping under a MR) and the previous comments
I think that mmap is simply the wrong kind of interface for this.

What we want is support for a new kinds of userspace memory registration in the
RDMA code that uses the pnfs export interface, both getting the block (or
rather byte in this case) mapping, and also gets the FL_LAYOUT lease for the
memory registration.

That btw is exactly what I do for the pNFS RDMA layout, just in-kernel.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v9 0/6] MAP_DIRECT for DAX userspace flush

2017-10-11 Thread Dan Williams
Changes since v8 [1]:
* Move MAP_SHARED_VALIDATE definition next to MAP_SHARED in all arch
  headers (Jan)

* Include xfs_layout.h directly in all the files that call
  xfs_break_layouts() (Dave)

* Clarify / add more comments to the MAP_DIRECT checks at fault time
  (Dave)

* Rename iomap_can_allocate() to break_layouts_nowait() to make it plain
  the reason we are bailing out of iomap_begin.

* Defer the lease_direct mechanism and RDMA core changes to a later
  patch series.

* EXT4 support is in the works and will be rebased on Jan's MAP_SYNC
  patches.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-October/012772.html

---

MAP_DIRECT is a mechanism that allows an application to establish a
mapping where the kernel will not change the block-map, or otherwise
dirty the block-map metadata of a file without notification. It supports
a "flush from userspace" model where persistent memory applications can
bypass the overhead of ongoing coordination of writes with the
filesystem, and it provides safety to RDMA operations involving DAX
mappings.

The kernel always has the ability to revoke access and convert the file
back to normal operation after performing a "lease break". Similar to
fcntl leases, there is no way for userspace to to cancel the lease break
process once it has started, it can only delay it via the
/proc/sys/fs/lease-break-time setting.

MAP_DIRECT enables XFS to supplant the device-dax interface for
mmap-write access to persistent memory with no ongoing coordination with
the filesystem via fsync/msync syscalls.

The MAP_DIRECT mechanism is complimentary to MAP_SYNC. Here are some
scenarios where you would choose one over the other:

* 3rd party DMA / RDMA to DAX with hardware that does not support
  on-demand paging (shared virtual memory) => MAP_DIRECT

* Support for reflinked inodes, fallocate-punch-hole, truncate, or any
  other operation that mutates the block map of an actively
  mapped file => MAP_SYNC

* Userpsace flush => MAP_SYNC or MAP_DIRECT

* Assurances that the file's block map metadata is stable, i.e. minimize
  worst case fault latency by locking out updates => MAP_DIRECT

---

Dan Williams (6):
  mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap 
flags
  fs, mm: pass fd to ->mmap_validate()
  fs: MAP_DIRECT core
  xfs: prepare xfs_break_layouts() for reuse with MAP_DIRECT
  fs, xfs, iomap: introduce break_layout_nowait()
  xfs: wire up MAP_DIRECT


 arch/alpha/include/uapi/asm/mman.h   |1 
 arch/mips/include/uapi/asm/mman.h|1 
 arch/mips/kernel/vdso.c  |2 
 arch/parisc/include/uapi/asm/mman.h  |1 
 arch/tile/mm/elf.c   |3 
 arch/x86/mm/mpx.c|3 
 arch/xtensa/include/uapi/asm/mman.h  |1 
 fs/Kconfig   |1 
 fs/Makefile  |2 
 fs/aio.c |2 
 fs/mapdirect.c   |  237 ++
 fs/xfs/Kconfig   |4 
 fs/xfs/Makefile  |1 
 fs/xfs/xfs_file.c|  108 
 fs/xfs/xfs_ioctl.c   |1 
 fs/xfs/xfs_iomap.c   |3 
 fs/xfs/xfs_iops.c|1 
 fs/xfs/xfs_layout.c  |   45 +
 fs/xfs/xfs_layout.h  |   13 +
 fs/xfs/xfs_pnfs.c|   31 ---
 fs/xfs/xfs_pnfs.h|8 -
 include/linux/fs.h   |   11 +
 include/linux/mapdirect.h|   40 
 include/linux/mm.h   |9 +
 include/linux/mman.h |   42 +
 include/uapi/asm-generic/mman-common.h   |1 
 include/uapi/asm-generic/mman.h  |1 
 ipc/shm.c|3 
 mm/internal.h|2 
 mm/mmap.c|   28 ++-
 mm/nommu.c   |5 -
 mm/util.c|7 -
 tools/include/uapi/asm-generic/mman-common.h |1 
 33 files changed, 557 insertions(+), 62 deletions(-)
 create mode 100644 fs/mapdirect.c
 create mode 100644 fs/xfs/xfs_layout.c
 create mode 100644 fs/xfs/xfs_layout.h
 create mode 100644 include/linux/mapdirect.h
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm