Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
On Mon, Oct 16, 2017 at 03:02:52PM +0300, Sagi Grimberg wrote: > But why should the kernel ever need to mangle the CQ? if a lease break > would deregister the MR the device is expected to generate remote > protection errors on its own. The point is to avoid protection errors - hittles change over when the DAX mapping changes like ODP does. Theonly way to get there is to notify the app before the mappings change.. Dan suggested having ibv_pollcq return this indication.. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
On Mon, Oct 16, 2017 at 12:44:31PM -0700, Dan Williams wrote: > > While I agree with the need for a per-MR notification mechanism, one > > thing we lose by walking away from MAP_DIRECT is a way for a > > hypervisor to coordinate pass through of a DAX mapping to an RDMA > > device in a guest. That will remain a case where we will still need to > > use device-dax. I'm fine if that's the answer, but just want to be > > clear about all the places we need to protect a DAX mapping against > > RDMA from a non-ODP device. > > For this specific issue perhaps we promote FL_LAYOUT as a lease-type > that can be set by fcntl(). I don't think it is a good userspace interface, mostly because it is about things that don't matter for userspace (block mappings). It makes sense as a kernel interface for callers that want to pin down a memory long-term, but for userspace the fact that the block mapping changes doesn't matter - it matters that their long term pin is broken by something. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
On Mon, Oct 16, 2017 at 10:43 AM, Dan Williamswrote: > On Mon, Oct 16, 2017 at 12:26 AM, Christoph Hellwig wrote: >> On Fri, Oct 13, 2017 at 11:31:45AM -0600, Jason Gunthorpe wrote: >>> I don't think that really represents how lots of apps actually use >>> RDMA. >>> >>> RDMA is often buried down in the software stack (eg in a MPI), and by >>> the time a mapping gets used for RDMA transfer the link between the >>> FD, mmap and the MR is totally opaque. >>> >>> Having a MR specific notification means the low level RDMA libraries >>> have a chance to deal with everything for the app. >>> >>> Eg consider a HPC app using MPI that uses some DAX aware library to >>> get DAX backed mmap's. It then passes memory in those mmaps to the >>> MPI library to do transfers. The MPI creates the MR on demand. >>> >> >> I suspect one of the more interesting use cases might be a file server, >> for which that's not the case. But otherwise I agree with the above, >> and also thing that notifying the MR handle is the only way to go for >> another very important reason: fencing. What if the application/library >> does not react on the notification? With a per-MR notification we >> can unregister the MR in kernel space and have a rock solid fencing >> mechanism. And that is the most important bit here. > > While I agree with the need for a per-MR notification mechanism, one > thing we lose by walking away from MAP_DIRECT is a way for a > hypervisor to coordinate pass through of a DAX mapping to an RDMA > device in a guest. That will remain a case where we will still need to > use device-dax. I'm fine if that's the answer, but just want to be > clear about all the places we need to protect a DAX mapping against > RDMA from a non-ODP device. For this specific issue perhaps we promote FL_LAYOUT as a lease-type that can be set by fcntl(). ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
On Mon, Oct 16, 2017 at 12:26 AM, Christoph Hellwigwrote: > On Fri, Oct 13, 2017 at 11:31:45AM -0600, Jason Gunthorpe wrote: >> I don't think that really represents how lots of apps actually use >> RDMA. >> >> RDMA is often buried down in the software stack (eg in a MPI), and by >> the time a mapping gets used for RDMA transfer the link between the >> FD, mmap and the MR is totally opaque. >> >> Having a MR specific notification means the low level RDMA libraries >> have a chance to deal with everything for the app. >> >> Eg consider a HPC app using MPI that uses some DAX aware library to >> get DAX backed mmap's. It then passes memory in those mmaps to the >> MPI library to do transfers. The MPI creates the MR on demand. >> > > I suspect one of the more interesting use cases might be a file server, > for which that's not the case. But otherwise I agree with the above, > and also thing that notifying the MR handle is the only way to go for > another very important reason: fencing. What if the application/library > does not react on the notification? With a per-MR notification we > can unregister the MR in kernel space and have a rock solid fencing > mechanism. And that is the most important bit here. While I agree with the need for a per-MR notification mechanism, one thing we lose by walking away from MAP_DIRECT is a way for a hypervisor to coordinate pass through of a DAX mapping to an RDMA device in a guest. That will remain a case where we will still need to use device-dax. I'm fine if that's the answer, but just want to be clear about all the places we need to protect a DAX mapping against RDMA from a non-ODP device. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
I don't think that really represents how lots of apps actually use RDMA. RDMA is often buried down in the software stack (eg in a MPI), and by the time a mapping gets used for RDMA transfer the link between the FD, mmap and the MR is totally opaque. Having a MR specific notification means the low level RDMA libraries have a chance to deal with everything for the app. Eg consider a HPC app using MPI that uses some DAX aware library to get DAX backed mmap's. It then passes memory in those mmaps to the MPI library to do transfers. The MPI creates the MR on demand. I suspect one of the more interesting use cases might be a file server, for which that's not the case. But otherwise I agree with the above, and also thing that notifying the MR handle is the only way to go for another very important reason: fencing. What if the application/library does not react on the notification? With a per-MR notification we can unregister the MR in kernel space and have a rock solid fencing mechanism. And that is the most important bit here. I agree we must deregister the MR in kernel space. As said, I think its perfectly reasonable to let user-space see error completions and provide query mechanism for MR granularity (unfortunately this will probably need drivers assistance as they know how their device reports in MR granularity access violations). ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
Hey folks, (chiming in very late here...) I think, if you want to build a uAPI for notification of MR lease break, then you need show how it fits into the above software model: - How it can be hidden in a RDMA specific library So, here's a strawman can ibv_poll_cq() start returning ibv_wc_status == IBV_WC_LOC_PROT_ERR when file coherency is lost. This would make the solution generic across DAX and non-DAX. What's you're feeling for how well applications are prepared to deal with that status return? Stuffing an entry into the CQ is difficult. The CQ is in user memory and it is DMA'd from the HCA for several pieces of hardware, so the kernel can't just stuff something in there. It can be done with HW support by having the HCA DMA it via an exception path or something, but even then, you run into questions like CQ overflow and accounting issues since it is not ment for this. But why should the kernel ever need to mangle the CQ? if a lease break would deregister the MR the device is expected to generate remote protection errors on its own. And in that case, I think we need a query mechanism rather an event mechanism so when the application starts seeing protection errors it can query the relevant MR (I think most if not all devices have that information in their internal completion queue entries). So, you need a side channel of some kind, either in certain drivers or generically.. - How lease break can be done hitlessly, so the library user never needs to know it is happening or see failed/missed transfers I agree that the application should not be aware of lease breakages, but seeing failed transfers is perfectly acceptable given that an access violation is happening (my assumption is that failed transfers are error completions reported in the user completion queue). What we need to have is a framework to help user-space to recover sanely, which is to query what MR had the access violation, restore it, and re-establish the queue pair. iommu redirect should be hit less and behave like the page cache case where RDMA targets pages that are no longer part of the file. Yes, if the iommu can be fenced properly it sounds doable. - Whatever fast path checking is needed does not kill performance What do you consider a fast path? I was assuming that memory registration is a slow path, and iommu operations are asynchronous so should not impact performance of ongoing operations beyond typical iommu overhead. ibv_poll_cq() and ibv_post_send() would be a fast path. Where this struggled before is in creating a side channel you also now have to check that side channel, and checking it at high performance is quite hard.. Even quiecing things to be able to tear down the MR has performance implications on post send... This is exactly why I think we should not have it, but instead give building blocks to recover sanely from error completions... ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
On Fri, Oct 13, 2017 at 11:22:21AM -0700, Dan Williams wrote: > So, here's a strawman can ibv_poll_cq() start returning ibv_wc_status > == IBV_WC_LOC_PROT_ERR when file coherency is lost. This would make > the solution generic across DAX and non-DAX. What's you're feeling for > how well applications are prepared to deal with that status return? The problem aren't local protection errors, but remote protection errors when we modify a MR with an rkey that the remote side accesses. > > - How lease break can be done hitlessly, so the library user never > >needs to know it is happening or see failed/missed transfers > > iommu redirect should be hit less and behave like the page cache case > where RDMA targets pages that are no longer part of the file. But systems that care about performance (e.g. the usual RDMA users) usually don't use an IOMMU due to the performance impact. Especially as HCAs already have their own built-in iommus (aka the MR mechanism). Note that file systems already have a mechanism like you mention above to keep extents that are busy from being reallocated. E.g. take a look at fs/xfs/xfs_extent_busy.c. The downside is that this could lock down a massive amount of space in the busy list if we for example have a MR covering a huge file that is truncated down. So even if we'd want that scheme we'd need some sort of ulmit for the amount of DAX pages locked down in get_user_pages. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
On Fri, Oct 13, 2017 at 11:31:45AM -0600, Jason Gunthorpe wrote: > I don't think that really represents how lots of apps actually use > RDMA. > > RDMA is often buried down in the software stack (eg in a MPI), and by > the time a mapping gets used for RDMA transfer the link between the > FD, mmap and the MR is totally opaque. > > Having a MR specific notification means the low level RDMA libraries > have a chance to deal with everything for the app. > > Eg consider a HPC app using MPI that uses some DAX aware library to > get DAX backed mmap's. It then passes memory in those mmaps to the > MPI library to do transfers. The MPI creates the MR on demand. > I suspect one of the more interesting use cases might be a file server, for which that's not the case. But otherwise I agree with the above, and also thing that notifying the MR handle is the only way to go for another very important reason: fencing. What if the application/library does not react on the notification? With a per-MR notification we can unregister the MR in kernel space and have a rock solid fencing mechanism. And that is the most important bit here. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
On Fri, Oct 13, 2017 at 10:38:22AM -0600, Jason Gunthorpe wrote: > > scheme specific to RDMA which seems like a waste to me when we can > > generically signal an event on the fd for any event that effects any > > of the vma's on the file. The FL_LAYOUT lease impacts the entire file, > > so as far as I can see delaying the notification until MR-init is too > > late, too granular, and too RDMA specific. > > But for RDMA a FD is not what we care about - we want the MR handle so > the app knows which MR needs fixing. Yes. Although the fd for the ibX device might be a good handle to transport that information, unlike the fd for the mapped file. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
On Fri, Oct 13, 2017 at 11:22:21AM -0700, Dan Williams wrote: > > So, who should be responsible for MR coherency? Today we say the MPI > > is responsible. But we can't really expect the MPI > > to hook SIGIO and somehow try to reverse engineer what MRs are > > impacted from a FD that may not even still be open. > > Ok, that's good insight that I didn't have. Userspace needs more help > than just an fd notification. Glad to help! > > I think, if you want to build a uAPI for notification of MR lease > > break, then you need show how it fits into the above software model: > > - How it can be hidden in a RDMA specific library > > So, here's a strawman can ibv_poll_cq() start returning ibv_wc_status > == IBV_WC_LOC_PROT_ERR when file coherency is lost. This would make > the solution generic across DAX and non-DAX. What's you're feeling for > how well applications are prepared to deal with that status return? Stuffing an entry into the CQ is difficult. The CQ is in user memory and it is DMA'd from the HCA for several pieces of hardware, so the kernel can't just stuff something in there. It can be done with HW support by having the HCA DMA it via an exception path or something, but even then, you run into questions like CQ overflow and accounting issues since it is not ment for this. So, you need a side channel of some kind, either in certain drivers or generically.. > > - How lease break can be done hitlessly, so the library user never > >needs to know it is happening or see failed/missed transfers > > iommu redirect should be hit less and behave like the page cache case > where RDMA targets pages that are no longer part of the file. Yes, if the iommu can be fenced properly it sounds doable. > > - Whatever fast path checking is needed does not kill performance > > What do you consider a fast path? I was assuming that memory > registration is a slow path, and iommu operations are asynchronous so > should not impact performance of ongoing operations beyond typical > iommu overhead. ibv_poll_cq() and ibv_post_send() would be a fast path. Where this struggled before is in creating a side channel you also now have to check that side channel, and checking it at high performance is quite hard.. Even quiecing things to be able to tear down the MR has performance implications on post send... Now that I see this whole thing in this light it seem so very similar to the MPI driven user space mmu notifications ideas and has similar challenges. FWIW, RDMA banged its head on this issue for 10 years and it was ODP that emerged as the solution. One option might be to use an async event notification 'MR de-coherence' and rely on a main polling loop to catch it. This is good enough for dax becaue the lease-requestor would wait until the async event was processed. It would also be acceptable for the general MPI case too, but only if this lease concept was wider than just DAX, eg a MR leases a peice of VMA, and if anything anyhow changes that VMA (eg munamp, mmap, mremap, etc) then it has to wait from the MR to release the lease. ie munmap would block until the async event is processed. ODP-light in userspace, essentially. IIRC this sort of suggestion was never explored, something like: poll(fd) ibv_read_async_event(fd) if (event == MR_DECOHERENCE) { queice_network(); ibv_restore_mr(mr); restore_network(); } The implemention of ibv_restore_mr would have to make a new MR that pointed to the same virtual memory addresses, but was backed by the *new* physical pages. This means it has to unblock the lease, and wait for the lease requestor to complete executing. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
On Fri, Oct 13, 2017 at 10:31 AM, Jason Gunthorpewrote: > On Fri, Oct 13, 2017 at 10:01:04AM -0700, Dan Williams wrote: >> On Fri, Oct 13, 2017 at 9:38 AM, Jason Gunthorpe >> wrote: >> > On Fri, Oct 13, 2017 at 08:14:55AM -0700, Dan Williams wrote: >> > >> >> scheme specific to RDMA which seems like a waste to me when we can >> >> generically signal an event on the fd for any event that effects any >> >> of the vma's on the file. The FL_LAYOUT lease impacts the entire file, >> >> so as far as I can see delaying the notification until MR-init is too >> >> late, too granular, and too RDMA specific. >> > >> > But for RDMA a FD is not what we care about - we want the MR handle so >> > the app knows which MR needs fixing. >> >> I'd rather put the onus on userspace to remember where it used a >> MAP_DIRECT mapping and be aware that all the mappings of that file are >> subject to a lease break. Sure, we could build up a pile of kernel >> infrastructure to notify on a per-MR basis, but I think that would >> only be worth it if leases were range based. As it is, the entire file >> is covered by a lease instance and all MRs that might reference that >> file get one notification. That said, we can always arrange for a >> per-driver callback at lease-break time so that it can do something >> above and beyond the default notification. > > I don't think that really represents how lots of apps actually use > RDMA. > > RDMA is often buried down in the software stack (eg in a MPI), and by > the time a mapping gets used for RDMA transfer the link between the > FD, mmap and the MR is totally opaque. > > Having a MR specific notification means the low level RDMA libraries > have a chance to deal with everything for the app. > > Eg consider a HPC app using MPI that uses some DAX aware library to > get DAX backed mmap's. It then passes memory in those mmaps to the > MPI library to do transfers. The MPI creates the MR on demand. > > So, who should be responsible for MR coherency? Today we say the MPI > is responsible. But we can't really expect the MPI > to hook SIGIO and somehow try to reverse engineer what MRs are > impacted from a FD that may not even still be open. Ok, that's good insight that I didn't have. Userspace needs more help than just an fd notification. > I think, if you want to build a uAPI for notification of MR lease > break, then you need show how it fits into the above software model: > - How it can be hidden in a RDMA specific library So, here's a strawman can ibv_poll_cq() start returning ibv_wc_status == IBV_WC_LOC_PROT_ERR when file coherency is lost. This would make the solution generic across DAX and non-DAX. What's you're feeling for how well applications are prepared to deal with that status return? > - How lease break can be done hitlessly, so the library user never >needs to know it is happening or see failed/missed transfers iommu redirect should be hit less and behave like the page cache case where RDMA targets pages that are no longer part of the file. > - Whatever fast path checking is needed does not kill performance What do you consider a fast path? I was assuming that memory registration is a slow path, and iommu operations are asynchronous so should not impact performance of ongoing operations beyond typical iommu overhead. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
On Fri, Oct 13, 2017 at 10:01:04AM -0700, Dan Williams wrote: > On Fri, Oct 13, 2017 at 9:38 AM, Jason Gunthorpe >wrote: > > On Fri, Oct 13, 2017 at 08:14:55AM -0700, Dan Williams wrote: > > > >> scheme specific to RDMA which seems like a waste to me when we can > >> generically signal an event on the fd for any event that effects any > >> of the vma's on the file. The FL_LAYOUT lease impacts the entire file, > >> so as far as I can see delaying the notification until MR-init is too > >> late, too granular, and too RDMA specific. > > > > But for RDMA a FD is not what we care about - we want the MR handle so > > the app knows which MR needs fixing. > > I'd rather put the onus on userspace to remember where it used a > MAP_DIRECT mapping and be aware that all the mappings of that file are > subject to a lease break. Sure, we could build up a pile of kernel > infrastructure to notify on a per-MR basis, but I think that would > only be worth it if leases were range based. As it is, the entire file > is covered by a lease instance and all MRs that might reference that > file get one notification. That said, we can always arrange for a > per-driver callback at lease-break time so that it can do something > above and beyond the default notification. I don't think that really represents how lots of apps actually use RDMA. RDMA is often buried down in the software stack (eg in a MPI), and by the time a mapping gets used for RDMA transfer the link between the FD, mmap and the MR is totally opaque. Having a MR specific notification means the low level RDMA libraries have a chance to deal with everything for the app. Eg consider a HPC app using MPI that uses some DAX aware library to get DAX backed mmap's. It then passes memory in those mmaps to the MPI library to do transfers. The MPI creates the MR on demand. So, who should be responsible for MR coherency? Today we say the MPI is responsible. But we can't really expect the MPI to hook SIGIO and somehow try to reverse engineer what MRs are impacted from a FD that may not even still be open. I think, if you want to build a uAPI for notification of MR lease break, then you need show how it fits into the above software model: - How it can be hidden in a RDMA specific library - How lease break can be done hitlessly, so the library user never needs to know it is happening or see failed/missed transfers - Whatever fast path checking is needed does not kill performance Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
On Fri, Oct 13, 2017 at 9:38 AM, Jason Gunthorpewrote: > On Fri, Oct 13, 2017 at 08:14:55AM -0700, Dan Williams wrote: > >> scheme specific to RDMA which seems like a waste to me when we can >> generically signal an event on the fd for any event that effects any >> of the vma's on the file. The FL_LAYOUT lease impacts the entire file, >> so as far as I can see delaying the notification until MR-init is too >> late, too granular, and too RDMA specific. > > But for RDMA a FD is not what we care about - we want the MR handle so > the app knows which MR needs fixing. I'd rather put the onus on userspace to remember where it used a MAP_DIRECT mapping and be aware that all the mappings of that file are subject to a lease break. Sure, we could build up a pile of kernel infrastructure to notify on a per-MR basis, but I think that would only be worth it if leases were range based. As it is, the entire file is covered by a lease instance and all MRs that might reference that file get one notification. That said, we can always arrange for a per-driver callback at lease-break time so that it can do something above and beyond the default notification. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
On Fri, Oct 13, 2017 at 08:14:55AM -0700, Dan Williams wrote: > scheme specific to RDMA which seems like a waste to me when we can > generically signal an event on the fd for any event that effects any > of the vma's on the file. The FL_LAYOUT lease impacts the entire file, > so as far as I can see delaying the notification until MR-init is too > late, too granular, and too RDMA specific. But for RDMA a FD is not what we care about - we want the MR handle so the app knows which MR needs fixing. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
On Thu, Oct 12, 2017 at 11:57 PM, Christoph Hellwigwrote: > On Thu, Oct 12, 2017 at 10:41:39AM -0700, Dan Williams wrote: >> So, you're jumping into this review at v9 where I've split the patches >> that take an initial MAP_DIRECT lease out from the patches that take >> FL_LAYOUT leases at memory registration time. You can see a previous >> attempt in "[PATCH v8 00/14] MAP_DIRECT for DAX RDMA and userspace >> flush" which should be in your inbox. > > The point is that your problem has absolutely nothing to do with mmap, > and all with get_user_pages. > > get_user_pages on DAX doesn't give the same guarantees as on pagecache > or anonymous memory, and that is the prbolem we need to fix. In fact > I'm pretty sure if we try hard enough (and we might have to try > very hard) we can see the same problem with plain direct I/O and without > any RDMA involved, e.g. do a larger direct I/O write to memory that is > mmap()ed from a DAX file, then truncate the DAX file and reallocate > the blocks, and we might corrupt that new file. We'll probably need > a special setup where there is little other chance but to reallocate > those used blocks. I'll take a harder look at this... > So what we need to do first is to fix get_user_pages vs unmapping > DAX mmap()ed blocks, be that from a hole punch, truncate, COW > operation, etc. > > Then we need to look into the special case of a long-living non-transient > get_user_pages that RDMA does - we can't just reject any truncate or > other operation for that, so that's where something like me layout > lease suggestion comes into play - but the call that should get the > least is not the mmap - it's the memory registration call that does > the get_user_pages. Yes, mmap is not the place to get the lease for a later get_user_pages, and my patches do take an additional lease at get_user_pages / MR init time. However, the mmap call has the file-descriptor for SIGIO the MR-init call does not. If we delay all of the setup it to MR time then we need to invent a notification scheme specific to RDMA which seems like a waste to me when we can generically signal an event on the fd for any event that effects any of the vma's on the file. The FL_LAYOUT lease impacts the entire file, so as far as I can see delaying the notification until MR-init is too late, too granular, and too RDMA specific. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
On Thu, Oct 12, 2017 at 10:41:39AM -0700, Dan Williams wrote: > So, you're jumping into this review at v9 where I've split the patches > that take an initial MAP_DIRECT lease out from the patches that take > FL_LAYOUT leases at memory registration time. You can see a previous > attempt in "[PATCH v8 00/14] MAP_DIRECT for DAX RDMA and userspace > flush" which should be in your inbox. The point is that your problem has absolutely nothing to do with mmap, and all with get_user_pages. get_user_pages on DAX doesn't give the same guarantees as on pagecache or anonymous memory, and that is the prbolem we need to fix. In fact I'm pretty sure if we try hard enough (and we might have to try very hard) we can see the same problem with plain direct I/O and without any RDMA involved, e.g. do a larger direct I/O write to memory that is mmap()ed from a DAX file, then truncate the DAX file and reallocate the blocks, and we might corrupt that new file. We'll probably need a special setup where there is little other chance but to reallocate those used blocks. So what we need to do first is to fix get_user_pages vs unmapping DAX mmap()ed blocks, be that from a hole punch, truncate, COW operation, etc. Then we need to look into the special case of a long-living non-transient get_user_pages that RDMA does - we can't just reject any truncate or other operation for that, so that's where something like me layout lease suggestion comes into play - but the call that should get the least is not the mmap - it's the memory registration call that does the get_user_pages. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
On Thu, Oct 12, 2017 at 7:23 AM, Christoph Hellwigwrote: > Sorry for chiming in so late, been extremely busy lately. > > From quickly glacing over what the now finally described use case is > (which contradicts the subject btw - it's not about flushing, it's > about not removing block mapping under a MR) and the previous comments > I think that mmap is simply the wrong kind of interface for this. > > What we want is support for a new kinds of userspace memory registration in > the > RDMA code that uses the pnfs export interface, both getting the block (or > rather byte in this case) mapping, and also gets the FL_LAYOUT lease for the > memory registration. > > That btw is exactly what I do for the pNFS RDMA layout, just in-kernel. ...and this is exactly my plan. So, you're jumping into this review at v9 where I've split the patches that take an initial MAP_DIRECT lease out from the patches that take FL_LAYOUT leases at memory registration time. You can see a previous attempt in "[PATCH v8 00/14] MAP_DIRECT for DAX RDMA and userspace flush" which should be in your inbox. I'm not proposing mmap as the memory registration interface, it's the "register for notification of lease break" interface. Here's my proposed sequence: addr = mmap(..., MAP_DIRECT.., fd); <- register a vma for "direct" memory registrations with an FL_LAYOUT lease that at a lease break event sends SIGIO on the fd used for mmap. ibv_reg_mr(..., addr, ...); <- check for a valid MAP_DIRECT vma, and take out another FL_LAYOUT lease. This lease force revokes the RDMA mapping when it expires, and it relies on the process receiving SIGIO as the 'break' notification. fallocate(fd, PUNCH_HOLE...) <- breaks all the FL_LAYOUT leases, the vma owner gets notified by fd. Al, rightly points out that the fd may be closed by the time the event fires since the lease follows the vma lifetime. I see two ways to solve this, document that the process may get notifications on a stale fd if close() happens before munmap(), or, similar to how we call locks_remove_posix() in filp_close(), add a routine to disable any lease notifiers on close(). I'll investigate the second option because this seems to be a general problem with leases. For RDMA I am presently re-working the implementation [1]. Inspired by a discussion with Jason [2], I am going to add something like ib_umem_ops to allow drivers to override the default policy of what happens on a lease that expires. The default action is to invalidate device access to the memory with iommu_unmap(), but I want to allow for drivers to do something smarter or choose to not support DAX mappings at all. [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-October/012785.html [2]: https://lists.01.org/pipermail/linux-nvdimm/2017-October/012793.html ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
Sorry for chiming in so late, been extremely busy lately. >From quickly glacing over what the now finally described use case is (which contradicts the subject btw - it's not about flushing, it's about not removing block mapping under a MR) and the previous comments I think that mmap is simply the wrong kind of interface for this. What we want is support for a new kinds of userspace memory registration in the RDMA code that uses the pnfs export interface, both getting the block (or rather byte in this case) mapping, and also gets the FL_LAYOUT lease for the memory registration. That btw is exactly what I do for the pNFS RDMA layout, just in-kernel. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
Changes since v8 [1]: * Move MAP_SHARED_VALIDATE definition next to MAP_SHARED in all arch headers (Jan) * Include xfs_layout.h directly in all the files that call xfs_break_layouts() (Dave) * Clarify / add more comments to the MAP_DIRECT checks at fault time (Dave) * Rename iomap_can_allocate() to break_layouts_nowait() to make it plain the reason we are bailing out of iomap_begin. * Defer the lease_direct mechanism and RDMA core changes to a later patch series. * EXT4 support is in the works and will be rebased on Jan's MAP_SYNC patches. [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-October/012772.html --- MAP_DIRECT is a mechanism that allows an application to establish a mapping where the kernel will not change the block-map, or otherwise dirty the block-map metadata of a file without notification. It supports a "flush from userspace" model where persistent memory applications can bypass the overhead of ongoing coordination of writes with the filesystem, and it provides safety to RDMA operations involving DAX mappings. The kernel always has the ability to revoke access and convert the file back to normal operation after performing a "lease break". Similar to fcntl leases, there is no way for userspace to to cancel the lease break process once it has started, it can only delay it via the /proc/sys/fs/lease-break-time setting. MAP_DIRECT enables XFS to supplant the device-dax interface for mmap-write access to persistent memory with no ongoing coordination with the filesystem via fsync/msync syscalls. The MAP_DIRECT mechanism is complimentary to MAP_SYNC. Here are some scenarios where you would choose one over the other: * 3rd party DMA / RDMA to DAX with hardware that does not support on-demand paging (shared virtual memory) => MAP_DIRECT * Support for reflinked inodes, fallocate-punch-hole, truncate, or any other operation that mutates the block map of an actively mapped file => MAP_SYNC * Userpsace flush => MAP_SYNC or MAP_DIRECT * Assurances that the file's block map metadata is stable, i.e. minimize worst case fault latency by locking out updates => MAP_DIRECT --- Dan Williams (6): mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags fs, mm: pass fd to ->mmap_validate() fs: MAP_DIRECT core xfs: prepare xfs_break_layouts() for reuse with MAP_DIRECT fs, xfs, iomap: introduce break_layout_nowait() xfs: wire up MAP_DIRECT arch/alpha/include/uapi/asm/mman.h |1 arch/mips/include/uapi/asm/mman.h|1 arch/mips/kernel/vdso.c |2 arch/parisc/include/uapi/asm/mman.h |1 arch/tile/mm/elf.c |3 arch/x86/mm/mpx.c|3 arch/xtensa/include/uapi/asm/mman.h |1 fs/Kconfig |1 fs/Makefile |2 fs/aio.c |2 fs/mapdirect.c | 237 ++ fs/xfs/Kconfig |4 fs/xfs/Makefile |1 fs/xfs/xfs_file.c| 108 fs/xfs/xfs_ioctl.c |1 fs/xfs/xfs_iomap.c |3 fs/xfs/xfs_iops.c|1 fs/xfs/xfs_layout.c | 45 + fs/xfs/xfs_layout.h | 13 + fs/xfs/xfs_pnfs.c| 31 --- fs/xfs/xfs_pnfs.h|8 - include/linux/fs.h | 11 + include/linux/mapdirect.h| 40 include/linux/mm.h |9 + include/linux/mman.h | 42 + include/uapi/asm-generic/mman-common.h |1 include/uapi/asm-generic/mman.h |1 ipc/shm.c|3 mm/internal.h|2 mm/mmap.c| 28 ++- mm/nommu.c |5 - mm/util.c|7 - tools/include/uapi/asm-generic/mman-common.h |1 33 files changed, 557 insertions(+), 62 deletions(-) create mode 100644 fs/mapdirect.c create mode 100644 fs/xfs/xfs_layout.c create mode 100644 fs/xfs/xfs_layout.h create mode 100644 include/linux/mapdirect.h ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm