Re: Enabling peer to peer device transactions for PCIe devices
On 23/11/16 02:55 PM, Jason Gunthorpe wrote: >>> Only ODP hardware allows changing the DMA address on the fly, and it >>> works at the page table level. We do not need special handling for >>> RDMA. >> >> I am aware of ODP but, noted by others, it doesn't provide a general >> solution to the points above. > > How do you mean? I was only saying it wasn't general in that it wouldn't work for IB hardware that doesn't support ODP or other hardware that doesn't do similar things (like an NVMe drive). It makes sense for hardware that supports ODP to allow MRs to not pin the underlying memory and provide for migrations that the hardware can follow. But most DMA engines will require the memory to be pinned and any complex allocators (GPU or otherwise) should respect that. And that seems like it should be the default way most of this works -- and I think it wouldn't actually take too much effort to make it all work now as is. (Our iopmem work is actually quite small and simple.) >> It's also worth noting that #4 makes use of ZONE_DEVICE (#2) so they are >> really the same option. iopmem is really just one way to get BAR >> addresses to user-space while inside the kernel it's ZONE_DEVICE. > > Seems fine for RDMA? Yeah, we've had RDMA and O_DIRECT transfers to PCIe backed ZONE_DEVICE memory working for some time. I'd say it's a good fit. The main question we've had is how to expose PCIe bars to userspace to be used as MRs and such. Logan ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: Enabling peer to peer device transactions for PCIe devices
On Wed, Nov 23, 2016 at 02:11:29PM -0700, Logan Gunthorpe wrote: > Perhaps I am not following what Serguei is asking for, but I > understood the desire was for a complex GPU allocator that could > migrate pages between GPU and CPU memory under control of the GPU > driver, among other things. The desire is for DMA to continue to work > even after these migrations happen. The main issue is to how to solve use cases when p2p is requested/initiated via CPU pointers where such pointers could point to non-system memory location e.g. VRAM. It will allow to provide consistent working model for user to deal only with pointers (HSA, CUDA, OpenCL 2.0 SVM) as well as provide performance optimization avoiding double-buffering and extra special code when dealing with PCIe device memory. Examples are: - RDMA Network operations. RDMA MRs where registered memory could be e.g. VRAM. Currently it is solved using so called PeerDirect interface which is currently out-of-tree and provided as part of OFED. - File operations (fread/fwrite) when user wants to transfer file data directly to/from e.g. VRAM Challenges are: - Because graphics sub-system must support overcomit (at least each application/process should independently see all resources) ideally such memory should be movable without changing CPU pointer value as well as "paged-out" supporting "page fault" at least on access from CPU. - We must co-exist with existing DRM infrastructure, as well as support sharing VRAM memory between different processes - We should be able to deal with large allocations: tens, hundreds of MBs or may be GBs. - We may have PCIe devices where p2p may not work - Potentially any GPU memory should be supported including memory carved out from system RAM (e.g. allocated via get_free_pages()). Note: - In the case of RDMA MRs life-span of "pinning" (get_user_pages"/put_page) may be defined/controlled by application not kernel which may be should treated differently as special case. Original proposal was to create "struct pages" for VRAM memory to allow "get_user_pages" to work transparently similar how it is/was done for "DAX Device" case. Unfortunately based on my understanding "DAX Device" implementation deal only with permanently "locked" memory (fixed location) unrelated to "get_user_pages"/"put_page" scope which doesn't satisfy requirements for "eviction" / "moving" of memory keeping CPU address intact. > The desire is for DMA to continue to work > even after these migrations happen At least some kind of mm notifier callback to inform about changing in location (pre- and post-) similar how it is done for system pages. My understanding is that It will not solve RDMA MR issue where "lock" could be during the whole application life but (a) it will not make RDMA MR case worse (b) should be enough for all other cases for "get_user_pages"/"put_page" controlled by kernel. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH] x86: fix kaslr and memmap collision
On Tue, Nov 22, 2016 at 11:01:32AM -0800, Dan Williams wrote: > On Tue, Nov 22, 2016 at 10:54 AM, Kees Cookwrote: > > On Tue, Nov 22, 2016 at 9:26 AM, Dan Williams > > wrote: > >> No, you're right, we need to handle multiple ranges. Since the > >> mem_avoid array is statically allocated perhaps we can handle up to 4 > >> memmap= entries, but past that point disable kaslr for that boot? > > > > Yeah, that seems fine to me. I assume it's rare to have 4? > > > > It should be rare to have *one* since ACPI 6.0 added support for > communicating persistent memory ranges. However there are legacy > nvdimm users that I know are doing at least 2, but I have hard time > imagining they would ever do more than 4. I doubt it's rare amongst the people using RAM to emulate pmem for filesystem testing purposes. My "pmem" test VM always has at least 2 ranges set to give me two discrete pmem devices, and I have used 4 from time to time to do things like test multi-volume scratch XFS filesystems in xfstests (i.e. data, log and realtime volumes) so I didn't need to play games with partitioning or DM... Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: Enabling peer to peer device transactions for PCIe devices
On Wed, Nov 23, 2016 at 02:42:12PM -0800, Dan Williams wrote: > > The crucial part for this discussion is the ability to fence and block > > DMA for a specific range. This is the hardware capability that lets > > page migration happen: fence DMA, migrate page, update page > > table in HCA, unblock DMA. > > Wait, ODP requires migratable pages, ZONE_DEVICE pages are not > migratable. Does it? I didn't think so.. Does ZONE_DEVICE break MMU notifiers/etc or something? There is certainly nothing about the hardware that cares about ZONE_DEVICE vs System memory. I used 'migration' in the broader sense of doing any transformation to the page such that the DMA address changes - not the specific kernel MM process... > You can't replace a PCIe mapping with just any other System RAM > physical address, right? I thought that was exactly what HMM was trying to do? Migrate pages between CPU and GPU memory as needed. As Serguei has said this process needs to be driven by the GPU driver. The peer-peer issue is how do you do that while RDMA is possible on those pages, because when the page migrates to GPU memory you want the RDMA to follow it seamlessly. This is why page table mirroring is the best solution - use the existing mm machinery to link the DMA driver and whatever is controlling the VMA. > At least not without a filesystem recording where things went, but > at point we're no longer talking about the base P2P-DMA mapping In the filesystem/DAX case, it would be the filesystem that initiates any change in the page physical address. ODP *follows* changes in the VMA it does not cause any change in address mapping. That has to be done by whoever is in charge of the VMA. > something like pnfs-rdma to a DAX filesystem. Something in the kernel (ie nfs-rdma) would be entirely different. We generally don't do long lived mappings in the kernel for RDMA (certainly not for NFS), so it is much more like your basic every day DMA operation: map, execute, unmap. We probably don't need to use page table mirroring for this. ODP comes in when userpsace mmaps a DAX file and then tries to use it for RDMA. Page table mirroring lets the DAX filesystem decide to move the backing pages at any time. When it wants to do that it interacts with the MM in the usual way which links to ODP and makes sure the migration is seamless. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH] ndctl: introduce 4k allocation support for creating namespace
Some needed changes I noticed while trying to take this onto the 'pending' branch On Mon, Oct 24, 2016 at 4:21 PM, Dave Jiangwrote: > Existing implementation defaults all pages allocated as 2M superpages. > For nfit_test dax device we need 4k pages allocated to work properly. > Adding an --align,-a option to provide the alignment. Will accept > 4k and 2M at the moment as proper parameter. No -a option still defaults > to 2M. > > Signed-off-by: Dave Jiang > --- > ndctl/builtin-xaction-namespace.c | 22 -- > util/size.h |1 + > 2 files changed, 21 insertions(+), 2 deletions(-) > > diff --git a/ndctl/builtin-xaction-namespace.c > b/ndctl/builtin-xaction-namespace.c > index 9b1702d..89ce6ce 100644 > --- a/ndctl/builtin-xaction-namespace.c > +++ b/ndctl/builtin-xaction-namespace.c > @@ -49,6 +49,7 @@ static struct parameters { > const char *region; > const char *reconfig; > const char *sector_size; > + const char *align; > } param; > > void builtin_xaction_namespace_reset(void) > @@ -71,6 +72,7 @@ struct parsed_parameters { > enum ndctl_namespace_mode mode; > unsigned long long size; > unsigned long sector_size; > + unsigned long align; > }; > > #define debug(fmt, ...) \ > @@ -104,6 +106,8 @@ OPT_STRING('l', "sector-size", _size, > "lba-size", \ > "specify the logical sector size in bytes"), \ > OPT_STRING('t', "type", , "type", \ > "specify the type of namespace to create 'pmem' or 'blk'"), \ > +OPT_STRING('a', "align", , "align", \ > + "specify the namespace alignment in bytes (default: 0x20 (2M))"), > \ > OPT_BOOLEAN('f', "force", , "reconfigure namespace even if currently > active") > > static const struct option base_options[] = { > @@ -319,7 +323,7 @@ static int setup_namespace(struct ndctl_region *region, > > try(ndctl_pfn, set_uuid, pfn, uuid); > try(ndctl_pfn, set_location, pfn, p->loc); > - try(ndctl_pfn, set_align, pfn, SZ_2M); > + try(ndctl_pfn, set_align, pfn, p->align); This will now collide wit the new "ndctl_pfn_has_align()" check that got added to fix support for pre-4.5 kernels. > try(ndctl_pfn, set_namespace, pfn, ndns); > rc = ndctl_pfn_enable(pfn); > } else if (p->mode == NDCTL_NS_MODE_DAX) { > @@ -327,7 +331,7 @@ static int setup_namespace(struct ndctl_region *region, > > try(ndctl_dax, set_uuid, dax, uuid); > try(ndctl_dax, set_location, dax, p->loc); > - try(ndctl_dax, set_align, dax, SZ_2M); > + try(ndctl_dax, set_align, dax, p->align); > try(ndctl_dax, set_namespace, dax, ndns); > rc = ndctl_dax_enable(dax); > } else if (p->mode == NDCTL_NS_MODE_SAFE) { > @@ -383,6 +387,20 @@ static int validate_namespace_options(struct > ndctl_region *region, > > memset(p, 0, sizeof(*p)); > > + if (param.align) { > + p->align = parse_size64(param.align); > + switch (p->align) { > + case SZ_4K: > + case SZ_2M: > + break; > + case SZ_1G: /* unsupported yet... */ > + default: > + debug("%s: invalid align\n", __func__); > + return -EINVAL; > + } > + } else > + p->align = SZ_2M; > + I think this check should come after we have determined that the mode is either "memory" or "dax", and error out otherwise. Also, when the alignment is not 2M, we should check that the kernel has alignment setting support with the new ndctl_pfn_has_align() api. Note that kernels that support device-dax implicitly support the alignment property. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: Enabling peer to peer device transactions for PCIe devices
On Wed, Nov 23, 2016 at 02:11:29PM -0700, Logan Gunthorpe wrote: > > As I said, there is no possible special handling. Standard IB hardware > > does not support changing the DMA address once a MR is created. Forget > > about doing that. > > Yeah, that's essentially the point I was trying to make. Not to mention > all the other unrelated hardware that can't DMA to an address that might > disappear mid-transfer. Right, it is impossible to ask for generic page migration with ongoing DMA. That is simply not supported by any of the hardware at all. > > Only ODP hardware allows changing the DMA address on the fly, and it > > works at the page table level. We do not need special handling for > > RDMA. > > I am aware of ODP but, noted by others, it doesn't provide a general > solution to the points above. How do you mean? Perhaps I am not following what Serguei is asking for, but I understood the desire was for a complex GPU allocator that could migrate pages between GPU and CPU memory under control of the GPU driver, among other things. The desire is for DMA to continue to work even after these migrations happen. Page table mirroring *is* the general solution for this problem. The GPU driver controls the VMA and the DMA driver mirrors that VMA. Do you know of another option that doesn't just degenerate to page table mirroring?? Remember, there are two facets to the RDMA ODP implementation, I feel there is some confusion here.. The crucial part for this discussion is the ability to fence and block DMA for a specific range. This is the hardware capability that lets page migration happen: fence DMA, migrate page, update page table in HCA, unblock DMA. Without that hardware support the DMA address must be unchanging, and there is nothing we can do about it. This is why standard IB hardware must have fixed MRs - it lacks the fence capability. The other part is the page faulting implementation, but that is not required, and to Serguei's point, is not desired for GPU anyhow. > > To me this means at least items #1 and #3 should be removed from > > Alexander's list. > > It's also worth noting that #4 makes use of ZONE_DEVICE (#2) so they are > really the same option. iopmem is really just one way to get BAR > addresses to user-space while inside the kernel it's ZONE_DEVICE. Seems fine for RDMA? Didn't we just strike off everything on the list except #2? :\ Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: Enabling peer to peer device transactions for PCIe devices
On 23/11/16 01:33 PM, Jason Gunthorpe wrote: > On Wed, Nov 23, 2016 at 02:58:38PM -0500, Serguei Sagalovitch wrote: > >>We do not want to have "highly" dynamic translation due to >>performance cost. We need to support "overcommit" but would >>like to minimize impact. To support RDMA MRs for GPU/VRAM/PCIe >>device memory (which is must) we need either globally force >>pinning for the scope of "get_user_pages() / "put_pages" or have >>special handling for RDMA MRs and similar cases. > > As I said, there is no possible special handling. Standard IB hardware > does not support changing the DMA address once a MR is created. Forget > about doing that. Yeah, that's essentially the point I was trying to make. Not to mention all the other unrelated hardware that can't DMA to an address that might disappear mid-transfer. > Only ODP hardware allows changing the DMA address on the fly, and it > works at the page table level. We do not need special handling for > RDMA. I am aware of ODP but, noted by others, it doesn't provide a general solution to the points above. > Like I said, this is the direction the industry seems to be moving in, > so any solution here should focus on VMAs/page tables as the way to link > the peer-peer devices. Yes, this was the appeal to us of using ZONE_DEVICE. > To me this means at least items #1 and #3 should be removed from > Alexander's list. It's also worth noting that #4 makes use of ZONE_DEVICE (#2) so they are really the same option. iopmem is really just one way to get BAR addresses to user-space while inside the kernel it's ZONE_DEVICE. Logan ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: Enabling peer to peer device transactions for PCIe devices
On Wed, Nov 23, 2016 at 02:58:38PM -0500, Serguei Sagalovitch wrote: >We do not want to have "highly" dynamic translation due to >performance cost. We need to support "overcommit" but would >like to minimize impact. To support RDMA MRs for GPU/VRAM/PCIe >device memory (which is must) we need either globally force >pinning for the scope of "get_user_pages() / "put_pages" or have >special handling for RDMA MRs and similar cases. As I said, there is no possible special handling. Standard IB hardware does not support changing the DMA address once a MR is created. Forget about doing that. Only ODP hardware allows changing the DMA address on the fly, and it works at the page table level. We do not need special handling for RDMA. >Generally it could be difficult to correctly handle "DMA in >progress" due to the facts that (a) DMA could originate from >numerous PCIe devices simultaneously including requests to >receive network data. We handle all of this today in kernel via the page pinning mechanism. This needs to be copied into peer-peer memory and GPU memory schemes as well. A pinned page means the DMA address channot be changed and there is active non-CPU access to it. Any hardware that does not support page table mirroring must go this route. > (b) in HSA case DMA could originated from user space without kernel >driver knowledge. So without corresponding h/w support >everywhere I do not see how it could be solved effectively. All true user triggered DMA must go through some kind of coherent page table mirroring scheme (eg this is what CAPI does, presumably AMDs HSA is similar). A page table mirroring scheme is basically the same as what ODP does. Like I said, this is the direction the industry seems to be moving in, so any solution here should focus on VMAs/page tables as the way to link the peer-peer devices. To me this means at least items #1 and #3 should be removed from Alexander's list. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: Enabling peer to peer device transactions for PCIe devices
On 2016-11-23 02:32 PM, Jason Gunthorpe wrote: On Wed, Nov 23, 2016 at 02:14:40PM -0500, Serguei Sagalovitch wrote: On 2016-11-23 02:05 PM, Jason Gunthorpe wrote: As Bart says, it would be best to be combined with something like Mellanox's ODP MRs, which allows a page to be evicted and then trigger a CPU interrupt if a DMA is attempted so it can be brought back. Please note that in the general case (including MR one) we could have "page fault" from the different PCIe device. So all PCIe device must be synchronized. Standard RDMA MRs require pinned pages, the DMA address cannot change while the MR exists (there is no hardware support for this at all), so page faulting from any other device is out of the question while they exist. This is the same requirement as typical simple driver DMA which requires pages pinned until the simple device completes DMA. ODP RDMA MRs do not require that, they just page fault like the CPU or really anything and the kernel has to make sense of concurrant page faults from multiple sources. The upshot is that GPU scenarios that rely on highly dynamic virtual->physical translation cannot sanely be combined with standard long-life RDMA MRs. We do not want to have "highly" dynamic translation due to performance cost. We need to support "overcommit" but would like to minimize impact. To support RDMA MRs for GPU/VRAM/PCIe device memory (which is must) we need either globally force pinning for the scope of "get_user_pages() / "put_pages" or have special handling for RDMA MRs and similar cases. Generally it could be difficult to correctly handle "DMA in progress" due to the facts that (a) DMA could originate from numerous PCIe devices simultaneously including requests to receive network data. (b) in HSA case DMA could originated from user space without kernel driver knowledge. So without corresponding h/w support everywhere I do not see how it could be solved effectively. Certainly, any solution for GPUs must follow the typical page pinning semantics, changing the DMA address of a page must be blocked while any DMA is in progress. Does HMM solve the peer-peer problem? Does it do it generically or only for drivers that are mirroring translation tables? In current form HMM doesn't solve peer-peer problem. Currently it allow "mirroring" of "malloc" memory on GPU which is not always what needed. Additionally there is need to have opportunity to share VRAM allocations between different processes. Humm, so it can be removed from Alexander's list then :\ HMM is very useful for some type of scenarios as well as it could significantly simplify (for performance) implementations of some features e.g. OpenCL 2.0 SVM. As Dan suggested, maybe we need to do both. Some kind of fix for get_user_pages() for smaller mappings (eg ZONE_DEVICE) and a mandatory API conversion to get_user_dma_sg() for other cases? Jason Sincerely yours, Serguei Sagalovitch ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: Enabling peer to peer device transactions for PCIe devices
On Wed, Nov 23, 2016 at 02:14:40PM -0500, Serguei Sagalovitch wrote: > > On 2016-11-23 02:05 PM, Jason Gunthorpe wrote: > >As Bart says, it would be best to be combined with something like > >Mellanox's ODP MRs, which allows a page to be evicted and then trigger > >a CPU interrupt if a DMA is attempted so it can be brought back. > Please note that in the general case (including MR one) we could have > "page fault" from the different PCIe device. So all PCIe device must > be synchronized. Standard RDMA MRs require pinned pages, the DMA address cannot change while the MR exists (there is no hardware support for this at all), so page faulting from any other device is out of the question while they exist. This is the same requirement as typical simple driver DMA which requires pages pinned until the simple device completes DMA. ODP RDMA MRs do not require that, they just page fault like the CPU or really anything and the kernel has to make sense of concurrant page faults from multiple sources. The upshot is that GPU scenarios that rely on highly dynamic virtual->physical translation cannot sanely be combined with standard long-life RDMA MRs. Certainly, any solution for GPUs must follow the typical page pinning semantics, changing the DMA address of a page must be blocked while any DMA is in progress. > >Does HMM solve the peer-peer problem? Does it do it generically or > >only for drivers that are mirroring translation tables? > In current form HMM doesn't solve peer-peer problem. Currently it allow > "mirroring" of "malloc" memory on GPU which is not always what needed. > Additionally there is need to have opportunity to share VRAM allocations > between different processes. Humm, so it can be removed from Alexander's list then :\ As Dan suggested, maybe we need to do both. Some kind of fix for get_user_pages() for smaller mappings (eg ZONE_DEVICE) and a mandatory API conversion to get_user_dma_sg() for other cases? Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: Enabling peer to peer device transactions for PCIe devices
On 2016-11-23 03:51 AM, Christian König wrote: Am 23.11.2016 um 08:49 schrieb Daniel Vetter: On Tue, Nov 22, 2016 at 01:21:03PM -0800, Dan Williams wrote: On Tue, Nov 22, 2016 at 1:03 PM, Daniel Vetterwrote: On Tue, Nov 22, 2016 at 9:35 PM, Serguei Sagalovitch wrote: On 2016-11-22 03:10 PM, Daniel Vetter wrote: On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams wrote: On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch wrote: I personally like "device-DAX" idea but my concerns are: - How well it will co-exists with the DRM infrastructure / implementations in part dealing with CPU pointers? Inside the kernel a device-DAX range is "just memory" in the sense that you can perform pfn_to_page() on it and issue I/O, but the vma is not migratable. To be honest I do not know how well that co-exists with drm infrastructure. - How well we will be able to handle case when we need to "move"/"evict" memory/data to the new location so CPU pointer should point to the new physical location/address (and may be not in PCI device memory at all)? So, device-DAX deliberately avoids support for in-kernel migration or overcommit. Those cases are left to the core mm or drm. The device-dax interface is for cases where all that is needed is a direct-mapping to a statically-allocated physical-address range be it persistent memory or some other special reserved memory range. For some of the fancy use-cases (e.g. to be comparable to what HMM can pull off) I think we want all the magic in core mm, i.e. migration and overcommit. At least that seems to be the very strong drive in all general-purpose gpu abstractions and implementations, where memory is allocated with malloc, and then mapped/moved into vram/gpu address space through some magic, It is possible that there is other way around: memory is requested to be allocated and should be kept in vram for performance reason but due to possible overcommit case we need at least temporally to "move" such allocation to system memory. With migration I meant migrating both ways of course. And with stuff like numactl we can also influence where exactly the malloc'ed memory is allocated originally, at least if we'd expose the vram range as a very special numa node that happens to be far away and not hold any cpu cores. I don't think we should be using numa distance to reverse engineer a certain allocation behavior. The latency data should be truthful, but you're right we'll need a mechanism to keep general purpose allocations out of that range by default. Btw, strict isolation is another design point of device-dax, but I think in this case we're describing something between the two extremes of full isolation and full compatibility with existing numactl apis. Yes, agreed. My idea with exposing vram sections using numa nodes wasn't to reuse all the existing allocation policies directly, those won't work. So at boot-up your default numa policy would exclude any vram nodes. But I think (as an -mm layman) that numa gives us a lot of the tools and policy interface that we need to implement what we want for gpus. Agree completely. From a ten mile high view our GPUs are just command processors with local memory as well . Basically this is also the whole idea of what AMD is pushing with HSA for a while. It's just that a lot of problems start to pop up when you look at all the nasty details. For example only part of the GPU memory is usually accessible by the CPU. So even when numa nodes expose a good foundation for this I think there is still a lot of code to write. BTW: I should probably start to read into the numa code of the kernel. Any good pointers for that? I would assume that "page" allocation logic itself should be inside of graphics driver due to possible different requirements especially from graphics: alignment, etc. Regards, Christian. Wrt isolation: There's a sliding scale of what different users expect, from full auto everything, including migrating pages around if needed to full isolation all seems to be on the table. As long as we keep vram nodes out of any default allocation numasets, full isolation should be possible. -Daniel Sincerely yours, Serguei Sagalovitch ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: Enabling peer to peer device transactions for PCIe devices
On 2016-11-23 02:05 PM, Jason Gunthorpe wrote: On Wed, Nov 23, 2016 at 10:13:03AM -0700, Logan Gunthorpe wrote: an MR would be very tricky. The MR may be relied upon by another host and the kernel would have to inform user-space the MR was invalid then user-space would have to tell the remote application. As Bart says, it would be best to be combined with something like Mellanox's ODP MRs, which allows a page to be evicted and then trigger a CPU interrupt if a DMA is attempted so it can be brought back. Please note that in the general case (including MR one) we could have "page fault" from the different PCIe device. So all PCIe device must be synchronized. includes the usual fencing mechanism so the CPU can block, flush, and then evict a page coherently. This is the general direction the industry is going in: Link PCI DMA directly to dynamic user page tabels, including support for demand faulting and synchronicity. Mellanox ODP is a rough implementation of mirroring a process's page table via the kernel, while IBM's CAPI (and CCIX, PCI ATS?) is probably a good example of where this is ultimately headed. CAPI allows a PCI DMA to directly target an ASID associated with a user process and then use the usual CPU machinery to do the page translation for the DMA. This includes page faults for evicted pages, and obviously allows eviction and migration.. So, of all the solutions in the original list, I would discard anything that isn't VMA focused. Emulating what CAPI does in hardware with software is probably the best choice, or we have to do it all again when CAPI style hardware broadly rolls out :( DAX and GPU allocators should create VMAs and manipulate them in the usual way to achieve migration, windowing, cache, movement or swap of the potentially peer-peer memory pages. They would have to respect the usual rules for a VMA, including pinning. DMA drivers would use the usual approaches for dealing with DMA from a VMA: short term pin or long term coherent translation mirror. So, to my view (looking from RDMA), the main problem with peer-peer is how do you DMA translate VMA's that point at non struct page memory? Does HMM solve the peer-peer problem? Does it do it generically or only for drivers that are mirroring translation tables? In current form HMM doesn't solve peer-peer problem. Currently it allow "mirroring" of "malloc" memory on GPU which is not always what needed. Additionally there is need to have opportunity to share VRAM allocations between different processes. From a RDMA perspective we could use something other than get_user_pages() to pin and DMA translate a VMA if the core community could decide on an API. eg get_user_dma_sg() would probably be quite usable. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: Enabling peer to peer device transactions for PCIe devices
On Wed, Nov 23, 2016 at 10:40:47AM -0800, Dan Williams wrote: > I don't think that was designed for the case where the backing memory > is a special/static physical address range rather than anonymous > "System RAM", right? The hardware doesn't care where the memory is. ODP is just a generic mechanism to provide demand-fault behavior for a mirrored page table. ODP has the same issue as everything else, it needs to translate a page table entry into a DMA address, and we have no API to do that when the page table points to peer-peer memory. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: Enabling peer to peer device transactions for PCIe devices
On Wed, Nov 23, 2016 at 10:13:03AM -0700, Logan Gunthorpe wrote: > an MR would be very tricky. The MR may be relied upon by another host > and the kernel would have to inform user-space the MR was invalid then > user-space would have to tell the remote application. As Bart says, it would be best to be combined with something like Mellanox's ODP MRs, which allows a page to be evicted and then trigger a CPU interrupt if a DMA is attempted so it can be brought back. This includes the usual fencing mechanism so the CPU can block, flush, and then evict a page coherently. This is the general direction the industry is going in: Link PCI DMA directly to dynamic user page tabels, including support for demand faulting and synchronicity. Mellanox ODP is a rough implementation of mirroring a process's page table via the kernel, while IBM's CAPI (and CCIX, PCI ATS?) is probably a good example of where this is ultimately headed. CAPI allows a PCI DMA to directly target an ASID associated with a user process and then use the usual CPU machinery to do the page translation for the DMA. This includes page faults for evicted pages, and obviously allows eviction and migration.. So, of all the solutions in the original list, I would discard anything that isn't VMA focused. Emulating what CAPI does in hardware with software is probably the best choice, or we have to do it all again when CAPI style hardware broadly rolls out :( DAX and GPU allocators should create VMAs and manipulate them in the usual way to achieve migration, windowing, cache, movement or swap of the potentially peer-peer memory pages. They would have to respect the usual rules for a VMA, including pinning. DMA drivers would use the usual approaches for dealing with DMA from a VMA: short term pin or long term coherent translation mirror. So, to my view (looking from RDMA), the main problem with peer-peer is how do you DMA translate VMA's that point at non struct page memory? Does HMM solve the peer-peer problem? Does it do it generically or only for drivers that are mirroring translation tables? >From a RDMA perspective we could use something other than get_user_pages() to pin and DMA translate a VMA if the core community could decide on an API. eg get_user_dma_sg() would probably be quite usable. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[PATCH 3/6] dax: add tracepoint infrastructure, PMD tracing
Tracepoints are the standard way to capture debugging and tracing information in many parts of the kernel, including the XFS and ext4 filesystems. Create a tracepoint header for FS DAX and add the first DAX tracepoints to the PMD fault handler. This allows the tracing for DAX to be done in the same way as the filesystem tracing so that developers can look at them together and get a coherent idea of what the system is doing. I added both an entry and exit tracepoint because future patches will add tracepoints to child functions of dax_iomap_pmd_fault() like dax_pmd_load_hole() and dax_pmd_insert_mapping(). We want those messages to be wrapped by the parent function tracepoints so the code flow is more easily understood. Having entry and exit tracepoints for faults also allows us to easily see what filesystems functions were called during the fault. These filesystem functions get executed via iomap_begin() and iomap_end() calls, for example, and will have their own tracepoints. For PMD faults we primarily want to understand the faulting address and whether it fell back to 4k faults. If it fell back to 4k faults the tracepoints should let us understand why. I named the new tracepoint header file "fs_dax.h" to allow for device DAX to have its own separate tracing header in the same directory at some point. Here is an example output for these events from a successful PMD fault: big-2057 [000] 136.396855: dax_pmd_fault: shared mapping write address 0x10505000 vm_start 0x1020 vm_end 0x1070 pgoff 0x200 max_pgoff 0x1400 big-2057 [000] 136.397943: dax_pmd_fault_done: shared mapping write address 0x10505000 vm_start 0x1020 vm_end 0x1070 pgoff 0x200 max_pgoff 0x1400 NOPAGE Signed-off-by: Ross ZwislerSuggested-by: Dave Chinner --- fs/dax.c | 29 +--- include/linux/mm.h| 14 ++ include/trace/events/fs_dax.h | 61 +++ 3 files changed, 94 insertions(+), 10 deletions(-) create mode 100644 include/trace/events/fs_dax.h diff --git a/fs/dax.c b/fs/dax.c index cc8a069..1aa7616 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -35,6 +35,9 @@ #include #include "internal.h" +#define CREATE_TRACE_POINTS +#include + /* We choose 4096 entries - same as per-zone page wait tables */ #define DAX_WAIT_TABLE_BITS 12 #define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS) @@ -1310,6 +1313,16 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address, loff_t pos; int error; + /* +* Check whether offset isn't beyond end of file now. Caller is +* supposed to hold locks serializing us with truncate / punch hole so +* this is a reliable test. +*/ + pgoff = linear_page_index(vma, pmd_addr); + max_pgoff = (i_size_read(inode) - 1) >> PAGE_SHIFT; + + trace_dax_pmd_fault(vma, address, flags, pgoff, max_pgoff, 0); + /* Fall back to PTEs if we're going to COW */ if (write && !(vma->vm_flags & VM_SHARED)) goto fallback; @@ -1320,16 +1333,10 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address, if ((pmd_addr + PMD_SIZE) > vma->vm_end) goto fallback; - /* -* Check whether offset isn't beyond end of file now. Caller is -* supposed to hold locks serializing us with truncate / punch hole so -* this is a reliable test. -*/ - pgoff = linear_page_index(vma, pmd_addr); - max_pgoff = (i_size_read(inode) - 1) >> PAGE_SHIFT; - - if (pgoff > max_pgoff) - return VM_FAULT_SIGBUS; + if (pgoff > max_pgoff) { + result = VM_FAULT_SIGBUS; + goto out; + } /* If the PMD would extend beyond the file size */ if ((pgoff | PG_PMD_COLOUR) > max_pgoff) @@ -1400,6 +1407,8 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address, split_huge_pmd(vma, pmd, address); count_vm_event(THP_FAULT_FALLBACK); } +out: + trace_dax_pmd_fault_done(vma, address, flags, pgoff, max_pgoff, result); return result; } EXPORT_SYMBOL_GPL(dax_iomap_pmd_fault); diff --git a/include/linux/mm.h b/include/linux/mm.h index a5f52c0..e373917 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1107,6 +1107,20 @@ static inline void clear_page_pfmemalloc(struct page *page) VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE | \ VM_FAULT_FALLBACK) +#define VM_FAULT_RESULT_TRACE \ + { VM_FAULT_OOM, "OOM" }, \ + { VM_FAULT_SIGBUS, "SIGBUS" }, \ + { VM_FAULT_MAJOR, "MAJOR" }, \ + { VM_FAULT_WRITE, "WRITE" }, \ + { VM_FAULT_HWPOISON,"HWPOISON" }, \ + { VM_FAULT_HWPOISON_LARGE,
[PATCH 6/6] dax: add tracepoints to dax_pmd_insert_mapping()
Add tracepoints to dax_pmd_insert_mapping(), following the same logging conventions as the tracepoints in dax_iomap_pmd_fault(). Here is an example PMD fault showing the new tracepoints: big-1544 [006] 48.153479: dax_pmd_fault: shared mapping write address 0x10505000 vm_start 0x1020 vm_end 0x1070 pgoff 0x200 max_pgoff 0x1400 big-1544 [006] 48.155230: dax_pmd_insert_mapping: shared mapping write address 0x10505000 length 0x20 pfn 0x100600 DEV|MAP radix_entry 0xc000e big-1544 [006] 48.155266: dax_pmd_fault_done: shared mapping write address 0x10505000 vm_start 0x1020 vm_end 0x1070 pgoff 0x200 max_pgoff 0x1400 NOPAGE Signed-off-by: Ross Zwisler--- fs/dax.c | 10 +++--- include/linux/pfn_t.h | 6 ++ include/trace/events/fs_dax.h | 42 ++ 3 files changed, 55 insertions(+), 3 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 2824414..d6ba4a3 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -1236,10 +1236,10 @@ static int dax_pmd_insert_mapping(struct vm_area_struct *vma, pmd_t *pmd, .size = PMD_SIZE, }; long length = dax_map_atomic(bdev, ); - void *ret; + void *ret = NULL; if (length < 0) /* dax_map_atomic() failed */ - return VM_FAULT_FALLBACK; + goto fallback; if (length < PMD_SIZE) goto unmap_fallback; if (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR) @@ -1252,13 +1252,17 @@ static int dax_pmd_insert_mapping(struct vm_area_struct *vma, pmd_t *pmd, ret = dax_insert_mapping_entry(mapping, vmf, *entryp, dax.sector, RADIX_DAX_PMD); if (IS_ERR(ret)) - return VM_FAULT_FALLBACK; + goto fallback; *entryp = ret; + trace_dax_pmd_insert_mapping(vma, address, write, length, dax.pfn, ret); return vmf_insert_pfn_pmd(vma, address, pmd, dax.pfn, write); unmap_fallback: dax_unmap_atomic(bdev, ); +fallback: + trace_dax_pmd_insert_mapping_fallback(vma, address, write, length, + dax.pfn, ret); return VM_FAULT_FALLBACK; } diff --git a/include/linux/pfn_t.h b/include/linux/pfn_t.h index a3d90b9..033fc7b 100644 --- a/include/linux/pfn_t.h +++ b/include/linux/pfn_t.h @@ -15,6 +15,12 @@ #define PFN_DEV (1ULL << (BITS_PER_LONG_LONG - 3)) #define PFN_MAP (1ULL << (BITS_PER_LONG_LONG - 4)) +#define PFN_FLAGS_TRACE \ + { PFN_SG_CHAIN, "SG_CHAIN" }, \ + { PFN_SG_LAST, "SG_LAST" }, \ + { PFN_DEV, "DEV" }, \ + { PFN_MAP, "MAP" } + static inline pfn_t __pfn_to_pfn_t(unsigned long pfn, u64 flags) { pfn_t pfn_t = { .val = pfn | (flags & PFN_FLAGS_MASK), }; diff --git a/include/trace/events/fs_dax.h b/include/trace/events/fs_dax.h index 8814b1a..a03f820 100644 --- a/include/trace/events/fs_dax.h +++ b/include/trace/events/fs_dax.h @@ -87,6 +87,48 @@ DEFINE_EVENT(dax_pmd_load_hole_class, name, \ DEFINE_PMD_LOAD_HOLE_EVENT(dax_pmd_load_hole); DEFINE_PMD_LOAD_HOLE_EVENT(dax_pmd_load_hole_fallback); +DECLARE_EVENT_CLASS(dax_pmd_insert_mapping_class, + TP_PROTO(struct vm_area_struct *vma, unsigned long address, int write, + long length, pfn_t pfn, void *radix_entry), + TP_ARGS(vma, address, write, length, pfn, radix_entry), + TP_STRUCT__entry( + __field(unsigned long, vm_flags) + __field(unsigned long, address) + __field(int, write) + __field(long, length) + __field(u64, pfn_val) + __field(void *, radix_entry) + ), + TP_fast_assign( + __entry->vm_flags = vma->vm_flags; + __entry->address = address; + __entry->write = write; + __entry->length = length; + __entry->pfn_val = pfn.val; + __entry->radix_entry = radix_entry; + ), + TP_printk("%s mapping %s address %#lx length %#lx pfn %#llx %s" + " radix_entry %#lx", + __entry->vm_flags & VM_SHARED ? "shared" : "private", + __entry->write ? "write" : "read", + __entry->address, + __entry->length, + __entry->pfn_val & ~PFN_FLAGS_MASK, + __print_flags(__entry->pfn_val & PFN_FLAGS_MASK, "|", + PFN_FLAGS_TRACE), + (unsigned long)__entry->radix_entry + ) +) + +#define DEFINE_PMD_INSERT_MAPPING_EVENT(name) \ +DEFINE_EVENT(dax_pmd_insert_mapping_class, name, \ + TP_PROTO(struct vm_area_struct *vma, unsigned long address, \ + int write, long length, pfn_t pfn, void *radix_entry), \ + TP_ARGS(vma, address, write, length, pfn, radix_entry)) + +DEFINE_PMD_INSERT_MAPPING_EVENT(dax_pmd_insert_mapping); +DEFINE_PMD_INSERT_MAPPING_EVENT(dax_pmd_insert_mapping_fallback); + #endif /*
[PATCH 4/6] dax: update MAINTAINERS entries for FS DAX
Add the new include/trace/events/fs_dax.h tracepoint header, update Matthew's email address and add myself as a maintainer for filesystem DAX. Signed-off-by: Ross ZwislerSuggested-by: Matthew Wilcox --- MAINTAINERS | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index 2a58eea..8fef4bf 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -3855,10 +3855,12 @@ S: Maintained F: drivers/i2c/busses/i2c-diolan-u2c.c DIRECT ACCESS (DAX) -M: Matthew Wilcox +M: Matthew Wilcox +M: Ross Zwisler L: linux-fsde...@vger.kernel.org S: Supported F: fs/dax.c +F: include/trace/events/fs_dax.h DIRECTORY NOTIFICATION (DNOTIFY) M: Eric Paris -- 2.7.4 ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[PATCH 5/6] dax: add tracepoints to dax_pmd_load_hole()
Add tracepoints to dax_pmd_load_hole(), following the same logging conventions as the tracepoints in dax_iomap_pmd_fault(). Here is an example PMD fault showing the new tracepoints: read_big-1393 [007] 32.133809: dax_pmd_fault: shared mapping read address 0x1040 vm_start 0x1020 vm_end 0x1060 pgoff 0x200 max_pgoff 0x1400 read_big-1393 [007] 32.134067: dax_pmd_load_hole: shared mapping read address 0x1040 zero_page ea0002b98000 radix_entry 0x1e read_big-1393 [007] 32.134069: dax_pmd_fault_done: shared mapping read address 0x1040 vm_start 0x1020 vm_end 0x1060 pgoff 0x200 max_pgoff 0x1400 NOPAGE Signed-off-by: Ross Zwisler--- fs/dax.c | 13 + include/trace/events/fs_dax.h | 32 2 files changed, 41 insertions(+), 4 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 1aa7616..2824414 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -1269,32 +1269,37 @@ static int dax_pmd_load_hole(struct vm_area_struct *vma, pmd_t *pmd, struct address_space *mapping = vma->vm_file->f_mapping; unsigned long pmd_addr = address & PMD_MASK; struct page *zero_page; + void *ret = NULL; spinlock_t *ptl; pmd_t pmd_entry; - void *ret; zero_page = mm_get_huge_zero_page(vma->vm_mm); if (unlikely(!zero_page)) - return VM_FAULT_FALLBACK; + goto fallback; ret = dax_insert_mapping_entry(mapping, vmf, *entryp, 0, RADIX_DAX_PMD | RADIX_DAX_HZP); if (IS_ERR(ret)) - return VM_FAULT_FALLBACK; + goto fallback; *entryp = ret; ptl = pmd_lock(vma->vm_mm, pmd); if (!pmd_none(*pmd)) { spin_unlock(ptl); - return VM_FAULT_FALLBACK; + goto fallback; } pmd_entry = mk_pmd(zero_page, vma->vm_page_prot); pmd_entry = pmd_mkhuge(pmd_entry); set_pmd_at(vma->vm_mm, pmd_addr, pmd, pmd_entry); spin_unlock(ptl); + trace_dax_pmd_load_hole(vma, address, zero_page, ret); return VM_FAULT_NOPAGE; + +fallback: + trace_dax_pmd_load_hole_fallback(vma, address, zero_page, ret); + return VM_FAULT_FALLBACK; } int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address, diff --git a/include/trace/events/fs_dax.h b/include/trace/events/fs_dax.h index f9ed4eb..8814b1a 100644 --- a/include/trace/events/fs_dax.h +++ b/include/trace/events/fs_dax.h @@ -54,6 +54,38 @@ DEFINE_EVENT(dax_pmd_fault_class, name, \ DEFINE_PMD_FAULT_EVENT(dax_pmd_fault); DEFINE_PMD_FAULT_EVENT(dax_pmd_fault_done); +DECLARE_EVENT_CLASS(dax_pmd_load_hole_class, + TP_PROTO(struct vm_area_struct *vma, unsigned long address, + struct page *zero_page, void *radix_entry), + TP_ARGS(vma, address, zero_page, radix_entry), + TP_STRUCT__entry( + __field(unsigned long, vm_flags) + __field(unsigned long, address) + __field(struct page *, zero_page) + __field(void *, radix_entry) + ), + TP_fast_assign( + __entry->vm_flags = vma->vm_flags; + __entry->address = address; + __entry->zero_page = zero_page; + __entry->radix_entry = radix_entry; + ), + TP_printk("%s mapping read address %#lx zero_page %p radix_entry %#lx", + __entry->vm_flags & VM_SHARED ? "shared" : "private", + __entry->address, + __entry->zero_page, + (unsigned long)__entry->radix_entry + ) +) + +#define DEFINE_PMD_LOAD_HOLE_EVENT(name) \ +DEFINE_EVENT(dax_pmd_load_hole_class, name, \ + TP_PROTO(struct vm_area_struct *vma, unsigned long address, \ + struct page *zero_page, void *radix_entry), \ + TP_ARGS(vma, address, zero_page, radix_entry)) + +DEFINE_PMD_LOAD_HOLE_EVENT(dax_pmd_load_hole); +DEFINE_PMD_LOAD_HOLE_EVENT(dax_pmd_load_hole_fallback); #endif /* _TRACE_FS_DAX_H */ -- 2.7.4 ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[PATCH 1/6] dax: fix build breakage with ext4, dax and !iomap
With the current Kconfig setup it is possible to have the following: CONFIG_EXT4_FS=y CONFIG_FS_DAX=y CONFIG_FS_IOMAP=n # this is in fs/Kconfig & isn't user accessible With this config we get build failures in ext4_dax_fault() because the iomap functions in fs/dax.c are missing: fs/built-in.o: In function `ext4_dax_fault': file.c:(.text+0x7f3ac): undefined reference to `dax_iomap_fault' file.c:(.text+0x7f404): undefined reference to `dax_iomap_fault' fs/built-in.o: In function `ext4_file_read_iter': file.c:(.text+0x7fc54): undefined reference to `dax_iomap_rw' fs/built-in.o: In function `ext4_file_write_iter': file.c:(.text+0x7fe9a): undefined reference to `dax_iomap_rw' file.c:(.text+0x7feed): undefined reference to `dax_iomap_rw' fs/built-in.o: In function `ext4_block_zero_page_range': inode.c:(.text+0x85c0d): undefined reference to `iomap_zero_range' Now that the struct buffer_head based DAX fault paths and I/O path have been removed we really depend on iomap support being present for DAX. Make this explicit by selecting FS_IOMAP if we compile in DAX support. Signed-off-by: Ross Zwisler--- fs/Kconfig | 1 + fs/dax.c| 2 -- fs/ext2/Kconfig | 1 - 3 files changed, 1 insertion(+), 3 deletions(-) diff --git a/fs/Kconfig b/fs/Kconfig index 8e9e5f41..18024bf 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -38,6 +38,7 @@ config FS_DAX bool "Direct Access (DAX) support" depends on MMU depends on !(ARM || MIPS || SPARC) + select FS_IOMAP help Direct Access (DAX) can be used on memory-backed block devices. If the block device supports DAX and the filesystem supports DAX, diff --git a/fs/dax.c b/fs/dax.c index be39633..d8fe3eb 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -968,7 +968,6 @@ int __dax_zero_page_range(struct block_device *bdev, sector_t sector, } EXPORT_SYMBOL_GPL(__dax_zero_page_range); -#ifdef CONFIG_FS_IOMAP static sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos) { return iomap->blkno + (((pos & PAGE_MASK) - iomap->offset) >> 9); @@ -1405,4 +1404,3 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address, } EXPORT_SYMBOL_GPL(dax_iomap_pmd_fault); #endif /* CONFIG_FS_DAX_PMD */ -#endif /* CONFIG_FS_IOMAP */ diff --git a/fs/ext2/Kconfig b/fs/ext2/Kconfig index 36bea5a..c634874e 100644 --- a/fs/ext2/Kconfig +++ b/fs/ext2/Kconfig @@ -1,6 +1,5 @@ config EXT2_FS tristate "Second extended fs support" - select FS_IOMAP if FS_DAX help Ext2 is a standard Linux file system for hard disks. -- 2.7.4 ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: Enabling peer to peer device transactions for PCIe devices
On Wed, Nov 23, 2016 at 9:27 AM, Bart Van Asschewrote: > On 11/23/2016 09:13 AM, Logan Gunthorpe wrote: >> >> IMO any memory that has been registered for a P2P transaction should be >> locked from being evicted. So if there's a get_user_pages call it needs >> to be pinned until the put_page. The main issue being with the RDMA >> case: handling an eviction when a chunk of memory has been registered as >> an MR would be very tricky. The MR may be relied upon by another host >> and the kernel would have to inform user-space the MR was invalid then >> user-space would have to tell the remote application. > > > Hello Logan, > > Are you aware that the Linux kernel already supports ODP (On Demand Paging)? > See also the output of git grep -nHi on.demand.paging. See also > https://www.openfabrics.org/images/eventpresos/workshops2014/DevWorkshop/presos/Tuesday/pdf/04_ODP_update.pdf. > I don't think that was designed for the case where the backing memory is a special/static physical address range rather than anonymous "System RAM", right? I think we should handle the graphics P2P concerns separately from the general P2P-DMA case since the latter does not require the higher order memory management facilities. Using ZONE_DEVICE/DAX mappings to avoid changes to every driver that wants to support P2P-DMA separately from typical DMA still seems the path of least resistance. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: Enabling peer to peer device transactions for PCIe devices
On 11/23/2016 09:13 AM, Logan Gunthorpe wrote: IMO any memory that has been registered for a P2P transaction should be locked from being evicted. So if there's a get_user_pages call it needs to be pinned until the put_page. The main issue being with the RDMA case: handling an eviction when a chunk of memory has been registered as an MR would be very tricky. The MR may be relied upon by another host and the kernel would have to inform user-space the MR was invalid then user-space would have to tell the remote application. Hello Logan, Are you aware that the Linux kernel already supports ODP (On Demand Paging)? See also the output of git grep -nHi on.demand.paging. See also https://www.openfabrics.org/images/eventpresos/workshops2014/DevWorkshop/presos/Tuesday/pdf/04_ODP_update.pdf. Bart. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: Enabling peer to peer device transactions for PCIe devices
On 11/22/2016 11:49 PM, Daniel Vetter wrote: > Yes, agreed. My idea with exposing vram sections using numa nodes wasn't > to reuse all the existing allocation policies directly, those won't work. > So at boot-up your default numa policy would exclude any vram nodes. > > But I think (as an -mm layman) that numa gives us a lot of the tools and > policy interface that we need to implement what we want for gpus. Are you suggesting creating NUMA nodes for video RAM (I assume that's what you mean by vram) where that RAM is not at all CPU-accessible? ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: Enabling peer to peer device transactions for PCIe devices
Am 23.11.2016 um 08:49 schrieb Daniel Vetter: On Tue, Nov 22, 2016 at 01:21:03PM -0800, Dan Williams wrote: On Tue, Nov 22, 2016 at 1:03 PM, Daniel Vetterwrote: On Tue, Nov 22, 2016 at 9:35 PM, Serguei Sagalovitch wrote: On 2016-11-22 03:10 PM, Daniel Vetter wrote: On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams wrote: On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch wrote: I personally like "device-DAX" idea but my concerns are: - How well it will co-exists with the DRM infrastructure / implementations in part dealing with CPU pointers? Inside the kernel a device-DAX range is "just memory" in the sense that you can perform pfn_to_page() on it and issue I/O, but the vma is not migratable. To be honest I do not know how well that co-exists with drm infrastructure. - How well we will be able to handle case when we need to "move"/"evict" memory/data to the new location so CPU pointer should point to the new physical location/address (and may be not in PCI device memory at all)? So, device-DAX deliberately avoids support for in-kernel migration or overcommit. Those cases are left to the core mm or drm. The device-dax interface is for cases where all that is needed is a direct-mapping to a statically-allocated physical-address range be it persistent memory or some other special reserved memory range. For some of the fancy use-cases (e.g. to be comparable to what HMM can pull off) I think we want all the magic in core mm, i.e. migration and overcommit. At least that seems to be the very strong drive in all general-purpose gpu abstractions and implementations, where memory is allocated with malloc, and then mapped/moved into vram/gpu address space through some magic, It is possible that there is other way around: memory is requested to be allocated and should be kept in vram for performance reason but due to possible overcommit case we need at least temporally to "move" such allocation to system memory. With migration I meant migrating both ways of course. And with stuff like numactl we can also influence where exactly the malloc'ed memory is allocated originally, at least if we'd expose the vram range as a very special numa node that happens to be far away and not hold any cpu cores. I don't think we should be using numa distance to reverse engineer a certain allocation behavior. The latency data should be truthful, but you're right we'll need a mechanism to keep general purpose allocations out of that range by default. Btw, strict isolation is another design point of device-dax, but I think in this case we're describing something between the two extremes of full isolation and full compatibility with existing numactl apis. Yes, agreed. My idea with exposing vram sections using numa nodes wasn't to reuse all the existing allocation policies directly, those won't work. So at boot-up your default numa policy would exclude any vram nodes. But I think (as an -mm layman) that numa gives us a lot of the tools and policy interface that we need to implement what we want for gpus. Agree completely. From a ten mile high view our GPUs are just command processors with local memory as well . Basically this is also the whole idea of what AMD is pushing with HSA for a while. It's just that a lot of problems start to pop up when you look at all the nasty details. For example only part of the GPU memory is usually accessible by the CPU. So even when numa nodes expose a good foundation for this I think there is still a lot of code to write. BTW: I should probably start to read into the numa code of the kernel. Any good pointers for that? Regards, Christian. Wrt isolation: There's a sliding scale of what different users expect, from full auto everything, including migrating pages around if needed to full isolation all seems to be on the table. As long as we keep vram nodes out of any default allocation numasets, full isolation should be possible. -Daniel ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm