Re: [PATCH] mm/up: combine put_compound_head() and unpin_user_page()
On Wed, Dec 09, 2020 at 12:57:38PM -0800, John Hubbard wrote: > > diff --git a/mm/gup.c b/mm/gup.c > > index 98eb8e6d2609c3..7b33b7d4b324d7 100644 > > +++ b/mm/gup.c > > @@ -123,6 +123,28 @@ static __maybe_unused struct page > > *try_grab_compound_head(struct page *page, > > return NULL; > > } > > +static void put_compound_head(struct page *page, int refs, unsigned int > > flags) > > +{ > > It might be nice to rename "page" to "head", here. > > While reading this I toyed with the idea of having this at the top: > > VM_BUG_ON_PAGE(compound_head(page) != page, page); > > ...but it's overkill in a static function with pretty clear call sites. So I > think it's just right as-is. Matt's folio patches would take are of this, when that is available all this becomes a lot better. Thanks, Jason ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
[PATCH] mm/up: combine put_compound_head() and unpin_user_page()
These functions accomplish the same thing but have different implementations. unpin_user_page() has a bug where it calls mod_node_page_state() after calling put_page() which creates a risk that the page could have been hot-uplugged from the system. Fix this by using put_compound_head() as the only implementation. __unpin_devmap_managed_user_page() and related can be deleted as well in favour of the simpler, but slower, version in put_compound_head() that has an extra atomic page_ref_sub, but always calls put_page() which internally contains the special devmap code. Move put_compound_head() to be directly after try_grab_compound_head() so people can find it in future. Fixes: 1970dc6f5226 ("mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) reporting") Signed-off-by: Jason Gunthorpe --- mm/gup.c | 103 +-- 1 file changed, 23 insertions(+), 80 deletions(-) With Matt's folio idea I'd next to go to make a put_folio(folio, refs) Which would cleanly eliminate that extra atomic here without duplicating the devmap special case. This should also be called 'ungrab_compound_head' as we seem to be using the word 'grab' to mean 'pin or get' depending on GUP flags. diff --git a/mm/gup.c b/mm/gup.c index 98eb8e6d2609c3..7b33b7d4b324d7 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -123,6 +123,28 @@ static __maybe_unused struct page *try_grab_compound_head(struct page *page, return NULL; } +static void put_compound_head(struct page *page, int refs, unsigned int flags) +{ + if (flags & FOLL_PIN) { + mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, + refs); + + if (hpage_pincount_available(page)) + hpage_pincount_sub(page, refs); + else + refs *= GUP_PIN_COUNTING_BIAS; + } + + VM_BUG_ON_PAGE(page_ref_count(page) < refs, page); + /* +* Calling put_page() for each ref is unnecessarily slow. Only the last +* ref needs a put_page(). +*/ + if (refs > 1) + page_ref_sub(page, refs - 1); + put_page(page); +} + /** * try_grab_page() - elevate a page's refcount by a flag-dependent amount * @@ -177,41 +199,6 @@ bool __must_check try_grab_page(struct page *page, unsigned int flags) return true; } -#ifdef CONFIG_DEV_PAGEMAP_OPS -static bool __unpin_devmap_managed_user_page(struct page *page) -{ - int count, refs = 1; - - if (!page_is_devmap_managed(page)) - return false; - - if (hpage_pincount_available(page)) - hpage_pincount_sub(page, 1); - else - refs = GUP_PIN_COUNTING_BIAS; - - count = page_ref_sub_return(page, refs); - - mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1); - /* -* devmap page refcounts are 1-based, rather than 0-based: if -* refcount is 1, then the page is free and the refcount is -* stable because nobody holds a reference on the page. -*/ - if (count == 1) - free_devmap_managed_page(page); - else if (!count) - __put_page(page); - - return true; -} -#else -static bool __unpin_devmap_managed_user_page(struct page *page) -{ - return false; -} -#endif /* CONFIG_DEV_PAGEMAP_OPS */ - /** * unpin_user_page() - release a dma-pinned page * @page:pointer to page to be released @@ -223,28 +210,7 @@ static bool __unpin_devmap_managed_user_page(struct page *page) */ void unpin_user_page(struct page *page) { - int refs = 1; - - page = compound_head(page); - - /* -* For devmap managed pages we need to catch refcount transition from -* GUP_PIN_COUNTING_BIAS to 1, when refcount reach one it means the -* page is free and we need to inform the device driver through -* callback. See include/linux/memremap.h and HMM for details. -*/ - if (__unpin_devmap_managed_user_page(page)) - return; - - if (hpage_pincount_available(page)) - hpage_pincount_sub(page, 1); - else - refs = GUP_PIN_COUNTING_BIAS; - - if (page_ref_sub_and_test(page, refs)) - __put_page(page); - - mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1); + put_compound_head(compound_head(page), 1, FOLL_PIN); } EXPORT_SYMBOL(unpin_user_page); @@ -2062,29 +2028,6 @@ EXPORT_SYMBOL(get_user_pages_unlocked); * This code is based heavily on the PowerPC implementation by Nick Piggin. */ #ifdef CONFIG_HAVE_FAST_GUP - -static void put_compound_head(struct page *page, int refs, unsigned int flags) -{ - if (flags & FOLL_PIN) { - mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, -
Re: regression from 5.10.0-rc3: BUG: Bad page state in process kworker/41:0 pfn:891066 during fio on devdax
On Mon, Nov 09, 2020 at 01:54:42PM -0400, Jason Gunthorpe wrote: > On Mon, Nov 09, 2020 at 09:26:19AM -0800, Dan Williams wrote: > > On Mon, Nov 9, 2020 at 6:12 AM Jason Gunthorpe wrote: > > > > > > Wow, this is surprising > > > > > > This has been widely backported already, Dan please check?? > > > > > > I thought pgprot_decrypted was a NOP on most x86 platforms - > > > sme_me_mask == 0: > > > > > > #define __sme_set(x)((x) | sme_me_mask) > > > #define __sme_clr(x)((x) & ~sme_me_mask) > > > > > > ?? > > > > > > Confused how this can be causing DAX issues > > > > Does that correctly preserve the "soft" pte bits? Especially > > PTE_DEVMAP that DAX uses? > > > > I'll check... > > extern u64 sme_me_mask; > #define __pgprot(x) ((pgprot_t) { (x) } ) > #define pgprot_val(x)((x).pgprot) > #define __sme_clr(x)((x) & ~sme_me_mask) > #define pgprot_decrypted(prot) __pgprot(__sme_clr(pgprot_val(prot))) > > static inline int io_remap_pfn_range(struct vm_area_struct *vma, > unsigned long addr, unsigned long pfn, > unsigned long size, pgprot_t prot) > { >return remap_pfn_range(vma, addr, pfn, size, pgprot_decrypted(prot)); > } > > Not seeing how that could change the pgprot in any harmful way? > > Yi, are you using a platform where sme_me_mask != 0 ? > > That code looks clearly like it would only trigger on AMD SME systems, > is that what you are using? Can't be, the system is too old: [ 398.455914] Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 10/05/2016 I'm at a total loss how this change could even do anything on a non-AMD system, let alone how this intersects in any way with DEVDAX, which I could not find being used with io_remap_pfn_range() How confident are you in the bisection? Jason ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v3 00/23] device-dax: Support sub-dividing soft-reserved ranges
On Fri, Jul 31, 2020 at 08:24:58PM -0700, Dan Williams wrote: > - Fix test_hmm and other compilation fixups (Ralph) The hmm parts look OK Acked-by: Jason Gunthorpe Jason ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v2 11/12] PM, libnvdimm: Add 'mem-quiet' state and callback for firmware activation
On Thu, Jul 09, 2020 at 09:10:06AM -0700, Dan Williams wrote: > On Thu, Jul 9, 2020 at 8:39 AM Jason Gunthorpe wrote: > > > > On Thu, Jul 09, 2020 at 04:00:51PM +0100, Christoph Hellwig wrote: > > > On Mon, Jul 06, 2020 at 06:59:32PM -0700, Dan Williams wrote: > > > > The runtime firmware activation capability of Intel NVDIMM devices > > > > requires memory transactions to be disabled for 100s of microseconds. > > > > This timeout is large enough to cause in-flight DMA to fail and other > > > > application detectable timeouts. Arrange for firmware activation to be > > > > executed while the system is "quiesced", all processes and device-DMA > > > > frozen. > > > > > > > > It is already required that invoking device ->freeze() callbacks is > > > > sufficient to cease DMA. A device that continues memory writes outside > > > > of user-direction violates expectations of the PM core to be to > > > > establish a coherent hibernation image. > > > > > > > > That said, RDMA devices are an example of a device that access memory > > > > outside of user process direction. > > > > Are you saying freeze doesn't work for some RDMA drivers? That would > > be a driver bug, I think. > > Right, it's more my hunch than a known bug at this point, but in my > experience with testing server class hardware when I've reported a > power management bugs I've sometimes got the incredulous response "who > suspends / hibernates servers!?". I can drop that comment. > > Are there protocol timeouts that might need to be adjusted for a 100s > of microseconds blip in memory controller response? Survivability depends alot on HW support, it has to suspend, not discard DMAs that it needs to issue. Most likely things are as you say, and HW doesn't support safe short time suspend. The usual use of PM stuff here is to make the machine ready for kexec Jason ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v2 11/12] PM, libnvdimm: Add 'mem-quiet' state and callback for firmware activation
On Thu, Jul 09, 2020 at 04:00:51PM +0100, Christoph Hellwig wrote: > On Mon, Jul 06, 2020 at 06:59:32PM -0700, Dan Williams wrote: > > The runtime firmware activation capability of Intel NVDIMM devices > > requires memory transactions to be disabled for 100s of microseconds. > > This timeout is large enough to cause in-flight DMA to fail and other > > application detectable timeouts. Arrange for firmware activation to be > > executed while the system is "quiesced", all processes and device-DMA > > frozen. > > > > It is already required that invoking device ->freeze() callbacks is > > sufficient to cease DMA. A device that continues memory writes outside > > of user-direction violates expectations of the PM core to be to > > establish a coherent hibernation image. > > > > That said, RDMA devices are an example of a device that access memory > > outside of user process direction. Are you saying freeze doesn't work for some RDMA drivers? That would be a driver bug, I think. The consequences of doing freeze are pretty serious, but it should still stop DMA. Jason ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH v3] libnvdimm: Enable unit test infrastructure compile checks
On Sat, Aug 31, 2019 at 11:52:47AM -0700, Dan Williams wrote: > The infrastructure to mock core libnvdimm routines for unit testing > purposes is prone to bitrot relative to refactoring of that core. > Arrange for the unit test core to be built when CONFIG_COMPILE_TEST=y. > This does not result in a functional unit test environment, it is only a > helper for 0day to catch unit test build regressions. > > Note that there are a few x86isms in the implementation, so this does > not bother compile testing this architectures other than 64-bit x86. > > Cc: Jérôme Glisse > Cc: Jason Gunthorpe > Reported-by: Christoph Hellwig > Signed-off-by: Dan Williams > Link: > https://lore.kernel.org/r/156097224232.1086847.9463861924683372741.st...@dwillia2-desk3.amr.corp.intel.com > Changes since v2: > - Fixed a 0day report when the unit test infrastructure is built-in. > Arrange for it to only compile as a module. This has received a build > success notifcation from the robot over 142 configs. > > Hi Jason, > > As we discussed previously please take this through the hmm tree to give > compile coverage for devm_memremap_pages() updates. Okay, applied again and 0-day passed Thanks, Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 2/4] memremap: remove the dev field in struct dev_pagemap
On Wed, Aug 21, 2019 at 01:24:20PM -0300, Jason Gunthorpe wrote: > On Tue, Aug 20, 2019 at 07:58:22PM -0700, Dan Williams wrote: > > On Tue, Aug 20, 2019 at 6:27 AM Jason Gunthorpe wrote: > > > > > > On Mon, Aug 19, 2019 at 06:44:02PM -0700, Dan Williams wrote: > > > > On Sun, Aug 18, 2019 at 2:12 AM Christoph Hellwig wrote: > > > > > > > > > > The dev field in struct dev_pagemap is only used to print dev_name in > > > > > two places, which are at best nice to have. Just remove the field > > > > > and thus the name in those two messages. > > > > > > > > > > Signed-off-by: Christoph Hellwig > > > > > Reviewed-by: Ira Weiny > > > > > > > > Needs the below as well. > > > > > > > > /me goes to check if he ever merged the fix to make the unit test > > > > stuff get built by default with COMPILE_TEST [1]. Argh! Nope, didn't > > > > submit it for 5.3-rc1, sorry for the thrash. > > > > > > > > You can otherwise add: > > > > > > > > Reviewed-by: Dan Williams > > > > > > > > [1]: > > > > https://lore.kernel.org/lkml/156097224232.1086847.9463861924683372741.st...@dwillia2-desk3.amr.corp.intel.com/ > > > > > > Can you get this merged? Do you want it to go with this series? > > > > Yeah, makes some sense to let you merge it so that you can get > > kbuild-robot reports about any follow-on memremap_pages() work that > > may trip up the build. Otherwise let me know and I'll get it queued > > with the other v5.4 libnvdimm pending bits. > > Done, I used it already to test build the last series from CH.. It failed 0-day, I'm guessing some missing kconfig stuff For now I dropped it, but, if you send a v2 I can forward it toward 0-day again! Thanks, Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 2/4] memremap: remove the dev field in struct dev_pagemap
On Tue, Aug 20, 2019 at 07:58:22PM -0700, Dan Williams wrote: > On Tue, Aug 20, 2019 at 6:27 AM Jason Gunthorpe wrote: > > > > On Mon, Aug 19, 2019 at 06:44:02PM -0700, Dan Williams wrote: > > > On Sun, Aug 18, 2019 at 2:12 AM Christoph Hellwig wrote: > > > > > > > > The dev field in struct dev_pagemap is only used to print dev_name in > > > > two places, which are at best nice to have. Just remove the field > > > > and thus the name in those two messages. > > > > > > > > Signed-off-by: Christoph Hellwig > > > > Reviewed-by: Ira Weiny > > > > > > Needs the below as well. > > > > > > /me goes to check if he ever merged the fix to make the unit test > > > stuff get built by default with COMPILE_TEST [1]. Argh! Nope, didn't > > > submit it for 5.3-rc1, sorry for the thrash. > > > > > > You can otherwise add: > > > > > > Reviewed-by: Dan Williams > > > > > > [1]: > > > https://lore.kernel.org/lkml/156097224232.1086847.9463861924683372741.st...@dwillia2-desk3.amr.corp.intel.com/ > > > > Can you get this merged? Do you want it to go with this series? > > Yeah, makes some sense to let you merge it so that you can get > kbuild-robot reports about any follow-on memremap_pages() work that > may trip up the build. Otherwise let me know and I'll get it queued > with the other v5.4 libnvdimm pending bits. Done, I used it already to test build the last series from CH.. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 2/4] memremap: remove the dev field in struct dev_pagemap
On Mon, Aug 19, 2019 at 06:44:02PM -0700, Dan Williams wrote: > On Sun, Aug 18, 2019 at 2:12 AM Christoph Hellwig wrote: > > > > The dev field in struct dev_pagemap is only used to print dev_name in > > two places, which are at best nice to have. Just remove the field > > and thus the name in those two messages. > > > > Signed-off-by: Christoph Hellwig > > Reviewed-by: Ira Weiny > > Needs the below as well. > > /me goes to check if he ever merged the fix to make the unit test > stuff get built by default with COMPILE_TEST [1]. Argh! Nope, didn't > submit it for 5.3-rc1, sorry for the thrash. > > You can otherwise add: > > Reviewed-by: Dan Williams > > [1]: > https://lore.kernel.org/lkml/156097224232.1086847.9463861924683372741.st...@dwillia2-desk3.amr.corp.intel.com/ Can you get this merged? Do you want it to go with this series? Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: add a not device managed memremap_pages v3
On Sun, Aug 18, 2019 at 11:05:53AM +0200, Christoph Hellwig wrote: > Hi Dan and Jason, > > Bharata has been working on secure page management for kvmppc guests, > and one I thing I noticed is that he had to fake up a struct device > just so that it could be passed to the devm_memremap_pages > instrastructure for device private memory. > > This series adds non-device managed versions of the > devm_request_free_mem_region and devm_memremap_pages functions for > his use case. > > Changes since v2: > - improved changelogs that the the v2 changes into account > > Changes since v1: > - don't overload devm_request_free_mem_region > - export the memremap_pages and munmap_pages as kvmppc can be a module Looks good, I fixed up the patch with Dan's note and reviewed them as well. Applied to hmm.git as requested Thanks, Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: add a not device managed memremap_pages v2
On Fri, Aug 16, 2019 at 02:36:07PM +0200, Christoph Hellwig wrote: > > > Changes since v1: > > > - don't overload devm_request_free_mem_region > > > - export the memremap_pages and munmap_pages as kvmppc can be a module > > > > What tree do we want this to go through? Dan are you running a pgmap > > tree still? Do we know of any conflicts? > > The last changes in this area went through the hmm tree. There are > now known conflicts, and the kvmppc drivers that needs this already > has a dependency on the hmm tree for the migrate_vma_* changes. OK by me, Dan can you ack or review? Thanks Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: add a not device managed memremap_pages v2
On Fri, Aug 16, 2019 at 08:54:30AM +0200, Christoph Hellwig wrote: > Hi Dan and Jason, > > Bharata has been working on secure page management for kvmppc guests, > and one I thing I noticed is that he had to fake up a struct device > just so that it could be passed to the devm_memremap_pages > instrastructure for device private memory. > > This series adds non-device managed versions of the > devm_request_free_mem_region and devm_memremap_pages functions for > his use case. > > Changes since v1: > - don't overload devm_request_free_mem_region > - export the memremap_pages and munmap_pages as kvmppc can be a module What tree do we want this to go through? Dan are you running a pgmap tree still? Do we know of any conflicts? Thanks, Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 5/5] memremap: provide a not device managed memremap_pages
On Sun, Aug 11, 2019 at 10:12:47AM +0200, Christoph Hellwig wrote: > The kvmppc ultravisor code wants a device private memory pool that is > system wide and not attached to a device. Instead of faking up one > provide a low-level memremap_pages for it. Note that this function is > not exported, and doesn't have a cleanup routine associated with it to > discourage use from more driver like users. > > Signed-off-by: Christoph Hellwig > include/linux/memremap.h | 1 + > mm/memremap.c| 74 > 2 files changed, 45 insertions(+), 30 deletions(-) > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h > index 8f0013e18e14..eac23e88a94a 100644 > +++ b/include/linux/memremap.h > @@ -123,6 +123,7 @@ static inline struct vmem_altmap *pgmap_altmap(struct > dev_pagemap *pgmap) > } > > #ifdef CONFIG_ZONE_DEVICE > +void *memremap_pages(struct dev_pagemap *pgmap, int nid); > void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap); > void devm_memunmap_pages(struct device *dev, struct dev_pagemap *pgmap); > struct dev_pagemap *get_dev_pagemap(unsigned long pfn, > diff --git a/mm/memremap.c b/mm/memremap.c > index 09a087ca30ff..7b7575330db4 100644 > +++ b/mm/memremap.c > @@ -137,27 +137,12 @@ static void dev_pagemap_percpu_release(struct > percpu_ref *ref) > complete(&pgmap->done); > } > > -/** > - * devm_memremap_pages - remap and provide memmap backing for the given > resource > - * @dev: hosting device for @res > - * @pgmap: pointer to a struct dev_pagemap > - * > - * Notes: > - * 1/ At a minimum the res and type members of @pgmap must be initialized > - *by the caller before passing it to this function > - * > - * 2/ The altmap field may optionally be initialized, in which case > - *PGMAP_ALTMAP_VALID must be set in pgmap->flags. > - * > - * 3/ The ref field may optionally be provided, in which pgmap->ref must be > - *'live' on entry and will be killed and reaped at > - *devm_memremap_pages_release() time, or if this routine fails. > - * > - * 4/ res is expected to be a host memory range that could feasibly be > - *treated as a "System RAM" range, i.e. not a device mmio range, but > - *this is not enforced. > +/* > + * This version is not intended for system resources only, and there is no Was 'is not' what was intended here? I'm having a hard time reading this. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 2/5] resource: add a not device managed request_free_mem_region variant
On Sun, Aug 11, 2019 at 10:12:44AM +0200, Christoph Hellwig wrote: > Just add a simple macro that passes a NULL dev argument to > dev_request_free_mem_region, and call request_mem_region in the > function for that particular case. > > Signed-off-by: Christoph Hellwig > include/linux/ioport.h | 2 ++ > kernel/resource.c | 5 - > 2 files changed, 6 insertions(+), 1 deletion(-) > > diff --git a/include/linux/ioport.h b/include/linux/ioport.h > index 0dcc48cafa80..528ae6cbb1b4 100644 > +++ b/include/linux/ioport.h > @@ -297,6 +297,8 @@ static inline bool resource_overlaps(struct resource *r1, > struct resource *r2) > > struct resource *devm_request_free_mem_region(struct device *dev, > struct resource *base, unsigned long size, const char *name); > +#define request_free_mem_region(base, size, name) \ > + devm_request_free_mem_region(NULL, base, size, name) > > #endif /* __ASSEMBLY__ */ > #endif /* _LINUX_IOPORT_H */ > diff --git a/kernel/resource.c b/kernel/resource.c > index 0ddc558586a7..3a826b3cc883 100644 > +++ b/kernel/resource.c > @@ -1671,7 +1671,10 @@ struct resource *devm_request_free_mem_region(struct > device *dev, > REGION_DISJOINT) > continue; > > - res = devm_request_mem_region(dev, addr, size, name); > + if (dev) > + res = devm_request_mem_region(dev, addr, size, name); > + else > + res = request_mem_region(addr, size, name); It is a bit jarring to have something called devm_* that doesn't actually do the devm_ part on some paths. Maybe this function should be called __request_free_mem_region() with another name wrapper macro? Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: dev_pagemap related cleanups v4
On Tue, Jul 02, 2019 at 04:17:48PM -0700, Dan Williams wrote: > On Tue, Jul 2, 2019 at 11:42 AM Jason Gunthorpe wrote: > > > > On Mon, Jul 01, 2019 at 10:25:17AM +0200, Christoph Hellwig wrote: > > > And I've demonstrated that I can't send patch series.. While this > > > has all the right patches, it also has the extra patches already > > > in the hmm tree, and four extra patches I wanted to send once > > > this series is merged. I'll give up for now, please use the git > > > url for anything serious, as it contains the right thing. > > > > Okay, I sorted it all out and temporarily put it here: > > > > https://github.com/jgunthorpe/linux/commits/hmm > > > > Bit involved job: > > - Took Ira's v4 patch into hmm.git and confirmed it matches what > > Andrew has in linux-next after all the fixups > > - Checked your github v4 and the v3 that hit the mailing list were > > substantially similar (I never did get a clean v4) and largely > > went with the github version > > - Based CH's v4 series on -rc7 and put back the removal hunk in swap.c > > so it compiles > > - Merge'd CH's series to hmm.git and fixed all the conflicts with Ira > > and Ralph's patches (such that swap.c remains unchanged) > > - Added Dan's ack's and tested-by's > > Looks good. Test merge (with some collisions, see below) also passes > my test suite. Okay, published toward linux-next now > > > > I think this fairly closely follows what was posted to the mailing > > list. > > > > As it was more than a simple 'git am', I'll let it sit on github until > > I hear OK's then I'll move it to kernel.org's hmm.git and it will hit > > linux-next. 0-day should also run on this whole thing from my github. > > > > What I know is outstanding: > > - The conflicting ARM patches, I understand Andrew will handle these > >post-linux-next > > - The conflict with AMD GPU in -next, I am waiting to hear from AMD > > Just a heads up that this also collides with the "sub-section" patches > in Andrew's tree. The resolution is straightforward, mostly just > colliding updates to arch_{add,remove}_memory() call sites in > kernel/memremap.c and collisions with pgmap_altmap() usage. Okay, thanks Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: dev_pagemap related cleanups v4
On Mon, Jul 01, 2019 at 10:25:17AM +0200, Christoph Hellwig wrote: > And I've demonstrated that I can't send patch series.. While this > has all the right patches, it also has the extra patches already > in the hmm tree, and four extra patches I wanted to send once > this series is merged. I'll give up for now, please use the git > url for anything serious, as it contains the right thing. Okay, I sorted it all out and temporarily put it here: https://github.com/jgunthorpe/linux/commits/hmm Bit involved job: - Took Ira's v4 patch into hmm.git and confirmed it matches what Andrew has in linux-next after all the fixups - Checked your github v4 and the v3 that hit the mailing list were substantially similar (I never did get a clean v4) and largely went with the github version - Based CH's v4 series on -rc7 and put back the removal hunk in swap.c so it compiles - Merge'd CH's series to hmm.git and fixed all the conflicts with Ira and Ralph's patches (such that swap.c remains unchanged) - Added Dan's ack's and tested-by's I think this fairly closely follows what was posted to the mailing list. As it was more than a simple 'git am', I'll let it sit on github until I hear OK's then I'll move it to kernel.org's hmm.git and it will hit linux-next. 0-day should also run on this whole thing from my github. What I know is outstanding: - The conflicting ARM patches, I understand Andrew will handle these post-linux-next - The conflict with AMD GPU in -next, I am waiting to hear from AMD Otherwise I think we are done with hmm.git for this cycle. Unfortunately this is still not enough to progress rdma's ODP, so we will need to do this again next cycle :( I'll be working on patches once I get all the merge window prep I have to do done. Regards, Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 16/25] device-dax: use the dev_pagemap internal refcount
On Fri, Jun 28, 2019 at 10:10:12AM -0700, Dan Williams wrote: > On Fri, Jun 28, 2019 at 10:08 AM Dan Williams > wrote: > > > > On Fri, Jun 28, 2019 at 10:02 AM Jason Gunthorpe wrote: > > > > > > On Fri, Jun 28, 2019 at 09:27:44AM -0700, Dan Williams wrote: > > > > On Fri, Jun 28, 2019 at 8:39 AM Jason Gunthorpe > > > > wrote: > > > > > > > > > > On Wed, Jun 26, 2019 at 02:27:15PM +0200, Christoph Hellwig wrote: > > > > > > The functionality is identical to the one currently open coded in > > > > > > device-dax. > > > > > > > > > > > > Signed-off-by: Christoph Hellwig > > > > > > Reviewed-by: Ira Weiny > > > > > > drivers/dax/dax-private.h | 4 > > > > > > drivers/dax/device.c | 43 > > > > > > --- > > > > > > 2 files changed, 47 deletions(-) > > > > > > > > > > DanW: I think this series has reached enough review, did you want > > > > > to ack/test any further? > > > > > > > > > > This needs to land in hmm.git soon to make the merge window. > > > > > > > > I was awaiting a decision about resolving the collision with Ira's > > > > patch before testing the final result again [1]. You can go ahead and > > > > add my reviewed-by for the series, but my tested-by should be on the > > > > final state of the series. > > > > > > The conflict looks OK to me, I think we can let Andrew and Linus > > > resolve it. > > > > Andrew's tree effectively always rebases since it's a quilt series. > > I'd recommend pulling Ira's patch out of -mm and applying it with the > > rest of hmm reworks. Any other git tree I'd agree with just doing the > > late conflict resolution, but I'm not clear on what's the best > > practice when conflicting with -mm. What happens depends on timing as things arrive to Linus. I promised to send hmm.git early, so I understand that Andrew will quilt rebase his tree to Linus's and fix the conflict in Ira's patch before he sends it. > Regardless the patch is buggy. If you want to do the conflict > resolution it should be because the DEVICE_PUBLIC removal effectively > does the same fix otherwise we're knowingly leaving a broken point in > the history. I'm not sure I understand your concern, is there something wrong with CH's series as it stands? hmm is a non-rebasing git tree, so as long as the series is correct *when I apply it* there is no broken history. I assumed the conflict resolution for Ira's patch was to simply take the deletion of the if block from CH's series - right? If we do need to take Ira's patch into hmm.git it will go after CH's series (and Ira will have to rebase/repost it), so I think there is nothing to do at this moment - unless you are saying there is a problem with the series in CH's git tree? Regards, Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 16/25] device-dax: use the dev_pagemap internal refcount
On Fri, Jun 28, 2019 at 09:27:44AM -0700, Dan Williams wrote: > On Fri, Jun 28, 2019 at 8:39 AM Jason Gunthorpe wrote: > > > > On Wed, Jun 26, 2019 at 02:27:15PM +0200, Christoph Hellwig wrote: > > > The functionality is identical to the one currently open coded in > > > device-dax. > > > > > > Signed-off-by: Christoph Hellwig > > > Reviewed-by: Ira Weiny > > > drivers/dax/dax-private.h | 4 > > > drivers/dax/device.c | 43 --- > > > 2 files changed, 47 deletions(-) > > > > DanW: I think this series has reached enough review, did you want > > to ack/test any further? > > > > This needs to land in hmm.git soon to make the merge window. > > I was awaiting a decision about resolving the collision with Ira's > patch before testing the final result again [1]. You can go ahead and > add my reviewed-by for the series, but my tested-by should be on the > final state of the series. The conflict looks OK to me, I think we can let Andrew and Linus resolve it. Thanks, Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 16/25] device-dax: use the dev_pagemap internal refcount
On Wed, Jun 26, 2019 at 02:27:15PM +0200, Christoph Hellwig wrote: > The functionality is identical to the one currently open coded in > device-dax. > > Signed-off-by: Christoph Hellwig > Reviewed-by: Ira Weiny > --- > drivers/dax/dax-private.h | 4 > drivers/dax/device.c | 43 --- > 2 files changed, 47 deletions(-) DanW: I think this series has reached enough review, did you want to ack/test any further? This needs to land in hmm.git soon to make the merge window. Thanks, Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 24/25] mm: remove the HMM config option
On Wed, Jun 26, 2019 at 02:27:23PM +0200, Christoph Hellwig wrote: > All the mm/hmm.c code is better keyed off HMM_MIRROR. Also let nouveau > depend on it instead of the mix of a dummy dependency symbol plus the > actually selected one. Drop various odd dependencies, as the code is > pretty portable. > > Signed-off-by: Christoph Hellwig > --- > drivers/gpu/drm/nouveau/Kconfig | 3 +-- > include/linux/hmm.h | 5 + > include/linux/mm_types.h| 2 +- > mm/Kconfig | 27 --- > mm/Makefile | 2 +- > mm/hmm.c| 2 -- > 6 files changed, 8 insertions(+), 33 deletions(-) Makes more sense to me too Reviewed-by: Jason Gunthorpe Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 12/25] memremap: add a migrate_to_ram method to struct dev_pagemap_ops
On Wed, Jun 26, 2019 at 02:27:11PM +0200, Christoph Hellwig wrote: > This replaces the hacky ->fault callback, which is currently directly > called from common code through a hmm specific data structure as an > exercise in layering violations. > > Signed-off-by: Christoph Hellwig > Reviewed-by: Ralph Campbell > --- > include/linux/hmm.h | 6 -- > include/linux/memremap.h | 6 ++ > include/linux/swapops.h | 15 --- > kernel/memremap.c| 35 --- > mm/hmm.c | 13 + > mm/memory.c | 9 ++--- > 6 files changed, 17 insertions(+), 67 deletions(-) Reviewed-by: Jason Gunthorpe I'ver heard there are some other use models for fault() here beyond migrate to ram, but we can rename it if we ever see them. > +static vm_fault_t hmm_devmem_migrate_to_ram(struct vm_fault *vmf) > { > - struct hmm_devmem *devmem = page->pgmap->data; > + struct hmm_devmem *devmem = vmf->page->pgmap->data; > > - return devmem->ops->fault(devmem, vma, addr, page, flags, pmdp); > + return devmem->ops->fault(devmem, vmf->vma, vmf->address, vmf->page, > + vmf->flags, vmf->pmd); > } Next cycle we should probably rename this fault to migrate_to_ram as well and pass in the vmf.. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 20/25] mm: remove hmm_vma_alloc_locked_page
On Wed, Jun 26, 2019 at 02:27:19PM +0200, Christoph Hellwig wrote: > The only user of it has just been removed, and there wasn't really any need > to wrap a basic memory allocator to start with. > > Signed-off-by: Christoph Hellwig > --- > include/linux/hmm.h | 3 --- > mm/hmm.c| 14 -- > 2 files changed, 17 deletions(-) Reviewed-by: Jason Gunthorpe Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 03/25] mm: remove hmm_devmem_add_resource
On Wed, Jun 26, 2019 at 02:27:02PM +0200, Christoph Hellwig wrote: > This function has never been used since it was first added to the kernel > more than a year and a half ago, and if we ever grow a consumer of the > MEMORY_DEVICE_PUBLIC infrastructure it can easily use devm_memremap_pages > directly. > > Signed-off-by: Christoph Hellwig > Reviewed-by: Jason Gunthorpe > Reviewed-by: John Hubbard > Acked-by: Michal Hocko > --- > include/linux/hmm.h | 3 --- > mm/hmm.c| 50 - > 2 files changed, 53 deletions(-) This should be squashed to the new earlier patch? Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 18/22] mm: mark DEVICE_PUBLIC as broken
On Tue, Jun 25, 2019 at 09:29:15AM +0200, Christoph Hellwig wrote: > On Thu, Jun 20, 2019 at 09:26:48PM +0200, Michal Hocko wrote: > > On Thu 13-06-19 11:43:21, Christoph Hellwig wrote: > > > The code hasn't been used since it was added to the tree, and doesn't > > > appear to actually be usable. Mark it as BROKEN until either a user > > > comes along or we finally give up on it. > > > > I would go even further and simply remove all the DEVICE_PUBLIC code. > > I looked into that as I now got the feedback twice. It would > create a conflict with another tree cleaning things up around the > is_device_private defintion, but otherwise I'd be glad to just remove > it. > > Jason, as this goes through your tree, do you mind the additional > conflict? Which tree and what does the resolution look like? Also, I don't want to be making the decision if we should keep/remove DEVICE_PUBLIC, so let's get an Ack from Andrew/etc? My main reluctance is that I know there is HW out there that can do coherent, and I want to believe they are coming with patches, just too slowly. But I'd also rather those people defend themselves :P Thanks, Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 18/22] mm: mark DEVICE_PUBLIC as broken
On Thu, Jun 13, 2019 at 06:23:04PM -0700, John Hubbard wrote: > On 6/13/19 5:43 PM, Ira Weiny wrote: > > On Thu, Jun 13, 2019 at 07:58:29PM +0000, Jason Gunthorpe wrote: > >> On Thu, Jun 13, 2019 at 12:53:02PM -0700, Ralph Campbell wrote: > >>> > ... > >> Hum, so the only thing this config does is short circuit here: > >> > >> static inline bool is_device_public_page(const struct page *page) > >> { > >> return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) && > >> IS_ENABLED(CONFIG_DEVICE_PUBLIC) && > >> is_zone_device_page(page) && > >> page->pgmap->type == MEMORY_DEVICE_PUBLIC; > >> } > >> > >> Which is called all over the place.. > > > > yes but the earlier patch: > > > > [PATCH 03/22] mm: remove hmm_devmem_add_resource > > > > Removes the only place type is set to MEMORY_DEVICE_PUBLIC. > > > > So I think it is ok. Frankly I was wondering if we should remove the public > > type altogether but conceptually it seems ok. But I don't see any users of > > it > > so... should we get rid of it in the code rather than turning the config > > off? > > > > Ira > > That seems reasonable. I recall that the hope was for those IBM Power 9 > systems to use _PUBLIC, as they have hardware-based coherent device (GPU) > memory, and so the memory really is visible to the CPU. And the IBM team > was thinking of taking advantage of it. But I haven't seen anything on > that front for a while. Does anyone know who those people are and can we encourage them to send some patches? :) Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: dev_pagemap related cleanups
On Thu, Jun 13, 2019 at 11:21:01PM +0200, Christoph Hellwig wrote: > On Thu, Jun 13, 2019 at 08:40:46PM +0000, Jason Gunthorpe wrote: > > > Perhaps we should pull those out and resend them through hmm.git? > > > > It could be done - but how bad is the conflict resolution? > > Trivial. All but one patch just apply using git-am, and the other one > just has a few lines of offsets. Okay, NP then, trivial ones are OK to send to Linus.. If Andrew gets them into -rc5 then I will get rc5 into hmm.git next week. Thanks, Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: dev_pagemap related cleanups
On Thu, Jun 13, 2019 at 11:27:39AM -0700, Dan Williams wrote: > On Thu, Jun 13, 2019 at 2:43 AM Christoph Hellwig wrote: > > > > Hi Dan, Jérôme and Jason, > > > > below is a series that cleans up the dev_pagemap interface so that > > it is more easily usable, which removes the need to wrap it in hmm > > and thus allowing to kill a lot of code > > > > Diffstat: > > > > 22 files changed, 245 insertions(+), 802 deletions(-) > > Hooray! > > > Git tree: > > > > git://git.infradead.org/users/hch/misc.git hmm-devmem-cleanup > > I just realized this collides with the dev_pagemap release rework in > Andrew's tree (commit ids below are from next.git and are not stable) > > 4422ee8476f0 mm/devm_memremap_pages: fix final page put race > 771f0714d0dc PCI/P2PDMA: track pgmap references per resource, not globally > af37085de906 lib/genalloc: introduce chunk owners > e0047ff8aa77 PCI/P2PDMA: fix the gen_pool_add_virt() failure path > 0315d47d6ae9 mm/devm_memremap_pages: introduce devm_memunmap_pages > 216475c7eaa8 drivers/base/devres: introduce devm_release_action() > > CONFLICT (content): Merge conflict in tools/testing/nvdimm/test/iomap.c > CONFLICT (content): Merge conflict in mm/hmm.c > CONFLICT (content): Merge conflict in kernel/memremap.c > CONFLICT (content): Merge conflict in include/linux/memremap.h > CONFLICT (content): Merge conflict in drivers/pci/p2pdma.c > CONFLICT (content): Merge conflict in drivers/nvdimm/pmem.c > CONFLICT (content): Merge conflict in drivers/dax/device.c > CONFLICT (content): Merge conflict in drivers/dax/dax-private.h > > Perhaps we should pull those out and resend them through hmm.git? It could be done - but how bad is the conflict resolution? I'd be more comfortable to take a PR from you to merge into hmm.git, rather than raw patches, then apply CH's series on top. I think. That way if something goes wrong you can send your PR to Linus directly. > It also turns out the nvdimm unit tests crash with this signature on > that branch where base v5.2-rc3 passes: > > BUG: kernel NULL pointer dereference, address: 0008 > [..] > CPU: 15 PID: 1414 Comm: lt-libndctl Tainted: G OE > 5.2.0-rc3+ #3399 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 > 02/06/2015 > RIP: 0010:percpu_ref_kill_and_confirm+0x1e/0x180 > [..] > Call Trace: > release_nodes+0x234/0x280 > device_release_driver_internal+0xe8/0x1b0 > bus_remove_device+0xf2/0x160 > device_del+0x166/0x370 > unregister_dev_dax+0x23/0x50 > release_nodes+0x234/0x280 > device_release_driver_internal+0xe8/0x1b0 > unbind_store+0x94/0x120 > kernfs_fop_write+0xf0/0x1a0 > vfs_write+0xb7/0x1b0 > ksys_write+0x5c/0xd0 > do_syscall_64+0x60/0x240 Too bad the trace didn't say which devm cleanup triggered it.. Did dev_pagemap_percpu_exit get called with a NULL pgmap->ref ? Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 22/22] mm: don't select MIGRATE_VMA_HELPER from HMM_MIRROR
On Thu, Jun 13, 2019 at 11:43:25AM +0200, Christoph Hellwig wrote: > The migrate_vma helper is only used by noveau to migrate device private > pages around. Other HMM_MIRROR users like amdgpu or infiniband don't > need it. > > Signed-off-by: Christoph Hellwig > --- > drivers/gpu/drm/nouveau/Kconfig | 1 + > mm/Kconfig | 1 - > 2 files changed, 1 insertion(+), 1 deletion(-) Yes, the thing that calls migrate_vma() should be the thing that has the kconfig stuff. Reviewed-by: Jason Gunthorpe Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 21/22] mm: remove the HMM config option
On Thu, Jun 13, 2019 at 11:43:24AM +0200, Christoph Hellwig wrote: > All the mm/hmm.c code is better keyed off HMM_MIRROR. Also let nouveau > depend on it instead of the mix of a dummy dependency symbol plus the > actually selected one. Drop various odd dependencies, as the code is > pretty portable. I don't really know, but I thought this needed the arch restriction for the same reason get_user_pages has various unique arch specific implementations (it does seem to have some open coded GUP like thing)? I was hoping we could do this after your common gup series? But sooner is better too. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 18/22] mm: mark DEVICE_PUBLIC as broken
On Thu, Jun 13, 2019 at 12:53:02PM -0700, Ralph Campbell wrote: > > On 6/13/19 12:44 PM, Jason Gunthorpe wrote: > > On Thu, Jun 13, 2019 at 11:43:21AM +0200, Christoph Hellwig wrote: > > > The code hasn't been used since it was added to the tree, and doesn't > > > appear to actually be usable. Mark it as BROKEN until either a user > > > comes along or we finally give up on it. > > > > > > Signed-off-by: Christoph Hellwig > > > mm/Kconfig | 1 + > > > 1 file changed, 1 insertion(+) > > > > > > diff --git a/mm/Kconfig b/mm/Kconfig > > > index 0d2ba7e1f43e..406fa45e9ecc 100644 > > > +++ b/mm/Kconfig > > > @@ -721,6 +721,7 @@ config DEVICE_PRIVATE > > > config DEVICE_PUBLIC > > > bool "Addressable device memory (like GPU memory)" > > > depends on ARCH_HAS_HMM > > > + depends on BROKEN > > > select HMM > > > select DEV_PAGEMAP_OPS > > > > This seems a bit harsh, we do have another kconfig that selects this > > one today: > > > > config DRM_NOUVEAU_SVM > > bool "(EXPERIMENTAL) Enable SVM (Shared Virtual Memory) support" > > depends on ARCH_HAS_HMM > > depends on DRM_NOUVEAU > > depends on STAGING > > select HMM_MIRROR > > select DEVICE_PRIVATE > > default n > > help > >Say Y here if you want to enable experimental support for > >Shared Virtual Memory (SVM). > > > > Maybe it should be depends on STAGING not broken? > > > > or maybe nouveau_svm doesn't actually need DEVICE_PRIVATE? > > > > Jason > > I think you are confusing DEVICE_PRIVATE for DEVICE_PUBLIC. > DRM_NOUVEAU_SVM does use DEVICE_PRIVATE but not DEVICE_PUBLIC. Indeed you are correct, never mind Hum, so the only thing this config does is short circuit here: static inline bool is_device_public_page(const struct page *page) { return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) && IS_ENABLED(CONFIG_DEVICE_PUBLIC) && is_zone_device_page(page) && page->pgmap->type == MEMORY_DEVICE_PUBLIC; } Which is called all over the place.. So, yes, we really don't want any distro or something to turn this on until it has a use. Reviewed-by: Jason Gunthorpe Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 20/22] mm: sort out the DEVICE_PRIVATE Kconfig mess
On Thu, Jun 13, 2019 at 11:43:23AM +0200, Christoph Hellwig wrote: > The ZONE_DEVICE support doesn't depend on anything HMM related, just on > various bits of arch support as indicated by the architecture. Also > don't select the option from nouveau as it isn't present in many setups, > and depend on it instead. > > Signed-off-by: Christoph Hellwig > drivers/gpu/drm/nouveau/Kconfig | 2 +- > mm/Kconfig | 5 ++--- > 2 files changed, 3 insertions(+), 4 deletions(-) > > diff --git a/drivers/gpu/drm/nouveau/Kconfig b/drivers/gpu/drm/nouveau/Kconfig > index dba2613f7180..6303d203ab1d 100644 > +++ b/drivers/gpu/drm/nouveau/Kconfig > @@ -85,10 +85,10 @@ config DRM_NOUVEAU_BACKLIGHT > config DRM_NOUVEAU_SVM > bool "(EXPERIMENTAL) Enable SVM (Shared Virtual Memory) support" > depends on ARCH_HAS_HMM > + depends on DEVICE_PRIVATE > depends on DRM_NOUVEAU > depends on STAGING > select HMM_MIRROR > - select DEVICE_PRIVATE > default n > help > Say Y here if you want to enable experimental support for Ben, I heard you might have a patch like this in your tree, but I don't think I could find your tree?? Do you have any nouveau/Kconfig patches that might conflict? Thanks Does this fix the randconfigs failures that have been reported? > diff --git a/mm/Kconfig b/mm/Kconfig > index 406fa45e9ecc..4dbd718c8cf4 100644 > +++ b/mm/Kconfig > @@ -677,13 +677,13 @@ config ARCH_HAS_HMM_MIRROR > > config ARCH_HAS_HMM > bool > - default y > depends on (X86_64 || PPC64) > depends on ZONE_DEVICE > depends on MMU && 64BIT > depends on MEMORY_HOTPLUG > depends on MEMORY_HOTREMOVE > depends on SPARSEMEM_VMEMMAP > + default y What is the reason we have this ARCH thing anyhow? Does hmm have arch dependencies someplace? Thanks Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 18/22] mm: mark DEVICE_PUBLIC as broken
On Thu, Jun 13, 2019 at 11:43:21AM +0200, Christoph Hellwig wrote: > The code hasn't been used since it was added to the tree, and doesn't > appear to actually be usable. Mark it as BROKEN until either a user > comes along or we finally give up on it. > > Signed-off-by: Christoph Hellwig > mm/Kconfig | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/mm/Kconfig b/mm/Kconfig > index 0d2ba7e1f43e..406fa45e9ecc 100644 > +++ b/mm/Kconfig > @@ -721,6 +721,7 @@ config DEVICE_PRIVATE > config DEVICE_PUBLIC > bool "Addressable device memory (like GPU memory)" > depends on ARCH_HAS_HMM > + depends on BROKEN > select HMM > select DEV_PAGEMAP_OPS This seems a bit harsh, we do have another kconfig that selects this one today: config DRM_NOUVEAU_SVM bool "(EXPERIMENTAL) Enable SVM (Shared Virtual Memory) support" depends on ARCH_HAS_HMM depends on DRM_NOUVEAU depends on STAGING select HMM_MIRROR select DEVICE_PRIVATE default n help Say Y here if you want to enable experimental support for Shared Virtual Memory (SVM). Maybe it should be depends on STAGING not broken? or maybe nouveau_svm doesn't actually need DEVICE_PRIVATE? Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 17/22] mm: remove hmm_devmem_add
On Thu, Jun 13, 2019 at 11:43:20AM +0200, Christoph Hellwig wrote: > There isn't really much value add in the hmm_devmem_add wrapper. Just > factor out a little helper to find the resource, and otherwise let the > driver implement the dev_pagemap_ops directly. Was this commit message written when other patches were squashed in here? I think the helper this mentions was from an earlier patch > Signed-off-by: Christoph Hellwig > --- > Documentation/vm/hmm.rst | 26 > include/linux/hmm.h | 129 --- > mm/hmm.c | 115 -- > 3 files changed, 270 deletions(-) I looked for in-flight patches that might be using these APIs and found nothing. To be sent patches can use the new API with no loss in functionality... Reviewed-by: Jason Gunthorpe Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 14/22] nouveau: use alloc_page_vma directly
On Thu, Jun 13, 2019 at 11:43:17AM +0200, Christoph Hellwig wrote: > hmm_vma_alloc_locked_page is scheduled to go away, use the proper > mm function directly. > > Signed-off-by: Christoph Hellwig > --- > drivers/gpu/drm/nouveau/nouveau_dmem.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) Reviewed-by: Jason Gunthorpe Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 11/22] memremap: remove the data field in struct dev_pagemap
On Thu, Jun 13, 2019 at 11:43:14AM +0200, Christoph Hellwig wrote: > struct dev_pagemap is always embedded into a containing structure, so > there is no need to an additional private data field. > > Signed-off-by: Christoph Hellwig > --- > drivers/nvdimm/pmem.c| 2 +- > include/linux/memremap.h | 3 +-- > kernel/memremap.c| 2 +- > mm/hmm.c | 9 + > 4 files changed, 8 insertions(+), 8 deletions(-) Reviewed-by: Jason Gunthorpe Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 09/22] memremap: lift the devmap_enable manipulation into devm_memremap_pages
On Thu, Jun 13, 2019 at 11:43:12AM +0200, Christoph Hellwig wrote: > Just check if there is a ->page_free operation set and take care of the > static key enable, as well as the put using device managed resources. > diff --git a/mm/hmm.c b/mm/hmm.c > index c76a1b5defda..6dc769feb2e1 100644 > +++ b/mm/hmm.c > @@ -1378,8 +1378,6 @@ struct hmm_devmem *hmm_devmem_add(const struct > hmm_devmem_ops *ops, > void *result; > int ret; > > - dev_pagemap_get_ops(); > - Where was the matching dev_pagemap_put_ops() for this hmm case? This is a bug fix too? The nouveau driver is the only one to actually call this hmm function and it does it as part of a probe function. Seems reasonable, however, in the unlikely event that it fails to init 'dmem' the driver will retain a dev_pagemap_get_ops until it unloads. This imbalance doesn't seem worth worrying about. Reviewed-by: Christoph Hellwig Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 08/22] memremap: pass a struct dev_pagemap to ->kill
On Thu, Jun 13, 2019 at 11:43:11AM +0200, Christoph Hellwig wrote: > Passing the actual typed structure leads to more understandable code > vs the actual references. > > Signed-off-by: Christoph Hellwig > drivers/dax/device.c | 7 +++ > drivers/nvdimm/pmem.c | 6 +++--- > drivers/pci/p2pdma.c | 6 +++--- > include/linux/memremap.h | 2 +- > kernel/memremap.c | 4 ++-- > mm/hmm.c | 4 ++-- > tools/testing/nvdimm/test/iomap.c | 6 ++ > 7 files changed, 16 insertions(+), 19 deletions(-) > > diff --git a/drivers/dax/device.c b/drivers/dax/device.c > index 4adab774dade..e23fa1bd8c97 100644 > +++ b/drivers/dax/device.c > @@ -37,13 +37,12 @@ static void dev_dax_percpu_exit(void *data) > percpu_ref_exit(ref); > } > > -static void dev_dax_percpu_kill(struct percpu_ref *data) > +static void dev_dax_percpu_kill(struct dev_pagemap *pgmap) > { Looks like it was always like this, but I also can't see a reason to use the percpu as a handle for a dev_pagemap callback. Reviewed-by: Jason Gunthorpe Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 07/22] memremap: move dev_pagemap callbacks into a separate structure
On Thu, Jun 13, 2019 at 11:43:10AM +0200, Christoph Hellwig wrote: > The dev_pagemap is a growing too many callbacks. Move them into a > separate ops structure so that they are not duplicated for multiple > instances, and an attacker can't easily overwrite them. Reviewed-by: Jason Gunthorpe Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 06/22] mm: factor out a devm_request_free_mem_region helper
On Thu, Jun 13, 2019 at 11:43:09AM +0200, Christoph Hellwig wrote: > Keep the physical address allocation that hmm_add_device does with the > rest of the resource code, and allow future reuse of it without the hmm > wrapper. > > Signed-off-by: Christoph Hellwig > include/linux/ioport.h | 2 ++ > kernel/resource.c | 39 +++ > mm/hmm.c | 33 - > 3 files changed, 45 insertions(+), 29 deletions(-) > > diff --git a/include/linux/ioport.h b/include/linux/ioport.h > index da0ebaec25f0..76a33ae3bf6c 100644 > +++ b/include/linux/ioport.h > @@ -286,6 +286,8 @@ static inline bool resource_overlaps(struct resource *r1, > struct resource *r2) > return (r1->start <= r2->end && r1->end >= r2->start); > } > > +struct resource *devm_request_free_mem_region(struct device *dev, > + struct resource *base, unsigned long size); > > #endif /* __ASSEMBLY__ */ > #endif /* _LINUX_IOPORT_H */ > diff --git a/kernel/resource.c b/kernel/resource.c > index 158f04ec1d4f..99c58134ed1c 100644 > +++ b/kernel/resource.c > @@ -1628,6 +1628,45 @@ void resource_list_free(struct list_head *head) > } > EXPORT_SYMBOL(resource_list_free); > > +#ifdef CONFIG_DEVICE_PRIVATE > +/** > + * devm_request_free_mem_region - find free region for device private memory > + * > + * @dev: device struct to bind the resource too > + * @size: size in bytes of the device memory to add > + * @base: resource tree to look in > + * > + * This function tries to find an empty range of physical address big enough > to > + * contain the new resource, so that it can later be hotpluged as ZONE_DEVICE > + * memory, which in turn allocates struct pages. > + */ > +struct resource *devm_request_free_mem_region(struct device *dev, > + struct resource *base, unsigned long size) > +{ > + resource_size_t end, addr; > + struct resource *res; > + > + size = ALIGN(size, 1UL << PA_SECTION_SHIFT); > + end = min_t(unsigned long, base->end, (1UL << MAX_PHYSMEM_BITS) - 1); Even fixed it to use min_t > + addr = end - size + 1UL; > + for (; addr > size && addr >= base->start; addr -= size) { > + if (region_intersects(addr, size, 0, IORES_DESC_NONE) != > + REGION_DISJOINT) > + continue; The FIXME about the algorithm cost seems justified though, yikes. > + > + res = devm_request_mem_region(dev, addr, size, dev_name(dev)); > + if (!res) > + return ERR_PTR(-ENOMEM); > + res->desc = IORES_DESC_DEVICE_PRIVATE_MEMORY; I wonder if IORES_DESC_DEVICE_PRIVATE_MEMORY should be a function argument? Not really any substantive remark, so Reviewed-by: Jason Gunthorpe Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 04/22] mm: don't clear ->mapping in hmm_devmem_free
On Thu, Jun 13, 2019 at 11:43:07AM +0200, Christoph Hellwig wrote: > ->mapping isn't even used by HMM users, and the field at the same offset > in the zone_device part of the union is declared as pad. (Which btw is > rather confusing, as DAX uses ->pgmap and ->mapping from two different > sides of the union, but DAX doesn't use hmm_devmem_free). > > Signed-off-by: Christoph Hellwig > --- > mm/hmm.c | 2 -- > 1 file changed, 2 deletions(-) Hurm, is hmm following this comment from mm_types.h? * If you allocate the page using alloc_pages(), you can use some of the * space in struct page for your own purposes. The five words in the main * union are available, except for bit 0 of the first word which must be * kept clear. Many users use this word to store a pointer to an object * which is guaranteed to be aligned. If you use the same storage as * page->mapping, you must restore it to NULL before freeing the page. Maybe the assumption was that a driver is using ->mapping ? However, nouveau is the only driver that uses this path, and it never touches page->mapping either (nor in -next). It looks like if a driver were to start using mapping then the driver should be responsible to set it back to NULL before being done with the page. Reviewed-by: Jason Gunthorpe Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 03/22] mm: remove hmm_devmem_add_resource
On Thu, Jun 13, 2019 at 11:43:06AM +0200, Christoph Hellwig wrote: > This function has never been used since it was first added to the kernel > more than a year and a half ago, and if we ever grow a consumer of the > MEMORY_DEVICE_PUBLIC infrastructure it can easily use devm_memremap_pages > directly now that we've simplified the API for it. nit: Have we simplified the interface for devm_memremap_pages() at this point, or are you talking about later patches in this series. I checked this and all the called functions are exported symbols, so there is no blocker for a future driver to call devm_memremap_pages(), maybe even with all this boiler plate, in future. If we eventually get many users that need some simplified registration then we should add a devm_memremap_pages_simplified() interface and factor out that code when we can see the pattern. Reviewed-by: Jason Gunthorpe Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 02/22] mm: remove the struct hmm_device infrastructure
On Thu, Jun 13, 2019 at 11:43:05AM +0200, Christoph Hellwig wrote: > This code is a trivial wrapper around device model helpers, which > should have been integrated into the driver device model usage from > the start. Assuming it actually had users, which it never had since > the code was added more than 1 1/2 years ago. > > Signed-off-by: Christoph Hellwig > --- > include/linux/hmm.h | 20 > mm/hmm.c| 80 - > 2 files changed, 100 deletions(-) I haven't looked in detail at this device memory stuff.. But I did check a bit through the mailing list archives for some clue what this was supposed to be for (wow, this is from 2016!) The commit that added this says: This introduce a dummy HMM device class so device driver can use it to create hmm_device for the sole purpose of registering device memory. Which I just can't understand at all. If we need a 'struct device' for some 'device memory' purpose then it probably ought to be the 'struct pci_device' holding the BAR, not a fake device. I also can't comprehend why a supposed fake device would need a chardev, with a stanadrd 'hmm_deviceX' name, without also defining a core kernel ABI for that char dev.. If this comes back it needs a proper explanation and review, with a user. Reviewed-by: Jason Gunthorpe Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 01/22] mm: remove the unused ARCH_HAS_HMM_DEVICE Kconfig option
On Thu, Jun 13, 2019 at 11:43:04AM +0200, Christoph Hellwig wrote: > Signed-off-by: Christoph Hellwig > --- > mm/Kconfig | 10 -- > 1 file changed, 10 deletions(-) So long as we are willing to run a hmm.git we can merge only complete features going forward. Thus we don't need the crazy process described in the commit message that (deliberately) introduced this unused kconfig. I also tried to find something in-flight for 5.3 that would need this and found nothing Reviewed-by: Jason Gunthorpe Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: dev_pagemap related cleanups
On Thu, Jun 13, 2019 at 11:43:03AM +0200, Christoph Hellwig wrote: > Hi Dan, Jérôme and Jason, > > below is a series that cleans up the dev_pagemap interface so that > it is more easily usable, which removes the need to wrap it in hmm > and thus allowing to kill a lot of code Do you want some of this to run through hmm.git? I see many patches that don't seem to have inter-dependencies.. Thanks, Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v5 10/13] nvme-pci: Add support for P2P memory in requests
On Thu, Aug 30, 2018 at 12:53:49PM -0600, Logan Gunthorpe wrote: > For P2P requests, we must use the pci_p2pmem_map_sg() function > instead of the dma_map_sg functions. > > With that, we can then indicate PCI_P2P support in the request queue. > For this, we create an NVME_F_PCI_P2P flag which tells the core to > set QUEUE_FLAG_PCI_P2P in the request queue. > > Signed-off-by: Logan Gunthorpe > Reviewed-by: Sagi Grimberg > Reviewed-by: Christoph Hellwig > drivers/nvme/host/core.c | 4 > drivers/nvme/host/nvme.h | 1 + > drivers/nvme/host/pci.c | 17 + > 3 files changed, 18 insertions(+), 4 deletions(-) > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index dd8ec1dd9219..6033ce2fd3e9 100644 > +++ b/drivers/nvme/host/core.c > @@ -3051,7 +3051,11 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, > unsigned nsid) > ns->queue = blk_mq_init_queue(ctrl->tagset); > if (IS_ERR(ns->queue)) > goto out_free_ns; > + > blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue); > + if (ctrl->ops->flags & NVME_F_PCI_P2PDMA) > + blk_queue_flag_set(QUEUE_FLAG_PCI_P2PDMA, ns->queue); > + > ns->queue->queuedata = ns; > ns->ctrl = ctrl; > > diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h > index bb4a2003c097..4030743c90aa 100644 > +++ b/drivers/nvme/host/nvme.h > @@ -343,6 +343,7 @@ struct nvme_ctrl_ops { > unsigned int flags; > #define NVME_F_FABRICS (1 << 0) > #define NVME_F_METADATA_SUPPORTED(1 << 1) > +#define NVME_F_PCI_P2PDMA(1 << 2) > int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val); > int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val); > int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val); > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index 2902585c6ddf..bb2120d30e39 100644 > +++ b/drivers/nvme/host/pci.c > @@ -737,8 +737,13 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, > struct request *req, > goto out; > > ret = BLK_STS_RESOURCE; > - nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents, dma_dir, > - DMA_ATTR_NO_WARN); > + > + if (is_pci_p2pdma_page(sg_page(iod->sg))) > + nr_mapped = pci_p2pdma_map_sg(dev->dev, iod->sg, iod->nents, > + dma_dir); > + else > + nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents, > + dma_dir, DMA_ATTR_NO_WARN); > if (!nr_mapped) > goto out; > > @@ -780,7 +785,10 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct > request *req) > DMA_TO_DEVICE : DMA_FROM_DEVICE; > > if (iod->nents) { > - dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir); > + /* P2PDMA requests do not need to be unmapped */ > + if (!is_pci_p2pdma_page(sg_page(iod->sg))) > + dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir); This seems like a poor direction, if we add IOMMU hairpin support we will need unmapping. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory
On Mon, Mar 26, 2018 at 11:30:38AM -0600, Logan Gunthorpe wrote: > > > On 26/03/18 10:41 AM, Jason Gunthorpe wrote: > > On Mon, Mar 26, 2018 at 12:11:38PM +0100, Jonathan Cameron wrote: > >> On Tue, 13 Mar 2018 10:43:55 -0600 > >> Logan Gunthorpe wrote: > >> > >>> On 12/03/18 09:28 PM, Sinan Kaya wrote: > >>>> On 3/12/2018 3:35 PM, Logan Gunthorpe wrote: > >>>> Regarding the switch business, It is amazing how much trouble you went > >>>> into > >>>> limit this functionality into very specific hardware. > >>>> > >>>> I thought that we reached to an agreement that code would not impose > >>>> any limits on what user wants. > >>>> > >>>> What happened to all the emails we exchanged? > >>> > >>> It turns out that root ports that support P2P are far less common than > >>> anyone thought. So it will likely have to be a white list. > >> > >> This came as a bit of a surprise to our PCIe architect. > > > > I don't think it is a hardware problem. > > The latest and greatest Power9 CPUs still explicitly do not support > this. I think this is another case of the HW can do it but the SW support is missing. IOMMU configuration and maybe firmware too, for instance. If I recall I saw a presentation that Coral was expected to use P2P between the network and GPU. > And, if I recall correctly, the ARM64 device we played with did > not either -- but I suspect that will differ depending on vendor. Wouldn't surprise me at all to see broken implementations in ARM64.. But even there it needs IOMMU enablement to work at all if I recall. Bascially, this is probably not a HW problem that needs a HW bit, but a OS/firmware problem to do all the enablement.. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory
On Mon, Mar 26, 2018 at 12:11:38PM +0100, Jonathan Cameron wrote: > On Tue, 13 Mar 2018 10:43:55 -0600 > Logan Gunthorpe wrote: > > > On 12/03/18 09:28 PM, Sinan Kaya wrote: > > > On 3/12/2018 3:35 PM, Logan Gunthorpe wrote: > > > Regarding the switch business, It is amazing how much trouble you went > > > into > > > limit this functionality into very specific hardware. > > > > > > I thought that we reached to an agreement that code would not impose > > > any limits on what user wants. > > > > > > What happened to all the emails we exchanged? > > > > It turns out that root ports that support P2P are far less common than > > anyone thought. So it will likely have to be a white list. > > This came as a bit of a surprise to our PCIe architect. I don't think it is a hardware problem. I know Mellanox and Nvidia have been doing p2p on Intel root complexes for something like 5-6 years now.. I don't have the details, but it does seem to work. I have heard some chips give poor performance.. Also AMD GPU SLI uses P2P these days, so this isn't exactly a niche feature in Intel/AMD land. I think the main issue here is that there is some BIOS involvement to set things up properly. Eg in GPU land motherboards certify for 'crossfire' support. > His follow up was whether it was worth raising an ECR for the PCIe spec > to add a capability bit to allow this to be discovered. This might > long term avoid the need to maintain the white list for new devices. If it is primarily a BIOS issue then it should be an ACPI thing, right? Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
On Mon, Mar 05, 2018 at 01:42:12PM -0700, Keith Busch wrote: > On Mon, Mar 05, 2018 at 01:10:53PM -0700, Jason Gunthorpe wrote: > > So when reading the above mlx code, we see the first wmb() being used > > to ensure that CPU stores to cachable memory are visible to the DMA > > triggered by the doorbell ring. > > IIUC, we don't need a similar barrier for NVMe to ensure memory is > visibile to DMA since the SQE memory is allocated DMA coherent when the > SQ is not within a CMB. You still need it. DMA coherent just means you don't need to call the DMA API after writing, it says nothing about CPU ordering. eg on x86 DMA coherent is just normal system memory, and you do need the SFENCE betweeen system memory stores and DMA triggering MMIO, apparently. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB
On Mon, Mar 05, 2018 at 09:57:27PM +0200, Sagi Grimberg wrote: > Keith, while we're on this, regardless of cmb, is SQE memcopy and DB update > ordering always guaranteed? > > If you look at mlx4 (rdma device driver) that works exactly the same as > nvme you will find: > -- > qp->sq.head += nreq; > > /* > * Make sure that descriptors are written before > * doorbell record. > */ > wmb(); > > writel(qp->doorbell_qpn, >to_mdev(ibqp->device)->uar_map + MLX4_SEND_DOORBELL); > > /* > * Make sure doorbells don't leak out of SQ spinlock > * and reach the HCA out of order. > */ > mmiowb(); > -- > > That seems to explicitly make sure to place a barrier before updating > the doorbell. So as I see it, either ordering is guaranteed and the > above code is redundant, or nvme needs to do the same. A wmb() is always required before operations that can trigger DMA. The reason ARM has a barrier in writel() is not to make it ordered with respect to CPU stores to cachable memory, but to make it ordered with respect to *other* writels. Eg Linux defines this: writel(A, mem); writel(B, mem); To always produce two TLPs on PCI-E when mem is UC BAR memory. And ARM cannot guarentee that without the extra barrier. So now we see stuff like this: writel_relaxed(A, mem); writel_relaxed(B, mem+4); Which says the TLPs to A and B can be issued in any order.. So when reading the above mlx code, we see the first wmb() being used to ensure that CPU stores to cachable memory are visible to the DMA triggered by the doorbell ring. The mmiowb() is used to ensure that DB writes are not combined and not issued in any order other than implied by the lock that encloses the whole thing. This is needed because uar_map is WC memory. We don't have ordering with respect to two writel's here, so if ARM performance was a concern the writel could be switched to writel_relaxed(). Presumably nvme has similar requirments, although I guess the DB register is mapped UC not WC? Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [LSF/MM TOPIC] Filesystem-DAX, page-pinning, and RDMA
On Mon, Jan 29, 2018 at 06:33:25PM -0500, Jerome Glisse wrote: > Between i would also like to participate, in my view the burden should > be on GUP users, so if hardware is not ODP capable then you should at > least be able to kill the mapping/GUP and force the hardware to redo a > GUP if it get any more transaction on affect umem. Can non ODP hardware > do that ? Or is it out of the question ? For RDMA we can have the HW forcibly tear down the MR, but it is incredibly disruptive and nobody running applications would be happy with this outcome. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 02/12] pci-p2p: Add sysfs group to display p2pmem stats
On Thu, Jan 04, 2018 at 03:50:40PM -0600, Bjorn Helgaas wrote: > > This is similar to /sys/bus/pci/drivers_autoprobe, but > > affects only the VFs associated with a specific PF. > > + > > +What: /sys/bus/pci/devices/.../p2pmem/available > > I wonder if "p2pdma" would be a more suggestive term? It's not really > the *memory* that is peer-to-peer; the peer-to-peer part is referring > to *access* to the memory. There have been out of tree patches using P2P DMA for a long time, and some of the use cases have nothing to do with 'memory' - eg DMA to 'registers' I notice that this series particularly focus on treating the target BAR as 'memory' - ie it puts genalloc on top of the BAR, and I guess treat all addresses as equal and interchangable. If this series gets accepted I would expect proposals to extend this infrastructure to allow for P2P for registers in some way as well. So I think the 'p2pmem' name is a good choice only when it is in places that talk about the genalloc part of this design. We should reserve p2pdma to talk about the generic infrastructure unrelated to the genalloc pool. Since these sysfs's seem to report the genalloc pool status, p2pmem seems like a good choice to me. > > @@ -82,6 +130,9 @@ static int pci_p2pmem_setup(struct pci_dev *pdev) > > if (error) > > goto out_pool_destroy; > > > > + if (sysfs_create_group(&pdev->dev.kobj, &p2pmem_group)) > > + dev_warn(&pdev->dev, "failed to create p2p sysfs group\n"); > > Not sure the warning (by itself) is worthwhile. If we were going to > disable the feature if sysfs_create_group() failed, that's one thing, > but we aren't doing anything except generating a warning, which the > user can't really do anything with. If the user is looking for the > sysfs file, its absence will be obvious even without the message. Don't most of the failure paths inside sysfs_create_group cause prints anyhow? Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
On Mon, Oct 16, 2017 at 03:02:52PM +0300, Sagi Grimberg wrote: > But why should the kernel ever need to mangle the CQ? if a lease break > would deregister the MR the device is expected to generate remote > protection errors on its own. The point is to avoid protection errors - hittles change over when the DAX mapping changes like ODP does. Theonly way to get there is to notify the app before the mappings change.. Dan suggested having ibv_pollcq return this indication.. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
On Fri, Oct 13, 2017 at 11:22:21AM -0700, Dan Williams wrote: > > So, who should be responsible for MR coherency? Today we say the MPI > > is responsible. But we can't really expect the MPI > > to hook SIGIO and somehow try to reverse engineer what MRs are > > impacted from a FD that may not even still be open. > > Ok, that's good insight that I didn't have. Userspace needs more help > than just an fd notification. Glad to help! > > I think, if you want to build a uAPI for notification of MR lease > > break, then you need show how it fits into the above software model: > > - How it can be hidden in a RDMA specific library > > So, here's a strawman can ibv_poll_cq() start returning ibv_wc_status > == IBV_WC_LOC_PROT_ERR when file coherency is lost. This would make > the solution generic across DAX and non-DAX. What's you're feeling for > how well applications are prepared to deal with that status return? Stuffing an entry into the CQ is difficult. The CQ is in user memory and it is DMA'd from the HCA for several pieces of hardware, so the kernel can't just stuff something in there. It can be done with HW support by having the HCA DMA it via an exception path or something, but even then, you run into questions like CQ overflow and accounting issues since it is not ment for this. So, you need a side channel of some kind, either in certain drivers or generically.. > > - How lease break can be done hitlessly, so the library user never > >needs to know it is happening or see failed/missed transfers > > iommu redirect should be hit less and behave like the page cache case > where RDMA targets pages that are no longer part of the file. Yes, if the iommu can be fenced properly it sounds doable. > > - Whatever fast path checking is needed does not kill performance > > What do you consider a fast path? I was assuming that memory > registration is a slow path, and iommu operations are asynchronous so > should not impact performance of ongoing operations beyond typical > iommu overhead. ibv_poll_cq() and ibv_post_send() would be a fast path. Where this struggled before is in creating a side channel you also now have to check that side channel, and checking it at high performance is quite hard.. Even quiecing things to be able to tear down the MR has performance implications on post send... Now that I see this whole thing in this light it seem so very similar to the MPI driven user space mmu notifications ideas and has similar challenges. FWIW, RDMA banged its head on this issue for 10 years and it was ODP that emerged as the solution. One option might be to use an async event notification 'MR de-coherence' and rely on a main polling loop to catch it. This is good enough for dax becaue the lease-requestor would wait until the async event was processed. It would also be acceptable for the general MPI case too, but only if this lease concept was wider than just DAX, eg a MR leases a peice of VMA, and if anything anyhow changes that VMA (eg munamp, mmap, mremap, etc) then it has to wait from the MR to release the lease. ie munmap would block until the async event is processed. ODP-light in userspace, essentially. IIRC this sort of suggestion was never explored, something like: poll(fd) ibv_read_async_event(fd) if (event == MR_DECOHERENCE) { queice_network(); ibv_restore_mr(mr); restore_network(); } The implemention of ibv_restore_mr would have to make a new MR that pointed to the same virtual memory addresses, but was backed by the *new* physical pages. This means it has to unblock the lease, and wait for the lease requestor to complete executing. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
On Fri, Oct 13, 2017 at 10:01:04AM -0700, Dan Williams wrote: > On Fri, Oct 13, 2017 at 9:38 AM, Jason Gunthorpe > wrote: > > On Fri, Oct 13, 2017 at 08:14:55AM -0700, Dan Williams wrote: > > > >> scheme specific to RDMA which seems like a waste to me when we can > >> generically signal an event on the fd for any event that effects any > >> of the vma's on the file. The FL_LAYOUT lease impacts the entire file, > >> so as far as I can see delaying the notification until MR-init is too > >> late, too granular, and too RDMA specific. > > > > But for RDMA a FD is not what we care about - we want the MR handle so > > the app knows which MR needs fixing. > > I'd rather put the onus on userspace to remember where it used a > MAP_DIRECT mapping and be aware that all the mappings of that file are > subject to a lease break. Sure, we could build up a pile of kernel > infrastructure to notify on a per-MR basis, but I think that would > only be worth it if leases were range based. As it is, the entire file > is covered by a lease instance and all MRs that might reference that > file get one notification. That said, we can always arrange for a > per-driver callback at lease-break time so that it can do something > above and beyond the default notification. I don't think that really represents how lots of apps actually use RDMA. RDMA is often buried down in the software stack (eg in a MPI), and by the time a mapping gets used for RDMA transfer the link between the FD, mmap and the MR is totally opaque. Having a MR specific notification means the low level RDMA libraries have a chance to deal with everything for the app. Eg consider a HPC app using MPI that uses some DAX aware library to get DAX backed mmap's. It then passes memory in those mmaps to the MPI library to do transfers. The MPI creates the MR on demand. So, who should be responsible for MR coherency? Today we say the MPI is responsible. But we can't really expect the MPI to hook SIGIO and somehow try to reverse engineer what MRs are impacted from a FD that may not even still be open. I think, if you want to build a uAPI for notification of MR lease break, then you need show how it fits into the above software model: - How it can be hidden in a RDMA specific library - How lease break can be done hitlessly, so the library user never needs to know it is happening or see failed/missed transfers - Whatever fast path checking is needed does not kill performance Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush
On Fri, Oct 13, 2017 at 08:14:55AM -0700, Dan Williams wrote: > scheme specific to RDMA which seems like a waste to me when we can > generically signal an event on the fd for any event that effects any > of the vma's on the file. The FL_LAYOUT lease impacts the entire file, > so as far as I can see delaying the notification until MR-init is too > late, too granular, and too RDMA specific. But for RDMA a FD is not what we care about - we want the MR handle so the app knows which MR needs fixing. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
On Fri, Oct 13, 2017 at 08:50:47AM +0200, Christoph Hellwig wrote: > > However, chatting this over with a few more people I have an alternate > > solution that effectively behaves the same as how non-ODP hardware > > handles this case of hole punch / truncation today. So, today if this > > scenario happens on a page-cache backed mapping, the file blocks are > > unmapped and the RDMA continues into pinned pages that are no longer > > part of the file. We can achieve the same thing with the iommu, just > > re-target the I/O into memory that isn't part of the file. That way > > hardware does not see I/O errors and the DAX data consistency model is > > no worse than the page-cache case. > > Yikes. Well, as much as you say Yikes, Dan is correct, this does match the semantics RDMA MR's already have. They become non-coherent if their underlying object is changed, and there are many ways to get there. I've never thought about it, but it does sound like ftruncate, fallocate, etc on a normal file would break the MR coherency too?? There have been efforts in the past driven by the MPI people to create, essentially, something like lease-break' SIGIO. Except it was intended to be general, and wanted solve all the problems related with MR de-coherence. This was complicated and never became acceptable to mainline. Instead ODP was developed, and ODP actually solves all the problem sanely. Thinking about it some more, and with your other comments on get_user_pages in this thread, I tend to agree. It doesn't make sense to develop a user space lease break API for MR's that is a DAX specific feature. Along the some lines, it also doesn't make sense to force-invalidate MR's linked to DAX regions, while leaving MR's linked to other regions that have the same problem alone. If you want to make non-ODP MR's work better, then you need to have a general overall solution to tell userspace when the MR becomes (or I guess, is becoming) non-coherent, that covers all the cases that break MR coherence, not just via DAX. Otherwise, I think Dan is right, keeping the current semantic of having MRs just do something wrong, but not corrupt memory, when they loose coherence, is broadly consistent with how non-ODP MRs work today. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
On Tue, Oct 10, 2017 at 01:17:26PM -0700, Dan Williams wrote: > Also keep in mind that what triggers the lease break is another > application trying to write or punch holes in a file that is mapped > for RDMA. So, if the hardware can't handle the iommu mapping getting > invalidated asynchronously and the application can't react in the > lease break timeout period then the administrator should arrange for > the file to not be written or truncated while it is mapped. That makes sense, but why not return ENOSYS or something to the app trying to alter the file if the RDMA hardware can't support this instead of having the RDMA app deal with this lease break weirdness? > It's already the case that get_user_pages() does not lock down file > associations, so if your application is contending with these types of > file changes it likely already has a problem keeping transactions in > sync with the file state even without DAX. Yes, things go weird in non-ODP RDMA cases like this.. Also, just to clear, I would expect an app using the SIGIO interface to basically halt ongoing RDMA, wait for MRs to become unused locally and remotely, destroy the MRs, then somehow, establish new MRs that cover the same logical map (eg what ODP would do transparently) after the lease breaker has made their changes, then restart their IO. Does your SIGIO approach have a race-free way to do that last steps? > > So, not being able to support DAX on certain RDMA hardware is not > > an unreasonable situation in our space. > > That makes sense, but it still seems to me that this proposed solution > allows more than enough ways to avoid that worst case scenario where > hardware reacts badly to iommu invalidation. Yes, although I am concerned that returning PCI-E errors is such an unusual and untested path for some of our RDMA drivers that they may malfunction badly... Again, going back to the question of who would ever use this, I would be very relucant to deploy a production configuration relying on the iommu invalidate or SIGIO techniques, when ODP HW is available and works flawlessly. > be blacklisted from supporting DAX altogether. In other words this is > a starting point to incrementally enhance or disable specific drivers, > but with the assurance that the kernel can always do the safe thing > when / if the driver is missing a finer grained solution. Seems reasonable.. I think existing HW will have an easier time adding invalidate, while new hardware really should implement ODP. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
On Tue, Oct 10, 2017 at 10:39:27AM -0700, Dan Williams wrote: > On Tue, Oct 10, 2017 at 10:25 AM, Jason Gunthorpe > >> Have a look at the patch [1], I don't touch the ODP path. > > > > But, does ODP work OK already? I'm not clear on that.. > > It had better. If the mapping is invalidated I would hope that > generates an io fault that gets handled by the driver to setup the new > mapping. I don't see how it can work otherwise. I would assume so too... > > This is why ODP should be the focus because this cannot work fully > > reliably otherwise.. > > The lease break time is configurable. If that application can't > respond to a stop request within a timeout of its own choosing then it > should not be using DAX mappings. Well, no RDMA application can really do this, unless you set the timeout to multiple minutes, on par with network timeouts. Again, these details are why I think this kind of DAX and non ODP-MRs are probably practically not too useful for a production system. Great for test of course, but in that case SIGKILL would be fine too... > > Well, what about using SIGKILL if the lease-break-time hits? The > > kernel will clean up the MRs when the process exits and this will > > fence DMA to that memory. > > Can you point me to where the MR cleanup code fences DMA and quiesces > the device? Yes. The MR's are associated with an fd. When the fd is closed ib_uverbs_close triggers ib_uverbs_cleanup_ucontext which runs through all the objects, including MRs, and deletes them. The specification for deleting a MR requires a synchronous fence with the hardware. After MR deletion the hardware will not DMA to any pages described by the old MR, and those pages will be unpinned. > > But, still, if you really want to be fined graned, then I think > > invalidating the impacted MR's is a better solution for RDMA than > > trying to do it with the IOMMU... > > If there's a better routine for handling ib_umem_lease_break() I'd > love to use it. Right now I'm reaching for the only tool I know for > kernel enforced revocation of DMA access. Well, you'd have to code something in the MR code to keep track of DAX MRs and issue an out of band invalidate to impacted MRs to create the fence. This probably needs some driver work, I'm not sure if all the hardware can do out of band invalidate to any MR or not.. Generally speaking, in RDMA, when a new feature like this comes along we have to push a lot of the work down to the driver authors, and the approach has historically been that new features only work on some hardware (as much as I dislike this, it is pragmatic) So, not being able to support DAX on certain RDMA hardware is not an unreasonable situation in our space. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
On Mon, Oct 09, 2017 at 12:28:29PM -0700, Dan Williams wrote: > > I don't think this has ever come up in the context of an all-device MR > > invalidate requirement. Drivers already have code to invalidate > > specifc MRs, but to find all MRs that touch certain pages and then > > invalidate them would be new code. > > > > We also have ODP aware drivers that can retarget a MR to new > > physical pages. If the block map changes DAX should synchronously > > retarget the ODP MR, not halt DMA. > > Have a look at the patch [1], I don't touch the ODP path. But, does ODP work OK already? I'm not clear on that.. > > Most likely ODP & DAX would need to be used together to get robust > > user applications, as having the user QP's go to an error state at > > random times (due to DMA failures) during operation is never going to > > be acceptable... > > It's not random. The process that set up the mapping and registered > the memory gets SIGIO when someone else tries to modify the file map. > That process then gets /proc/sys/fs/lease-break-time seconds to fix > the problem before the kernel force revokes the DMA access. Well, the process can't fix the problem in bounded time, so it is random if it will fail or not. MR life time is under the control of the remote side, and time to complete the network exchanges required to release the MRs is hard to bound. So even if I implement SIGIO properly my app will still likely have random QP failures under various cases and work loads. :( This is why ODP should be the focus because this cannot work fully reliably otherwise.. > > Perhaps you might want to initially only support ODP MR mappings with > > DAX and then the DMA fencing issue goes away? > > I'd rather try to fix the non-ODP DAX case instead of just turning it off. Well, what about using SIGKILL if the lease-break-time hits? The kernel will clean up the MRs when the process exits and this will fence DMA to that memory. But, still, if you really want to be fined graned, then I think invalidating the impacted MR's is a better solution for RDMA than trying to do it with the IOMMU... Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
On Mon, Oct 09, 2017 at 12:05:30PM -0700, Dan Williams wrote: > On Mon, Oct 9, 2017 at 11:58 AM, Jason Gunthorpe > wrote: > > On Fri, Oct 06, 2017 at 03:35:54PM -0700, Dan Williams wrote: > >> otherwise be quiesced. The need for this knowledge is driven by a need > >> to make RDMA transfers to DAX mappings safe. If the DAX file's block map > >> changes we need to be to reliably stop accesses to blocks that have been > >> freed or re-assigned to a new file. > > > > If RDMA is driving this need, why not invalidate backing RDMA MRs > > instead of requiring a IOMMU to do it? RDMA MR are finer grained and > > do not suffer from the re-use problem David W. brought up with IOVAs.. > > Sounds promising. All I want in the end is to be sure that the kernel > is enabled to stop any in-flight RDMA at will without asking > userspace. Does this require per-RDMA driver opt-in or is there a > common call that can be made? I don't think this has ever come up in the context of an all-device MR invalidate requirement. Drivers already have code to invalidate specifc MRs, but to find all MRs that touch certain pages and then invalidate them would be new code. We also have ODP aware drivers that can retarget a MR to new physical pages. If the block map changes DAX should synchronously retarget the ODP MR, not halt DMA. Most likely ODP & DAX would need to be used together to get robust user applications, as having the user QP's go to an error state at random times (due to DMA failures) during operation is never going to be acceptable... Perhaps you might want to initially only support ODP MR mappings with DAX and then the DMA fencing issue goes away? Cheers, Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()
On Fri, Oct 06, 2017 at 03:35:54PM -0700, Dan Williams wrote: > otherwise be quiesced. The need for this knowledge is driven by a need > to make RDMA transfers to DAX mappings safe. If the DAX file's block map > changes we need to be to reliably stop accesses to blocks that have been > freed or re-assigned to a new file. If RDMA is driving this need, why not invalidate backing RDMA MRs instead of requiring a IOMMU to do it? RDMA MR are finer grained and do not suffer from the re-use problem David W. brought up with IOVAs.. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 15/21] xen-blkfront: Make use of the new sg_map helper function
On Thu, Apr 27, 2017 at 05:03:45PM -0600, Logan Gunthorpe wrote: > > > On 27/04/17 04:11 PM, Jason Gunthorpe wrote: > > On Thu, Apr 27, 2017 at 03:53:37PM -0600, Logan Gunthorpe wrote: > > Well, that is in the current form, with more users it would make sense > > to optimize for the single page case, eg by providing the existing > > call, providing a faster single-page-only variant of the copy, perhaps > > even one that is inlined. > > Ok, does it make sense then to have an sg_copy_page_to_buffer (or some > such... I'm having trouble thinking of a sane name that isn't too long). > That just does k(un)map_atomic and memcpy? I could try that if it makes > sense to people. It seems the most robust: test for iomem, and jump to a slow path copy, otherwise inline the kmap and memcpy Every place doing memcpy from sgl will need that pattern to be correct. > > sg_miter will still fail when the sg contains __iomem, however I would > > expect that the sg_copy will work with iomem, by using the __iomem > > memcpy variant. > > Yes, that's true. Any sg_miters that ever see iomem will need to be > converted to support it. This isn't much different than the other > kmap(sg_page()) users I was converting that will also fail if they see > iomem. Though, I suspect an sg_miter user would be easier to convert to > iomem than a random kmap user. How? sg_miter seems like the next nightmare down this path, what is sg_miter_next supposed to do when something hits an iomem sgl? miter.addr is supposed to be a kernel pointer that must not be __iomem.. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 15/21] xen-blkfront: Make use of the new sg_map helper function
On Thu, Apr 27, 2017 at 03:53:37PM -0600, Logan Gunthorpe wrote: > On 27/04/17 02:53 PM, Jason Gunthorpe wrote: > > blkfront is one of the drivers I looked at, and it appears to only be > > memcpying with the bvec_data pointer, so I wonder why it does not use > > sg_copy_X_buffer instead.. > > But you'd potentially end up calling sg_copy_to_buffer multiple times > per page within the sg (given that gnttab_foreach_grant_in_range might > call blkif_copy_from_grant/blkif_setup_rw_req_grant multiple times). > Even calling sg_copy_to_buffer once per page seems rather inefficient as > it uses sg_miter internally. Well, that is in the current form, with more users it would make sense to optimize for the single page case, eg by providing the existing call, providing a faster single-page-only variant of the copy, perhaps even one that is inlined. > Switching the for_each_sg to sg_miter is probably the nicer solution as > it takes care of the mapping and the offset/length accounting for you > and will have similar performance. sg_miter will still fail when the sg contains __iomem, however I would expect that the sg_copy will work with iomem, by using the __iomem memcpy variant. So, sg_copy should always be preferred in this new world with mixed __iomem since it is the only primitive that can transparently handle it. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 15/21] xen-blkfront: Make use of the new sg_map helper function
On Thu, Apr 27, 2017 at 02:19:24PM -0600, Logan Gunthorpe wrote: > > > On 26/04/17 01:37 AM, Roger Pau Monné wrote: > > On Tue, Apr 25, 2017 at 12:21:02PM -0600, Logan Gunthorpe wrote: > >> Straightforward conversion to the new helper, except due to the lack > >> of error path, we have to use SG_MAP_MUST_NOT_FAIL which may BUG_ON in > >> certain cases in the future. > >> > >> Signed-off-by: Logan Gunthorpe > >> Cc: Boris Ostrovsky > >> Cc: Juergen Gross > >> Cc: Konrad Rzeszutek Wilk > >> Cc: "Roger Pau Monné" > >> drivers/block/xen-blkfront.c | 20 +++- > >> 1 file changed, 11 insertions(+), 9 deletions(-) > >> > >> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c > >> index 3945963..ed62175 100644 > >> +++ b/drivers/block/xen-blkfront.c > >> @@ -816,8 +816,9 @@ static int blkif_queue_rw_req(struct request *req, > >> struct blkfront_ring_info *ri > >>BUG_ON(sg->offset + sg->length > PAGE_SIZE); > >> > >>if (setup.need_copy) { > >> - setup.bvec_off = sg->offset; > >> - setup.bvec_data = kmap_atomic(sg_page(sg)); > >> + setup.bvec_off = 0; > >> + setup.bvec_data = sg_map(sg, 0, SG_KMAP_ATOMIC | > >> + SG_MAP_MUST_NOT_FAIL); > > > > I assume that sg_map already adds sg->offset to the address? > > Correct. > > > Also wondering whether we can get rid of bvec_off and just increment > > bvec_data, > > adding Julien who IIRC added this code. > > bvec_off is used to keep track of the offset within the current mapping > so it's not a great idea given that you'd want to kunmap_atomic the > original address and not something with an offset. It would be nice if > this could be converted to use the sg_miter interface but that's a much > more invasive change that would require someone who knows this code and > can properly test it. I'd be very grateful if someone actually took that on. blkfront is one of the drivers I looked at, and it appears to only be memcpying with the bvec_data pointer, so I wonder why it does not use sg_copy_X_buffer instead.. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 01/21] scatterlist: Introduce sg_map helper functions
On Thu, Apr 27, 2017 at 08:53:38AM +0200, Christoph Hellwig wrote: > > The main difficulty we > > have now is that neither of those functions are expected to fail and we > > need them to be able to in cases where the page doesn't map to system > > RAM. This patch series is trying to address it for users of scatterlist. > > I'm certainly open to other suggestions. > > I think you'll need to follow the existing kmap semantics and never > fail the iomem version either. Otherwise you'll have a special case > that's almost never used that has a different error path. How about first switching as many call sites as possible to use sg_copy_X_buffer instead of kmap? A random audit of Logan's series suggests this is actually a fairly common thing. eg drivers/mmc/host/sdhci.c is only doing this: buffer = sdhci_kmap_atomic(sg, &flags); memcpy(buffer, align, size); sdhci_kunmap_atomic(buffer, &flags); drivers/scsi/mvsas/mv_sas.c is this: + to = sg_map(sg_resp, 0, SG_KMAP_ATOMIC); + memcpy(to, + slot->response + sizeof(struct mvs_err_info), + sg_dma_len(sg_resp)); + sg_unmap(sg_resp, to, 0, SG_KMAP_ATOMIC); etc. Lots of other places seem similar, if not sometimes a little bit more convoluted.. Switching all the trivial cases to use copy might bring more clarity as to what is actually required for the remaining few users? If there are only a few then it may no longer matter if the API is not idyllic. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
On Wed, Apr 19, 2017 at 01:41:49PM -0600, Logan Gunthorpe wrote: > > But.. it could point to a GPU and the GPU struct device could have a > > proxy dma_ops like Dan pointed out. > > Seems a bit awkward to me that in order for the intended use case, you > have to proxy the dma_ops. I'd probably still suggest throwing a couple > ops for things like this in the dev_pagemap. Another option is adding a new 'struct completer_dma_ops *' to struct device for this use case. Seems like a waste to expand dev_pagemap when we only need a unique value per struct device? Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
On Wed, Apr 19, 2017 at 01:02:49PM -0600, Logan Gunthorpe wrote: > > > On 19/04/17 12:32 PM, Jason Gunthorpe wrote: > > On Wed, Apr 19, 2017 at 12:01:39PM -0600, Logan Gunthorpe wrote: > > Not entirely, it would have to call through the whole process > > including the arch_p2p_cross_segment().. > > Hmm, yes. Though it's still not clear what, if anything, > arch_p2p_cross_segment would be doing. Sets up the iommu for arches that place a iommu between the pci root port and other pci root ports. > In my experience, if you are going between host bridges, the CPU > address (or PCI address -- I'm not sure which seeing they are the > same on my system) would still work fine Try it with VT-D turned on. It shouldn't work or there is a notable security hole in your platform.. > > const struct dma_map_ops *comp_ops = get_dma_ops(completer); > > const struct dma_map_ops *init_ops = get_dma_ops(initiator); > > So, in this case, what device does the completer point to? The PCI > device or a more specific GPU device? If it's the former, who's > responsible for setting the new dma_ops? Typically the dma_ops are arch > specific but now you'd be adding ones that are tied to hmm or the gpu. Donno, that is for GPU folks to figure out :) But.. it could point to a GPU and the GPU struct device could have a proxy dma_ops like Dan pointed out. > >> I'm not sure I like the name pci_p2p_same_segment. It reads as though > >> it's only checking if the devices are not the same segment. > > > > Well, that is exactly what it is doing. If it succeeds then the caller > > knows the DMA will not flow outside the segment and no iommu setup/etc > > is required. > > It appears to me like it's calculating the DMA address, and the check is > just a side requirement. It reads as though it's only doing the check. pci_p2p_same_segment_get_pa() then? Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
On Wed, Apr 19, 2017 at 12:01:39PM -0600, Logan Gunthorpe wrote: > I'm just spit balling here but if HMM wanted to use unaddressable memory > as a DMA target, it could set that function to create a window ine gpu > memory, then call the pci_p2p_same_segment and return the result as the > dma address. Not entirely, it would have to call through the whole process including the arch_p2p_cross_segment().. Maybe we can start down the road of using ops for more iommu steps with something like this as the helper: dma_addr_t dma_map_pa(struct device *initiator, struct page *page, void *data) { struct device *completer = get_p2p_completer(page); dma_addr_t pa; if (IS_ERR(completer)) return SYSTEM_MEMORY; // Or maybe ? return init_ops->dma_map_pa(..); // Try the generic method pa = pci_p2p_same_segment(dev, p2p_src, page); if (pa != ERROR) goto out; // Try the arch specific helper const struct dma_map_ops *comp_ops = get_dma_ops(completer); const struct dma_map_ops *init_ops = get_dma_ops(initiator); /* FUTURE: Let something translate a HMM page into a DMA'ble page, eg by mapping it into a GPU window. Maybe this callback lives in devmap ? */ page = comp_ops->translate_dma_page(completer, page); /* New dma_map_op is the same as arch_p2p_cross_segment in prior version. Any arch specific data needed to program the iommu flows through data */ pa = init_ops->p2p_cross_segment_map(completer, inititator, page, data); out: device_put(completer); return pa; } // map_sg op: for (each sgl) { struct page *page = sg_page(s); struct arch_iommu_data data = {}; // pass through to ops->p2p_cross_segment dma_addr_t pa; pa = dma_map_pa(dev, page, &data) if (pa == ERROR) // fail if (!data.needs_iommu) { // Insert PA directly into the result SGL sg++; continue; } // Else pa & data describe how to setup the iommu } > > dma_addr_t pci_p2p_same_segment(struct device *initator, > > struct device *completer, > > struct page *page) > > I'm not sure I like the name pci_p2p_same_segment. It reads as though > it's only checking if the devices are not the same segment. Well, that is exactly what it is doing. If it succeeds then the caller knows the DMA will not flow outside the segment and no iommu setup/etc is required. That function cannot be expanded to include generic cross-segment traffic, a new function would be needed.. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
On Wed, Apr 19, 2017 at 10:48:51AM -0600, Logan Gunthorpe wrote: > The pci_enable_p2p_bar function would then just need to call > devm_memremap_pages with the dma_map callback set to a function that > does the segment check and the offset calculation. I don't see a use for the dma_map function pointer at this point.. It doesn't make alot of sense for the completor of the DMA to provide a mapping op, the mapping process is *path* specific, not specific to a completer/initiator. So, I would suggest more like this: static inline struct device *get_p2p_src(struct page *page) { struct device *res; struct dev_pagemap *pgmap; if (!is_zone_device_page(page)) return NULL; pgmap = get_dev_pagemap(page_to_pfn(page), NULL); if (!pgmap || pgmap->type != MEMORY_DEVICE_P2P) /* For now ZONE_DEVICE memory that is not P2P is assumed to be configured for DMA the same as CPU memory. */ return ERR_PTR(-EINVAL); res = pgmap->dev; device_get(res); put_dev_pagemap(pgmap); return res; } dma_addr_t pci_p2p_same_segment(struct device *initator, struct device *completer, struct page *page) { if (! PCI initiator & completer) return ERROR; if (!same segment initiator & completer) return ERROR; // Translate page directly to the value programmed into the BAR return (Completer's PCI BAR base address) + (offset of page within BAR); } // dma_sg_map for (each sgl) { struct page *page = sg_page(s); struct device *p2p_src = get_p2p_src(page); if (IS_ERR(p2p_src)) // fail dma_sg if (p2p_src) { bool needs_iommu = false; pa = pci_p2p_same_segment(dev, p2p_src, page); if (pa == ERROR) pa = arch_p2p_cross_segment(dev, p2psrc, page, &needs_iommui); device_put(p2p_src); if (pa == ERROR) // fail if (!needs_iommu) { // Insert PA directly into the result SGL sg++; continue; } } else // CPU memory pa = page_to_phys(page); To me it looks like the code duplication across the iommu stuff comes from just duplicating the basic iommu algorithm in every driver. To clean that up I think someone would need to hoist the overall sgl loop and use more ops callbacks eg allocate_iommu_range, assign_page_to_rage, dealloc_range, etc. This is a problem p2p makes worse, but isn't directly causing :\ Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
On Wed, Apr 19, 2017 at 11:20:06AM +1000, Benjamin Herrenschmidt wrote: > That helper wouldn't perform the actual iommu mapping. It would simply > return something along the lines of: > > - "use that alternate bus address and don't map in the iommu" I was thinking only this one would be supported with a core code helper.. > - "use that alternate bus address and do map in the iommu" > - "proceed as normal" .. because I don't have an idea how a core code helper could figure out what the platform needs for the above two .. I think many of the iommu platforms will need to construct their own bus address in the iommu, and we already have easy access to the CPU address. Presumably once a transcation passes through the iommu it needs to be using the CPU address for the target bar, otherwise things are going to be ambiguous.. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
On Tue, Apr 18, 2017 at 03:51:27PM -0700, Dan Williams wrote: > > This really seems like much less trouble than trying to wrapper all > > the arch's dma ops, and doesn't have the wonky restrictions. > > I don't think the root bus iommu drivers have any business knowing or > caring about dma happening between devices lower in the hierarchy. Maybe not, but performance requires some odd choices in this code.. :( > > Setting up the iommu is fairly expensive, so getting rid of the > > batching would kill performance.. > > When we're crossing device and host memory boundaries how much > batching is possible? As far as I can see you'll always be splitting > the sgl on these dma mapping boundaries. Splitting the sgl is different from iommu batching. As an example, an O_DIRECT write of 1 MB with a single 4K P2P page in the middle. The optimum behavior is to allocate a 1MB-4K iommu range and fill it with the CPU memory. Then return a SGL with three entires, two pointing into the range and one to the p2p. It is creating each range which tends to be expensive, so creating two ranges (or worse, if every SGL created a range it would be 255) is very undesired. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
On Tue, Apr 18, 2017 at 03:28:17PM -0700, Dan Williams wrote: > Unlike the pci bus address offset case which I think is fundamental to > support since shipping archs do this toda But we can support this by modifying those arch's unique dma_ops directly. Eg as I explained, my p2p_same_segment_map_page() helper concept would do the offset adjustment for same-segement DMA. If PPC calls that in their IOMMU drivers then they will have proper support for this basic p2p, and the right framework to move on to more advanced cases of p2p. This really seems like much less trouble than trying to wrapper all the arch's dma ops, and doesn't have the wonky restrictions. > I think it is ok to say p2p is restricted to a single sgl that gets > to talk to host memory or a single device. RDMA and GPU would be sad with this restriction... > That said, what's wrong with a p2p aware map_sg implementation > calling up to the host memory map_sg implementation on a per sgl > basis? Setting up the iommu is fairly expensive, so getting rid of the batching would kill performance.. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
On Tue, Apr 18, 2017 at 03:31:58PM -0600, Logan Gunthorpe wrote: > 1) It means that sg_has_p2p has to walk the entire sg and check every > page. Then map_sg_p2p/map_sg has to walk it again and repeat the check > then do some operation per page. If anyone is concerned about the > dma_map performance this could be an issue. dma_map performance is a concern, this is why I suggest this as an interm solution until all dma_ops are migrated. Ideally sg_has_p2p would be a fast path that checked some kind of flags bit set during sg_assign_page... This would probably all have to be protected with CONFIG_P2P until it becomes performance neutral. People without an iommu are not going to want to walk the sg list at all.. > 2) Without knowing exactly what the arch specific code may need to do > it's hard to say that this is exactly the right approach. If every > dma_ops provider has to do exactly this on every page it may lead to a > lot of duplicate code: I think someone would have to start to look at it to make a determination.. I suspect the main server oriented iommu dma op will want to have proper p2p support anyhow and will probably have their unique control flow.. > The only thing I'm presently aware of is the segment check and applying > the offset to the physical address Well, I called the function p2p_same_segment_map_page() in my last suggestion for a reason - that is all the helper does. The intention would be for real iommu drivers to call that helper for the one simple case and if it fails then use their own routines to figure out if cross-segment P2P is possible and configure the iommu as needed. > bus specific and not arch specific which I think is what Dan may be > getting at. So it may make sense to just have a pci_map_sg_p2p() which > takes a dma_ops struct it would use for any page that isn't a p2p page. Like I keep saying, dma_ops are not really designed to be stacked. Try and write a stacked map_sg function like you describe and you will see how horrible it quickly becomes. Setting up an iommu is very expensive, so we need to batch it for the entire sg list. Thus a trivial implementation to iterate over all sg list entries is not desired. So first a sg list without p2p memory would have to be created, pass to the lower level ops, then brought back. Remember, the returned sg list will have a different number of entries than the original. Now another complex loop is needed to split/merge back in the p2p sg elements to get a return result. Finally, we have to undo all of this when doing unmap. Basically, all this list processing is a huge overhead compared to just putting a helper call in the existing sg iteration loop of the actual op. Particularly if the actual op is a no-op like no-mmu x86 would use. Since dma mapping is a performance path we must be careful not to create intrinsic inefficiencies with otherwise nice layering :) Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
On Tue, Apr 18, 2017 at 02:11:33PM -0700, Dan Williams wrote: > > I think this opens an even bigger can of worms.. > > No, I don't think it does. You'd only shim when the target page is > backed by a device, not host memory, and you can figure this out by a > is_zone_device_page()-style lookup. The bigger can of worms is how do you meaningfully stack dma_ops. What does the p2p provider do when it detects a p2p page? Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
On Tue, Apr 18, 2017 at 12:48:35PM -0700, Dan Williams wrote: > > Yes, I noticed this problem too and that makes sense. It just means > > every dma_ops will probably need to be modified to either support p2p > > pages or fail on them. Though, the only real difficulty there is that it > > will be a lot of work. > > I don't think you need to go touch all dma_ops, I think you can just > arrange for devices that are going to do dma to get redirected to a > p2p aware provider of operations that overrides the system default > dma_ops. I.e. just touch get_dma_ops(). I don't follow, when does get_dma_ops() return a p2p aware provider? It has no way to know if the DMA is going to involve p2p, get_dma_ops is called with the device initiating the DMA. So you'd always return the P2P shim on a system that has registered P2P memory? Even so, how does this shim work? dma_ops are not really intended to be stacked. How would we make unmap work, for instance? What happens when the underlying iommu dma ops actually natively understands p2p and doesn't want the shim? I think this opens an even bigger can of worms.. Lets find a strategy to safely push this into dma_ops. What about something more incremental like this instead: - dma_ops will set map_sg_p2p == map_sg when they are updated to support p2p, otherwise DMA on P2P pages will fail for those ops. - When all ops support p2p we remove the if and ops->map_sg then just call map_sg_p2p - For now the scatterlist maintains a bit when pages are added indicating if p2p memory might be present in the list. - Unmap for p2p and non-p2p is the same, the underlying ops driver has to make it work. diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index 0977317c6835c2..505ed7d502053d 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -103,6 +103,9 @@ struct dma_map_ops { int (*map_sg)(struct device *dev, struct scatterlist *sg, int nents, enum dma_data_direction dir, unsigned long attrs); + int (*map_sg_p2p)(struct device *dev, struct scatterlist *sg, + int nents, enum dma_data_direction dir, + unsigned long attrs); void (*unmap_sg)(struct device *dev, struct scatterlist *sg, int nents, enum dma_data_direction dir, @@ -244,7 +247,15 @@ static inline int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, for_each_sg(sg, s, nents, i) kmemcheck_mark_initialized(sg_virt(s), s->length); BUG_ON(!valid_dma_direction(dir)); - ents = ops->map_sg(dev, sg, nents, dir, attrs); + + if (sg_has_p2p(sg)) { + if (ops->map_sg_p2p) + ents = ops->map_sg_p2p(dev, sg, nents, dir, attrs); + else + return 0; + } else + ents = ops->map_sg(dev, sg, nents, dir, attrs); + BUG_ON(ents < 0); debug_dma_map_sg(dev, sg, nents, ents, dir); ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
On Tue, Apr 18, 2017 at 01:35:32PM -0600, Logan Gunthorpe wrote: > > Ultimately every dma_ops will need special code to support P2P with > > the special hardware that ops is controlling, so it makes some sense > > to start by pushing the check down there in the first place. This > > advice is partially motivated by how dma_map_sg is just a small > > wrapper around the function pointer call... > > Yes, I noticed this problem too and that makes sense. It just means > every dma_ops will probably need to be modified to either support p2p > pages or fail on them. Though, the only real difficulty there is that it > will be a lot of work. I think this is why progress on this keeps getting stuck - every solution is a lot of work. > > Where p2p_same_segment_map_page checks if the two devices are on the > > 'same switch' and if so returns the address translated to match the > > bus address programmed into the BAR or fails. We knows this case is > > required to work by the PCI spec, so it makes sense to use it as the > > first canned helper. > > I've also suggested that this check should probably be done (or perhaps > duplicated) before we even get to the map stage. Since the mechanics of the check is essentially unique to every dma-ops I would not hoist it out of the map function without a really good reason. > In the case of nvme-fabrics we'd probably want to let the user know > when they try to configure it or at least fall back to allocating > regular memory instead. You could try to do a dummy mapping / create a MR early on to detect this. FWIW, I wonder if from a RDMA perspective we have another problem.. Should we allow P2P memory to be used with the local DMA lkey? There are potential designs around virtualization that would not allow that. Should we mandate that P2P memory be in its own MR? Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
On Tue, Apr 18, 2017 at 12:30:59PM -0600, Logan Gunthorpe wrote: > > - The dma ops provider must be able to tell if source memory is bar > >mapped and recover the pci device backing the mapping. > > Do you mean to say that every dma-ops provider needs to be taught about > p2p backed pages? I was hoping we could have dma_map_* just use special > p2p dma-ops if it was passed p2p pages (though there are some > complications to this too). I think that is how it will end up working out if this is the path.. Ultimately every dma_ops will need special code to support P2P with the special hardware that ops is controlling, so it makes some sense to start by pushing the check down there in the first place. This advice is partially motivated by how dma_map_sg is just a small wrapper around the function pointer call... Something like: foo_dma_map_sg(...) { for (every page in sg) if (page is p2p) dma_addr[I] = p2p_same_segment_map_page(...); } Where p2p_same_segment_map_page checks if the two devices are on the 'same switch' and if so returns the address translated to match the bus address programmed into the BAR or fails. We knows this case is required to work by the PCI spec, so it makes sense to use it as the first canned helper. This also proves out the basic idea that the dma ops can recover the pci device and perform an inspection of the traversed fabric path. >From there every arch would have to expand the implementation to support a wider range of things. Eg x86 with no iommu and no offset could allow every address to be used based on a host bridge white list. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
On Tue, Apr 18, 2017 at 10:27:47AM -0700, Dan Williams wrote: > > FWIW, RDMA probably wouldn't want to use a p2mem device either, we > > already have APIs that map BAR memory to user space, and would like to > > keep using them. A 'enable P2P for bar' helper function sounds better > > to me. > > ...and I think it's not a helper function as much as asking the bus > provider "can these two device dma to each other". What I mean I could write in a RDMA driver: /* Allow the memory in BAR 1 to be the target of P2P transactions */ pci_enable_p2p_bar(dev, 1); And not require anything else.. > The "helper" is the dma api redirecting through a software-iommu > that handles bus address translation differently than it would > handle host memory dma mapping. Not sure, until we see what arches actually need to do here it is hard to design common helpers. Here are a few obvious things that arches will need to implement to support this broadly: - Virtualization might need to do a hypervisor call to get the right translation, or consult some hypervisor specific description table. - Anything using IOMMUs for virtualization will need to setup IOMMU permissions to allow the P2P flow, this might require translation to an address cookie. - Fail if the PCI devices are in different domains, or setup hardware to do completion bus/device/function translation. - All platforms can succeed if the PCI devices are under the same 'segment', but where segments begin is somewhat platform specific knowledge. (this is 'same switch' idea Logan has talked about) So, we can eventually design helpers for various common scenarios, but until we see what arch code actually needs to do it seems premature. Much of this seems to involve interaction with some kind of hardware, or consulation of some kind of currently platform specific data, so I'm not sure what a software-iommu would be doing?? The main thing to agree on is that this code belongs under dma ops and that arches have to support struct page mapped BAR addresses in their dma ops inputs. Is that resonable? Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
On Mon, Apr 17, 2017 at 08:23:16AM +1000, Benjamin Herrenschmidt wrote: > Thanks :-) There's a reason why I'm insisting on this. We have constant > requests for this today. We have hacks in the GPU drivers to do it for > GPUs behind a switch, but those are just that, ad-hoc hacks in the > drivers. We have similar grossness around the corner with some CAPI > NICs trying to DMA to GPUs. I have people trying to use PLX DMA engines > to whack nVME devices. A lot of people feel this way in the RDMA community too. We have had vendors shipping out of tree code to enable P2P for RDMA with GPU years and years now. :( Attempts to get things in mainline have always run into the same sort of road blocks you've identified in this thread.. FWIW, I read this discussion and it sounds closer to an agreement than I've ever seen in the past. >From Ben's comments, I would think that the 'first class' support that is needed here is simply a function to return the 'struct device' backing a CPU address range. This is the minimal required information for the arch or IOMMU code under the dma ops to figure out the fabric source/dest, compute the traffic path, determine if P2P is even possible, what translation hardware is crossed, and what DMA address should be used. If there is going to be more core support for this stuff I think it will be under the topic of more robustly describing the fabric to the core and core helpers to extract data from the description: eg compute the path, check if the path crosses translation, etc But that isn't really related to P2P, and is probably better left to the arch authors to figure out where they need to enhance the existing topology data.. I think the key agreement to get out of Logan's series is that P2P DMA means: - The BAR will be backed by struct pages - Passing the CPU __iomem address of the BAR to the DMA API is valid and, long term, dma ops providers are expected to fail or return the right DMA address - Mapping BAR memory into userspace and back to the kernel via get_user_pages works transparently, and with the DMA API above - The dma ops provider must be able to tell if source memory is bar mapped and recover the pci device backing the mapping. At least this is what we'd like in RDMA :) FWIW, RDMA probably wouldn't want to use a p2mem device either, we already have APIs that map BAR memory to user space, and would like to keep using them. A 'enable P2P for bar' helper function sounds better to me. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory
On Thu, Apr 13, 2017 at 06:26:31PM -0500, Bjorn Helgaas wrote: > > Ah, thanks for the tip! On my system, this translation returns the same > > address so it was not necessary. And, yes, that means this would have to > > find its way into the dma mapping routine somehow. This means we'll > > eventually need a way to look-up the p2pmem device from the struct page. > > Which means we will likely need a new flag bit in the struct page or > > something. The big difficulty I see is testing. Do you know what > > architectures or in what circumstances are these translations used? > > Any caller of pci_add_resource_offset() uses CPU addresses different from > the PCI bus addresses (unless the offset is zero, of course). All ACPI > platforms also support this translation (see "translation_offset"), though > in most x86 systems the offset is zero. I'm aware of one x86 system that > was tested with a non-zero offset but I don't think it was shipped that > way. I'd suggest just detecting if there is any translation in bus addresses anywhere and just hard disabling P2P on such systems. On modern hardware with 64 bit BARs there is very little reason to have translation, so I think this is a legacy feature. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem
On Thu, Apr 06, 2017 at 08:33:38AM +0300, Sagi Grimberg wrote: > > >>Note that the nvme completion queues are still on the host memory, so > >>this means we have lost the ordering between data and completions as > >>they go to different pcie targets. > > > >Hmm, in this simple up/down case with a switch, I think it might > >actually be OK. > > > >Transactions might not complete at the NVMe device before the CPU > >processes the RDMA completion, however due to the PCI-E ordering rules > >new TLPs directed to the NVMe will complete after the RMDA TLPs and > >thus observe the new data. (eg order preserving) > > > >It would be very hard to use P2P if fabric ordering is not preserved.. > > I think it still can race if the p2p device is connected with more than > a single port to the switch. > > Say it's connected via 2 legs, the bar is accessed from leg A and the > data from the disk comes via leg B. In this case, the data is heading > towards the p2p device via leg B (might be congested), the completion > goes directly to the RC, and then the host issues a read from the > bar via leg A. I don't understand what can guarantee ordering here. Right, this is why I qualified my statement with 'simple up/down case' Make it any more complex and it clearly stops working sanely, but I wouldn't worry about unusual PCI-E fabrics at this point.. > Stephen told me that this still guarantees ordering, but I honestly > can't understand how, perhaps someone can explain to me in a simple > way that I can understand. AFAIK PCI-E ordering is explicitly per link, so things that need order must always traverse the same link. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem
On Tue, Apr 04, 2017 at 01:59:26PM +0300, Sagi Grimberg wrote: > Note that the nvme completion queues are still on the host memory, so > this means we have lost the ordering between data and completions as > they go to different pcie targets. Hmm, in this simple up/down case with a switch, I think it might actually be OK. Transactions might not complete at the NVMe device before the CPU processes the RDMA completion, however due to the PCI-E ordering rules new TLPs directed to the NVMe will complete after the RMDA TLPs and thus observe the new data. (eg order preserving) It would be very hard to use P2P if fabric ordering is not preserved.. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 01/16] chardev: add helper function to register char devs with a struct device
On Sun, Feb 26, 2017 at 10:21:25AM -0800, Dan Williams wrote: > > + * cdev_device_add() adds the char device represented by @cdev to the > > system, > > + * just as cdev_add does. It then adds @dev to the system using device_add > > + * The dev_t for the char device will be taken from the struct device which > > + * needs to be initialized first. This helper function correctly takes a > > + * reference to the parent device so the parent will not get released until > > + * all references to the cdev are released. > > + * > > + * This function should be used whenever the struct cdev and the > > + * struct device are members of the same structure whose lifetime is > > + * managed by the struct device. > > + */ > > Perhaps add a note here that userspace may have invoked file > operations between cdev_add() and a failing device_add(), so > additional cleanup beyond put_device() (like mmap invalidation) might > be needed. That can be a later follow-on patch. Yes please, that is way too subtle. Suggest: NOTE: Callers must assume that userspace was able to open the cdev and can call cdev fops callbacks at any time, even if this function fails. I would also add a note to cdev_device_del: NOTE: This guarantees that associated sysfs callbacks are not running or runnable, however any open cdevs will remain and their fops remain callable even after this returns. Since I have seen a lot of confusion on that point as well.. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 06/16] tpm-chip: utilize new cdev_device_add helper function
On Sat, Feb 25, 2017 at 11:38:07PM -0700, Logan Gunthorpe wrote: > Replace the open coded registration of the cdev and dev with the > new device_add_cdev() helper. The helper replaces a common pattern by > taking the proper reference against the parent device and adding both > the cdev and the device. > > Signed-off-by: Logan Gunthorpe > drivers/char/tpm/tpm-chip.c | 19 +++ > 1 file changed, 3 insertions(+), 16 deletions(-) > > diff --git a/drivers/char/tpm/tpm-chip.c b/drivers/char/tpm/tpm-chip.c > index a77262d..933f96e 100644 > +++ b/drivers/char/tpm/tpm-chip.c > @@ -187,7 +187,6 @@ struct tpm_chip *tpm_chip_alloc(struct device *dev, > > cdev_init(&chip->cdev, &tpm_fops); > chip->cdev.owner = THIS_MODULE; > - chip->cdev.kobj.parent = &chip->dev.kobj; > > return chip; > > @@ -230,27 +229,16 @@ static int tpm_add_char_device(struct tpm_chip *chip) > { > int rc; > > - rc = cdev_add(&chip->cdev, chip->dev.devt, 1); > + rc = cdev_device_add(&chip->cdev, &chip->dev); > if (rc) { > dev_err(&chip->dev, > - "unable to cdev_add() %s, major %d, minor %d, err=%d\n", > + "unable to cdev_device_add() %s, major %d, minor %d, > err=%d\n", > dev_name(&chip->dev), MAJOR(chip->dev.devt), > MINOR(chip->dev.devt), rc); > > return rc; > } > > - rc = device_add(&chip->dev); > - if (rc) { > - dev_err(&chip->dev, > - "unable to device_register() %s, major %d, minor %d, > err=%d\n", > - dev_name(&chip->dev), MAJOR(chip->dev.devt), > - MINOR(chip->dev.devt), rc); > - > - cdev_del(&chip->cdev); > - return rc; > - } This will collide with the patch I sent: https://patchwork.kernel.org/patch/9589001/ Jarkko: The resolution will be to continue to call tpm_del_char_device from the cdev_device_add error path and drop all calls to cdev_del Reviewed-by: Jason Gunthorpe Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH] device-dax: fix cdev leak
On Thu, Feb 23, 2017 at 11:22:03AM -0800, Dan Williams wrote: > If device_add() fails, cleanup the cdev. Otherwise, we leak a kobj_map() > with a stale device number. > > Fixes: ba09c01d2fa8 ("dax: convert to the cdev api") > Cc: > Cc: Logan Gunthorpe > Reported-by: Jason Gunthorpe > Signed-off-by: Dan Williams > drivers/dax/dax.c |1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c > index ed758b74ddf0..0f8008dd0b0c 100644 > +++ b/drivers/dax/dax.c > @@ -724,6 +724,7 @@ struct dax_dev *devm_create_dax_dev(struct dax_region > *dax_region, > dev_set_name(dev, "dax%d.%d", dax_region->id, dax_dev->id); > rc = device_add(dev); > if (rc) { > + cdev_del(&dax_dev->cdev); This probably should call into unregister_dax_dev and just skip the device_unregister part. Once cdev_add returns it is possible for a mmap to have been created, so cleanup after that point has to go through all the other unregister_dax_dev steps. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 07/14] infiniband: utilize new device_add_cdev helper function
On Tue, Feb 21, 2017 at 04:54:05PM -0700, Logan Gunthorpe wrote: > Is that true? Once device_register or device_add is called then you need > to use put_device. General rule is once kref_init has been called then you should use kref_put and not kfree. device_initialize ultimately calls kref_init.. Reasoning that you don't 'need' to use put_device until device_add is just too hard. For instance, there is still another bug in ib_ucm_add_one: if (device_add(&ucm_dev->dev)) goto err_cdev; if (device_create_file(&ucm_dev->dev, &dev_attr_ibdev)) goto err_dev; [] err_dev: device_unregister(&ucm_dev->dev); err_cdev: ib_ucm_cdev_del(ucm_dev); err: kfree(ucm_dev); return; } If we go down the err_dev path then device_unregister will probably kfree ucm_dev - the argument is not guarenteed valid after device_unregister returns - this is what makes it different from device_del. The simplest fix is to change the unwind into: err_dev: device_del(&ucm_dev->dev); err_cdev: ib_ucm_cdev_del(ucm_dev); err: put_device(&ucm_dev->dev); return; } And the only way to keep our idiomatic goto unwind working is if all 'goto errs' can call put_device - which requires the device_initialize be done before any errors are possible. > In fact device_add is what does the first get_device so this doesn't > add up to me. Not quite, kref_init creates a kref with a count of 1 - eg the caller owns the ref, and that ref must be put to trigger kref release. Thus good kref usage should always pair a kref_put with the kref_init. The get_device at the start of device_add pairs with the put_device at the end of device_del and the kref_init pairs with the put_device at the end of device_unregister. (notice that device_unregister ends up calling put_device twice in a row..) I'll send you a clean patch for your v2. > I know the DAX code only uses put_device after device_add. Looks to me like that code fails to call cdev_del if device_add fails? This approach is problematic because it is trying do the ida removals in the release function. That is not necessary and has the side effect of making the release function uncallable until too late in the process. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 07/14] infiniband: utilize new device_add_cdev helper function
On Mon, Feb 20, 2017 at 10:00:46PM -0700, Logan Gunthorpe wrote: > This patch updates core/ucm.c which didn't originally use the > cdev.kobj.parent with it's parent device. I did not look heavily into > whether this was a bug or not, but it seems likely to me there would > be a use before free. Hum, is probably safe - ib_ucm_remove_one can only happen on module unload and the cdev holds the module lock while open. Even so device_add_cdev should be used anyhow. > I also took a look at core/uverbs_main.c, core/user_mad.c and > hw/hfi1/device.c which utilize cdev.kobj.parent but because the > infiniband core seems to use kobjs internally instead of struct devices > they could not be converted to use the new helper API and still > directly manipulate the internals of the kobj. Yes, and hfi1 had the same use-afte-free bug when it was first submitted as well. IHMO cdev should have a helper entry for this style of use case as well. > diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c > index e0a995b..38ea316 100644 > +++ b/drivers/infiniband/core/ucm.c > @@ -1283,18 +1283,20 @@ static void ib_ucm_add_one(struct ib_device *device) > set_bit(devnum, dev_map); > } > > + device_initialize(&ucm_dev->dev); Ah, this needs more help. Once device_initialize is called put_device should be the error-unwind, can you use something more like this? >From ac7a35ea98066c9a9e3f3e54557c0ccda44b0ffc Mon Sep 17 00:00:00 2001 From: Jason Gunthorpe Date: Tue, 21 Feb 2017 12:07:55 -0700 Subject: [PATCH] infiniband: utilize new device_add_cdev helper The use after free is not triggerable here because the cdev holds the module lock and the only device_unregister is only triggered by module ouload, however make the change for consistency. To make this work the cdev_del needs to move out of the struct device release function. Signed-off-by: Jason Gunthorpe --- drivers/infiniband/core/ucm.c | 25 ++--- 1 file changed, 14 insertions(+), 11 deletions(-) diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c index 7713ef089c3ccc..acda8d941381f3 100644 --- a/drivers/infiniband/core/ucm.c +++ b/drivers/infiniband/core/ucm.c @@ -1202,12 +1202,16 @@ static void ib_ucm_release_dev(struct device *dev) struct ib_ucm_device *ucm_dev; ucm_dev = container_of(dev, struct ib_ucm_device, dev); + kfree(ucm_dev); +} + +static void ib_ucm_cdev_del(struct ib_ucm_device *ucm_dev) +{ cdev_del(&ucm_dev->cdev); if (ucm_dev->devnum < IB_UCM_MAX_DEVICES) clear_bit(ucm_dev->devnum, dev_map); else clear_bit(ucm_dev->devnum - IB_UCM_MAX_DEVICES, overflow_map); - kfree(ucm_dev); } static const struct file_operations ucm_fops = { @@ -1263,6 +1267,7 @@ static void ib_ucm_add_one(struct ib_device *device) if (!ucm_dev) return; + device_initialize(&ucm_dev->dev); ucm_dev->ib_dev = device; devnum = find_first_zero_bit(dev_map, IB_UCM_MAX_DEVICES); @@ -1280,18 +1285,19 @@ static void ib_ucm_add_one(struct ib_device *device) set_bit(devnum, dev_map); } + ucm_dev->dev.devt = base; + ucm_dev->dev.release = ib_ucm_release_dev; + cdev_init(&ucm_dev->cdev, &ucm_fops); ucm_dev->cdev.owner = THIS_MODULE; kobject_set_name(&ucm_dev->cdev.kobj, "ucm%d", ucm_dev->devnum); - if (cdev_add(&ucm_dev->cdev, base, 1)) + if (device_add_cdev(&ucm_dev->dev, &ucm_dev->cdev)) goto err; ucm_dev->dev.class = &cm_class; ucm_dev->dev.parent = device->dma_device; - ucm_dev->dev.devt = ucm_dev->cdev.dev; - ucm_dev->dev.release = ib_ucm_release_dev; dev_set_name(&ucm_dev->dev, "ucm%d", ucm_dev->devnum); - if (device_register(&ucm_dev->dev)) + if (device_add(&ucm_dev->dev)) goto err_cdev; if (device_create_file(&ucm_dev->dev, &dev_attr_ibdev)) @@ -1303,13 +1309,9 @@ static void ib_ucm_add_one(struct ib_device *device) err_dev: device_unregister(&ucm_dev->dev); err_cdev: - cdev_del(&ucm_dev->cdev); - if (ucm_dev->devnum < IB_UCM_MAX_DEVICES) - clear_bit(devnum, dev_map); - else - clear_bit(devnum, overflow_map); + ib_ucm_cdev_del(ucm_dev); err: - kfree(ucm_dev); + put_device(&ucm_dev->dev); return; } @@ -1320,6 +1322,7 @@ static void ib_ucm_remove_one(struct ib_device *device, void *client_data) if (!ucm_dev) return; + ib_ucm_cdev_del(ucm_dev); device_unregister(&ucm_dev->dev); } -- 2.7.4 ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 05/14] tpm-chip: utilize new device_add_cdev helper function
On Mon, Feb 20, 2017 at 10:00:44PM -0700, Logan Gunthorpe wrote: > Signed-off-by: Logan Gunthorpe > drivers/char/tpm/tpm-chip.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) Reviewed-by: Jason Gunthorpe Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: Enabling peer to peer device transactions for PCIe devices
On Thu, Jan 12, 2017 at 10:11:29AM -0500, Jerome Glisse wrote: > On Wed, Jan 11, 2017 at 10:54:39PM -0600, Stephen Bates wrote: > > > What we want is for RDMA, O_DIRECT, etc to just work with special VMAs > > > (ie. at least those backed with ZONE_DEVICE memory). Then > > > GPU/NVME/DAX/whatever drivers can just hand these VMAs to userspace > > > (using whatever interface is most appropriate) and userspace can do what > > > it pleases with them. This makes _so_ much sense and actually largely > > > already works today (as demonstrated by iopmem). > So i say let solve the IOMMU issue first and let everyone use it in their > own way with their device. I do not think we can share much more than > that. Solve it for the easy ZONE_DIRECT/etc case then. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: Enabling peer to peer device transactions for PCIe devices
On Fri, Jan 06, 2017 at 12:37:22PM -0500, Jerome Glisse wrote: > On Fri, Jan 06, 2017 at 11:56:30AM -0500, Serguei Sagalovitch wrote: > > On 2017-01-05 08:58 PM, Jerome Glisse wrote: > > > On Thu, Jan 05, 2017 at 05:30:34PM -0700, Jason Gunthorpe wrote: > > > > On Thu, Jan 05, 2017 at 06:23:52PM -0500, Jerome Glisse wrote: > > > > > > > > > > I still don't understand what you driving at - you've said in both > > > > > > cases a user VMA exists. > > > > > In the former case no, there is no VMA directly but if you want one > > > > > than > > > > > a device can provide one. But such VMA is useless as CPU access is not > > > > > expected. > > > > I disagree it is useless, the VMA is going to be necessary to support > > > > upcoming things like CAPI, you need it to support O_DIRECT from the > > > > filesystem, DPDK, etc. This is why I am opposed to any model that is > > > > not VMA based for setting up RDMA - that is shorted sighted and does > > > > not seem to reflect where the industry is going. > > > > > > > > So focus on having VMA backed by actual physical memory that covers > > > > your GPU objects and ask how do we wire up the '__user *' to the DMA > > > > API in the best way so the DMA API still has enough information to > > > > setup IOMMUs and whatnot. > > > I am talking about 2 different thing. Existing hardware and API where you > > > _do not_ have a vma and you do not need one. This is just > > > > existing stuff. > > I do not understand why you assume that existing API doesn't need one. > > I would say that a lot of __existing__ user level API and their support in > > kernel (especially outside of graphics domain) assumes that we have vma and > > deal with __user * pointers. +1 > Well i am thinking to GPUDirect here. Some of GPUDirect use case do not have > vma (struct vm_area_struct) associated with them they directly apply to GPU > object that aren't expose to CPU. Yes some use case have vma for share buffer. Lets stop talkind about GPU direct. Today we can't even make VMA pointing at a PCI bar work properly in the kernel - lets start there please. People can argue over other options once that is done. > For HMM plan is to restrict to ODP and either to replace ODP with HMM or > change > ODP to not use get_user_pages_remote() but directly fetch informations from > CPU page table. Everything else stay as it is. I posted patchset to replace > ODP with HMM in the past. Make a generic API for all of this and you'd have my vote.. IMHO, you must support basic pinning semantics - that is necessary to support generic short lived DMA (eg filesystem, etc). That hardware can clearly do that if it can support ODP. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: Enabling peer to peer device transactions for PCIe devices
On Thu, Jan 05, 2017 at 06:23:52PM -0500, Jerome Glisse wrote: > > I still don't understand what you driving at - you've said in both > > cases a user VMA exists. > > In the former case no, there is no VMA directly but if you want one than > a device can provide one. But such VMA is useless as CPU access is not > expected. I disagree it is useless, the VMA is going to be necessary to support upcoming things like CAPI, you need it to support O_DIRECT from the filesystem, DPDK, etc. This is why I am opposed to any model that is not VMA based for setting up RDMA - that is shorted sighted and does not seem to reflect where the industry is going. So focus on having VMA backed by actual physical memory that covers your GPU objects and ask how do we wire up the '__user *' to the DMA API in the best way so the DMA API still has enough information to setup IOMMUs and whatnot. > What i was trying to get accross is that no matter what level you > consider in the end you still need something at the DMA API level. > And that the 2 different use case (device vma or regular vma) means > 2 differents API for the device driver. I agree we need new stuff at the DMA API level, but I am opposed to the idea we need two API paths that the *driver* has to figure out. That is fundamentally not what I want as a driver developer. Give me a common API to convert '__user *' to a scatter list and pin the pages. This needs to figure out your two cases. And Huge Pages. And ZONE_DIRECT.. (a better get_user_pages) Give me an API to take the scatter list and DMA map it, handling all the stuff associated with peer-peer. (a better dma_map_sg) Give me a notifier scheme to rework my scatter list when physical pages need to change (mmu notifiers) Use the scatter list memory to convey needed information from the first step to the second. Do not bother the driver with distinctions on what kind of memory is behind that VMA. Don't ask me to use get_user_pages or gpu_get_user_pages, do not ask me to use dma_map_sg or dma_map_sg_peer_direct. The Driver Doesn't Need To Know. IMHO this is why GPU direct is not mergable - it creates a crazy parallel mini-mm subsystem inside RDMA and uses that to connect to a GPU driver, everything is expected to have parallel paths for GPU direct and normal MM. No good at all. > > So, how do you identify these GPU objects? How do you expect RDMA > > convert them to scatter lists? How will ODP work? > > No ODP on those. If you want vma, the GPU device driver can provide You said you needed invalidate, that has to be done via ODP. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: Enabling peer to peer device transactions for PCIe devices
On Thu, Jan 05, 2017 at 03:19:36PM -0500, Jerome Glisse wrote: > > Always having a VMA changes the discussion - the question is how to > > create a VMA that reprensents IO device memory, and how do DMA > > consumers extract the correct information from that VMA to pass to the > > kernel DMA API so it can setup peer-peer DMA. > > Well my point is that it can't be. In HMM case inside a single VMA > you [..] > In the GPUDirect case the idea is that you have a specific device vma > that you map for peer to peer. [..] I still don't understand what you driving at - you've said in both cases a user VMA exists. >From my perspective in RDMA, all I want is a core kernel flow to convert a '__user *' into a scatter list of DMA addresses, that works no matter what is backing that VMA, be it HMM, a 'hidden' GPU object, or struct page memory. A '__user *' pointer is the only way to setup a RDMA MR, and I see no reason to have another API at this time. The details of how to translate to a scatter list are a MM subject, and the MM folks need to get I just don't care if that routine works at a page level, or a whole VMA level, or some combination of both, that is up to the MM team to figure out :) > a page level. Expectation here is that the GPU userspace expose a special > API to allow RDMA to directly happen on GPU object allocated through > GPU specific API (ie it is not regular memory and it is not accessible > by CPU). So, how do you identify these GPU objects? How do you expect RDMA convert them to scatter lists? How will ODP work? > > We have MMU notifiers to handle this today in RDMA. Async RDMA MR > > Invalidate like you see in the above out of tree patches is totally > > crazy and shouldn't be in mainline. Use ODP capable RDMA hardware. > > Well there is still a large base of hardware that do not have such > feature and some people would like to be able to keep using those. Hopefully someone will figure out how to do that without the crazy async MR invalidation. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: Enabling peer to peer device transactions for PCIe devices
On Thu, Jan 05, 2017 at 02:54:24PM -0500, Jerome Glisse wrote: > Mellanox and NVidia support peer to peer with what they market a > GPUDirect. It only works without IOMMU. It is probably not upstream : > > https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg21402.html > > I thought it was but it seems it require an out of tree driver to work. Right, it is out of tree and not under consideration for mainline. > Wether there is a vma or not isn't important to the issue anyway. If > you want to enforce VMA rule for RDMA it is an RDMA specific discussion > in which i don't want to be involve, it is not my turf :) Always having a VMA changes the discussion - the question is how to create a VMA that reprensents IO device memory, and how do DMA consumers extract the correct information from that VMA to pass to the kernel DMA API so it can setup peer-peer DMA. > What matter is the back channel API between peer-to-peer device. Like > the above patchset points out for GPU we need to be able to invalidate > a mapping at any point in time. Pining is not something we want to > live with. We have MMU notifiers to handle this today in RDMA. Async RDMA MR Invalidate like you see in the above out of tree patches is totally crazy and shouldn't be in mainline. Use ODP capable RDMA hardware. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: Enabling peer to peer device transactions for PCIe devices
On Thu, Jan 05, 2017 at 01:39:29PM -0500, Jerome Glisse wrote: > 1) peer-to-peer because of userspace specific API like NVidia GPU > direct (AMD is pushing its own similar API i just can't remember > marketing name). This does not happen through a vma, this happens > through specific device driver call going through device specific > ioctl on both side (GPU and RDMA). So both kernel driver are aware > of each others. Today you can only do user-initiated RDMA operations in conjection with a VMA. We'd need a really big and strong reason to create an entirely new non-VMA based memory handle scheme for RDMA. So my inclination is to just completely push back on this idea. You need a VMA to do RMA. GPUs need to create VMAs for the memory they want to RDMA from, even if the VMA handle just causes SIGBUS for any CPU access. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: Enabling peer to peer device transactions for PCIe devices
On Tue, Dec 06, 2016 at 09:51:15AM -0700, Logan Gunthorpe wrote: > Hey, > > On 06/12/16 09:38 AM, Jason Gunthorpe wrote: > >>> I'm not opposed to mapping /dev/nvmeX. However, the lookup is trivial > >>> to accomplish in sysfs through /sys/dev/char to find the sysfs path of the > >>> device-dax instance under the nvme device, or if you already have the nvme > >>> sysfs path the dax instance(s) will appear under the "dax" sub-directory. > >> > >> Personally I think mapping the dax resource in the sysfs tree is a nice > >> way to do this and a bit more intuitive than mapping a /dev/nvmeX. > > > > It is still not at all clear to me what userpsace is supposed to do > > with this on nvme.. How is the CMB usable from userspace? > > The flow is pretty simple. For example to write to NVMe from an RDMA device: > > 1) Obtain a chunk of the CMB to use as a buffer(either by mmaping > /dev/nvmx, the device dax char device or through a block layer interface > (which sounds like a good suggestion from Christoph, but I'm not really > sure how it would look). Okay, so clearly this needs a kernel side NVMe specific allocator and locking so users don't step on each other.. Or as Christoph says some kind of general mechanism to get these bounce buffers.. > 2) Create an MR with the buffer and use an RDMA function to fill it with > data from a remote host. This will cause the RDMA hardware to write > directly to the memory in the NVMe card. > > 3) Using O_DIRECT, write the buffer to a file on the NVMe filesystem. > When the address reaches hardware the NVMe will recognize it as local > memory and copy it directly there. Ah, I see. As a first draft I'd stick with some kind of API built into the /dev/nvmeX that backs the filesystem. The user app would fstat the target file, open /dev/block/MAJOR(st_dev):MINOR(st_dev), do some ioctl to get a CMB mmap, and then proceed from there.. When that is all working kernel-side, it would make sense to look at a more general mechanism that could be used unprivileged?? > Thus we are able to transfer data to any file on an NVMe device without > going through system memory. This has benefits on systems with lots of > activity in system memory but step 3 is likely to be slowish due to the > need to pin/unpin the memory for every transaction. This is similar to the GPU issues too.. On NVMe you don't need to pin the pages, you just need to lock that VMA so it doesn't get freed from the NVMe CMB allocator while the IO is running... Probably in the long run the get_user_pages is going to have to be pushed down into drivers.. Future MMU coherent IO hardware also does not need the pinning or other overheads. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: Enabling peer to peer device transactions for PCIe devices
> > I'm not opposed to mapping /dev/nvmeX. However, the lookup is trivial > > to accomplish in sysfs through /sys/dev/char to find the sysfs path of the > > device-dax instance under the nvme device, or if you already have the nvme > > sysfs path the dax instance(s) will appear under the "dax" sub-directory. > > Personally I think mapping the dax resource in the sysfs tree is a nice > way to do this and a bit more intuitive than mapping a /dev/nvmeX. It is still not at all clear to me what userpsace is supposed to do with this on nvme.. How is the CMB usable from userspace? Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: Enabling peer to peer device transactions for PCIe devices
On Mon, Dec 05, 2016 at 12:27:20PM -0700, Logan Gunthorpe wrote: > > > On 05/12/16 12:14 PM, Jason Gunthorpe wrote: > >But CMB sounds much more like the GPU case where there is a > >specialized allocator handing out the BAR to consumers, so I'm not > >sure a general purpose chardev makes a lot of sense? > > I don't think it will ever need to be as complicated as the GPU case. There > will probably only ever be a relatively small amount of memory behind the > CMB and really the only users are those doing P2P work. Thus the specialized > allocator could be pretty simple and I expect it would be fine to just > return -ENOMEM if there is not enough memory. NVMe might have to deal with pci-e hot-unplug, which is a similar problem-class to the GPU case.. In any event the allocator still needs to track which regions are in use and be able to hook 'free' from userspace. That does suggest it should be integrated into the nvme driver and not a bolt on driver.. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: Enabling peer to peer device transactions for PCIe devices
On Mon, Dec 05, 2016 at 10:48:58AM -0800, Dan Williams wrote: > On Mon, Dec 5, 2016 at 10:39 AM, Logan Gunthorpe wrote: > > On 05/12/16 11:08 AM, Dan Williams wrote: > >> > >> I've already recommended that iopmem not be a block device and instead > >> be a device-dax instance. I also don't think it should claim the PCI > >> ID, rather the driver that wants to map one of its bars this way can > >> register the memory region with the device-dax core. > >> > >> I'm not sure there are enough device drivers that want to do this to > >> have it be a generic /sys/.../resource_dmableX capability. It still > >> seems to be an exotic one-off type of configuration. > > > > > > Yes, this is essentially my thinking. Except I think the userspace interface > > should really depend on the device itself. Device dax is a good choice for > > many and I agree the block device approach wouldn't be ideal. > > > > Specifically for NVME CMB: I think it would make a lot of sense to just hand > > out these mappings with an mmap call on /dev/nvmeX. I expect CMB buffers > > would be volatile and thus you wouldn't need to keep track of where in the > > BAR the region came from. Thus, the mmap call would just be an allocator > > from BAR memory. If device-dax were used, userspace would need to lookup > > which device-dax instance corresponds to which nvme drive. > > I'm not opposed to mapping /dev/nvmeX. However, the lookup is trivial > to accomplish in sysfs through /sys/dev/char to find the sysfs path > of But CMB sounds much more like the GPU case where there is a specialized allocator handing out the BAR to consumers, so I'm not sure a general purpose chardev makes a lot of sense? Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: Enabling peer to peer device transactions for PCIe devices
On Mon, Dec 05, 2016 at 09:40:38AM -0800, Dan Williams wrote: > > If it is kernel only with physical addresess we don't need a uAPI for > > it, so I'm not sure #1 is at all related to iopmem. > > > > Most people who want #1 probably can just mmap > > /sys/../pci/../resourceX to get a user handle to it, or pass around > > __iomem pointers in the kernel. This has been asked for before with > > RDMA. > > > > I'm still not really clear what iopmem is for, or why DAX should ever > > be involved in this.. > > Right, by default remap_pfn_range() does not establish DMA capable > mappings. You can think of iopmem as remap_pfn_range() converted to > use devm_memremap_pages(). Given the extra constraints of > devm_memremap_pages() it seems reasonable to have those DMA capable > mappings be optionally established via a separate driver. Except the iopmem driver claims the PCI ID, and presents a block interface which is really *NOT* what people who have asked for this in the past have wanted. IIRC it was embedded stuff eg RDMA video directly out of a capture card or a similar kind of thinking. It is a good point about devm_memremap_pages limitations, but maybe that just says to create a /sys/.../resource_dmableX ? Or is there some reason why people want a filesystem on top of BAR memory? That does not seem to have been covered yet.. Jason ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm