Re: [PATCH] mm/up: combine put_compound_head() and unpin_user_page()

2020-12-09 Thread Jason Gunthorpe
On Wed, Dec 09, 2020 at 12:57:38PM -0800, John Hubbard wrote:
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 98eb8e6d2609c3..7b33b7d4b324d7 100644
> > +++ b/mm/gup.c
> > @@ -123,6 +123,28 @@ static __maybe_unused struct page 
> > *try_grab_compound_head(struct page *page,
> > return NULL;
> >   }
> > +static void put_compound_head(struct page *page, int refs, unsigned int 
> > flags)
> > +{
> 
> It might be nice to rename "page" to "head", here.
> 
> While reading this I toyed with the idea of having this at the top:
> 
>   VM_BUG_ON_PAGE(compound_head(page) != page, page);
> 
> ...but it's overkill in a static function with pretty clear call sites. So I
> think it's just right as-is.

Matt's folio patches would take are of this, when that is available
all this becomes a lot better.

Thanks,
Jason
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[PATCH] mm/up: combine put_compound_head() and unpin_user_page()

2020-12-09 Thread Jason Gunthorpe
These functions accomplish the same thing but have different
implementations.

unpin_user_page() has a bug where it calls mod_node_page_state() after
calling put_page() which creates a risk that the page could have been
hot-uplugged from the system.

Fix this by using put_compound_head() as the only implementation.

__unpin_devmap_managed_user_page() and related can be deleted as well in
favour of the simpler, but slower, version in put_compound_head() that has
an extra atomic page_ref_sub, but always calls put_page() which internally
contains the special devmap code.

Move put_compound_head() to be directly after try_grab_compound_head() so
people can find it in future.

Fixes: 1970dc6f5226 ("mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) 
reporting")
Signed-off-by: Jason Gunthorpe 
---
 mm/gup.c | 103 +--
 1 file changed, 23 insertions(+), 80 deletions(-)

With Matt's folio idea I'd next to go to make a
  put_folio(folio, refs)

Which would cleanly eliminate that extra atomic here without duplicating the
devmap special case.

This should also be called 'ungrab_compound_head' as we seem to be using the
word 'grab' to mean 'pin or get' depending on GUP flags.

diff --git a/mm/gup.c b/mm/gup.c
index 98eb8e6d2609c3..7b33b7d4b324d7 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -123,6 +123,28 @@ static __maybe_unused struct page 
*try_grab_compound_head(struct page *page,
return NULL;
 }
 
+static void put_compound_head(struct page *page, int refs, unsigned int flags)
+{
+   if (flags & FOLL_PIN) {
+   mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED,
+   refs);
+
+   if (hpage_pincount_available(page))
+   hpage_pincount_sub(page, refs);
+   else
+   refs *= GUP_PIN_COUNTING_BIAS;
+   }
+
+   VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
+   /*
+* Calling put_page() for each ref is unnecessarily slow. Only the last
+* ref needs a put_page().
+*/
+   if (refs > 1)
+   page_ref_sub(page, refs - 1);
+   put_page(page);
+}
+
 /**
  * try_grab_page() - elevate a page's refcount by a flag-dependent amount
  *
@@ -177,41 +199,6 @@ bool __must_check try_grab_page(struct page *page, 
unsigned int flags)
return true;
 }
 
-#ifdef CONFIG_DEV_PAGEMAP_OPS
-static bool __unpin_devmap_managed_user_page(struct page *page)
-{
-   int count, refs = 1;
-
-   if (!page_is_devmap_managed(page))
-   return false;
-
-   if (hpage_pincount_available(page))
-   hpage_pincount_sub(page, 1);
-   else
-   refs = GUP_PIN_COUNTING_BIAS;
-
-   count = page_ref_sub_return(page, refs);
-
-   mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1);
-   /*
-* devmap page refcounts are 1-based, rather than 0-based: if
-* refcount is 1, then the page is free and the refcount is
-* stable because nobody holds a reference on the page.
-*/
-   if (count == 1)
-   free_devmap_managed_page(page);
-   else if (!count)
-   __put_page(page);
-
-   return true;
-}
-#else
-static bool __unpin_devmap_managed_user_page(struct page *page)
-{
-   return false;
-}
-#endif /* CONFIG_DEV_PAGEMAP_OPS */
-
 /**
  * unpin_user_page() - release a dma-pinned page
  * @page:pointer to page to be released
@@ -223,28 +210,7 @@ static bool __unpin_devmap_managed_user_page(struct page 
*page)
  */
 void unpin_user_page(struct page *page)
 {
-   int refs = 1;
-
-   page = compound_head(page);
-
-   /*
-* For devmap managed pages we need to catch refcount transition from
-* GUP_PIN_COUNTING_BIAS to 1, when refcount reach one it means the
-* page is free and we need to inform the device driver through
-* callback. See include/linux/memremap.h and HMM for details.
-*/
-   if (__unpin_devmap_managed_user_page(page))
-   return;
-
-   if (hpage_pincount_available(page))
-   hpage_pincount_sub(page, 1);
-   else
-   refs = GUP_PIN_COUNTING_BIAS;
-
-   if (page_ref_sub_and_test(page, refs))
-   __put_page(page);
-
-   mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1);
+   put_compound_head(compound_head(page), 1, FOLL_PIN);
 }
 EXPORT_SYMBOL(unpin_user_page);
 
@@ -2062,29 +2028,6 @@ EXPORT_SYMBOL(get_user_pages_unlocked);
  * This code is based heavily on the PowerPC implementation by Nick Piggin.
  */
 #ifdef CONFIG_HAVE_FAST_GUP
-
-static void put_compound_head(struct page *page, int refs, unsigned int flags)
-{
-   if (flags & FOLL_PIN) {
-   mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED,
-

Re: regression from 5.10.0-rc3: BUG: Bad page state in process kworker/41:0 pfn:891066 during fio on devdax

2020-11-09 Thread Jason Gunthorpe
On Mon, Nov 09, 2020 at 01:54:42PM -0400, Jason Gunthorpe wrote:
> On Mon, Nov 09, 2020 at 09:26:19AM -0800, Dan Williams wrote:
> > On Mon, Nov 9, 2020 at 6:12 AM Jason Gunthorpe  wrote:
> > >
> > > Wow, this is surprising
> > >
> > > This has been widely backported already, Dan please check??
> > >
> > > I thought pgprot_decrypted was a NOP on most x86 platforms -
> > > sme_me_mask == 0:
> > >
> > > #define __sme_set(x)((x) | sme_me_mask)
> > > #define __sme_clr(x)((x) & ~sme_me_mask)
> > >
> > > ??
> > >
> > > Confused how this can be causing DAX issues
> > 
> > Does that correctly preserve the "soft" pte bits? Especially
> > PTE_DEVMAP that DAX uses?
> > 
> > I'll check...
> 
>  extern u64 sme_me_mask;
>  #define __pgprot(x)  ((pgprot_t) { (x) } )
>  #define pgprot_val(x)((x).pgprot)
>  #define __sme_clr(x)((x) & ~sme_me_mask)
>  #define pgprot_decrypted(prot)   __pgprot(__sme_clr(pgprot_val(prot)))
> 
>  static inline int io_remap_pfn_range(struct vm_area_struct *vma,
>  unsigned long addr, unsigned long pfn,
>  unsigned long size, pgprot_t prot)
>  {
>return remap_pfn_range(vma, addr, pfn, size, pgprot_decrypted(prot));
>  }
> 
> Not seeing how that could change the pgprot in any harmful way?
> 
> Yi, are you using a platform where sme_me_mask != 0 ? 
> 
> That code looks clearly like it would only trigger on AMD SME systems,
> is that what you are using?

Can't be, the system is too old:

 [  398.455914] Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS 
P89 10/05/2016

I'm at a total loss how this change could even do anything on a
non-AMD system, let alone how this intersects in any way with DEVDAX,
which I could not find being used with io_remap_pfn_range()

How confident are you in the bisection?

Jason
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v3 00/23] device-dax: Support sub-dividing soft-reserved ranges

2020-08-04 Thread Jason Gunthorpe
On Fri, Jul 31, 2020 at 08:24:58PM -0700, Dan Williams wrote:

> - Fix test_hmm and other compilation fixups (Ralph)

The hmm parts look OK

Acked-by: Jason Gunthorpe 

Jason
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 11/12] PM, libnvdimm: Add 'mem-quiet' state and callback for firmware activation

2020-07-09 Thread Jason Gunthorpe
On Thu, Jul 09, 2020 at 09:10:06AM -0700, Dan Williams wrote:
> On Thu, Jul 9, 2020 at 8:39 AM Jason Gunthorpe  wrote:
> >
> > On Thu, Jul 09, 2020 at 04:00:51PM +0100, Christoph Hellwig wrote:
> > > On Mon, Jul 06, 2020 at 06:59:32PM -0700, Dan Williams wrote:
> > > > The runtime firmware activation capability of Intel NVDIMM devices
> > > > requires memory transactions to be disabled for 100s of microseconds.
> > > > This timeout is large enough to cause in-flight DMA to fail and other
> > > > application detectable timeouts. Arrange for firmware activation to be
> > > > executed while the system is "quiesced", all processes and device-DMA
> > > > frozen.
> > > >
> > > > It is already required that invoking device ->freeze() callbacks is
> > > > sufficient to cease DMA. A device that continues memory writes outside
> > > > of user-direction violates expectations of the PM core to be to
> > > > establish a coherent hibernation image.
> > > >
> > > > That said, RDMA devices are an example of a device that access memory
> > > > outside of user process direction.
> >
> > Are you saying freeze doesn't work for some RDMA drivers? That would
> > be a driver bug, I think.
> 
> Right, it's more my hunch than a known bug at this point, but in my
> experience with testing server class hardware when I've reported a
> power management bugs I've sometimes got the incredulous response "who
> suspends / hibernates servers!?". I can drop that comment.
> 
> Are there protocol timeouts that might need to be adjusted for a 100s
> of microseconds blip in memory controller response?

Survivability depends alot on HW support, it has to suspend, not
discard DMAs that it needs to issue. Most likely things are as you
say, and HW doesn't support safe short time suspend. The usual use of
PM stuff here is to make the machine ready for kexec

Jason
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 11/12] PM, libnvdimm: Add 'mem-quiet' state and callback for firmware activation

2020-07-09 Thread Jason Gunthorpe
On Thu, Jul 09, 2020 at 04:00:51PM +0100, Christoph Hellwig wrote:
> On Mon, Jul 06, 2020 at 06:59:32PM -0700, Dan Williams wrote:
> > The runtime firmware activation capability of Intel NVDIMM devices
> > requires memory transactions to be disabled for 100s of microseconds.
> > This timeout is large enough to cause in-flight DMA to fail and other
> > application detectable timeouts. Arrange for firmware activation to be
> > executed while the system is "quiesced", all processes and device-DMA
> > frozen.
> > 
> > It is already required that invoking device ->freeze() callbacks is
> > sufficient to cease DMA. A device that continues memory writes outside
> > of user-direction violates expectations of the PM core to be to
> > establish a coherent hibernation image.
> > 
> > That said, RDMA devices are an example of a device that access memory
> > outside of user process direction.

Are you saying freeze doesn't work for some RDMA drivers? That would
be a driver bug, I think.

The consequences of doing freeze are pretty serious, but it should
still stop DMA.

Jason
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v3] libnvdimm: Enable unit test infrastructure compile checks

2019-09-01 Thread Jason Gunthorpe
On Sat, Aug 31, 2019 at 11:52:47AM -0700, Dan Williams wrote:
> The infrastructure to mock core libnvdimm routines for unit testing
> purposes is prone to bitrot relative to refactoring of that core.
> Arrange for the unit test core to be built when CONFIG_COMPILE_TEST=y.
> This does not result in a functional unit test environment, it is only a
> helper for 0day to catch unit test build regressions.
> 
> Note that there are a few x86isms in the implementation, so this does
> not bother compile testing this architectures other than 64-bit x86.
> 
> Cc: Jérôme Glisse 
> Cc: Jason Gunthorpe 
> Reported-by: Christoph Hellwig 
> Signed-off-by: Dan Williams 
> Link: 
> https://lore.kernel.org/r/156097224232.1086847.9463861924683372741.st...@dwillia2-desk3.amr.corp.intel.com
> Changes since v2:
> - Fixed a 0day report when the unit test infrastructure is built-in.
>   Arrange for it to only compile as a module. This has received a build
>   success notifcation from the robot over 142 configs.
> 
> Hi Jason,
> 
> As we discussed previously please take this through the hmm tree to give
> compile coverage for devm_memremap_pages() updates.

Okay, applied again and 0-day passed

Thanks,
Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 2/4] memremap: remove the dev field in struct dev_pagemap

2019-08-21 Thread Jason Gunthorpe
On Wed, Aug 21, 2019 at 01:24:20PM -0300, Jason Gunthorpe wrote:
> On Tue, Aug 20, 2019 at 07:58:22PM -0700, Dan Williams wrote:
> > On Tue, Aug 20, 2019 at 6:27 AM Jason Gunthorpe  wrote:
> > >
> > > On Mon, Aug 19, 2019 at 06:44:02PM -0700, Dan Williams wrote:
> > > > On Sun, Aug 18, 2019 at 2:12 AM Christoph Hellwig  wrote:
> > > > >
> > > > > The dev field in struct dev_pagemap is only used to print dev_name in
> > > > > two places, which are at best nice to have.  Just remove the field
> > > > > and thus the name in those two messages.
> > > > >
> > > > > Signed-off-by: Christoph Hellwig 
> > > > > Reviewed-by: Ira Weiny 
> > > >
> > > > Needs the below as well.
> > > >
> > > > /me goes to check if he ever merged the fix to make the unit test
> > > > stuff get built by default with COMPILE_TEST [1]. Argh! Nope, didn't
> > > > submit it for 5.3-rc1, sorry for the thrash.
> > > >
> > > > You can otherwise add:
> > > >
> > > > Reviewed-by: Dan Williams 
> > > >
> > > > [1]: 
> > > > https://lore.kernel.org/lkml/156097224232.1086847.9463861924683372741.st...@dwillia2-desk3.amr.corp.intel.com/
> > >
> > > Can you get this merged? Do you want it to go with this series?
> > 
> > Yeah, makes some sense to let you merge it so that you can get
> > kbuild-robot reports about any follow-on memremap_pages() work that
> > may trip up the build. Otherwise let me know and I'll get it queued
> > with the other v5.4 libnvdimm pending bits.
> 
> Done, I used it already to test build the last series from CH..

It failed 0-day, I'm guessing some missing kconfig stuff

For now I dropped it, but, if you send a v2 I can forward it toward
0-day again!

Thanks,
Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 2/4] memremap: remove the dev field in struct dev_pagemap

2019-08-21 Thread Jason Gunthorpe
On Tue, Aug 20, 2019 at 07:58:22PM -0700, Dan Williams wrote:
> On Tue, Aug 20, 2019 at 6:27 AM Jason Gunthorpe  wrote:
> >
> > On Mon, Aug 19, 2019 at 06:44:02PM -0700, Dan Williams wrote:
> > > On Sun, Aug 18, 2019 at 2:12 AM Christoph Hellwig  wrote:
> > > >
> > > > The dev field in struct dev_pagemap is only used to print dev_name in
> > > > two places, which are at best nice to have.  Just remove the field
> > > > and thus the name in those two messages.
> > > >
> > > > Signed-off-by: Christoph Hellwig 
> > > > Reviewed-by: Ira Weiny 
> > >
> > > Needs the below as well.
> > >
> > > /me goes to check if he ever merged the fix to make the unit test
> > > stuff get built by default with COMPILE_TEST [1]. Argh! Nope, didn't
> > > submit it for 5.3-rc1, sorry for the thrash.
> > >
> > > You can otherwise add:
> > >
> > > Reviewed-by: Dan Williams 
> > >
> > > [1]: 
> > > https://lore.kernel.org/lkml/156097224232.1086847.9463861924683372741.st...@dwillia2-desk3.amr.corp.intel.com/
> >
> > Can you get this merged? Do you want it to go with this series?
> 
> Yeah, makes some sense to let you merge it so that you can get
> kbuild-robot reports about any follow-on memremap_pages() work that
> may trip up the build. Otherwise let me know and I'll get it queued
> with the other v5.4 libnvdimm pending bits.

Done, I used it already to test build the last series from CH..

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 2/4] memremap: remove the dev field in struct dev_pagemap

2019-08-20 Thread Jason Gunthorpe
On Mon, Aug 19, 2019 at 06:44:02PM -0700, Dan Williams wrote:
> On Sun, Aug 18, 2019 at 2:12 AM Christoph Hellwig  wrote:
> >
> > The dev field in struct dev_pagemap is only used to print dev_name in
> > two places, which are at best nice to have.  Just remove the field
> > and thus the name in those two messages.
> >
> > Signed-off-by: Christoph Hellwig 
> > Reviewed-by: Ira Weiny 
> 
> Needs the below as well.
> 
> /me goes to check if he ever merged the fix to make the unit test
> stuff get built by default with COMPILE_TEST [1]. Argh! Nope, didn't
> submit it for 5.3-rc1, sorry for the thrash.
> 
> You can otherwise add:
> 
> Reviewed-by: Dan Williams 
> 
> [1]: 
> https://lore.kernel.org/lkml/156097224232.1086847.9463861924683372741.st...@dwillia2-desk3.amr.corp.intel.com/

Can you get this merged? Do you want it to go with this series?

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: add a not device managed memremap_pages v3

2019-08-20 Thread Jason Gunthorpe
On Sun, Aug 18, 2019 at 11:05:53AM +0200, Christoph Hellwig wrote:
> Hi Dan and Jason,
> 
> Bharata has been working on secure page management for kvmppc guests,
> and one I thing I noticed is that he had to fake up a struct device
> just so that it could be passed to the devm_memremap_pages
> instrastructure for device private memory.
> 
> This series adds non-device managed versions of the
> devm_request_free_mem_region and devm_memremap_pages functions for
> his use case.
> 
> Changes since v2:
>  - improved changelogs that the the v2 changes into account
> 
> Changes since v1:
>  - don't overload devm_request_free_mem_region
>  - export the memremap_pages and munmap_pages as kvmppc can be a module

Looks good, I fixed up the patch with Dan's note and reviewed them as
well.

Applied to hmm.git as requested

Thanks,
Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: add a not device managed memremap_pages v2

2019-08-16 Thread Jason Gunthorpe
On Fri, Aug 16, 2019 at 02:36:07PM +0200, Christoph Hellwig wrote:
> > > Changes since v1:
> > >  - don't overload devm_request_free_mem_region
> > >  - export the memremap_pages and munmap_pages as kvmppc can be a module
> > 
> > What tree do we want this to go through? Dan are you running a pgmap
> > tree still? Do we know of any conflicts?
> 
> The last changes in this area went through the hmm tree.  There are
> now known conflicts, and the kvmppc drivers that needs this already
> has a dependency on the hmm tree for the migrate_vma_* changes.

OK by me, Dan can you ack or review? Thanks

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: add a not device managed memremap_pages v2

2019-08-16 Thread Jason Gunthorpe
On Fri, Aug 16, 2019 at 08:54:30AM +0200, Christoph Hellwig wrote:
> Hi Dan and Jason,
> 
> Bharata has been working on secure page management for kvmppc guests,
> and one I thing I noticed is that he had to fake up a struct device
> just so that it could be passed to the devm_memremap_pages
> instrastructure for device private memory.
> 
> This series adds non-device managed versions of the
> devm_request_free_mem_region and devm_memremap_pages functions for
> his use case.
> 
> Changes since v1:
>  - don't overload devm_request_free_mem_region
>  - export the memremap_pages and munmap_pages as kvmppc can be a module

What tree do we want this to go through? Dan are you running a pgmap
tree still? Do we know of any conflicts?

Thanks,
Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 5/5] memremap: provide a not device managed memremap_pages

2019-08-11 Thread Jason Gunthorpe
On Sun, Aug 11, 2019 at 10:12:47AM +0200, Christoph Hellwig wrote:
> The kvmppc ultravisor code wants a device private memory pool that is
> system wide and not attached to a device.  Instead of faking up one
> provide a low-level memremap_pages for it.  Note that this function is
> not exported, and doesn't have a cleanup routine associated with it to
> discourage use from more driver like users.
> 
> Signed-off-by: Christoph Hellwig 
>  include/linux/memremap.h |  1 +
>  mm/memremap.c| 74 
>  2 files changed, 45 insertions(+), 30 deletions(-)
> 
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 8f0013e18e14..eac23e88a94a 100644
> +++ b/include/linux/memremap.h
> @@ -123,6 +123,7 @@ static inline struct vmem_altmap *pgmap_altmap(struct 
> dev_pagemap *pgmap)
>  }
>  
>  #ifdef CONFIG_ZONE_DEVICE
> +void *memremap_pages(struct dev_pagemap *pgmap, int nid);
>  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
>  void devm_memunmap_pages(struct device *dev, struct dev_pagemap *pgmap);
>  struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 09a087ca30ff..7b7575330db4 100644
> +++ b/mm/memremap.c
> @@ -137,27 +137,12 @@ static void dev_pagemap_percpu_release(struct 
> percpu_ref *ref)
>   complete(&pgmap->done);
>  }
>  
> -/**
> - * devm_memremap_pages - remap and provide memmap backing for the given 
> resource
> - * @dev: hosting device for @res
> - * @pgmap: pointer to a struct dev_pagemap
> - *
> - * Notes:
> - * 1/ At a minimum the res and type members of @pgmap must be initialized
> - *by the caller before passing it to this function
> - *
> - * 2/ The altmap field may optionally be initialized, in which case
> - *PGMAP_ALTMAP_VALID must be set in pgmap->flags.
> - *
> - * 3/ The ref field may optionally be provided, in which pgmap->ref must be
> - *'live' on entry and will be killed and reaped at
> - *devm_memremap_pages_release() time, or if this routine fails.
> - *
> - * 4/ res is expected to be a host memory range that could feasibly be
> - *treated as a "System RAM" range, i.e. not a device mmio range, but
> - *this is not enforced.
> +/*
> + * This version is not intended for system resources only, and there is no

Was 'is not' what was intended here? I'm having a hard time reading
this.

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 2/5] resource: add a not device managed request_free_mem_region variant

2019-08-11 Thread Jason Gunthorpe
On Sun, Aug 11, 2019 at 10:12:44AM +0200, Christoph Hellwig wrote:
> Just add a simple macro that passes a NULL dev argument to
> dev_request_free_mem_region, and call request_mem_region in the
> function for that particular case.
> 
> Signed-off-by: Christoph Hellwig 
>  include/linux/ioport.h | 2 ++
>  kernel/resource.c  | 5 -
>  2 files changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/ioport.h b/include/linux/ioport.h
> index 0dcc48cafa80..528ae6cbb1b4 100644
> +++ b/include/linux/ioport.h
> @@ -297,6 +297,8 @@ static inline bool resource_overlaps(struct resource *r1, 
> struct resource *r2)
>  
>  struct resource *devm_request_free_mem_region(struct device *dev,
>   struct resource *base, unsigned long size, const char *name);
> +#define request_free_mem_region(base, size, name) \
> + devm_request_free_mem_region(NULL, base, size, name)
>  
>  #endif /* __ASSEMBLY__ */
>  #endif   /* _LINUX_IOPORT_H */
> diff --git a/kernel/resource.c b/kernel/resource.c
> index 0ddc558586a7..3a826b3cc883 100644
> +++ b/kernel/resource.c
> @@ -1671,7 +1671,10 @@ struct resource *devm_request_free_mem_region(struct 
> device *dev,
>   REGION_DISJOINT)
>   continue;
>  
> - res = devm_request_mem_region(dev, addr, size, name);
> + if (dev)
> + res = devm_request_mem_region(dev, addr, size, name);
> + else
> + res = request_mem_region(addr, size, name);

It is a bit jarring to have something called devm_* that doesn't
actually do the devm_ part on some paths.

Maybe this function should be called __request_free_mem_region() with
another name wrapper macro?

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: dev_pagemap related cleanups v4

2019-07-02 Thread Jason Gunthorpe
On Tue, Jul 02, 2019 at 04:17:48PM -0700, Dan Williams wrote:
> On Tue, Jul 2, 2019 at 11:42 AM Jason Gunthorpe  wrote:
> >
> > On Mon, Jul 01, 2019 at 10:25:17AM +0200, Christoph Hellwig wrote:
> > > And I've demonstrated that I can't send patch series..  While this
> > > has all the right patches, it also has the extra patches already
> > > in the hmm tree, and four extra patches I wanted to send once
> > > this series is merged.  I'll give up for now, please use the git
> > > url for anything serious, as it contains the right thing.
> >
> > Okay, I sorted it all out and temporarily put it here:
> >
> > https://github.com/jgunthorpe/linux/commits/hmm
> >
> > Bit involved job:
> > - Took Ira's v4 patch into hmm.git and confirmed it matches what
> >   Andrew has in linux-next after all the fixups
> > - Checked your github v4 and the v3 that hit the mailing list were
> >   substantially similar (I never did get a clean v4) and largely
> >   went with the github version
> > - Based CH's v4 series on -rc7 and put back the removal hunk in swap.c
> >   so it compiles
> > - Merge'd CH's series to hmm.git and fixed all the conflicts with Ira
> >   and Ralph's patches (such that swap.c remains unchanged)
> > - Added Dan's ack's and tested-by's
> 
> Looks good. Test merge (with some collisions, see below) also passes
> my test suite.

Okay, published toward linux-next now

> >
> > I think this fairly closely follows what was posted to the mailing
> > list.
> >
> > As it was more than a simple 'git am', I'll let it sit on github until
> > I hear OK's then I'll move it to kernel.org's hmm.git and it will hit
> > linux-next. 0-day should also run on this whole thing from my github.
> >
> > What I know is outstanding:
> >  - The conflicting ARM patches, I understand Andrew will handle these
> >post-linux-next
> >  - The conflict with AMD GPU in -next, I am waiting to hear from AMD
> 
> Just a heads up that this also collides with the "sub-section" patches
> in Andrew's tree. The resolution is straightforward, mostly just
> colliding updates to arch_{add,remove}_memory() call sites in
> kernel/memremap.c and collisions with pgmap_altmap() usage.

Okay, thanks

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: dev_pagemap related cleanups v4

2019-07-02 Thread Jason Gunthorpe
On Mon, Jul 01, 2019 at 10:25:17AM +0200, Christoph Hellwig wrote:
> And I've demonstrated that I can't send patch series..  While this
> has all the right patches, it also has the extra patches already
> in the hmm tree, and four extra patches I wanted to send once
> this series is merged.  I'll give up for now, please use the git
> url for anything serious, as it contains the right thing.

Okay, I sorted it all out and temporarily put it here:

https://github.com/jgunthorpe/linux/commits/hmm

Bit involved job:
- Took Ira's v4 patch into hmm.git and confirmed it matches what
  Andrew has in linux-next after all the fixups
- Checked your github v4 and the v3 that hit the mailing list were
  substantially similar (I never did get a clean v4) and largely
  went with the github version
- Based CH's v4 series on -rc7 and put back the removal hunk in swap.c
  so it compiles
- Merge'd CH's series to hmm.git and fixed all the conflicts with Ira
  and Ralph's patches (such that swap.c remains unchanged)
- Added Dan's ack's and tested-by's

I think this fairly closely follows what was posted to the mailing
list.

As it was more than a simple 'git am', I'll let it sit on github until
I hear OK's then I'll move it to kernel.org's hmm.git and it will hit
linux-next. 0-day should also run on this whole thing from my github.

What I know is outstanding:
 - The conflicting ARM patches, I understand Andrew will handle these
   post-linux-next
 - The conflict with AMD GPU in -next, I am waiting to hear from AMD

Otherwise I think we are done with hmm.git for this
cycle.

Unfortunately this is still not enough to progress rdma's ODP, so we
will need to do this again next cycle :( I'll be working on patches
once I get all the merge window prep I have to do done.

Regards,
Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 16/25] device-dax: use the dev_pagemap internal refcount

2019-06-28 Thread Jason Gunthorpe
On Fri, Jun 28, 2019 at 10:10:12AM -0700, Dan Williams wrote:
> On Fri, Jun 28, 2019 at 10:08 AM Dan Williams  
> wrote:
> >
> > On Fri, Jun 28, 2019 at 10:02 AM Jason Gunthorpe  wrote:
> > >
> > > On Fri, Jun 28, 2019 at 09:27:44AM -0700, Dan Williams wrote:
> > > > On Fri, Jun 28, 2019 at 8:39 AM Jason Gunthorpe  
> > > > wrote:
> > > > >
> > > > > On Wed, Jun 26, 2019 at 02:27:15PM +0200, Christoph Hellwig wrote:
> > > > > > The functionality is identical to the one currently open coded in
> > > > > > device-dax.
> > > > > >
> > > > > > Signed-off-by: Christoph Hellwig 
> > > > > > Reviewed-by: Ira Weiny 
> > > > > >  drivers/dax/dax-private.h |  4 
> > > > > >  drivers/dax/device.c  | 43 
> > > > > > ---
> > > > > >  2 files changed, 47 deletions(-)
> > > > >
> > > > > DanW: I think this series has reached enough review, did you want
> > > > > to ack/test any further?
> > > > >
> > > > > This needs to land in hmm.git soon to make the merge window.
> > > >
> > > > I was awaiting a decision about resolving the collision with Ira's
> > > > patch before testing the final result again [1]. You can go ahead and
> > > > add my reviewed-by for the series, but my tested-by should be on the
> > > > final state of the series.
> > >
> > > The conflict looks OK to me, I think we can let Andrew and Linus
> > > resolve it.
> >
> > Andrew's tree effectively always rebases since it's a quilt series.
> > I'd recommend pulling Ira's patch out of -mm and applying it with the
> > rest of hmm reworks. Any other git tree I'd agree with just doing the
> > late conflict resolution, but I'm not clear on what's the best
> > practice when conflicting with -mm.

What happens depends on timing as things arrive to Linus. I promised
to send hmm.git early, so I understand that Andrew will quilt rebase
his tree to Linus's and fix the conflict in Ira's patch before he
sends it.

> Regardless the patch is buggy. If you want to do the conflict
> resolution it should be because the DEVICE_PUBLIC removal effectively
> does the same fix otherwise we're knowingly leaving a broken point in
> the history.

I'm not sure I understand your concern, is there something wrong with
CH's series as it stands? hmm is a non-rebasing git tree, so as long
as the series is correct *when I apply it* there is no broken history.

I assumed the conflict resolution for Ira's patch was to simply take
the deletion of the if block from CH's series - right?

If we do need to take Ira's patch into hmm.git it will go after CH's
series (and Ira will have to rebase/repost it), so I think there is
nothing to do at this moment - unless you are saying there is a
problem with the series in CH's git tree?

Regards,
Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 16/25] device-dax: use the dev_pagemap internal refcount

2019-06-28 Thread Jason Gunthorpe
On Fri, Jun 28, 2019 at 09:27:44AM -0700, Dan Williams wrote:
> On Fri, Jun 28, 2019 at 8:39 AM Jason Gunthorpe  wrote:
> >
> > On Wed, Jun 26, 2019 at 02:27:15PM +0200, Christoph Hellwig wrote:
> > > The functionality is identical to the one currently open coded in
> > > device-dax.
> > >
> > > Signed-off-by: Christoph Hellwig 
> > > Reviewed-by: Ira Weiny 
> > >  drivers/dax/dax-private.h |  4 
> > >  drivers/dax/device.c  | 43 ---
> > >  2 files changed, 47 deletions(-)
> >
> > DanW: I think this series has reached enough review, did you want
> > to ack/test any further?
> >
> > This needs to land in hmm.git soon to make the merge window.
> 
> I was awaiting a decision about resolving the collision with Ira's
> patch before testing the final result again [1]. You can go ahead and
> add my reviewed-by for the series, but my tested-by should be on the
> final state of the series.

The conflict looks OK to me, I think we can let Andrew and Linus
resolve it.

Thanks,
Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 16/25] device-dax: use the dev_pagemap internal refcount

2019-06-28 Thread Jason Gunthorpe
On Wed, Jun 26, 2019 at 02:27:15PM +0200, Christoph Hellwig wrote:
> The functionality is identical to the one currently open coded in
> device-dax.
> 
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Ira Weiny 
> ---
>  drivers/dax/dax-private.h |  4 
>  drivers/dax/device.c  | 43 ---
>  2 files changed, 47 deletions(-)

DanW: I think this series has reached enough review, did you want
to ack/test any further?

This needs to land in hmm.git soon to make the merge window.

Thanks,
Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 24/25] mm: remove the HMM config option

2019-06-27 Thread Jason Gunthorpe
On Wed, Jun 26, 2019 at 02:27:23PM +0200, Christoph Hellwig wrote:
> All the mm/hmm.c code is better keyed off HMM_MIRROR.  Also let nouveau
> depend on it instead of the mix of a dummy dependency symbol plus the
> actually selected one.  Drop various odd dependencies, as the code is
> pretty portable.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/gpu/drm/nouveau/Kconfig |  3 +--
>  include/linux/hmm.h |  5 +
>  include/linux/mm_types.h|  2 +-
>  mm/Kconfig  | 27 ---
>  mm/Makefile |  2 +-
>  mm/hmm.c|  2 --
>  6 files changed, 8 insertions(+), 33 deletions(-)

Makes more sense to me too

Reviewed-by: Jason Gunthorpe 

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 12/25] memremap: add a migrate_to_ram method to struct dev_pagemap_ops

2019-06-27 Thread Jason Gunthorpe
On Wed, Jun 26, 2019 at 02:27:11PM +0200, Christoph Hellwig wrote:
> This replaces the hacky ->fault callback, which is currently directly
> called from common code through a hmm specific data structure as an
> exercise in layering violations.
> 
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Ralph Campbell 
> ---
>  include/linux/hmm.h  |  6 --
>  include/linux/memremap.h |  6 ++
>  include/linux/swapops.h  | 15 ---
>  kernel/memremap.c| 35 ---
>  mm/hmm.c | 13 +
>  mm/memory.c  |  9 ++---
>  6 files changed, 17 insertions(+), 67 deletions(-)

Reviewed-by: Jason Gunthorpe 
 
I'ver heard there are some other use models for fault() here beyond
migrate to ram, but we can rename it if we ever see them.

> +static vm_fault_t hmm_devmem_migrate_to_ram(struct vm_fault *vmf)
>  {
> - struct hmm_devmem *devmem = page->pgmap->data;
> + struct hmm_devmem *devmem = vmf->page->pgmap->data;
>  
> - return devmem->ops->fault(devmem, vma, addr, page, flags, pmdp);
> + return devmem->ops->fault(devmem, vmf->vma, vmf->address, vmf->page,
> + vmf->flags, vmf->pmd);
>  }

Next cycle we should probably rename this fault to migrate_to_ram as
well and pass in the vmf..

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 20/25] mm: remove hmm_vma_alloc_locked_page

2019-06-27 Thread Jason Gunthorpe
On Wed, Jun 26, 2019 at 02:27:19PM +0200, Christoph Hellwig wrote:
> The only user of it has just been removed, and there wasn't really any need
> to wrap a basic memory allocator to start with.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  include/linux/hmm.h |  3 ---
>  mm/hmm.c| 14 --
>  2 files changed, 17 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 03/25] mm: remove hmm_devmem_add_resource

2019-06-27 Thread Jason Gunthorpe
On Wed, Jun 26, 2019 at 02:27:02PM +0200, Christoph Hellwig wrote:
> This function has never been used since it was first added to the kernel
> more than a year and a half ago, and if we ever grow a consumer of the
> MEMORY_DEVICE_PUBLIC infrastructure it can easily use devm_memremap_pages
> directly.
> 
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Jason Gunthorpe 
> Reviewed-by: John Hubbard 
> Acked-by: Michal Hocko 
> ---
>  include/linux/hmm.h |  3 ---
>  mm/hmm.c| 50 -
>  2 files changed, 53 deletions(-)

This should be squashed to the new earlier patch?

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 18/22] mm: mark DEVICE_PUBLIC as broken

2019-06-25 Thread Jason Gunthorpe
On Tue, Jun 25, 2019 at 09:29:15AM +0200, Christoph Hellwig wrote:
> On Thu, Jun 20, 2019 at 09:26:48PM +0200, Michal Hocko wrote:
> > On Thu 13-06-19 11:43:21, Christoph Hellwig wrote:
> > > The code hasn't been used since it was added to the tree, and doesn't
> > > appear to actually be usable.  Mark it as BROKEN until either a user
> > > comes along or we finally give up on it.
> > 
> > I would go even further and simply remove all the DEVICE_PUBLIC code.
> 
> I looked into that as I now got the feedback twice.  It would
> create a conflict with another tree cleaning things up around the
> is_device_private defintion, but otherwise I'd be glad to just remove
> it.
> 
> Jason, as this goes through your tree, do you mind the additional
> conflict?

Which tree and what does the resolution look like?

Also, I don't want to be making the decision if we should keep/remove
DEVICE_PUBLIC, so let's get an Ack from Andrew/etc?

My main reluctance is that I know there is HW out there that can do
coherent, and I want to believe they are coming with patches, just
too slowly. But I'd also rather those people defend themselves :P

Thanks,
Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 18/22] mm: mark DEVICE_PUBLIC as broken

2019-06-19 Thread Jason Gunthorpe
On Thu, Jun 13, 2019 at 06:23:04PM -0700, John Hubbard wrote:
> On 6/13/19 5:43 PM, Ira Weiny wrote:
> > On Thu, Jun 13, 2019 at 07:58:29PM +0000, Jason Gunthorpe wrote:
> >> On Thu, Jun 13, 2019 at 12:53:02PM -0700, Ralph Campbell wrote:
> >>>
> ...
> >> Hum, so the only thing this config does is short circuit here:
> >>
> >> static inline bool is_device_public_page(const struct page *page)
> >> {
> >> return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
> >> IS_ENABLED(CONFIG_DEVICE_PUBLIC) &&
> >> is_zone_device_page(page) &&
> >> page->pgmap->type == MEMORY_DEVICE_PUBLIC;
> >> }
> >>
> >> Which is called all over the place.. 
> > 
> >   yes but the earlier patch:
> > 
> > [PATCH 03/22] mm: remove hmm_devmem_add_resource
> > 
> > Removes the only place type is set to MEMORY_DEVICE_PUBLIC.
> > 
> > So I think it is ok.  Frankly I was wondering if we should remove the public
> > type altogether but conceptually it seems ok.  But I don't see any users of 
> > it
> > so...  should we get rid of it in the code rather than turning the config 
> > off?
> > 
> > Ira
> 
> That seems reasonable. I recall that the hope was for those IBM Power 9
> systems to use _PUBLIC, as they have hardware-based coherent device (GPU)
> memory, and so the memory really is visible to the CPU. And the IBM team
> was thinking of taking advantage of it. But I haven't seen anything on
> that front for a while.

Does anyone know who those people are and can we encourage them to
send some patches? :)

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: dev_pagemap related cleanups

2019-06-13 Thread Jason Gunthorpe
On Thu, Jun 13, 2019 at 11:21:01PM +0200, Christoph Hellwig wrote:
> On Thu, Jun 13, 2019 at 08:40:46PM +0000, Jason Gunthorpe wrote:
> > > Perhaps we should pull those out and resend them through hmm.git?
> > 
> > It could be done - but how bad is the conflict resolution?
> 
> Trivial.  All but one patch just apply using git-am, and the other one
> just has a few lines of offsets.

Okay, NP then, trivial ones are OK to send to Linus..

If Andrew gets them into -rc5 then I will get rc5 into hmm.git next
week.

Thanks,
Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: dev_pagemap related cleanups

2019-06-13 Thread Jason Gunthorpe
On Thu, Jun 13, 2019 at 11:27:39AM -0700, Dan Williams wrote:
> On Thu, Jun 13, 2019 at 2:43 AM Christoph Hellwig  wrote:
> >
> > Hi Dan, Jérôme and Jason,
> >
> > below is a series that cleans up the dev_pagemap interface so that
> > it is more easily usable, which removes the need to wrap it in hmm
> > and thus allowing to kill a lot of code
> >
> > Diffstat:
> >
> >  22 files changed, 245 insertions(+), 802 deletions(-)
> 
> Hooray!
> 
> > Git tree:
> >
> > git://git.infradead.org/users/hch/misc.git hmm-devmem-cleanup
> 
> I just realized this collides with the dev_pagemap release rework in
> Andrew's tree (commit ids below are from next.git and are not stable)
> 
> 4422ee8476f0 mm/devm_memremap_pages: fix final page put race
> 771f0714d0dc PCI/P2PDMA: track pgmap references per resource, not globally
> af37085de906 lib/genalloc: introduce chunk owners
> e0047ff8aa77 PCI/P2PDMA: fix the gen_pool_add_virt() failure path
> 0315d47d6ae9 mm/devm_memremap_pages: introduce devm_memunmap_pages
> 216475c7eaa8 drivers/base/devres: introduce devm_release_action()
> 
> CONFLICT (content): Merge conflict in tools/testing/nvdimm/test/iomap.c
> CONFLICT (content): Merge conflict in mm/hmm.c
> CONFLICT (content): Merge conflict in kernel/memremap.c
> CONFLICT (content): Merge conflict in include/linux/memremap.h
> CONFLICT (content): Merge conflict in drivers/pci/p2pdma.c
> CONFLICT (content): Merge conflict in drivers/nvdimm/pmem.c
> CONFLICT (content): Merge conflict in drivers/dax/device.c
> CONFLICT (content): Merge conflict in drivers/dax/dax-private.h
> 
> Perhaps we should pull those out and resend them through hmm.git?

It could be done - but how bad is the conflict resolution?

I'd be more comfortable to take a PR from you to merge into hmm.git,
rather than raw patches, then apply CH's series on top. I think.

That way if something goes wrong you can send your PR to Linus
directly.

> It also turns out the nvdimm unit tests crash with this signature on
> that branch where base v5.2-rc3 passes:
> 
> BUG: kernel NULL pointer dereference, address: 0008
> [..]
> CPU: 15 PID: 1414 Comm: lt-libndctl Tainted: G   OE
> 5.2.0-rc3+ #3399
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 
> 02/06/2015
> RIP: 0010:percpu_ref_kill_and_confirm+0x1e/0x180
> [..]
> Call Trace:
>  release_nodes+0x234/0x280
>  device_release_driver_internal+0xe8/0x1b0
>  bus_remove_device+0xf2/0x160
>  device_del+0x166/0x370
>  unregister_dev_dax+0x23/0x50
>  release_nodes+0x234/0x280
>  device_release_driver_internal+0xe8/0x1b0
>  unbind_store+0x94/0x120
>  kernfs_fop_write+0xf0/0x1a0
>  vfs_write+0xb7/0x1b0
>  ksys_write+0x5c/0xd0
>  do_syscall_64+0x60/0x240

Too bad the trace didn't say which devm cleanup triggered it.. Did
dev_pagemap_percpu_exit get called with a NULL pgmap->ref ?

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 22/22] mm: don't select MIGRATE_VMA_HELPER from HMM_MIRROR

2019-06-13 Thread Jason Gunthorpe
On Thu, Jun 13, 2019 at 11:43:25AM +0200, Christoph Hellwig wrote:
> The migrate_vma helper is only used by noveau to migrate device private
> pages around.  Other HMM_MIRROR users like amdgpu or infiniband don't
> need it.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/gpu/drm/nouveau/Kconfig | 1 +
>  mm/Kconfig  | 1 -
>  2 files changed, 1 insertion(+), 1 deletion(-)

Yes, the thing that calls migrate_vma() should be the thing that has
the kconfig stuff.

Reviewed-by: Jason Gunthorpe 

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 21/22] mm: remove the HMM config option

2019-06-13 Thread Jason Gunthorpe
On Thu, Jun 13, 2019 at 11:43:24AM +0200, Christoph Hellwig wrote:
> All the mm/hmm.c code is better keyed off HMM_MIRROR.  Also let nouveau
> depend on it instead of the mix of a dummy dependency symbol plus the
> actually selected one.  Drop various odd dependencies, as the code is
> pretty portable.

I don't really know, but I thought this needed the arch restriction
for the same reason get_user_pages has various unique arch specific
implementations (it does seem to have some open coded GUP like thing)?

I was hoping we could do this after your common gup series? But sooner
is better too.

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 18/22] mm: mark DEVICE_PUBLIC as broken

2019-06-13 Thread Jason Gunthorpe
On Thu, Jun 13, 2019 at 12:53:02PM -0700, Ralph Campbell wrote:
> 
> On 6/13/19 12:44 PM, Jason Gunthorpe wrote:
> > On Thu, Jun 13, 2019 at 11:43:21AM +0200, Christoph Hellwig wrote:
> > > The code hasn't been used since it was added to the tree, and doesn't
> > > appear to actually be usable.  Mark it as BROKEN until either a user
> > > comes along or we finally give up on it.
> > > 
> > > Signed-off-by: Christoph Hellwig 
> > >   mm/Kconfig | 1 +
> > >   1 file changed, 1 insertion(+)
> > > 
> > > diff --git a/mm/Kconfig b/mm/Kconfig
> > > index 0d2ba7e1f43e..406fa45e9ecc 100644
> > > +++ b/mm/Kconfig
> > > @@ -721,6 +721,7 @@ config DEVICE_PRIVATE
> > >   config DEVICE_PUBLIC
> > >   bool "Addressable device memory (like GPU memory)"
> > >   depends on ARCH_HAS_HMM
> > > + depends on BROKEN
> > >   select HMM
> > >   select DEV_PAGEMAP_OPS
> > 
> > This seems a bit harsh, we do have another kconfig that selects this
> > one today:
> > 
> > config DRM_NOUVEAU_SVM
> >  bool "(EXPERIMENTAL) Enable SVM (Shared Virtual Memory) support"
> >  depends on ARCH_HAS_HMM
> >  depends on DRM_NOUVEAU
> >  depends on STAGING
> >  select HMM_MIRROR
> >  select DEVICE_PRIVATE
> >  default n
> >  help
> >Say Y here if you want to enable experimental support for
> >Shared Virtual Memory (SVM).
> > 
> > Maybe it should be depends on STAGING not broken?
> > 
> > or maybe nouveau_svm doesn't actually need DEVICE_PRIVATE?
> > 
> > Jason
> 
> I think you are confusing DEVICE_PRIVATE for DEVICE_PUBLIC.
> DRM_NOUVEAU_SVM does use DEVICE_PRIVATE but not DEVICE_PUBLIC.

Indeed you are correct, never mind

Hum, so the only thing this config does is short circuit here:

static inline bool is_device_public_page(const struct page *page)
{
return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
IS_ENABLED(CONFIG_DEVICE_PUBLIC) &&
is_zone_device_page(page) &&
page->pgmap->type == MEMORY_DEVICE_PUBLIC;
}

Which is called all over the place.. 

So, yes, we really don't want any distro or something to turn this on
until it has a use.

Reviewed-by: Jason Gunthorpe 

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 20/22] mm: sort out the DEVICE_PRIVATE Kconfig mess

2019-06-13 Thread Jason Gunthorpe
On Thu, Jun 13, 2019 at 11:43:23AM +0200, Christoph Hellwig wrote:
> The ZONE_DEVICE support doesn't depend on anything HMM related, just on
> various bits of arch support as indicated by the architecture.  Also
> don't select the option from nouveau as it isn't present in many setups,
> and depend on it instead.
> 
> Signed-off-by: Christoph Hellwig 
>  drivers/gpu/drm/nouveau/Kconfig | 2 +-
>  mm/Kconfig  | 5 ++---
>  2 files changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/nouveau/Kconfig b/drivers/gpu/drm/nouveau/Kconfig
> index dba2613f7180..6303d203ab1d 100644
> +++ b/drivers/gpu/drm/nouveau/Kconfig
> @@ -85,10 +85,10 @@ config DRM_NOUVEAU_BACKLIGHT
>  config DRM_NOUVEAU_SVM
>   bool "(EXPERIMENTAL) Enable SVM (Shared Virtual Memory) support"
>   depends on ARCH_HAS_HMM
> + depends on DEVICE_PRIVATE
>   depends on DRM_NOUVEAU
>   depends on STAGING
>   select HMM_MIRROR
> - select DEVICE_PRIVATE
>   default n
>   help
> Say Y here if you want to enable experimental support for

Ben, I heard you might have a patch like this in your tree, but I
don't think I could find your tree?? 

Do you have any nouveau/Kconfig patches that might conflict? Thanks

Does this fix the randconfigs failures that have been reported?

> diff --git a/mm/Kconfig b/mm/Kconfig
> index 406fa45e9ecc..4dbd718c8cf4 100644
> +++ b/mm/Kconfig
> @@ -677,13 +677,13 @@ config ARCH_HAS_HMM_MIRROR
>  
>  config ARCH_HAS_HMM
>   bool
> - default y
>   depends on (X86_64 || PPC64)
>   depends on ZONE_DEVICE
>   depends on MMU && 64BIT
>   depends on MEMORY_HOTPLUG
>   depends on MEMORY_HOTREMOVE
>   depends on SPARSEMEM_VMEMMAP
> + default y

What is the reason we have this ARCH thing anyhow? Does hmm have arch
dependencies someplace?

Thanks
Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 18/22] mm: mark DEVICE_PUBLIC as broken

2019-06-13 Thread Jason Gunthorpe
On Thu, Jun 13, 2019 at 11:43:21AM +0200, Christoph Hellwig wrote:
> The code hasn't been used since it was added to the tree, and doesn't
> appear to actually be usable.  Mark it as BROKEN until either a user
> comes along or we finally give up on it.
> 
> Signed-off-by: Christoph Hellwig 
>  mm/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 0d2ba7e1f43e..406fa45e9ecc 100644
> +++ b/mm/Kconfig
> @@ -721,6 +721,7 @@ config DEVICE_PRIVATE
>  config DEVICE_PUBLIC
>   bool "Addressable device memory (like GPU memory)"
>   depends on ARCH_HAS_HMM
> + depends on BROKEN
>   select HMM
>   select DEV_PAGEMAP_OPS

This seems a bit harsh, we do have another kconfig that selects this
one today:

config DRM_NOUVEAU_SVM
bool "(EXPERIMENTAL) Enable SVM (Shared Virtual Memory) support"
depends on ARCH_HAS_HMM
depends on DRM_NOUVEAU
depends on STAGING
select HMM_MIRROR
select DEVICE_PRIVATE
default n
help
  Say Y here if you want to enable experimental support for
  Shared Virtual Memory (SVM).

Maybe it should be depends on STAGING not broken?

or maybe nouveau_svm doesn't actually need DEVICE_PRIVATE?

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 17/22] mm: remove hmm_devmem_add

2019-06-13 Thread Jason Gunthorpe
On Thu, Jun 13, 2019 at 11:43:20AM +0200, Christoph Hellwig wrote:
> There isn't really much value add in the hmm_devmem_add wrapper.  Just
> factor out a little helper to find the resource, and otherwise let the
> driver implement the dev_pagemap_ops directly.

Was this commit message written when other patches were squashed in
here? I think the helper this mentions was from an earlier patch

> Signed-off-by: Christoph Hellwig 
> ---
>  Documentation/vm/hmm.rst |  26 
>  include/linux/hmm.h  | 129 ---
>  mm/hmm.c | 115 --
>  3 files changed, 270 deletions(-)

I looked for in-flight patches that might be using these APIs and
found nothing. To be sent patches can use the new API with no loss in
functionality...

Reviewed-by: Jason Gunthorpe 

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 14/22] nouveau: use alloc_page_vma directly

2019-06-13 Thread Jason Gunthorpe
On Thu, Jun 13, 2019 at 11:43:17AM +0200, Christoph Hellwig wrote:
> hmm_vma_alloc_locked_page is scheduled to go away, use the proper
> mm function directly.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/gpu/drm/nouveau/nouveau_dmem.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)

Reviewed-by: Jason Gunthorpe 

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 11/22] memremap: remove the data field in struct dev_pagemap

2019-06-13 Thread Jason Gunthorpe
On Thu, Jun 13, 2019 at 11:43:14AM +0200, Christoph Hellwig wrote:
> struct dev_pagemap is always embedded into a containing structure, so
> there is no need to an additional private data field.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/nvdimm/pmem.c| 2 +-
>  include/linux/memremap.h | 3 +--
>  kernel/memremap.c| 2 +-
>  mm/hmm.c | 9 +
>  4 files changed, 8 insertions(+), 8 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 09/22] memremap: lift the devmap_enable manipulation into devm_memremap_pages

2019-06-13 Thread Jason Gunthorpe
On Thu, Jun 13, 2019 at 11:43:12AM +0200, Christoph Hellwig wrote:
> Just check if there is a ->page_free operation set and take care of the
> static key enable, as well as the put using device managed resources.
> diff --git a/mm/hmm.c b/mm/hmm.c
> index c76a1b5defda..6dc769feb2e1 100644
> +++ b/mm/hmm.c
> @@ -1378,8 +1378,6 @@ struct hmm_devmem *hmm_devmem_add(const struct 
> hmm_devmem_ops *ops,
>   void *result;
>   int ret;
>  
> - dev_pagemap_get_ops();
> -

Where was the matching dev_pagemap_put_ops() for this hmm case? This
is a bug fix too?

The nouveau driver is the only one to actually call this hmm function
and it does it as part of a probe function. 

Seems reasonable, however, in the unlikely event that it fails to init
'dmem' the driver will retain a dev_pagemap_get_ops until it unloads.
This imbalance doesn't seem worth worrying about.

Reviewed-by: Christoph Hellwig 

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 08/22] memremap: pass a struct dev_pagemap to ->kill

2019-06-13 Thread Jason Gunthorpe
On Thu, Jun 13, 2019 at 11:43:11AM +0200, Christoph Hellwig wrote:
> Passing the actual typed structure leads to more understandable code
> vs the actual references.
> 
> Signed-off-by: Christoph Hellwig 
>  drivers/dax/device.c  | 7 +++
>  drivers/nvdimm/pmem.c | 6 +++---
>  drivers/pci/p2pdma.c  | 6 +++---
>  include/linux/memremap.h  | 2 +-
>  kernel/memremap.c | 4 ++--
>  mm/hmm.c  | 4 ++--
>  tools/testing/nvdimm/test/iomap.c | 6 ++
>  7 files changed, 16 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> index 4adab774dade..e23fa1bd8c97 100644
> +++ b/drivers/dax/device.c
> @@ -37,13 +37,12 @@ static void dev_dax_percpu_exit(void *data)
>   percpu_ref_exit(ref);
>  }
>  
> -static void dev_dax_percpu_kill(struct percpu_ref *data)
> +static void dev_dax_percpu_kill(struct dev_pagemap *pgmap)
>  {

Looks like it was always like this, but I also can't see a reason to
use the percpu as a handle for a dev_pagemap callback.

Reviewed-by: Jason Gunthorpe 

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 07/22] memremap: move dev_pagemap callbacks into a separate structure

2019-06-13 Thread Jason Gunthorpe
On Thu, Jun 13, 2019 at 11:43:10AM +0200, Christoph Hellwig wrote:
> The dev_pagemap is a growing too many callbacks.  Move them into a
> separate ops structure so that they are not duplicated for multiple
> instances, and an attacker can't easily overwrite them.

Reviewed-by: Jason Gunthorpe 

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 06/22] mm: factor out a devm_request_free_mem_region helper

2019-06-13 Thread Jason Gunthorpe
On Thu, Jun 13, 2019 at 11:43:09AM +0200, Christoph Hellwig wrote:
> Keep the physical address allocation that hmm_add_device does with the
> rest of the resource code, and allow future reuse of it without the hmm
> wrapper.
> 
> Signed-off-by: Christoph Hellwig 
>  include/linux/ioport.h |  2 ++
>  kernel/resource.c  | 39 +++
>  mm/hmm.c   | 33 -
>  3 files changed, 45 insertions(+), 29 deletions(-)
> 
> diff --git a/include/linux/ioport.h b/include/linux/ioport.h
> index da0ebaec25f0..76a33ae3bf6c 100644
> +++ b/include/linux/ioport.h
> @@ -286,6 +286,8 @@ static inline bool resource_overlaps(struct resource *r1, 
> struct resource *r2)
> return (r1->start <= r2->end && r1->end >= r2->start);
>  }
>  
> +struct resource *devm_request_free_mem_region(struct device *dev,
> + struct resource *base, unsigned long size);
>  
>  #endif /* __ASSEMBLY__ */
>  #endif   /* _LINUX_IOPORT_H */
> diff --git a/kernel/resource.c b/kernel/resource.c
> index 158f04ec1d4f..99c58134ed1c 100644
> +++ b/kernel/resource.c
> @@ -1628,6 +1628,45 @@ void resource_list_free(struct list_head *head)
>  }
>  EXPORT_SYMBOL(resource_list_free);
>  
> +#ifdef CONFIG_DEVICE_PRIVATE
> +/**
> + * devm_request_free_mem_region - find free region for device private memory
> + *
> + * @dev: device struct to bind the resource too
> + * @size: size in bytes of the device memory to add
> + * @base: resource tree to look in
> + *
> + * This function tries to find an empty range of physical address big enough 
> to
> + * contain the new resource, so that it can later be hotpluged as ZONE_DEVICE
> + * memory, which in turn allocates struct pages.
> + */
> +struct resource *devm_request_free_mem_region(struct device *dev,
> + struct resource *base, unsigned long size)
> +{
> + resource_size_t end, addr;
> + struct resource *res;
> +
> + size = ALIGN(size, 1UL << PA_SECTION_SHIFT);
> + end = min_t(unsigned long, base->end, (1UL << MAX_PHYSMEM_BITS) - 1);

Even fixed it to use min_t

> + addr = end - size + 1UL;
> + for (; addr > size && addr >= base->start; addr -= size) {
> + if (region_intersects(addr, size, 0, IORES_DESC_NONE) !=
> + REGION_DISJOINT)
> + continue;

The FIXME about the algorithm cost seems justified though, yikes.

> +
> + res = devm_request_mem_region(dev, addr, size, dev_name(dev));
> + if (!res)
> +     return ERR_PTR(-ENOMEM);
> + res->desc = IORES_DESC_DEVICE_PRIVATE_MEMORY;

I wonder if IORES_DESC_DEVICE_PRIVATE_MEMORY should be a function
argument?

Not really any substantive remark, so

Reviewed-by: Jason Gunthorpe 

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 04/22] mm: don't clear ->mapping in hmm_devmem_free

2019-06-13 Thread Jason Gunthorpe
On Thu, Jun 13, 2019 at 11:43:07AM +0200, Christoph Hellwig wrote:
> ->mapping isn't even used by HMM users, and the field at the same offset
> in the zone_device part of the union is declared as pad.  (Which btw is
> rather confusing, as DAX uses ->pgmap and ->mapping from two different
> sides of the union, but DAX doesn't use hmm_devmem_free).
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  mm/hmm.c | 2 --
>  1 file changed, 2 deletions(-)

Hurm, is hmm following this comment from mm_types.h?

 * If you allocate the page using alloc_pages(), you can use some of the
 * space in struct page for your own purposes.  The five words in the main
 * union are available, except for bit 0 of the first word which must be
 * kept clear.  Many users use this word to store a pointer to an object
 * which is guaranteed to be aligned.  If you use the same storage as
 * page->mapping, you must restore it to NULL before freeing the page.

Maybe the assumption was that a driver is using ->mapping ?

However, nouveau is the only driver that uses this path, and it never
touches page->mapping either (nor in -next).

It looks like if a driver were to start using mapping then the driver
should be responsible to set it back to NULL before being done with
the page.

Reviewed-by: Jason Gunthorpe 

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 03/22] mm: remove hmm_devmem_add_resource

2019-06-13 Thread Jason Gunthorpe
On Thu, Jun 13, 2019 at 11:43:06AM +0200, Christoph Hellwig wrote:
> This function has never been used since it was first added to the kernel
> more than a year and a half ago, and if we ever grow a consumer of the
> MEMORY_DEVICE_PUBLIC infrastructure it can easily use devm_memremap_pages
> directly now that we've simplified the API for it.

nit: Have we simplified the interface for devm_memremap_pages() at
this point, or are you talking about later patches in this series.

I checked this and all the called functions are exported symbols, so
there is no blocker for a future driver to call devm_memremap_pages(),
maybe even with all this boiler plate, in future.

If we eventually get many users that need some simplified registration
then we should add a devm_memremap_pages_simplified() interface and
factor out that code when we can see the pattern.

Reviewed-by: Jason Gunthorpe 

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 02/22] mm: remove the struct hmm_device infrastructure

2019-06-13 Thread Jason Gunthorpe
On Thu, Jun 13, 2019 at 11:43:05AM +0200, Christoph Hellwig wrote:
> This code is a trivial wrapper around device model helpers, which
> should have been integrated into the driver device model usage from
> the start.  Assuming it actually had users, which it never had since
> the code was added more than 1 1/2 years ago.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  include/linux/hmm.h | 20 
>  mm/hmm.c| 80 -
>  2 files changed, 100 deletions(-)

I haven't looked in detail at this device memory stuff.. But I did
check a bit through the mailing list archives for some clue what this
was supposed to be for (wow, this is from 2016!)

The commit that added this says:
  This introduce a dummy HMM device class so device driver can use it to
  create hmm_device for the sole purpose of registering device memory.

Which I just can't understand at all. 

If we need a 'struct device' for some 'device memory' purpose then it
probably ought to be the 'struct pci_device' holding the BAR, not a
fake device.

I also can't comprehend why a supposed fake device would need a
chardev, with a stanadrd 'hmm_deviceX' name, without also defining a
core kernel ABI for that char dev..

If this comes back it needs a proper explanation and review, with a
user.

Reviewed-by: Jason Gunthorpe 

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 01/22] mm: remove the unused ARCH_HAS_HMM_DEVICE Kconfig option

2019-06-13 Thread Jason Gunthorpe
On Thu, Jun 13, 2019 at 11:43:04AM +0200, Christoph Hellwig wrote:
> Signed-off-by: Christoph Hellwig 
> ---
>  mm/Kconfig | 10 --
>  1 file changed, 10 deletions(-)

So long as we are willing to run a hmm.git we can merge only complete
features going forward.

Thus we don't need the crazy process described in the commit message
that (deliberately) introduced this unused kconfig.

I also tried to find something in-flight for 5.3 that would need this
and found nothing

Reviewed-by: Jason Gunthorpe 

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: dev_pagemap related cleanups

2019-06-13 Thread Jason Gunthorpe
On Thu, Jun 13, 2019 at 11:43:03AM +0200, Christoph Hellwig wrote:
> Hi Dan, Jérôme and Jason,
> 
> below is a series that cleans up the dev_pagemap interface so that
> it is more easily usable, which removes the need to wrap it in hmm
> and thus allowing to kill a lot of code

Do you want some of this to run through hmm.git? I see many patches
that don't seem to have inter-dependencies..

Thanks,
Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 10/13] nvme-pci: Add support for P2P memory in requests

2018-09-04 Thread Jason Gunthorpe
On Thu, Aug 30, 2018 at 12:53:49PM -0600, Logan Gunthorpe wrote:
> For P2P requests, we must use the pci_p2pmem_map_sg() function
> instead of the dma_map_sg functions.
> 
> With that, we can then indicate PCI_P2P support in the request queue.
> For this, we create an NVME_F_PCI_P2P flag which tells the core to
> set QUEUE_FLAG_PCI_P2P in the request queue.
> 
> Signed-off-by: Logan Gunthorpe 
> Reviewed-by: Sagi Grimberg 
> Reviewed-by: Christoph Hellwig 
>  drivers/nvme/host/core.c |  4 
>  drivers/nvme/host/nvme.h |  1 +
>  drivers/nvme/host/pci.c  | 17 +
>  3 files changed, 18 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index dd8ec1dd9219..6033ce2fd3e9 100644
> +++ b/drivers/nvme/host/core.c
> @@ -3051,7 +3051,11 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, 
> unsigned nsid)
>   ns->queue = blk_mq_init_queue(ctrl->tagset);
>   if (IS_ERR(ns->queue))
>   goto out_free_ns;
> +
>   blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue);
> + if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
> + blk_queue_flag_set(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
> +
>   ns->queue->queuedata = ns;
>   ns->ctrl = ctrl;
>  
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index bb4a2003c097..4030743c90aa 100644
> +++ b/drivers/nvme/host/nvme.h
> @@ -343,6 +343,7 @@ struct nvme_ctrl_ops {
>   unsigned int flags;
>  #define NVME_F_FABRICS   (1 << 0)
>  #define NVME_F_METADATA_SUPPORTED(1 << 1)
> +#define NVME_F_PCI_P2PDMA(1 << 2)
>   int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
>   int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val);
>   int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 2902585c6ddf..bb2120d30e39 100644
> +++ b/drivers/nvme/host/pci.c
> @@ -737,8 +737,13 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, 
> struct request *req,
>   goto out;
>  
>   ret = BLK_STS_RESOURCE;
> - nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents, dma_dir,
> - DMA_ATTR_NO_WARN);
> +
> + if (is_pci_p2pdma_page(sg_page(iod->sg)))
> + nr_mapped = pci_p2pdma_map_sg(dev->dev, iod->sg, iod->nents,
> +   dma_dir);
> + else
> + nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents,
> +  dma_dir,  DMA_ATTR_NO_WARN);
>   if (!nr_mapped)
>   goto out;
>  
> @@ -780,7 +785,10 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct 
> request *req)
>   DMA_TO_DEVICE : DMA_FROM_DEVICE;
>  
>   if (iod->nents) {
> - dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir);
> + /* P2PDMA requests do not need to be unmapped */
> + if (!is_pci_p2pdma_page(sg_page(iod->sg)))
> + dma_unmap_sg(dev->dev, iod->sg, iod->nents, dma_dir);

This seems like a poor direction, if we add IOMMU hairpin support we
will need unmapping.

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-26 Thread Jason Gunthorpe
On Mon, Mar 26, 2018 at 11:30:38AM -0600, Logan Gunthorpe wrote:
> 
> 
> On 26/03/18 10:41 AM, Jason Gunthorpe wrote:
> > On Mon, Mar 26, 2018 at 12:11:38PM +0100, Jonathan Cameron wrote:
> >> On Tue, 13 Mar 2018 10:43:55 -0600
> >> Logan Gunthorpe  wrote:
> >>
> >>> On 12/03/18 09:28 PM, Sinan Kaya wrote:
> >>>> On 3/12/2018 3:35 PM, Logan Gunthorpe wrote:
> >>>> Regarding the switch business, It is amazing how much trouble you went 
> >>>> into
> >>>> limit this functionality into very specific hardware.
> >>>>
> >>>> I thought that we reached to an agreement that code would not impose
> >>>> any limits on what user wants.
> >>>>
> >>>> What happened to all the emails we exchanged?  
> >>>
> >>> It turns out that root ports that support P2P are far less common than 
> >>> anyone thought. So it will likely have to be a white list.
> >>
> >> This came as a bit of a surprise to our PCIe architect.
> > 
> > I don't think it is a hardware problem.
> 
> The latest and greatest Power9 CPUs still explicitly do not support
> this.

I think this is another case of the HW can do it but the SW support is
missing. IOMMU configuration and maybe firmware too, for instance.

If I recall I saw a presentation that Coral was expected to use P2P
between the network and GPU.

> And, if I recall correctly, the ARM64 device we played with did
> not either -- but I suspect that will differ depending on vendor.

Wouldn't surprise me at all to see broken implementations in
ARM64.. But even there it needs IOMMU enablement to work at all if I
recall.

Bascially, this is probably not a HW problem that needs a HW bit, but
a OS/firmware problem to do all the enablement..

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v3 01/11] PCI/P2PDMA: Support peer-to-peer memory

2018-03-26 Thread Jason Gunthorpe
On Mon, Mar 26, 2018 at 12:11:38PM +0100, Jonathan Cameron wrote:
> On Tue, 13 Mar 2018 10:43:55 -0600
> Logan Gunthorpe  wrote:
> 
> > On 12/03/18 09:28 PM, Sinan Kaya wrote:
> > > On 3/12/2018 3:35 PM, Logan Gunthorpe wrote:
> > > Regarding the switch business, It is amazing how much trouble you went 
> > > into
> > > limit this functionality into very specific hardware.
> > > 
> > > I thought that we reached to an agreement that code would not impose
> > > any limits on what user wants.
> > > 
> > > What happened to all the emails we exchanged?  
> > 
> > It turns out that root ports that support P2P are far less common than 
> > anyone thought. So it will likely have to be a white list.
> 
> This came as a bit of a surprise to our PCIe architect.

I don't think it is a hardware problem.

I know Mellanox and Nvidia have been doing p2p on Intel root complexes
for something like 5-6 years now.. I don't have the details, but it
does seem to work.

I have heard some chips give poor performance..

Also AMD GPU SLI uses P2P these days, so this isn't exactly a niche
feature in Intel/AMD land.

I think the main issue here is that there is some BIOS involvement to
set things up properly. Eg in GPU land motherboards certify for
'crossfire' support.

> His follow up was whether it was worth raising an ECR for the PCIe spec
> to add a capability bit to allow this to be discovered.  This might
> long term avoid the need to maintain the white list for new devices.

If it is primarily a BIOS issue then it should be an ACPI thing, right?

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB

2018-03-05 Thread Jason Gunthorpe
On Mon, Mar 05, 2018 at 01:42:12PM -0700, Keith Busch wrote:
> On Mon, Mar 05, 2018 at 01:10:53PM -0700, Jason Gunthorpe wrote:
> > So when reading the above mlx code, we see the first wmb() being used
> > to ensure that CPU stores to cachable memory are visible to the DMA
> > triggered by the doorbell ring.
> 
> IIUC, we don't need a similar barrier for NVMe to ensure memory is
> visibile to DMA since the SQE memory is allocated DMA coherent when the
> SQ is not within a CMB.

You still need it.

DMA coherent just means you don't need to call the DMA API after
writing, it says nothing about CPU ordering.

eg on x86 DMA coherent is just normal system memory, and you do need
the SFENCE betweeen system memory stores and DMA triggering MMIO,
apparently.

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 07/10] nvme-pci: Use PCI p2pmem subsystem to manage the CMB

2018-03-05 Thread Jason Gunthorpe
On Mon, Mar 05, 2018 at 09:57:27PM +0200, Sagi Grimberg wrote:

> Keith, while we're on this, regardless of cmb, is SQE memcopy and DB update
> ordering always guaranteed?
> 
> If you look at mlx4 (rdma device driver) that works exactly the same as
> nvme you will find:
> --
> qp->sq.head += nreq;
> 
> /*
>  * Make sure that descriptors are written before
>  * doorbell record.
>  */
> wmb();
> 
> writel(qp->doorbell_qpn,
>to_mdev(ibqp->device)->uar_map + MLX4_SEND_DOORBELL);
> 
> /*
>  * Make sure doorbells don't leak out of SQ spinlock
>  * and reach the HCA out of order.
>  */
> mmiowb();
> --
> 
> That seems to explicitly make sure to place a barrier before updating
> the doorbell. So as I see it, either ordering is guaranteed and the
> above code is redundant, or nvme needs to do the same.

A wmb() is always required before operations that can trigger DMA.

The reason ARM has a barrier in writel() is not to make it ordered
with respect to CPU stores to cachable memory, but to make it ordered
with respect to *other* writels.

Eg Linux defines this:

writel(A, mem);
writel(B, mem);

To always produce two TLPs on PCI-E when mem is UC BAR memory.

And ARM cannot guarentee that without the extra barrier.

So now we see stuff like this:

writel_relaxed(A, mem);
writel_relaxed(B, mem+4);

Which says the TLPs to A and B can be issued in any order..

So when reading the above mlx code, we see the first wmb() being used
to ensure that CPU stores to cachable memory are visible to the DMA
triggered by the doorbell ring.

The mmiowb() is used to ensure that DB writes are not combined and not
issued in any order other than implied by the lock that encloses the
whole thing. This is needed because uar_map is WC memory.

We don't have ordering with respect to two writel's here, so if ARM
performance was a concern the writel could be switched to
writel_relaxed().

Presumably nvme has similar requirments, although I guess the DB
register is mapped UC not WC?

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [LSF/MM TOPIC] Filesystem-DAX, page-pinning, and RDMA

2018-02-01 Thread Jason Gunthorpe
On Mon, Jan 29, 2018 at 06:33:25PM -0500, Jerome Glisse wrote:

> Between i would also like to participate, in my view the burden should
> be on GUP users, so if hardware is not ODP capable then you should at
> least be able to kill the mapping/GUP and force the hardware to redo a
> GUP if it get any more transaction on affect umem. Can non ODP hardware
> do that ? Or is it out of the question ?

For RDMA we can have the HW forcibly tear down the MR, but it is
incredibly disruptive and nobody running applications would be happy
with this outcome.

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 02/12] pci-p2p: Add sysfs group to display p2pmem stats

2018-01-04 Thread Jason Gunthorpe
On Thu, Jan 04, 2018 at 03:50:40PM -0600, Bjorn Helgaas wrote:
> > This is similar to /sys/bus/pci/drivers_autoprobe, but
> > affects only the VFs associated with a specific PF.
> > +
> > +What:  /sys/bus/pci/devices/.../p2pmem/available
> 
> I wonder if "p2pdma" would be a more suggestive term?  It's not really
> the *memory* that is peer-to-peer; the peer-to-peer part is referring
> to *access* to the memory.

There have been out of tree patches using P2P DMA for a long time, and
some of the use cases have nothing to do with 'memory' - eg DMA to
'registers'

I notice that this series particularly focus on treating the target
BAR as 'memory' - ie it puts genalloc on top of the BAR, and I guess
treat all addresses as equal and interchangable.

If this series gets accepted I would expect proposals to extend this
infrastructure to allow for P2P for registers in some way as well.

So I think the 'p2pmem' name is a good choice only when it is in
places that talk about the genalloc part of this design.

We should reserve p2pdma to talk about the generic infrastructure
unrelated to the genalloc pool.

Since these sysfs's seem to report the genalloc pool status, p2pmem
seems like a good choice to me.

> > @@ -82,6 +130,9 @@ static int pci_p2pmem_setup(struct pci_dev *pdev)
> > if (error)
> > goto out_pool_destroy;
> >  
> > +   if (sysfs_create_group(&pdev->dev.kobj, &p2pmem_group))
> > +   dev_warn(&pdev->dev, "failed to create p2p sysfs group\n");
> 
> Not sure the warning (by itself) is worthwhile.  If we were going to
> disable the feature if sysfs_create_group() failed, that's one thing,
> but we aren't doing anything except generating a warning, which the
> user can't really do anything with.  If the user is looking for the
> sysfs file, its absence will be obvious even without the message.

Don't most of the failure paths inside sysfs_create_group cause prints
anyhow?

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush

2017-10-18 Thread Jason Gunthorpe
On Mon, Oct 16, 2017 at 03:02:52PM +0300, Sagi Grimberg wrote:
> But why should the kernel ever need to mangle the CQ? if a lease break
> would deregister the MR the device is expected to generate remote
> protection errors on its own.

The point is to avoid protection errors - hittles change over when the
DAX mapping changes like ODP does.

Theonly way to get there is to notify the app before the mappings
change.. Dan suggested having ibv_pollcq return this indication..

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush

2017-10-13 Thread Jason Gunthorpe
On Fri, Oct 13, 2017 at 11:22:21AM -0700, Dan Williams wrote:
> > So, who should be responsible for MR coherency? Today we say the MPI
> > is responsible. But we can't really expect the MPI
> > to hook SIGIO and somehow try to reverse engineer what MRs are
> > impacted from a FD that may not even still be open.
> 
> Ok, that's good insight that I didn't have. Userspace needs more help
> than just an fd notification.

Glad to help!

> > I think, if you want to build a uAPI for notification of MR lease
> > break, then you need show how it fits into the above software model:
> >  - How it can be hidden in a RDMA specific library
> 
> So, here's a strawman can ibv_poll_cq() start returning ibv_wc_status
> == IBV_WC_LOC_PROT_ERR when file coherency is lost. This would make
> the solution generic across DAX and non-DAX. What's you're feeling for
> how well applications are prepared to deal with that status return?

Stuffing an entry into the CQ is difficult. The CQ is in user memory
and it is DMA'd from the HCA for several pieces of hardware, so the
kernel can't just stuff something in there. It can be done
with HW support by having the HCA DMA it via an exception path or
something, but even then, you run into questions like CQ overflow and
accounting issues since it is not ment for this.

So, you need a side channel of some kind, either in certain drivers or
generically..

> >  - How lease break can be done hitlessly, so the library user never
> >needs to know it is happening or see failed/missed transfers
> 
> iommu redirect should be hit less and behave like the page cache case
> where RDMA targets pages that are no longer part of the file.

Yes, if the iommu can be fenced properly it sounds doable.

> >  - Whatever fast path checking is needed does not kill performance
> 
> What do you consider a fast path? I was assuming that memory
> registration is a slow path, and iommu operations are asynchronous so
> should not impact performance of ongoing operations beyond typical
> iommu overhead.

ibv_poll_cq() and ibv_post_send() would be a fast path.

Where this struggled before is in creating a side channel you also now
have to check that side channel, and checking it at high performance
is quite hard.. Even quiecing things to be able to tear down the MR
has performance implications on post send...

Now that I see this whole thing in this light it seem so very similar
to the MPI driven user space mmu notifications ideas and has similar
challenges. FWIW, RDMA banged its head on this issue for 10 years and
it was ODP that emerged as the solution.

One option might be to use an async event notification 'MR
de-coherence' and rely on a main polling loop to catch it.

This is good enough for dax becaue the lease-requestor would wait
until the async event was processed. It would also be acceptable for
the general MPI case too, but only if this lease concept was wider
than just DAX, eg a MR leases a peice of VMA, and if anything anyhow
changes that VMA (eg munamp, mmap, mremap, etc) then it has to wait
from the MR to release the lease. ie munmap would block until the
async event is processed. ODP-light in userspace, essentially.

IIRC this sort of suggestion was never explored, something like:

poll(fd)
ibv_read_async_event(fd)
if (event == MR_DECOHERENCE) {
queice_network();
ibv_restore_mr(mr);
restore_network();
}

The implemention of ibv_restore_mr would have to make a new MR that
pointed to the same virtual memory addresses, but was backed by the
*new* physical pages. This means it has to unblock the lease, and wait
for the lease requestor to complete executing.

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush

2017-10-13 Thread Jason Gunthorpe
On Fri, Oct 13, 2017 at 10:01:04AM -0700, Dan Williams wrote:
> On Fri, Oct 13, 2017 at 9:38 AM, Jason Gunthorpe
>  wrote:
> > On Fri, Oct 13, 2017 at 08:14:55AM -0700, Dan Williams wrote:
> >
> >> scheme specific to RDMA which seems like a waste to me when we can
> >> generically signal an event on the fd for any event that effects any
> >> of the vma's on the file. The FL_LAYOUT lease impacts the entire file,
> >> so as far as I can see delaying the notification until MR-init is too
> >> late, too granular, and too RDMA specific.
> >
> > But for RDMA a FD is not what we care about - we want the MR handle so
> > the app knows which MR needs fixing.
> 
> I'd rather put the onus on userspace to remember where it used a
> MAP_DIRECT mapping and be aware that all the mappings of that file are
> subject to a lease break. Sure, we could build up a pile of kernel
> infrastructure to notify on a per-MR basis, but I think that would
> only be worth it if leases were range based. As it is, the entire file
> is covered by a lease instance and all MRs that might reference that
> file get one notification. That said, we can always arrange for a
> per-driver callback at lease-break time so that it can do something
> above and beyond the default notification.

I don't think that really represents how lots of apps actually use
RDMA.

RDMA is often buried down in the software stack (eg in a MPI), and by
the time a mapping gets used for RDMA transfer the link between the
FD, mmap and the MR is totally opaque.

Having a MR specific notification means the low level RDMA libraries
have a chance to deal with everything for the app.

Eg consider a HPC app using MPI that uses some DAX aware library to
get DAX backed mmap's. It then passes memory in those mmaps to the
MPI library to do transfers. The MPI creates the MR on demand.

So, who should be responsible for MR coherency? Today we say the MPI
is responsible. But we can't really expect the MPI
to hook SIGIO and somehow try to reverse engineer what MRs are
impacted from a FD that may not even still be open.

I think, if you want to build a uAPI for notification of MR lease
break, then you need show how it fits into the above software model:
 - How it can be hidden in a RDMA specific library
 - How lease break can be done hitlessly, so the library user never
   needs to know it is happening or see failed/missed transfers
 - Whatever fast path checking is needed does not kill performance

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v9 0/6] MAP_DIRECT for DAX userspace flush

2017-10-13 Thread Jason Gunthorpe
On Fri, Oct 13, 2017 at 08:14:55AM -0700, Dan Williams wrote:

> scheme specific to RDMA which seems like a waste to me when we can
> generically signal an event on the fd for any event that effects any
> of the vma's on the file. The FL_LAYOUT lease impacts the entire file,
> so as far as I can see delaying the notification until MR-init is too
> late, too granular, and too RDMA specific.

But for RDMA a FD is not what we care about - we want the MR handle so
the app knows which MR needs fixing.

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()

2017-10-13 Thread Jason Gunthorpe
On Fri, Oct 13, 2017 at 08:50:47AM +0200, Christoph Hellwig wrote:

> > However, chatting this over with a few more people I have an alternate
> > solution that effectively behaves the same as how non-ODP hardware
> > handles this case of hole punch / truncation today. So, today if this
> > scenario happens on a page-cache backed mapping, the file blocks are
> > unmapped and the RDMA continues into pinned pages that are no longer
> > part of the file. We can achieve the same thing with the iommu, just
> > re-target the I/O into memory that isn't part of the file. That way
> > hardware does not see I/O errors and the DAX data consistency model is
> > no worse than the page-cache case.
> 
> Yikes.

Well, as much as you say Yikes, Dan is correct, this does match the
semantics RDMA MR's already have. They become non-coherent if their
underlying object is changed, and there are many ways to get there.
I've never thought about it, but it does sound like ftruncate,
fallocate, etc on a normal file would break the MR coherency too??

There have been efforts in the past driven by the MPI people to
create, essentially, something like lease-break' SIGIO. Except it was
intended to be general, and wanted solve all the problems related with
MR de-coherence. This was complicated and never became acceptable to
mainline.

Instead ODP was developed, and ODP actually solves all the problem
sanely.

Thinking about it some more, and with your other comments on
get_user_pages in this thread, I tend to agree. It doesn't make sense
to develop a user space lease break API for MR's that is a DAX
specific feature.

Along the some lines, it also doesn't make sense to force-invalidate
MR's linked to DAX regions, while leaving MR's linked to other
regions that have the same problem alone.

If you want to make non-ODP MR's work better, then you need to have a
general overall solution to tell userspace when the MR becomes (or I
guess, is becoming) non-coherent, that covers all the cases that break
MR coherence, not just via DAX.

Otherwise, I think Dan is right, keeping the current semantic of
having MRs just do something wrong, but not corrupt memory, when they
loose coherence, is broadly consistent with how non-ODP MRs work today.

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()

2017-10-12 Thread Jason Gunthorpe
On Tue, Oct 10, 2017 at 01:17:26PM -0700, Dan Williams wrote:

> Also keep in mind that what triggers the lease break is another
> application trying to write or punch holes in a file that is mapped
> for RDMA. So, if the hardware can't handle the iommu mapping getting
> invalidated asynchronously and the application can't react in the
> lease break timeout period then the administrator should arrange for
> the file to not be written or truncated while it is mapped.

That makes sense, but why not return ENOSYS or something to the app
trying to alter the file if the RDMA hardware can't support this
instead of having the RDMA app deal with this lease break weirdness?

> It's already the case that get_user_pages() does not lock down file
> associations, so if your application is contending with these types of
> file changes it likely already has a problem keeping transactions in
> sync with the file state even without DAX.

Yes, things go weird in non-ODP RDMA cases like this..

Also, just to clear, I would expect an app using the SIGIO interface
to basically halt ongoing RDMA, wait for MRs to become unused locally
and remotely, destroy the MRs, then somehow, establish new MRs that
cover the same logical map (eg what ODP would do transparently) after
the lease breaker has made their changes, then restart their IO.

Does your SIGIO approach have a race-free way to do that last steps?

> > So, not being able to support DAX on certain RDMA hardware is not
> > an unreasonable situation in our space.
> 
> That makes sense, but it still seems to me that this proposed solution
> allows more than enough ways to avoid that worst case scenario where
> hardware reacts badly to iommu invalidation.

Yes, although I am concerned that returning PCI-E errors is such an
unusual and untested path for some of our RDMA drivers that they may
malfunction badly...

Again, going back to the question of who would ever use this, I would
be very relucant to deploy a production configuration relying on the iommu
invalidate or SIGIO techniques, when ODP HW is available and works
flawlessly.

> be blacklisted from supporting DAX altogether. In other words this is
> a starting point to incrementally enhance or disable specific drivers,
> but with the assurance that the kernel can always do the safe thing
> when / if the driver is missing a finer grained solution.

Seems reasonable.. I think existing HW will have an easier time adding
invalidate, while new hardware really should implement ODP.

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()

2017-10-10 Thread Jason Gunthorpe
On Tue, Oct 10, 2017 at 10:39:27AM -0700, Dan Williams wrote:
> On Tue, Oct 10, 2017 at 10:25 AM, Jason Gunthorpe

> >> Have a look at the patch [1], I don't touch the ODP path.
> >
> > But, does ODP work OK already? I'm not clear on that..
> 
> It had better. If the mapping is invalidated I would hope that
> generates an io fault that gets handled by the driver to setup the new
> mapping. I don't see how it can work otherwise.

I would assume so too...

> > This is why ODP should be the focus because this cannot work fully
> > reliably otherwise..
> 
> The lease break time is configurable. If that application can't
> respond to a stop request within a timeout of its own choosing then it
> should not be using DAX mappings.

Well, no RDMA application can really do this, unless you set the
timeout to multiple minutes, on par with network timeouts.

Again, these details are why I think this kind of DAX and non ODP-MRs
are probably practically not too useful for a production system. Great
for test of course, but in that case SIGKILL would be fine too...

> > Well, what about using SIGKILL if the lease-break-time hits? The
> > kernel will clean up the MRs when the process exits and this will
> > fence DMA to that memory.
> 
> Can you point me to where the MR cleanup code fences DMA and quiesces
> the device?

Yes. The MR's are associated with an fd. When the fd is closed
ib_uverbs_close triggers ib_uverbs_cleanup_ucontext which runs through
all the objects, including MRs, and deletes them.

The specification for deleting a MR requires a synchronous fence with
the hardware. After MR deletion the hardware will not DMA to any pages
described by the old MR, and those pages will be unpinned.

> > But, still, if you really want to be fined graned, then I think
> > invalidating the impacted MR's is a better solution for RDMA than
> > trying to do it with the IOMMU...
> 
> If there's a better routine for handling ib_umem_lease_break() I'd
> love to use it. Right now I'm reaching for the only tool I know for
> kernel enforced revocation of DMA access.

Well, you'd have to code something in the MR code to keep track of DAX
MRs and issue an out of band invalidate to impacted MRs to create the
fence.

This probably needs some driver work, I'm not sure if all the hardware
can do out of band invalidate to any MR or not..

Generally speaking, in RDMA, when a new feature like this comes along
we have to push a lot of the work down to the driver authors, and the
approach has historically been that new features only work on some
hardware (as much as I dislike this, it is pragmatic)

So, not being able to support DAX on certain RDMA hardware is not
an unreasonable situation in our space.

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()

2017-10-10 Thread Jason Gunthorpe
On Mon, Oct 09, 2017 at 12:28:29PM -0700, Dan Williams wrote:

> > I don't think this has ever come up in the context of an all-device MR
> > invalidate requirement. Drivers already have code to invalidate
> > specifc MRs, but to find all MRs that touch certain pages and then
> > invalidate them would be new code.
> >
> > We also have ODP aware drivers that can retarget a MR to new
> > physical pages. If the block map changes DAX should synchronously
> > retarget the ODP MR, not halt DMA.
> 
> Have a look at the patch [1], I don't touch the ODP path.

But, does ODP work OK already? I'm not clear on that..

> > Most likely ODP & DAX would need to be used together to get robust
> > user applications, as having the user QP's go to an error state at
> > random times (due to DMA failures) during operation is never going to
> > be acceptable...
> 
> It's not random. The process that set up the mapping and registered
> the memory gets SIGIO when someone else tries to modify the file map.
> That process then gets /proc/sys/fs/lease-break-time seconds to fix
> the problem before the kernel force revokes the DMA access.

Well, the process can't fix the problem in bounded time, so it is
random if it will fail or not.

MR life time is under the control of the remote side, and time to
complete the network exchanges required to release the MRs is hard to
bound. So even if I implement SIGIO properly my app will still likely
have random QP failures under various cases and work loads. :(

This is why ODP should be the focus because this cannot work fully
reliably otherwise..

> > Perhaps you might want to initially only support ODP MR mappings with
> > DAX and then the DMA fencing issue goes away?
> 
> I'd rather try to fix the non-ODP DAX case instead of just turning it off.

Well, what about using SIGKILL if the lease-break-time hits? The
kernel will clean up the MRs when the process exits and this will
fence DMA to that memory.

But, still, if you really want to be fined graned, then I think
invalidating the impacted MR's is a better solution for RDMA than
trying to do it with the IOMMU...

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()

2017-10-09 Thread Jason Gunthorpe
On Mon, Oct 09, 2017 at 12:05:30PM -0700, Dan Williams wrote:
> On Mon, Oct 9, 2017 at 11:58 AM, Jason Gunthorpe
>  wrote:
> > On Fri, Oct 06, 2017 at 03:35:54PM -0700, Dan Williams wrote:
> >> otherwise be quiesced. The need for this knowledge is driven by a need
> >> to make RDMA transfers to DAX mappings safe. If the DAX file's block map
> >> changes we need to be to reliably stop accesses to blocks that have been
> >> freed or re-assigned to a new file.
> >
> > If RDMA is driving this need, why not invalidate backing RDMA MRs
> > instead of requiring a IOMMU to do it? RDMA MR are finer grained and
> > do not suffer from the re-use problem David W. brought up with IOVAs..
> 
> Sounds promising. All I want in the end is to be sure that the kernel
> is enabled to stop any in-flight RDMA at will without asking
> userspace. Does this require per-RDMA driver opt-in or is there a
> common call that can be made?

I don't think this has ever come up in the context of an all-device MR
invalidate requirement. Drivers already have code to invalidate
specifc MRs, but to find all MRs that touch certain pages and then
invalidate them would be new code.

We also have ODP aware drivers that can retarget a MR to new
physical pages. If the block map changes DAX should synchronously
retarget the ODP MR, not halt DMA.

Most likely ODP & DAX would need to be used together to get robust
user applications, as having the user QP's go to an error state at
random times (due to DMA failures) during operation is never going to
be acceptable...

Perhaps you might want to initially only support ODP MR mappings with
DAX and then the DMA fencing issue goes away?

Cheers,
Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v7 07/12] dma-mapping: introduce dma_has_iommu()

2017-10-09 Thread Jason Gunthorpe
On Fri, Oct 06, 2017 at 03:35:54PM -0700, Dan Williams wrote:
> otherwise be quiesced. The need for this knowledge is driven by a need
> to make RDMA transfers to DAX mappings safe. If the DAX file's block map
> changes we need to be to reliably stop accesses to blocks that have been
> freed or re-assigned to a new file.

If RDMA is driving this need, why not invalidate backing RDMA MRs
instead of requiring a IOMMU to do it? RDMA MR are finer grained and
do not suffer from the re-use problem David W. brought up with IOVAs..

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 15/21] xen-blkfront: Make use of the new sg_map helper function

2017-04-27 Thread Jason Gunthorpe
On Thu, Apr 27, 2017 at 05:03:45PM -0600, Logan Gunthorpe wrote:
> 
> 
> On 27/04/17 04:11 PM, Jason Gunthorpe wrote:
> > On Thu, Apr 27, 2017 at 03:53:37PM -0600, Logan Gunthorpe wrote:
> > Well, that is in the current form, with more users it would make sense
> > to optimize for the single page case, eg by providing the existing
> > call, providing a faster single-page-only variant of the copy, perhaps
> > even one that is inlined.
> 
> Ok, does it make sense then to have an sg_copy_page_to_buffer (or some
> such... I'm having trouble thinking of a sane name that isn't too long).
> That just does k(un)map_atomic and memcpy? I could try that if it makes
> sense to people.

It seems the most robust: test for iomem, and jump to a slow path
copy, otherwise inline the kmap and memcpy

Every place doing memcpy from sgl will need that pattern to be
correct.

> > sg_miter will still fail when the sg contains __iomem, however I would
> > expect that the sg_copy will work with iomem, by using the __iomem
> > memcpy variant.
> 
> Yes, that's true. Any sg_miters that ever see iomem will need to be
> converted to support it. This isn't much different than the other
> kmap(sg_page()) users I was converting that will also fail if they see
> iomem. Though, I suspect an sg_miter user would be easier to convert to
> iomem than a random kmap user.

How? sg_miter seems like the next nightmare down this path, what is
sg_miter_next supposed to do when something hits an iomem sgl?

miter.addr is supposed to be a kernel pointer that must not be
__iomem..

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 15/21] xen-blkfront: Make use of the new sg_map helper function

2017-04-27 Thread Jason Gunthorpe
On Thu, Apr 27, 2017 at 03:53:37PM -0600, Logan Gunthorpe wrote:
> On 27/04/17 02:53 PM, Jason Gunthorpe wrote:
> > blkfront is one of the drivers I looked at, and it appears to only be
> > memcpying with the bvec_data pointer, so I wonder why it does not use
> > sg_copy_X_buffer instead..
> 
> But you'd potentially end up calling sg_copy_to_buffer multiple times
> per page within the sg (given that gnttab_foreach_grant_in_range might
> call blkif_copy_from_grant/blkif_setup_rw_req_grant multiple times).
> Even calling sg_copy_to_buffer once per page seems rather inefficient as
> it uses sg_miter internally.

Well, that is in the current form, with more users it would make sense
to optimize for the single page case, eg by providing the existing
call, providing a faster single-page-only variant of the copy, perhaps
even one that is inlined.

> Switching the for_each_sg to sg_miter is probably the nicer solution as
> it takes care of the mapping and the offset/length accounting for you
> and will have similar performance.

sg_miter will still fail when the sg contains __iomem, however I would
expect that the sg_copy will work with iomem, by using the __iomem
memcpy variant.

So, sg_copy should always be preferred in this new world with mixed
__iomem since it is the only primitive that can transparently handle
it.

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 15/21] xen-blkfront: Make use of the new sg_map helper function

2017-04-27 Thread Jason Gunthorpe
On Thu, Apr 27, 2017 at 02:19:24PM -0600, Logan Gunthorpe wrote:
> 
> 
> On 26/04/17 01:37 AM, Roger Pau Monné wrote:
> > On Tue, Apr 25, 2017 at 12:21:02PM -0600, Logan Gunthorpe wrote:
> >> Straightforward conversion to the new helper, except due to the lack
> >> of error path, we have to use SG_MAP_MUST_NOT_FAIL which may BUG_ON in
> >> certain cases in the future.
> >>
> >> Signed-off-by: Logan Gunthorpe 
> >> Cc: Boris Ostrovsky 
> >> Cc: Juergen Gross 
> >> Cc: Konrad Rzeszutek Wilk 
> >> Cc: "Roger Pau Monné" 
> >>  drivers/block/xen-blkfront.c | 20 +++-
> >>  1 file changed, 11 insertions(+), 9 deletions(-)
> >>
> >> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> >> index 3945963..ed62175 100644
> >> +++ b/drivers/block/xen-blkfront.c
> >> @@ -816,8 +816,9 @@ static int blkif_queue_rw_req(struct request *req, 
> >> struct blkfront_ring_info *ri
> >>BUG_ON(sg->offset + sg->length > PAGE_SIZE);
> >>  
> >>if (setup.need_copy) {
> >> -  setup.bvec_off = sg->offset;
> >> -  setup.bvec_data = kmap_atomic(sg_page(sg));
> >> +  setup.bvec_off = 0;
> >> +  setup.bvec_data = sg_map(sg, 0, SG_KMAP_ATOMIC |
> >> +   SG_MAP_MUST_NOT_FAIL);
> > 
> > I assume that sg_map already adds sg->offset to the address?
> 
> Correct.
> 
> > Also wondering whether we can get rid of bvec_off and just increment 
> > bvec_data,
> > adding Julien who IIRC added this code.
> 
> bvec_off is used to keep track of the offset within the current mapping
> so it's not a great idea given that you'd want to kunmap_atomic the
> original address and not something with an offset. It would be nice if
> this could be converted to use the sg_miter interface but that's a much
> more invasive change that would require someone who knows this code and
> can properly test it. I'd be very grateful if someone actually took that on.

blkfront is one of the drivers I looked at, and it appears to only be
memcpying with the bvec_data pointer, so I wonder why it does not use
sg_copy_X_buffer instead..

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 01/21] scatterlist: Introduce sg_map helper functions

2017-04-27 Thread Jason Gunthorpe
On Thu, Apr 27, 2017 at 08:53:38AM +0200, Christoph Hellwig wrote:

> > The main difficulty we
> > have now is that neither of those functions are expected to fail and we
> > need them to be able to in cases where the page doesn't map to system
> > RAM. This patch series is trying to address it for users of scatterlist.
> > I'm certainly open to other suggestions.
> 
> I think you'll need to follow the existing kmap semantics and never
> fail the iomem version either.  Otherwise you'll have a special case
> that's almost never used that has a different error path.

How about first switching as many call sites as possible to use
sg_copy_X_buffer instead of kmap?

A random audit of Logan's series suggests this is actually a fairly
common thing.

eg drivers/mmc/host/sdhci.c is only doing this:

buffer = sdhci_kmap_atomic(sg, &flags);
memcpy(buffer, align, size);
sdhci_kunmap_atomic(buffer, &flags);

drivers/scsi/mvsas/mv_sas.c is this:

+   to = sg_map(sg_resp, 0, SG_KMAP_ATOMIC);
+   memcpy(to,
+  slot->response + sizeof(struct mvs_err_info),
+  sg_dma_len(sg_resp));
+   sg_unmap(sg_resp, to, 0, SG_KMAP_ATOMIC);

etc.

Lots of other places seem similar, if not sometimes a little bit more
convoluted..

Switching all the trivial cases to use copy might bring more clarity
as to what is actually required for the remaining few users? If there
are only a few then it may no longer matter if the API is not idyllic.

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-19 Thread Jason Gunthorpe
On Wed, Apr 19, 2017 at 01:41:49PM -0600, Logan Gunthorpe wrote:

> > But.. it could point to a GPU and the GPU struct device could have a
> > proxy dma_ops like Dan pointed out.
> 
> Seems a bit awkward to me that in order for the intended use case, you
> have to proxy the dma_ops. I'd probably still suggest throwing a couple
> ops for things like this in the dev_pagemap.

Another option is adding a new 'struct completer_dma_ops *' to struct
device for this use case.

Seems like a waste to expand dev_pagemap when we only need a unique
value per struct device?

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-19 Thread Jason Gunthorpe
On Wed, Apr 19, 2017 at 01:02:49PM -0600, Logan Gunthorpe wrote:
> 
> 
> On 19/04/17 12:32 PM, Jason Gunthorpe wrote:
> > On Wed, Apr 19, 2017 at 12:01:39PM -0600, Logan Gunthorpe wrote:
> > Not entirely, it would have to call through the whole process
> > including the arch_p2p_cross_segment()..
> 
> Hmm, yes. Though it's still not clear what, if anything,
> arch_p2p_cross_segment would be doing.

Sets up the iommu for arches that place a iommu between the pci root
port and other pci root ports.

> In my experience, if you are going between host bridges, the CPU
> address (or PCI address -- I'm not sure which seeing they are the
> same on my system) would still work fine

Try it with VT-D turned on. It shouldn't work or there is a notable
security hole in your platform..

> >  const struct dma_map_ops *comp_ops = get_dma_ops(completer);
> >  const struct dma_map_ops *init_ops = get_dma_ops(initiator);
>
> So, in this case, what device does the completer point to? The PCI
> device or a more specific GPU device? If it's the former, who's
> responsible for setting the new dma_ops? Typically the dma_ops are arch
> specific but now you'd be adding ones that are tied to hmm or the gpu.

Donno, that is for GPU folks to figure out :)

But.. it could point to a GPU and the GPU struct device could have a
proxy dma_ops like Dan pointed out.

> >> I'm not sure I like the name pci_p2p_same_segment. It reads as though
> >> it's only checking if the devices are not the same segment.
> > 
> > Well, that is exactly what it is doing. If it succeeds then the caller
> > knows the DMA will not flow outside the segment and no iommu setup/etc
> > is required.
> 
> It appears to me like it's calculating the DMA address, and the check is
> just a side requirement. It reads as though it's only doing the check.

pci_p2p_same_segment_get_pa() then?

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-19 Thread Jason Gunthorpe
On Wed, Apr 19, 2017 at 12:01:39PM -0600, Logan Gunthorpe wrote:

> I'm just spit balling here but if HMM wanted to use unaddressable memory
> as a DMA target, it could set that function to create a window ine gpu
> memory, then call the pci_p2p_same_segment and return the result as the
> dma address.

Not entirely, it would have to call through the whole process
including the arch_p2p_cross_segment()..

Maybe we can start down the road of using ops for more iommu steps
with something like this as the helper:

dma_addr_t dma_map_pa(struct device *initiator, struct page *page,
  void *data)
{
 struct device *completer = get_p2p_completer(page);
 dma_addr_t pa;

 if (IS_ERR(completer))
   return SYSTEM_MEMORY;
   // Or maybe ?
   return init_ops->dma_map_pa(..);

 // Try the generic method
 pa = pci_p2p_same_segment(dev, p2p_src, page);
 if (pa != ERROR)
   goto out;

 // Try the arch specific helper
 const struct dma_map_ops *comp_ops = get_dma_ops(completer);
 const struct dma_map_ops *init_ops = get_dma_ops(initiator);

 /* FUTURE: Let something translate a HMM page into a DMA'ble
page, eg by mapping it into a GPU window. Maybe this
callback lives in devmap ? */
 page = comp_ops->translate_dma_page(completer, page);

 /* New dma_map_op is the same as arch_p2p_cross_segment in
prior version. Any arch specific data needed to program
the iommu flows through data */
 pa = init_ops->p2p_cross_segment_map(completer, inititator, page, 
data);

out:
 device_put(completer);
 return pa;
}

// map_sg op:
for (each sgl) {
struct page *page = sg_page(s);
struct arch_iommu_data data = {}; // pass through to ops->p2p_cross_segment
dma_addr_t pa;

pa = dma_map_pa(dev, page, &data)
if (pa == ERROR)
// fail

if (!data.needs_iommu) {
// Insert PA directly into the result SGL
sg++;
continue;
}

// Else pa & data describe how to setup the iommu
}

> > dma_addr_t pci_p2p_same_segment(struct device *initator,
> > struct device *completer,
> > struct page *page)
> 
> I'm not sure I like the name pci_p2p_same_segment. It reads as though
> it's only checking if the devices are not the same segment.

Well, that is exactly what it is doing. If it succeeds then the caller
knows the DMA will not flow outside the segment and no iommu setup/etc
is required.

That function cannot be expanded to include generic cross-segment
traffic, a new function would be needed..

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-19 Thread Jason Gunthorpe
On Wed, Apr 19, 2017 at 10:48:51AM -0600, Logan Gunthorpe wrote:
> The pci_enable_p2p_bar function would then just need to call
> devm_memremap_pages with the dma_map callback set to a function that
> does the segment check and the offset calculation.

I don't see a use for the dma_map function pointer at this point..

It doesn't make alot of sense for the completor of the DMA to provide
a mapping op, the mapping process is *path* specific, not specific to
a completer/initiator.

So, I would suggest more like this:

static inline struct device *get_p2p_src(struct page *page)
{
struct device *res;
struct dev_pagemap *pgmap;

if (!is_zone_device_page(page))
 return NULL;

pgmap = get_dev_pagemap(page_to_pfn(page), NULL);
if (!pgmap || pgmap->type !=  MEMORY_DEVICE_P2P)
/* For now ZONE_DEVICE memory that is not P2P is
   assumed to be configured for DMA the same as CPU
   memory. */
return ERR_PTR(-EINVAL);
res = pgmap->dev;
device_get(res);
put_dev_pagemap(pgmap);
return res;
}

dma_addr_t pci_p2p_same_segment(struct device *initator,
struct device *completer,
struct page *page)
{
   if (! PCI initiator & completer)
   return ERROR;
   if (!same segment initiator & completer)
   return ERROR;

   // Translate page directly to the value programmed into the BAR
   return (Completer's PCI BAR base address) + (offset of page within BAR);
}

// dma_sg_map

for (each sgl) {
struct page *page = sg_page(s);
struct device *p2p_src = get_p2p_src(page);

if (IS_ERR(p2p_src))
// fail dma_sg

if (p2p_src) {
bool needs_iommu = false;

pa = pci_p2p_same_segment(dev, p2p_src, page);
if (pa == ERROR)
pa = arch_p2p_cross_segment(dev, p2psrc, page, &needs_iommui);

device_put(p2p_src);

if (pa == ERROR)
// fail

if (!needs_iommu) {  
// Insert PA directly into the result SGL
sg++;
continue;
}
}
else
// CPU memory
pa = page_to_phys(page);


To me it looks like the code duplication across the iommu stuff comes
from just duplicating the basic iommu algorithm in every driver.

To clean that up I think someone would need to hoist the overall sgl
loop and use more ops callbacks eg allocate_iommu_range,
assign_page_to_rage, dealloc_range, etc. This is a problem p2p makes
worse, but isn't directly causing :\

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-19 Thread Jason Gunthorpe
On Wed, Apr 19, 2017 at 11:20:06AM +1000, Benjamin Herrenschmidt wrote:

> That helper wouldn't perform the actual iommu mapping. It would simply
> return something along the lines of:
> 
>  - "use that alternate bus address and don't map in the iommu"

I was thinking only this one would be supported with a core code
helper..

>  - "use that alternate bus address and do map in the iommu"
>  - "proceed as normal"

.. because I don't have an idea how a core code helper could figure
out what the platform needs for the above two ..

I think many of the iommu platforms will need to construct their own
bus address in the iommu, and we already have easy access to the CPU
address.

Presumably once a transcation passes through the iommu it needs to be
using the CPU address for the target bar, otherwise things are going
to be ambiguous..

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Jason Gunthorpe
On Tue, Apr 18, 2017 at 03:51:27PM -0700, Dan Williams wrote:
> > This really seems like much less trouble than trying to wrapper all
> > the arch's dma ops, and doesn't have the wonky restrictions.
> 
> I don't think the root bus iommu drivers have any business knowing or
> caring about dma happening between devices lower in the hierarchy.

Maybe not, but performance requires some odd choices in this code.. :(

> > Setting up the iommu is fairly expensive, so getting rid of the
> > batching would kill performance..
> 
> When we're crossing device and host memory boundaries how much
> batching is possible? As far as I can see you'll always be splitting
> the sgl on these dma mapping boundaries.

Splitting the sgl is different from iommu batching.

As an example, an O_DIRECT write of 1 MB with a single 4K P2P page in
the middle.

The optimum behavior is to allocate a 1MB-4K iommu range and fill it
with the CPU memory. Then return a SGL with three entires, two
pointing into the range and one to the p2p.

It is creating each range which tends to be expensive, so creating two
ranges (or worse, if every SGL created a range it would be 255) is
very undesired.

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Jason Gunthorpe
On Tue, Apr 18, 2017 at 03:28:17PM -0700, Dan Williams wrote:

> Unlike the pci bus address offset case which I think is fundamental to
> support since shipping archs do this toda

But we can support this by modifying those arch's unique dma_ops
directly.

Eg as I explained, my p2p_same_segment_map_page() helper concept would
do the offset adjustment for same-segement DMA.

If PPC calls that in their IOMMU drivers then they will have proper
support for this basic p2p, and the right framework to move on to more
advanced cases of p2p.

This really seems like much less trouble than trying to wrapper all
the arch's dma ops, and doesn't have the wonky restrictions.

> I think it is ok to say p2p is restricted to a single sgl that gets
> to talk to host memory or a single device.

RDMA and GPU would be sad with this restriction...

> That said, what's wrong with a p2p aware map_sg implementation
> calling up to the host memory map_sg implementation on a per sgl
> basis?

Setting up the iommu is fairly expensive, so getting rid of the
batching would kill performance..

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Jason Gunthorpe
On Tue, Apr 18, 2017 at 03:31:58PM -0600, Logan Gunthorpe wrote:

> 1) It means that sg_has_p2p has to walk the entire sg and check every
> page. Then map_sg_p2p/map_sg has to walk it again and repeat the check
> then do some operation per page. If anyone is concerned about the
> dma_map performance this could be an issue.

dma_map performance is a concern, this is why I suggest this as an
interm solution until all dma_ops are migrated. Ideally sg_has_p2p
would be a fast path that checked some kind of flags bit set during
sg_assign_page... 

This would probably all have to be protected with CONFIG_P2P until it
becomes performance neutral. People without an iommu are not going to
want to walk the sg list at all..

> 2) Without knowing exactly what the arch specific code may need to do
> it's hard to say that this is exactly the right approach. If every
> dma_ops provider has to do exactly this on every page it may lead to a
> lot of duplicate code:

I think someone would have to start to look at it to make a
determination..

I suspect the main server oriented iommu dma op will want to have
proper p2p support anyhow and will probably have their unique control
flow..

> The only thing I'm presently aware of is the segment check and applying
> the offset to the physical address

Well, I called the function p2p_same_segment_map_page() in my last
suggestion for a reason - that is all the helper does.

The intention would be for real iommu drivers to call that helper for
the one simple case and if it fails then use their own routines to
figure out if cross-segment P2P is possible and configure the iommu as
needed.

> bus specific and not arch specific which I think is what Dan may be
> getting at. So it may make sense to just have a pci_map_sg_p2p() which
> takes a dma_ops struct it would use for any page that isn't a p2p page.

Like I keep saying, dma_ops are not really designed to be stacked.

Try and write a stacked map_sg function like you describe and you will
see how horrible it quickly becomes.

Setting up an iommu is very expensive, so we need to batch it for the
entire sg list. Thus a trivial implementation to iterate over all sg
list entries is not desired.

So first a sg list without p2p memory would have to be created, pass
to the lower level ops, then brought back. Remember, the returned sg
list will have a different number of entries than the original. Now
another complex loop is needed to split/merge back in the p2p sg
elements to get a return result.

Finally, we have to undo all of this when doing unmap.

Basically, all this list processing is a huge overhead compared to
just putting a helper call in the existing sg iteration loop of the
actual op.  Particularly if the actual op is a no-op like no-mmu x86
would use.

Since dma mapping is a performance path we must be careful not to
create intrinsic inefficiencies with otherwise nice layering :)

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Jason Gunthorpe
On Tue, Apr 18, 2017 at 02:11:33PM -0700, Dan Williams wrote:
> > I think this opens an even bigger can of worms..
> 
> No, I don't think it does. You'd only shim when the target page is
> backed by a device, not host memory, and you can figure this out by a
> is_zone_device_page()-style lookup.

The bigger can of worms is how do you meaningfully stack dma_ops.

What does the p2p provider do when it detects a p2p page?

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Jason Gunthorpe
On Tue, Apr 18, 2017 at 12:48:35PM -0700, Dan Williams wrote:

> > Yes, I noticed this problem too and that makes sense. It just means
> > every dma_ops will probably need to be modified to either support p2p
> > pages or fail on them. Though, the only real difficulty there is that it
> > will be a lot of work.
> 
> I don't think you need to go touch all dma_ops, I think you can just
> arrange for devices that are going to do dma to get redirected to a
> p2p aware provider of operations that overrides the system default
> dma_ops. I.e. just touch get_dma_ops().

I don't follow, when does get_dma_ops() return a p2p aware provider?
It has no way to know if the DMA is going to involve p2p, get_dma_ops
is called with the device initiating the DMA.

So you'd always return the P2P shim on a system that has registered
P2P memory?

Even so, how does this shim work? dma_ops are not really intended to
be stacked. How would we make unmap work, for instance? What happens
when the underlying iommu dma ops actually natively understands p2p
and doesn't want the shim?

I think this opens an even bigger can of worms..

Lets find a strategy to safely push this into dma_ops.

What about something more incremental like this instead:
- dma_ops will set map_sg_p2p == map_sg when they are updated to
  support p2p, otherwise DMA on P2P pages will fail for those ops.
- When all ops support p2p we remove the if and ops->map_sg then
  just call map_sg_p2p
- For now the scatterlist maintains a bit when pages are added indicating if
  p2p memory might be present in the list.
- Unmap for p2p and non-p2p is the same, the underlying ops driver has
  to make it work.

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 0977317c6835c2..505ed7d502053d 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -103,6 +103,9 @@ struct dma_map_ops {
int (*map_sg)(struct device *dev, struct scatterlist *sg,
  int nents, enum dma_data_direction dir,
  unsigned long attrs);
+   int (*map_sg_p2p)(struct device *dev, struct scatterlist *sg,
+ int nents, enum dma_data_direction dir,
+ unsigned long attrs);
void (*unmap_sg)(struct device *dev,
 struct scatterlist *sg, int nents,
 enum dma_data_direction dir,
@@ -244,7 +247,15 @@ static inline int dma_map_sg_attrs(struct device *dev, 
struct scatterlist *sg,
for_each_sg(sg, s, nents, i)
kmemcheck_mark_initialized(sg_virt(s), s->length);
BUG_ON(!valid_dma_direction(dir));
-   ents = ops->map_sg(dev, sg, nents, dir, attrs);
+
+   if (sg_has_p2p(sg)) {
+   if (ops->map_sg_p2p)
+   ents = ops->map_sg_p2p(dev, sg, nents, dir, attrs);
+   else
+   return 0;
+   } else
+   ents = ops->map_sg(dev, sg, nents, dir, attrs);
+
BUG_ON(ents < 0);
debug_dma_map_sg(dev, sg, nents, ents, dir);
 

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Jason Gunthorpe
On Tue, Apr 18, 2017 at 01:35:32PM -0600, Logan Gunthorpe wrote:

> > Ultimately every dma_ops will need special code to support P2P with
> > the special hardware that ops is controlling, so it makes some sense
> > to start by pushing the check down there in the first place. This
> > advice is partially motivated by how dma_map_sg is just a small
> > wrapper around the function pointer call...
> 
> Yes, I noticed this problem too and that makes sense. It just means
> every dma_ops will probably need to be modified to either support p2p
> pages or fail on them. Though, the only real difficulty there is that it
> will be a lot of work.

I think this is why progress on this keeps getting stuck - every
solution is a lot of work.

> > Where p2p_same_segment_map_page checks if the two devices are on the
> > 'same switch' and if so returns the address translated to match the
> > bus address programmed into the BAR or fails. We knows this case is
> > required to work by the PCI spec, so it makes sense to use it as the
> > first canned helper.
> 
> I've also suggested that this check should probably be done (or perhaps
> duplicated) before we even get to the map stage.

Since the mechanics of the check is essentially unique to every
dma-ops I would not hoist it out of the map function without a really
good reason.

> In the case of nvme-fabrics we'd probably want to let the user know
> when they try to configure it or at least fall back to allocating
> regular memory instead.

You could try to do a dummy mapping / create a MR early on to detect
this.

FWIW, I wonder if from a RDMA perspective we have another
problem.. Should we allow P2P memory to be used with the local DMA
lkey? There are potential designs around virtualization that would not
allow that. Should we mandate that P2P memory be in its own MR?

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Jason Gunthorpe
On Tue, Apr 18, 2017 at 12:30:59PM -0600, Logan Gunthorpe wrote:

> >  - The dma ops provider must be able to tell if source memory is bar
> >mapped and recover the pci device backing the mapping.
> 
> Do you mean to say that every dma-ops provider needs to be taught about
> p2p backed pages? I was hoping we could have dma_map_* just use special
> p2p dma-ops if it was passed p2p pages (though there are some
> complications to this too).

I think that is how it will end up working out if this is the path..

Ultimately every dma_ops will need special code to support P2P with
the special hardware that ops is controlling, so it makes some sense
to start by pushing the check down there in the first place. This
advice is partially motivated by how dma_map_sg is just a small
wrapper around the function pointer call...

Something like:

foo_dma_map_sg(...)
{
  for (every page in sg)
  if (page is p2p)
   dma_addr[I] = p2p_same_segment_map_page(...);
}

Where p2p_same_segment_map_page checks if the two devices are on the
'same switch' and if so returns the address translated to match the
bus address programmed into the BAR or fails. We knows this case is
required to work by the PCI spec, so it makes sense to use it as the
first canned helper.

This also proves out the basic idea that the dma ops can recover the
pci device and perform an inspection of the traversed fabric path.

>From there every arch would have to expand the implementation to
support a wider range of things. Eg x86 with no iommu and no offset
could allow every address to be used based on a host bridge white
list.

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Jason Gunthorpe
On Tue, Apr 18, 2017 at 10:27:47AM -0700, Dan Williams wrote:
> > FWIW, RDMA probably wouldn't want to use a p2mem device either, we
> > already have APIs that map BAR memory to user space, and would like to
> > keep using them. A 'enable P2P for bar' helper function sounds better
> > to me.
> 
> ...and I think it's not a helper function as much as asking the bus
> provider "can these two device dma to each other".

What I mean I could write in a RDMA driver:

  /* Allow the memory in BAR 1 to be the target of P2P transactions */
  pci_enable_p2p_bar(dev, 1);

And not require anything else..

> The "helper" is the dma api redirecting through a software-iommu
> that handles bus address translation differently than it would
> handle host memory dma mapping.

Not sure, until we see what arches actually need to do here it is hard
to design common helpers.

Here are a few obvious things that arches will need to implement to
support this broadly:

- Virtualization might need to do a hypervisor call to get the right
  translation, or consult some hypervisor specific description table.

- Anything using IOMMUs for virtualization will need to setup IOMMU
  permissions to allow the P2P flow, this might require translation to
  an address cookie.

- Fail if the PCI devices are in different domains, or setup hardware to
  do completion bus/device/function translation.

- All platforms can succeed if the PCI devices are under the same
  'segment', but where segments begin is somewhat platform specific
  knowledge. (this is 'same switch' idea Logan has talked about)

So, we can eventually design helpers for various common scenarios, but
until we see what arch code actually needs to do it seems
premature. Much of this seems to involve interaction with some kind of
hardware, or consulation of some kind of currently platform specific
data, so I'm not sure what a software-iommu would be doing??

The main thing to agree on is that this code belongs under dma ops and
that arches have to support struct page mapped BAR addresses in their
dma ops inputs. Is that resonable?

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Jason Gunthorpe
On Mon, Apr 17, 2017 at 08:23:16AM +1000, Benjamin Herrenschmidt wrote:
 
> Thanks :-) There's a reason why I'm insisting on this. We have constant
> requests for this today. We have hacks in the GPU drivers to do it for
> GPUs behind a switch, but those are just that, ad-hoc hacks in the
> drivers. We have similar grossness around the corner with some CAPI
> NICs trying to DMA to GPUs. I have people trying to use PLX DMA engines
> to whack nVME devices.

A lot of people feel this way in the RDMA community too. We have had
vendors shipping out of tree code to enable P2P for RDMA with GPU
years and years now. :(

Attempts to get things in mainline have always run into the same sort
of road blocks you've identified in this thread..

FWIW, I read this discussion and it sounds closer to an agreement than
I've ever seen in the past.

>From Ben's comments, I would think that the 'first class' support that
is needed here is simply a function to return the 'struct device'
backing a CPU address range.

This is the minimal required information for the arch or IOMMU code
under the dma ops to figure out the fabric source/dest, compute the
traffic path, determine if P2P is even possible, what translation
hardware is crossed, and what DMA address should be used.

If there is going to be more core support for this stuff I think it
will be under the topic of more robustly describing the fabric to the
core and core helpers to extract data from the description: eg compute
the path, check if the path crosses translation, etc

But that isn't really related to P2P, and is probably better left to
the arch authors to figure out where they need to enhance the existing
topology data..

I think the key agreement to get out of Logan's series is that P2P DMA
means:
 - The BAR will be backed by struct pages
 - Passing the CPU __iomem address of the BAR to the DMA API is
   valid and, long term, dma ops providers are expected to fail
   or return the right DMA address
 - Mapping BAR memory into userspace and back to the kernel via
   get_user_pages works transparently, and with the DMA API above
 - The dma ops provider must be able to tell if source memory is bar
   mapped and recover the pci device backing the mapping.

At least this is what we'd like in RDMA :)

FWIW, RDMA probably wouldn't want to use a p2mem device either, we
already have APIs that map BAR memory to user space, and would like to
keep using them. A 'enable P2P for bar' helper function sounds better
to me.

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-13 Thread Jason Gunthorpe
On Thu, Apr 13, 2017 at 06:26:31PM -0500, Bjorn Helgaas wrote:
> > Ah, thanks for the tip! On my system, this translation returns the same
> > address so it was not necessary. And, yes, that means this would have to
> > find its way into the dma mapping routine somehow. This means we'll
> > eventually need a way to look-up the p2pmem device from the struct page.
> > Which means we will likely need a new flag bit in the struct page or
> > something. The big difficulty I see is testing. Do you know what
> > architectures or in what circumstances are these translations used?
> 
> Any caller of pci_add_resource_offset() uses CPU addresses different from
> the PCI bus addresses (unless the offset is zero, of course).  All ACPI
> platforms also support this translation (see "translation_offset"), though
> in most x86 systems the offset is zero.  I'm aware of one x86 system that
> was tested with a non-zero offset but I don't think it was shipped that
> way.

I'd suggest just detecting if there is any translation in bus
addresses anywhere and just hard disabling P2P on such systems.

On modern hardware with 64 bit BARs there is very little reason to
have translation, so I think this is a legacy feature.

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem

2017-04-06 Thread Jason Gunthorpe
On Thu, Apr 06, 2017 at 08:33:38AM +0300, Sagi Grimberg wrote:
> 
> >>Note that the nvme completion queues are still on the host memory, so
> >>this means we have lost the ordering between data and completions as
> >>they go to different pcie targets.
> >
> >Hmm, in this simple up/down case with a switch, I think it might
> >actually be OK.
> >
> >Transactions might not complete at the NVMe device before the CPU
> >processes the RDMA completion, however due to the PCI-E ordering rules
> >new TLPs directed to the NVMe will complete after the RMDA TLPs and
> >thus observe the new data. (eg order preserving)
> >
> >It would be very hard to use P2P if fabric ordering is not preserved..
> 
> I think it still can race if the p2p device is connected with more than
> a single port to the switch.
> 
> Say it's connected via 2 legs, the bar is accessed from leg A and the
> data from the disk comes via leg B. In this case, the data is heading
> towards the p2p device via leg B (might be congested), the completion
> goes directly to the RC, and then the host issues a read from the
> bar via leg A. I don't understand what can guarantee ordering here.

Right, this is why I qualified my statement with 'simple up/down case'

Make it any more complex and it clearly stops working sanely, but I
wouldn't worry about unusual PCI-E fabrics at this point..

> Stephen told me that this still guarantees ordering, but I honestly
> can't understand how, perhaps someone can explain to me in a simple
> way that I can understand.

AFAIK PCI-E ordering is explicitly per link, so things that need order
must always traverse the same link.

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem

2017-04-04 Thread Jason Gunthorpe
On Tue, Apr 04, 2017 at 01:59:26PM +0300, Sagi Grimberg wrote:
> Note that the nvme completion queues are still on the host memory, so
> this means we have lost the ordering between data and completions as
> they go to different pcie targets.

Hmm, in this simple up/down case with a switch, I think it might
actually be OK.

Transactions might not complete at the NVMe device before the CPU
processes the RDMA completion, however due to the PCI-E ordering rules
new TLPs directed to the NVMe will complete after the RMDA TLPs and
thus observe the new data. (eg order preserving)

It would be very hard to use P2P if fabric ordering is not preserved..

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 01/16] chardev: add helper function to register char devs with a struct device

2017-02-27 Thread Jason Gunthorpe
On Sun, Feb 26, 2017 at 10:21:25AM -0800, Dan Williams wrote:
> > + * cdev_device_add() adds the char device represented by @cdev to the 
> > system,
> > + * just as cdev_add does. It then adds @dev to the system using device_add
> > + * The dev_t for the char device will be taken from the struct device which
> > + * needs to be initialized first. This helper function correctly takes a
> > + * reference to the parent device so the parent will not get released until
> > + * all references to the cdev are released.
> > + *
> > + * This function should be used whenever the struct cdev and the
> > + * struct device are members of the same structure whose lifetime is
> > + * managed by the struct device.
> > + */
> 
> Perhaps add a note here that userspace may have invoked file
> operations between cdev_add() and a failing device_add(), so
> additional cleanup beyond put_device() (like mmap invalidation) might
> be needed. That can be a later follow-on patch.

Yes please, that is way too subtle. Suggest:

NOTE: Callers must assume that userspace was able to open the cdev and
can call cdev fops callbacks at any time, even if this function fails.

I would also add a note to cdev_device_del:

NOTE: This guarantees that associated sysfs callbacks are not running
or runnable, however any open cdevs will remain and their fops remain
callable even after this returns.

Since I have seen a lot of confusion on that point as well..

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 06/16] tpm-chip: utilize new cdev_device_add helper function

2017-02-27 Thread Jason Gunthorpe
On Sat, Feb 25, 2017 at 11:38:07PM -0700, Logan Gunthorpe wrote:
> Replace the open coded registration of the cdev and dev with the
> new device_add_cdev() helper. The helper replaces a common pattern by
> taking the proper reference against the parent device and adding both
> the cdev and the device.
> 
> Signed-off-by: Logan Gunthorpe 
>  drivers/char/tpm/tpm-chip.c | 19 +++
>  1 file changed, 3 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/char/tpm/tpm-chip.c b/drivers/char/tpm/tpm-chip.c
> index a77262d..933f96e 100644
> +++ b/drivers/char/tpm/tpm-chip.c
> @@ -187,7 +187,6 @@ struct tpm_chip *tpm_chip_alloc(struct device *dev,
>  
>   cdev_init(&chip->cdev, &tpm_fops);
>   chip->cdev.owner = THIS_MODULE;
> - chip->cdev.kobj.parent = &chip->dev.kobj;
>  
>   return chip;
>  
> @@ -230,27 +229,16 @@ static int tpm_add_char_device(struct tpm_chip *chip)
>  {
>   int rc;
>  
> - rc = cdev_add(&chip->cdev, chip->dev.devt, 1);
> + rc = cdev_device_add(&chip->cdev, &chip->dev);
>   if (rc) {
>   dev_err(&chip->dev,
> - "unable to cdev_add() %s, major %d, minor %d, err=%d\n",
> + "unable to cdev_device_add() %s, major %d, minor %d, 
> err=%d\n",
>   dev_name(&chip->dev), MAJOR(chip->dev.devt),
>   MINOR(chip->dev.devt), rc);
>  
>   return rc;
>   }
>  
> - rc = device_add(&chip->dev);
> - if (rc) {
> - dev_err(&chip->dev,
> - "unable to device_register() %s, major %d, minor %d, 
> err=%d\n",
> - dev_name(&chip->dev), MAJOR(chip->dev.devt),
> - MINOR(chip->dev.devt), rc);
> -
> - cdev_del(&chip->cdev);
> -     return rc;
> - }

This will collide with the patch I sent:

https://patchwork.kernel.org/patch/9589001/

Jarkko: The resolution will be to continue to call tpm_del_char_device
from the cdev_device_add error path and drop all calls to cdev_del

Reviewed-by: Jason Gunthorpe 

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] device-dax: fix cdev leak

2017-02-23 Thread Jason Gunthorpe
On Thu, Feb 23, 2017 at 11:22:03AM -0800, Dan Williams wrote:
> If device_add() fails, cleanup the cdev. Otherwise, we leak a kobj_map()
> with a stale device number.
> 
> Fixes: ba09c01d2fa8 ("dax: convert to the cdev api")
> Cc: 
> Cc: Logan Gunthorpe 
> Reported-by: Jason Gunthorpe 
> Signed-off-by: Dan Williams 
>  drivers/dax/dax.c |1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/dax/dax.c b/drivers/dax/dax.c
> index ed758b74ddf0..0f8008dd0b0c 100644
> +++ b/drivers/dax/dax.c
> @@ -724,6 +724,7 @@ struct dax_dev *devm_create_dax_dev(struct dax_region 
> *dax_region,
>   dev_set_name(dev, "dax%d.%d", dax_region->id, dax_dev->id);
>   rc = device_add(dev);
>   if (rc) {
> + cdev_del(&dax_dev->cdev);

This probably should call into unregister_dax_dev and just skip the
device_unregister part.

Once cdev_add returns it is possible for a mmap to have been created,
so cleanup after that point has to go through all the other
unregister_dax_dev steps.

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 07/14] infiniband: utilize new device_add_cdev helper function

2017-02-21 Thread Jason Gunthorpe
On Tue, Feb 21, 2017 at 04:54:05PM -0700, Logan Gunthorpe wrote:

> Is that true? Once device_register or device_add is called then you need
> to use put_device.

General rule is once kref_init has been called then you should use
kref_put and not kfree.

device_initialize ultimately calls kref_init..

Reasoning that you don't 'need' to use put_device until device_add is
just too hard.

For instance, there is still another bug in ib_ucm_add_one:

if (device_add(&ucm_dev->dev))
goto err_cdev;

if (device_create_file(&ucm_dev->dev, &dev_attr_ibdev))
goto err_dev;
[]
err_dev:
device_unregister(&ucm_dev->dev);
err_cdev:
ib_ucm_cdev_del(ucm_dev);
err:
kfree(ucm_dev);
return;
}

If we go down the err_dev path then device_unregister will probably
kfree ucm_dev - the argument is not guarenteed valid after
device_unregister returns - this is what makes it different from
device_del.

The simplest fix is to change the unwind into:

err_dev:
device_del(&ucm_dev->dev);
err_cdev:
ib_ucm_cdev_del(ucm_dev);
err:
put_device(&ucm_dev->dev);
return;
}

And the only way to keep our idiomatic goto unwind working is if all
'goto errs' can call put_device - which requires the device_initialize
be done before any errors are possible.

> In fact device_add is what does the first get_device so this doesn't
> add up to me.

Not quite, kref_init creates a kref with a count of 1 - eg the caller
owns the ref, and that ref must be put to trigger kref release.

Thus good kref usage should always pair a kref_put with the kref_init.

The get_device at the start of device_add pairs with the put_device at
the end of device_del and the kref_init pairs with the put_device at
the end of device_unregister. (notice that device_unregister ends up
calling put_device twice in a row..)

I'll send you a clean patch for your v2.

> I know the DAX code only uses put_device after device_add.

Looks to me like that code fails to call cdev_del if device_add fails?

This approach is problematic because it is trying do the ida removals
in the release function. That is not necessary and has the side effect
of making the release function uncallable until too late in the
process.

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 07/14] infiniband: utilize new device_add_cdev helper function

2017-02-21 Thread Jason Gunthorpe
On Mon, Feb 20, 2017 at 10:00:46PM -0700, Logan Gunthorpe wrote:
> This patch updates core/ucm.c which didn't originally use the
> cdev.kobj.parent with it's parent device. I did not look heavily into
> whether this was a bug or not, but it seems likely to me there would
> be a use before free.

Hum, is probably safe - ib_ucm_remove_one can only happen on module
unload and the cdev holds the module lock while open.

Even so device_add_cdev should be used anyhow.

> I also took a look at core/uverbs_main.c, core/user_mad.c and
> hw/hfi1/device.c which utilize cdev.kobj.parent but because the
> infiniband core seems to use kobjs internally instead of struct devices
> they could not be converted to use the new helper API and still
> directly manipulate the internals of the kobj.

Yes, and hfi1 had the same use-afte-free bug when it was first
submitted as well. IHMO cdev should have a helper entry for this style
of use case as well.

> diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c
> index e0a995b..38ea316 100644
> +++ b/drivers/infiniband/core/ucm.c
> @@ -1283,18 +1283,20 @@ static void ib_ucm_add_one(struct ib_device *device)
>   set_bit(devnum, dev_map);
>   }
>  
> + device_initialize(&ucm_dev->dev);

Ah, this needs more help. Once device_initialize is called put_device
should be the error-unwind, can you use something more like this?

>From ac7a35ea98066c9a9e3f3e54557c0ccda44b0ffc Mon Sep 17 00:00:00 2001
From: Jason Gunthorpe 
Date: Tue, 21 Feb 2017 12:07:55 -0700
Subject: [PATCH] infiniband: utilize new device_add_cdev helper

The use after free is not triggerable here because the cdev holds
the module lock and the only device_unregister is only triggered by
module ouload, however make the change for consistency.

To make this work the cdev_del needs to move out of the struct device
release function.

Signed-off-by: Jason Gunthorpe 
---
 drivers/infiniband/core/ucm.c | 25 ++---
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c
index 7713ef089c3ccc..acda8d941381f3 100644
--- a/drivers/infiniband/core/ucm.c
+++ b/drivers/infiniband/core/ucm.c
@@ -1202,12 +1202,16 @@ static void ib_ucm_release_dev(struct device *dev)
struct ib_ucm_device *ucm_dev;
 
ucm_dev = container_of(dev, struct ib_ucm_device, dev);
+   kfree(ucm_dev);
+}
+
+static void ib_ucm_cdev_del(struct ib_ucm_device *ucm_dev)
+{
cdev_del(&ucm_dev->cdev);
if (ucm_dev->devnum < IB_UCM_MAX_DEVICES)
clear_bit(ucm_dev->devnum, dev_map);
else
clear_bit(ucm_dev->devnum - IB_UCM_MAX_DEVICES, overflow_map);
-   kfree(ucm_dev);
 }
 
 static const struct file_operations ucm_fops = {
@@ -1263,6 +1267,7 @@ static void ib_ucm_add_one(struct ib_device *device)
if (!ucm_dev)
return;
 
+   device_initialize(&ucm_dev->dev);
ucm_dev->ib_dev = device;
 
devnum = find_first_zero_bit(dev_map, IB_UCM_MAX_DEVICES);
@@ -1280,18 +1285,19 @@ static void ib_ucm_add_one(struct ib_device *device)
set_bit(devnum, dev_map);
}
 
+   ucm_dev->dev.devt = base;
+   ucm_dev->dev.release = ib_ucm_release_dev;
+
cdev_init(&ucm_dev->cdev, &ucm_fops);
ucm_dev->cdev.owner = THIS_MODULE;
kobject_set_name(&ucm_dev->cdev.kobj, "ucm%d", ucm_dev->devnum);
-   if (cdev_add(&ucm_dev->cdev, base, 1))
+   if (device_add_cdev(&ucm_dev->dev, &ucm_dev->cdev))
goto err;
 
ucm_dev->dev.class = &cm_class;
ucm_dev->dev.parent = device->dma_device;
-   ucm_dev->dev.devt = ucm_dev->cdev.dev;
-   ucm_dev->dev.release = ib_ucm_release_dev;
dev_set_name(&ucm_dev->dev, "ucm%d", ucm_dev->devnum);
-   if (device_register(&ucm_dev->dev))
+   if (device_add(&ucm_dev->dev))
goto err_cdev;
 
if (device_create_file(&ucm_dev->dev, &dev_attr_ibdev))
@@ -1303,13 +1309,9 @@ static void ib_ucm_add_one(struct ib_device *device)
 err_dev:
device_unregister(&ucm_dev->dev);
 err_cdev:
-   cdev_del(&ucm_dev->cdev);
-   if (ucm_dev->devnum < IB_UCM_MAX_DEVICES)
-   clear_bit(devnum, dev_map);
-   else
-   clear_bit(devnum, overflow_map);
+   ib_ucm_cdev_del(ucm_dev);
 err:
-   kfree(ucm_dev);
+   put_device(&ucm_dev->dev);
return;
 }
 
@@ -1320,6 +1322,7 @@ static void ib_ucm_remove_one(struct ib_device *device, 
void *client_data)
if (!ucm_dev)
return;
 
+   ib_ucm_cdev_del(ucm_dev);
device_unregister(&ucm_dev->dev);
 }
 
-- 
2.7.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 05/14] tpm-chip: utilize new device_add_cdev helper function

2017-02-21 Thread Jason Gunthorpe
On Mon, Feb 20, 2017 at 10:00:44PM -0700, Logan Gunthorpe wrote:
> Signed-off-by: Logan Gunthorpe 
>  drivers/char/tpm/tpm-chip.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2017-01-12 Thread Jason Gunthorpe
On Thu, Jan 12, 2017 at 10:11:29AM -0500, Jerome Glisse wrote:
> On Wed, Jan 11, 2017 at 10:54:39PM -0600, Stephen Bates wrote:
> > > What we want is for RDMA, O_DIRECT, etc to just work with special VMAs
> > > (ie. at least those backed with ZONE_DEVICE memory). Then
> > > GPU/NVME/DAX/whatever drivers can just hand these VMAs to userspace
> > > (using whatever interface is most appropriate) and userspace can do what
> > > it pleases with them. This makes _so_ much sense and actually largely
> > > already works today (as demonstrated by iopmem).

> So i say let solve the IOMMU issue first and let everyone use it in their
> own way with their device. I do not think we can share much more than
> that.

Solve it for the easy ZONE_DIRECT/etc case then.

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2017-01-06 Thread Jason Gunthorpe
On Fri, Jan 06, 2017 at 12:37:22PM -0500, Jerome Glisse wrote:
> On Fri, Jan 06, 2017 at 11:56:30AM -0500, Serguei Sagalovitch wrote:
> > On 2017-01-05 08:58 PM, Jerome Glisse wrote:
> > > On Thu, Jan 05, 2017 at 05:30:34PM -0700, Jason Gunthorpe wrote:
> > > > On Thu, Jan 05, 2017 at 06:23:52PM -0500, Jerome Glisse wrote:
> > > > 
> > > > > > I still don't understand what you driving at - you've said in both
> > > > > > cases a user VMA exists.
> > > > > In the former case no, there is no VMA directly but if you want one 
> > > > > than
> > > > > a device can provide one. But such VMA is useless as CPU access is not
> > > > > expected.
> > > > I disagree it is useless, the VMA is going to be necessary to support
> > > > upcoming things like CAPI, you need it to support O_DIRECT from the
> > > > filesystem, DPDK, etc. This is why I am opposed to any model that is
> > > > not VMA based for setting up RDMA - that is shorted sighted and does
> > > > not seem to reflect where the industry is going.
> > > > 
> > > > So focus on having VMA backed by actual physical memory that covers
> > > > your GPU objects and ask how do we wire up the '__user *' to the DMA
> > > > API in the best way so the DMA API still has enough information to
> > > > setup IOMMUs and whatnot.
> > > I am talking about 2 different thing. Existing hardware and API where you
> > > _do not_ have a vma and you do not need one. This is just
> > > > existing stuff.

> > I do not understand why you assume that existing API doesn't  need one.
> > I would say that a lot of __existing__ user level API and their support in
> > kernel (especially outside of graphics domain) assumes that we have vma and
> > deal with __user * pointers.

+1

> Well i am thinking to GPUDirect here. Some of GPUDirect use case do not have
> vma (struct vm_area_struct) associated with them they directly apply to GPU
> object that aren't expose to CPU. Yes some use case have vma for share buffer.

Lets stop talkind about GPU direct. Today we can't even make VMA
pointing at a PCI bar work properly in the kernel - lets start there
please. People can argue over other options once that is done.

> For HMM plan is to restrict to ODP and either to replace ODP with HMM or 
> change
> ODP to not use get_user_pages_remote() but directly fetch informations from
> CPU page table. Everything else stay as it is. I posted patchset to replace
> ODP with HMM in the past.

Make a generic API for all of this and you'd have my vote..

IMHO, you must support basic pinning semantics - that is necessary to
support generic short lived DMA (eg filesystem, etc). That hardware
can clearly do that if it can support ODP.

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2017-01-05 Thread Jason Gunthorpe
On Thu, Jan 05, 2017 at 06:23:52PM -0500, Jerome Glisse wrote:

> > I still don't understand what you driving at - you've said in both
> > cases a user VMA exists.
> 
> In the former case no, there is no VMA directly but if you want one than
> a device can provide one. But such VMA is useless as CPU access is not
> expected.

I disagree it is useless, the VMA is going to be necessary to support
upcoming things like CAPI, you need it to support O_DIRECT from the
filesystem, DPDK, etc. This is why I am opposed to any model that is
not VMA based for setting up RDMA - that is shorted sighted and does
not seem to reflect where the industry is going.

So focus on having VMA backed by actual physical memory that covers
your GPU objects and ask how do we wire up the '__user *' to the DMA
API in the best way so the DMA API still has enough information to
setup IOMMUs and whatnot.

> What i was trying to get accross is that no matter what level you
> consider in the end you still need something at the DMA API level.
> And that the 2 different use case (device vma or regular vma) means
> 2 differents API for the device driver.

I agree we need new stuff at the DMA API level, but I am opposed to
the idea we need two API paths that the *driver* has to figure out.
That is fundamentally not what I want as a driver developer.

Give me a common API to convert '__user *' to a scatter list and pin
the pages. This needs to figure out your two cases. And Huge
Pages. And ZONE_DIRECT.. (a better get_user_pages)

Give me an API to take the scatter list and DMA map it, handling all
the stuff associated with peer-peer. (a better dma_map_sg)

Give me a notifier scheme to rework my scatter list when physical
pages need to change (mmu notifiers)

Use the scatter list memory to convey needed information from the
first step to the second.

Do not bother the driver with distinctions on what kind of memory is
behind that VMA. Don't ask me to use get_user_pages or
gpu_get_user_pages, do not ask me to use dma_map_sg or
dma_map_sg_peer_direct. The Driver Doesn't Need To Know.

IMHO this is why GPU direct is not mergable - it creates a crazy
parallel mini-mm subsystem inside RDMA and uses that to connect to a
GPU driver, everything is expected to have parallel paths for GPU
direct and normal MM. No good at all.

> > So, how do you identify these GPU objects? How do you expect RDMA
> > convert them to scatter lists? How will ODP work?
> 
> No ODP on those. If you want vma, the GPU device driver can provide

You said you needed invalidate, that has to be done via ODP.

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2017-01-05 Thread Jason Gunthorpe
On Thu, Jan 05, 2017 at 03:19:36PM -0500, Jerome Glisse wrote:

> > Always having a VMA changes the discussion - the question is how to
> > create a VMA that reprensents IO device memory, and how do DMA
> > consumers extract the correct information from that VMA to pass to the
> > kernel DMA API so it can setup peer-peer DMA.
> 
> Well my point is that it can't be. In HMM case inside a single VMA
> you
[..]

> In the GPUDirect case the idea is that you have a specific device vma
> that you map for peer to peer.

[..]

I still don't understand what you driving at - you've said in both
cases a user VMA exists.

>From my perspective in RDMA, all I want is a core kernel flow to
convert a '__user *' into a scatter list of DMA addresses, that works no
matter what is backing that VMA, be it HMM, a 'hidden' GPU object, or
struct page memory.

A '__user *' pointer is the only way to setup a RDMA MR, and I see no
reason to have another API at this time.

The details of how to translate to a scatter list are a MM subject,
and the MM folks need to get 

I just don't care if that routine works at a page level, or a whole
VMA level, or some combination of both, that is up to the MM team to
figure out :)

> a page level. Expectation here is that the GPU userspace expose a special
> API to allow RDMA to directly happen on GPU object allocated through
> GPU specific API (ie it is not regular memory and it is not accessible
> by CPU).

So, how do you identify these GPU objects? How do you expect RDMA
convert them to scatter lists? How will ODP work?

> > We have MMU notifiers to handle this today in RDMA. Async RDMA MR
> > Invalidate like you see in the above out of tree patches is totally
> > crazy and shouldn't be in mainline. Use ODP capable RDMA hardware.
> 
> Well there is still a large base of hardware that do not have such
> feature and some people would like to be able to keep using those.

Hopefully someone will figure out how to do that without the crazy
async MR invalidation.

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2017-01-05 Thread Jason Gunthorpe
On Thu, Jan 05, 2017 at 02:54:24PM -0500, Jerome Glisse wrote:

> Mellanox and NVidia support peer to peer with what they market a
> GPUDirect. It only works without IOMMU. It is probably not upstream :
> 
> https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg21402.html
> 
> I thought it was but it seems it require an out of tree driver to work.

Right, it is out of tree and not under consideration for mainline.

> Wether there is a vma or not isn't important to the issue anyway. If
> you want to enforce VMA rule for RDMA it is an RDMA specific discussion
> in which i don't want to be involve, it is not my turf :)

Always having a VMA changes the discussion - the question is how to
create a VMA that reprensents IO device memory, and how do DMA
consumers extract the correct information from that VMA to pass to the
kernel DMA API so it can setup peer-peer DMA.

> What matter is the back channel API between peer-to-peer device. Like
> the above patchset points out for GPU we need to be able to invalidate
> a mapping at any point in time. Pining is not something we want to
> live with.

We have MMU notifiers to handle this today in RDMA. Async RDMA MR
Invalidate like you see in the above out of tree patches is totally
crazy and shouldn't be in mainline. Use ODP capable RDMA hardware.

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2017-01-05 Thread Jason Gunthorpe
On Thu, Jan 05, 2017 at 01:39:29PM -0500, Jerome Glisse wrote:

>   1) peer-to-peer because of userspace specific API like NVidia GPU
> direct (AMD is pushing its own similar API i just can't remember
> marketing name). This does not happen through a vma, this happens
> through specific device driver call going through device specific
> ioctl on both side (GPU and RDMA). So both kernel driver are aware
> of each others.

Today you can only do user-initiated RDMA operations in conjection
with a VMA.

We'd need a really big and strong reason to create an entirely new
non-VMA based memory handle scheme for RDMA.

So my inclination is to just completely push back on this idea. You
need a VMA to do RMA.

GPUs need to create VMAs for the memory they want to RDMA from, even
if the VMA handle just causes SIGBUS for any CPU access.

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-06 Thread Jason Gunthorpe
On Tue, Dec 06, 2016 at 09:51:15AM -0700, Logan Gunthorpe wrote:
> Hey,
> 
> On 06/12/16 09:38 AM, Jason Gunthorpe wrote:
> >>> I'm not opposed to mapping /dev/nvmeX.  However, the lookup is trivial
> >>> to accomplish in sysfs through /sys/dev/char to find the sysfs path of the
> >>> device-dax instance under the nvme device, or if you already have the nvme
> >>> sysfs path the dax instance(s) will appear under the "dax" sub-directory.
> >>
> >> Personally I think mapping the dax resource in the sysfs tree is a nice
> >> way to do this and a bit more intuitive than mapping a /dev/nvmeX.
> > 
> > It is still not at all clear to me what userpsace is supposed to do
> > with this on nvme.. How is the CMB usable from userspace?
> 
> The flow is pretty simple. For example to write to NVMe from an RDMA device:
> 
> 1) Obtain a chunk of the CMB to use as a buffer(either by mmaping
> /dev/nvmx, the device dax char device or through a block layer interface
> (which sounds like a good suggestion from Christoph, but I'm not really
> sure how it would look).

Okay, so clearly this needs a kernel side NVMe specific allocator
and locking so users don't step on each other..

Or as Christoph says some kind of general mechanism to get these
bounce buffers..

> 2) Create an MR with the buffer and use an RDMA function to fill it with
> data from a remote host. This will cause the RDMA hardware to write
> directly to the memory in the NVMe card.
> 
> 3) Using O_DIRECT, write the buffer to a file on the NVMe filesystem.
> When the address reaches hardware the NVMe will recognize it as local
> memory and copy it directly there.

Ah, I see.

As a first draft I'd stick with some kind of API built into the
/dev/nvmeX that backs the filesystem. The user app would fstat the
target file, open /dev/block/MAJOR(st_dev):MINOR(st_dev), do some
ioctl to get a CMB mmap, and then proceed from there..

When that is all working kernel-side, it would make sense to look at a
more general mechanism that could be used unprivileged??

> Thus we are able to transfer data to any file on an NVMe device without
> going through system memory. This has benefits on systems with lots of
> activity in system memory but step 3 is likely to be slowish due to the
> need to pin/unpin the memory for every transaction.

This is similar to the GPU issues too.. On NVMe you don't need to pin
the pages, you just need to lock that VMA so it doesn't get freed from
the NVMe CMB allocator while the IO is running...

Probably in the long run the get_user_pages is going to have to be
pushed down into drivers.. Future MMU coherent IO hardware also does
not need the pinning or other overheads.

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-06 Thread Jason Gunthorpe
> > I'm not opposed to mapping /dev/nvmeX.  However, the lookup is trivial
> > to accomplish in sysfs through /sys/dev/char to find the sysfs path of the
> > device-dax instance under the nvme device, or if you already have the nvme
> > sysfs path the dax instance(s) will appear under the "dax" sub-directory.
> 
> Personally I think mapping the dax resource in the sysfs tree is a nice
> way to do this and a bit more intuitive than mapping a /dev/nvmeX.

It is still not at all clear to me what userpsace is supposed to do
with this on nvme.. How is the CMB usable from userspace?

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-05 Thread Jason Gunthorpe
On Mon, Dec 05, 2016 at 12:27:20PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 05/12/16 12:14 PM, Jason Gunthorpe wrote:
> >But CMB sounds much more like the GPU case where there is a
> >specialized allocator handing out the BAR to consumers, so I'm not
> >sure a general purpose chardev makes a lot of sense?
> 
> I don't think it will ever need to be as complicated as the GPU case. There
> will probably only ever be a relatively small amount of memory behind the
> CMB and really the only users are those doing P2P work. Thus the specialized
> allocator could be pretty simple and I expect it would be fine to just
> return -ENOMEM if there is not enough memory.

NVMe might have to deal with pci-e hot-unplug, which is a similar
problem-class to the GPU case..

In any event the allocator still needs to track which regions are in
use and be able to hook 'free' from userspace. That does suggest it
should be integrated into the nvme driver and not a bolt on driver..

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-05 Thread Jason Gunthorpe
On Mon, Dec 05, 2016 at 10:48:58AM -0800, Dan Williams wrote:
> On Mon, Dec 5, 2016 at 10:39 AM, Logan Gunthorpe  wrote:
> > On 05/12/16 11:08 AM, Dan Williams wrote:
> >>
> >> I've already recommended that iopmem not be a block device and instead
> >> be a device-dax instance. I also don't think it should claim the PCI
> >> ID, rather the driver that wants to map one of its bars this way can
> >> register the memory region with the device-dax core.
> >>
> >> I'm not sure there are enough device drivers that want to do this to
> >> have it be a generic /sys/.../resource_dmableX capability. It still
> >> seems to be an exotic one-off type of configuration.
> >
> >
> > Yes, this is essentially my thinking. Except I think the userspace interface
> > should really depend on the device itself. Device dax is a good  choice for
> > many and I agree the block device approach wouldn't be ideal.
> >
> > Specifically for NVME CMB: I think it would make a lot of sense to just hand
> > out these mappings with an mmap call on /dev/nvmeX. I expect CMB buffers
> > would be volatile and thus you wouldn't need to keep track of where in the
> > BAR the region came from. Thus, the mmap call would just be an allocator
> > from BAR memory. If device-dax were used, userspace would need to lookup
> > which device-dax instance corresponds to which nvme drive.
> 
> I'm not opposed to mapping /dev/nvmeX.  However, the lookup is trivial
> to accomplish in sysfs through /sys/dev/char to find the sysfs path
> of

But CMB sounds much more like the GPU case where there is a
specialized allocator handing out the BAR to consumers, so I'm not
sure a general purpose chardev makes a lot of sense?

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-05 Thread Jason Gunthorpe
On Mon, Dec 05, 2016 at 09:40:38AM -0800, Dan Williams wrote:

> > If it is kernel only with physical addresess we don't need a uAPI for
> > it, so I'm not sure #1 is at all related to iopmem.
> >
> > Most people who want #1 probably can just mmap
> > /sys/../pci/../resourceX to get a user handle to it, or pass around
> > __iomem pointers in the kernel. This has been asked for before with
> > RDMA.
> >
> > I'm still not really clear what iopmem is for, or why DAX should ever
> > be involved in this..
> 
> Right, by default remap_pfn_range() does not establish DMA capable
> mappings. You can think of iopmem as remap_pfn_range() converted to
> use devm_memremap_pages(). Given the extra constraints of
> devm_memremap_pages() it seems reasonable to have those DMA capable
> mappings be optionally established via a separate driver.

Except the iopmem driver claims the PCI ID, and presents a block
interface which is really *NOT* what people who have asked for this in
the past have wanted. IIRC it was embedded stuff eg RDMA video
directly out of a capture card or a similar kind of thinking.

It is a good point about devm_memremap_pages limitations, but maybe
that just says to create a /sys/.../resource_dmableX ?

Or is there some reason why people want a filesystem on top of BAR
memory? That does not seem to have been covered yet..

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


  1   2   >