Re: question about page tables in DAX/FS/PMEM case

2019-02-21 Thread Jerome Glisse
On Thu, Feb 21, 2019 at 02:58:27PM -0800, Larry Bassel wrote:
> [adding linux-mm]
> 
> On 21 Feb 19 15:41, Jerome Glisse wrote:
> > On Wed, Feb 20, 2019 at 03:06:22PM -0800, Larry Bassel wrote:
> > > I'm working on sharing page tables in the DAX/XFS/PMEM/PMD case.
> > > 
> > > If multiple processes would use the identical page of PMDs corresponding
> > > to a 1 GiB address range of DAX/XFS/PMEM/PMDs, presumably one can instead
> > > of populating a new PUD, just atomically increment a refcount and point
> > > to the same PUD in the next level above.
> 
> Thanks for your feedback. Some comments/clarification below.
> 
> > 
> > I think page table sharing was discuss several time in the past and
> > the complexity involve versus the benefit were not clear. For 1GB
> > of virtual address you need:
> > #pte pages = 1G/(512 * 2^12)   = 512 pte pages
> > #pmd pages = 1G/(512 * 512 * 2^12) = 1   pmd pages
> > 
> > So if we were to share the pmd directory page we would be saving a
> > total of 513 pages for every page table or ~2MB. This goes up with
> > the number of process that map the same range ie if 10 process map
> > the same range and share the same pmd than you are saving 9 * 2MB
> > 18MB of memory. This seems relatively modest saving.
> 
> The file blocksize = page size in what I am working on would
> be 2 MiB (sharing puds/pages of pmds), I'm not trying to
> support sharing pmds/pages of ptes. And yes, the savings in this
> case is actually even less than in your example (but see my example below).
> 
> > 
> > AFAIK there is no hardware benefit from sharing the page table
> > directory within different page table. So the only benefit is the
> > amount of memory we save.
> 
> Yes, in our use case (high end Oracle database using DAX/XFS/PMEM/PMD)
> the main benefit would be memory savings:
> 
> A future system might have 6 TiB of PMEM on it and
> there might be 1 processes each mapping all of this 6 TiB.
> Here the savings would be approximately
> (6 TiB / 2 MiB) * 8 bytes (page table size) * 1 = 240 GiB
> (and these page tables themselves would be in non-PMEM (ordinary RAM)).

Damm you have a lot of process, must mean many cores, i want one of those :)

[...]

> > > If the process later munmaps this file or exits but there are still
> > > other users of the shared page of PMDs, I would need to
> > > detect that this has happened and act accordingly (#3 above)
> > > 
> > > Where will these page table entries be torn down?
> > > In the same code where any other page table is torn down?
> > > If this is the case, what would the cleanest way of telling that these
> > > page tables (PMDs, etc.) correspond to a DAX/FS/PMEM mapping
> > > (look at the physical address pointed to?) so that
> > > I could do the right thing here.
> > > 
> > > I understand that I may have missed something obvious here.
> > > 
> > 
> > They are many issues here are the one i can think of:
> > - finding a pmd/pud to share, you need to walk the reverse mapping
> >   of the range you are mapping and to find if any process or other
> >   virtual address already as a pud or pmd you can reuse. This can
> >   take more time than allocating page directory pages.
> > - if one process munmap some portion of a share pud you need to
> >   break the sharing this means that munmap (or mremap) would need
> >   to handle this page table directory sharing case first
> > - many code path in the kernel might need update to understand this
> >   share page table thing (mprotect, userfaultfd, ...)
> > - the locking rules is bound to be painfull
> > - this might not work on all architecture as some architecture do
> >   associate information with page table directory and that can not
> >   always be share (it would need to be enabled arch by arch)
> 
> Yes, some architectures don't support DAX at all (note again that
> I'm not trying to share non-DAX page table here).

DAX is irrelevant here, DAX is a property of the underlying filesystem
and for the most part the core mm is blissfully unaware of it. So all
of the above apply.

> > 
> > The nice thing:
> > - unmapping for migration, when you unmap a share pud/pmd you can
> >   decrement mapcount by share pud/pmd count this could speedup
> >   migration
> 
> A followup question: the kernel does sharing of page tables for hugetlbfs
> (also 2 MiB pages), why aren't the above issues relevant there as well
> (or are they but we support it anyhow)?

huge

Re: question about page tables in DAX/FS/PMEM case

2019-02-21 Thread Jerome Glisse
On Wed, Feb 20, 2019 at 03:06:22PM -0800, Larry Bassel wrote:
> I'm working on sharing page tables in the DAX/XFS/PMEM/PMD case.
> 
> If multiple processes would use the identical page of PMDs corresponding
> to a 1 GiB address range of DAX/XFS/PMEM/PMDs, presumably one can instead
> of populating a new PUD, just atomically increment a refcount and point
> to the same PUD in the next level above.

I think page table sharing was discuss several time in the past and
the complexity involve versus the benefit were not clear. For 1GB
of virtual address you need:
#pte pages = 1G/(512 * 2^12)   = 512 pte pages
#pmd pages = 1G/(512 * 512 * 2^12) = 1   pmd pages

So if we were to share the pmd directory page we would be saving a
total of 513 pages for every page table or ~2MB. This goes up with
the number of process that map the same range ie if 10 process map
the same range and share the same pmd than you are saving 9 * 2MB
18MB of memory. This seems relatively modest saving.

AFAIK there is no hardware benefit from sharing the page table
directory within different page table. So the only benefit is the
amount of memory we save.

See below for comments on complexity to achieve this.

> 
> i.e.
> 
> OLD:
> process 1:
> VA -> levels of page tables -> PUD1 -> page of PMDs1
> process 2:
> VA -> levels of page tables -> PUD2 -> page of PMDs2
> 
> NEW:
> process 1:
> VA -> levels of page tables -> PUD1 -> page of PMDs1
> process 2:
> VA -> levels of page tables -> PUD1 -> page of PMDs1 (refcount 2)
> 
> There are several cases to consider:
> 
> 1. New mapping
> OLD:
> make a new PUD, populate the associated page of PMDs
> (at least partially) with PMD entries.
> NEW:
> same
> 
> 2. Mapping by a process same (same VA->PA and size and protections, etc.)
> as one that already exists
> OLD:
> make a new PUD, populate the associated page of PMDs
> (at least partially) with PMD entries.
> NEW:
> use same PUD, increase refcount (potentially even if this mapping is private
> in which case there may eventually be a copy-on-write -- see #5 below)
> 
> 3. Unmapping of a mapping which is the same as that from another process
> OLD:
> destroy the process's copy of mapping, free PUD, etc.
> NEW:
> decrease refcount, only if now 0 do we destroy mapping, etc.
> 
> 4. Unmapping of a mapping which is unique (refcount 1)
> OLD:
> destroy the process's copy of mapping, free PUD, etc.
> NEW:
> same
> 
> 5. Mapping was private (but same as another process), process writes
> OLD:
> break the PMD into PTEs, destroy PMD mapping, free PUD, etc..
> NEW:
> decrease refcount, only if now 0 do we destroy mapping, etc.
> we still break the PMD into PTEs.
> 
> If I have a mmap of a DAX/FS/PMEM file and I take
> a page (either pte or PMD sized) fault on access to this file,
> the page table(s) are set up in dax_iomap_fault() in fs/dax.c (correct?).

Not exactly the page table are allocated long before dax_iomap_fault()
get calls. They are allocated by the handle_mm_fault() and its childs
functions.

> 
> If the process later munmaps this file or exits but there are still
> other users of the shared page of PMDs, I would need to
> detect that this has happened and act accordingly (#3 above)
> 
> Where will these page table entries be torn down?
> In the same code where any other page table is torn down?
> If this is the case, what would the cleanest way of telling that these
> page tables (PMDs, etc.) correspond to a DAX/FS/PMEM mapping
> (look at the physical address pointed to?) so that
> I could do the right thing here.
> 
> I understand that I may have missed something obvious here.
> 

They are many issues here are the one i can think of:
- finding a pmd/pud to share, you need to walk the reverse mapping
  of the range you are mapping and to find if any process or other
  virtual address already as a pud or pmd you can reuse. This can
  take more time than allocating page directory pages.
- if one process munmap some portion of a share pud you need to
  break the sharing this means that munmap (or mremap) would need
  to handle this page table directory sharing case first
- many code path in the kernel might need update to understand this
  share page table thing (mprotect, userfaultfd, ...)
- the locking rules is bound to be painfull
- this might not work on all architecture as some architecture do
  associate information with page table directory and that can not
  always be share (it would need to be enabled arch by arch)

The nice thing:
- unmapping for migration, when you unmap a share pud/pmd you can
  decrement mapcount by share pud/pmd count this could speedup
  migration

This is what i could think of on the top of my head but there might be
other thing. I believe the question is really a benefit versus cost and
to me at least the complexity cost outweight the benefit one for now.
Kirill Shutemov proposed rework on how we do page table and this kind of
rework 

Re: [Lsf-pc] [LSF/MM TOPIC] The end of the DAX experiment

2019-02-14 Thread Jerome Glisse
On Thu, Feb 14, 2019 at 12:20:11PM -0800, Dan Williams wrote:
> On Thu, Feb 14, 2019 at 12:09 PM Matthew Wilcox  wrote:
> >
> > On Thu, Feb 14, 2019 at 11:31:24AM -0800, Dan Williams wrote:
> > > On Thu, Feb 14, 2019 at 11:10 AM Jerome Glisse  wrote:
> > > > I am just again working on my struct page mapping patchset as well as
> > > > the generic page write protection that sits on top. I hope to be able
> > > > to post the v2 in couple weeks. You can always look at my posting last
> > > > year to see more details.
> > >
> > > Yes, I have that in mind as one of the contenders. However, it's not
> > > clear to me that its a suitable fit for filesystem-reflink. Others
> > > have floated the 'page proxy' idea, so it would be good to discuss the
> > > merits of the general approaches.
> >
> > ... and my preferred option of putting pfn entries in the page cache.
> 
> Another option to include the discussion.
> 
> > Or is that what you meant by "page proxy"?
> 
> Page proxy would be an object that a filesystem could allocate to
> point back to a single physical 'struct page *'. The proxy would
> contain an override for page->index.

Note that generic page write protection has such object, kind of like
stable_node in KSM. You overwritte page->mapping to point to this
generic struct which has a pointer to set of callback so that whatever
is protecting the page can offer API to break protection (break sharing
here).

So instead of having struct proxy_page -> struct page you would have the
reverse: struct page -> struct proxy and so you do not have to change
much in all the file system beside removing the reliance on page->mapping
which is what most of my patches are about.

Cheers,
Jérôme
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [Lsf-pc] [LSF/MM TOPIC] The end of the DAX experiment

2019-02-14 Thread Jerome Glisse
On Thu, Feb 14, 2019 at 10:25:07AM -0800, Dan Williams wrote:
> On Thu, Feb 14, 2019 at 5:46 AM Michal Hocko  wrote:
> >
> > On Wed 06-02-19 13:12:59, Dan Williams wrote:
> > [...]
> > > * Userfaultfd for file-backed mappings and DAX
> >
> > I assume that other topics are meant to be FS track but this one is MM,
> > right?
> 
> Yes, but I think it is the lowest priority of all the noted sub-topics
> in this proposal. The DAX-reflink discussion, where a given
> physical-page may need to be mapped into multiple inodes at different
> offsets, might be more fruitful to have as a joint discussion with MM.

Note that my generic page write protection work can be use for that ie
having a single page correspond to multiple different mapping with also
different offset within each mapping. While in my patchset i only solve
the mapping aliasing issue, the index can be solve in much the same way
because same thinking apply. Namely that when you work on a file you
know the mapping and file offset and thus the index and when you work on
the vma you know the mapping and offset within the vma which translate
to offset within the file. They are only few places that do not have the
informations available and those do not care about it.

I am just again working on my struct page mapping patchset as well as
the generic page write protection that sits on top. I hope to be able
to post the v2 in couple weeks. You can always look at my posting last
year to see more details.

Cheers,
Jérôme
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 0/5] [v4] Allow persistent memory to be used like normal RAM

2019-01-25 Thread Jerome Glisse
On Thu, Jan 24, 2019 at 03:14:41PM -0800, Dave Hansen wrote:
> v3 spurred a bunch of really good discussion.  Thanks to everybody
> that made comments and suggestions!
> 
> I would still love some Acks on this from the folks on cc, even if it
> is on just the patch touching your area.
> 
> Note: these are based on commit d2f33c19644 in:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git 
> libnvdimm-pending
> 
> Changes since v3:
>  * Move HMM-related resource warning instead of removing it
>  * Use __request_resource() directly instead of devm.
>  * Create a separate DAX_PMEM Kconfig option, complete with help text
>  * Update patch descriptions and cover letter to give a better
>overview of use-cases and hardware where this might be useful.

This one looks good to me, i will give it a go on monday to
test against nouveau and HMM.

Cheers,
Jérôme
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 2/5] mm/resource: move HMM pr_debug() deeper into resource code

2019-01-25 Thread Jerome Glisse
On Thu, Jan 24, 2019 at 03:14:44PM -0800, Dave Hansen wrote:
> 
> From: Dave Hansen 
> 
> HMM consumes physical address space for its own use, even
> though nothing is mapped or accessible there.  It uses a
> special resource description (IORES_DESC_DEVICE_PRIVATE_MEMORY)
> to uniquely identify these areas.
> 
> When HMM consumes address space, it makes a best guess about
> what to consume.  However, it is possible that a future memory
> or device hotplug can collide with the reserved area.  In the
> case of these conflicts, there is an error message in
> register_memory_resource().
> 
> Later patches in this series move register_memory_resource()
> from using request_resource_conflict() to __request_region().
> Unfortunately, __request_region() does not return the conflict
> like the previous function did, which makes it impossible to
> check for IORES_DESC_DEVICE_PRIVATE_MEMORY in a conflicting
> resource.
> 
> Instead of warning in register_memory_resource(), move the
> check into the core resource code itself (__request_region())
> where the conflicting resource _is_ available.  This has the
> added bonus of producing a warning in case of HMM conflicts
> with devices *or* RAM address space, as opposed to the RAM-
> only warnings that were there previously.
> 
> Signed-off-by: Dave Hansen 
> Cc: Dan Williams 
> Cc: Dave Jiang 
> Cc: Ross Zwisler 
> Cc: Vishal Verma 
> Cc: Tom Lendacky 
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: linux-nvdimm@lists.01.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux...@kvack.org
> Cc: Huang Ying 
> Cc: Fengguang Wu 

Reviewed-by: Jerome Glisse 

> ---
> 
>  b/kernel/resource.c   |   10 ++
>  b/mm/memory_hotplug.c |5 -
>  2 files changed, 10 insertions(+), 5 deletions(-)
> 
> diff -puN kernel/resource.c~move-request_region-check kernel/resource.c
> --- a/kernel/resource.c~move-request_region-check 2019-01-24 
> 15:13:14.453199539 -0800
> +++ b/kernel/resource.c   2019-01-24 15:13:14.458199539 -0800
> @@ -1123,6 +1123,16 @@ struct resource * __request_region(struc
>   conflict = __request_resource(parent, res);
>   if (!conflict)
>   break;
> + /*
> +  * mm/hmm.c reserves physical addresses which then
> +  * become unavailable to other users.  Conflicts are
> +  * not expected.  Be verbose if one is encountered.
> +  */
> + if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
> + pr_debug("Resource conflict with unaddressable "
> +  "device memory at %#010llx !\n",
> +  (unsigned long long)start);
> + }
>   if (conflict != parent) {
>   if (!(conflict->flags & IORESOURCE_BUSY)) {
>   parent = conflict;
> diff -puN mm/memory_hotplug.c~move-request_region-check mm/memory_hotplug.c
> --- a/mm/memory_hotplug.c~move-request_region-check   2019-01-24 
> 15:13:14.455199539 -0800
> +++ b/mm/memory_hotplug.c 2019-01-24 15:13:14.459199539 -0800
> @@ -109,11 +109,6 @@ static struct resource *register_memory_
>   res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
>   conflict =  request_resource_conflict(_resource, res);
>   if (conflict) {
> - if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
> - pr_debug("Device unaddressable memory block "
> -  "memory hotplug at %#010llx !\n",
> -  (unsigned long long)start);
> - }
>   pr_debug("System RAM resource %pR cannot be added\n", res);
>   kfree(res);
>   return ERR_PTR(-EEXIST);
> _
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 2/4] mm/memory-hotplug: allow memory resources to be children

2019-01-23 Thread Jerome Glisse
On Wed, Jan 23, 2019 at 12:03:54PM -0800, Dave Hansen wrote:
> On 1/16/19 3:38 PM, Jerome Glisse wrote:
> > So right now i would rather that we keep properly reporting this
> > hazard so that at least we know it failed because of that. This
> > also include making sure that we can not register private memory
> > as a child of an un-busy resource that does exist but might not
> > have yet been claim by its rightful owner.
> 
> I can definitely keep the warning in.  But, I don't think there's a
> chance of HMM registering a IORES_DESC_DEVICE_PRIVATE_MEMORY region as
> the child of another.  The region_intersects() check *should* find that:

Sounds fine to (just keep the warning).

Cheers,
Jérôme

> 
> > for (; addr > size && addr >= iomem_resource.start; addr -= size) {
> > ret = region_intersects(addr, size, 0, IORES_DESC_NONE);
> > if (ret != REGION_DISJOINT)
> > continue;
> 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 2/4] mm/memory-hotplug: allow memory resources to be children

2019-01-18 Thread Jerome Glisse
On Fri, Jan 18, 2019 at 11:58:54AM -0800, Dave Hansen wrote:
> On 1/16/19 11:16 AM, Jerome Glisse wrote:
> >> We *could* also simply truncate the existing top-level
> >> "Persistent Memory" resource and take over the released address
> >> space.  But, this means that if we ever decide to hot-unplug the
> >> "RAM" and give it back, we need to recreate the original setup,
> >> which may mean going back to the BIOS tables.
> >>
> >> This should have no real effect on the existing collision
> >> detection because the areas that truly conflict should be marked
> >> IORESOURCE_BUSY.
> > 
> > Still i am worrying that this might allow device private to register
> > itself as a child of some un-busy resource as this patch obviously
> > change the behavior of register_memory_resource()
> > 
> > What about instead explicitly providing parent resource to add_memory()
> > and then to register_memory_resource() so if it is provided as an
> > argument (!NULL) then you can __request_region(arg_res, ...) otherwise
> > you keep existing code intact ?
> 
> We don't have the locking to do this, do we?  For instance, all the
> locking is done below register_memory_resource(), so any previous
> resource lookup is invalid by the time we get to register_memory_resource().

Yeah you are right, maybe just a bool then ? bool as_child

Cheers,
Jérôme
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 2/4] mm/memory-hotplug: allow memory resources to be children

2019-01-16 Thread Jerome Glisse
On Wed, Jan 16, 2019 at 03:01:39PM -0800, Dave Hansen wrote:
> On 1/16/19 11:16 AM, Jerome Glisse wrote:
> >> We also rework the old error message a bit since we do not get
> >> the conflicting entry back: only an indication that we *had* a
> >> conflict.
> > We should keep the device private check (moving it in __request_region)
> > as device private can try to register un-use physical address (un-use
> > at time of device private registration) that latter can block valid
> > physical address the error message you are removing report such event.
> 
> If a resource can't support having a child, shouldn't it just be marked
> IORESOURCE_BUSY, rather than trying to somehow special-case
> IORES_DESC_DEVICE_PRIVATE_MEMORY behavior?

So the thing about IORES_DESC_DEVICE_PRIVATE_MEMORY is that they
are not necessarily link to any real resource ie they can just be
random range of physical address that at the time of registration
had no resource.

Now you can latter hotplug some memory that would conflict with
this IORES_DESC_DEVICE_PRIVATE_MEMORY and if that happens we want
to tell that to the user ie:
"Sorry we registered some fake memory at fake physical address
 and now you have hotplug something that conflict with that."


Why no existing resource ? Well it depends on the platform. In some
case memory for HMM is just not accessible by the CPU _at_ all so
there is obviously no physical address from CPU point of view for
this kind of memory. The other case is PCIE and BAR size. If we
have PCIE bar resizing working everywhere we could potentialy
use the resized PCIE bar (thought i think some device have bug on
that front so i need to check device side too). So when HMM was
design without the PCIE resize and with totaly un-accessible memory
the only option was to pick some unuse physical address range as
anyway memory we are hotpluging is not CPU accessible.

It has been on my TODO to try to find a better way to reserve a
physical range but this is highly platform specific. I need to
investigate if i can report to ACPI on x86 that i want to make
sure the system never assign some physical address range.

Checking PCIE bar resize is also on my TODO (on device side as
i think some device are just buggy there and won't accept BAR
bigger than 256MB and freakout if you try).

So right now i would rather that we keep properly reporting this
hazard so that at least we know it failed because of that. This
also include making sure that we can not register private memory
as a child of an un-busy resource that does exist but might not
have yet been claim by its rightful owner.

Existing code make sure of that, with your change this is a case
that i would not be able to stop. Well i would have to hot unplug
and try a different physical address i guess.

Cheers,
Jérôme
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 2/4] mm/memory-hotplug: allow memory resources to be children

2019-01-16 Thread Jerome Glisse
On Wed, Jan 16, 2019 at 10:19:02AM -0800, Dave Hansen wrote:
> 
> From: Dave Hansen 
> 
> The mm/resource.c code is used to manage the physical address
> space.  We can view the current resource configuration in
> /proc/iomem.  An example of this is at the bottom of this
> description.
> 
> The nvdimm subsystem "owns" the physical address resources which
> map to persistent memory and has resources inserted for them as
> "Persistent Memory".  We want to use this persistent memory, but
> as volatile memory, just like RAM.  The best way to do this is
> to leave the existing resource in place, but add a "System RAM"
> resource underneath it. This clearly communicates the ownership
> relationship of this memory.
> 
> The request_resource_conflict() API only deals with the
> top-level resources.  Replace it with __request_region() which
> will search for !IORESOURCE_BUSY areas lower in the resource
> tree than the top level.
> 
> We also rework the old error message a bit since we do not get
> the conflicting entry back: only an indication that we *had* a
> conflict.

We should keep the device private check (moving it in __request_region)
as device private can try to register un-use physical address (un-use
at time of device private registration) that latter can block valid
physical address the error message you are removing report such event.


> 
> We *could* also simply truncate the existing top-level
> "Persistent Memory" resource and take over the released address
> space.  But, this means that if we ever decide to hot-unplug the
> "RAM" and give it back, we need to recreate the original setup,
> which may mean going back to the BIOS tables.
> 
> This should have no real effect on the existing collision
> detection because the areas that truly conflict should be marked
> IORESOURCE_BUSY.

Still i am worrying that this might allow device private to register
itself as a child of some un-busy resource as this patch obviously
change the behavior of register_memory_resource()

What about instead explicitly providing parent resource to add_memory()
and then to register_memory_resource() so if it is provided as an
argument (!NULL) then you can __request_region(arg_res, ...) otherwise
you keep existing code intact ?

Cheers,
Jérôme


> 
> -0fff : Reserved
> 1000-0009fbff : System RAM
> 0009fc00-0009 : Reserved
> 000a-000b : PCI Bus :00
> 000c-000c97ff : Video ROM
> 000c9800-000ca5ff : Adapter ROM
> 000f-000f : Reserved
>   000f-000f : System ROM
> 0010-9fff : System RAM
>   0100-01e071d0 : Kernel code
>   01e071d1-027dfdff : Kernel data
>   02dc6000-0305dfff : Kernel bss
> a000-afff : Persistent Memory (legacy)
>   a000-a7ff : System RAM
> b000-bffd : System RAM
> bffe-bfff : Reserved
> c000-febf : PCI Bus :00
> 
> Cc: Dan Williams 
> Cc: Dave Jiang 
> Cc: Ross Zwisler 
> Cc: Vishal Verma 
> Cc: Tom Lendacky 
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: linux-nvdimm@lists.01.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux...@kvack.org
> Cc: Huang Ying 
> Cc: Fengguang Wu 
> 
> Signed-off-by: Dave Hansen 
> ---
> 
>  b/mm/memory_hotplug.c |   31 ++-
>  1 file changed, 14 insertions(+), 17 deletions(-)
> 
> diff -puN 
> mm/memory_hotplug.c~mm-memory-hotplug-allow-memory-resource-to-be-child 
> mm/memory_hotplug.c
> --- a/mm/memory_hotplug.c~mm-memory-hotplug-allow-memory-resource-to-be-child 
> 2018-12-20 11:48:42.317771933 -0800
> +++ b/mm/memory_hotplug.c 2018-12-20 11:48:42.322771933 -0800
> @@ -98,24 +98,21 @@ void mem_hotplug_done(void)
>  /* add this memory to iomem resource */
>  static struct resource *register_memory_resource(u64 start, u64 size)
>  {
> - struct resource *res, *conflict;
> - res = kzalloc(sizeof(struct resource), GFP_KERNEL);
> - if (!res)
> - return ERR_PTR(-ENOMEM);
> + struct resource *res;
> + unsigned long flags =  IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
> + char *resource_name = "System RAM";
>  
> - res->name = "System RAM";
> - res->start = start;
> - res->end = start + size - 1;
> - res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
> - conflict =  request_resource_conflict(_resource, res);
> - if (conflict) {
> - if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
> - pr_debug("Device unaddressable memory block "
> -  "memory hotplug at %#010llx !\n",
> -  (unsigned long long)start);
> - }
> - pr_debug("System RAM resource %pR cannot be added\n", res);
> - kfree(res);
> + /*
> +  * Request ownership of the new memory range.  This might be
> +  * a child of an existing resource that was present but
> +  * not marked as busy.
> +  */
> + res = __request_region(_resource, start, size,
> +

Re: [PATCH V3 3/4] mm: add a function to differentiate the pages is from DAX device memory

2018-08-13 Thread Jerome Glisse
On Tue, Aug 14, 2018 at 01:41:40AM +0800, Zhang,Yi wrote:
> 
> 
> On 2018年08月09日 17:23, Pankaj Gupta wrote:
> >> DAX driver hotplug the device memory and move it to memory zone, these
> >> pages will be marked reserved flag, however, some other kernel componet
> >> will misconceive these pages are reserved mmio (ex: we map these dev_dax
> >> or fs_dax pages to kvm for DIMM/NVDIMM backend). Together with the type
> >> MEMORY_DEVICE_FS_DAX, we can use is_dax_page() to differentiate the pages
> >> is DAX device memory or not.
> >>
> >> Signed-off-by: Zhang Yi 
> >> Signed-off-by: Zhang Yu 
> >> ---
> >>  include/linux/mm.h | 12 
> >>  1 file changed, 12 insertions(+)
> >>
> >> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >> index 68a5121..de5cbc3 100644
> >> --- a/include/linux/mm.h
> >> +++ b/include/linux/mm.h
> >> @@ -889,6 +889,13 @@ static inline bool is_device_public_page(const struct
> >> page *page)
> >>page->pgmap->type == MEMORY_DEVICE_PUBLIC;
> >>  }
> >>  
> >> +static inline bool is_dax_page(const struct page *page)
> >> +{
> >> +  return is_zone_device_page(page) &&
> >> +  (page->pgmap->type == MEMORY_DEVICE_FS_DAX ||
> >> +  page->pgmap->type == MEMORY_DEVICE_DEV_DAX);
> >> +}
> > I think question from Dan for KVM VM with 'MEMORY_DEVICE_PUBLIC' still 
> > holds?
> > I am also interested to know if there is any use-case.
> >
> > Thanks,
> > Pankaj
> Yes, it is, thanks for your remind, Pankaj.
> Adding Jerome for Dan's questions on V1:
> [Dan]:
> 
> Jerome, might there be any use case to pass MEMORY_DEVICE_PUBLIC
> memory to a guest vm?

Yes and no, i am not sure how we are going to do it. But being able to
share GPU among multiple VM is on TODO list and those GPU will have
MEMORY_DEVICE_PUBLIC|PRIVATE depending on the platform. So either we
pass down the real underlying resource to the guest, or we will pass
down a fake one and have guest and host driver talk to each other so
that the host driver can do overall resource management accross multiple
guests.

So i would say that for now you can ignore MEMORY_DEVICE_PUBLIC and when
we get to the KVM guest sharing of those and decide how we want to do
it then we can update kvm to properly interpret those.

Cheers,
Jérôme
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v6 06/13] mm, dev_pagemap: Do not clear ->mapping on final put

2018-07-23 Thread Jerome Glisse
On Mon, Jul 23, 2018 at 09:12:06AM -0700, Dave Jiang wrote:
> Jerome,
> Is it possible to get an ack for this? Thanks!
> 
> On 07/13/2018 09:50 PM, Dan Williams wrote:
> > MEMORY_DEVICE_FS_DAX relies on typical page semantics whereby ->mapping
> > is only ever cleared by truncation, not final put.
> > 
> > Without this fix dax pages may forget their mapping association at the
> > end of every page pin event.
> > 
> > Move this atypical behavior that HMM wants into the HMM ->page_free()
> > callback.
> > 
> > Cc: 
> > Cc: Jan Kara 
> > Cc: Andrew Morton 
> > Cc: Ross Zwisler 
> > Fixes: d2c997c0f145 ("fs, dax: use page->mapping...")
> > Signed-off-by: Dan Williams 

Acked-by: Jérôme Glisse 

> > ---
> >  kernel/memremap.c |1 -
> >  mm/hmm.c  |2 ++
> >  2 files changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/memremap.c b/kernel/memremap.c
> > index 5857267a4af5..62603634a1d2 100644
> > --- a/kernel/memremap.c
> > +++ b/kernel/memremap.c
> > @@ -339,7 +339,6 @@ void __put_devmap_managed_page(struct page *page)
> > __ClearPageActive(page);
> > __ClearPageWaiters(page);
> >  
> > -   page->mapping = NULL;
> > mem_cgroup_uncharge(page);
> >  
> > page->pgmap->page_free(page, page->pgmap->data);
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index de7b6bf77201..f9d1d89dec4d 100644
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -963,6 +963,8 @@ static void hmm_devmem_free(struct page *page, void 
> > *data)
> >  {
> > struct hmm_devmem *devmem = data;
> >  
> > +   page->mapping = NULL;
> > +
> > devmem->ops->free(devmem, page);
> >  }
> >  
> > 
> > ___
> > Linux-nvdimm mailing list
> > Linux-nvdimm@lists.01.org
> > https://lists.01.org/mailman/listinfo/linux-nvdimm
> > 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-10 Thread Jerome Glisse
On Thu, May 10, 2018 at 01:10:15PM -0600, Alex Williamson wrote:
> On Thu, 10 May 2018 18:41:09 +
> "Stephen  Bates"  wrote:
> > >Reasons is that GPU are giving up on PCIe (see all specialize link like
> > >NVlink that are popping up in GPU space). So for fast GPU inter-connect
> > >we have this new links.   
> > 
> > I look forward to Nvidia open-licensing NVLink to anyone who wants to use 
> > it ;-).
> 
> No doubt, the marketing for it is quick to point out the mesh topology
> of NVLink, but I haven't seen any technical documents that describe the
> isolation capabilities or IOMMU interaction.  Whether this is included
> or an afterthought, I have no idea.

AFAIK there is no IOMMU on NVLink between devices, walking a page table and
being able to sustain 80GB/s or 160GB/s is hard to achieve :) I think idea
behind those interconnect is that devices in the mesh are inherently secure
ie each single device is suppose to make sure that no one can abuse it.

GPU with their virtual address space and contextualize program executions
unit are suppose to be secure (a specter like bug might be lurking on those
but i doubt it).

So for those interconnect you program physical address directly in the page
table of the devices and those physical address are un-translated from hard-
ware perspective.

Note that the kernel driver that do the actual GPU page table programming
can do sanity check on value it is setting. So checks can also happens at
setup time. But after that assumption is hardware is secure and no one can
abuse it AFAICT.

> 
> > >Also the IOMMU isolation do matter a lot to us. Think someone using 
> > > this
> > >peer to peer to gain control of a server in the cloud.  
> 
> From that perspective, do we have any idea what NVLink means for
> topology and IOMMU provided isolation and translation?  I've seen a
> device assignment user report that seems to suggest it might pretend to
> be PCIe compatible, but the assigned GPU ultimately doesn't work
> correctly in a VM, so perhaps the software compatibility is only so
> deep. Thanks,

Note that each single GPU (in configurations i am aware of) also have a
PCIE link with the CPU/main memory. So from that point of view they very
much behave like a regular PCIE devices. It is just that each GPUs in
the mesh can access each other memory through high bandwidth interconnect.

I am not sure how much is public beyond that, i will ask NVidia to try to
have someone chime in this thread and shed light on this, if possible.

Cheers,
Jérôme
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-10 Thread Jerome Glisse
On Thu, May 10, 2018 at 04:29:44PM +0200, Christian König wrote:
> Am 10.05.2018 um 16:20 schrieb Stephen Bates:
> > Hi Jerome
> > 
> > > As it is tie to PASID this is done using IOMMU so looks for caller
> > > of amd_iommu_bind_pasid() or intel_svm_bind_mm() in GPU the existing
> > >   user is the AMD GPU driver see:
> > Ah thanks. This cleared things up for me. A quick search shows there are 
> > still no users of intel_svm_bind_mm() but I see the AMD version used in 
> > that GPU driver.
> 
> Just FYI: There is also another effort ongoing to give both the AMD, Intel
> as well as ARM IOMMUs a common interface so that drivers can use whatever
> the platform offers fro SVM support.
> 
> > One thing I could not grok from the code how the GPU driver indicates which 
> > DMA events require ATS translations and which do not. I am assuming the 
> > driver implements someway of indicating that and its not just a global ON 
> > or OFF for all DMAs? The reason I ask is that I looking at if NVMe was to 
> > support ATS what would need to be added in the NVMe spec above and beyond 
> > what we have in PCI ATS to support efficient use of ATS (for example would 
> > we need a flag in the submission queue entries to indicate a particular 
> > IO's SGL/PRP should undergo ATS).
> 
> Oh, well that is complicated at best.
> 
> On very old hardware it wasn't a window, but instead you had to use special
> commands in your shader which indicated that you want to use an ATS
> transaction instead of a normal PCIe transaction for your read/write/atomic.
> 
> As Jerome explained on most hardware we have a window inside the internal
> GPU address space which when accessed issues a ATS transaction with a
> configurable PASID.
> 
> But on very newer hardware that window became a bit in the GPUVM page
> tables, so in theory we now can control it on a 4K granularity basis for the
> internal 48bit GPU address space.
> 

To complete this a 50 lines primer on GPU:

GPUVA - GPU virtual address
GPUPA - GPU physical address

GPU run programs very much like CPU program expect a program will have
many thousands of threads running concurrently. There is a hierarchy of
groups for a given program ie threads are grouped together, the lowest
hierarchy level have a group size in <= 64 threads on most GPUs.

Those programs (call shader for graphic program think OpenGL, Vulkan
or compute for GPGPU think OpenCL CUDA) are submited by the userspace
against a given address space. In the "old" days (couple years back
when dinausor were still roaming the earth) this address space was
specific to the GPU and each user space program could create multiple
GPU address space. All the memory operation done by the program was
against this address space. Hence all PCIE transactions are spawn from
a program + address space.

GPU use page table + window aperture (the window aperture is going away
so you can focus on page table). To translate GPU virtual address into
a physical address. The physical address can point to GPU local memory
or to system memory or to another PCIE device memory (ie some PCIE BAR).

So all PCIE transaction are spawn through this process of GPUVA to GPUPA
then GPUPA is handled by the GPU mmu unit that either spawn a PCIE
transaction for non local GPUPA or access local memory otherwise.


So per say the kernel driver does not configure which transaction is
using ATS or peer to peer. Userspace program create a GPU virtual address
space and bind object into it. This object can be system memory or some
other PCIE device memory in which case we would to do a peer to peer.


So you won't find any logic in the kernel. What you find is creating
virtual address space and binding object.


Above i talk about the old days, nowadays we want the GPU virtual address
space to be exactly the same as the CPU virtual address space as the
process which initiate the GPU program is using. This is where we use the
PASID and ATS. So here userspace create a special "GPU context" that says
that the GPU virtual address space will be the same as the program that
create the GPU context. A process ID is then allocated and the mm_struct
is bind to this process ID in the IOMMU driver. Then all program executed
on the GPU use the process ID to identify the address space against which
they are running.


All of the above i did not talk about DMA engine which are on the "side"
of the GPU to copy memory around. GPU have multiple DMA engines with
different capabilities, some of those DMA engine use the same GPU address
space as describe above, other use directly GPUPA.


Hopes this helps understanding the big picture. I over simplify thing and
devils is in the details.

Cheers,
Jérôme
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-10 Thread Jerome Glisse
On Thu, May 10, 2018 at 02:16:25PM +, Stephen  Bates wrote:
> Hi Christian
> 
> > Why would a switch not identify that as a peer address? We use the PASID 
> >together with ATS to identify the address space which a transaction 
> >should use.
> 
> I think you are conflating two types of TLPs here. If the device supports ATS 
> then it will issue a TR TLP to obtain a translated address from the IOMMU. 
> This TR TLP will be addressed to the RP and so regardless of ACS it is going 
> up to the Root Port. When it gets the response it gets the physical address 
> and can use that with the TA bit set for the p2pdma. In the case of ATS 
> support we also have more control over ACS as we can disable it just for TA 
> addresses (as per 7.7.7.7.2 of the spec).
> 
>  >   If I'm not completely mistaken when you disable ACS it is perfectly 
>  >   possible that a bridge identifies a transaction as belonging to a peer 
>  >   address, which isn't what we want here.
>
> You are right here and I think this illustrates a problem for using the IOMMU 
> at all when P2PDMA devices do not support ATS. Let me explain:
> 
> If we want to do a P2PDMA and the DMA device does not support ATS then I 
> think we have to disable the IOMMU (something Mike suggested earlier). The 
> reason is that since ATS is not an option the EP must initiate the DMA using 
> the addresses passed down to it. If the IOMMU is on then this is an IOVA that 
> could (with some non-zero probability) point to an IO Memory address in the 
> same PCI domain. So if we disable ACS we are in trouble as we might MemWr to 
> the wrong place but if we enable ACS we lose much of the benefit of P2PDMA. 
> Disabling the IOMMU removes the IOVA risk and ironically also resolves the 
> IOMMU grouping issues.
> 
> So I think if we want to support performant P2PDMA for devices that don't 
> have ATS (and no NVMe SSDs today support ATS) then we have to disable the 
> IOMMU. I know this is problematic for AMDs use case so perhaps we also need 
> to consider a mode for P2PDMA for devices that DO support ATS where we can 
> enable the IOMMU (but in this case EPs without ATS cannot participate as 
> P2PDMA DMA iniators).
> 
> Make sense?
> 

Note on GPU we do would not rely on ATS for peer to peer. Some part
of the GPU (DMA engines) do not necessarily support ATS. Yet those
are the part likely to be use in peer to peer.

However here this is a distinction in objective that i believe is lost.
We (ake GPU people aka the good guys ;)) do no want to do peer to peer
for performance reasons ie we do not care having our transaction going
to the root complex and back down the destination. At least in use case
i am working on this is fine.

Reasons is that GPU are giving up on PCIe (see all specialize link like
NVlink that are popping up in GPU space). So for fast GPU inter-connect
we have this new links. Yet for legacy and inter-operability we would
like to do peer to peer with other devices like RDMA ... going through
the root complex would be fine from performance point of view. Worst
case is that it is slower than existing design where system memory is
use as bounce buffer.

Also the IOMMU isolation do matter a lot to us. Think someone using this
peer to peer to gain control of a server in the cloud.

Cheers,
Jérôme
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-09 Thread Jerome Glisse
On Wed, May 09, 2018 at 04:30:32PM +, Stephen  Bates wrote:
> Hi Jerome
> 
> > Now inside that page table you can point GPU virtual address
> > to use GPU memory or use system memory. Those system memory entry can
> > also be mark as ATS against a given PASID.
> 
> Thanks. This all makes sense. 
> 
> But do you have examples of this in a kernel driver (if so can you point me 
> too it) or is this all done via user-space? Based on my grepping of the 
> kernel code I see zero EP drivers using in-kernel ATS functionality right 
> now...
> 

As it is tie to PASID this is done using IOMMU so looks for caller
of amd_iommu_bind_pasid() or intel_svm_bind_mm() in GPU the existing
user is the AMD GPU driver see:

drivers/gpu/drm/amd/
drivers/gpu/drm/amd/amdkfd/
drivers/gpu/drm/amd/amdgpu/

Lot of codes there. The GPU code details do not really matter for
this discussions thought. You do not need to do much to use PASID.

Cheers,
Jérôme
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-09 Thread Jerome Glisse
On Wed, May 09, 2018 at 03:41:44PM +, Stephen  Bates wrote:
> Christian
> 
> >Interesting point, give me a moment to check that. That finally makes 
> >all the hardware I have standing around here valuable :)
> 
> Yes. At the very least it provides an initial standards based path
> for P2P DMAs across RPs which is something we have discussed on this
> list in the past as being desirable.
> 
> BTW I am trying to understand how an ATS capable EP function determines
> when to perform an ATS Translation Request (ATS TR). Is there an
> upstream example of the driver for your APU that uses ATS? If so, can
> you provide a pointer to it. Do you provide some type of entry in the
> submission queues for commands going to the APU to indicate if the
> address associated with a specific command should be translated using
> ATS or not? Or do you simply enable ATS and then all addresses passed
> to your APU that miss the local cache result in a ATS TR?

On GPU ATS is always tie to a PASID. You do not do the former without
the latter (AFAICT this is not doable, maybe through some JTAG but not
in normal operation).

GPU are like CPU, so you have GPU threads that run against an address
space. This address space use a page table (very much like the CPU page
table). Now inside that page table you can point GPU virtual address
to use GPU memory or use system memory. Those system memory entry can
also be mark as ATS against a given PASID.

On some GPU you define a window of GPU virtual address that goes through
PASID & ATS (so access in that window do not go through the page table
but directly through PASID & ATS).

Jérôme
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v4 04/14] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-05-08 Thread Jerome Glisse
On Tue, May 08, 2018 at 02:19:05PM -0600, Logan Gunthorpe wrote:
> 
> 
> On 08/05/18 02:13 PM, Alex Williamson wrote:
> > Well, I'm a bit confused, this patch series is specifically disabling
> > ACS on switches, but per the spec downstream switch ports implementing
> > ACS MUST implement direct translated P2P.  So it seems the only
> > potential gap here is the endpoint, which must support ATS or else
> > there's nothing for direct translated P2P to do.  The switch port plays
> > no part in the actual translation of the request, ATS on the endpoint
> > has already cached the translation and is now attempting to use it.
> > For the switch port, this only becomes a routing decision, the request
> > is already translated, therefore ACS RR and EC can be ignored to
> > perform "normal" (direct) routing, as if ACS were not present.  It would
> > be a shame to go to all the trouble of creating this no-ACS mode to find
> > out the target hardware supports ATS and should have simply used it, or
> > we should have disabled the IOMMU altogether, which leaves ACS disabled.
> 
> Ah, ok, I didn't think it was the endpoint that had to implement ATS.
> But in that case, for our application, we need NVMe cards and RDMA NICs
> to all have ATS support and I expect that is just as unlikely. At least
> none of the endpoints on my system support it. Maybe only certain GPUs
> have this support.

I think there is confusion here, Alex properly explained the scheme
PCIE-device do a ATS request to the IOMMU which returns a valid
translation for a virtual address. Device can then use that address
directly without going through IOMMU for translation.

ATS is implemented by the IOMMU not by the device (well device implement
the client side of it). Also ATS is meaningless without something like
PASID as far as i know.

Cheers,
Jérôme
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-02 Thread Jerome Glisse
On Fri, Mar 02, 2018 at 09:38:43PM +, Stephen  Bates wrote:
> > It seems people miss-understand HMM :( 
> 
> Hi Jerome
> 
> Your unhappy face emoticon made me sad so I went off to (re)read up
> on HMM. Along the way I came up with a couple of things.
> 
> While hmm.txt is really nice to read it makes no mention of
> DEVICE_PRIVATE and DEVICE_PUBLIC. It also gives no indication when
> one might choose to use one over the other. Would it be possible to
> update hmm.txt to include some discussion on this? I understand
> that DEVICE_PUBLIC creates a mapping in the kernel's linear address
> space for the device memory and DEVICE_PRIVATE does not. However,
> like I said, I am not sure when you would use either one and the
> pros and cons of doing so. I actually ended up finding some useful
> information in memremap.h but I don't think it is fair to expect
> people to dig *that* deep to find this information ;-).

Yes i need to document that some more in hmm.txt, PRIVATE is for device
that have memory that do not fit regular memory expectation ie cachable
so PCIe device memory fit under that category. So if all you need is
struct page for such memory then this is a perfect fit. On top of that
you can use more HMM feature, like using this memory transparently
inside a process address space.

PUBLIC is for memory that belong to a device but still can be access by
CPU in cache coherent way (CAPI, CCIX, ...). Again if you have such
memory and just want struct page you can use that and again if you want
to use that inside a process address space HMM provide more helpers to
do so.


> A quick grep shows no drivers using the HMM API in the upstream code
> today. Is this correct? Are there any examples of out of tree drivers
> that use HMM you can point me too? As a driver developer what
> resources exist to help me write a HMM aware driver?

I am about to send RFC for nouveau, i am still working out some bugs.
I was hoping to be done today but i am still fighting with the hardware.
They are other drivers being work on with HMM. I do not know exactly
when they will be made public (i expect in coming months).

How you use HMM is under the control of the device driver, as well as
how you expose it to userspace. They use it how they want to use it.
There is no pattern or requirement imposed by HMM. All driver being work
on so far are GPU like hardware, ie big chunk of on board memory
(several giga-bytes) and they want to use that memory inside process
address space in a transparent fashion to the program and CPU.

Each have their own API expose to userspace and while they are a lot of
similarity among them, lot of details of userspace API is hardware
specific. In GPU world most of the driver are in userspace, application
do target high level API such as OpenGL, Vulkan, OpenCL or CUDA. Those
API then have a hardware specific userspace driver that talks to hardware
specific IOCTL. So this is not like network or block device.


> The (very nice) hmm.txt document is not references in the MAINTAINERS
> file? You might want to fix that when you have a moment.

I have couple small fixes/typo patches that i need to cleanup and send
i will fix the MAINTAINERS as part of those.

Cheers,
Jérôme
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 04/10] PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches

2018-03-01 Thread Jerome Glisse
On Thu, Mar 01, 2018 at 09:32:20PM +, Stephen  Bates wrote:
> > your kernel provider needs to decide whether they favor device assignment 
> > or p2p
> 
> Thanks Alex! The hardware requirements for P2P (switch, high performance EPs) 
> are such that we really only expect CONFIG_P2P_DMA to be enabled in specific 
> instances and in those instances the users have made a decision to favor P2P 
> over IOMMU isolation. Or they have setup their PCIe topology in a way that 
> gives them IOMMU isolation where they want it and P2P where they want it.
> 
> 

Note that they are usecase for P2P where IOMMU isolation matter and
the traffic through root complex isn't see as an issue. For instance
for GPU the idea is that you want to allow the RDMA device to directly
read or write from GPU memory to avoid having to migrate memory to
system memory. This isn't so much for performance than for ease of
use.

Cheers,
Jérôme
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Jerome Glisse
On Thu, Mar 01, 2018 at 02:15:01PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 01/03/18 02:10 PM, Jerome Glisse wrote:
> > It seems people miss-understand HMM :( you do not have to use all of
> > its features. If all you care about is having struct page then just
> > use that for instance in your case only use those following 3 functions:
> > 
> > hmm_devmem_add() or hmm_devmem_add_resource() and hmm_devmem_remove()
> > for cleanup.
> 
> To what benefit over just using devm_memremap_pages()? If I'm using the hmm
> interface and disabling all the features, I don't see the point. We've also
> cleaned up the devm_memremap_pages() interface to be more usefully generic
> in such a way that I'd hope HMM starts using it too and gets rid of the code
> duplication.
> 

The first HMM variant find a hole and do not require a resource as input
parameter. Beside that internaly for PCIE device memory devm_memremap_pages()
does not do the right thing last time i check it always create a linear
mapping of the range ie HMM call add_pages() while devm_memremap_pages()
call arch_add_memory()

When i upstreamed HMM, Dan didn't want me to touch devm_memremap_pages()
to match my need. I am more than happy to modify devm_memremap_pages() to
also handle HMM needs.

Note that the intention of HMM is to be a middle layer between low level
infrastructure and device driver. Idea is that such impedance layer should
make it easier down the road to change how thing are handled down below
without having to touch many device driver.

Cheers,
Jérôme
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Jerome Glisse
On Thu, Mar 01, 2018 at 02:11:34PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 01/03/18 02:03 PM, Benjamin Herrenschmidt wrote:
> > However, what happens if anything calls page_address() on them ? Some
> > DMA ops do that for example, or some devices might ...
> 
> Although we could probably work around it with some pain, we rely on
> page_address() and virt_to_phys(), etc to work on these pages. So on x86,
> yes, it makes it into the linear mapping.

This is pretty easy to do with HMM:

unsigned long hmm_page_to_phys_pfn(struct page *page)
{
struct hmm_devmem *devmem;
unsigned long ppfn;

/* Sanity test maybe BUG_ON() */
if (!is_device_private_page(page))
return -1UL;

devmem = page->pgmap->data;
ppfn = page_to_page(page) - devmem->pfn_first;
return ppfn + devmem->device_phys_base_pfn;
}

Note that last field does not exist in today HMM because i did not need
such helper so far but this can be added.

Cheers,
Jérôme
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Jerome Glisse
On Thu, Mar 01, 2018 at 02:03:26PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 01/03/18 01:55 PM, Jerome Glisse wrote:
> > Well this again a new user of struct page for device memory just for
> > one usecase. I wanted HMM to be more versatile so that it could be use
> > for this kind of thing too. I guess the message didn't go through. I
> > will take some cycles tomorrow to look into this patchset to ascertain
> > how struct page is use in this context.
> 
> We looked at it but didn't see how any of it was applicable to our needs.
> 

It seems people miss-understand HMM :( you do not have to use all of
its features. If all you care about is having struct page then just
use that for instance in your case only use those following 3 functions:

hmm_devmem_add() or hmm_devmem_add_resource() and hmm_devmem_remove()
for cleanup.

You can set the fault callback to an empty stub that always do return
VM_SIGBUS or a patch to allow NULL callback inside HMM.

You don't have to use the free callback if you don't care and if there
is something that doesn't quite match what you want HMM can always be
ajusted to address this.

The intention of HMM is to be useful for all device memory that wish
to have struct page for various reasons.

Cheers,
Jérôme
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory

2018-03-01 Thread Jerome Glisse
On Fri, Mar 02, 2018 at 07:29:55AM +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-01 at 11:04 -0700, Logan Gunthorpe wrote:
> > 
> > On 28/02/18 08:56 PM, Benjamin Herrenschmidt wrote:
> > > On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote:
> > > > The problem is that acccording to him (I didn't double check the latest
> > > > patches) you effectively hotplug the PCIe memory into the system when
> > > > creating struct pages.
> > > > 
> > > > This cannot possibly work for us. First we cannot map PCIe memory as
> > > > cachable. (Note that doing so is a bad idea if you are behind a PLX
> > > > switch anyway since you'd ahve to manage cache coherency in SW).
> > > 
> > > Note: I think the above means it won't work behind a switch on x86
> > > either, will it ?
> > 
> > This works perfectly fine on x86 behind a switch and we've tested it on 
> > multiple machines. We've never had an issue of running out of virtual 
> > space despite our PCI bars typically being located with an offset of 
> > 56TB or more. The arch code on x86 also somehow figures out not to map 
> > the memory as cachable so that's not an issue (though, at this point, 
> > the CPU never accesses the memory so even if it were, it wouldn't affect 
> > anything).
> 
> Oliver can you look into this ? You sais the memory was effectively
> hotplug'ed into the system when creating the struct pages. That would
> mean to me that it's a) mapped (which for us is cachable, maybe x86 has
> tricks to avoid that) and b) potentially used to populate userspace
> pages (that will definitely be cachable). Unless there's something in
> there you didn't see that prevents it.
> 
> > We also had this working on ARM64 a while back but it required some out 
> > of tree ZONE_DEVICE patches and some truly horrid hacks to it's arch 
> > code to ioremap the memory into the page map.
> > 
> > You didn't mention what architecture you were trying this on.
> 
> ppc64.
> 
> > It may make sense at this point to make this feature dependent on x86 
> > until more work is done to make it properly portable. Something like 
> > arch functions that allow adding IO memory pages to with a specific 
> > cache setting. Though, if an arch has such restrictive limits on the map 
> > size it would probably need to address that too somehow.
> 
> Not fan of that approach.
> 
> So there are two issues to consider here:
> 
>  - Our MMIO space is very far away from memory (high bits set in the
> address) which causes problem with things like vmmemmap, page_address,
> virt_to_page etc... Do you have similar issues on arm64 ?

HMM private (HMM public is different) works around that by looking for
"hole" in address space and using those for hotplug (ie page_to_pfn()
!= physical pfn of the memory). This is ok for HMM because the memory
is never map by the CPU and we can find the physical pfn with a little
bit of math (page_to_pfn() - page->pgmap->res->start + page->pgmap->dev->
physical_base_address).

To avoid anything going bad i actually do not populate the kernel linear
mapping for the range hence definitly no CPU access at all through those
struct page. CPU can still access PCIE bar through usual mmio map.

> 
>  - We need to ensure that the mechanism (which I'm not familiar with)
> that you use to create the struct page's for the device don't end up
> turning those device pages into normal "general use" pages for the
> system. Oliver thinks it does, you say it doesn't, ... 
> 
> Jerome (Glisse), what's your take on this ? Smells like something that
> could be covered by HMM...

Well this again a new user of struct page for device memory just for
one usecase. I wanted HMM to be more versatile so that it could be use
for this kind of thing too. I guess the message didn't go through. I
will take some cycles tomorrow to look into this patchset to ascertain
how struct page is use in this context.

Note that i also want peer to peer for HMM users but with ACS and using
IOMMU ie having to populate IOMMU page table of one device to point to
bar of another device. I need to test on how many platform this work,
hardware engineer are unable/unwilling to commit on wether this work or
not.


> Logan, the only reason you need struct page's to begin with is for the
> DMA API right ? Or am I missing something here ?

If it is only needed for that this sounds like a waste of memory for
struct page. Thought i understand this allow new API to match previous
one.

Cheers,
Jérôme
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [LSF/MM TOPIC] Filesystem-DAX, page-pinning, and RDMA

2018-01-29 Thread Jerome Glisse
On Wed, Jan 24, 2018 at 07:56:02PM -0800, Dan Williams wrote:
> The get_user_pages_longterm() api was recently added as a stop-gap
> measure to prevent applications from growing dependencies on the
> ability to to pin DAX-mapped filesystem blocks for RDMA indefinitely
> with no ongoing coordination with the filesystem. This 'longterm'
> pinning is also problematic for the non-DAX VMA case where the core-mm
> needs a time bounded way to revoke a pin and manipulate the physical
> pages. While existing RDMA applications have already grown the
> assumption that they can pin page-cache pages indefinitely, the fact
> that we are breaking this assumption for filesystem-dax presents an
> opportunity to deprecate the 'indefinite pin' mechanisms and move to a
> general interface that supports pin revocation.
> 
> While RDMA may grow an explicit Infiniband-verb for this 'memory
> registration with lease' semantic, it seems that this problem is
> bigger than just RDMA. At LSF/MM it would be useful to have a
> discussion between fs, mm, dax, and RDMA folks about addressing this
> problem at the core level.
> 
> Particular people that would be useful to have in attendance are
> Michal Hocko, Christoph Hellwig, and Jason Gunthorpe (cc'd).
> 

Between i would also like to participate, in my view the burden should
be on GUP users, so if hardware is not ODP capable then you should at
least be able to kill the mapping/GUP and force the hardware to redo a
GUP if it get any more transaction on affect umem. Can non ODP hardware
do that ? Or is it out of the question ?

Cheers,
Jérôme
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 04/12] pci-p2p: Clear ACS P2P flags for all client devices

2018-01-04 Thread Jerome Glisse
On Thu, Jan 04, 2018 at 08:33:00PM -0700, Alex Williamson wrote:
> On Thu, 4 Jan 2018 17:00:47 -0700
> Logan Gunthorpe  wrote:
> 
> > On 04/01/18 03:35 PM, Alex Williamson wrote:
> > > Yep, flipping these ACS bits invalidates any IOMMU groups that depend
> > > on the isolation of that downstream port and I suspect also any peers
> > > within the same PCI slot of that port and their downstream devices.  The
> > > entire sub-hierarchy grouping needs to be re-evaluated.  This
> > > potentially affects running devices that depend on that isolation, so
> > > I'm not sure how that happens dynamically.  A boot option might be
> > > easier.  Thanks,  
> > 
> > I don't see how this is the case in current kernel code. It appears to 
> > only enable ACS globally if the IOMMU requests it.
> 
> IOMMU groups don't exist unless the IOMMU is enabled and x86 and ARM
> both request ACS be enabled if an IOMMU is present, so I'm not sure
> what you're getting at here.  Also, in reply to your other email, if
> the IOMMU is enabled, every device handled by the IOMMU is a member of
> an IOMMU group, see struct device.iommu_group.  There's an
> iommu_group_get() accessor to get a reference to it.
>  
> > I also don't see how turning off ACS isolation for a specific device is 
> > going to hurt anything. The IOMMU should still be able to keep going on 
> > unaware that anything has changed. The only worry is that a security 
> > hole may now be created if a user was relying on the isolation between 
> > two devices that are in different VMs or something. However, if a user 
> > was relying on this, they probably shouldn't have turned on P2P in the 
> > first place.
> 
> That's exactly what IOMMU groups represent, the smallest set of devices
> which have DMA isolation from other devices.  By poking this hole, the
> IOMMU group is invalid.  We cannot turn off ACS only for a specific
> device, in order to enable p2p it needs to be disabled at every
> downstream port between the devices where we want to enable p2p.
> Depending on the topology, that could mean we're also enabling p2p for
> unrelated devices.  Those unrelated devices might be in active use and
> the p2p IOVAs now have a different destination which is no longer IOMMU
> translated.
>  
> > We started with a fairly unintelligent choice to simply disable ACS on 
> > any kernel that had CONFIG_PCI_P2P set. However, this did not seem like 
> > a good idea going forward. Instead, we now selectively disable the ACS 
> > bit only on the downstream ports that are involved in P2P transactions. 
> > This seems like the safest choice and still allows people to (carefully) 
> > use P2P adjacent to other devices that need to be isolated.
> 
> I don't see that the code is doing much checking that adjacent devices
> are also affected by the p2p change and of course the IOMMU group is
> entirely invalid once the p2p holes start getting poked.
> 
> > I don't think anyone wants another boot option that must be set in order 
> > to use this functionality (and only some hardware would require this). 
> > That's just a huge pain for users.
> 
> No, but nor do we need IOMMU groups that no longer represent what
> they're intended to describe or runtime, unchecked routing changes
> through the topology for devices that might already be using
> conflicting IOVA ranges.  Maybe soft hotplugs are another possibility,
> designate a sub-hierarchy to be removed and re-scanned with ACS
> disabled.  Otherwise it seems like disabling and re-enabling ACS needs
> to also handle merging and splitting groups dynamically.  Thanks,
> 

Dumb question, can we use a PCI bar address of one device into the
IOMMU page table of another address ie like we would DMA map a
regular system page ?

It would be much better in my view to follow down such path if that
is at all possible from hardware point of view (i am not sure where
to dig in the specification to answer my above question).

Cheers,
Jérôme
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC v2 0/5] surface heterogeneous memory performance information

2017-07-06 Thread Jerome Glisse
On Thu, Jul 06, 2017 at 03:52:28PM -0600, Ross Zwisler wrote:

[...]

> 
>  Next steps 
> 
> There is still a lot of work to be done on this series, but the overall
> goal of this RFC is to gather feedback on which of the two options we
> should pursue, or whether some third option is preferred.  After that is
> done and we have a solid direction we can add support for ACPI hot add,
> test more complex configurations, etc.
> 
> So, for applications that need to differentiate between memory ranges based
> on their performance, what option would work best for you?  Is the local
> (initiator,target) performance provided by patch 5 enough, or do you
> require performance information for all possible (initiator,target)
> pairings?

Am i right in assuming that HBM or any faster memory will be relatively small
(1GB - 8GB maybe 16GB ?) and of fix amount (ie size will depend on the exact
CPU model you have) ?

If so i am wondering if we should not restrict NUMA placement policy for such
node to vma only. Forbid any policy that would prefer those node globally at
thread/process level. This would avoid wide thread policy to exhaust this
smaller pool of memory.

Drawback of doing so would be that existing applications would not benefit
from it. So workload where is acceptable to exhaust such memory wouldn't
benefit until their application are updated.


This is definitly not something impacting this patchset. I am just thinking
about this at large and i believe that NUMA might need to evolve slightly
to better handle memory hierarchy.

Cheers,
Jérôme
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-19 Thread Jerome Glisse
On Wed, Apr 19, 2017 at 10:01:23AM -0700, Dan Williams wrote:
> On Wed, Apr 19, 2017 at 9:48 AM, Logan Gunthorpe  wrote:
> >
> >
> > On 19/04/17 09:55 AM, Jason Gunthorpe wrote:
> >> I was thinking only this one would be supported with a core code
> >> helper..
> >
> > Pivoting slightly: I was looking at how HMM uses ZONE_DEVICE. They add a
> > type flag to the dev_pagemap structure which would be very useful to us.
> > We could add another MEMORY_DEVICE_P2P type to distinguish p2p pages.
> > Then, potentially, we could add a dma_map callback to the structure
> > (possibly unioned with an hmm field). The dev_ops providers would then
> > just need to do something like this (enclosed in a helper):
> >
> > if (is_zone_device_page(page)) {
> > pgmap = get_dev_pagemap(page_to_pfn(page));
> > if (!pgmap || pgmap->type !=  MEMORY_DEVICE_P2P ||
> > !pgmap->dma_map)
> > return 0;
> >
> > dma_addr = pgmap->dma_map(dev, pgmap->dev, page);
> > put_dev_pagemap(pgmap);
> > if (!dma_addr)
> > return 0;
> > ...
> > }
> >
> > The pci_enable_p2p_bar function would then just need to call
> > devm_memremap_pages with the dma_map callback set to a function that
> > does the segment check and the offset calculation.
> >
> > Thoughts?
> >
> > @Jerome: my feedback to you would be that your patch assumes all users
> > of devm_memremap_pages are MEMORY_DEVICE_PERSISTENT. It would be more
> > useful if it was generic. My suggestion would be to have the caller
> > allocate the dev_pagemap structure, populate it and pass it into
> > devm_memremap_pages. Given that pretty much everything in that structure
> > are already arguments to that function, I feel like this makes sense.
> > This should also help to unify hmm_devmem_pages_create and
> > devm_memremap_pages which look very similar to each other.
> 
> I like that change. Also the types should describe the memory relative
> to its relationship to struct page, not whether it is persistent or
> not. I would consider volatile and persistent memory that is attached
> to the cpu memory controller and i/o coherent as the same type of
> memory. DMA incoherent ranges like P2P and HMM should get their own
> types.

Dan you asked me to not use devm_memremap_pages() because you didn't
want to have HMM memory in the pgmap_radix, did you change opinion
on that ? :)

Note i won't make any change now on that front but if it make sense
i am happy to do it as separate patchset on top of HMM.

Also i don't want p2pmem to be an exclusive or with HMM, we will want
GPU to do peer to peer DMA and thus HMM ZONE_DEVICE pages to support
this too.

I do believe it is easier to special case in ZONE_DEVICE into existing
dma_ops for each architecture. For x86 i think there is only 3 different
set of dma_ops to modify. For other arch i guess there is even less.

But in all the case i think p2pmem should stay out of memory management
business. If some set of device do not have memory management it is
better to propose helper to do that as part of the subsystem to which
those devices belong. Just wanted to reiterate that point.

Cheers,
Jérôme
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Jerome Glisse
> On Tue, Apr 18, 2017 at 12:35 PM, Logan Gunthorpe 
> wrote:
> >
> >
> > On 18/04/17 01:01 PM, Jason Gunthorpe wrote:
> >> Ultimately every dma_ops will need special code to support P2P with
> >> the special hardware that ops is controlling, so it makes some sense
> >> to start by pushing the check down there in the first place. This
> >> advice is partially motivated by how dma_map_sg is just a small
> >> wrapper around the function pointer call...
> >
> > Yes, I noticed this problem too and that makes sense. It just means
> > every dma_ops will probably need to be modified to either support p2p
> > pages or fail on them. Though, the only real difficulty there is that it
> > will be a lot of work.
> 
> I don't think you need to go touch all dma_ops, I think you can just
> arrange for devices that are going to do dma to get redirected to a
> p2p aware provider of operations that overrides the system default
> dma_ops. I.e. just touch get_dma_ops().

This would not work well for everyone, for instance on GPU we usualy
have buffer object with a mix of device memory and regular system
memory but call dma sg map once for the list.

Cheers,
Jérôme
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [LSF/MM TOPIC] Memory hotplug, ZONE_DEVICE, and the future of struct page

2017-01-12 Thread Jerome Glisse
On Thu, Jan 12, 2017 at 02:43:03PM -0800, Dan Williams wrote:
> Back when we were first attempting to support DMA for DAX mappings of
> persistent memory the plan was to forgo 'struct page' completely and
> develop a pfn-to-scatterlist capability for the dma-mapping-api. That
> effort died in this thread:
> 
> https://lkml.org/lkml/2015/8/14/3
> 
> ...where we learned that the dependencies on struct page for dma
> mapping are deeper than a PFN_PHYS() conversion for some
> architectures. That was the moment we pivoted to ZONE_DEVICE and
> arranged for a 'struct page' to be available for any persistent memory
> range that needs to be the target of DMA. ZONE_DEVICE enables any
> device-driver that can target "System RAM" to also be able to target
> persistent memory through a DAX mapping.
> 
> Since that time the "page-less" DAX path has continued to mature [1]
> without growing new dependencies on struct page, but at the same time
> continuing to rely on ZONE_DEVICE to satisfy get_user_pages().
> 
> Peer-to-peer DMA appears to be evolving from a niche embedded use case
> to something general purpose platforms will need to comprehend. The
> "map_peer_resource" [2] approach looks to be headed to the same
> destination as the pfn-to-scatterlist effort. It's difficult to avoid
> 'struct page' for describing DMA operations without custom driver
> code.
> 
> With that background, a statement and a question to discuss at LSF/MM:
> 
> General purpose DMA, i.e. any DMA setup through the dma-mapping-api,
> requires pfn_to_page() support across the entire physical address
> range mapped.

Note that in my case it is even worse. The pfn of the page does not
correspond to anything so it need to go through a special function
to find if a page can be mapped for another device and to provide a
valid pfn at which the page can be access by other device.

Basicly the PCIE bar is like a window into the device memory that is
dynamicly remap to specific page of the device memory. Not all device
memory can be expose through PCIE bar because of PCIE issues.

> 
> Is ZONE_DEVICE the proper vehicle for this? We've already seen that it
> collides with platform alignment assumptions [3], and if there's a
> wider effort to rework memory hotplug [4] it seems DMA support should
> be part of the discussion.

Obvioulsy i would like to join this discussion :)

Cheers,
Jérôme
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2017-01-06 Thread Jerome Glisse
On Fri, Jan 06, 2017 at 11:56:30AM -0500, Serguei Sagalovitch wrote:
> On 2017-01-05 08:58 PM, Jerome Glisse wrote:
> > On Thu, Jan 05, 2017 at 05:30:34PM -0700, Jason Gunthorpe wrote:
> > > On Thu, Jan 05, 2017 at 06:23:52PM -0500, Jerome Glisse wrote:
> > > 
> > > > > I still don't understand what you driving at - you've said in both
> > > > > cases a user VMA exists.
> > > > In the former case no, there is no VMA directly but if you want one than
> > > > a device can provide one. But such VMA is useless as CPU access is not
> > > > expected.
> > > I disagree it is useless, the VMA is going to be necessary to support
> > > upcoming things like CAPI, you need it to support O_DIRECT from the
> > > filesystem, DPDK, etc. This is why I am opposed to any model that is
> > > not VMA based for setting up RDMA - that is shorted sighted and does
> > > not seem to reflect where the industry is going.
> > > 
> > > So focus on having VMA backed by actual physical memory that covers
> > > your GPU objects and ask how do we wire up the '__user *' to the DMA
> > > API in the best way so the DMA API still has enough information to
> > > setup IOMMUs and whatnot.
> > I am talking about 2 different thing. Existing hardware and API where you
> > _do not_ have a vma and you do not need one. This is just existing stuff.
> I do not understand why you assume that existing API doesn't  need one.
> I would say that a lot of __existing__ user level API and their support in
> kernel (especially outside of graphics domain) assumes that we have vma and
> deal with __user * pointers.

Well i am thinking to GPUDirect here. Some of GPUDirect use case do not have
vma (struct vm_area_struct) associated with them they directly apply to GPU
object that aren't expose to CPU. Yes some use case have vma for share buffer.

In the open source driver it is true that we have vma most often than not.

> > Some close driver provide a functionality on top of this design. Question
> > is do we want to do the same ? If yes and you insist on having a vma we
> > could provide one but this is does not apply and is useless for where we
> > are going with new hardware.
> > 
> > With new hardware you just use malloc or mmap to allocate memory and then
> > you use it directly with the device. Device driver can migrate any part of
> > the process address space to device memory. In this scheme you have your
> > usual VMAs but there is nothing special about them.
>
> Assuming that the whole device memory is CPU accessible and it looks
> like the direction where we are going:
> - You forgot about use case when we want or need to allocate memory
> directly on device (why we need to migrate anything if not needed?).
> - We may want to use CPU to access such memory on device to avoid
> any unnecessary migration back.
> - We may have more device memory than the system one.
> E.g. if you have 12 GPUs w/64GB each it will already give us ~0.7 TB
> not mentioning NVDIMM cards which could also be used as memory
> storage for other device access.
> - We also may want/need to share GPU memory between different
> processes.

Here i am talking about platform where GPU memory is not accessible at
all by the CPU (because of PCIe restriction, think CPU atomic operation
on IO memory).

So i really distinguish between CAPI/CCIX and PCIe. Some platform will
have CAPI/CCIX other wont. HMM apply mostly to the latter. Some of HMM
functionalities are still usefull with CAPI/CCIX.

Note that HMM do support allocation on GPU first. In current design this
can happen when GPU is the first to access an unpopulated virtual address.


For platform where GPU memory is accessible plan is either something
like CDM (Coherent Device Memory) or rely on ZONE_DEVICE. So all GPU
memory have struct page and those are like ordinary pages. CDM still
wants some restrictions like avoiding CPU allocation to happen on GPU
when there is memory pressure ... For all intent and purposes this
will work transparently in respect to RDMA because we assume on those
system that the RDMA is CAPI/CCIX and that it can peer to other device.


> > Now when you try to do get_user_page() on any page that is inside the
> > device it will fails because we do not allow any device memory to be pin.
> > There is various reasons for that and they are not going away in any hw
> > in the planing (so for next few years).
> > 
> > Still we do want to support peer to peer mapping. Plan is to only do so
> > with ODP capable hardware. Still we need to solve the IOMMU issue and
> > it needs special handling inside the RDMA device. The way it works is
> > that RDMA ask for a GPU page, 

Re: Enabling peer to peer device transactions for PCIe devices

2017-01-05 Thread Jerome Glisse
On Thu, Jan 05, 2017 at 05:30:34PM -0700, Jason Gunthorpe wrote:
> On Thu, Jan 05, 2017 at 06:23:52PM -0500, Jerome Glisse wrote:
> 
> > > I still don't understand what you driving at - you've said in both
> > > cases a user VMA exists.
> > 
> > In the former case no, there is no VMA directly but if you want one than
> > a device can provide one. But such VMA is useless as CPU access is not
> > expected.
> 
> I disagree it is useless, the VMA is going to be necessary to support
> upcoming things like CAPI, you need it to support O_DIRECT from the
> filesystem, DPDK, etc. This is why I am opposed to any model that is
> not VMA based for setting up RDMA - that is shorted sighted and does
> not seem to reflect where the industry is going.
> 
> So focus on having VMA backed by actual physical memory that covers
> your GPU objects and ask how do we wire up the '__user *' to the DMA
> API in the best way so the DMA API still has enough information to
> setup IOMMUs and whatnot.

I am talking about 2 different thing. Existing hardware and API where you
_do not_ have a vma and you do not need one. This is just existing stuff.
Some close driver provide a functionality on top of this design. Question
is do we want to do the same ? If yes and you insist on having a vma we
could provide one but this is does not apply and is useless for where we
are going with new hardware.

With new hardware you just use malloc or mmap to allocate memory and then
you use it directly with the device. Device driver can migrate any part of
the process address space to device memory. In this scheme you have your
usual VMAs but there is nothing special about them.

Now when you try to do get_user_page() on any page that is inside the
device it will fails because we do not allow any device memory to be pin.
There is various reasons for that and they are not going away in any hw
in the planing (so for next few years).

Still we do want to support peer to peer mapping. Plan is to only do so
with ODP capable hardware. Still we need to solve the IOMMU issue and
it needs special handling inside the RDMA device. The way it works is
that RDMA ask for a GPU page, GPU check if it has place inside its PCI
bar to map this page for the device, this can fail. If it succeed then
you need the IOMMU to let the RDMA device access the GPU PCI bar.

So here we have 2 orthogonal problem. First one is how to make 2 drivers
talks to each other to setup mapping to allow peer to peer and second is
about IOMMU.


> > What i was trying to get accross is that no matter what level you
> > consider in the end you still need something at the DMA API level.
> > And that the 2 different use case (device vma or regular vma) means
> > 2 differents API for the device driver.
> 
> I agree we need new stuff at the DMA API level, but I am opposed to
> the idea we need two API paths that the *driver* has to figure out.
> That is fundamentally not what I want as a driver developer.
> 
> Give me a common API to convert '__user *' to a scatter list and pin
> the pages. This needs to figure out your two cases. And Huge
> Pages. And ZONE_DIRECT.. (a better get_user_pages)

Pining is not gonna happen like i said it would hinder the GPU to the
point it would become useless.


> Give me an API to take the scatter list and DMA map it, handling all
> the stuff associated with peer-peer. (a better dma_map_sg)
> 
> Give me a notifier scheme to rework my scatter list when physical
> pages need to change (mmu notifiers)
> 
> Use the scatter list memory to convey needed information from the
> first step to the second.
> 
> Do not bother the driver with distinctions on what kind of memory is
> behind that VMA. Don't ask me to use get_user_pages or
> gpu_get_user_pages, do not ask me to use dma_map_sg or
> dma_map_sg_peer_direct. The Driver Doesn't Need To Know.

I understand you want it easy but there must be part that must be aware,
at very least the ODP logic. Creating a peer to peer mapping is a multi
step process and some of those step can fails. Fallback is always to
migrate back to system memory as a default path that can not fail, except
if we are out of memory.


> IMHO this is why GPU direct is not mergable - it creates a crazy
> parallel mini-mm subsystem inside RDMA and uses that to connect to a
> GPU driver, everything is expected to have parallel paths for GPU
> direct and normal MM. No good at all.

Existing hardware and new hardware works differently. I am trying to
explain the two different design needed for each one. You understandtably
dislike the existing hardware that has more stringent requirement and
can not be supported transparently and need dedicated communication with
the two driver.

New hardware that have a completely different API in userspace. We can
decide to only support the latter and

Re: Enabling peer to peer device transactions for PCIe devices

2017-01-05 Thread Jerome Glisse
On Thu, Jan 05, 2017 at 03:42:15PM -0700, Jason Gunthorpe wrote:
> On Thu, Jan 05, 2017 at 03:19:36PM -0500, Jerome Glisse wrote:
> 
> > > Always having a VMA changes the discussion - the question is how to
> > > create a VMA that reprensents IO device memory, and how do DMA
> > > consumers extract the correct information from that VMA to pass to the
> > > kernel DMA API so it can setup peer-peer DMA.
> > 
> > Well my point is that it can't be. In HMM case inside a single VMA
> > you
> [..]
> 
> > In the GPUDirect case the idea is that you have a specific device vma
> > that you map for peer to peer.
> 
> [..]
> 
> I still don't understand what you driving at - you've said in both
> cases a user VMA exists.

In the former case no, there is no VMA directly but if you want one than
a device can provide one. But such VMA is useless as CPU access is not
expected.

> 
> From my perspective in RDMA, all I want is a core kernel flow to
> convert a '__user *' into a scatter list of DMA addresses, that works no
> matter what is backing that VMA, be it HMM, a 'hidden' GPU object, or
> struct page memory.
> 
> A '__user *' pointer is the only way to setup a RDMA MR, and I see no
> reason to have another API at this time.
> 
> The details of how to translate to a scatter list are a MM subject,
> and the MM folks need to get 
> 
> I just don't care if that routine works at a page level, or a whole
> VMA level, or some combination of both, that is up to the MM team to
> figure out :)

And that's what i am trying to get accross. There is 2 cases here.
What exist on today hardware. Thing like GPU direct, that works on
VMA level. Versus where some new hardware is going were want to do
thing on page level. Both require different API at different level.

What i was trying to get accross is that no matter what level you
consider in the end you still need something at the DMA API level.
And that the 2 different use case (device vma or regular vma) means
2 differents API for the device driver.

> 
> > a page level. Expectation here is that the GPU userspace expose a special
> > API to allow RDMA to directly happen on GPU object allocated through
> > GPU specific API (ie it is not regular memory and it is not accessible
> > by CPU).
> 
> So, how do you identify these GPU objects? How do you expect RDMA
> convert them to scatter lists? How will ODP work?

No ODP on those. If you want vma, the GPU device driver can provide
one. GPU object are disjoint from regular memory (coming from some
form of mmap). They are created through ioctl and in many case are
never expose to the CPU. They only exist inside the GPU driver realm.

None the less there is usecase where exchanging those object accross
computer over a network make sense. I am not an end user here :)


> > > We have MMU notifiers to handle this today in RDMA. Async RDMA MR
> > > Invalidate like you see in the above out of tree patches is totally
> > > crazy and shouldn't be in mainline. Use ODP capable RDMA hardware.
> > 
> > Well there is still a large base of hardware that do not have such
> > feature and some people would like to be able to keep using those.
> 
> Hopefully someone will figure out how to do that without the crazy
> async MR invalidation.

Personnaly i don't care too much about this old hardware and thus i am
fine without supporting them. The open source userspace is playing
catchup and doing feature for old hardware probably does not make sense.

Cheers,
Jérôme
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2017-01-05 Thread Jerome Glisse
On Thu, Jan 05, 2017 at 01:07:19PM -0700, Jason Gunthorpe wrote:
> On Thu, Jan 05, 2017 at 02:54:24PM -0500, Jerome Glisse wrote:
> 
> > Mellanox and NVidia support peer to peer with what they market a
> > GPUDirect. It only works without IOMMU. It is probably not upstream :
> > 
> > https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg21402.html
> > 
> > I thought it was but it seems it require an out of tree driver to work.
> 
> Right, it is out of tree and not under consideration for mainline.
> 
> > Wether there is a vma or not isn't important to the issue anyway. If
> > you want to enforce VMA rule for RDMA it is an RDMA specific discussion
> > in which i don't want to be involve, it is not my turf :)
> 
> Always having a VMA changes the discussion - the question is how to
> create a VMA that reprensents IO device memory, and how do DMA
> consumers extract the correct information from that VMA to pass to the
> kernel DMA API so it can setup peer-peer DMA.

Well my point is that it can't be. In HMM case inside a single VMA you
can have one page inside GPU memory at address A but next page inside
regular memory at A+4k. So handling this at the VMA level does not make
sense. So in this case you would get the device from the struct page
and you would query through common API to determine if you can do peer
to peer. If not it would trigger migration back to regular memory.
If yes then you still have to solve the IOMMU issue and hence the DMA
API changes that were propose.

In the GPUDirect case the idea is that you have a specific device vma
that you map for peer to peer. Here thing can be at vma level and not at
a page level. Expectation here is that the GPU userspace expose a special
API to allow RDMA to directly happen on GPU object allocated through
GPU specific API (ie it is not regular memory and it is not accessible
by CPU).


Both case are disjoint. Both case need to solve the IOMMU issue which
seems to be best solve at the DMA API level.


> > What matter is the back channel API between peer-to-peer device. Like
> > the above patchset points out for GPU we need to be able to invalidate
> > a mapping at any point in time. Pining is not something we want to
> > live with.
> 
> We have MMU notifiers to handle this today in RDMA. Async RDMA MR
> Invalidate like you see in the above out of tree patches is totally
> crazy and shouldn't be in mainline. Use ODP capable RDMA hardware.

Well there is still a large base of hardware that do not have such
feature and some people would like to be able to keep using those.
I believe allowing direct access to GPU object that are otherwise
hidden from regular kernel memory management is still meaningfull.

Cheers,
Jérôme

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2017-01-05 Thread Jerome Glisse
Sorry to revive this thread but it fells through my filters and i
miss it. I have been going through it and i think the discussion
has been hinder by the fact that distinct problems were merge while
they should be address separately.

First for peer-to-peer we need to be clear on how this happens. Two
cases here :
  1) peer-to-peer because of userspace specific API like NVidia GPU
direct (AMD is pushing its own similar API i just can't remember
marketing name). This does not happen through a vma, this happens
through specific device driver call going through device specific
ioctl on both side (GPU and RDMA). So both kernel driver are aware
of each others.
  2) peer-to-peer because RDMA/device is trying to access a regular
vma (ie non special either private anonymous or share memory or
mmap of a regular file not a device file).

For 1) there is no need to over complicate thing. Device driver must
have a back-channel between them and must be able to invalidate their
respective mapping (ie GPU must be able to ask RDMA device to kill/
stop its MR).

So remaining issue for 1) is how to enable effective peer-to-peer
mapping given that it might not work reliably on all platform. Here
Alex was listing existing proposal:
  A P2P DMA DMA-API/PCI map_peer_resource support for peer-to-peer
http://www.spinics.net/lists/linux-pci/msg44560.html
  B ZONE_DEVICE IO irect I/O and DMA for persistent memory
https://lwn.net/Articles/672457/
  C DMA-BUF RDMA subsystem DMA-BUF support
http://www.spinics.net/lists/linux-rdma/msg38748.html
  D iopmem iopmem : A block device for PCIe memory
https://lwn.net/Articles/703895/
  E HMM (not interesting for case 1)
  F Something new

Of the above D is ill suited for for GPU as we do not want to pin
GPU memory and D is design with long live object that do not move.
Also i do not think that exposing device PCIe bar through a new
/dev/somefilename is a good idea for GPU. So i think this should
be discarded.

HMM should be discard in respect of case 1 too. It is useful for
case 2. I don't think dma-buf is the right path either.

So we i think there is only A and B that make sense. Now for use
case 1 i think A is the best solution. No need to have struct page
and it require explicit knowlegde for device driver that it is
mapping another device memory which is a given in usecase 1.


If we look at case 2 the situation is bit more complex. Here RDMA
is just trying to access a regular VMA but it might happens that
some memory inside that VMA reside inside a device memory. When
that happens we would like to avoid to move that memory back to
system memory assuming that a peer mapping is doable.

Usecase 2 assume that the GPU is either on platform with CAPI or
CCTX (or something similar) in which case it is easy as device
memory will have struct page and is always accessible by CPU and
transparent from device to device access (AFAICT).

So we left with platform that do not have proper support for
device memory (ie CPU can not access it the same as DDR or as
limited access). Which apply to x86 for the foreseeable future.

This is the problem HMM address, allowing to transparently use
device memory inside a process even if direct CPU access are not
permited. I have plan to support peer-to-peer with HMM because
it is an important usecase. The idea is to have the device driver
fault against ZONE_DEVICE page and communicate through common API
to establish mapping. HMM will only handle keeping track of device
to device mapping and allowing to invalidate such mapping at any
time to allow memory to be migrated.

I do not intend to solve the IOMMU side of the problem or even
the PCI hierarchy issue where you can't peer-to-peer between device
accross some PCI bridge. I believe this is an orthogonal problem
and that it is best solve inside the DMA API ie with solution A.


I do not think we should try to solve all the problems with a
common solutions. They are too disparate from capabilities (what
the hardware can and can't do).

>From my point of view there is few take aways:
  - device should only access regular vma
  - device should never try to access vma that point to another
device (mmap of any file in /dev)
  - peer to peer access through dedicated userspace API must
involve dedicated API between kernel driver taking part into
the peer to peer access
  - peer to peer of regular vma must involve a common API for
drivers to interact so no driver can block the other


So i think the DMA-API proposal is the one to pursue and others
problem relating to handling GPU memory and how to use it is a
different kind of problem. One with either an hardware solution
(CAPI, CCTX, ...) or a software solution (HMM so far).

I don't think we should conflict the 2 problems into one. Anyway
i think this should be something worth discussing face to face
with interested party to flesh out a solution (can be at LSF/MM
or in another forum).

Cheers,
Jérôme