Re: Enabling peer to peer device transactions for PCIe devices

2016-11-23 Thread Logan Gunthorpe


On 23/11/16 02:55 PM, Jason Gunthorpe wrote:
>>> Only ODP hardware allows changing the DMA address on the fly, and it
>>> works at the page table level. We do not need special handling for
>>> RDMA.
>>
>> I am aware of ODP but, noted by others, it doesn't provide a general
>> solution to the points above.
> 
> How do you mean?

I was only saying it wasn't general in that it wouldn't work for IB
hardware that doesn't support ODP or other hardware  that doesn't do
similar things (like an NVMe drive).

It makes sense for hardware that supports ODP to allow MRs to not pin
the underlying memory and provide for migrations that the hardware can
follow. But most DMA engines will require the memory to be pinned and
any complex allocators (GPU or otherwise) should respect that. And that
seems like it should be the default way most of this works -- and I
think it wouldn't actually take too much effort to make it all work now
as is. (Our iopmem work is actually quite small and simple.)

>> It's also worth noting that #4 makes use of ZONE_DEVICE (#2) so they are
>> really the same option. iopmem is really just one way to get BAR
>> addresses to user-space while inside the kernel it's ZONE_DEVICE.
> 
> Seems fine for RDMA?

Yeah, we've had RDMA and O_DIRECT transfers to PCIe backed ZONE_DEVICE
memory working for some time. I'd say it's a good fit. The main question
we've had is how to expose PCIe bars to userspace to be used as MRs and
such.


Logan
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-23 Thread Sagalovitch, Serguei
On Wed, Nov 23, 2016 at 02:11:29PM -0700, Logan Gunthorpe wrote:

> Perhaps I am not following what Serguei is asking for, but I
> understood the desire was for a complex GPU allocator that could
> migrate pages between GPU and CPU memory under control of the GPU
> driver, among other things. The desire is for DMA to continue to work
> even after these migrations happen.

The main issue is to  how to solve use cases when p2p is 
requested/initiated via CPU pointers where such pointers could 
point to non-system memory location e.g.  VRAM.  

It will allow to provide consistent working model for user to deal only
with pointers (HSA, CUDA, OpenCL 2.0 SVM) as well as provide 
performance optimization avoiding double-buffering and extra special code 
when dealing with PCIe device memory.

 Examples are:

 - RDMA Network operations.  RDMA MRs where registered memory 
could be e.g. VRAM.  Currently it is solved using so called PeerDirect  
interface which  is currently out-of-tree and  provided as part of OFED.
- File operations (fread/fwrite) when user wants to transfer file data directly 
to/from e.g. VRAM


Challenges are:
- Because graphics sub-system must support overcomit (at least each 
application/process should independently see all resources) ideally 
such memory should be movable without changing CPU pointer value
as well as "paged-out" supporting "page fault" at least on access from 
CPU.
 - We must co-exist with existing DRM infrastructure, as well as 
support sharing VRAM memory between different processes
- We should be able to deal with large allocations: tens, hundreds of 
MBs or may be GBs.
- We may have PCIe devices where p2p may not work
- Potentially any GPU memory should be supported including 
memory carved out from system RAM (e.g. allocated via
get_free_pages()).


Note:
-  In the case of RDMA MRs life-span of "pinning" 
(get_user_pages"/put_page) may be defined/controlled by 
application not kernel which  may be should 
treated differently as special case. 
  

Original proposal was to create "struct pages" for VRAM memory 
to allow "get_user_pages"  to work transparently similar 
how it is/was done for "DAX Device" case. Unfortunately 
based on my understanding "DAX Device" implementation 
deal only with permanently  "locked" memory  (fixed location) 
unrelated to "get_user_pages"/"put_page" scope  
which doesn't satisfy requirements  for "eviction" / "moving" of 
memory keeping CPU address intact.  

> The desire is for DMA to continue to work
> even after these migrations happen
At least some kind of mm notifier callback to inform about changing 
in location (pre- and post-) similar how it is done for system pages. 
My understanding is that It will not solve RDMA MR issue where "lock" 
could be during the whole  application life but  (a) it will not make 
RDMA MR case worse  (b) should be enough for all other cases for 
"get_user_pages"/"put_page" controlled by  kernel.
 
 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] x86: fix kaslr and memmap collision

2016-11-23 Thread Dave Chinner
On Tue, Nov 22, 2016 at 11:01:32AM -0800, Dan Williams wrote:
> On Tue, Nov 22, 2016 at 10:54 AM, Kees Cook  wrote:
> > On Tue, Nov 22, 2016 at 9:26 AM, Dan Williams  
> > wrote:
> >> No, you're right, we need to handle multiple ranges.  Since the
> >> mem_avoid array is statically allocated perhaps we can handle up to 4
> >> memmap= entries, but past that point disable kaslr for that boot?
> >
> > Yeah, that seems fine to me. I assume it's rare to have 4?
> >
> 
> It should be rare to have *one* since ACPI 6.0 added support for
> communicating persistent memory ranges.  However there are legacy
> nvdimm users that I know are doing at least 2, but I have hard time
> imagining they would ever do more than 4.

I doubt it's rare amongst the people using RAM to emulate pmem for
filesystem testing purposes. My "pmem" test VM always has at least 2
ranges set to give me two discrete pmem devices, and I have used 4
from time to time to do things like test multi-volume scratch XFS
filesystems in xfstests (i.e. data, log and realtime volumes) so I
didn't need to play games with partitioning or DM...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-23 Thread Jason Gunthorpe
On Wed, Nov 23, 2016 at 02:42:12PM -0800, Dan Williams wrote:
> > The crucial part for this discussion is the ability to fence and block
> > DMA for a specific range. This is the hardware capability that lets
> > page migration happen: fence DMA, migrate page, update page
> > table in HCA, unblock DMA.
> 
> Wait, ODP requires migratable pages, ZONE_DEVICE pages are not
> migratable.

Does it? I didn't think so.. Does ZONE_DEVICE break MMU notifiers/etc
or something? There is certainly nothing about the hardware that cares
about ZONE_DEVICE vs System memory.

I used 'migration' in the broader sense of doing any transformation to
the page such that the DMA address changes - not the specific kernel
MM process...

> You can't replace a PCIe mapping with just any other System RAM
> physical address, right?

I thought that was exactly what HMM was trying to do? Migrate pages
between CPU and GPU memory as needed. As Serguei has said this process
needs to be driven by the GPU driver.

The peer-peer issue is how do you do that while RDMA is possible on
those pages, because when the page migrates to GPU memory you want the
RDMA to follow it seamlessly.

This is why page table mirroring is the best solution - use the
existing mm machinery to link the DMA driver and whatever is
controlling the VMA.

> At least not without a filesystem recording where things went, but
> at point we're no longer talking about the base P2P-DMA mapping

In the filesystem/DAX case, it would be the filesystem that initiates
any change in the page physical address.

ODP *follows* changes in the VMA it does not cause any change in
address mapping. That has to be done by whoever is in charge of the
VMA.

> something like pnfs-rdma to a DAX filesystem.

Something in the kernel (ie nfs-rdma) would be entirely different. We
generally don't do long lived mappings in the kernel for RDMA
(certainly not for NFS), so it is much more like your basic every day
DMA operation: map, execute, unmap. We probably don't need to use page
table mirroring for this.

ODP comes in when userpsace mmaps a DAX file and then tries to use it
for RDMA. Page table mirroring lets the DAX filesystem decide to move
the backing pages at any time. When it wants to do that it interacts
with the MM in the usual way which links to ODP and makes sure the
migration is seamless.

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] ndctl: introduce 4k allocation support for creating namespace

2016-11-23 Thread Dan Williams
Some needed changes I noticed while trying to take this onto the
'pending' branch

On Mon, Oct 24, 2016 at 4:21 PM, Dave Jiang  wrote:
> Existing implementation defaults all pages allocated as 2M superpages.
> For nfit_test dax device we need 4k pages allocated to work properly.
> Adding an --align,-a option to provide the alignment. Will accept
> 4k and 2M at the moment as proper parameter. No -a option still defaults
> to 2M.
>
> Signed-off-by: Dave Jiang 
> ---
>  ndctl/builtin-xaction-namespace.c |   22 --
>  util/size.h   |1 +
>  2 files changed, 21 insertions(+), 2 deletions(-)
>
> diff --git a/ndctl/builtin-xaction-namespace.c 
> b/ndctl/builtin-xaction-namespace.c
> index 9b1702d..89ce6ce 100644
> --- a/ndctl/builtin-xaction-namespace.c
> +++ b/ndctl/builtin-xaction-namespace.c
> @@ -49,6 +49,7 @@ static struct parameters {
> const char *region;
> const char *reconfig;
> const char *sector_size;
> +   const char *align;
>  } param;
>
>  void builtin_xaction_namespace_reset(void)
> @@ -71,6 +72,7 @@ struct parsed_parameters {
> enum ndctl_namespace_mode mode;
> unsigned long long size;
> unsigned long sector_size;
> +   unsigned long align;
>  };
>
>  #define debug(fmt, ...) \
> @@ -104,6 +106,8 @@ OPT_STRING('l', "sector-size", _size, 
> "lba-size", \
> "specify the logical sector size in bytes"), \
>  OPT_STRING('t', "type", , "type", \
> "specify the type of namespace to create 'pmem' or 'blk'"), \
> +OPT_STRING('a', "align", , "align", \
> +   "specify the namespace alignment in bytes (default: 0x20 (2M))"), 
> \
>  OPT_BOOLEAN('f', "force", , "reconfigure namespace even if currently 
> active")
>
>  static const struct option base_options[] = {
> @@ -319,7 +323,7 @@ static int setup_namespace(struct ndctl_region *region,
>
> try(ndctl_pfn, set_uuid, pfn, uuid);
> try(ndctl_pfn, set_location, pfn, p->loc);
> -   try(ndctl_pfn, set_align, pfn, SZ_2M);
> +   try(ndctl_pfn, set_align, pfn, p->align);

This will now collide wit the new "ndctl_pfn_has_align()" check that
got added to fix support for pre-4.5 kernels.

> try(ndctl_pfn, set_namespace, pfn, ndns);
> rc = ndctl_pfn_enable(pfn);
> } else if (p->mode == NDCTL_NS_MODE_DAX) {
> @@ -327,7 +331,7 @@ static int setup_namespace(struct ndctl_region *region,
>
> try(ndctl_dax, set_uuid, dax, uuid);
> try(ndctl_dax, set_location, dax, p->loc);
> -   try(ndctl_dax, set_align, dax, SZ_2M);
> +   try(ndctl_dax, set_align, dax, p->align);
> try(ndctl_dax, set_namespace, dax, ndns);
> rc = ndctl_dax_enable(dax);
> } else if (p->mode == NDCTL_NS_MODE_SAFE) {
> @@ -383,6 +387,20 @@ static int validate_namespace_options(struct 
> ndctl_region *region,
>
> memset(p, 0, sizeof(*p));
>
> +   if (param.align) {
> +   p->align = parse_size64(param.align);
> +   switch (p->align) {
> +   case SZ_4K:
> +   case SZ_2M:
> +   break;
> +   case SZ_1G: /* unsupported yet... */
> +   default:
> +   debug("%s: invalid align\n", __func__);
> +   return -EINVAL;
> +   }
> +   } else
> +   p->align = SZ_2M;
> +

I think this check should come after we have determined that the mode
is either "memory" or "dax", and error out otherwise.  Also, when the
alignment is not 2M, we should check that the kernel has alignment
setting support with the new ndctl_pfn_has_align() api.  Note that
kernels that support device-dax implicitly support the alignment
property.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-23 Thread Jason Gunthorpe
On Wed, Nov 23, 2016 at 02:11:29PM -0700, Logan Gunthorpe wrote:
> > As I said, there is no possible special handling. Standard IB hardware
> > does not support changing the DMA address once a MR is created. Forget
> > about doing that.
> 
> Yeah, that's essentially the point I was trying to make. Not to mention
> all the other unrelated hardware that can't DMA to an address that might
> disappear mid-transfer.

Right, it is impossible to ask for generic page migration with ongoing
DMA. That is simply not supported by any of the hardware at all.

> > Only ODP hardware allows changing the DMA address on the fly, and it
> > works at the page table level. We do not need special handling for
> > RDMA.
> 
> I am aware of ODP but, noted by others, it doesn't provide a general
> solution to the points above.

How do you mean?

Perhaps I am not following what Serguei is asking for, but I
understood the desire was for a complex GPU allocator that could
migrate pages between GPU and CPU memory under control of the GPU
driver, among other things. The desire is for DMA to continue to work
even after these migrations happen.

Page table mirroring *is* the general solution for this problem. The
GPU driver controls the VMA and the DMA driver mirrors that VMA.

Do you know of another option that doesn't just degenerate to page
table mirroring??

Remember, there are two facets to the RDMA ODP implementation, I feel
there is some confusion here..

The crucial part for this discussion is the ability to fence and block
DMA for a specific range. This is the hardware capability that lets
page migration happen: fence DMA, migrate page, update page
table in HCA, unblock DMA.

Without that hardware support the DMA address must be unchanging, and
there is nothing we can do about it. This is why standard IB hardware
must have fixed MRs - it lacks the fence capability.

The other part is the page faulting implementation, but that is not
required, and to Serguei's point, is not desired for GPU anyhow.

> > To me this means at least items #1 and #3 should be removed from
> > Alexander's list.
> 
> It's also worth noting that #4 makes use of ZONE_DEVICE (#2) so they are
> really the same option. iopmem is really just one way to get BAR
> addresses to user-space while inside the kernel it's ZONE_DEVICE.

Seems fine for RDMA?

Didn't we just strike off everything on the list except #2? :\

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-23 Thread Logan Gunthorpe


On 23/11/16 01:33 PM, Jason Gunthorpe wrote:
> On Wed, Nov 23, 2016 at 02:58:38PM -0500, Serguei Sagalovitch wrote:
> 
>>We do not want to have "highly" dynamic translation due to
>>performance cost.  We need to support "overcommit" but would
>>like to minimize impact.  To support RDMA MRs for GPU/VRAM/PCIe
>>device memory (which is must) we need either globally force
>>pinning for the scope of "get_user_pages() / "put_pages" or have
>>special handling for RDMA MRs and similar cases.
> 
> As I said, there is no possible special handling. Standard IB hardware
> does not support changing the DMA address once a MR is created. Forget
> about doing that.

Yeah, that's essentially the point I was trying to make. Not to mention
all the other unrelated hardware that can't DMA to an address that might
disappear mid-transfer.

> Only ODP hardware allows changing the DMA address on the fly, and it
> works at the page table level. We do not need special handling for
> RDMA.

I am aware of ODP but, noted by others, it doesn't provide a general
solution to the points above.

> Like I said, this is the direction the industry seems to be moving in,
> so any solution here should focus on VMAs/page tables as the way to link
> the peer-peer devices.

Yes, this was the appeal to us of using ZONE_DEVICE.

> To me this means at least items #1 and #3 should be removed from
> Alexander's list.

It's also worth noting that #4 makes use of ZONE_DEVICE (#2) so they are
really the same option. iopmem is really just one way to get BAR
addresses to user-space while inside the kernel it's ZONE_DEVICE.

Logan
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-23 Thread Jason Gunthorpe
On Wed, Nov 23, 2016 at 02:58:38PM -0500, Serguei Sagalovitch wrote:

>We do not want to have "highly" dynamic translation due to
>performance cost.  We need to support "overcommit" but would
>like to minimize impact.  To support RDMA MRs for GPU/VRAM/PCIe
>device memory (which is must) we need either globally force
>pinning for the scope of "get_user_pages() / "put_pages" or have
>special handling for RDMA MRs and similar cases.

As I said, there is no possible special handling. Standard IB hardware
does not support changing the DMA address once a MR is created. Forget
about doing that.

Only ODP hardware allows changing the DMA address on the fly, and it
works at the page table level. We do not need special handling for
RDMA.

>Generally it could be difficult to correctly handle "DMA in
>progress" due to the facts that (a) DMA could originate from
>numerous PCIe devices simultaneously including requests to
>receive network data.

We handle all of this today in kernel via the page pinning mechanism.
This needs to be copied into peer-peer memory and GPU memory schemes
as well. A pinned page means the DMA address channot be changed and
there is active non-CPU access to it.

Any hardware that does not support page table mirroring must go this
route.

> (b) in HSA case DMA could originated from user space without kernel
>driver knowledge.  So without corresponding h/w support
>everywhere I do not see how it could be solved effectively.

All true user triggered DMA must go through some kind of coherent page
table mirroring scheme (eg this is what CAPI does, presumably AMDs HSA
is similar). A page table mirroring scheme is basically the same as
what ODP does.

Like I said, this is the direction the industry seems to be moving in,
so any solution here should focus on VMAs/page tables as the way to link
the peer-peer devices.

To me this means at least items #1 and #3 should be removed from
Alexander's list.

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-23 Thread Serguei Sagalovitch



On 2016-11-23 02:32 PM, Jason Gunthorpe wrote:

On Wed, Nov 23, 2016 at 02:14:40PM -0500, Serguei Sagalovitch wrote:

On 2016-11-23 02:05 PM, Jason Gunthorpe wrote:

As Bart says, it would be best to be combined with something like
Mellanox's ODP MRs, which allows a page to be evicted and then trigger
a CPU interrupt if a DMA is attempted so it can be brought back.

Please note that in the general case (including  MR one) we could have
"page fault" from the different PCIe device. So all  PCIe device must
be synchronized.

Standard RDMA MRs require pinned pages, the DMA address cannot change
while the MR exists (there is no hardware support for this at all), so
page faulting from any other device is out of the question while they
exist. This is the same requirement as typical simple driver DMA which
requires pages pinned until the simple device completes DMA.

ODP RDMA MRs do not require that, they just page fault like the CPU or
really anything and the kernel has to make sense of concurrant page
faults from multiple sources.

The upshot is that GPU scenarios that rely on highly dynamic
virtual->physical translation cannot sanely be combined with standard
long-life RDMA MRs.
We do not want to have "highly" dynamic translation due to performance 
cost.

We need to support "overcommit" but would like to minimize impact.

To support RDMA MRs for GPU/VRAM/PCIe device memory (which is must)
we need either globally force  pinning for the scope of
"get_user_pages() / "put_pages" or have special handling for RDMA MRs and
similar cases.  Generally it could be difficult to correctly handle "DMA 
in progress"

 due to the  facts that (a) DMA could originate  from numerous PCIe devices
simultaneously including requests to receive network data. (b) in HSA 
case DMA could

 originated from user space without kernel driver knowledge.
So without corresponding h/w support everywhere I do not see how it could
be solved effectively.

Certainly, any solution for GPUs must follow the typical page pinning
semantics, changing the DMA address of a page must be blocked while
any DMA is in progress.

Does HMM solve the peer-peer problem? Does it do it generically or
only for drivers that are mirroring translation tables?

In current form HMM doesn't solve peer-peer problem. Currently it allow
"mirroring" of  "malloc" memory on GPU which is not always what needed.
Additionally  there is need to have opportunity to share VRAM allocations
between  different processes.

Humm, so it can be removed from Alexander's list then :\
HMM is very useful for some type of scenarios as well as it could 
significantly
simplify (for performance) implementations of some features e.g. OpenCL 
2.0 SVM.


As Dan suggested, maybe we need to do both. Some kind of fix for
get_user_pages() for smaller mappings (eg ZONE_DEVICE) and a mandatory
API conversion to get_user_dma_sg() for other cases?

Jason


Sincerely yours,
Serguei Sagalovitch

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-23 Thread Jason Gunthorpe
On Wed, Nov 23, 2016 at 02:14:40PM -0500, Serguei Sagalovitch wrote:
> 
> On 2016-11-23 02:05 PM, Jason Gunthorpe wrote:

> >As Bart says, it would be best to be combined with something like
> >Mellanox's ODP MRs, which allows a page to be evicted and then trigger
> >a CPU interrupt if a DMA is attempted so it can be brought back.

> Please note that in the general case (including  MR one) we could have
> "page fault" from the different PCIe device. So all  PCIe device must
> be synchronized.

Standard RDMA MRs require pinned pages, the DMA address cannot change
while the MR exists (there is no hardware support for this at all), so
page faulting from any other device is out of the question while they
exist. This is the same requirement as typical simple driver DMA which
requires pages pinned until the simple device completes DMA.

ODP RDMA MRs do not require that, they just page fault like the CPU or
really anything and the kernel has to make sense of concurrant page
faults from multiple sources.

The upshot is that GPU scenarios that rely on highly dynamic
virtual->physical translation cannot sanely be combined with standard
long-life RDMA MRs.

Certainly, any solution for GPUs must follow the typical page pinning
semantics, changing the DMA address of a page must be blocked while
any DMA is in progress.

> >Does HMM solve the peer-peer problem? Does it do it generically or
> >only for drivers that are mirroring translation tables?

> In current form HMM doesn't solve peer-peer problem. Currently it allow
> "mirroring" of  "malloc" memory on GPU which is not always what needed.
> Additionally  there is need to have opportunity to share VRAM allocations
> between  different processes.

Humm, so it can be removed from Alexander's list then :\

As Dan suggested, maybe we need to do both. Some kind of fix for
get_user_pages() for smaller mappings (eg ZONE_DEVICE) and a mandatory
API conversion to get_user_dma_sg() for other cases?

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-23 Thread Serguei Sagalovitch


On 2016-11-23 03:51 AM, Christian König wrote:

Am 23.11.2016 um 08:49 schrieb Daniel Vetter:

On Tue, Nov 22, 2016 at 01:21:03PM -0800, Dan Williams wrote:

On Tue, Nov 22, 2016 at 1:03 PM, Daniel Vetter  wrote:

On Tue, Nov 22, 2016 at 9:35 PM, Serguei Sagalovitch
 wrote:

On 2016-11-22 03:10 PM, Daniel Vetter wrote:
On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams 


wrote:

On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch
 wrote:

I personally like "device-DAX" idea but my concerns are:

-  How well it will co-exists with the  DRM infrastructure /
implementations
 in part dealing with CPU pointers?

Inside the kernel a device-DAX range is "just memory" in the sense
that you can perform pfn_to_page() on it and issue I/O, but the 
vma is

not migratable. To be honest I do not know how well that co-exists
with drm infrastructure.


-  How well we will be able to handle case when we need to
"move"/"evict"
 memory/data to the new location so CPU pointer should 
point to the

new
physical location/address
  (and may be not in PCI device memory at all)?
So, device-DAX deliberately avoids support for in-kernel 
migration or
overcommit. Those cases are left to the core mm or drm. The 
device-dax
interface is for cases where all that is needed is a 
direct-mapping to
a statically-allocated physical-address range be it persistent 
memory

or some other special reserved memory range.
For some of the fancy use-cases (e.g. to be comparable to what 
HMM can
pull off) I think we want all the magic in core mm, i.e. 
migration and

overcommit. At least that seems to be the very strong drive in all
general-purpose gpu abstractions and implementations, where 
memory is

allocated with malloc, and then mapped/moved into vram/gpu address
space through some magic,
It is possible that there is other way around: memory is requested 
to be

allocated and should be kept in vram for  performance reason but due
to possible overcommit case we need at least temporally to "move" 
such

allocation to system memory.

With migration I meant migrating both ways of course. And with stuff
like numactl we can also influence where exactly the malloc'ed memory
is allocated originally, at least if we'd expose the vram range as a
very special numa node that happens to be far away and not hold any
cpu cores.

I don't think we should be using numa distance to reverse engineer a
certain allocation behavior.  The latency data should be truthful, but
you're right we'll need a mechanism to keep general purpose
allocations out of that range by default. Btw, strict isolation is
another design point of device-dax, but I think in this case we're
describing something between the two extremes of full isolation and
full compatibility with existing numactl apis.

Yes, agreed. My idea with exposing vram sections using numa nodes wasn't
to reuse all the existing allocation policies directly, those won't 
work.

So at boot-up your default numa policy would exclude any vram nodes.

But I think (as an -mm layman) that numa gives us a lot of the tools and
policy interface that we need to implement what we want for gpus.


Agree completely. From a ten mile high view our GPUs are just command 
processors with local memory as well .


Basically this is also the whole idea of what AMD is pushing with HSA 
for a while.


It's just that a lot of problems start to pop up when you look at all 
the nasty details. For example only part of the GPU memory is usually 
accessible by the CPU.


So even when numa nodes expose a good foundation for this I think 
there is still a lot of code to write.


BTW: I should probably start to read into the numa code of the kernel. 
Any good pointers for that?
I would assume that "page" allocation logic itself should be inside of 
graphics driver due to possible different requirements especially from 
graphics: alignment, etc.




Regards,
Christian.


Wrt isolation: There's a sliding scale of what different users expect,
from full auto everything, including migrating pages around if needed to
full isolation all seems to be on the table. As long as we keep vram 
nodes
out of any default allocation numasets, full isolation should be 
possible.

-Daniel





Sincerely yours,
Serguei Sagalovitch

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-23 Thread Serguei Sagalovitch


On 2016-11-23 02:05 PM, Jason Gunthorpe wrote:

On Wed, Nov 23, 2016 at 10:13:03AM -0700, Logan Gunthorpe wrote:


an MR would be very tricky. The MR may be relied upon by another host
and the kernel would have to inform user-space the MR was invalid then
user-space would have to tell the remote application.

As Bart says, it would be best to be combined with something like
Mellanox's ODP MRs, which allows a page to be evicted and then trigger
a CPU interrupt if a DMA is attempted so it can be brought back.

Please note that in the general case (including  MR one) we could have
"page fault" from the different PCIe device. So all  PCIe device must
be synchronized.

includes the usual fencing mechanism so the CPU can block, flush, and
then evict a page coherently.

This is the general direction the industry is going in: Link PCI DMA
directly to dynamic user page tabels, including support for demand
faulting and synchronicity.

Mellanox ODP is a rough implementation of mirroring a process's page
table via the kernel, while IBM's CAPI (and CCIX, PCI ATS?) is
probably a good example of where this is ultimately headed.

CAPI allows a PCI DMA to directly target an ASID associated with a
user process and then use the usual CPU machinery to do the page
translation for the DMA. This includes page faults for evicted pages,
and obviously allows eviction and migration..

So, of all the solutions in the original list, I would discard
anything that isn't VMA focused. Emulating what CAPI does in hardware
with software is probably the best choice, or we have to do it all
again when CAPI style hardware broadly rolls out :(

DAX and GPU allocators should create VMAs and manipulate them in the
usual way to achieve migration, windowing, cache, movement or
swap of the potentially peer-peer memory pages. They would have to
respect the usual rules for a VMA, including pinning.

DMA drivers would use the usual approaches for dealing with DMA from
a VMA: short term pin or long term coherent translation mirror.

So, to my view (looking from RDMA), the main problem with peer-peer is
how do you DMA translate VMA's that point at non struct page memory?

Does HMM solve the peer-peer problem? Does it do it generically or
only for drivers that are mirroring translation tables?

In current form HMM doesn't solve peer-peer problem. Currently it allow
"mirroring" of  "malloc" memory on GPU which is not always what needed.
Additionally  there is need to have opportunity to share VRAM allocations
between  different processes.

 From a RDMA perspective we could use something other than
get_user_pages() to pin and DMA translate a VMA if the core community
could decide on an API. eg get_user_dma_sg() would probably be quite
usable.

Jason


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-23 Thread Jason Gunthorpe
On Wed, Nov 23, 2016 at 10:40:47AM -0800, Dan Williams wrote:

> I don't think that was designed for the case where the backing memory
> is a special/static physical address range rather than anonymous
> "System RAM", right?

The hardware doesn't care where the memory is. ODP is just a generic
mechanism to provide demand-fault behavior for a mirrored page table.

ODP has the same issue as everything else, it needs to translate a
page table entry into a DMA address, and we have no API to do that
when the page table points to peer-peer memory.

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-23 Thread Jason Gunthorpe
On Wed, Nov 23, 2016 at 10:13:03AM -0700, Logan Gunthorpe wrote:

> an MR would be very tricky. The MR may be relied upon by another host
> and the kernel would have to inform user-space the MR was invalid then
> user-space would have to tell the remote application.

As Bart says, it would be best to be combined with something like
Mellanox's ODP MRs, which allows a page to be evicted and then trigger
a CPU interrupt if a DMA is attempted so it can be brought back. This
includes the usual fencing mechanism so the CPU can block, flush, and
then evict a page coherently.

This is the general direction the industry is going in: Link PCI DMA
directly to dynamic user page tabels, including support for demand
faulting and synchronicity.

Mellanox ODP is a rough implementation of mirroring a process's page
table via the kernel, while IBM's CAPI (and CCIX, PCI ATS?) is
probably a good example of where this is ultimately headed.

CAPI allows a PCI DMA to directly target an ASID associated with a
user process and then use the usual CPU machinery to do the page
translation for the DMA. This includes page faults for evicted pages,
and obviously allows eviction and migration..

So, of all the solutions in the original list, I would discard
anything that isn't VMA focused. Emulating what CAPI does in hardware
with software is probably the best choice, or we have to do it all
again when CAPI style hardware broadly rolls out :(

DAX and GPU allocators should create VMAs and manipulate them in the
usual way to achieve migration, windowing, cache, movement or
swap of the potentially peer-peer memory pages. They would have to
respect the usual rules for a VMA, including pinning.

DMA drivers would use the usual approaches for dealing with DMA from
a VMA: short term pin or long term coherent translation mirror.

So, to my view (looking from RDMA), the main problem with peer-peer is
how do you DMA translate VMA's that point at non struct page memory?

Does HMM solve the peer-peer problem? Does it do it generically or
only for drivers that are mirroring translation tables?

>From a RDMA perspective we could use something other than
get_user_pages() to pin and DMA translate a VMA if the core community
could decide on an API. eg get_user_dma_sg() would probably be quite
usable.

Jason
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 3/6] dax: add tracepoint infrastructure, PMD tracing

2016-11-23 Thread Ross Zwisler
Tracepoints are the standard way to capture debugging and tracing
information in many parts of the kernel, including the XFS and ext4
filesystems.  Create a tracepoint header for FS DAX and add the first DAX
tracepoints to the PMD fault handler.  This allows the tracing for DAX to
be done in the same way as the filesystem tracing so that developers can
look at them together and get a coherent idea of what the system is doing.

I added both an entry and exit tracepoint because future patches will add
tracepoints to child functions of dax_iomap_pmd_fault() like
dax_pmd_load_hole() and dax_pmd_insert_mapping(). We want those messages to
be wrapped by the parent function tracepoints so the code flow is more
easily understood.  Having entry and exit tracepoints for faults also
allows us to easily see what filesystems functions were called during the
fault.  These filesystem functions get executed via iomap_begin() and
iomap_end() calls, for example, and will have their own tracepoints.

For PMD faults we primarily want to understand the faulting address and
whether it fell back to 4k faults.  If it fell back to 4k faults the
tracepoints should let us understand why.

I named the new tracepoint header file "fs_dax.h" to allow for device DAX
to have its own separate tracing header in the same directory at some
point.

Here is an example output for these events from a successful PMD fault:

big-2057  [000]    136.396855: dax_pmd_fault: shared mapping write
address 0x10505000 vm_start 0x1020 vm_end 0x1070 pgoff 0x200
max_pgoff 0x1400

big-2057  [000]    136.397943: dax_pmd_fault_done: shared mapping write
address 0x10505000 vm_start 0x1020 vm_end 0x1070 pgoff 0x200
max_pgoff 0x1400 NOPAGE

Signed-off-by: Ross Zwisler 
Suggested-by: Dave Chinner 
---
 fs/dax.c  | 29 +---
 include/linux/mm.h| 14 ++
 include/trace/events/fs_dax.h | 61 +++
 3 files changed, 94 insertions(+), 10 deletions(-)
 create mode 100644 include/trace/events/fs_dax.h

diff --git a/fs/dax.c b/fs/dax.c
index cc8a069..1aa7616 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -35,6 +35,9 @@
 #include 
 #include "internal.h"
 
+#define CREATE_TRACE_POINTS
+#include 
+
 /* We choose 4096 entries - same as per-zone page wait tables */
 #define DAX_WAIT_TABLE_BITS 12
 #define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
@@ -1310,6 +1313,16 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, 
unsigned long address,
loff_t pos;
int error;
 
+   /*
+* Check whether offset isn't beyond end of file now. Caller is
+* supposed to hold locks serializing us with truncate / punch hole so
+* this is a reliable test.
+*/
+   pgoff = linear_page_index(vma, pmd_addr);
+   max_pgoff = (i_size_read(inode) - 1) >> PAGE_SHIFT;
+
+   trace_dax_pmd_fault(vma, address, flags, pgoff, max_pgoff, 0);
+
/* Fall back to PTEs if we're going to COW */
if (write && !(vma->vm_flags & VM_SHARED))
goto fallback;
@@ -1320,16 +1333,10 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, 
unsigned long address,
if ((pmd_addr + PMD_SIZE) > vma->vm_end)
goto fallback;
 
-   /*
-* Check whether offset isn't beyond end of file now. Caller is
-* supposed to hold locks serializing us with truncate / punch hole so
-* this is a reliable test.
-*/
-   pgoff = linear_page_index(vma, pmd_addr);
-   max_pgoff = (i_size_read(inode) - 1) >> PAGE_SHIFT;
-
-   if (pgoff > max_pgoff)
-   return VM_FAULT_SIGBUS;
+   if (pgoff > max_pgoff) {
+   result = VM_FAULT_SIGBUS;
+   goto out;
+   }
 
/* If the PMD would extend beyond the file size */
if ((pgoff | PG_PMD_COLOUR) > max_pgoff)
@@ -1400,6 +1407,8 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, 
unsigned long address,
split_huge_pmd(vma, pmd, address);
count_vm_event(THP_FAULT_FALLBACK);
}
+out:
+   trace_dax_pmd_fault_done(vma, address, flags, pgoff, max_pgoff, result);
return result;
 }
 EXPORT_SYMBOL_GPL(dax_iomap_pmd_fault);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a5f52c0..e373917 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1107,6 +1107,20 @@ static inline void clear_page_pfmemalloc(struct page 
*page)
 VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE | \
 VM_FAULT_FALLBACK)
 
+#define VM_FAULT_RESULT_TRACE \
+   { VM_FAULT_OOM, "OOM" }, \
+   { VM_FAULT_SIGBUS,  "SIGBUS" }, \
+   { VM_FAULT_MAJOR,   "MAJOR" }, \
+   { VM_FAULT_WRITE,   "WRITE" }, \
+   { VM_FAULT_HWPOISON,"HWPOISON" }, \
+   { VM_FAULT_HWPOISON_LARGE,  

[PATCH 6/6] dax: add tracepoints to dax_pmd_insert_mapping()

2016-11-23 Thread Ross Zwisler
Add tracepoints to dax_pmd_insert_mapping(), following the same logging
conventions as the tracepoints in dax_iomap_pmd_fault().

Here is an example PMD fault showing the new tracepoints:

big-1544  [006] 48.153479: dax_pmd_fault: shared mapping write
address 0x10505000 vm_start 0x1020 vm_end 0x1070 pgoff 0x200
max_pgoff 0x1400

big-1544  [006] 48.155230: dax_pmd_insert_mapping: shared mapping
write address 0x10505000 length 0x20 pfn 0x100600 DEV|MAP radix_entry
0xc000e

big-1544  [006] 48.155266: dax_pmd_fault_done: shared mapping write
address 0x10505000 vm_start 0x1020 vm_end 0x1070 pgoff 0x200
max_pgoff 0x1400 NOPAGE

Signed-off-by: Ross Zwisler 
---
 fs/dax.c  | 10 +++---
 include/linux/pfn_t.h |  6 ++
 include/trace/events/fs_dax.h | 42 ++
 3 files changed, 55 insertions(+), 3 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 2824414..d6ba4a3 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1236,10 +1236,10 @@ static int dax_pmd_insert_mapping(struct vm_area_struct 
*vma, pmd_t *pmd,
.size = PMD_SIZE,
};
long length = dax_map_atomic(bdev, );
-   void *ret;
+   void *ret = NULL;
 
if (length < 0) /* dax_map_atomic() failed */
-   return VM_FAULT_FALLBACK;
+   goto fallback;
if (length < PMD_SIZE)
goto unmap_fallback;
if (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR)
@@ -1252,13 +1252,17 @@ static int dax_pmd_insert_mapping(struct vm_area_struct 
*vma, pmd_t *pmd,
ret = dax_insert_mapping_entry(mapping, vmf, *entryp, dax.sector,
RADIX_DAX_PMD);
if (IS_ERR(ret))
-   return VM_FAULT_FALLBACK;
+   goto fallback;
*entryp = ret;
 
+   trace_dax_pmd_insert_mapping(vma, address, write, length, dax.pfn, ret);
return vmf_insert_pfn_pmd(vma, address, pmd, dax.pfn, write);
 
 unmap_fallback:
dax_unmap_atomic(bdev, );
+fallback:
+   trace_dax_pmd_insert_mapping_fallback(vma, address, write, length,
+   dax.pfn, ret);
return VM_FAULT_FALLBACK;
 }
 
diff --git a/include/linux/pfn_t.h b/include/linux/pfn_t.h
index a3d90b9..033fc7b 100644
--- a/include/linux/pfn_t.h
+++ b/include/linux/pfn_t.h
@@ -15,6 +15,12 @@
 #define PFN_DEV (1ULL << (BITS_PER_LONG_LONG - 3))
 #define PFN_MAP (1ULL << (BITS_PER_LONG_LONG - 4))
 
+#define PFN_FLAGS_TRACE \
+   { PFN_SG_CHAIN, "SG_CHAIN" }, \
+   { PFN_SG_LAST,  "SG_LAST" }, \
+   { PFN_DEV,  "DEV" }, \
+   { PFN_MAP,  "MAP" }
+
 static inline pfn_t __pfn_to_pfn_t(unsigned long pfn, u64 flags)
 {
pfn_t pfn_t = { .val = pfn | (flags & PFN_FLAGS_MASK), };
diff --git a/include/trace/events/fs_dax.h b/include/trace/events/fs_dax.h
index 8814b1a..a03f820 100644
--- a/include/trace/events/fs_dax.h
+++ b/include/trace/events/fs_dax.h
@@ -87,6 +87,48 @@ DEFINE_EVENT(dax_pmd_load_hole_class, name, \
 DEFINE_PMD_LOAD_HOLE_EVENT(dax_pmd_load_hole);
 DEFINE_PMD_LOAD_HOLE_EVENT(dax_pmd_load_hole_fallback);
 
+DECLARE_EVENT_CLASS(dax_pmd_insert_mapping_class,
+   TP_PROTO(struct vm_area_struct *vma, unsigned long address, int write,
+   long length, pfn_t pfn, void *radix_entry),
+   TP_ARGS(vma, address, write, length, pfn, radix_entry),
+   TP_STRUCT__entry(
+   __field(unsigned long, vm_flags)
+   __field(unsigned long, address)
+   __field(int, write)
+   __field(long, length)
+   __field(u64, pfn_val)
+   __field(void *, radix_entry)
+   ),
+   TP_fast_assign(
+   __entry->vm_flags = vma->vm_flags;
+   __entry->address = address;
+   __entry->write = write;
+   __entry->length = length;
+   __entry->pfn_val = pfn.val;
+   __entry->radix_entry = radix_entry;
+   ),
+   TP_printk("%s mapping %s address %#lx length %#lx pfn %#llx %s"
+   " radix_entry %#lx",
+   __entry->vm_flags & VM_SHARED ? "shared" : "private",
+   __entry->write ? "write" : "read",
+   __entry->address,
+   __entry->length,
+   __entry->pfn_val & ~PFN_FLAGS_MASK,
+   __print_flags(__entry->pfn_val & PFN_FLAGS_MASK, "|",
+   PFN_FLAGS_TRACE),
+   (unsigned long)__entry->radix_entry
+   )
+)
+
+#define DEFINE_PMD_INSERT_MAPPING_EVENT(name) \
+DEFINE_EVENT(dax_pmd_insert_mapping_class, name, \
+   TP_PROTO(struct vm_area_struct *vma, unsigned long address, \
+   int write, long length, pfn_t pfn, void *radix_entry), \
+   TP_ARGS(vma, address, write, length, pfn, radix_entry))
+
+DEFINE_PMD_INSERT_MAPPING_EVENT(dax_pmd_insert_mapping);
+DEFINE_PMD_INSERT_MAPPING_EVENT(dax_pmd_insert_mapping_fallback);
+
 #endif /* 

[PATCH 4/6] dax: update MAINTAINERS entries for FS DAX

2016-11-23 Thread Ross Zwisler
Add the new include/trace/events/fs_dax.h tracepoint header, update
Matthew's email address and add myself as a maintainer for filesystem DAX.

Signed-off-by: Ross Zwisler 
Suggested-by: Matthew Wilcox 
---
 MAINTAINERS | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 2a58eea..8fef4bf 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3855,10 +3855,12 @@ S:  Maintained
 F: drivers/i2c/busses/i2c-diolan-u2c.c
 
 DIRECT ACCESS (DAX)
-M: Matthew Wilcox 
+M: Matthew Wilcox 
+M: Ross Zwisler 
 L: linux-fsde...@vger.kernel.org
 S: Supported
 F: fs/dax.c
+F: include/trace/events/fs_dax.h
 
 DIRECTORY NOTIFICATION (DNOTIFY)
 M: Eric Paris 
-- 
2.7.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 5/6] dax: add tracepoints to dax_pmd_load_hole()

2016-11-23 Thread Ross Zwisler
Add tracepoints to dax_pmd_load_hole(), following the same logging
conventions as the tracepoints in dax_iomap_pmd_fault().

Here is an example PMD fault showing the new tracepoints:

read_big-1393  [007] 32.133809: dax_pmd_fault: shared mapping read
address 0x1040 vm_start 0x1020 vm_end 0x1060 pgoff 0x200
max_pgoff 0x1400

read_big-1393  [007] 32.134067: dax_pmd_load_hole: shared mapping
read address 0x1040 zero_page ea0002b98000 radix_entry 0x1e

read_big-1393  [007] 32.134069: dax_pmd_fault_done: shared mapping
read address 0x1040 vm_start 0x1020 vm_end 0x1060 pgoff 0x200
max_pgoff 0x1400 NOPAGE

Signed-off-by: Ross Zwisler 
---
 fs/dax.c  | 13 +
 include/trace/events/fs_dax.h | 32 
 2 files changed, 41 insertions(+), 4 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 1aa7616..2824414 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1269,32 +1269,37 @@ static int dax_pmd_load_hole(struct vm_area_struct 
*vma, pmd_t *pmd,
struct address_space *mapping = vma->vm_file->f_mapping;
unsigned long pmd_addr = address & PMD_MASK;
struct page *zero_page;
+   void *ret = NULL;
spinlock_t *ptl;
pmd_t pmd_entry;
-   void *ret;
 
zero_page = mm_get_huge_zero_page(vma->vm_mm);
 
if (unlikely(!zero_page))
-   return VM_FAULT_FALLBACK;
+   goto fallback;
 
ret = dax_insert_mapping_entry(mapping, vmf, *entryp, 0,
RADIX_DAX_PMD | RADIX_DAX_HZP);
if (IS_ERR(ret))
-   return VM_FAULT_FALLBACK;
+   goto fallback;
*entryp = ret;
 
ptl = pmd_lock(vma->vm_mm, pmd);
if (!pmd_none(*pmd)) {
spin_unlock(ptl);
-   return VM_FAULT_FALLBACK;
+   goto fallback;
}
 
pmd_entry = mk_pmd(zero_page, vma->vm_page_prot);
pmd_entry = pmd_mkhuge(pmd_entry);
set_pmd_at(vma->vm_mm, pmd_addr, pmd, pmd_entry);
spin_unlock(ptl);
+   trace_dax_pmd_load_hole(vma, address, zero_page, ret);
return VM_FAULT_NOPAGE;
+
+fallback:
+   trace_dax_pmd_load_hole_fallback(vma, address, zero_page, ret);
+   return VM_FAULT_FALLBACK;
 }
 
 int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
diff --git a/include/trace/events/fs_dax.h b/include/trace/events/fs_dax.h
index f9ed4eb..8814b1a 100644
--- a/include/trace/events/fs_dax.h
+++ b/include/trace/events/fs_dax.h
@@ -54,6 +54,38 @@ DEFINE_EVENT(dax_pmd_fault_class, name, \
 DEFINE_PMD_FAULT_EVENT(dax_pmd_fault);
 DEFINE_PMD_FAULT_EVENT(dax_pmd_fault_done);
 
+DECLARE_EVENT_CLASS(dax_pmd_load_hole_class,
+   TP_PROTO(struct vm_area_struct *vma, unsigned long address,
+   struct page *zero_page, void *radix_entry),
+   TP_ARGS(vma, address, zero_page, radix_entry),
+   TP_STRUCT__entry(
+   __field(unsigned long, vm_flags)
+   __field(unsigned long, address)
+   __field(struct page *, zero_page)
+   __field(void *, radix_entry)
+   ),
+   TP_fast_assign(
+   __entry->vm_flags = vma->vm_flags;
+   __entry->address = address;
+   __entry->zero_page = zero_page;
+   __entry->radix_entry = radix_entry;
+   ),
+   TP_printk("%s mapping read address %#lx zero_page %p radix_entry %#lx",
+   __entry->vm_flags & VM_SHARED ? "shared" : "private",
+   __entry->address,
+   __entry->zero_page,
+   (unsigned long)__entry->radix_entry
+   )
+)
+
+#define DEFINE_PMD_LOAD_HOLE_EVENT(name) \
+DEFINE_EVENT(dax_pmd_load_hole_class, name, \
+   TP_PROTO(struct vm_area_struct *vma, unsigned long address, \
+   struct page *zero_page, void *radix_entry), \
+   TP_ARGS(vma, address, zero_page, radix_entry))
+
+DEFINE_PMD_LOAD_HOLE_EVENT(dax_pmd_load_hole);
+DEFINE_PMD_LOAD_HOLE_EVENT(dax_pmd_load_hole_fallback);
 
 #endif /* _TRACE_FS_DAX_H */
 
-- 
2.7.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 1/6] dax: fix build breakage with ext4, dax and !iomap

2016-11-23 Thread Ross Zwisler
With the current Kconfig setup it is possible to have the following:

CONFIG_EXT4_FS=y
CONFIG_FS_DAX=y
CONFIG_FS_IOMAP=n   # this is in fs/Kconfig & isn't user accessible

With this config we get build failures in ext4_dax_fault() because the
iomap functions in fs/dax.c are missing:

fs/built-in.o: In function `ext4_dax_fault':
file.c:(.text+0x7f3ac): undefined reference to `dax_iomap_fault'
file.c:(.text+0x7f404): undefined reference to `dax_iomap_fault'
fs/built-in.o: In function `ext4_file_read_iter':
file.c:(.text+0x7fc54): undefined reference to `dax_iomap_rw'
fs/built-in.o: In function `ext4_file_write_iter':
file.c:(.text+0x7fe9a): undefined reference to `dax_iomap_rw'
file.c:(.text+0x7feed): undefined reference to `dax_iomap_rw'
fs/built-in.o: In function `ext4_block_zero_page_range':
inode.c:(.text+0x85c0d): undefined reference to `iomap_zero_range'

Now that the struct buffer_head based DAX fault paths and I/O path have
been removed we really depend on iomap support being present for DAX.  Make
this explicit by selecting FS_IOMAP if we compile in DAX support.

Signed-off-by: Ross Zwisler 
---
 fs/Kconfig  | 1 +
 fs/dax.c| 2 --
 fs/ext2/Kconfig | 1 -
 3 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 8e9e5f41..18024bf 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -38,6 +38,7 @@ config FS_DAX
bool "Direct Access (DAX) support"
depends on MMU
depends on !(ARM || MIPS || SPARC)
+   select FS_IOMAP
help
  Direct Access (DAX) can be used on memory-backed block devices.
  If the block device supports DAX and the filesystem supports DAX,
diff --git a/fs/dax.c b/fs/dax.c
index be39633..d8fe3eb 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -968,7 +968,6 @@ int __dax_zero_page_range(struct block_device *bdev, 
sector_t sector,
 }
 EXPORT_SYMBOL_GPL(__dax_zero_page_range);
 
-#ifdef CONFIG_FS_IOMAP
 static sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
 {
return iomap->blkno + (((pos & PAGE_MASK) - iomap->offset) >> 9);
@@ -1405,4 +1404,3 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, 
unsigned long address,
 }
 EXPORT_SYMBOL_GPL(dax_iomap_pmd_fault);
 #endif /* CONFIG_FS_DAX_PMD */
-#endif /* CONFIG_FS_IOMAP */
diff --git a/fs/ext2/Kconfig b/fs/ext2/Kconfig
index 36bea5a..c634874e 100644
--- a/fs/ext2/Kconfig
+++ b/fs/ext2/Kconfig
@@ -1,6 +1,5 @@
 config EXT2_FS
tristate "Second extended fs support"
-   select FS_IOMAP if FS_DAX
help
  Ext2 is a standard Linux file system for hard disks.
 
-- 
2.7.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-23 Thread Dan Williams
On Wed, Nov 23, 2016 at 9:27 AM, Bart Van Assche
 wrote:
> On 11/23/2016 09:13 AM, Logan Gunthorpe wrote:
>>
>> IMO any memory that has been registered for a P2P transaction should be
>> locked from being evicted. So if there's a get_user_pages call it needs
>> to be pinned until the put_page. The main issue being with the RDMA
>> case: handling an eviction when a chunk of memory has been registered as
>> an MR would be very tricky. The MR may be relied upon by another host
>> and the kernel would have to inform user-space the MR was invalid then
>> user-space would have to tell the remote application.
>
>
> Hello Logan,
>
> Are you aware that the Linux kernel already supports ODP (On Demand Paging)?
> See also the output of git grep -nHi on.demand.paging. See also
> https://www.openfabrics.org/images/eventpresos/workshops2014/DevWorkshop/presos/Tuesday/pdf/04_ODP_update.pdf.
>

I don't think that was designed for the case where the backing memory
is a special/static physical address range rather than anonymous
"System RAM", right?

I think we should handle the graphics P2P concerns separately from the
general P2P-DMA case since the latter does not require the higher
order memory management facilities. Using ZONE_DEVICE/DAX mappings to
avoid changes to every driver that wants to support P2P-DMA separately
from typical DMA still seems the path of least resistance.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-23 Thread Bart Van Assche

On 11/23/2016 09:13 AM, Logan Gunthorpe wrote:

IMO any memory that has been registered for a P2P transaction should be
locked from being evicted. So if there's a get_user_pages call it needs
to be pinned until the put_page. The main issue being with the RDMA
case: handling an eviction when a chunk of memory has been registered as
an MR would be very tricky. The MR may be relied upon by another host
and the kernel would have to inform user-space the MR was invalid then
user-space would have to tell the remote application.


Hello Logan,

Are you aware that the Linux kernel already supports ODP (On Demand 
Paging)? See also the output of git grep -nHi on.demand.paging. See also 
https://www.openfabrics.org/images/eventpresos/workshops2014/DevWorkshop/presos/Tuesday/pdf/04_ODP_update.pdf.


Bart.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-23 Thread Dave Hansen
On 11/22/2016 11:49 PM, Daniel Vetter wrote:
> Yes, agreed. My idea with exposing vram sections using numa nodes wasn't
> to reuse all the existing allocation policies directly, those won't work.
> So at boot-up your default numa policy would exclude any vram nodes.
> 
> But I think (as an -mm layman) that numa gives us a lot of the tools and
> policy interface that we need to implement what we want for gpus.

Are you suggesting creating NUMA nodes for video RAM (I assume that's
what you mean by vram) where that RAM is not at all CPU-accessible?
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-23 Thread Christian König

Am 23.11.2016 um 08:49 schrieb Daniel Vetter:

On Tue, Nov 22, 2016 at 01:21:03PM -0800, Dan Williams wrote:

On Tue, Nov 22, 2016 at 1:03 PM, Daniel Vetter  wrote:

On Tue, Nov 22, 2016 at 9:35 PM, Serguei Sagalovitch
 wrote:

On 2016-11-22 03:10 PM, Daniel Vetter wrote:

On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams 
wrote:

On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch
 wrote:

I personally like "device-DAX" idea but my concerns are:

-  How well it will co-exists with the  DRM infrastructure /
implementations
 in part dealing with CPU pointers?

Inside the kernel a device-DAX range is "just memory" in the sense
that you can perform pfn_to_page() on it and issue I/O, but the vma is
not migratable. To be honest I do not know how well that co-exists
with drm infrastructure.


-  How well we will be able to handle case when we need to
"move"/"evict"
 memory/data to the new location so CPU pointer should point to the
new
physical location/address
  (and may be not in PCI device memory at all)?

So, device-DAX deliberately avoids support for in-kernel migration or
overcommit. Those cases are left to the core mm or drm. The device-dax
interface is for cases where all that is needed is a direct-mapping to
a statically-allocated physical-address range be it persistent memory
or some other special reserved memory range.

For some of the fancy use-cases (e.g. to be comparable to what HMM can
pull off) I think we want all the magic in core mm, i.e. migration and
overcommit. At least that seems to be the very strong drive in all
general-purpose gpu abstractions and implementations, where memory is
allocated with malloc, and then mapped/moved into vram/gpu address
space through some magic,

It is possible that there is other way around: memory is requested to be
allocated and should be kept in vram for  performance reason but due
to possible overcommit case we need at least temporally to "move" such
allocation to system memory.

With migration I meant migrating both ways of course. And with stuff
like numactl we can also influence where exactly the malloc'ed memory
is allocated originally, at least if we'd expose the vram range as a
very special numa node that happens to be far away and not hold any
cpu cores.

I don't think we should be using numa distance to reverse engineer a
certain allocation behavior.  The latency data should be truthful, but
you're right we'll need a mechanism to keep general purpose
allocations out of that range by default. Btw, strict isolation is
another design point of device-dax, but I think in this case we're
describing something between the two extremes of full isolation and
full compatibility with existing numactl apis.

Yes, agreed. My idea with exposing vram sections using numa nodes wasn't
to reuse all the existing allocation policies directly, those won't work.
So at boot-up your default numa policy would exclude any vram nodes.

But I think (as an -mm layman) that numa gives us a lot of the tools and
policy interface that we need to implement what we want for gpus.


Agree completely. From a ten mile high view our GPUs are just command 
processors with local memory as well .


Basically this is also the whole idea of what AMD is pushing with HSA 
for a while.


It's just that a lot of problems start to pop up when you look at all 
the nasty details. For example only part of the GPU memory is usually 
accessible by the CPU.


So even when numa nodes expose a good foundation for this I think there 
is still a lot of code to write.


BTW: I should probably start to read into the numa code of the kernel. 
Any good pointers for that?


Regards,
Christian.


Wrt isolation: There's a sliding scale of what different users expect,
from full auto everything, including migrating pages around if needed to
full isolation all seems to be on the table. As long as we keep vram nodes
out of any default allocation numasets, full isolation should be possible.
-Daniel



___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm