Re: Enabling peer to peer device transactions for PCIe devices

2017-01-06 Thread Serguei Sagalovitch

On 2017-01-05 08:58 PM, Jerome Glisse wrote:

On Thu, Jan 05, 2017 at 05:30:34PM -0700, Jason Gunthorpe wrote:

On Thu, Jan 05, 2017 at 06:23:52PM -0500, Jerome Glisse wrote:


I still don't understand what you driving at - you've said in both
cases a user VMA exists.

In the former case no, there is no VMA directly but if you want one than
a device can provide one. But such VMA is useless as CPU access is not
expected.

I disagree it is useless, the VMA is going to be necessary to support
upcoming things like CAPI, you need it to support O_DIRECT from the
filesystem, DPDK, etc. This is why I am opposed to any model that is
not VMA based for setting up RDMA - that is shorted sighted and does
not seem to reflect where the industry is going.

So focus on having VMA backed by actual physical memory that covers
your GPU objects and ask how do we wire up the '__user *' to the DMA
API in the best way so the DMA API still has enough information to
setup IOMMUs and whatnot.

I am talking about 2 different thing. Existing hardware and API where you
_do not_ have a vma and you do not need one. This is just existing stuff.

I do not understand why you assume that existing API doesn't  need one.
I would say that a lot of __existing__ user level API and their support 
in kernel

(especially outside of graphics domain) assumes that we have vma and
deal with __user * pointers.

Some close driver provide a functionality on top of this design. Question
is do we want to do the same ? If yes and you insist on having a vma we
could provide one but this is does not apply and is useless for where we
are going with new hardware.

With new hardware you just use malloc or mmap to allocate memory and then
you use it directly with the device. Device driver can migrate any part of
the process address space to device memory. In this scheme you have your
usual VMAs but there is nothing special about them.

Assuming that the whole device memory is CPU accessible and it looks
like the direction where we are going:
- You forgot about use case when we want or need to allocate memory
directly on device (why we need to migrate anything if not needed?).
- We may want to use CPU to access such memory on device to avoid
any unnecessary migration back.
- We may have more device memory than the system one.
E.g. if you have 12 GPUs w/64GB each it will already give us ~0.7 TB
not mentioning NVDIMM cards which could also be used as memory
storage for other device access.
- We also may want/need to share GPU memory between different
processes.

Now when you try to do get_user_page() on any page that is inside the
device it will fails because we do not allow any device memory to be pin.
There is various reasons for that and they are not going away in any hw
in the planing (so for next few years).

Still we do want to support peer to peer mapping. Plan is to only do so
with ODP capable hardware. Still we need to solve the IOMMU issue and
it needs special handling inside the RDMA device. The way it works is
that RDMA ask for a GPU page, GPU check if it has place inside its PCI
bar to map this page for the device, this can fail. If it succeed then
you need the IOMMU to let the RDMA device access the GPU PCI bar.

So here we have 2 orthogonal problem. First one is how to make 2 drivers
talks to each other to setup mapping to allow peer to peer But I would assume  
and second is
about IOMMU.


I think that there is the third problem:  A lot of existing user level API
(MPI, IB Verbs, file i/o, etc.) deal with pointers to the buffers.
Potentially it would be ideally to support use cases when those buffers are
located in device memory avoiding any unnecessary migration / 
double-buffering.

Currently a lot of infrastructure in kernel assumes that this is the user
pointer and call "get_user_pages"  to get s/g.   What is your opinion
how it should be changed to deal with cases when "buffer" is in
device memory?





Re: Enabling peer to peer device transactions for PCIe devices

2017-01-06 Thread Serguei Sagalovitch

On 2017-01-05 08:58 PM, Jerome Glisse wrote:

On Thu, Jan 05, 2017 at 05:30:34PM -0700, Jason Gunthorpe wrote:

On Thu, Jan 05, 2017 at 06:23:52PM -0500, Jerome Glisse wrote:


I still don't understand what you driving at - you've said in both
cases a user VMA exists.

In the former case no, there is no VMA directly but if you want one than
a device can provide one. But such VMA is useless as CPU access is not
expected.

I disagree it is useless, the VMA is going to be necessary to support
upcoming things like CAPI, you need it to support O_DIRECT from the
filesystem, DPDK, etc. This is why I am opposed to any model that is
not VMA based for setting up RDMA - that is shorted sighted and does
not seem to reflect where the industry is going.

So focus on having VMA backed by actual physical memory that covers
your GPU objects and ask how do we wire up the '__user *' to the DMA
API in the best way so the DMA API still has enough information to
setup IOMMUs and whatnot.

I am talking about 2 different thing. Existing hardware and API where you
_do not_ have a vma and you do not need one. This is just existing stuff.

I do not understand why you assume that existing API doesn't  need one.
I would say that a lot of __existing__ user level API and their support 
in kernel

(especially outside of graphics domain) assumes that we have vma and
deal with __user * pointers.

Some close driver provide a functionality on top of this design. Question
is do we want to do the same ? If yes and you insist on having a vma we
could provide one but this is does not apply and is useless for where we
are going with new hardware.

With new hardware you just use malloc or mmap to allocate memory and then
you use it directly with the device. Device driver can migrate any part of
the process address space to device memory. In this scheme you have your
usual VMAs but there is nothing special about them.

Assuming that the whole device memory is CPU accessible and it looks
like the direction where we are going:
- You forgot about use case when we want or need to allocate memory
directly on device (why we need to migrate anything if not needed?).
- We may want to use CPU to access such memory on device to avoid
any unnecessary migration back.
- We may have more device memory than the system one.
E.g. if you have 12 GPUs w/64GB each it will already give us ~0.7 TB
not mentioning NVDIMM cards which could also be used as memory
storage for other device access.
- We also may want/need to share GPU memory between different
processes.

Now when you try to do get_user_page() on any page that is inside the
device it will fails because we do not allow any device memory to be pin.
There is various reasons for that and they are not going away in any hw
in the planing (so for next few years).

Still we do want to support peer to peer mapping. Plan is to only do so
with ODP capable hardware. Still we need to solve the IOMMU issue and
it needs special handling inside the RDMA device. The way it works is
that RDMA ask for a GPU page, GPU check if it has place inside its PCI
bar to map this page for the device, this can fail. If it succeed then
you need the IOMMU to let the RDMA device access the GPU PCI bar.

So here we have 2 orthogonal problem. First one is how to make 2 drivers
talks to each other to setup mapping to allow peer to peer But I would assume  
and second is
about IOMMU.


I think that there is the third problem:  A lot of existing user level API
(MPI, IB Verbs, file i/o, etc.) deal with pointers to the buffers.
Potentially it would be ideally to support use cases when those buffers are
located in device memory avoiding any unnecessary migration / 
double-buffering.

Currently a lot of infrastructure in kernel assumes that this is the user
pointer and call "get_user_pages"  to get s/g.   What is your opinion
how it should be changed to deal with cases when "buffer" is in
device memory?





Re: Enabling peer to peer device transactions for PCIe devices

2017-01-05 Thread Serguei Sagalovitch

On 2017-01-05 07:30 PM, Jason Gunthorpe wrote:

 but I am opposed to
the idea we need two API paths that the *driver* has to figure out.
That is fundamentally not what I want as a driver developer.

Give me a common API to convert '__user *' to a scatter list and pin
the pages.

Completely agreed. IMHO there is no sense to duplicate the same logic
everywhere as well as  trying to find places where it is missing.

Sincerely yours,
Serguei Sagalovitch



Re: Enabling peer to peer device transactions for PCIe devices

2017-01-05 Thread Serguei Sagalovitch

On 2017-01-05 07:30 PM, Jason Gunthorpe wrote:

 but I am opposed to
the idea we need two API paths that the *driver* has to figure out.
That is fundamentally not what I want as a driver developer.

Give me a common API to convert '__user *' to a scatter list and pin
the pages.

Completely agreed. IMHO there is no sense to duplicate the same logic
everywhere as well as  trying to find places where it is missing.

Sincerely yours,
Serguei Sagalovitch



Re: Enabling peer to peer device transactions for PCIe devices

2016-11-30 Thread Serguei Sagalovitch

On 2016-11-30 11:23 AM, Jason Gunthorpe wrote:

Yes, that sounds fine. Can we simply kill the process from the GPU driver?
Or do we need to extend the OOM killer to manage GPU pages?

I don't know..
We could use send_sig_info to send signal from  kernel  to user space. 
So theoretically GPU driver

could issue KILL signal to some process.


On Wed, Nov 30, 2016 at 12:45:58PM +0200, Haggai Eran wrote:

I think we can achieve the kernel's needs with ZONE_DEVICE and DMA-API support
for peer to peer. I'm not sure we need vmap. We need a way to have a scatterlist
of MMIO pfns, and ZONE_DEVICE allows that.
I do not think that using DMA-API as it is is the best solution (at 
least in the current form):


-  It deals with handles/fd for the whole allocation but client 
could/will use sub-allocation as
well as theoretically possible to "merge" several allocations in one 
from GPU perspective.
-  It require knowledge to export but because "sharing" is controlled 
from user space it

means that we must "export" all allocation by default
- It deals with 'fd'/handles but user application may work with 
addresses/pointers.


Also current  DMA-API force each time to do all DMA table programming 
unrelated if
location was changed or not. With  vma / mmu  we are  able to install 
notifier to intercept
changes in location and update  translation tables only as needed (we do 
not need to keep

get_user_pages()  lock).


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-30 Thread Serguei Sagalovitch

On 2016-11-30 11:23 AM, Jason Gunthorpe wrote:

Yes, that sounds fine. Can we simply kill the process from the GPU driver?
Or do we need to extend the OOM killer to manage GPU pages?

I don't know..
We could use send_sig_info to send signal from  kernel  to user space. 
So theoretically GPU driver

could issue KILL signal to some process.


On Wed, Nov 30, 2016 at 12:45:58PM +0200, Haggai Eran wrote:

I think we can achieve the kernel's needs with ZONE_DEVICE and DMA-API support
for peer to peer. I'm not sure we need vmap. We need a way to have a scatterlist
of MMIO pfns, and ZONE_DEVICE allows that.
I do not think that using DMA-API as it is is the best solution (at 
least in the current form):


-  It deals with handles/fd for the whole allocation but client 
could/will use sub-allocation as
well as theoretically possible to "merge" several allocations in one 
from GPU perspective.
-  It require knowledge to export but because "sharing" is controlled 
from user space it

means that we must "export" all allocation by default
- It deals with 'fd'/handles but user application may work with 
addresses/pointers.


Also current  DMA-API force each time to do all DMA table programming 
unrelated if
location was changed or not. With  vma / mmu  we are  able to install 
notifier to intercept
changes in location and update  translation tables only as needed (we do 
not need to keep

get_user_pages()  lock).


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-28 Thread Serguei Sagalovitch


On 2016-11-28 04:36 PM, Logan Gunthorpe wrote:

On 28/11/16 12:35 PM, Serguei Sagalovitch wrote:

As soon as PeerDirect mapping is called then GPU must not "move" the
such memory.  It is by PeerDirect design. It is similar how it is works
with system memory and RDMA MR: when "get_user_pages" is called then the
memory is pinned.

We haven't touch this in a long time and perhaps it changed, but there
definitely was a call back in the PeerDirect API to allow the GPU to
invalidate the mapping. That's what we don't want.

I assume that you are talking about "invalidate_peer_memory()' callback?
I was told that it is the "last resort" because HCA (and driver) is not
able to handle  it in the safe manner so it is basically "abort" 
everything.




Re: Enabling peer to peer device transactions for PCIe devices

2016-11-28 Thread Serguei Sagalovitch


On 2016-11-28 04:36 PM, Logan Gunthorpe wrote:

On 28/11/16 12:35 PM, Serguei Sagalovitch wrote:

As soon as PeerDirect mapping is called then GPU must not "move" the
such memory.  It is by PeerDirect design. It is similar how it is works
with system memory and RDMA MR: when "get_user_pages" is called then the
memory is pinned.

We haven't touch this in a long time and perhaps it changed, but there
definitely was a call back in the PeerDirect API to allow the GPU to
invalidate the mapping. That's what we don't want.

I assume that you are talking about "invalidate_peer_memory()' callback?
I was told that it is the "last resort" because HCA (and driver) is not
able to handle  it in the safe manner so it is basically "abort" 
everything.




Re: Enabling peer to peer device transactions for PCIe devices

2016-11-28 Thread Serguei Sagalovitch

On 2016-11-28 01:20 PM, Logan Gunthorpe wrote:


On 28/11/16 09:57 AM, Jason Gunthorpe wrote:

On PeerDirect, we have some kind of a middle-ground solution for pinning
GPU memory. We create a non-ODP MR pointing to VRAM but rely on
user-space and the GPU not to migrate it. If they do, the MR gets
destroyed immediately.

That sounds horrible. How can that possibly work? What if the MR is
being used when the GPU decides to migrate? I would not support that
upstream without a lot more explanation..

Yup, this was our experience when playing around with PeerDirect. There
was nothing we could do if the GPU decided to invalidate the P2P
mapping.

As soon as PeerDirect mapping is called then GPU must not "move" the
such memory.  It is by PeerDirect design. It is similar how it is works
with system memory and RDMA MR: when "get_user_pages" is called then the
memory is pinned.



Re: Enabling peer to peer device transactions for PCIe devices

2016-11-28 Thread Serguei Sagalovitch

On 2016-11-28 01:20 PM, Logan Gunthorpe wrote:


On 28/11/16 09:57 AM, Jason Gunthorpe wrote:

On PeerDirect, we have some kind of a middle-ground solution for pinning
GPU memory. We create a non-ODP MR pointing to VRAM but rely on
user-space and the GPU not to migrate it. If they do, the MR gets
destroyed immediately.

That sounds horrible. How can that possibly work? What if the MR is
being used when the GPU decides to migrate? I would not support that
upstream without a lot more explanation..

Yup, this was our experience when playing around with PeerDirect. There
was nothing we could do if the GPU decided to invalidate the P2P
mapping.

As soon as PeerDirect mapping is called then GPU must not "move" the
such memory.  It is by PeerDirect design. It is similar how it is works
with system memory and RDMA MR: when "get_user_pages" is called then the
memory is pinned.



Re: Enabling peer to peer device transactions for PCIe devices

2016-11-28 Thread Serguei Sagalovitch

On 2016-11-27 09:02 AM, Haggai Eran wrote

On PeerDirect, we have some kind of a middle-ground solution for pinning
GPU memory. We create a non-ODP MR pointing to VRAM but rely on
user-space and the GPU not to migrate it. If they do, the MR gets
destroyed immediately. This should work on legacy devices without ODP
support, and allows the system to safely terminate a process that
misbehaves. The downside of course is that it cannot transparently
migrate memory but I think for user-space RDMA doing that transparently
requires hardware support for paging, via something like HMM.

...

May be I am wrong but my understanding is that PeerDirect logic basically
follow  "RDMA register MR" logic so basically nothing prevent to "terminate"
process for "MMU notifier" case when we are very low on memory
not making it similar (not worse) then PeerDirect case.

I'm hearing most people say ZONE_DEVICE is the way to handle this,
which means the missing remaing piece for RDMA is some kind of DMA
core support for p2p address translation..

Yes, this is definitely something we need. I think Will Davis's patches
are a good start.

Another thing I think is that while HMM is good for user-space
applications, for kernel p2p use there is no need for that.

About HMM: I do not think that in the current form HMM would  fit in
requirement for generic P2P transfer case. My understanding is that at
the current stage HMM is good for "caching" system memory
in device memory for fast GPU access but in RDMA MR non-ODP case
it will not work because  the location of memory should not be
changed so memory should be allocated directly in PCIe memory.

Using ZONE_DEVICE with or without something like DMA-BUF to pin and unpin
pages for the short duration as you wrote above could work fine for
kernel uses in which we can guarantee they are short.

Potentially there is another issue related to pin/unpin. If memory could
be used a lot of time then there is no sense to rebuild and program
s/g tables each time if location of memory was not changed.




Re: Enabling peer to peer device transactions for PCIe devices

2016-11-28 Thread Serguei Sagalovitch

On 2016-11-27 09:02 AM, Haggai Eran wrote

On PeerDirect, we have some kind of a middle-ground solution for pinning
GPU memory. We create a non-ODP MR pointing to VRAM but rely on
user-space and the GPU not to migrate it. If they do, the MR gets
destroyed immediately. This should work on legacy devices without ODP
support, and allows the system to safely terminate a process that
misbehaves. The downside of course is that it cannot transparently
migrate memory but I think for user-space RDMA doing that transparently
requires hardware support for paging, via something like HMM.

...

May be I am wrong but my understanding is that PeerDirect logic basically
follow  "RDMA register MR" logic so basically nothing prevent to "terminate"
process for "MMU notifier" case when we are very low on memory
not making it similar (not worse) then PeerDirect case.

I'm hearing most people say ZONE_DEVICE is the way to handle this,
which means the missing remaing piece for RDMA is some kind of DMA
core support for p2p address translation..

Yes, this is definitely something we need. I think Will Davis's patches
are a good start.

Another thing I think is that while HMM is good for user-space
applications, for kernel p2p use there is no need for that.

About HMM: I do not think that in the current form HMM would  fit in
requirement for generic P2P transfer case. My understanding is that at
the current stage HMM is good for "caching" system memory
in device memory for fast GPU access but in RDMA MR non-ODP case
it will not work because  the location of memory should not be
changed so memory should be allocated directly in PCIe memory.

Using ZONE_DEVICE with or without something like DMA-BUF to pin and unpin
pages for the short duration as you wrote above could work fine for
kernel uses in which we can guarantee they are short.

Potentially there is another issue related to pin/unpin. If memory could
be used a lot of time then there is no sense to rebuild and program
s/g tables each time if location of memory was not changed.




Re: Enabling peer to peer device transactions for PCIe devices

2016-11-25 Thread Serguei Sagalovitch


On 2016-11-25 03:26 PM, Felix Kuehling wrote:

On 16-11-25 12:20 PM, Serguei Sagalovitch wrote:

A white list may end up being rather complicated if it has to cover
different CPU generations and system architectures. I feel this is a
decision user space could easily make.

Logan

I agreed that it is better to leave up to user space to check what is
working
and what is not. I found that write is practically always working but
read very
often not. Also sometimes system BIOS update could fix the issue.


But is user mode always aware that P2P is going on or even possible? For
example you may have a library reading a buffer from a file, but it
doesn't necessarily know where that buffer is located (system memory,
VRAM, ...) and it may not know what kind of the device the file is on
(SATA drive, NVMe SSD, ...). The library will never know if all it gets
is a pointer and a file descriptor.

The library ends up calling a read system call. Then it would be up to
the kernel to figure out the most efficient way to read the buffer from
the file. If supported, it could use P2P between a GPU and NVMe where
the NVMe device performs a DMA write to VRAM.

If you put the burden of figuring out the P2P details on user mode code,
I think it will severely limit the use cases that actually take
advantage of it. You also risk a bunch of different implementations that
get it wrong half the time on half the systems out there.

Regards,
   Felix



I agreed in theory with you but  I must admit that I do not know how
kernel could effectively collect all informations without running
pretty complicated tests each time on boot-up (if any configuration
changed including BIOS settings)  and on pnp events. Also for efficient
way kernel needs to know performance results (and it could also
depends on clock / power mode) for read/write of each pair devices, for
double-buffering it needs to know / detect on which NUMA node
to allocate, etc. etc.  Also  device could be fully configured only
on the first request for access so it may be needed to change initialization
sequences.



Re: Enabling peer to peer device transactions for PCIe devices

2016-11-25 Thread Serguei Sagalovitch


On 2016-11-25 03:26 PM, Felix Kuehling wrote:

On 16-11-25 12:20 PM, Serguei Sagalovitch wrote:

A white list may end up being rather complicated if it has to cover
different CPU generations and system architectures. I feel this is a
decision user space could easily make.

Logan

I agreed that it is better to leave up to user space to check what is
working
and what is not. I found that write is practically always working but
read very
often not. Also sometimes system BIOS update could fix the issue.


But is user mode always aware that P2P is going on or even possible? For
example you may have a library reading a buffer from a file, but it
doesn't necessarily know where that buffer is located (system memory,
VRAM, ...) and it may not know what kind of the device the file is on
(SATA drive, NVMe SSD, ...). The library will never know if all it gets
is a pointer and a file descriptor.

The library ends up calling a read system call. Then it would be up to
the kernel to figure out the most efficient way to read the buffer from
the file. If supported, it could use P2P between a GPU and NVMe where
the NVMe device performs a DMA write to VRAM.

If you put the burden of figuring out the P2P details on user mode code,
I think it will severely limit the use cases that actually take
advantage of it. You also risk a bunch of different implementations that
get it wrong half the time on half the systems out there.

Regards,
   Felix



I agreed in theory with you but  I must admit that I do not know how
kernel could effectively collect all informations without running
pretty complicated tests each time on boot-up (if any configuration
changed including BIOS settings)  and on pnp events. Also for efficient
way kernel needs to know performance results (and it could also
depends on clock / power mode) for read/write of each pair devices, for
double-buffering it needs to know / detect on which NUMA node
to allocate, etc. etc.  Also  device could be fully configured only
on the first request for access so it may be needed to change initialization
sequences.



Re: Enabling peer to peer device transactions for PCIe devices

2016-11-25 Thread Serguei Sagalovitch

On 2016-11-25 02:34 PM, Jason Gunthorpe wrote:

On Fri, Nov 25, 2016 at 12:16:30PM -0500, Serguei Sagalovitch wrote:


b) Allocation may not  have CPU address  at all - only GPU one.

But you don't expect RDMA to work in the case, right?

GPU people need to stop doing this windowed memory stuff :)

GPU could perfectly access all VRAM.  It is only issue for p2p without
special interconnect and CPU access. Strictly speaking as long as we
have "bus address"  we could have RDMA but  I agreed that for
RDMA we could/should(?) always "request"  CPU address (I hope that we
could forget about 32-bit application :-)).

BTW/FYI: About CPU access: Some user-level API is mainly handle based
so there is no need for CPU access by default.

About "visible" / non-visible VRAM parts: I assume  that going
forward we will be able to get rid from it completely as soon as support
for resizable PCI BAR will be implemented and/or old/current h/w
will become obsolete.


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-25 Thread Serguei Sagalovitch

On 2016-11-25 02:34 PM, Jason Gunthorpe wrote:

On Fri, Nov 25, 2016 at 12:16:30PM -0500, Serguei Sagalovitch wrote:


b) Allocation may not  have CPU address  at all - only GPU one.

But you don't expect RDMA to work in the case, right?

GPU people need to stop doing this windowed memory stuff :)

GPU could perfectly access all VRAM.  It is only issue for p2p without
special interconnect and CPU access. Strictly speaking as long as we
have "bus address"  we could have RDMA but  I agreed that for
RDMA we could/should(?) always "request"  CPU address (I hope that we
could forget about 32-bit application :-)).

BTW/FYI: About CPU access: Some user-level API is mainly handle based
so there is no need for CPU access by default.

About "visible" / non-visible VRAM parts: I assume  that going
forward we will be able to get rid from it completely as soon as support
for resizable PCI BAR will be implemented and/or old/current h/w
will become obsolete.


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-25 Thread Serguei Sagalovitch

On 2016-11-25 08:22 AM, Christian König wrote:



Serguei, what is your plan in GPU land for migration? Ie if I have a
CPU mapped page and the GPU moves it to VRAM, it becomes non-cachable
- do you still allow the CPU to access it? Or do you swap it back to
cachable memory if the CPU touches it?


Depends on the policy in command, but currently it's the other way 
around most of the time.


E.g. we allocate memory in VRAM, the CPU writes to it WC and avoids 
reading because that is slow, the GPU in turn can access it with full 
speed.


When we run out of VRAM we move those allocations to system memory and 
update both the CPU as well as the GPU page tables.


So that move is transparent for both userspace as well as shaders 
running on the GPU.

I would like to add more in relation to  CPU access :

a) we could have CPU-accessible part of VRAM ("inside" of PCIe BAR register)
and non-CPU  accessible part.  As the result if user needs to have
CPU access than memory should be located in CPU-accessible part
of VRAM or in system memory.

Application/user mode driver could specify preference/hints of
locations based on their assumption / knowledge about access
patterns requirements, game resolution,  knowledge
about size of VRAM memory, etc.  So if CPU access performance
is critical then such memory should be allocated in system memory
as  the first (and may be only) choice.

b) Allocation may not  have CPU address  at all - only GPU one.
Also we may not be able to have CPU address/accesses for all VRAM
memory but memory may still  be migrated in any case unrelated
if we have CPU address or not.

c) " VRAM, it becomes non-cachable "
Strictly speaking VRAM is configured as WC (write-combined memory) to
provide fast CPU write access. Also it was found that sometimes if CPU
access is not critical from performance perspective it may be useful
to allocate/program system memory also as WC to  avoid needs for
extra "snooping" to synchronize with CPU caches during GPU access.
So potentially system memory could be WC too.




Re: Enabling peer to peer device transactions for PCIe devices

2016-11-25 Thread Serguei Sagalovitch

On 2016-11-25 08:22 AM, Christian König wrote:



Serguei, what is your plan in GPU land for migration? Ie if I have a
CPU mapped page and the GPU moves it to VRAM, it becomes non-cachable
- do you still allow the CPU to access it? Or do you swap it back to
cachable memory if the CPU touches it?


Depends on the policy in command, but currently it's the other way 
around most of the time.


E.g. we allocate memory in VRAM, the CPU writes to it WC and avoids 
reading because that is slow, the GPU in turn can access it with full 
speed.


When we run out of VRAM we move those allocations to system memory and 
update both the CPU as well as the GPU page tables.


So that move is transparent for both userspace as well as shaders 
running on the GPU.

I would like to add more in relation to  CPU access :

a) we could have CPU-accessible part of VRAM ("inside" of PCIe BAR register)
and non-CPU  accessible part.  As the result if user needs to have
CPU access than memory should be located in CPU-accessible part
of VRAM or in system memory.

Application/user mode driver could specify preference/hints of
locations based on their assumption / knowledge about access
patterns requirements, game resolution,  knowledge
about size of VRAM memory, etc.  So if CPU access performance
is critical then such memory should be allocated in system memory
as  the first (and may be only) choice.

b) Allocation may not  have CPU address  at all - only GPU one.
Also we may not be able to have CPU address/accesses for all VRAM
memory but memory may still  be migrated in any case unrelated
if we have CPU address or not.

c) " VRAM, it becomes non-cachable "
Strictly speaking VRAM is configured as WC (write-combined memory) to
provide fast CPU write access. Also it was found that sometimes if CPU
access is not critical from performance perspective it may be useful
to allocate/program system memory also as WC to  avoid needs for
extra "snooping" to synchronize with CPU caches during GPU access.
So potentially system memory could be WC too.




Re: Enabling peer to peer device transactions for PCIe devices

2016-11-25 Thread Serguei Sagalovitch



Well, I guess there's some consensus building to do. The existing
options are:

* Device DAX: which could work but the problem I see with it is that it
only allows one application to do these transfers. Or there would have
to be some user-space coordination to figure which application gets what
memeroy.
About one application restriction: so it is per memory mapping? I assume 
that
it should not be problem for one application to do transfer to the 
several devices

simultaneously? Am I right?

May be we should follow RDMA MR design and register memory for p2p 
transfer from user

space?

What about the following:

a)  Device DAX is created
b) "Normal" (movable, etc.) allocation will be done for PCIe memory and 
CPU pointer/access will

be requested.
c)  p2p_mr_register() will be called and CPU pointer (mmap( on DAX 
Device)) will be returned.
Accordingly such memory will be marked as "unmovable" by e.g. graphics 
driver.

d) When p2p is not needed then p2p_mr_unregister() will be called.

What do you think? Will it work?




Re: Enabling peer to peer device transactions for PCIe devices

2016-11-25 Thread Serguei Sagalovitch



Well, I guess there's some consensus building to do. The existing
options are:

* Device DAX: which could work but the problem I see with it is that it
only allows one application to do these transfers. Or there would have
to be some user-space coordination to figure which application gets what
memeroy.
About one application restriction: so it is per memory mapping? I assume 
that
it should not be problem for one application to do transfer to the 
several devices

simultaneously? Am I right?

May be we should follow RDMA MR design and register memory for p2p 
transfer from user

space?

What about the following:

a)  Device DAX is created
b) "Normal" (movable, etc.) allocation will be done for PCIe memory and 
CPU pointer/access will

be requested.
c)  p2p_mr_register() will be called and CPU pointer (mmap( on DAX 
Device)) will be returned.
Accordingly such memory will be marked as "unmovable" by e.g. graphics 
driver.

d) When p2p is not needed then p2p_mr_unregister() will be called.

What do you think? Will it work?




Re: Enabling peer to peer device transactions for PCIe devices

2016-11-25 Thread Serguei Sagalovitch



A white list may end up being rather complicated if it has to cover
different CPU generations and system architectures. I feel this is a
decision user space could easily make.

Logan
I agreed that it is better to leave up to user space to check what is 
working
and what is not. I found that write is practically always working but 
read very

often not. Also sometimes system BIOS update could fix the issue.



Re: Enabling peer to peer device transactions for PCIe devices

2016-11-25 Thread Serguei Sagalovitch



A white list may end up being rather complicated if it has to cover
different CPU generations and system architectures. I feel this is a
decision user space could easily make.

Logan
I agreed that it is better to leave up to user space to check what is 
working
and what is not. I found that write is practically always working but 
read very

often not. Also sometimes system BIOS update could fix the issue.



Re: Enabling peer to peer device transactions for PCIe devices

2016-11-24 Thread Serguei Sagalovitch


On 2016-11-24 11:26 AM, Jason Gunthorpe wrote:

On Thu, Nov 24, 2016 at 10:45:18AM +0100, Christian König wrote:

Am 24.11.2016 um 00:25 schrieb Jason Gunthorpe:

There is certainly nothing about the hardware that cares
about ZONE_DEVICE vs System memory.

Well that is clearly not so simple. When your ZONE_DEVICE pages describe a
PCI BAR and another PCI device initiates a DMA to this address the DMA
subsystem must be able to check if the interconnection really works.

I said the hardware doesn't care.. You are right, we still have an
outstanding problem in Linux of how to generically DMA map a P2P
address - which is a different issue from getting the P2P address from
a __user pointer...

Jason
I agreed but the problem is that one issue immediately introduce another 
one

to solve and so on (if we do not want to cut corners). I would think  that
a lot of them interconnected because the way how one problem could be
solved may impact solution for another.

btw: about "DMA map a p2p address": Right now to enable  p2p between 
devices
it is required/recommended to disable iommu support  (e.g. intel iommu 
driver

has special logic for graphics and  comment "Reserve all PCI MMIO to avoid
peer-to-peer access").


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-24 Thread Serguei Sagalovitch


On 2016-11-24 11:26 AM, Jason Gunthorpe wrote:

On Thu, Nov 24, 2016 at 10:45:18AM +0100, Christian König wrote:

Am 24.11.2016 um 00:25 schrieb Jason Gunthorpe:

There is certainly nothing about the hardware that cares
about ZONE_DEVICE vs System memory.

Well that is clearly not so simple. When your ZONE_DEVICE pages describe a
PCI BAR and another PCI device initiates a DMA to this address the DMA
subsystem must be able to check if the interconnection really works.

I said the hardware doesn't care.. You are right, we still have an
outstanding problem in Linux of how to generically DMA map a P2P
address - which is a different issue from getting the P2P address from
a __user pointer...

Jason
I agreed but the problem is that one issue immediately introduce another 
one

to solve and so on (if we do not want to cut corners). I would think  that
a lot of them interconnected because the way how one problem could be
solved may impact solution for another.

btw: about "DMA map a p2p address": Right now to enable  p2p between 
devices
it is required/recommended to disable iommu support  (e.g. intel iommu 
driver

has special logic for graphics and  comment "Reserve all PCI MMIO to avoid
peer-to-peer access").


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-23 Thread Serguei Sagalovitch



On 2016-11-23 02:12 PM, Jason Gunthorpe wrote:

On Wed, Nov 23, 2016 at 10:40:47AM -0800, Dan Williams wrote:


I don't think that was designed for the case where the backing memory
is a special/static physical address range rather than anonymous
"System RAM", right?

The hardware doesn't care where the memory is. ODP is just a generic
mechanism to provide demand-fault behavior for a mirrored page table.

ODP has the same issue as everything else, it needs to translate a
page table entry into a DMA address, and we have no API to do that
when the page table points to peer-peer memory.

Jason
I would like to note that for graphics applications (especially for VR 
support) we

should  avoid ODP  case at any cost during graphics commands execution  due
to requirement to have smooth and predictable playback. We want to load 
/ "pin"
all required resources before graphics processor begin to touch them. 
This is not

so critical for compute applications. Because only graphics / compute stack
knows which resource will be in used as well as all statistics 
accordingly only graphics
stack is capable to make the correct decision when and _where_ evict as 
well

as when and _where_ to put memory back.



Re: Enabling peer to peer device transactions for PCIe devices

2016-11-23 Thread Serguei Sagalovitch



On 2016-11-23 02:12 PM, Jason Gunthorpe wrote:

On Wed, Nov 23, 2016 at 10:40:47AM -0800, Dan Williams wrote:


I don't think that was designed for the case where the backing memory
is a special/static physical address range rather than anonymous
"System RAM", right?

The hardware doesn't care where the memory is. ODP is just a generic
mechanism to provide demand-fault behavior for a mirrored page table.

ODP has the same issue as everything else, it needs to translate a
page table entry into a DMA address, and we have no API to do that
when the page table points to peer-peer memory.

Jason
I would like to note that for graphics applications (especially for VR 
support) we

should  avoid ODP  case at any cost during graphics commands execution  due
to requirement to have smooth and predictable playback. We want to load 
/ "pin"
all required resources before graphics processor begin to touch them. 
This is not

so critical for compute applications. Because only graphics / compute stack
knows which resource will be in used as well as all statistics 
accordingly only graphics
stack is capable to make the correct decision when and _where_ evict as 
well

as when and _where_ to put memory back.



Re: Enabling peer to peer device transactions for PCIe devices

2016-11-23 Thread Serguei Sagalovitch


On 2016-11-23 03:51 AM, Christian König wrote:

Am 23.11.2016 um 08:49 schrieb Daniel Vetter:

On Tue, Nov 22, 2016 at 01:21:03PM -0800, Dan Williams wrote:

On Tue, Nov 22, 2016 at 1:03 PM, Daniel Vetter <dan...@ffwll.ch> wrote:

On Tue, Nov 22, 2016 at 9:35 PM, Serguei Sagalovitch
<serguei.sagalovi...@amd.com> wrote:

On 2016-11-22 03:10 PM, Daniel Vetter wrote:
On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams 
<dan.j.willi...@intel.com>

wrote:

On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch
<serguei.sagalovi...@amd.com> wrote:

I personally like "device-DAX" idea but my concerns are:

-  How well it will co-exists with the  DRM infrastructure /
implementations
 in part dealing with CPU pointers?

Inside the kernel a device-DAX range is "just memory" in the sense
that you can perform pfn_to_page() on it and issue I/O, but the 
vma is

not migratable. To be honest I do not know how well that co-exists
with drm infrastructure.


-  How well we will be able to handle case when we need to
"move"/"evict"
 memory/data to the new location so CPU pointer should 
point to the

new
physical location/address
  (and may be not in PCI device memory at all)?
So, device-DAX deliberately avoids support for in-kernel 
migration or
overcommit. Those cases are left to the core mm or drm. The 
device-dax
interface is for cases where all that is needed is a 
direct-mapping to
a statically-allocated physical-address range be it persistent 
memory

or some other special reserved memory range.
For some of the fancy use-cases (e.g. to be comparable to what 
HMM can
pull off) I think we want all the magic in core mm, i.e. 
migration and

overcommit. At least that seems to be the very strong drive in all
general-purpose gpu abstractions and implementations, where 
memory is

allocated with malloc, and then mapped/moved into vram/gpu address
space through some magic,
It is possible that there is other way around: memory is requested 
to be

allocated and should be kept in vram for  performance reason but due
to possible overcommit case we need at least temporally to "move" 
such

allocation to system memory.

With migration I meant migrating both ways of course. And with stuff
like numactl we can also influence where exactly the malloc'ed memory
is allocated originally, at least if we'd expose the vram range as a
very special numa node that happens to be far away and not hold any
cpu cores.

I don't think we should be using numa distance to reverse engineer a
certain allocation behavior.  The latency data should be truthful, but
you're right we'll need a mechanism to keep general purpose
allocations out of that range by default. Btw, strict isolation is
another design point of device-dax, but I think in this case we're
describing something between the two extremes of full isolation and
full compatibility with existing numactl apis.

Yes, agreed. My idea with exposing vram sections using numa nodes wasn't
to reuse all the existing allocation policies directly, those won't 
work.

So at boot-up your default numa policy would exclude any vram nodes.

But I think (as an -mm layman) that numa gives us a lot of the tools and
policy interface that we need to implement what we want for gpus.


Agree completely. From a ten mile high view our GPUs are just command 
processors with local memory as well .


Basically this is also the whole idea of what AMD is pushing with HSA 
for a while.


It's just that a lot of problems start to pop up when you look at all 
the nasty details. For example only part of the GPU memory is usually 
accessible by the CPU.


So even when numa nodes expose a good foundation for this I think 
there is still a lot of code to write.


BTW: I should probably start to read into the numa code of the kernel. 
Any good pointers for that?
I would assume that "page" allocation logic itself should be inside of 
graphics driver due to possible different requirements especially from 
graphics: alignment, etc.




Regards,
Christian.


Wrt isolation: There's a sliding scale of what different users expect,
from full auto everything, including migrating pages around if needed to
full isolation all seems to be on the table. As long as we keep vram 
nodes
out of any default allocation numasets, full isolation should be 
possible.

-Daniel





Sincerely yours,
Serguei Sagalovitch



Re: Enabling peer to peer device transactions for PCIe devices

2016-11-23 Thread Serguei Sagalovitch


On 2016-11-23 03:51 AM, Christian König wrote:

Am 23.11.2016 um 08:49 schrieb Daniel Vetter:

On Tue, Nov 22, 2016 at 01:21:03PM -0800, Dan Williams wrote:

On Tue, Nov 22, 2016 at 1:03 PM, Daniel Vetter  wrote:

On Tue, Nov 22, 2016 at 9:35 PM, Serguei Sagalovitch
 wrote:

On 2016-11-22 03:10 PM, Daniel Vetter wrote:
On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams 


wrote:

On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch
 wrote:

I personally like "device-DAX" idea but my concerns are:

-  How well it will co-exists with the  DRM infrastructure /
implementations
 in part dealing with CPU pointers?

Inside the kernel a device-DAX range is "just memory" in the sense
that you can perform pfn_to_page() on it and issue I/O, but the 
vma is

not migratable. To be honest I do not know how well that co-exists
with drm infrastructure.


-  How well we will be able to handle case when we need to
"move"/"evict"
 memory/data to the new location so CPU pointer should 
point to the

new
physical location/address
  (and may be not in PCI device memory at all)?
So, device-DAX deliberately avoids support for in-kernel 
migration or
overcommit. Those cases are left to the core mm or drm. The 
device-dax
interface is for cases where all that is needed is a 
direct-mapping to
a statically-allocated physical-address range be it persistent 
memory

or some other special reserved memory range.
For some of the fancy use-cases (e.g. to be comparable to what 
HMM can
pull off) I think we want all the magic in core mm, i.e. 
migration and

overcommit. At least that seems to be the very strong drive in all
general-purpose gpu abstractions and implementations, where 
memory is

allocated with malloc, and then mapped/moved into vram/gpu address
space through some magic,
It is possible that there is other way around: memory is requested 
to be

allocated and should be kept in vram for  performance reason but due
to possible overcommit case we need at least temporally to "move" 
such

allocation to system memory.

With migration I meant migrating both ways of course. And with stuff
like numactl we can also influence where exactly the malloc'ed memory
is allocated originally, at least if we'd expose the vram range as a
very special numa node that happens to be far away and not hold any
cpu cores.

I don't think we should be using numa distance to reverse engineer a
certain allocation behavior.  The latency data should be truthful, but
you're right we'll need a mechanism to keep general purpose
allocations out of that range by default. Btw, strict isolation is
another design point of device-dax, but I think in this case we're
describing something between the two extremes of full isolation and
full compatibility with existing numactl apis.

Yes, agreed. My idea with exposing vram sections using numa nodes wasn't
to reuse all the existing allocation policies directly, those won't 
work.

So at boot-up your default numa policy would exclude any vram nodes.

But I think (as an -mm layman) that numa gives us a lot of the tools and
policy interface that we need to implement what we want for gpus.


Agree completely. From a ten mile high view our GPUs are just command 
processors with local memory as well .


Basically this is also the whole idea of what AMD is pushing with HSA 
for a while.


It's just that a lot of problems start to pop up when you look at all 
the nasty details. For example only part of the GPU memory is usually 
accessible by the CPU.


So even when numa nodes expose a good foundation for this I think 
there is still a lot of code to write.


BTW: I should probably start to read into the numa code of the kernel. 
Any good pointers for that?
I would assume that "page" allocation logic itself should be inside of 
graphics driver due to possible different requirements especially from 
graphics: alignment, etc.




Regards,
Christian.


Wrt isolation: There's a sliding scale of what different users expect,
from full auto everything, including migrating pages around if needed to
full isolation all seems to be on the table. As long as we keep vram 
nodes
out of any default allocation numasets, full isolation should be 
possible.

-Daniel





Sincerely yours,
Serguei Sagalovitch



Re: Enabling peer to peer device transactions for PCIe devices

2016-11-23 Thread Serguei Sagalovitch

On 2016-11-23 12:27 PM, Bart Van Assche wrote:

On 11/23/2016 09:13 AM, Logan Gunthorpe wrote:

IMO any memory that has been registered for a P2P transaction should be
locked from being evicted. So if there's a get_user_pages call it needs
to be pinned until the put_page. The main issue being with the RDMA
case: handling an eviction when a chunk of memory has been registered as
an MR would be very tricky. The MR may be relied upon by another host
and the kernel would have to inform user-space the MR was invalid then
user-space would have to tell the remote application.


Hello Logan,

Are you aware that the Linux kernel already supports ODP (On Demand 
Paging)? See also the output of git grep -nHi on.demand.paging. See 
also 
https://www.openfabrics.org/images/eventpresos/workshops2014/DevWorkshop/presos/Tuesday/pdf/04_ODP_update.pdf.


Bart.
My understanding is that  the main problems are (a) h/w support (b) 
compatibility with IB Verbs semantic.




Re: Enabling peer to peer device transactions for PCIe devices

2016-11-23 Thread Serguei Sagalovitch

On 2016-11-23 12:27 PM, Bart Van Assche wrote:

On 11/23/2016 09:13 AM, Logan Gunthorpe wrote:

IMO any memory that has been registered for a P2P transaction should be
locked from being evicted. So if there's a get_user_pages call it needs
to be pinned until the put_page. The main issue being with the RDMA
case: handling an eviction when a chunk of memory has been registered as
an MR would be very tricky. The MR may be relied upon by another host
and the kernel would have to inform user-space the MR was invalid then
user-space would have to tell the remote application.


Hello Logan,

Are you aware that the Linux kernel already supports ODP (On Demand 
Paging)? See also the output of git grep -nHi on.demand.paging. See 
also 
https://www.openfabrics.org/images/eventpresos/workshops2014/DevWorkshop/presos/Tuesday/pdf/04_ODP_update.pdf.


Bart.
My understanding is that  the main problems are (a) h/w support (b) 
compatibility with IB Verbs semantic.




Re: Enabling peer to peer device transactions for PCIe devices

2016-11-23 Thread Serguei Sagalovitch


On 2016-11-23 02:05 PM, Jason Gunthorpe wrote:

On Wed, Nov 23, 2016 at 10:13:03AM -0700, Logan Gunthorpe wrote:


an MR would be very tricky. The MR may be relied upon by another host
and the kernel would have to inform user-space the MR was invalid then
user-space would have to tell the remote application.

As Bart says, it would be best to be combined with something like
Mellanox's ODP MRs, which allows a page to be evicted and then trigger
a CPU interrupt if a DMA is attempted so it can be brought back.

Please note that in the general case (including  MR one) we could have
"page fault" from the different PCIe device. So all  PCIe device must
be synchronized.

includes the usual fencing mechanism so the CPU can block, flush, and
then evict a page coherently.

This is the general direction the industry is going in: Link PCI DMA
directly to dynamic user page tabels, including support for demand
faulting and synchronicity.

Mellanox ODP is a rough implementation of mirroring a process's page
table via the kernel, while IBM's CAPI (and CCIX, PCI ATS?) is
probably a good example of where this is ultimately headed.

CAPI allows a PCI DMA to directly target an ASID associated with a
user process and then use the usual CPU machinery to do the page
translation for the DMA. This includes page faults for evicted pages,
and obviously allows eviction and migration..

So, of all the solutions in the original list, I would discard
anything that isn't VMA focused. Emulating what CAPI does in hardware
with software is probably the best choice, or we have to do it all
again when CAPI style hardware broadly rolls out :(

DAX and GPU allocators should create VMAs and manipulate them in the
usual way to achieve migration, windowing, cache, movement or
swap of the potentially peer-peer memory pages. They would have to
respect the usual rules for a VMA, including pinning.

DMA drivers would use the usual approaches for dealing with DMA from
a VMA: short term pin or long term coherent translation mirror.

So, to my view (looking from RDMA), the main problem with peer-peer is
how do you DMA translate VMA's that point at non struct page memory?

Does HMM solve the peer-peer problem? Does it do it generically or
only for drivers that are mirroring translation tables?

In current form HMM doesn't solve peer-peer problem. Currently it allow
"mirroring" of  "malloc" memory on GPU which is not always what needed.
Additionally  there is need to have opportunity to share VRAM allocations
between  different processes.

 From a RDMA perspective we could use something other than
get_user_pages() to pin and DMA translate a VMA if the core community
could decide on an API. eg get_user_dma_sg() would probably be quite
usable.

Jason




Re: Enabling peer to peer device transactions for PCIe devices

2016-11-23 Thread Serguei Sagalovitch


On 2016-11-23 02:05 PM, Jason Gunthorpe wrote:

On Wed, Nov 23, 2016 at 10:13:03AM -0700, Logan Gunthorpe wrote:


an MR would be very tricky. The MR may be relied upon by another host
and the kernel would have to inform user-space the MR was invalid then
user-space would have to tell the remote application.

As Bart says, it would be best to be combined with something like
Mellanox's ODP MRs, which allows a page to be evicted and then trigger
a CPU interrupt if a DMA is attempted so it can be brought back.

Please note that in the general case (including  MR one) we could have
"page fault" from the different PCIe device. So all  PCIe device must
be synchronized.

includes the usual fencing mechanism so the CPU can block, flush, and
then evict a page coherently.

This is the general direction the industry is going in: Link PCI DMA
directly to dynamic user page tabels, including support for demand
faulting and synchronicity.

Mellanox ODP is a rough implementation of mirroring a process's page
table via the kernel, while IBM's CAPI (and CCIX, PCI ATS?) is
probably a good example of where this is ultimately headed.

CAPI allows a PCI DMA to directly target an ASID associated with a
user process and then use the usual CPU machinery to do the page
translation for the DMA. This includes page faults for evicted pages,
and obviously allows eviction and migration..

So, of all the solutions in the original list, I would discard
anything that isn't VMA focused. Emulating what CAPI does in hardware
with software is probably the best choice, or we have to do it all
again when CAPI style hardware broadly rolls out :(

DAX and GPU allocators should create VMAs and manipulate them in the
usual way to achieve migration, windowing, cache, movement or
swap of the potentially peer-peer memory pages. They would have to
respect the usual rules for a VMA, including pinning.

DMA drivers would use the usual approaches for dealing with DMA from
a VMA: short term pin or long term coherent translation mirror.

So, to my view (looking from RDMA), the main problem with peer-peer is
how do you DMA translate VMA's that point at non struct page memory?

Does HMM solve the peer-peer problem? Does it do it generically or
only for drivers that are mirroring translation tables?

In current form HMM doesn't solve peer-peer problem. Currently it allow
"mirroring" of  "malloc" memory on GPU which is not always what needed.
Additionally  there is need to have opportunity to share VRAM allocations
between  different processes.

 From a RDMA perspective we could use something other than
get_user_pages() to pin and DMA translate a VMA if the core community
could decide on an API. eg get_user_dma_sg() would probably be quite
usable.

Jason




Re: Enabling peer to peer device transactions for PCIe devices

2016-11-22 Thread Serguei Sagalovitch



On 2016-11-22 03:10 PM, Daniel Vetter wrote:

On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams <dan.j.willi...@intel.com> wrote:

On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch
<serguei.sagalovi...@amd.com> wrote:

I personally like "device-DAX" idea but my concerns are:

-  How well it will co-exists with the  DRM infrastructure / implementations
in part dealing with CPU pointers?

Inside the kernel a device-DAX range is "just memory" in the sense
that you can perform pfn_to_page() on it and issue I/O, but the vma is
not migratable. To be honest I do not know how well that co-exists
with drm infrastructure.


-  How well we will be able to handle case when we need to "move"/"evict"
memory/data to the new location so CPU pointer should point to the new
physical location/address
 (and may be not in PCI device memory at all)?

So, device-DAX deliberately avoids support for in-kernel migration or
overcommit. Those cases are left to the core mm or drm. The device-dax
interface is for cases where all that is needed is a direct-mapping to
a statically-allocated physical-address range be it persistent memory
or some other special reserved memory range.

For some of the fancy use-cases (e.g. to be comparable to what HMM can
pull off) I think we want all the magic in core mm, i.e. migration and
overcommit. At least that seems to be the very strong drive in all
general-purpose gpu abstractions and implementations, where memory is
allocated with malloc, and then mapped/moved into vram/gpu address
space through some magic,

It is possible that there is other way around: memory is requested to be
allocated and should be kept in vram for  performance reason but due
to possible overcommit case we need at least temporally to "move" such
allocation to system memory.

  but still visible on both the cpu and gpu
side in some form. Special device to allocate memory, and not being
able to migrate stuff around sound like misfeatures from that pov.
-Daniel




Re: Enabling peer to peer device transactions for PCIe devices

2016-11-22 Thread Serguei Sagalovitch



On 2016-11-22 03:10 PM, Daniel Vetter wrote:

On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams  wrote:

On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch
 wrote:

I personally like "device-DAX" idea but my concerns are:

-  How well it will co-exists with the  DRM infrastructure / implementations
in part dealing with CPU pointers?

Inside the kernel a device-DAX range is "just memory" in the sense
that you can perform pfn_to_page() on it and issue I/O, but the vma is
not migratable. To be honest I do not know how well that co-exists
with drm infrastructure.


-  How well we will be able to handle case when we need to "move"/"evict"
memory/data to the new location so CPU pointer should point to the new
physical location/address
 (and may be not in PCI device memory at all)?

So, device-DAX deliberately avoids support for in-kernel migration or
overcommit. Those cases are left to the core mm or drm. The device-dax
interface is for cases where all that is needed is a direct-mapping to
a statically-allocated physical-address range be it persistent memory
or some other special reserved memory range.

For some of the fancy use-cases (e.g. to be comparable to what HMM can
pull off) I think we want all the magic in core mm, i.e. migration and
overcommit. At least that seems to be the very strong drive in all
general-purpose gpu abstractions and implementations, where memory is
allocated with malloc, and then mapped/moved into vram/gpu address
space through some magic,

It is possible that there is other way around: memory is requested to be
allocated and should be kept in vram for  performance reason but due
to possible overcommit case we need at least temporally to "move" such
allocation to system memory.

  but still visible on both the cpu and gpu
side in some form. Special device to allocate memory, and not being
able to migrate stuff around sound like misfeatures from that pov.
-Daniel