Re: Enabling peer to peer device transactions for PCIe devices
On 2017-01-05 08:58 PM, Jerome Glisse wrote: On Thu, Jan 05, 2017 at 05:30:34PM -0700, Jason Gunthorpe wrote: On Thu, Jan 05, 2017 at 06:23:52PM -0500, Jerome Glisse wrote: I still don't understand what you driving at - you've said in both cases a user VMA exists. In the former case no, there is no VMA directly but if you want one than a device can provide one. But such VMA is useless as CPU access is not expected. I disagree it is useless, the VMA is going to be necessary to support upcoming things like CAPI, you need it to support O_DIRECT from the filesystem, DPDK, etc. This is why I am opposed to any model that is not VMA based for setting up RDMA - that is shorted sighted and does not seem to reflect where the industry is going. So focus on having VMA backed by actual physical memory that covers your GPU objects and ask how do we wire up the '__user *' to the DMA API in the best way so the DMA API still has enough information to setup IOMMUs and whatnot. I am talking about 2 different thing. Existing hardware and API where you _do not_ have a vma and you do not need one. This is just existing stuff. I do not understand why you assume that existing API doesn't need one. I would say that a lot of __existing__ user level API and their support in kernel (especially outside of graphics domain) assumes that we have vma and deal with __user * pointers. Some close driver provide a functionality on top of this design. Question is do we want to do the same ? If yes and you insist on having a vma we could provide one but this is does not apply and is useless for where we are going with new hardware. With new hardware you just use malloc or mmap to allocate memory and then you use it directly with the device. Device driver can migrate any part of the process address space to device memory. In this scheme you have your usual VMAs but there is nothing special about them. Assuming that the whole device memory is CPU accessible and it looks like the direction where we are going: - You forgot about use case when we want or need to allocate memory directly on device (why we need to migrate anything if not needed?). - We may want to use CPU to access such memory on device to avoid any unnecessary migration back. - We may have more device memory than the system one. E.g. if you have 12 GPUs w/64GB each it will already give us ~0.7 TB not mentioning NVDIMM cards which could also be used as memory storage for other device access. - We also may want/need to share GPU memory between different processes. Now when you try to do get_user_page() on any page that is inside the device it will fails because we do not allow any device memory to be pin. There is various reasons for that and they are not going away in any hw in the planing (so for next few years). Still we do want to support peer to peer mapping. Plan is to only do so with ODP capable hardware. Still we need to solve the IOMMU issue and it needs special handling inside the RDMA device. The way it works is that RDMA ask for a GPU page, GPU check if it has place inside its PCI bar to map this page for the device, this can fail. If it succeed then you need the IOMMU to let the RDMA device access the GPU PCI bar. So here we have 2 orthogonal problem. First one is how to make 2 drivers talks to each other to setup mapping to allow peer to peer But I would assume and second is about IOMMU. I think that there is the third problem: A lot of existing user level API (MPI, IB Verbs, file i/o, etc.) deal with pointers to the buffers. Potentially it would be ideally to support use cases when those buffers are located in device memory avoiding any unnecessary migration / double-buffering. Currently a lot of infrastructure in kernel assumes that this is the user pointer and call "get_user_pages" to get s/g. What is your opinion how it should be changed to deal with cases when "buffer" is in device memory?
Re: Enabling peer to peer device transactions for PCIe devices
On 2017-01-05 08:58 PM, Jerome Glisse wrote: On Thu, Jan 05, 2017 at 05:30:34PM -0700, Jason Gunthorpe wrote: On Thu, Jan 05, 2017 at 06:23:52PM -0500, Jerome Glisse wrote: I still don't understand what you driving at - you've said in both cases a user VMA exists. In the former case no, there is no VMA directly but if you want one than a device can provide one. But such VMA is useless as CPU access is not expected. I disagree it is useless, the VMA is going to be necessary to support upcoming things like CAPI, you need it to support O_DIRECT from the filesystem, DPDK, etc. This is why I am opposed to any model that is not VMA based for setting up RDMA - that is shorted sighted and does not seem to reflect where the industry is going. So focus on having VMA backed by actual physical memory that covers your GPU objects and ask how do we wire up the '__user *' to the DMA API in the best way so the DMA API still has enough information to setup IOMMUs and whatnot. I am talking about 2 different thing. Existing hardware and API where you _do not_ have a vma and you do not need one. This is just existing stuff. I do not understand why you assume that existing API doesn't need one. I would say that a lot of __existing__ user level API and their support in kernel (especially outside of graphics domain) assumes that we have vma and deal with __user * pointers. Some close driver provide a functionality on top of this design. Question is do we want to do the same ? If yes and you insist on having a vma we could provide one but this is does not apply and is useless for where we are going with new hardware. With new hardware you just use malloc or mmap to allocate memory and then you use it directly with the device. Device driver can migrate any part of the process address space to device memory. In this scheme you have your usual VMAs but there is nothing special about them. Assuming that the whole device memory is CPU accessible and it looks like the direction where we are going: - You forgot about use case when we want or need to allocate memory directly on device (why we need to migrate anything if not needed?). - We may want to use CPU to access such memory on device to avoid any unnecessary migration back. - We may have more device memory than the system one. E.g. if you have 12 GPUs w/64GB each it will already give us ~0.7 TB not mentioning NVDIMM cards which could also be used as memory storage for other device access. - We also may want/need to share GPU memory between different processes. Now when you try to do get_user_page() on any page that is inside the device it will fails because we do not allow any device memory to be pin. There is various reasons for that and they are not going away in any hw in the planing (so for next few years). Still we do want to support peer to peer mapping. Plan is to only do so with ODP capable hardware. Still we need to solve the IOMMU issue and it needs special handling inside the RDMA device. The way it works is that RDMA ask for a GPU page, GPU check if it has place inside its PCI bar to map this page for the device, this can fail. If it succeed then you need the IOMMU to let the RDMA device access the GPU PCI bar. So here we have 2 orthogonal problem. First one is how to make 2 drivers talks to each other to setup mapping to allow peer to peer But I would assume and second is about IOMMU. I think that there is the third problem: A lot of existing user level API (MPI, IB Verbs, file i/o, etc.) deal with pointers to the buffers. Potentially it would be ideally to support use cases when those buffers are located in device memory avoiding any unnecessary migration / double-buffering. Currently a lot of infrastructure in kernel assumes that this is the user pointer and call "get_user_pages" to get s/g. What is your opinion how it should be changed to deal with cases when "buffer" is in device memory?
Re: Enabling peer to peer device transactions for PCIe devices
On 2017-01-05 07:30 PM, Jason Gunthorpe wrote: but I am opposed to the idea we need two API paths that the *driver* has to figure out. That is fundamentally not what I want as a driver developer. Give me a common API to convert '__user *' to a scatter list and pin the pages. Completely agreed. IMHO there is no sense to duplicate the same logic everywhere as well as trying to find places where it is missing. Sincerely yours, Serguei Sagalovitch
Re: Enabling peer to peer device transactions for PCIe devices
On 2017-01-05 07:30 PM, Jason Gunthorpe wrote: but I am opposed to the idea we need two API paths that the *driver* has to figure out. That is fundamentally not what I want as a driver developer. Give me a common API to convert '__user *' to a scatter list and pin the pages. Completely agreed. IMHO there is no sense to duplicate the same logic everywhere as well as trying to find places where it is missing. Sincerely yours, Serguei Sagalovitch
Re: Enabling peer to peer device transactions for PCIe devices
On 2016-11-30 11:23 AM, Jason Gunthorpe wrote: Yes, that sounds fine. Can we simply kill the process from the GPU driver? Or do we need to extend the OOM killer to manage GPU pages? I don't know.. We could use send_sig_info to send signal from kernel to user space. So theoretically GPU driver could issue KILL signal to some process. On Wed, Nov 30, 2016 at 12:45:58PM +0200, Haggai Eran wrote: I think we can achieve the kernel's needs with ZONE_DEVICE and DMA-API support for peer to peer. I'm not sure we need vmap. We need a way to have a scatterlist of MMIO pfns, and ZONE_DEVICE allows that. I do not think that using DMA-API as it is is the best solution (at least in the current form): - It deals with handles/fd for the whole allocation but client could/will use sub-allocation as well as theoretically possible to "merge" several allocations in one from GPU perspective. - It require knowledge to export but because "sharing" is controlled from user space it means that we must "export" all allocation by default - It deals with 'fd'/handles but user application may work with addresses/pointers. Also current DMA-API force each time to do all DMA table programming unrelated if location was changed or not. With vma / mmu we are able to install notifier to intercept changes in location and update translation tables only as needed (we do not need to keep get_user_pages() lock).
Re: Enabling peer to peer device transactions for PCIe devices
On 2016-11-30 11:23 AM, Jason Gunthorpe wrote: Yes, that sounds fine. Can we simply kill the process from the GPU driver? Or do we need to extend the OOM killer to manage GPU pages? I don't know.. We could use send_sig_info to send signal from kernel to user space. So theoretically GPU driver could issue KILL signal to some process. On Wed, Nov 30, 2016 at 12:45:58PM +0200, Haggai Eran wrote: I think we can achieve the kernel's needs with ZONE_DEVICE and DMA-API support for peer to peer. I'm not sure we need vmap. We need a way to have a scatterlist of MMIO pfns, and ZONE_DEVICE allows that. I do not think that using DMA-API as it is is the best solution (at least in the current form): - It deals with handles/fd for the whole allocation but client could/will use sub-allocation as well as theoretically possible to "merge" several allocations in one from GPU perspective. - It require knowledge to export but because "sharing" is controlled from user space it means that we must "export" all allocation by default - It deals with 'fd'/handles but user application may work with addresses/pointers. Also current DMA-API force each time to do all DMA table programming unrelated if location was changed or not. With vma / mmu we are able to install notifier to intercept changes in location and update translation tables only as needed (we do not need to keep get_user_pages() lock).
Re: Enabling peer to peer device transactions for PCIe devices
On 2016-11-28 04:36 PM, Logan Gunthorpe wrote: On 28/11/16 12:35 PM, Serguei Sagalovitch wrote: As soon as PeerDirect mapping is called then GPU must not "move" the such memory. It is by PeerDirect design. It is similar how it is works with system memory and RDMA MR: when "get_user_pages" is called then the memory is pinned. We haven't touch this in a long time and perhaps it changed, but there definitely was a call back in the PeerDirect API to allow the GPU to invalidate the mapping. That's what we don't want. I assume that you are talking about "invalidate_peer_memory()' callback? I was told that it is the "last resort" because HCA (and driver) is not able to handle it in the safe manner so it is basically "abort" everything.
Re: Enabling peer to peer device transactions for PCIe devices
On 2016-11-28 04:36 PM, Logan Gunthorpe wrote: On 28/11/16 12:35 PM, Serguei Sagalovitch wrote: As soon as PeerDirect mapping is called then GPU must not "move" the such memory. It is by PeerDirect design. It is similar how it is works with system memory and RDMA MR: when "get_user_pages" is called then the memory is pinned. We haven't touch this in a long time and perhaps it changed, but there definitely was a call back in the PeerDirect API to allow the GPU to invalidate the mapping. That's what we don't want. I assume that you are talking about "invalidate_peer_memory()' callback? I was told that it is the "last resort" because HCA (and driver) is not able to handle it in the safe manner so it is basically "abort" everything.
Re: Enabling peer to peer device transactions for PCIe devices
On 2016-11-28 01:20 PM, Logan Gunthorpe wrote: On 28/11/16 09:57 AM, Jason Gunthorpe wrote: On PeerDirect, we have some kind of a middle-ground solution for pinning GPU memory. We create a non-ODP MR pointing to VRAM but rely on user-space and the GPU not to migrate it. If they do, the MR gets destroyed immediately. That sounds horrible. How can that possibly work? What if the MR is being used when the GPU decides to migrate? I would not support that upstream without a lot more explanation.. Yup, this was our experience when playing around with PeerDirect. There was nothing we could do if the GPU decided to invalidate the P2P mapping. As soon as PeerDirect mapping is called then GPU must not "move" the such memory. It is by PeerDirect design. It is similar how it is works with system memory and RDMA MR: when "get_user_pages" is called then the memory is pinned.
Re: Enabling peer to peer device transactions for PCIe devices
On 2016-11-28 01:20 PM, Logan Gunthorpe wrote: On 28/11/16 09:57 AM, Jason Gunthorpe wrote: On PeerDirect, we have some kind of a middle-ground solution for pinning GPU memory. We create a non-ODP MR pointing to VRAM but rely on user-space and the GPU not to migrate it. If they do, the MR gets destroyed immediately. That sounds horrible. How can that possibly work? What if the MR is being used when the GPU decides to migrate? I would not support that upstream without a lot more explanation.. Yup, this was our experience when playing around with PeerDirect. There was nothing we could do if the GPU decided to invalidate the P2P mapping. As soon as PeerDirect mapping is called then GPU must not "move" the such memory. It is by PeerDirect design. It is similar how it is works with system memory and RDMA MR: when "get_user_pages" is called then the memory is pinned.
Re: Enabling peer to peer device transactions for PCIe devices
On 2016-11-27 09:02 AM, Haggai Eran wrote On PeerDirect, we have some kind of a middle-ground solution for pinning GPU memory. We create a non-ODP MR pointing to VRAM but rely on user-space and the GPU not to migrate it. If they do, the MR gets destroyed immediately. This should work on legacy devices without ODP support, and allows the system to safely terminate a process that misbehaves. The downside of course is that it cannot transparently migrate memory but I think for user-space RDMA doing that transparently requires hardware support for paging, via something like HMM. ... May be I am wrong but my understanding is that PeerDirect logic basically follow "RDMA register MR" logic so basically nothing prevent to "terminate" process for "MMU notifier" case when we are very low on memory not making it similar (not worse) then PeerDirect case. I'm hearing most people say ZONE_DEVICE is the way to handle this, which means the missing remaing piece for RDMA is some kind of DMA core support for p2p address translation.. Yes, this is definitely something we need. I think Will Davis's patches are a good start. Another thing I think is that while HMM is good for user-space applications, for kernel p2p use there is no need for that. About HMM: I do not think that in the current form HMM would fit in requirement for generic P2P transfer case. My understanding is that at the current stage HMM is good for "caching" system memory in device memory for fast GPU access but in RDMA MR non-ODP case it will not work because the location of memory should not be changed so memory should be allocated directly in PCIe memory. Using ZONE_DEVICE with or without something like DMA-BUF to pin and unpin pages for the short duration as you wrote above could work fine for kernel uses in which we can guarantee they are short. Potentially there is another issue related to pin/unpin. If memory could be used a lot of time then there is no sense to rebuild and program s/g tables each time if location of memory was not changed.
Re: Enabling peer to peer device transactions for PCIe devices
On 2016-11-27 09:02 AM, Haggai Eran wrote On PeerDirect, we have some kind of a middle-ground solution for pinning GPU memory. We create a non-ODP MR pointing to VRAM but rely on user-space and the GPU not to migrate it. If they do, the MR gets destroyed immediately. This should work on legacy devices without ODP support, and allows the system to safely terminate a process that misbehaves. The downside of course is that it cannot transparently migrate memory but I think for user-space RDMA doing that transparently requires hardware support for paging, via something like HMM. ... May be I am wrong but my understanding is that PeerDirect logic basically follow "RDMA register MR" logic so basically nothing prevent to "terminate" process for "MMU notifier" case when we are very low on memory not making it similar (not worse) then PeerDirect case. I'm hearing most people say ZONE_DEVICE is the way to handle this, which means the missing remaing piece for RDMA is some kind of DMA core support for p2p address translation.. Yes, this is definitely something we need. I think Will Davis's patches are a good start. Another thing I think is that while HMM is good for user-space applications, for kernel p2p use there is no need for that. About HMM: I do not think that in the current form HMM would fit in requirement for generic P2P transfer case. My understanding is that at the current stage HMM is good for "caching" system memory in device memory for fast GPU access but in RDMA MR non-ODP case it will not work because the location of memory should not be changed so memory should be allocated directly in PCIe memory. Using ZONE_DEVICE with or without something like DMA-BUF to pin and unpin pages for the short duration as you wrote above could work fine for kernel uses in which we can guarantee they are short. Potentially there is another issue related to pin/unpin. If memory could be used a lot of time then there is no sense to rebuild and program s/g tables each time if location of memory was not changed.
Re: Enabling peer to peer device transactions for PCIe devices
On 2016-11-25 03:26 PM, Felix Kuehling wrote: On 16-11-25 12:20 PM, Serguei Sagalovitch wrote: A white list may end up being rather complicated if it has to cover different CPU generations and system architectures. I feel this is a decision user space could easily make. Logan I agreed that it is better to leave up to user space to check what is working and what is not. I found that write is practically always working but read very often not. Also sometimes system BIOS update could fix the issue. But is user mode always aware that P2P is going on or even possible? For example you may have a library reading a buffer from a file, but it doesn't necessarily know where that buffer is located (system memory, VRAM, ...) and it may not know what kind of the device the file is on (SATA drive, NVMe SSD, ...). The library will never know if all it gets is a pointer and a file descriptor. The library ends up calling a read system call. Then it would be up to the kernel to figure out the most efficient way to read the buffer from the file. If supported, it could use P2P between a GPU and NVMe where the NVMe device performs a DMA write to VRAM. If you put the burden of figuring out the P2P details on user mode code, I think it will severely limit the use cases that actually take advantage of it. You also risk a bunch of different implementations that get it wrong half the time on half the systems out there. Regards, Felix I agreed in theory with you but I must admit that I do not know how kernel could effectively collect all informations without running pretty complicated tests each time on boot-up (if any configuration changed including BIOS settings) and on pnp events. Also for efficient way kernel needs to know performance results (and it could also depends on clock / power mode) for read/write of each pair devices, for double-buffering it needs to know / detect on which NUMA node to allocate, etc. etc. Also device could be fully configured only on the first request for access so it may be needed to change initialization sequences.
Re: Enabling peer to peer device transactions for PCIe devices
On 2016-11-25 03:26 PM, Felix Kuehling wrote: On 16-11-25 12:20 PM, Serguei Sagalovitch wrote: A white list may end up being rather complicated if it has to cover different CPU generations and system architectures. I feel this is a decision user space could easily make. Logan I agreed that it is better to leave up to user space to check what is working and what is not. I found that write is practically always working but read very often not. Also sometimes system BIOS update could fix the issue. But is user mode always aware that P2P is going on or even possible? For example you may have a library reading a buffer from a file, but it doesn't necessarily know where that buffer is located (system memory, VRAM, ...) and it may not know what kind of the device the file is on (SATA drive, NVMe SSD, ...). The library will never know if all it gets is a pointer and a file descriptor. The library ends up calling a read system call. Then it would be up to the kernel to figure out the most efficient way to read the buffer from the file. If supported, it could use P2P between a GPU and NVMe where the NVMe device performs a DMA write to VRAM. If you put the burden of figuring out the P2P details on user mode code, I think it will severely limit the use cases that actually take advantage of it. You also risk a bunch of different implementations that get it wrong half the time on half the systems out there. Regards, Felix I agreed in theory with you but I must admit that I do not know how kernel could effectively collect all informations without running pretty complicated tests each time on boot-up (if any configuration changed including BIOS settings) and on pnp events. Also for efficient way kernel needs to know performance results (and it could also depends on clock / power mode) for read/write of each pair devices, for double-buffering it needs to know / detect on which NUMA node to allocate, etc. etc. Also device could be fully configured only on the first request for access so it may be needed to change initialization sequences.
Re: Enabling peer to peer device transactions for PCIe devices
On 2016-11-25 02:34 PM, Jason Gunthorpe wrote: On Fri, Nov 25, 2016 at 12:16:30PM -0500, Serguei Sagalovitch wrote: b) Allocation may not have CPU address at all - only GPU one. But you don't expect RDMA to work in the case, right? GPU people need to stop doing this windowed memory stuff :) GPU could perfectly access all VRAM. It is only issue for p2p without special interconnect and CPU access. Strictly speaking as long as we have "bus address" we could have RDMA but I agreed that for RDMA we could/should(?) always "request" CPU address (I hope that we could forget about 32-bit application :-)). BTW/FYI: About CPU access: Some user-level API is mainly handle based so there is no need for CPU access by default. About "visible" / non-visible VRAM parts: I assume that going forward we will be able to get rid from it completely as soon as support for resizable PCI BAR will be implemented and/or old/current h/w will become obsolete.
Re: Enabling peer to peer device transactions for PCIe devices
On 2016-11-25 02:34 PM, Jason Gunthorpe wrote: On Fri, Nov 25, 2016 at 12:16:30PM -0500, Serguei Sagalovitch wrote: b) Allocation may not have CPU address at all - only GPU one. But you don't expect RDMA to work in the case, right? GPU people need to stop doing this windowed memory stuff :) GPU could perfectly access all VRAM. It is only issue for p2p without special interconnect and CPU access. Strictly speaking as long as we have "bus address" we could have RDMA but I agreed that for RDMA we could/should(?) always "request" CPU address (I hope that we could forget about 32-bit application :-)). BTW/FYI: About CPU access: Some user-level API is mainly handle based so there is no need for CPU access by default. About "visible" / non-visible VRAM parts: I assume that going forward we will be able to get rid from it completely as soon as support for resizable PCI BAR will be implemented and/or old/current h/w will become obsolete.
Re: Enabling peer to peer device transactions for PCIe devices
On 2016-11-25 08:22 AM, Christian König wrote: Serguei, what is your plan in GPU land for migration? Ie if I have a CPU mapped page and the GPU moves it to VRAM, it becomes non-cachable - do you still allow the CPU to access it? Or do you swap it back to cachable memory if the CPU touches it? Depends on the policy in command, but currently it's the other way around most of the time. E.g. we allocate memory in VRAM, the CPU writes to it WC and avoids reading because that is slow, the GPU in turn can access it with full speed. When we run out of VRAM we move those allocations to system memory and update both the CPU as well as the GPU page tables. So that move is transparent for both userspace as well as shaders running on the GPU. I would like to add more in relation to CPU access : a) we could have CPU-accessible part of VRAM ("inside" of PCIe BAR register) and non-CPU accessible part. As the result if user needs to have CPU access than memory should be located in CPU-accessible part of VRAM or in system memory. Application/user mode driver could specify preference/hints of locations based on their assumption / knowledge about access patterns requirements, game resolution, knowledge about size of VRAM memory, etc. So if CPU access performance is critical then such memory should be allocated in system memory as the first (and may be only) choice. b) Allocation may not have CPU address at all - only GPU one. Also we may not be able to have CPU address/accesses for all VRAM memory but memory may still be migrated in any case unrelated if we have CPU address or not. c) " VRAM, it becomes non-cachable " Strictly speaking VRAM is configured as WC (write-combined memory) to provide fast CPU write access. Also it was found that sometimes if CPU access is not critical from performance perspective it may be useful to allocate/program system memory also as WC to avoid needs for extra "snooping" to synchronize with CPU caches during GPU access. So potentially system memory could be WC too.
Re: Enabling peer to peer device transactions for PCIe devices
On 2016-11-25 08:22 AM, Christian König wrote: Serguei, what is your plan in GPU land for migration? Ie if I have a CPU mapped page and the GPU moves it to VRAM, it becomes non-cachable - do you still allow the CPU to access it? Or do you swap it back to cachable memory if the CPU touches it? Depends on the policy in command, but currently it's the other way around most of the time. E.g. we allocate memory in VRAM, the CPU writes to it WC and avoids reading because that is slow, the GPU in turn can access it with full speed. When we run out of VRAM we move those allocations to system memory and update both the CPU as well as the GPU page tables. So that move is transparent for both userspace as well as shaders running on the GPU. I would like to add more in relation to CPU access : a) we could have CPU-accessible part of VRAM ("inside" of PCIe BAR register) and non-CPU accessible part. As the result if user needs to have CPU access than memory should be located in CPU-accessible part of VRAM or in system memory. Application/user mode driver could specify preference/hints of locations based on their assumption / knowledge about access patterns requirements, game resolution, knowledge about size of VRAM memory, etc. So if CPU access performance is critical then such memory should be allocated in system memory as the first (and may be only) choice. b) Allocation may not have CPU address at all - only GPU one. Also we may not be able to have CPU address/accesses for all VRAM memory but memory may still be migrated in any case unrelated if we have CPU address or not. c) " VRAM, it becomes non-cachable " Strictly speaking VRAM is configured as WC (write-combined memory) to provide fast CPU write access. Also it was found that sometimes if CPU access is not critical from performance perspective it may be useful to allocate/program system memory also as WC to avoid needs for extra "snooping" to synchronize with CPU caches during GPU access. So potentially system memory could be WC too.
Re: Enabling peer to peer device transactions for PCIe devices
Well, I guess there's some consensus building to do. The existing options are: * Device DAX: which could work but the problem I see with it is that it only allows one application to do these transfers. Or there would have to be some user-space coordination to figure which application gets what memeroy. About one application restriction: so it is per memory mapping? I assume that it should not be problem for one application to do transfer to the several devices simultaneously? Am I right? May be we should follow RDMA MR design and register memory for p2p transfer from user space? What about the following: a) Device DAX is created b) "Normal" (movable, etc.) allocation will be done for PCIe memory and CPU pointer/access will be requested. c) p2p_mr_register() will be called and CPU pointer (mmap( on DAX Device)) will be returned. Accordingly such memory will be marked as "unmovable" by e.g. graphics driver. d) When p2p is not needed then p2p_mr_unregister() will be called. What do you think? Will it work?
Re: Enabling peer to peer device transactions for PCIe devices
Well, I guess there's some consensus building to do. The existing options are: * Device DAX: which could work but the problem I see with it is that it only allows one application to do these transfers. Or there would have to be some user-space coordination to figure which application gets what memeroy. About one application restriction: so it is per memory mapping? I assume that it should not be problem for one application to do transfer to the several devices simultaneously? Am I right? May be we should follow RDMA MR design and register memory for p2p transfer from user space? What about the following: a) Device DAX is created b) "Normal" (movable, etc.) allocation will be done for PCIe memory and CPU pointer/access will be requested. c) p2p_mr_register() will be called and CPU pointer (mmap( on DAX Device)) will be returned. Accordingly such memory will be marked as "unmovable" by e.g. graphics driver. d) When p2p is not needed then p2p_mr_unregister() will be called. What do you think? Will it work?
Re: Enabling peer to peer device transactions for PCIe devices
A white list may end up being rather complicated if it has to cover different CPU generations and system architectures. I feel this is a decision user space could easily make. Logan I agreed that it is better to leave up to user space to check what is working and what is not. I found that write is practically always working but read very often not. Also sometimes system BIOS update could fix the issue.
Re: Enabling peer to peer device transactions for PCIe devices
A white list may end up being rather complicated if it has to cover different CPU generations and system architectures. I feel this is a decision user space could easily make. Logan I agreed that it is better to leave up to user space to check what is working and what is not. I found that write is practically always working but read very often not. Also sometimes system BIOS update could fix the issue.
Re: Enabling peer to peer device transactions for PCIe devices
On 2016-11-24 11:26 AM, Jason Gunthorpe wrote: On Thu, Nov 24, 2016 at 10:45:18AM +0100, Christian König wrote: Am 24.11.2016 um 00:25 schrieb Jason Gunthorpe: There is certainly nothing about the hardware that cares about ZONE_DEVICE vs System memory. Well that is clearly not so simple. When your ZONE_DEVICE pages describe a PCI BAR and another PCI device initiates a DMA to this address the DMA subsystem must be able to check if the interconnection really works. I said the hardware doesn't care.. You are right, we still have an outstanding problem in Linux of how to generically DMA map a P2P address - which is a different issue from getting the P2P address from a __user pointer... Jason I agreed but the problem is that one issue immediately introduce another one to solve and so on (if we do not want to cut corners). I would think that a lot of them interconnected because the way how one problem could be solved may impact solution for another. btw: about "DMA map a p2p address": Right now to enable p2p between devices it is required/recommended to disable iommu support (e.g. intel iommu driver has special logic for graphics and comment "Reserve all PCI MMIO to avoid peer-to-peer access").
Re: Enabling peer to peer device transactions for PCIe devices
On 2016-11-24 11:26 AM, Jason Gunthorpe wrote: On Thu, Nov 24, 2016 at 10:45:18AM +0100, Christian König wrote: Am 24.11.2016 um 00:25 schrieb Jason Gunthorpe: There is certainly nothing about the hardware that cares about ZONE_DEVICE vs System memory. Well that is clearly not so simple. When your ZONE_DEVICE pages describe a PCI BAR and another PCI device initiates a DMA to this address the DMA subsystem must be able to check if the interconnection really works. I said the hardware doesn't care.. You are right, we still have an outstanding problem in Linux of how to generically DMA map a P2P address - which is a different issue from getting the P2P address from a __user pointer... Jason I agreed but the problem is that one issue immediately introduce another one to solve and so on (if we do not want to cut corners). I would think that a lot of them interconnected because the way how one problem could be solved may impact solution for another. btw: about "DMA map a p2p address": Right now to enable p2p between devices it is required/recommended to disable iommu support (e.g. intel iommu driver has special logic for graphics and comment "Reserve all PCI MMIO to avoid peer-to-peer access").
Re: Enabling peer to peer device transactions for PCIe devices
On 2016-11-23 02:12 PM, Jason Gunthorpe wrote: On Wed, Nov 23, 2016 at 10:40:47AM -0800, Dan Williams wrote: I don't think that was designed for the case where the backing memory is a special/static physical address range rather than anonymous "System RAM", right? The hardware doesn't care where the memory is. ODP is just a generic mechanism to provide demand-fault behavior for a mirrored page table. ODP has the same issue as everything else, it needs to translate a page table entry into a DMA address, and we have no API to do that when the page table points to peer-peer memory. Jason I would like to note that for graphics applications (especially for VR support) we should avoid ODP case at any cost during graphics commands execution due to requirement to have smooth and predictable playback. We want to load / "pin" all required resources before graphics processor begin to touch them. This is not so critical for compute applications. Because only graphics / compute stack knows which resource will be in used as well as all statistics accordingly only graphics stack is capable to make the correct decision when and _where_ evict as well as when and _where_ to put memory back.
Re: Enabling peer to peer device transactions for PCIe devices
On 2016-11-23 02:12 PM, Jason Gunthorpe wrote: On Wed, Nov 23, 2016 at 10:40:47AM -0800, Dan Williams wrote: I don't think that was designed for the case where the backing memory is a special/static physical address range rather than anonymous "System RAM", right? The hardware doesn't care where the memory is. ODP is just a generic mechanism to provide demand-fault behavior for a mirrored page table. ODP has the same issue as everything else, it needs to translate a page table entry into a DMA address, and we have no API to do that when the page table points to peer-peer memory. Jason I would like to note that for graphics applications (especially for VR support) we should avoid ODP case at any cost during graphics commands execution due to requirement to have smooth and predictable playback. We want to load / "pin" all required resources before graphics processor begin to touch them. This is not so critical for compute applications. Because only graphics / compute stack knows which resource will be in used as well as all statistics accordingly only graphics stack is capable to make the correct decision when and _where_ evict as well as when and _where_ to put memory back.
Re: Enabling peer to peer device transactions for PCIe devices
On 2016-11-23 03:51 AM, Christian König wrote: Am 23.11.2016 um 08:49 schrieb Daniel Vetter: On Tue, Nov 22, 2016 at 01:21:03PM -0800, Dan Williams wrote: On Tue, Nov 22, 2016 at 1:03 PM, Daniel Vetter <dan...@ffwll.ch> wrote: On Tue, Nov 22, 2016 at 9:35 PM, Serguei Sagalovitch <serguei.sagalovi...@amd.com> wrote: On 2016-11-22 03:10 PM, Daniel Vetter wrote: On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams <dan.j.willi...@intel.com> wrote: On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch <serguei.sagalovi...@amd.com> wrote: I personally like "device-DAX" idea but my concerns are: - How well it will co-exists with the DRM infrastructure / implementations in part dealing with CPU pointers? Inside the kernel a device-DAX range is "just memory" in the sense that you can perform pfn_to_page() on it and issue I/O, but the vma is not migratable. To be honest I do not know how well that co-exists with drm infrastructure. - How well we will be able to handle case when we need to "move"/"evict" memory/data to the new location so CPU pointer should point to the new physical location/address (and may be not in PCI device memory at all)? So, device-DAX deliberately avoids support for in-kernel migration or overcommit. Those cases are left to the core mm or drm. The device-dax interface is for cases where all that is needed is a direct-mapping to a statically-allocated physical-address range be it persistent memory or some other special reserved memory range. For some of the fancy use-cases (e.g. to be comparable to what HMM can pull off) I think we want all the magic in core mm, i.e. migration and overcommit. At least that seems to be the very strong drive in all general-purpose gpu abstractions and implementations, where memory is allocated with malloc, and then mapped/moved into vram/gpu address space through some magic, It is possible that there is other way around: memory is requested to be allocated and should be kept in vram for performance reason but due to possible overcommit case we need at least temporally to "move" such allocation to system memory. With migration I meant migrating both ways of course. And with stuff like numactl we can also influence where exactly the malloc'ed memory is allocated originally, at least if we'd expose the vram range as a very special numa node that happens to be far away and not hold any cpu cores. I don't think we should be using numa distance to reverse engineer a certain allocation behavior. The latency data should be truthful, but you're right we'll need a mechanism to keep general purpose allocations out of that range by default. Btw, strict isolation is another design point of device-dax, but I think in this case we're describing something between the two extremes of full isolation and full compatibility with existing numactl apis. Yes, agreed. My idea with exposing vram sections using numa nodes wasn't to reuse all the existing allocation policies directly, those won't work. So at boot-up your default numa policy would exclude any vram nodes. But I think (as an -mm layman) that numa gives us a lot of the tools and policy interface that we need to implement what we want for gpus. Agree completely. From a ten mile high view our GPUs are just command processors with local memory as well . Basically this is also the whole idea of what AMD is pushing with HSA for a while. It's just that a lot of problems start to pop up when you look at all the nasty details. For example only part of the GPU memory is usually accessible by the CPU. So even when numa nodes expose a good foundation for this I think there is still a lot of code to write. BTW: I should probably start to read into the numa code of the kernel. Any good pointers for that? I would assume that "page" allocation logic itself should be inside of graphics driver due to possible different requirements especially from graphics: alignment, etc. Regards, Christian. Wrt isolation: There's a sliding scale of what different users expect, from full auto everything, including migrating pages around if needed to full isolation all seems to be on the table. As long as we keep vram nodes out of any default allocation numasets, full isolation should be possible. -Daniel Sincerely yours, Serguei Sagalovitch
Re: Enabling peer to peer device transactions for PCIe devices
On 2016-11-23 03:51 AM, Christian König wrote: Am 23.11.2016 um 08:49 schrieb Daniel Vetter: On Tue, Nov 22, 2016 at 01:21:03PM -0800, Dan Williams wrote: On Tue, Nov 22, 2016 at 1:03 PM, Daniel Vetter wrote: On Tue, Nov 22, 2016 at 9:35 PM, Serguei Sagalovitch wrote: On 2016-11-22 03:10 PM, Daniel Vetter wrote: On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams wrote: On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch wrote: I personally like "device-DAX" idea but my concerns are: - How well it will co-exists with the DRM infrastructure / implementations in part dealing with CPU pointers? Inside the kernel a device-DAX range is "just memory" in the sense that you can perform pfn_to_page() on it and issue I/O, but the vma is not migratable. To be honest I do not know how well that co-exists with drm infrastructure. - How well we will be able to handle case when we need to "move"/"evict" memory/data to the new location so CPU pointer should point to the new physical location/address (and may be not in PCI device memory at all)? So, device-DAX deliberately avoids support for in-kernel migration or overcommit. Those cases are left to the core mm or drm. The device-dax interface is for cases where all that is needed is a direct-mapping to a statically-allocated physical-address range be it persistent memory or some other special reserved memory range. For some of the fancy use-cases (e.g. to be comparable to what HMM can pull off) I think we want all the magic in core mm, i.e. migration and overcommit. At least that seems to be the very strong drive in all general-purpose gpu abstractions and implementations, where memory is allocated with malloc, and then mapped/moved into vram/gpu address space through some magic, It is possible that there is other way around: memory is requested to be allocated and should be kept in vram for performance reason but due to possible overcommit case we need at least temporally to "move" such allocation to system memory. With migration I meant migrating both ways of course. And with stuff like numactl we can also influence where exactly the malloc'ed memory is allocated originally, at least if we'd expose the vram range as a very special numa node that happens to be far away and not hold any cpu cores. I don't think we should be using numa distance to reverse engineer a certain allocation behavior. The latency data should be truthful, but you're right we'll need a mechanism to keep general purpose allocations out of that range by default. Btw, strict isolation is another design point of device-dax, but I think in this case we're describing something between the two extremes of full isolation and full compatibility with existing numactl apis. Yes, agreed. My idea with exposing vram sections using numa nodes wasn't to reuse all the existing allocation policies directly, those won't work. So at boot-up your default numa policy would exclude any vram nodes. But I think (as an -mm layman) that numa gives us a lot of the tools and policy interface that we need to implement what we want for gpus. Agree completely. From a ten mile high view our GPUs are just command processors with local memory as well . Basically this is also the whole idea of what AMD is pushing with HSA for a while. It's just that a lot of problems start to pop up when you look at all the nasty details. For example only part of the GPU memory is usually accessible by the CPU. So even when numa nodes expose a good foundation for this I think there is still a lot of code to write. BTW: I should probably start to read into the numa code of the kernel. Any good pointers for that? I would assume that "page" allocation logic itself should be inside of graphics driver due to possible different requirements especially from graphics: alignment, etc. Regards, Christian. Wrt isolation: There's a sliding scale of what different users expect, from full auto everything, including migrating pages around if needed to full isolation all seems to be on the table. As long as we keep vram nodes out of any default allocation numasets, full isolation should be possible. -Daniel Sincerely yours, Serguei Sagalovitch
Re: Enabling peer to peer device transactions for PCIe devices
On 2016-11-23 12:27 PM, Bart Van Assche wrote: On 11/23/2016 09:13 AM, Logan Gunthorpe wrote: IMO any memory that has been registered for a P2P transaction should be locked from being evicted. So if there's a get_user_pages call it needs to be pinned until the put_page. The main issue being with the RDMA case: handling an eviction when a chunk of memory has been registered as an MR would be very tricky. The MR may be relied upon by another host and the kernel would have to inform user-space the MR was invalid then user-space would have to tell the remote application. Hello Logan, Are you aware that the Linux kernel already supports ODP (On Demand Paging)? See also the output of git grep -nHi on.demand.paging. See also https://www.openfabrics.org/images/eventpresos/workshops2014/DevWorkshop/presos/Tuesday/pdf/04_ODP_update.pdf. Bart. My understanding is that the main problems are (a) h/w support (b) compatibility with IB Verbs semantic.
Re: Enabling peer to peer device transactions for PCIe devices
On 2016-11-23 12:27 PM, Bart Van Assche wrote: On 11/23/2016 09:13 AM, Logan Gunthorpe wrote: IMO any memory that has been registered for a P2P transaction should be locked from being evicted. So if there's a get_user_pages call it needs to be pinned until the put_page. The main issue being with the RDMA case: handling an eviction when a chunk of memory has been registered as an MR would be very tricky. The MR may be relied upon by another host and the kernel would have to inform user-space the MR was invalid then user-space would have to tell the remote application. Hello Logan, Are you aware that the Linux kernel already supports ODP (On Demand Paging)? See also the output of git grep -nHi on.demand.paging. See also https://www.openfabrics.org/images/eventpresos/workshops2014/DevWorkshop/presos/Tuesday/pdf/04_ODP_update.pdf. Bart. My understanding is that the main problems are (a) h/w support (b) compatibility with IB Verbs semantic.
Re: Enabling peer to peer device transactions for PCIe devices
On 2016-11-23 02:05 PM, Jason Gunthorpe wrote: On Wed, Nov 23, 2016 at 10:13:03AM -0700, Logan Gunthorpe wrote: an MR would be very tricky. The MR may be relied upon by another host and the kernel would have to inform user-space the MR was invalid then user-space would have to tell the remote application. As Bart says, it would be best to be combined with something like Mellanox's ODP MRs, which allows a page to be evicted and then trigger a CPU interrupt if a DMA is attempted so it can be brought back. Please note that in the general case (including MR one) we could have "page fault" from the different PCIe device. So all PCIe device must be synchronized. includes the usual fencing mechanism so the CPU can block, flush, and then evict a page coherently. This is the general direction the industry is going in: Link PCI DMA directly to dynamic user page tabels, including support for demand faulting and synchronicity. Mellanox ODP is a rough implementation of mirroring a process's page table via the kernel, while IBM's CAPI (and CCIX, PCI ATS?) is probably a good example of where this is ultimately headed. CAPI allows a PCI DMA to directly target an ASID associated with a user process and then use the usual CPU machinery to do the page translation for the DMA. This includes page faults for evicted pages, and obviously allows eviction and migration.. So, of all the solutions in the original list, I would discard anything that isn't VMA focused. Emulating what CAPI does in hardware with software is probably the best choice, or we have to do it all again when CAPI style hardware broadly rolls out :( DAX and GPU allocators should create VMAs and manipulate them in the usual way to achieve migration, windowing, cache, movement or swap of the potentially peer-peer memory pages. They would have to respect the usual rules for a VMA, including pinning. DMA drivers would use the usual approaches for dealing with DMA from a VMA: short term pin or long term coherent translation mirror. So, to my view (looking from RDMA), the main problem with peer-peer is how do you DMA translate VMA's that point at non struct page memory? Does HMM solve the peer-peer problem? Does it do it generically or only for drivers that are mirroring translation tables? In current form HMM doesn't solve peer-peer problem. Currently it allow "mirroring" of "malloc" memory on GPU which is not always what needed. Additionally there is need to have opportunity to share VRAM allocations between different processes. From a RDMA perspective we could use something other than get_user_pages() to pin and DMA translate a VMA if the core community could decide on an API. eg get_user_dma_sg() would probably be quite usable. Jason
Re: Enabling peer to peer device transactions for PCIe devices
On 2016-11-23 02:05 PM, Jason Gunthorpe wrote: On Wed, Nov 23, 2016 at 10:13:03AM -0700, Logan Gunthorpe wrote: an MR would be very tricky. The MR may be relied upon by another host and the kernel would have to inform user-space the MR was invalid then user-space would have to tell the remote application. As Bart says, it would be best to be combined with something like Mellanox's ODP MRs, which allows a page to be evicted and then trigger a CPU interrupt if a DMA is attempted so it can be brought back. Please note that in the general case (including MR one) we could have "page fault" from the different PCIe device. So all PCIe device must be synchronized. includes the usual fencing mechanism so the CPU can block, flush, and then evict a page coherently. This is the general direction the industry is going in: Link PCI DMA directly to dynamic user page tabels, including support for demand faulting and synchronicity. Mellanox ODP is a rough implementation of mirroring a process's page table via the kernel, while IBM's CAPI (and CCIX, PCI ATS?) is probably a good example of where this is ultimately headed. CAPI allows a PCI DMA to directly target an ASID associated with a user process and then use the usual CPU machinery to do the page translation for the DMA. This includes page faults for evicted pages, and obviously allows eviction and migration.. So, of all the solutions in the original list, I would discard anything that isn't VMA focused. Emulating what CAPI does in hardware with software is probably the best choice, or we have to do it all again when CAPI style hardware broadly rolls out :( DAX and GPU allocators should create VMAs and manipulate them in the usual way to achieve migration, windowing, cache, movement or swap of the potentially peer-peer memory pages. They would have to respect the usual rules for a VMA, including pinning. DMA drivers would use the usual approaches for dealing with DMA from a VMA: short term pin or long term coherent translation mirror. So, to my view (looking from RDMA), the main problem with peer-peer is how do you DMA translate VMA's that point at non struct page memory? Does HMM solve the peer-peer problem? Does it do it generically or only for drivers that are mirroring translation tables? In current form HMM doesn't solve peer-peer problem. Currently it allow "mirroring" of "malloc" memory on GPU which is not always what needed. Additionally there is need to have opportunity to share VRAM allocations between different processes. From a RDMA perspective we could use something other than get_user_pages() to pin and DMA translate a VMA if the core community could decide on an API. eg get_user_dma_sg() would probably be quite usable. Jason
Re: Enabling peer to peer device transactions for PCIe devices
On 2016-11-22 03:10 PM, Daniel Vetter wrote: On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams <dan.j.willi...@intel.com> wrote: On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch <serguei.sagalovi...@amd.com> wrote: I personally like "device-DAX" idea but my concerns are: - How well it will co-exists with the DRM infrastructure / implementations in part dealing with CPU pointers? Inside the kernel a device-DAX range is "just memory" in the sense that you can perform pfn_to_page() on it and issue I/O, but the vma is not migratable. To be honest I do not know how well that co-exists with drm infrastructure. - How well we will be able to handle case when we need to "move"/"evict" memory/data to the new location so CPU pointer should point to the new physical location/address (and may be not in PCI device memory at all)? So, device-DAX deliberately avoids support for in-kernel migration or overcommit. Those cases are left to the core mm or drm. The device-dax interface is for cases where all that is needed is a direct-mapping to a statically-allocated physical-address range be it persistent memory or some other special reserved memory range. For some of the fancy use-cases (e.g. to be comparable to what HMM can pull off) I think we want all the magic in core mm, i.e. migration and overcommit. At least that seems to be the very strong drive in all general-purpose gpu abstractions and implementations, where memory is allocated with malloc, and then mapped/moved into vram/gpu address space through some magic, It is possible that there is other way around: memory is requested to be allocated and should be kept in vram for performance reason but due to possible overcommit case we need at least temporally to "move" such allocation to system memory. but still visible on both the cpu and gpu side in some form. Special device to allocate memory, and not being able to migrate stuff around sound like misfeatures from that pov. -Daniel
Re: Enabling peer to peer device transactions for PCIe devices
On 2016-11-22 03:10 PM, Daniel Vetter wrote: On Tue, Nov 22, 2016 at 9:01 PM, Dan Williams wrote: On Tue, Nov 22, 2016 at 10:59 AM, Serguei Sagalovitch wrote: I personally like "device-DAX" idea but my concerns are: - How well it will co-exists with the DRM infrastructure / implementations in part dealing with CPU pointers? Inside the kernel a device-DAX range is "just memory" in the sense that you can perform pfn_to_page() on it and issue I/O, but the vma is not migratable. To be honest I do not know how well that co-exists with drm infrastructure. - How well we will be able to handle case when we need to "move"/"evict" memory/data to the new location so CPU pointer should point to the new physical location/address (and may be not in PCI device memory at all)? So, device-DAX deliberately avoids support for in-kernel migration or overcommit. Those cases are left to the core mm or drm. The device-dax interface is for cases where all that is needed is a direct-mapping to a statically-allocated physical-address range be it persistent memory or some other special reserved memory range. For some of the fancy use-cases (e.g. to be comparable to what HMM can pull off) I think we want all the magic in core mm, i.e. migration and overcommit. At least that seems to be the very strong drive in all general-purpose gpu abstractions and implementations, where memory is allocated with malloc, and then mapped/moved into vram/gpu address space through some magic, It is possible that there is other way around: memory is requested to be allocated and should be kept in vram for performance reason but due to possible overcommit case we need at least temporally to "move" such allocation to system memory. but still visible on both the cpu and gpu side in some form. Special device to allocate memory, and not being able to migrate stuff around sound like misfeatures from that pov. -Daniel