Re: Enabling peer to peer device transactions for PCIe devices

2016-12-04 Thread Haggai Eran
On 11/30/2016 6:23 PM, Jason Gunthorpe wrote:
>> and O_DIRECT operations that access GPU memory.
> This goes through user space so there is still a VMA..
> 
>> Also, HMM's migration between two GPUs could use peer to peer in the
>> kernel, although that is intended to be handled by the GPU driver if
>> I understand correctly.
> Hum, presumably these migrations are VMA backed as well...
I guess so.

>>> Presumably in-kernel could use a vmap or something and the same basic
>>> flow?
>> I think we can achieve the kernel's needs with ZONE_DEVICE and DMA-API 
>> support
>> for peer to peer. I'm not sure we need vmap. We need a way to have a 
>> scatterlist
>> of MMIO pfns, and ZONE_DEVICE allows that.
> Well, if there is no virtual map then we are back to how do you do
> migrations and other things people seem to want to do on these
> pages. Maybe the loose 'struct page' flow is not for those users.
I was thinking that kernel use cases would disallow migration, similar to how 
non-ODP MRs would work. Either they are short-lived (like an O_DIRECT transfer)
or they can be longed lived but non-migratable (like perhaps a CMB staging 
buffer).

> But I think if you want kGPU or similar then you probably need vmaps
> or something similar to represent the GPU pages in kernel memory.
Right, although sometimes the GPU pages are simply inaccessible to the CPU.
In any case, I haven't thought about kGPU as a use-case.
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-04 Thread Haggai Eran
On 11/30/2016 8:01 PM, Logan Gunthorpe wrote:
> 
> 
> On 30/11/16 09:23 AM, Jason Gunthorpe wrote:
>>> Two cases I can think of are RDMA access to an NVMe device's controller
>>> memory buffer,
>>
>> I'm not sure on the use model there..
> 
> The NVMe fabrics stuff could probably make use of this. It's an
> in-kernel system to allow remote access to an NVMe device over RDMA. So
> they ought to be able to optimize their transfers by DMAing directly to
> the NVMe's CMB -- no userspace interface would be required but there
> would need some kernel infrastructure.

Yes, that's what I was thinking. The NVMe/f driver needs to map the CMB for
RDMA. I guess if it used ZONE_DEVICE like in the iopmem patches it would be
relatively easy to do.

--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Enabling peer to peer device transactions for PCIe devices

2016-12-04 Thread Haggai Eran
On 11/30/2016 7:28 PM, Serguei Sagalovitch wrote:
> On 2016-11-30 11:23 AM, Jason Gunthorpe wrote:
>>> Yes, that sounds fine. Can we simply kill the process from the GPU driver?
>>> Or do we need to extend the OOM killer to manage GPU pages?
>> I don't know..
> We could use send_sig_info to send signal from  kernel  to user space. So 
> theoretically GPU driver
> could issue KILL signal to some process.
> 
>> On Wed, Nov 30, 2016 at 12:45:58PM +0200, Haggai Eran wrote:
>>> I think we can achieve the kernel's needs with ZONE_DEVICE and DMA-API 
>>> support
>>> for peer to peer. I'm not sure we need vmap. We need a way to have a 
>>> scatterlist
>>> of MMIO pfns, and ZONE_DEVICE allows that.
> I do not think that using DMA-API as it is is the best solution (at least in 
> the current form):
> 
> -  It deals with handles/fd for the whole allocation but client could/will 
> use sub-allocation as
> well as theoretically possible to "merge" several allocations in one from GPU 
> perspective.
> -  It require knowledge to export but because "sharing" is controlled from 
> user space it
> means that we must "export" all allocation by default
> - It deals with 'fd'/handles but user application may work with 
> addresses/pointers.

Aren't you confusing DMABUF and DMA-API? DMA-API is how you program the IOMMU 
(dma_map_page/dma_map_sg/etc.).
The comment above is just about the need to extend this API to allow mapping 
peer device pages to bus addresses.

In the past I sent an RFC for using DMABUF for peer to peer. I think it had some
advantages for legacy devices. I agree that working with addresses and pointers 
through
something like HMM/ODP is much more flexible and easier to program from 
user-space.
For legacy, DMABUF would have allowed you a way to pin the pages so the GPU 
knows not to
move them. However, that can probably also be achieved simply via the reference 
count
on ZONE_DEVICE pages. The other nice thing about DMABUF is that it migrate the 
buffer
itself during attachment according to the requirements of the device that is 
attaching,
so you can automatically decide in the exporter whether to use p2p or a staging 
buffer.

> 
> Also current  DMA-API force each time to do all DMA table programming 
> unrelated if
> location was changed or not. With  vma / mmu  we are  able to install 
> notifier to intercept
> changes in location and update  translation tables only as needed (we do not 
> need to keep
> get_user_pages()  lock).
I agree.
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-30 Thread Haggai Eran
On 11/28/2016 9:02 PM, Jason Gunthorpe wrote:
> On Mon, Nov 28, 2016 at 06:19:40PM +0000, Haggai Eran wrote:
>>>> GPU memory. We create a non-ODP MR pointing to VRAM but rely on
>>>> user-space and the GPU not to migrate it. If they do, the MR gets
>>>> destroyed immediately.
>>> That sounds horrible. How can that possibly work? What if the MR is
>>> being used when the GPU decides to migrate? 
>> Naturally this doesn't support migration. The GPU is expected to pin
>> these pages as long as the MR lives. The MR invalidation is done only as
>> a last resort to keep system correctness.
> 
> That just forces applications to handle horrible unexpected
> failures. If this sort of thing is needed for correctness then OOM
> kill the offending process, don't corrupt its operation.
Yes, that sounds fine. Can we simply kill the process from the GPU driver?
Or do we need to extend the OOM killer to manage GPU pages?

> 
>> I think it is similar to how non-ODP MRs rely on user-space today to
>> keep them correct. If you do something like madvise(MADV_DONTNEED) on a
>> non-ODP MR's pages, you can still get yourself into a data corruption
>> situation (HCA sees one page and the process sees another for the same
>> virtual address). The pinning that we use only guarentees the HCA's page
>> won't be reused.
> 
> That is not really data corruption - the data still goes where it was
> originally destined. That is an application violating the
> requirements of a MR. 
I guess it is a matter of terminology. If you compare it to the ODP case 
or the CPU case then you usually expect a single virtual address to map to
a single physical page. Violating this cause some of your writes to be dropped
which is a data corruption in my book, even if the application caused it.

> An application cannot munmap/mremap a VMA
> while a non ODP MR points to it and then keep using the MR.
Right. And it is perfectly fine to have some similar requirements from the 
application
when doing peer to peer with a non-ODP MR. 

> That is totally different from a GPU driver wanthing to mess with
> translation to physical pages.
> 
>>> From what I understand we are not really talking about kernel p2p,
>>> everything proposed so far is being mediated by a userspace VMA, so
>>> I'd focus on making that work.
> 
>> Fair enough, although we will need both eventually, and I hope the
>> infrastructure can be shared to some degree.
> 
> What use case do you see for in kernel?
Two cases I can think of are RDMA access to an NVMe device's controller 
memory buffer, and O_DIRECT operations that access GPU memory.
Also, HMM's migration between two GPUs could use peer to peer in the kernel,
although that is intended to be handled by the GPU driver if I understand
correctly.

> Presumably in-kernel could use a vmap or something and the same basic
> flow?
I think we can achieve the kernel's needs with ZONE_DEVICE and DMA-API support
for peer to peer. I'm not sure we need vmap. We need a way to have a scatterlist
of MMIO pfns, and ZONE_DEVICE allows that.

Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-28 Thread Haggai Eran
On Mon, 2016-11-28 at 09:57 -0700, Jason Gunthorpe wrote:
> On Sun, Nov 27, 2016 at 04:02:16PM +0200, Haggai Eran wrote:
> > I think blocking mmu notifiers against something that is basically
> > controlled by user-space can be problematic. This can block things
> > like
> > memory reclaim. If you have user-space access to the device's
> > queues,
> > user-space can block the mmu notifier forever.
> Right, I mentioned that..
Sorry, I must have missed it.

> > On PeerDirect, we have some kind of a middle-ground solution for
> > pinning
> > GPU memory. We create a non-ODP MR pointing to VRAM but rely on
> > user-space and the GPU not to migrate it. If they do, the MR gets
> > destroyed immediately.
> That sounds horrible. How can that possibly work? What if the MR is
> being used when the GPU decides to migrate? 
Naturally this doesn't support migration. The GPU is expected to pin
these pages as long as the MR lives. The MR invalidation is done only as
a last resort to keep system correctness.

I think it is similar to how non-ODP MRs rely on user-space today to
keep them correct. If you do something like madvise(MADV_DONTNEED) on a
non-ODP MR's pages, you can still get yourself into a data corruption
situation (HCA sees one page and the process sees another for the same
virtual address). The pinning that we use only guarentees the HCA's page
won't be reused.

> I would not support that
> upstream without a lot more explanation..
> 
> I know people don't like requiring new hardware, but in this case we
> really do need ODP hardware to get all the semantics people want..
> 
> > 
> > Another thing I think is that while HMM is good for user-space
> > applications, for kernel p2p use there is no need for that. Using
> From what I understand we are not really talking about kernel p2p,
> everything proposed so far is being mediated by a userspace VMA, so
> I'd focus on making that work.
Fair enough, although we will need both eventually, and I hope the
infrastructure can be shared to some degree.--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-28 Thread Haggai Eran
On Mon, 2016-11-28 at 09:48 -0500, Serguei Sagalovitch wrote:
> On 2016-11-27 09:02 AM, Haggai Eran wrote
> > 
> > On PeerDirect, we have some kind of a middle-ground solution for
> > pinning
> > GPU memory. We create a non-ODP MR pointing to VRAM but rely on
> > user-space and the GPU not to migrate it. If they do, the MR gets
> > destroyed immediately. This should work on legacy devices without
> > ODP
> > support, and allows the system to safely terminate a process that
> > misbehaves. The downside of course is that it cannot transparently
> > migrate memory but I think for user-space RDMA doing that
> > transparently
> > requires hardware support for paging, via something like HMM.
> > 
> > ...
> May be I am wrong but my understanding is that PeerDirect logic
> basically
> follow  "RDMA register MR" logic 
Yes. The only difference from regular MRs is the invalidation process I
mentioned, and the fact that we get the addresses not from
get_user_pages but from a peer driver.

> so basically nothing prevent to "terminate"
> process for "MMU notifier" case when we are very low on memory
> not making it similar (not worse) then PeerDirect case.
I'm not sure I understand. I don't think any solution prevents
terminating an application. The paragraph above is just trying to
explain how a non-ODP device/MR can handle an invalidation.

> > > I'm hearing most people say ZONE_DEVICE is the way to handle this,
> > > which means the missing remaing piece for RDMA is some kind of DMA
> > > core support for p2p address translation..
> > Yes, this is definitely something we need. I think Will Davis's
> > patches
> > are a good start.
> > 
> > Another thing I think is that while HMM is good for user-space
> > applications, for kernel p2p use there is no need for that.
> About HMM: I do not think that in the current form HMM would  fit in
> requirement for generic P2P transfer case. My understanding is that at
> the current stage HMM is good for "caching" system memory
> in device memory for fast GPU access but in RDMA MR non-ODP case
> it will not work because  the location of memory should not be
> changed so memory should be allocated directly in PCIe memory.
The way I see it there are two ways to handle non-ODP MRs. Either you
prevent the GPU from migrating / reusing the MR's VRAM pages for as long
as the MR is alive (if I understand correctly you didn't like this
solution), or you allow the GPU to somehow notify the HCA to invalidate
the MR. If you do that, you can use mmu notifiers or HMM or something
else, but HMM provides a nice framework to facilitate that notification.

> > 
> > Using ZONE_DEVICE with or without something like DMA-BUF to pin and
> > unpin
> > pages for the short duration as you wrote above could work fine for
> > kernel uses in which we can guarantee they are short.
> Potentially there is another issue related to pin/unpin. If memory
> could
> be used a lot of time then there is no sense to rebuild and program
> s/g tables each time if location of memory was not changed.
Is this about the kernel use or user-space? In user-space I think the MR
concept captures a long-lived s/g table so you don't need to rebuild it
(unless the mapping changes).

Haggai--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-27 Thread Haggai Eran
On 11/25/2016 9:32 PM, Jason Gunthorpe wrote:
> On Fri, Nov 25, 2016 at 02:22:17PM +0100, Christian König wrote:
> 
>>> Like you say below we have to handle short lived in the usual way, and
>>> that covers basically every device except IB MRs, including the
>>> command queue on a NVMe drive.
>>
>> Well a problem which wasn't mentioned so far is that while GPUs do have a
>> page table to mirror the CPU page table, they usually can't recover from
>> page faults.
> 
>> So what we do is making sure that all memory accessed by the GPU Jobs stays
>> in place while those jobs run (pretty much the same pinning you do for the
>> DMA).
> 
> Yes, it is DMA, so this is a valid approach.
> 
> But, you don't need page faults from the GPU to do proper coherent
> page table mirroring. Basically when the driver submits the work to
> the GPU it 'faults' the pages into the CPU and mirror translation
> table (instead of pinning).
> 
> Like in ODP, MMU notifiers/HMM are used to monitor for translation
> changes. If a change comes in the GPU driver checks if an executing
> command is touching those pages and blocks the MMU notifier until the
> command flushes, then unfaults the page (blocking future commands) and
> unblocks the mmu notifier.
I think blocking mmu notifiers against something that is basically
controlled by user-space can be problematic. This can block things like
memory reclaim. If you have user-space access to the device's queues,
user-space can block the mmu notifier forever.

On PeerDirect, we have some kind of a middle-ground solution for pinning
GPU memory. We create a non-ODP MR pointing to VRAM but rely on
user-space and the GPU not to migrate it. If they do, the MR gets
destroyed immediately. This should work on legacy devices without ODP
support, and allows the system to safely terminate a process that
misbehaves. The downside of course is that it cannot transparently
migrate memory but I think for user-space RDMA doing that transparently
requires hardware support for paging, via something like HMM.

...

> I'm hearing most people say ZONE_DEVICE is the way to handle this,
> which means the missing remaing piece for RDMA is some kind of DMA
> core support for p2p address translation..

Yes, this is definitely something we need. I think Will Davis's patches
are a good start.

Another thing I think is that while HMM is good for user-space
applications, for kernel p2p use there is no need for that. Using
ZONE_DEVICE with or without something like DMA-BUF to pin and unpin
pages for the short duration as you wrote above could work fine for
kernel uses in which we can guarantee they are short.

Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Enabling peer to peer device transactions for PCIe devices

2016-11-27 Thread Haggai Eran
On 11/25/2016 9:32 PM, Jason Gunthorpe wrote:
> On Fri, Nov 25, 2016 at 02:22:17PM +0100, Christian König wrote:
>
>>> Like you say below we have to handle short lived in the usual way, and
>>> that covers basically every device except IB MRs, including the
>>> command queue on a NVMe drive.
>>
>> Well a problem which wasn't mentioned so far is that while GPUs do have a
>> page table to mirror the CPU page table, they usually can't recover from
>> page faults.
>
>> So what we do is making sure that all memory accessed by the GPU Jobs stays
>> in place while those jobs run (pretty much the same pinning you do for the
>> DMA).
>
> Yes, it is DMA, so this is a valid approach.
>
> But, you don't need page faults from the GPU to do proper coherent
> page table mirroring. Basically when the driver submits the work to
> the GPU it 'faults' the pages into the CPU and mirror translation
> table (instead of pinning).
>
> Like in ODP, MMU notifiers/HMM are used to monitor for translation
> changes. If a change comes in the GPU driver checks if an executing
> command is touching those pages and blocks the MMU notifier until the
> command flushes, then unfaults the page (blocking future commands) and
> unblocks the mmu notifier.
I think blocking mmu notifiers against something that is basically
controlled by user-space can be problematic. This can block things like
memory reclaim. If you have user-space access to the device's queues,
user-space can block the mmu notifier forever.

On PeerDirect, we have some kind of a middle-ground solution for pinning
GPU memory. We create a non-ODP MR pointing to VRAM but rely on
user-space and the GPU not to migrate it. If they do, the MR gets
destroyed immediately. This should work on legacy devices without ODP
support, and allows the system to safely terminate a process that
misbehaves. The downside of course is that it cannot transparently
migrate memory but I think for user-space RDMA doing that transparently
requires hardware support for paging, via something like HMM.

...

> I'm hearing most people say ZONE_DEVICE is the way to handle this,
> which means the missing remaing piece for RDMA is some kind of DMA
> core support for p2p address translation..

Yes, this is definitely something we need. I think Will Davis's patches
are a good start.

Another thing I think is that while HMM is good for user-space
applications, for kernel p2p use there is no need for that. Using
ZONE_DEVICE with or without something like DMA-BUF to pin and unpin
pages for the short duration as you wrote above could work fine for
kernel uses in which we can guarantee they are short.

Haggai

--
To unsubscribe from this list: send the line "unsubscribe linux-media" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html