Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support

2020-07-08 Thread Alex Deucher
On Wed, Jul 8, 2020 at 10:20 AM Christian König
 wrote:
>
> Am 08.07.20 um 11:49 schrieb Daniel Vetter:
> > On Wed, Jul 08, 2020 at 11:38:31AM +0200, Christian König wrote:
> >> Am 07.07.20 um 23:58 schrieb Xiong, Jianxin:
>  -Original Message-
>  From: Christian König 
>  Am 03.07.20 um 15:14 schrieb Jason Gunthorpe:
> > On Fri, Jul 03, 2020 at 02:52:03PM +0200, Daniel Vetter wrote:
> >
> >> So maybe I'm just totally confused about the rdma model. I thought:
> >> - you bind a pile of memory for various transactions, that might
> >> happen whenever. Kernel driver doesn't have much if any insight into
> >> when memory isn't needed anymore. I think in the rdma world that's
> >> called registering memory, but not sure.
> > Sure, but once registered the memory is able to be used at any moment
> > with no visibilty from the kernel.
> >
> > Unlike GPU the transactions that trigger memory access do not go
> > through the kernel - so there is no ability to interrupt a command
> > flow and fiddle with mappings.
>  This is the same for GPUs with user space queues as well.
> 
>  But we can still say for a process if that this process is using a 
>  DMA-buf which is moved out and so can't run any more unless the DMA-buf 
>  is
>  accessible again.
> 
>  In other words you somehow need to make sure that the hardware is not 
>  accessing a piece of memory any more when you want to move it.
> 
> >>> While a process can be easily suspended, there is no way to tell the RDMA 
> >>> NIC not to process posted work requests that use specific memory regions 
> >>> (or with any other conditions).
> >>>
> >>> So far it appears to me that DMA-buf dynamic mapping for RDMA is only 
> >>> viable with ODP support. For NICs without ODP, a way to allow pinning the 
> >>> device memory is still needed.
> >> And that's exactly the reason why I introduced explicit pin()/unpin()
> >> functions into the DMA-buf API:
> >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Flatest%2Fsource%2Fdrivers%2Fdma-buf%2Fdma-buf.c%23L811data=02%7C01%7Cchristian.koenig%40amd.com%7C6d785861acc542a2f53608d823243a7c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637297985792135311sdata=bBrkDynlACE9DAIlGntxXhE1unr%2FBxw5IRTm6AtV6WQ%3Dreserved=0
> >>
> >> It's just that at least our devices drivers currently prevent P2P with
> >> pinned DMA-buf's for two main reasons:
> >>
> >> a) To prevent deny of service attacks because P2P BARs are a rather rare
> >> resource.
> >>
> >> b) To prevent failures in configuration where P2P is not always possible
> >> between all devices which want to access a buffer.
> > So the above is more or less the question in the cover letter (which
> > didn't make it to dri-devel). Can we somehow throw that limitation out, or
> > is that simply not a good idea?
>
> At least for the AMD graphics drivers that's most certain not a good idea.
>
> We do have an use case where buffers need to be in system memory because
> P2P doesn't work.
>
> And by pinning them to VRAM you can create a really nice deny of service
> attack against the X system.
>

On the other hand, on modern platforms with large or resizable BARs,
you may end up with systems with more vram than system ram.

Alex


> > Simply moving buffers to system memory when they're pinned does simplify a
> > lot of headaches. For a specific custom built system we can avoid that
> > maybe, but I think upstream is kinda a different thing.
>
> Yes, agree completely on that. Customers which are willing to take the
> risk can easily do this themselves.
>
> But that is not something we should probably do for upstream.
>
> Regards,
> Christian.
>
> >
> > Cheers, Daniel
> >
> >> Regards,
> >> Christian.
> >>
> >>> Jianxin
> >>>
>  Christian.
> 
> > Jason
>
> ___
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support

2020-07-08 Thread Christian König

Am 08.07.20 um 11:49 schrieb Daniel Vetter:

On Wed, Jul 08, 2020 at 11:38:31AM +0200, Christian König wrote:

Am 07.07.20 um 23:58 schrieb Xiong, Jianxin:

-Original Message-
From: Christian König 
Am 03.07.20 um 15:14 schrieb Jason Gunthorpe:

On Fri, Jul 03, 2020 at 02:52:03PM +0200, Daniel Vetter wrote:


So maybe I'm just totally confused about the rdma model. I thought:
- you bind a pile of memory for various transactions, that might
happen whenever. Kernel driver doesn't have much if any insight into
when memory isn't needed anymore. I think in the rdma world that's
called registering memory, but not sure.

Sure, but once registered the memory is able to be used at any moment
with no visibilty from the kernel.

Unlike GPU the transactions that trigger memory access do not go
through the kernel - so there is no ability to interrupt a command
flow and fiddle with mappings.

This is the same for GPUs with user space queues as well.

But we can still say for a process if that this process is using a DMA-buf 
which is moved out and so can't run any more unless the DMA-buf is
accessible again.

In other words you somehow need to make sure that the hardware is not accessing 
a piece of memory any more when you want to move it.


While a process can be easily suspended, there is no way to tell the RDMA NIC 
not to process posted work requests that use specific memory regions (or with 
any other conditions).

So far it appears to me that DMA-buf dynamic mapping for RDMA is only viable 
with ODP support. For NICs without ODP, a way to allow pinning the device 
memory is still needed.

And that's exactly the reason why I introduced explicit pin()/unpin()
functions into the DMA-buf API:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Flatest%2Fsource%2Fdrivers%2Fdma-buf%2Fdma-buf.c%23L811data=02%7C01%7Cchristian.koenig%40amd.com%7C6d785861acc542a2f53608d823243a7c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637297985792135311sdata=bBrkDynlACE9DAIlGntxXhE1unr%2FBxw5IRTm6AtV6WQ%3Dreserved=0

It's just that at least our devices drivers currently prevent P2P with
pinned DMA-buf's for two main reasons:

a) To prevent deny of service attacks because P2P BARs are a rather rare
resource.

b) To prevent failures in configuration where P2P is not always possible
between all devices which want to access a buffer.

So the above is more or less the question in the cover letter (which
didn't make it to dri-devel). Can we somehow throw that limitation out, or
is that simply not a good idea?


At least for the AMD graphics drivers that's most certain not a good idea.

We do have an use case where buffers need to be in system memory because 
P2P doesn't work.


And by pinning them to VRAM you can create a really nice deny of service 
attack against the X system.



Simply moving buffers to system memory when they're pinned does simplify a
lot of headaches. For a specific custom built system we can avoid that
maybe, but I think upstream is kinda a different thing.


Yes, agree completely on that. Customers which are willing to take the 
risk can easily do this themselves.


But that is not something we should probably do for upstream.

Regards,
Christian.



Cheers, Daniel


Regards,
Christian.


Jianxin


Christian.


Jason


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support

2020-07-08 Thread Daniel Vetter
On Wed, Jul 08, 2020 at 11:38:31AM +0200, Christian König wrote:
> Am 07.07.20 um 23:58 schrieb Xiong, Jianxin:
> > > -Original Message-
> > > From: Christian König 
> > > Am 03.07.20 um 15:14 schrieb Jason Gunthorpe:
> > > > On Fri, Jul 03, 2020 at 02:52:03PM +0200, Daniel Vetter wrote:
> > > > 
> > > > > So maybe I'm just totally confused about the rdma model. I thought:
> > > > > - you bind a pile of memory for various transactions, that might
> > > > > happen whenever. Kernel driver doesn't have much if any insight into
> > > > > when memory isn't needed anymore. I think in the rdma world that's
> > > > > called registering memory, but not sure.
> > > > Sure, but once registered the memory is able to be used at any moment
> > > > with no visibilty from the kernel.
> > > > 
> > > > Unlike GPU the transactions that trigger memory access do not go
> > > > through the kernel - so there is no ability to interrupt a command
> > > > flow and fiddle with mappings.
> > > This is the same for GPUs with user space queues as well.
> > > 
> > > But we can still say for a process if that this process is using a 
> > > DMA-buf which is moved out and so can't run any more unless the DMA-buf is
> > > accessible again.
> > > 
> > > In other words you somehow need to make sure that the hardware is not 
> > > accessing a piece of memory any more when you want to move it.
> > > 
> > While a process can be easily suspended, there is no way to tell the RDMA 
> > NIC not to process posted work requests that use specific memory regions 
> > (or with any other conditions).
> > 
> > So far it appears to me that DMA-buf dynamic mapping for RDMA is only 
> > viable with ODP support. For NICs without ODP, a way to allow pinning the 
> > device memory is still needed.
> 
> And that's exactly the reason why I introduced explicit pin()/unpin()
> functions into the DMA-buf API:
> https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/dma-buf.c#L811
> 
> It's just that at least our devices drivers currently prevent P2P with
> pinned DMA-buf's for two main reasons:
> 
> a) To prevent deny of service attacks because P2P BARs are a rather rare
> resource.
> 
> b) To prevent failures in configuration where P2P is not always possible
> between all devices which want to access a buffer.

So the above is more or less the question in the cover letter (which
didn't make it to dri-devel). Can we somehow throw that limitation out, or
is that simply not a good idea?

Simply moving buffers to system memory when they're pinned does simplify a
lot of headaches. For a specific custom built system we can avoid that
maybe, but I think upstream is kinda a different thing.

Cheers, Daniel

> Regards,
> Christian.
> 
> > 
> > Jianxin
> > 
> > > Christian.
> > > 
> > > > Jason
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support

2020-07-08 Thread Christian König

Am 07.07.20 um 23:58 schrieb Xiong, Jianxin:

-Original Message-
From: Christian König 
Am 03.07.20 um 15:14 schrieb Jason Gunthorpe:

On Fri, Jul 03, 2020 at 02:52:03PM +0200, Daniel Vetter wrote:


So maybe I'm just totally confused about the rdma model. I thought:
- you bind a pile of memory for various transactions, that might
happen whenever. Kernel driver doesn't have much if any insight into
when memory isn't needed anymore. I think in the rdma world that's
called registering memory, but not sure.

Sure, but once registered the memory is able to be used at any moment
with no visibilty from the kernel.

Unlike GPU the transactions that trigger memory access do not go
through the kernel - so there is no ability to interrupt a command
flow and fiddle with mappings.

This is the same for GPUs with user space queues as well.

But we can still say for a process if that this process is using a DMA-buf 
which is moved out and so can't run any more unless the DMA-buf is
accessible again.

In other words you somehow need to make sure that the hardware is not accessing 
a piece of memory any more when you want to move it.


While a process can be easily suspended, there is no way to tell the RDMA NIC 
not to process posted work requests that use specific memory regions (or with 
any other conditions).

So far it appears to me that DMA-buf dynamic mapping for RDMA is only viable 
with ODP support. For NICs without ODP, a way to allow pinning the device 
memory is still needed.


And that's exactly the reason why I introduced explicit pin()/unpin() 
functions into the DMA-buf API: 
https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/dma-buf.c#L811


It's just that at least our devices drivers currently prevent P2P with 
pinned DMA-buf's for two main reasons:


a) To prevent deny of service attacks because P2P BARs are a rather rare 
resource.


b) To prevent failures in configuration where P2P is not always possible 
between all devices which want to access a buffer.


Regards,
Christian.



Jianxin


Christian.


Jason


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


RE: [RFC PATCH v2 0/3] RDMA: add dma-buf support

2020-07-07 Thread Xiong, Jianxin

> -Original Message-
> From: Christian König 
> Am 03.07.20 um 15:14 schrieb Jason Gunthorpe:
> > On Fri, Jul 03, 2020 at 02:52:03PM +0200, Daniel Vetter wrote:
> >
> >> So maybe I'm just totally confused about the rdma model. I thought:
> >> - you bind a pile of memory for various transactions, that might
> >> happen whenever. Kernel driver doesn't have much if any insight into
> >> when memory isn't needed anymore. I think in the rdma world that's
> >> called registering memory, but not sure.
> > Sure, but once registered the memory is able to be used at any moment
> > with no visibilty from the kernel.
> >
> > Unlike GPU the transactions that trigger memory access do not go
> > through the kernel - so there is no ability to interrupt a command
> > flow and fiddle with mappings.
> 
> This is the same for GPUs with user space queues as well.
> 
> But we can still say for a process if that this process is using a DMA-buf 
> which is moved out and so can't run any more unless the DMA-buf is
> accessible again.
> 
> In other words you somehow need to make sure that the hardware is not 
> accessing a piece of memory any more when you want to move it.
> 

While a process can be easily suspended, there is no way to tell the RDMA NIC 
not to process posted work requests that use specific memory regions (or with 
any other conditions). 

So far it appears to me that DMA-buf dynamic mapping for RDMA is only viable 
with ODP support. For NICs without ODP, a way to allow pinning the device 
memory is still needed.

Jianxin

> Christian.
> 
> >
> > Jason

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support

2020-07-06 Thread Jason Gunthorpe
On Thu, Jul 02, 2020 at 08:15:40PM +0200, Daniel Vetter wrote:
> > > > 3. rdma driver worker gets busy to restart rx:
> > > > 1. lock all dma-buf that are currently in use (dma_resv_lock).
> > > > thanks to ww_mutex deadlock avoidance this is possible
> > > Why all? Why not just lock the one that was invalidated to restore the
> > > mappings? That is some artifact of the GPU approach?
> > 
> > No, but you must make sure that mapping one doesn't invalidate others you
> > need.
> > 
> > Otherwise you can end up in a nice live lock :)
> 
> Also if you don't have pagefaults, but have to track busy memory at a
> context level, you do need to grab all locks of all buffers you need, or
> you'd race. There's nothing stopping a concurrent ->notify_move on some
> other buffer you'll need otherwise, and if you try to be clever and roll
> you're own locking, you'll anger lockdep - you're own lock will have to be
> on both sides of ww_mutex or it wont work, and that deadlocks.

So you are worried about atomically building some multi buffer
transaction? I don't think this applies to RDMA which isn't going to
be transcational here..

> > > And why is this done with work queues and locking instead of a
> > > callback saying the buffer is valid again?
> > 
> > You can do this as well, but a work queue is usually easier to handle than a
> > notification in an interrupt context of a foreign driver.
> 
> Yeah you can just install a dma_fence callback but
> - that's hardirq context
> - if you don't have per-page hw faults you need the multi-lock ww_mutex
>   dance anyway to avoid races.

It is still best to avoid the per-page faults and preload the new
mapping once it is ready.

Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support

2020-07-06 Thread Jason Gunthorpe
On Fri, Jul 03, 2020 at 02:52:03PM +0200, Daniel Vetter wrote:

> So maybe I'm just totally confused about the rdma model. I thought:
> - you bind a pile of memory for various transactions, that might
> happen whenever. Kernel driver doesn't have much if any insight into
> when memory isn't needed anymore. I think in the rdma world that's
> called registering memory, but not sure.

Sure, but once registered the memory is able to be used at any moment with
no visibilty from the kernel.

Unlike GPU the transactions that trigger memory access do not go
through the kernel - so there is no ability to interrupt a command
flow and fiddle with mappings.

Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support

2020-07-03 Thread Christian König

Am 03.07.20 um 15:14 schrieb Jason Gunthorpe:

On Fri, Jul 03, 2020 at 02:52:03PM +0200, Daniel Vetter wrote:


So maybe I'm just totally confused about the rdma model. I thought:
- you bind a pile of memory for various transactions, that might
happen whenever. Kernel driver doesn't have much if any insight into
when memory isn't needed anymore. I think in the rdma world that's
called registering memory, but not sure.

Sure, but once registered the memory is able to be used at any moment with
no visibilty from the kernel.

Unlike GPU the transactions that trigger memory access do not go
through the kernel - so there is no ability to interrupt a command
flow and fiddle with mappings.


This is the same for GPUs with user space queues as well.

But we can still say for a process if that this process is using a 
DMA-buf which is moved out and so can't run any more unless the DMA-buf 
is accessible again.


In other words you somehow need to make sure that the hardware is not 
accessing a piece of memory any more when you want to move it.


Christian.



Jason


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support

2020-07-03 Thread Daniel Vetter
On Fri, Jul 3, 2020 at 2:03 PM Jason Gunthorpe  wrote:
>
> On Thu, Jul 02, 2020 at 08:15:40PM +0200, Daniel Vetter wrote:
> > > > > 3. rdma driver worker gets busy to restart rx:
> > > > > 1. lock all dma-buf that are currently in use (dma_resv_lock).
> > > > > thanks to ww_mutex deadlock avoidance this is possible
> > > > Why all? Why not just lock the one that was invalidated to restore the
> > > > mappings? That is some artifact of the GPU approach?
> > >
> > > No, but you must make sure that mapping one doesn't invalidate others you
> > > need.
> > >
> > > Otherwise you can end up in a nice live lock :)
> >
> > Also if you don't have pagefaults, but have to track busy memory at a
> > context level, you do need to grab all locks of all buffers you need, or
> > you'd race. There's nothing stopping a concurrent ->notify_move on some
> > other buffer you'll need otherwise, and if you try to be clever and roll
> > you're own locking, you'll anger lockdep - you're own lock will have to be
> > on both sides of ww_mutex or it wont work, and that deadlocks.
>
> So you are worried about atomically building some multi buffer
> transaction? I don't think this applies to RDMA which isn't going to
> be transcational here..

So maybe I'm just totally confused about the rdma model. I thought:
- you bind a pile of memory for various transactions, that might
happen whenever. Kernel driver doesn't have much if any insight into
when memory isn't needed anymore. I think in the rdma world that's
called registering memory, but not sure.

- for hw with hw faults you can pull in the memory when it's needed,
and if concurrently another cpu is taking away pages and invalidating
rdma hw mappings, then that's no problem.

So far so good, 0 need for atomic transactions anything.

But the answer I gave here is for when you don't have per-page hw
faulting on the rdma nic, but something that works at a much larger
level. For a gpu it would be a compute context, no idea what the
equivalent for rdma is. This could go up to and including the entire
nic stalling all rx with link level back pressure, but not
necessarily.

Once you go above a single page, or well, at least a single dma-buf
object, you need some way to know when you have all buffers and memory
mappings present again, because you can't restart with only partial
memory.

Now if the rdma memory programming is more like traditional
networking, where you have a single rx or tx buffer, then yeah you
don't need fancy multi-buffer locking. Like I said I have no idea how
many of the buffers you need to restart the rdma stuff for hw which
doesn't have per-page hw faulting.

> > > > And why is this done with work queues and locking instead of a
> > > > callback saying the buffer is valid again?
> > >
> > > You can do this as well, but a work queue is usually easier to handle 
> > > than a
> > > notification in an interrupt context of a foreign driver.
> >
> > Yeah you can just install a dma_fence callback but
> > - that's hardirq context
> > - if you don't have per-page hw faults you need the multi-lock ww_mutex
> >   dance anyway to avoid races.
>
> It is still best to avoid the per-page faults and preload the new
> mapping once it is ready.

Sure, but I think that's entirely orthogonal.

Also even if you don't need the multi-buffer dance (either because hw
faults, or because the rdma rx/tx can be stopped at least at a
per-buffer level through some other means) then you still need the
dma_resv_lock to serialize with the exporter. So needs a sleeping
context either way. Hence some worker is needed.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support

2020-07-03 Thread Jason Gunthorpe
On Tue, Jun 30, 2020 at 08:08:46PM +, Xiong, Jianxin wrote:
> > From: Jason Gunthorpe 
> > Sent: Tuesday, June 30, 2020 12:17 PM
> > To: Xiong, Jianxin 
> > Cc: linux-r...@vger.kernel.org; Doug Ledford ; Sumit 
> > Semwal ; Leon Romanovsky
> > ; Vetter, Daniel ; Christian 
> > Koenig ; dri-
> > de...@lists.freedesktop.org
> > Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
> > 
> > > >
> > > > On Tue, Jun 30, 2020 at 05:21:33PM +, Xiong, Jianxin wrote:
> > > > > > > Heterogeneous Memory Management (HMM) utilizes
> > > > > > > mmu_interval_notifier and ZONE_DEVICE to support shared
> > > > > > > virtual address space and page migration between system memory
> > > > > > > and device memory. HMM doesn't support pinning device memory
> > > > > > > because pages located on device must be able to migrate to
> > > > > > > system memory when accessed by CPU. Peer-to-peer access is
> > > > > > > possible if the peer can handle page fault. For RDMA, that means 
> > > > > > > the NIC must support on-demand paging.
> > > > > >
> > > > > > peer-peer access is currently not possible with hmm_range_fault().
> > > > >
> > > > > Currently hmm_range_fault() always sets the cpu access flag and
> > > > > device private pages are migrated to the system RAM in the fault 
> > > > > handler.
> > > > > However, it's possible to have a modified code flow to keep the
> > > > > device private page info for use with peer to peer access.
> > > >
> > > > Sort of, but only within the same device, RDMA or anything else generic 
> > > > can't reach inside a DEVICE_PRIVATE and extract anything
> > useful.
> > >
> > > But pfn is supposed to be all that is needed.
> > 
> > Needed for what? The PFN of the DEVICE_PRIVATE pages is useless for 
> > anything.
> 
> Hmm. I thought the pfn corresponds to the address in the BAR range. I could be
> wrong here. 

No, DEVICE_PRIVATE is a dummy pfn to empty address space.

Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support

2020-07-03 Thread Jason Gunthorpe
On Thu, Jul 02, 2020 at 03:10:00PM +0200, Daniel Vetter wrote:
> On Wed, Jul 01, 2020 at 02:15:24PM -0300, Jason Gunthorpe wrote:
> > On Wed, Jul 01, 2020 at 05:42:21PM +0200, Daniel Vetter wrote:
> > > > >> All you need is the ability to stop wait for ongoing accesses to end 
> > > > >> and
> > > > >> make sure that new ones grab a new mapping.
> > > > > Swap and flush isn't a general HW ability either..
> > > > >
> > > > > I'm unclear how this could be useful, it is guarenteed to corrupt
> > > > > in-progress writes?
> > > > >
> > > > > Did you mean pause, swap and resume? That's ODP.
> > > >
> > > > Yes, something like this. And good to know, never heard of ODP.
> > > 
> > > Hm I thought ODP was full hw page faults at an individual page
> > > level,
> > 
> > Yes
> > 
> > > and this stop is for the entire nic. Under the hood both apply
> > > back-pressure on the network if a transmission can't be received,
> > > but
> > 
> > NIC's don't do stop and resume, blocking the Rx pipe is very
> > problematic and performance destroying.
> > 
> > The strategy for something like ODP is more complex, and so far no NIC
> > has deployed it at any granularity larger than per-page.
> > 
> > > So since Jason really doesn't like dma_fence much I think for rdma
> > > synchronous it is. And it shouldn't really matter, since waiting for a
> > > small transaction to complete at rdma wire speed isn't really that
> > > long an operation. 
> > 
> > Even if DMA fence were to somehow be involved, how would it look?
> 
> Well above you're saying it would be performance destroying, but let's
> pretend that's not a problem :-) Also, I have no clue about rdma, so this
> is really just the flow we have on the gpu side.

I see, no, this is not workable, the command flow in RDMA is not at
all like GPU - what you are a proposing is a global 'stop the whole
chip' Tx and Rx flows for an undetermined time. Not feasible

What we can do is use ODP techniques and pause only the MR attached to
the DMA buf with the process you outline below. This is not so hard to
implement.

> 3. rdma driver worker gets busy to restart rx:
>   1. lock all dma-buf that are currently in use (dma_resv_lock).
>   thanks to ww_mutex deadlock avoidance this is possible

Why all? Why not just lock the one that was invalidated to restore the
mappings? That is some artifact of the GPU approach?

And why is this done with work queues and locking instead of a
callback saying the buffer is valid again?

Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support

2020-07-02 Thread Daniel Vetter
On Thu, Jul 02, 2020 at 04:50:32PM +0200, Christian König wrote:
> Am 02.07.20 um 15:29 schrieb Jason Gunthorpe:
> > On Thu, Jul 02, 2020 at 03:10:00PM +0200, Daniel Vetter wrote:
> > > On Wed, Jul 01, 2020 at 02:15:24PM -0300, Jason Gunthorpe wrote:
> > > > On Wed, Jul 01, 2020 at 05:42:21PM +0200, Daniel Vetter wrote:
> > > > > > > > All you need is the ability to stop wait for ongoing accesses 
> > > > > > > > to end and
> > > > > > > > make sure that new ones grab a new mapping.
> > > > > > > Swap and flush isn't a general HW ability either..
> > > > > > > 
> > > > > > > I'm unclear how this could be useful, it is guarenteed to corrupt
> > > > > > > in-progress writes?
> > > > > > > 
> > > > > > > Did you mean pause, swap and resume? That's ODP.
> > > > > > Yes, something like this. And good to know, never heard of ODP.
> > > > > Hm I thought ODP was full hw page faults at an individual page
> > > > > level,
> > > > Yes
> > > > 
> > > > > and this stop is for the entire nic. Under the hood both apply
> > > > > back-pressure on the network if a transmission can't be received,
> > > > > but
> > > > NIC's don't do stop and resume, blocking the Rx pipe is very
> > > > problematic and performance destroying.
> > > > 
> > > > The strategy for something like ODP is more complex, and so far no NIC
> > > > has deployed it at any granularity larger than per-page.
> > > > 
> > > > > So since Jason really doesn't like dma_fence much I think for rdma
> > > > > synchronous it is. And it shouldn't really matter, since waiting for a
> > > > > small transaction to complete at rdma wire speed isn't really that
> > > > > long an operation.
> > > > Even if DMA fence were to somehow be involved, how would it look?
> > > Well above you're saying it would be performance destroying, but let's
> > > pretend that's not a problem :-) Also, I have no clue about rdma, so this
> > > is really just the flow we have on the gpu side.
> > I see, no, this is not workable, the command flow in RDMA is not at
> > all like GPU - what you are a proposing is a global 'stop the whole
> > chip' Tx and Rx flows for an undetermined time. Not feasible

Yeah, I said I have no clue about rdma :-)

> > What we can do is use ODP techniques and pause only the MR attached to
> > the DMA buf with the process you outline below. This is not so hard to
> > implement.
> 
> Well it boils down to only two requirements:
> 
> 1. You can stop accessing the memory or addresses exported by the DMA-buf.
> 
> 2. Before the next access you need to acquire a new mapping.
> 
> How you do this is perfectly up to you. E.g. you can stop everything, just
> prevent access to this DMA-buf, or just pause the users of this DMA-buf

Yeah in a gpu we also don't stop the entire world, only the context that
needs the buffer. If there's other stuff to run, we do keep running that.
Usually the reason for a buffer move is that we do actually have other
stuff that needs to be run, and which needs more vram for itself, so we
might have to throw out a few buffers from vram that can also be placed in
system memory.

Note that a modern gpu has multiple engines and most have or are gaining
hw scheduling of some sort, so there's a pile of concurrent gpu context
execution going on at any moment. The days where we just push gpu work
into a fifo are (mostly) long gone.

> > > 3. rdma driver worker gets busy to restart rx:
> > >   1. lock all dma-buf that are currently in use (dma_resv_lock).
> > >   thanks to ww_mutex deadlock avoidance this is possible
> > Why all? Why not just lock the one that was invalidated to restore the
> > mappings? That is some artifact of the GPU approach?
> 
> No, but you must make sure that mapping one doesn't invalidate others you
> need.
> 
> Otherwise you can end up in a nice live lock :)

Also if you don't have pagefaults, but have to track busy memory at a
context level, you do need to grab all locks of all buffers you need, or
you'd race. There's nothing stopping a concurrent ->notify_move on some
other buffer you'll need otherwise, and if you try to be clever and roll
you're own locking, you'll anger lockdep - you're own lock will have to be
on both sides of ww_mutex or it wont work, and that deadlocks.

So ww_mutex multi-lock dance or bust.

> > And why is this done with work queues and locking instead of a
> > callback saying the buffer is valid again?
> 
> You can do this as well, but a work queue is usually easier to handle than a
> notification in an interrupt context of a foreign driver.

Yeah you can just install a dma_fence callback but
- that's hardirq context
- if you don't have per-page hw faults you need the multi-lock ww_mutex
  dance anyway to avoid races.

On top of that the exporter has no idea whether the importer still really
needs the mapping, or whether it was just kept around opportunistically.
pte setup isn't free, so by default gpus keep everything mapped. At least
for the classic gl/vk execution model, compute like rocm/amdkfd is 

Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support

2020-07-02 Thread Christian König

Am 02.07.20 um 15:29 schrieb Jason Gunthorpe:

On Thu, Jul 02, 2020 at 03:10:00PM +0200, Daniel Vetter wrote:

On Wed, Jul 01, 2020 at 02:15:24PM -0300, Jason Gunthorpe wrote:

On Wed, Jul 01, 2020 at 05:42:21PM +0200, Daniel Vetter wrote:

All you need is the ability to stop wait for ongoing accesses to end and
make sure that new ones grab a new mapping.

Swap and flush isn't a general HW ability either..

I'm unclear how this could be useful, it is guarenteed to corrupt
in-progress writes?

Did you mean pause, swap and resume? That's ODP.

Yes, something like this. And good to know, never heard of ODP.

Hm I thought ODP was full hw page faults at an individual page
level,

Yes


and this stop is for the entire nic. Under the hood both apply
back-pressure on the network if a transmission can't be received,
but

NIC's don't do stop and resume, blocking the Rx pipe is very
problematic and performance destroying.

The strategy for something like ODP is more complex, and so far no NIC
has deployed it at any granularity larger than per-page.


So since Jason really doesn't like dma_fence much I think for rdma
synchronous it is. And it shouldn't really matter, since waiting for a
small transaction to complete at rdma wire speed isn't really that
long an operation.

Even if DMA fence were to somehow be involved, how would it look?

Well above you're saying it would be performance destroying, but let's
pretend that's not a problem :-) Also, I have no clue about rdma, so this
is really just the flow we have on the gpu side.

I see, no, this is not workable, the command flow in RDMA is not at
all like GPU - what you are a proposing is a global 'stop the whole
chip' Tx and Rx flows for an undetermined time. Not feasible

What we can do is use ODP techniques and pause only the MR attached to
the DMA buf with the process you outline below. This is not so hard to
implement.


Well it boils down to only two requirements:

1. You can stop accessing the memory or addresses exported by the DMA-buf.

2. Before the next access you need to acquire a new mapping.

How you do this is perfectly up to you. E.g. you can stop everything, 
just prevent access to this DMA-buf, or just pause the users of this 
DMA-buf





3. rdma driver worker gets busy to restart rx:
1. lock all dma-buf that are currently in use (dma_resv_lock).
thanks to ww_mutex deadlock avoidance this is possible

Why all? Why not just lock the one that was invalidated to restore the
mappings? That is some artifact of the GPU approach?


No, but you must make sure that mapping one doesn't invalidate others 
you need.


Otherwise you can end up in a nice live lock :)


And why is this done with work queues and locking instead of a
callback saying the buffer is valid again?


You can do this as well, but a work queue is usually easier to handle 
than a notification in an interrupt context of a foreign driver.


Regards,
Christian.



Jason


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support

2020-07-02 Thread Daniel Vetter
On Wed, Jul 01, 2020 at 02:15:24PM -0300, Jason Gunthorpe wrote:
> On Wed, Jul 01, 2020 at 05:42:21PM +0200, Daniel Vetter wrote:
> > > >> All you need is the ability to stop wait for ongoing accesses to end 
> > > >> and
> > > >> make sure that new ones grab a new mapping.
> > > > Swap and flush isn't a general HW ability either..
> > > >
> > > > I'm unclear how this could be useful, it is guarenteed to corrupt
> > > > in-progress writes?
> > > >
> > > > Did you mean pause, swap and resume? That's ODP.
> > >
> > > Yes, something like this. And good to know, never heard of ODP.
> > 
> > Hm I thought ODP was full hw page faults at an individual page
> > level,
> 
> Yes
> 
> > and this stop is for the entire nic. Under the hood both apply
> > back-pressure on the network if a transmission can't be received,
> > but
> 
> NIC's don't do stop and resume, blocking the Rx pipe is very
> problematic and performance destroying.
> 
> The strategy for something like ODP is more complex, and so far no NIC
> has deployed it at any granularity larger than per-page.
> 
> > So since Jason really doesn't like dma_fence much I think for rdma
> > synchronous it is. And it shouldn't really matter, since waiting for a
> > small transaction to complete at rdma wire speed isn't really that
> > long an operation. 
> 
> Even if DMA fence were to somehow be involved, how would it look?

Well above you're saying it would be performance destroying, but let's
pretend that's not a problem :-) Also, I have no clue about rdma, so this
is really just the flow we have on the gpu side.

0. rdma driver maintains a list of all dma-buf that it has mapped
somewhere and is currently using for transactions

1. rdma driver gets a dma_buf->notify_move callback on one of these
buffers. To handle that it:
1. stops hw access somehow at the rx
2. flushes caches and whatever else is needed
3. moves the unmapped buffer on a special list or marks it in some
different way as unavailable
4. launch the kthread/work_struct to fix everything back up

2. dma-buf export (gpu driver) can now issue the commands to move the
buffer around

3. rdma driver worker gets busy to restart rx:
1. lock all dma-buf that are currently in use (dma_resv_lock).
thanks to ww_mutex deadlock avoidance this is possible
2. for any buffers which have been marked as unavailable in 1.3
grab a new mapping (which might now be in system memory, or again
peer2peer but different address)
3. restart hw and rx
4. unlock all dma-buf locks (dma_resv_unlock)

There is a minor problem because step 2 only queues up the entire buffer
moves behind a pile of dma_fence, and atm we haven't made it absolutely
clear who's responsible for waiting for those to complete. For gpu drivers
it's the importer since gpu drivers don't have big qualms about
dma_fences, so 3.2. would perhaps also include a dma_fence_wait to make
sure the buffer move has actually completed.

Above flow is more or less exactly what happens for gpu workloads where we
can preempt running computations. Instead of stopping rx we preempt the
compute job and remove it from the scheduler queues, and instead of
restarting rx we just put the compute job back onto the scheduler queue as
eligible for gpu time. Otherwise it's exactly the same stuff.

Of course if you only have a single compute job and too many such
interruptions, then performance is going to tank. Don't do that, instead
make sure you have enough vram or system memory or whatever :-)

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support

2020-07-02 Thread Jason Gunthorpe
On Wed, Jul 01, 2020 at 11:03:06AM +0200, Christian König wrote:
> Am 30.06.20 um 20:46 schrieb Xiong, Jianxin:
> > > From: Jason Gunthorpe 
> > > Sent: Tuesday, June 30, 2020 10:35 AM
> > > To: Xiong, Jianxin 
> > > Cc: linux-r...@vger.kernel.org; Doug Ledford ; Sumit 
> > > Semwal ; Leon Romanovsky
> > > ; Vetter, Daniel ; Christian 
> > > Koenig 
> > > Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
> > > 
> > > On Tue, Jun 30, 2020 at 05:21:33PM +, Xiong, Jianxin wrote:
> > > > > > Heterogeneous Memory Management (HMM) utilizes
> > > > > > mmu_interval_notifier and ZONE_DEVICE to support shared virtual
> > > > > > address space and page migration between system memory and device
> > > > > > memory. HMM doesn't support pinning device memory because pages
> > > > > > located on device must be able to migrate to system memory when
> > > > > > accessed by CPU. Peer-to-peer access is possible if the peer can
> > > > > > handle page fault. For RDMA, that means the NIC must support 
> > > > > > on-demand paging.
> > > > > peer-peer access is currently not possible with hmm_range_fault().
> > > > Currently hmm_range_fault() always sets the cpu access flag and device
> > > > private pages are migrated to the system RAM in the fault handler.
> > > > However, it's possible to have a modified code flow to keep the device
> > > > private page info for use with peer to peer access.
> > > Sort of, but only within the same device, RDMA or anything else generic 
> > > can't reach inside a DEVICE_PRIVATE and extract anything useful.
> > But pfn is supposed to be all that is needed.
> > 
> > > > > So.. this patch doesn't really do anything new? We could just make a 
> > > > > MR against the DMA buf mmap and get to the same place?
> > > > That's right, the patch alone is just half of the story. The
> > > > functionality depends on availability of dma-buf exporter that can pin
> > > > the device memory.
> > > Well, what do you want to happen here? The RDMA parts are reasonable, but 
> > > I don't want to add new functionality without a purpose - the
> > > other parts need to be settled out first.
> > At the RDMA side, we mainly want to check if the changes are acceptable. 
> > For example,
> > the part about adding 'fd' to the device ops and the ioctl interface. All 
> > the previous
> > comments are very helpful for us to refine the patch so that we can be 
> > ready when
> > GPU side support becomes available.
> > 
> > > The need for the dynamic mapping support for even the current DMA Buf 
> > > hacky P2P users is really too bad. Can you get any GPU driver to
> > > support non-dynamic mapping?
> > We are working on direct direction.
> > 
> > > > > > migrate to system RAM. This is due to the lack of knowledge about
> > > > > > whether the importer can perform peer-to-peer access and the lack
> > > > > > of resource limit control measure for GPU. For the first part, the
> > > > > > latest dma-buf driver has a peer-to-peer flag for the importer,
> > > > > > but the flag is currently tied to dynamic mapping support, which
> > > > > > requires on-demand paging support from the NIC to work.
> > > > > ODP for DMA buf?
> > > > Right.
> > > Hum. This is not actually so hard to do. The whole dma buf proposal would 
> > > make a lot more sense if the 'dma buf MR' had to be the
> > > dynamic kind and the driver had to provide the faulting. It would not be 
> > > so hard to change mlx5 to be able to work like this, perhaps. (the
> > > locking might be a bit tricky though)
> > The main issue is that not all NICs support ODP.
> 
> You don't need on-demand paging support from the NIC for dynamic mapping to
> work.
> 
> All you need is the ability to stop wait for ongoing accesses to end and
> make sure that new ones grab a new mapping.

Swap and flush isn't a general HW ability either..

I'm unclear how this could be useful, it is guarenteed to corrupt
in-progress writes?

Did you mean pause, swap and resume? That's ODP.

Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support

2020-07-02 Thread Jason Gunthorpe
On Wed, Jul 01, 2020 at 05:42:21PM +0200, Daniel Vetter wrote:
> > >> All you need is the ability to stop wait for ongoing accesses to end and
> > >> make sure that new ones grab a new mapping.
> > > Swap and flush isn't a general HW ability either..
> > >
> > > I'm unclear how this could be useful, it is guarenteed to corrupt
> > > in-progress writes?
> > >
> > > Did you mean pause, swap and resume? That's ODP.
> >
> > Yes, something like this. And good to know, never heard of ODP.
> 
> Hm I thought ODP was full hw page faults at an individual page
> level,

Yes

> and this stop is for the entire nic. Under the hood both apply
> back-pressure on the network if a transmission can't be received,
> but

NIC's don't do stop and resume, blocking the Rx pipe is very
problematic and performance destroying.

The strategy for something like ODP is more complex, and so far no NIC
has deployed it at any granularity larger than per-page.

> So since Jason really doesn't like dma_fence much I think for rdma
> synchronous it is. And it shouldn't really matter, since waiting for a
> small transaction to complete at rdma wire speed isn't really that
> long an operation. 

Even if DMA fence were to somehow be involved, how would it look?

Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support

2020-07-01 Thread Daniel Vetter
On Wed, Jul 1, 2020 at 2:56 PM Christian König  wrote:
>
> Am 01.07.20 um 14:39 schrieb Jason Gunthorpe:
> > On Wed, Jul 01, 2020 at 11:03:06AM +0200, Christian König wrote:
> >> Am 30.06.20 um 20:46 schrieb Xiong, Jianxin:
> >>>> From: Jason Gunthorpe 
> >>>> Sent: Tuesday, June 30, 2020 10:35 AM
> >>>> To: Xiong, Jianxin 
> >>>> Cc: linux-r...@vger.kernel.org; Doug Ledford ; 
> >>>> Sumit Semwal ; Leon Romanovsky
> >>>> ; Vetter, Daniel ; Christian 
> >>>> Koenig 
> >>>> Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
> >>>>
> >>>> On Tue, Jun 30, 2020 at 05:21:33PM +, Xiong, Jianxin wrote:
> >>>>>>> Heterogeneous Memory Management (HMM) utilizes
> >>>>>>> mmu_interval_notifier and ZONE_DEVICE to support shared virtual
> >>>>>>> address space and page migration between system memory and device
> >>>>>>> memory. HMM doesn't support pinning device memory because pages
> >>>>>>> located on device must be able to migrate to system memory when
> >>>>>>> accessed by CPU. Peer-to-peer access is possible if the peer can
> >>>>>>> handle page fault. For RDMA, that means the NIC must support 
> >>>>>>> on-demand paging.
> >>>>>> peer-peer access is currently not possible with hmm_range_fault().
> >>>>> Currently hmm_range_fault() always sets the cpu access flag and device
> >>>>> private pages are migrated to the system RAM in the fault handler.
> >>>>> However, it's possible to have a modified code flow to keep the device
> >>>>> private page info for use with peer to peer access.
> >>>> Sort of, but only within the same device, RDMA or anything else generic 
> >>>> can't reach inside a DEVICE_PRIVATE and extract anything useful.
> >>> But pfn is supposed to be all that is needed.
> >>>
> >>>>>> So.. this patch doesn't really do anything new? We could just make a 
> >>>>>> MR against the DMA buf mmap and get to the same place?
> >>>>> That's right, the patch alone is just half of the story. The
> >>>>> functionality depends on availability of dma-buf exporter that can pin
> >>>>> the device memory.
> >>>> Well, what do you want to happen here? The RDMA parts are reasonable, 
> >>>> but I don't want to add new functionality without a purpose - the
> >>>> other parts need to be settled out first.
> >>> At the RDMA side, we mainly want to check if the changes are acceptable. 
> >>> For example,
> >>> the part about adding 'fd' to the device ops and the ioctl interface. All 
> >>> the previous
> >>> comments are very helpful for us to refine the patch so that we can be 
> >>> ready when
> >>> GPU side support becomes available.
> >>>
> >>>> The need for the dynamic mapping support for even the current DMA Buf 
> >>>> hacky P2P users is really too bad. Can you get any GPU driver to
> >>>> support non-dynamic mapping?
> >>> We are working on direct direction.
> >>>
> >>>>>>> migrate to system RAM. This is due to the lack of knowledge about
> >>>>>>> whether the importer can perform peer-to-peer access and the lack
> >>>>>>> of resource limit control measure for GPU. For the first part, the
> >>>>>>> latest dma-buf driver has a peer-to-peer flag for the importer,
> >>>>>>> but the flag is currently tied to dynamic mapping support, which
> >>>>>>> requires on-demand paging support from the NIC to work.
> >>>>>> ODP for DMA buf?
> >>>>> Right.
> >>>> Hum. This is not actually so hard to do. The whole dma buf proposal 
> >>>> would make a lot more sense if the 'dma buf MR' had to be the
> >>>> dynamic kind and the driver had to provide the faulting. It would not be 
> >>>> so hard to change mlx5 to be able to work like this, perhaps. (the
> >>>> locking might be a bit tricky though)
> >>> The main issue is that not all NICs support ODP.
> >> You don't need on-demand paging support from the NIC for dynamic mapping to
> >> work.
> >>
> >> All you need is the ability to stop wait for ongoing accesses t

Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support

2020-07-01 Thread Christian König

Am 01.07.20 um 14:39 schrieb Jason Gunthorpe:

On Wed, Jul 01, 2020 at 11:03:06AM +0200, Christian König wrote:

Am 30.06.20 um 20:46 schrieb Xiong, Jianxin:

From: Jason Gunthorpe 
Sent: Tuesday, June 30, 2020 10:35 AM
To: Xiong, Jianxin 
Cc: linux-r...@vger.kernel.org; Doug Ledford ; Sumit Semwal 
; Leon Romanovsky
; Vetter, Daniel ; Christian Koenig 

Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support

On Tue, Jun 30, 2020 at 05:21:33PM +, Xiong, Jianxin wrote:

Heterogeneous Memory Management (HMM) utilizes
mmu_interval_notifier and ZONE_DEVICE to support shared virtual
address space and page migration between system memory and device
memory. HMM doesn't support pinning device memory because pages
located on device must be able to migrate to system memory when
accessed by CPU. Peer-to-peer access is possible if the peer can
handle page fault. For RDMA, that means the NIC must support on-demand paging.

peer-peer access is currently not possible with hmm_range_fault().

Currently hmm_range_fault() always sets the cpu access flag and device
private pages are migrated to the system RAM in the fault handler.
However, it's possible to have a modified code flow to keep the device
private page info for use with peer to peer access.

Sort of, but only within the same device, RDMA or anything else generic can't 
reach inside a DEVICE_PRIVATE and extract anything useful.

But pfn is supposed to be all that is needed.


So.. this patch doesn't really do anything new? We could just make a MR against 
the DMA buf mmap and get to the same place?

That's right, the patch alone is just half of the story. The
functionality depends on availability of dma-buf exporter that can pin
the device memory.

Well, what do you want to happen here? The RDMA parts are reasonable, but I 
don't want to add new functionality without a purpose - the
other parts need to be settled out first.

At the RDMA side, we mainly want to check if the changes are acceptable. For 
example,
the part about adding 'fd' to the device ops and the ioctl interface. All the 
previous
comments are very helpful for us to refine the patch so that we can be ready 
when
GPU side support becomes available.


The need for the dynamic mapping support for even the current DMA Buf hacky P2P 
users is really too bad. Can you get any GPU driver to
support non-dynamic mapping?

We are working on direct direction.


migrate to system RAM. This is due to the lack of knowledge about
whether the importer can perform peer-to-peer access and the lack
of resource limit control measure for GPU. For the first part, the
latest dma-buf driver has a peer-to-peer flag for the importer,
but the flag is currently tied to dynamic mapping support, which
requires on-demand paging support from the NIC to work.

ODP for DMA buf?

Right.

Hum. This is not actually so hard to do. The whole dma buf proposal would make 
a lot more sense if the 'dma buf MR' had to be the
dynamic kind and the driver had to provide the faulting. It would not be so 
hard to change mlx5 to be able to work like this, perhaps. (the
locking might be a bit tricky though)

The main issue is that not all NICs support ODP.

You don't need on-demand paging support from the NIC for dynamic mapping to
work.

All you need is the ability to stop wait for ongoing accesses to end and
make sure that new ones grab a new mapping.

Swap and flush isn't a general HW ability either..

I'm unclear how this could be useful, it is guarenteed to corrupt
in-progress writes?

Did you mean pause, swap and resume? That's ODP.


Yes, something like this. And good to know, never heard of ODP.


On the GPU side we can pipeline things, e.g. you can program the 
hardware that page tables are changed at a certain point in time.


So what we do is when we get a notification that a buffer will move 
around is to mark this buffer in our structures as invalid and return a 
fence to so that the exporter is able to wait for ongoing stuff to finish.


The actual move then happens only after the ongoing operations on the 
GPU are finished and on the next operation we grab the new location of 
the buffer and re-program the page tables to it.


This way all the CPU does is really just planning asynchronous page 
table changes which are executed on the GPU later on.


You can of course do it synchronized as well, but this would hurt our 
performance pretty badly.


Regards,
Christian.



Jason


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support

2020-07-01 Thread Daniel Vetter
On Wed, Jul 01, 2020 at 02:07:44PM +0200, Daniel Vetter wrote:
> Either my mailer ate half the thread or it's still stuck somewhere, so
> jumping in the middle a bit.
> 
> On Wed, Jul 01, 2020 at 11:03:06AM +0200, Christian König wrote:
> > Am 30.06.20 um 20:46 schrieb Xiong, Jianxin:
> > > > -Original Message-
> > > > From: Jason Gunthorpe 
> > > > Sent: Tuesday, June 30, 2020 10:35 AM
> > > > To: Xiong, Jianxin 
> > > > Cc: linux-r...@vger.kernel.org; Doug Ledford ; 
> > > > Sumit Semwal ; Leon Romanovsky
> > > > ; Vetter, Daniel ; Christian 
> > > > Koenig 
> > > > Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
> > > > 
> > > > On Tue, Jun 30, 2020 at 05:21:33PM +, Xiong, Jianxin wrote:
> > > > > > > Heterogeneous Memory Management (HMM) utilizes
> > > > > > > mmu_interval_notifier and ZONE_DEVICE to support shared virtual
> > > > > > > address space and page migration between system memory and device
> > > > > > > memory. HMM doesn't support pinning device memory because pages
> > > > > > > located on device must be able to migrate to system memory when
> > > > > > > accessed by CPU. Peer-to-peer access is possible if the peer can
> > > > > > > handle page fault. For RDMA, that means the NIC must support 
> > > > > > > on-demand paging.
> > > > > > peer-peer access is currently not possible with hmm_range_fault().
> > > > > Currently hmm_range_fault() always sets the cpu access flag and device
> > > > > private pages are migrated to the system RAM in the fault handler.
> > > > > However, it's possible to have a modified code flow to keep the device
> > > > > private page info for use with peer to peer access.
> > > > Sort of, but only within the same device, RDMA or anything else generic 
> > > > can't reach inside a DEVICE_PRIVATE and extract anything useful.
> > > But pfn is supposed to be all that is needed.
> > > 
> > > > > > So.. this patch doesn't really do anything new? We could just make 
> > > > > > a MR against the DMA buf mmap and get to the same place?
> > > > > That's right, the patch alone is just half of the story. The
> > > > > functionality depends on availability of dma-buf exporter that can pin
> > > > > the device memory.
> > > > Well, what do you want to happen here? The RDMA parts are reasonable, 
> > > > but I don't want to add new functionality without a purpose - the
> > > > other parts need to be settled out first.
> > > At the RDMA side, we mainly want to check if the changes are acceptable. 
> > > For example,
> > > the part about adding 'fd' to the device ops and the ioctl interface. All 
> > > the previous
> > > comments are very helpful for us to refine the patch so that we can be 
> > > ready when
> > > GPU side support becomes available.
> > > 
> > > > The need for the dynamic mapping support for even the current DMA Buf 
> > > > hacky P2P users is really too bad. Can you get any GPU driver to
> > > > support non-dynamic mapping?
> > > We are working on direct direction.
> > > 
> > > > > > > migrate to system RAM. This is due to the lack of knowledge about
> > > > > > > whether the importer can perform peer-to-peer access and the lack
> > > > > > > of resource limit control measure for GPU. For the first part, the
> > > > > > > latest dma-buf driver has a peer-to-peer flag for the importer,
> > > > > > > but the flag is currently tied to dynamic mapping support, which
> > > > > > > requires on-demand paging support from the NIC to work.
> > > > > > ODP for DMA buf?
> > > > > Right.
> > > > Hum. This is not actually so hard to do. The whole dma buf proposal 
> > > > would make a lot more sense if the 'dma buf MR' had to be the
> > > > dynamic kind and the driver had to provide the faulting. It would not 
> > > > be so hard to change mlx5 to be able to work like this, perhaps. (the
> > > > locking might be a bit tricky though)
> > > The main issue is that not all NICs support ODP.
> > 
> > You don't need on-demand paging support from the NIC for dynamic mapping to
> > work.
> > 
> > All you need is the ability to stop wait for

Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support

2020-07-01 Thread Daniel Vetter
Either my mailer ate half the thread or it's still stuck somewhere, so
jumping in the middle a bit.

On Wed, Jul 01, 2020 at 11:03:06AM +0200, Christian König wrote:
> Am 30.06.20 um 20:46 schrieb Xiong, Jianxin:
> > > -Original Message-
> > > From: Jason Gunthorpe 
> > > Sent: Tuesday, June 30, 2020 10:35 AM
> > > To: Xiong, Jianxin 
> > > Cc: linux-r...@vger.kernel.org; Doug Ledford ; Sumit 
> > > Semwal ; Leon Romanovsky
> > > ; Vetter, Daniel ; Christian 
> > > Koenig 
> > > Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
> > > 
> > > On Tue, Jun 30, 2020 at 05:21:33PM +, Xiong, Jianxin wrote:
> > > > > > Heterogeneous Memory Management (HMM) utilizes
> > > > > > mmu_interval_notifier and ZONE_DEVICE to support shared virtual
> > > > > > address space and page migration between system memory and device
> > > > > > memory. HMM doesn't support pinning device memory because pages
> > > > > > located on device must be able to migrate to system memory when
> > > > > > accessed by CPU. Peer-to-peer access is possible if the peer can
> > > > > > handle page fault. For RDMA, that means the NIC must support 
> > > > > > on-demand paging.
> > > > > peer-peer access is currently not possible with hmm_range_fault().
> > > > Currently hmm_range_fault() always sets the cpu access flag and device
> > > > private pages are migrated to the system RAM in the fault handler.
> > > > However, it's possible to have a modified code flow to keep the device
> > > > private page info for use with peer to peer access.
> > > Sort of, but only within the same device, RDMA or anything else generic 
> > > can't reach inside a DEVICE_PRIVATE and extract anything useful.
> > But pfn is supposed to be all that is needed.
> > 
> > > > > So.. this patch doesn't really do anything new? We could just make a 
> > > > > MR against the DMA buf mmap and get to the same place?
> > > > That's right, the patch alone is just half of the story. The
> > > > functionality depends on availability of dma-buf exporter that can pin
> > > > the device memory.
> > > Well, what do you want to happen here? The RDMA parts are reasonable, but 
> > > I don't want to add new functionality without a purpose - the
> > > other parts need to be settled out first.
> > At the RDMA side, we mainly want to check if the changes are acceptable. 
> > For example,
> > the part about adding 'fd' to the device ops and the ioctl interface. All 
> > the previous
> > comments are very helpful for us to refine the patch so that we can be 
> > ready when
> > GPU side support becomes available.
> > 
> > > The need for the dynamic mapping support for even the current DMA Buf 
> > > hacky P2P users is really too bad. Can you get any GPU driver to
> > > support non-dynamic mapping?
> > We are working on direct direction.
> > 
> > > > > > migrate to system RAM. This is due to the lack of knowledge about
> > > > > > whether the importer can perform peer-to-peer access and the lack
> > > > > > of resource limit control measure for GPU. For the first part, the
> > > > > > latest dma-buf driver has a peer-to-peer flag for the importer,
> > > > > > but the flag is currently tied to dynamic mapping support, which
> > > > > > requires on-demand paging support from the NIC to work.
> > > > > ODP for DMA buf?
> > > > Right.
> > > Hum. This is not actually so hard to do. The whole dma buf proposal would 
> > > make a lot more sense if the 'dma buf MR' had to be the
> > > dynamic kind and the driver had to provide the faulting. It would not be 
> > > so hard to change mlx5 to be able to work like this, perhaps. (the
> > > locking might be a bit tricky though)
> > The main issue is that not all NICs support ODP.
> 
> You don't need on-demand paging support from the NIC for dynamic mapping to
> work.
> 
> All you need is the ability to stop wait for ongoing accesses to end and
> make sure that new ones grab a new mapping.

So having no clue about rdma myself much, this sounds rather interesting.
Sure it would result in immediately re-acquiring the pages, but that's
also really all we need to be able to move buffers around on the gpu side.
And with dma_resv_lock there's no livelock risk if the NIC immediately
starts a kthread/work_struct which reacquires all the dma

Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support

2020-07-01 Thread Christian König

Am 30.06.20 um 20:46 schrieb Xiong, Jianxin:

-Original Message-
From: Jason Gunthorpe 
Sent: Tuesday, June 30, 2020 10:35 AM
To: Xiong, Jianxin 
Cc: linux-r...@vger.kernel.org; Doug Ledford ; Sumit Semwal 
; Leon Romanovsky
; Vetter, Daniel ; Christian Koenig 

Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support

On Tue, Jun 30, 2020 at 05:21:33PM +, Xiong, Jianxin wrote:

Heterogeneous Memory Management (HMM) utilizes
mmu_interval_notifier and ZONE_DEVICE to support shared virtual
address space and page migration between system memory and device
memory. HMM doesn't support pinning device memory because pages
located on device must be able to migrate to system memory when
accessed by CPU. Peer-to-peer access is possible if the peer can
handle page fault. For RDMA, that means the NIC must support on-demand paging.

peer-peer access is currently not possible with hmm_range_fault().

Currently hmm_range_fault() always sets the cpu access flag and device
private pages are migrated to the system RAM in the fault handler.
However, it's possible to have a modified code flow to keep the device
private page info for use with peer to peer access.

Sort of, but only within the same device, RDMA or anything else generic can't 
reach inside a DEVICE_PRIVATE and extract anything useful.

But pfn is supposed to be all that is needed.


So.. this patch doesn't really do anything new? We could just make a MR against 
the DMA buf mmap and get to the same place?

That's right, the patch alone is just half of the story. The
functionality depends on availability of dma-buf exporter that can pin
the device memory.

Well, what do you want to happen here? The RDMA parts are reasonable, but I 
don't want to add new functionality without a purpose - the
other parts need to be settled out first.

At the RDMA side, we mainly want to check if the changes are acceptable. For 
example,
the part about adding 'fd' to the device ops and the ioctl interface. All the 
previous
comments are very helpful for us to refine the patch so that we can be ready 
when
GPU side support becomes available.


The need for the dynamic mapping support for even the current DMA Buf hacky P2P 
users is really too bad. Can you get any GPU driver to
support non-dynamic mapping?

We are working on direct direction.


migrate to system RAM. This is due to the lack of knowledge about
whether the importer can perform peer-to-peer access and the lack
of resource limit control measure for GPU. For the first part, the
latest dma-buf driver has a peer-to-peer flag for the importer,
but the flag is currently tied to dynamic mapping support, which
requires on-demand paging support from the NIC to work.

ODP for DMA buf?

Right.

Hum. This is not actually so hard to do. The whole dma buf proposal would make 
a lot more sense if the 'dma buf MR' had to be the
dynamic kind and the driver had to provide the faulting. It would not be so 
hard to change mlx5 to be able to work like this, perhaps. (the
locking might be a bit tricky though)

The main issue is that not all NICs support ODP.


You don't need on-demand paging support from the NIC for dynamic mapping 
to work.


All you need is the ability to stop wait for ongoing accesses to end and 
make sure that new ones grab a new mapping.


Apart from that this is a rather interesting work.

Regards,
Christian.




There are a few possible ways to address these issues, such as
decoupling peer-to-peer flag from dynamic mapping, allowing more
leeway for individual drivers to make the pinning decision and
adding GPU resource limit control via cgroup. We would like to get
comments on this patch series with the assumption that device
memory pinning via dma-buf is supported by some GPU drivers, and
at the same time welcome open discussions on how to address the
aforementioned issues as well as GPU-NIC peer-to-peer access solutions in 
general.

These seem like DMA buf problems, not RDMA problems, why are you
asking these questions with a RDMA patch set? The usual DMA buf people are not 
even Cc'd here.

The intention is to have people from both RDMA and DMA buffer side to
comment. Sumit Semwal is the DMA buffer maintainer according to the
MAINTAINERS file. I agree more people could be invited to the discussion.
Just added Christian Koenig to the cc-list.

Would be good to have added the drm lists too

Thanks, cc'd dri-devel here, and will also do the same for the previous part of 
the thread.


If the umem_description you mentioned is for information used to
create the umem (e.g. a structure for all the parameters), then this would work 
better.

It would make some more sense, and avoid all these weird EOPNOTSUPPS.

Good, thanks for the suggestion.


Jason


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support

2020-07-01 Thread Jason Gunthorpe
On Tue, Jun 30, 2020 at 06:46:17PM +, Xiong, Jianxin wrote:
> > From: Jason Gunthorpe 
> > Sent: Tuesday, June 30, 2020 10:35 AM
> > To: Xiong, Jianxin 
> > Cc: linux-r...@vger.kernel.org; Doug Ledford ; Sumit 
> > Semwal ; Leon Romanovsky
> > ; Vetter, Daniel ; Christian 
> > Koenig 
> > Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
> > 
> > On Tue, Jun 30, 2020 at 05:21:33PM +, Xiong, Jianxin wrote:
> > > > > Heterogeneous Memory Management (HMM) utilizes
> > > > > mmu_interval_notifier and ZONE_DEVICE to support shared virtual
> > > > > address space and page migration between system memory and device
> > > > > memory. HMM doesn't support pinning device memory because pages
> > > > > located on device must be able to migrate to system memory when
> > > > > accessed by CPU. Peer-to-peer access is possible if the peer can
> > > > > handle page fault. For RDMA, that means the NIC must support 
> > > > > on-demand paging.
> > > >
> > > > peer-peer access is currently not possible with hmm_range_fault().
> > >
> > > Currently hmm_range_fault() always sets the cpu access flag and device
> > > private pages are migrated to the system RAM in the fault handler.
> > > However, it's possible to have a modified code flow to keep the device
> > > private page info for use with peer to peer access.
> > 
> > Sort of, but only within the same device, RDMA or anything else generic 
> > can't reach inside a DEVICE_PRIVATE and extract anything useful.
> 
> But pfn is supposed to be all that is needed.

Needed for what? The PFN of the DEVICE_PRIVATE pages is useless for
anything.

> > Well, what do you want to happen here? The RDMA parts are
> > reasonable, but I don't want to add new functionality without a
> > purpose - the other parts need to be settled out first.
> 
> At the RDMA side, we mainly want to check if the changes are
> acceptable. For example, the part about adding 'fd' to the device
> ops and the ioctl interface. All the previous comments are very
> helpful for us to refine the patch so that we can be ready when GPU
> side support becomes available.

Well, I'm not totally happy with the way the umem and the fd is
handled so roughly and incompletely..

> > Hum. This is not actually so hard to do. The whole dma buf
> > proposal would make a lot more sense if the 'dma buf MR' had to be
> > the dynamic kind and the driver had to provide the faulting. It
> > would not be so hard to change mlx5 to be able to work like this,
> > perhaps. (the locking might be a bit tricky though)
> 
> The main issue is that not all NICs support ODP.

Sure, but there is lots of infrastructure work here to be done on dma
buf, having a correct consumer in the form of ODP might be helpful to
advance it.

Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


RE: [RFC PATCH v2 0/3] RDMA: add dma-buf support

2020-06-30 Thread Xiong, Jianxin
> -Original Message-
> From: Jason Gunthorpe 
> Sent: Tuesday, June 30, 2020 12:17 PM
> To: Xiong, Jianxin 
> Cc: linux-r...@vger.kernel.org; Doug Ledford ; Sumit 
> Semwal ; Leon Romanovsky
> ; Vetter, Daniel ; Christian Koenig 
> ; dri-
> de...@lists.freedesktop.org
> Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
> 
> > >
> > > On Tue, Jun 30, 2020 at 05:21:33PM +, Xiong, Jianxin wrote:
> > > > > > Heterogeneous Memory Management (HMM) utilizes
> > > > > > mmu_interval_notifier and ZONE_DEVICE to support shared
> > > > > > virtual address space and page migration between system memory
> > > > > > and device memory. HMM doesn't support pinning device memory
> > > > > > because pages located on device must be able to migrate to
> > > > > > system memory when accessed by CPU. Peer-to-peer access is
> > > > > > possible if the peer can handle page fault. For RDMA, that means 
> > > > > > the NIC must support on-demand paging.
> > > > >
> > > > > peer-peer access is currently not possible with hmm_range_fault().
> > > >
> > > > Currently hmm_range_fault() always sets the cpu access flag and
> > > > device private pages are migrated to the system RAM in the fault 
> > > > handler.
> > > > However, it's possible to have a modified code flow to keep the
> > > > device private page info for use with peer to peer access.
> > >
> > > Sort of, but only within the same device, RDMA or anything else generic 
> > > can't reach inside a DEVICE_PRIVATE and extract anything
> useful.
> >
> > But pfn is supposed to be all that is needed.
> 
> Needed for what? The PFN of the DEVICE_PRIVATE pages is useless for anything.

Hmm. I thought the pfn corresponds to the address in the BAR range. I could be
wrong here. 

> 
> > > Well, what do you want to happen here? The RDMA parts are
> > > reasonable, but I don't want to add new functionality without a
> > > purpose - the other parts need to be settled out first.
> >
> > At the RDMA side, we mainly want to check if the changes are
> > acceptable. For example, the part about adding 'fd' to the device ops
> > and the ioctl interface. All the previous comments are very helpful
> > for us to refine the patch so that we can be ready when GPU side
> > support becomes available.
> 
> Well, I'm not totally happy with the way the umem and the fd is handled so 
> roughly and incompletely..

Yes, this feedback is very helpful. Will work on improving the code.

> 
> > > Hum. This is not actually so hard to do. The whole dma buf proposal
> > > would make a lot more sense if the 'dma buf MR' had to be the
> > > dynamic kind and the driver had to provide the faulting. It would
> > > not be so hard to change mlx5 to be able to work like this, perhaps.
> > > (the locking might be a bit tricky though)
> >
> > The main issue is that not all NICs support ODP.
> 
> Sure, but there is lots of infrastructure work here to be done on dma buf, 
> having a correct consumer in the form of ODP might be helpful to
> advance it.

Good point. Thanks.

> 
> Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


RE: [RFC PATCH v2 0/3] RDMA: add dma-buf support

2020-06-30 Thread Xiong, Jianxin
Added to cc-list:
Christian Koenig 
dri-devel@lists.freedesktop.org

> -Original Message-
> From: Xiong, Jianxin 
> Sent: Monday, June 29, 2020 10:32 AM
> To: linux-r...@vger.kernel.org
> Cc: Xiong, Jianxin ; Doug Ledford 
> ; Jason Gunthorpe ; Sumit Semwal
> ; Leon Romanovsky ; Vetter, Daniel 
> 
> Subject: [RFC PATCH v2 0/3] RDMA: add dma-buf support
> 
> When enabled, an RDMA capable NIC can perform peer-to-peer transactions
> over PCIe to access the local memory located on another device. This can
> often lead to better performance than using a system memory buffer for
> RDMA and copying data between the buffer and device memory.
> 
> Current kernel RDMA stack uses get_user_pages() to pin the physical
> pages backing the user buffer and uses dma_map_sg_attrs() to get the
> dma addresses for memory access. This usually doesn't work for peer
> device memory due to the lack of associated page structures.
> 
> Several mechanisms exist today to facilitate device memory access.
> 
> ZONE_DEVICE is a new zone for device memory in the memory management
> subsystem. It allows pages from device memory being described with
> specialized page structures. As the result, calls like get_user_pages()
> can succeed, but what can be done with these page structures may be
> different from system memory. It is further specialized into multiple
> memory types, such as one type for PCI p2pmem/p2pdma and one type for
> HMM.
> 
> PCI p2pmem/p2pdma uses ZONE_DEVICE to represent device memory residing
> in a PCI BAR and provides a set of calls to publish, discover, allocate,
> and map such memory for peer-to-peer transactions. One feature of the
> API is that the buffer is allocated by the side that does the DMA
> transfer. This works well with the storage usage case, but is awkward
> with GPU-NIC communication, where typically the buffer is allocated by
> the GPU driver rather than the NIC driver.
> 
> Heterogeneous Memory Management (HMM) utilizes mmu_interval_notifier
> and ZONE_DEVICE to support shared virtual address space and page
> migration between system memory and device memory. HMM doesn't support
> pinning device memory because pages located on device must be able to
> migrate to system memory when accessed by CPU. Peer-to-peer access
> is possible if the peer can handle page fault. For RDMA, that means
> the NIC must support on-demand paging.
> 
> Dma-buf is a standard mechanism for sharing buffers among different
> device drivers. The buffer to be shared is exported by the owning
> driver and imported by the driver that wants to use it. The exporter
> provides a set of ops that the importer can call to pin and map the
> buffer. In addition, a file descriptor can be associated with a dma-
> buf object as the handle that can be passed to user space.
> 
> This patch series adds dma-buf importer role to the RDMA driver in
> attempt to support RDMA using device memory such as GPU VRAM. Dma-buf is
> chosen for a few reasons: first, the API is relatively simple and allows
> a lot of flexibility in implementing the buffer manipulation ops.
> Second, it doesn't require page structure. Third, dma-buf is already
> supported in many GPU drivers. However, we are aware that existing GPU
> drivers don't allow pinning device memory via the dma-buf interface.
> Pinning and mapping a dma-buf would cause the backing storage to migrate
> to system RAM. This is due to the lack of knowledge about whether the
> importer can perform peer-to-peer access and the lack of resource limit
> control measure for GPU. For the first part, the latest dma-buf driver
> has a peer-to-peer flag for the importer, but the flag is currently tied
> to dynamic mapping support, which requires on-demand paging support from
> the NIC to work. There are a few possible ways to address these issues,
> such as decoupling peer-to-peer flag from dynamic mapping, allowing more
> leeway for individual drivers to make the pinning decision and adding
> GPU resource limit control via cgroup. We would like to get comments on
> this patch series with the assumption that device memory pinning via
> dma-buf is supported by some GPU drivers, and at the same time welcome
> open discussions on how to address the aforementioned issues as well as
> GPU-NIC peer-to-peer access solutions in general.
> 
> This is the second version of the patch series. Here are the changes
> from the previous version:
> * The Kconfig option is removed. There is no dependence issue since
> dma-buf driver is always enabled.
> * The declaration of new data structure and functions is reorganized to
> minimize the visibility of the changes.
> * The new uverbs command now goes through ioctl() instead of write().
> * The rereg functionality is 

RE: [RFC PATCH v2 0/3] RDMA: add dma-buf support

2020-06-30 Thread Xiong, Jianxin
> -Original Message-
> From: Jason Gunthorpe 
> Sent: Tuesday, June 30, 2020 10:35 AM
> To: Xiong, Jianxin 
> Cc: linux-r...@vger.kernel.org; Doug Ledford ; Sumit 
> Semwal ; Leon Romanovsky
> ; Vetter, Daniel ; Christian Koenig 
> 
> Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
> 
> On Tue, Jun 30, 2020 at 05:21:33PM +, Xiong, Jianxin wrote:
> > > > Heterogeneous Memory Management (HMM) utilizes
> > > > mmu_interval_notifier and ZONE_DEVICE to support shared virtual
> > > > address space and page migration between system memory and device
> > > > memory. HMM doesn't support pinning device memory because pages
> > > > located on device must be able to migrate to system memory when
> > > > accessed by CPU. Peer-to-peer access is possible if the peer can
> > > > handle page fault. For RDMA, that means the NIC must support on-demand 
> > > > paging.
> > >
> > > peer-peer access is currently not possible with hmm_range_fault().
> >
> > Currently hmm_range_fault() always sets the cpu access flag and device
> > private pages are migrated to the system RAM in the fault handler.
> > However, it's possible to have a modified code flow to keep the device
> > private page info for use with peer to peer access.
> 
> Sort of, but only within the same device, RDMA or anything else generic can't 
> reach inside a DEVICE_PRIVATE and extract anything useful.

But pfn is supposed to be all that is needed.

> 
> > > So.. this patch doesn't really do anything new? We could just make a MR 
> > > against the DMA buf mmap and get to the same place?
> >
> > That's right, the patch alone is just half of the story. The
> > functionality depends on availability of dma-buf exporter that can pin
> > the device memory.
> 
> Well, what do you want to happen here? The RDMA parts are reasonable, but I 
> don't want to add new functionality without a purpose - the
> other parts need to be settled out first.

At the RDMA side, we mainly want to check if the changes are acceptable. For 
example,
the part about adding 'fd' to the device ops and the ioctl interface. All the 
previous
comments are very helpful for us to refine the patch so that we can be ready 
when
GPU side support becomes available.

> 
> The need for the dynamic mapping support for even the current DMA Buf hacky 
> P2P users is really too bad. Can you get any GPU driver to
> support non-dynamic mapping?

We are working on direct direction.

> 
> > > > migrate to system RAM. This is due to the lack of knowledge about
> > > > whether the importer can perform peer-to-peer access and the lack
> > > > of resource limit control measure for GPU. For the first part, the
> > > > latest dma-buf driver has a peer-to-peer flag for the importer,
> > > > but the flag is currently tied to dynamic mapping support, which
> > > > requires on-demand paging support from the NIC to work.
> > >
> > > ODP for DMA buf?
> >
> > Right.
> 
> Hum. This is not actually so hard to do. The whole dma buf proposal would 
> make a lot more sense if the 'dma buf MR' had to be the
> dynamic kind and the driver had to provide the faulting. It would not be so 
> hard to change mlx5 to be able to work like this, perhaps. (the
> locking might be a bit tricky though)

The main issue is that not all NICs support ODP.

> 
> > > > There are a few possible ways to address these issues, such as
> > > > decoupling peer-to-peer flag from dynamic mapping, allowing more
> > > > leeway for individual drivers to make the pinning decision and
> > > > adding GPU resource limit control via cgroup. We would like to get
> > > > comments on this patch series with the assumption that device
> > > > memory pinning via dma-buf is supported by some GPU drivers, and
> > > > at the same time welcome open discussions on how to address the
> > > > aforementioned issues as well as GPU-NIC peer-to-peer access solutions 
> > > > in general.
> > >
> > > These seem like DMA buf problems, not RDMA problems, why are you
> > > asking these questions with a RDMA patch set? The usual DMA buf people 
> > > are not even Cc'd here.
> >
> > The intention is to have people from both RDMA and DMA buffer side to
> > comment. Sumit Semwal is the DMA buffer maintainer according to the
> > MAINTAINERS file. I agree more people could be invited to the discussion.
> > Just added Christian Koenig to the cc-list.
> 
> Would be good to have added the drm lists too

Thanks, cc'd dri-devel here, and will also do the same for the previous part of 
the thread.

> 
> > If the umem_description you mentioned is for information used to
> > create the umem (e.g. a structure for all the parameters), then this would 
> > work better.
> 
> It would make some more sense, and avoid all these weird EOPNOTSUPPS.

Good, thanks for the suggestion.

> 
> Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel