Re: [Intel-gfx] [PATCH 5/5] drm/amdgpu: implement amdgpu_gem_prime_move_notify v2

2020-02-27 Thread Daniel Vetter
On Thu, Feb 27, 2020 at 10:20 AM Christian König
 wrote:
> Am 26.02.20 um 17:32 schrieb Daniel Vetter:
> > On Tue, Feb 25, 2020 at 6:16 PM Daniel Vetter  wrote:
> >> On Mon, Feb 24, 2020 at 07:46:59PM +0100, Christian König wrote:
> >>> Am 23.02.20 um 17:54 schrieb Thomas Hellström (VMware):
>  On 2/23/20 4:45 PM, Christian König wrote:
> > Am 21.02.20 um 18:12 schrieb Daniel Vetter:
> >> [SNIP]
> >> Yeah the Great Plan (tm) is to fully rely on ww_mutex slowly
> >> degenerating
> >> into essentially a global lock. But only when there's actual contention
> >> and thrashing.
> > Yes exactly. A really big problem in TTM is currently that we drop
> > the lock after evicting BOs because they tend to move in again
> > directly after that.
> >
> >  From practice I can also confirm that there is exactly zero benefit
> > from dropping locks early and reacquire them for example for the VM
> > page tables. That's just makes it more likely that somebody needs to
> > roll back and this is what we need to avoid in the first place.
>  If you have a benchmarking setup available it would be very interesting
>  for future reference to see how changing from WD to WW mutexes affects
>  the roll back frequency. WW is known to cause rollbacks much less
>  frequently but there is more work associated with each rollback.
> >>> Not of hand. To be honest I still have a hard time to get a grip on the
> >>> difference between WD and WW from the algorithm point of view. So I can't
> >>> judge that difference at all.
> >>>
> > Contention on BO locks during command submission is perfectly fine
> > as long as this is as lightweight as possible while we don't have
> > trashing. When we have trashing multi submission performance is best
> > archived to just favor a single process to finish its business and
> > block everybody else.
>  Hmm. Sounds like we need a per-manager ww_rwsem protecting manager
>  allocation, taken in write-mode then there's thrashing. In read-mode
>  otherwise. That would limit the amount of "unnecessary" locks we'd have
>  to keep and reduce unwanted side-effects, (see below):
> >>> Well per-manager (you mean per domain here don't you?) doesn't sound like
> >>> that useful because we rarely use only one domain, but I'm actually
> >>> questioning for quite a while if the per BO lock scheme was the right
> >>> approach.
> >>>
> >>> See from the performance aspect the closest to ideal solution I can think 
> >>> of
> >>> would be a ww_rwsem per user of a resource.
> >>>
> >>> In other words we don't lock BOs, but instead a list of all their users 
> >>> and
> >>> when you want to evict a BO you need to walk that list and inform all 
> >>> users
> >>> that the BO will be moving.
> >>>
> >>> During command submission you then have the fast path which rather just
> >>> grabs the read side of the user lock and check if all BOs are still in the
> >>> expected place.
> >>>
> >>> If some BOs were evicted you back off and start the slow path, e.g. maybe
> >>> even copy additional data from userspace then grab the write side of the
> >>> lock etc.. etc...
> >>>
> >>> That approach is similar to what we use in amdgpu with the per-VM BOs, but
> >>> goes a step further. Problem is that we are so used to per BO locks in the
> >>> kernel that this is probably not doable any more.
> >> Yeah I think it'd be nice to have the same approach for shared bo too. I
> >> guess what we could do is something like this (spinning your ww_rwmutex
> >> idea a bit further):
> >>
> >> dma_buf_read_lock(buf, vm)
> >> {
> >>  if (enabled(CONFIG_DEBUG_WW_MUTEX_SLOWPATH))
> >>  {
> >>  check that vm is indeed listed in buf and splat if not
> >>  }
> >>
> >>  /* for a buf that's not shared in multiple vm we'd have buf->resv
> >>   * == vm->resv here */
> >>  return ww_mutex_lock(vm->resv);
> >> }
> >>
> >> dma_buf_write_lock(buf)
> >> {
> >>  for_each_vm_in_buf(buf, vm) {
> >>  ww_mutex_lock(vm->resv);
> >>  }
> >> }
> >>
> >> Ideally we'd track all these vms with something slightly less shoddy than
> >> a linked list :-) Resizeable array is probably pretty good, I think we
> >> only ever need to go from buf -> vm list, not the other way round. At
> >> least in dma_resv/dma_buf code, driver code ofc needs to keep a list of
> >> all bo bound to a vm somewhere. But that's probably a much bigger
> >> datastructure for tracking vma offsets and mappings and other things on
> >> top.
> >>
> >> Ofc to even just get there we'd need something like the sublock list to
> >> keep track of all the additional locks we'd need for the writer lock. And
> >> we'd need the release callback for backoff, so that we could also go
> >> through the slowpath on a vm object that we're not holding a full
> >> reference on. That also means vm need to be refcounted.
> >>
> >> 

Re: [Intel-gfx] [PATCH 5/5] drm/amdgpu: implement amdgpu_gem_prime_move_notify v2

2020-02-27 Thread Christian König

Am 26.02.20 um 17:32 schrieb Daniel Vetter:

On Tue, Feb 25, 2020 at 6:16 PM Daniel Vetter  wrote:

On Mon, Feb 24, 2020 at 07:46:59PM +0100, Christian König wrote:

Am 23.02.20 um 17:54 schrieb Thomas Hellström (VMware):

On 2/23/20 4:45 PM, Christian König wrote:

Am 21.02.20 um 18:12 schrieb Daniel Vetter:

[SNIP]
Yeah the Great Plan (tm) is to fully rely on ww_mutex slowly
degenerating
into essentially a global lock. But only when there's actual contention
and thrashing.

Yes exactly. A really big problem in TTM is currently that we drop
the lock after evicting BOs because they tend to move in again
directly after that.

 From practice I can also confirm that there is exactly zero benefit
from dropping locks early and reacquire them for example for the VM
page tables. That's just makes it more likely that somebody needs to
roll back and this is what we need to avoid in the first place.

If you have a benchmarking setup available it would be very interesting
for future reference to see how changing from WD to WW mutexes affects
the roll back frequency. WW is known to cause rollbacks much less
frequently but there is more work associated with each rollback.

Not of hand. To be honest I still have a hard time to get a grip on the
difference between WD and WW from the algorithm point of view. So I can't
judge that difference at all.


Contention on BO locks during command submission is perfectly fine
as long as this is as lightweight as possible while we don't have
trashing. When we have trashing multi submission performance is best
archived to just favor a single process to finish its business and
block everybody else.

Hmm. Sounds like we need a per-manager ww_rwsem protecting manager
allocation, taken in write-mode then there's thrashing. In read-mode
otherwise. That would limit the amount of "unnecessary" locks we'd have
to keep and reduce unwanted side-effects, (see below):

Well per-manager (you mean per domain here don't you?) doesn't sound like
that useful because we rarely use only one domain, but I'm actually
questioning for quite a while if the per BO lock scheme was the right
approach.

See from the performance aspect the closest to ideal solution I can think of
would be a ww_rwsem per user of a resource.

In other words we don't lock BOs, but instead a list of all their users and
when you want to evict a BO you need to walk that list and inform all users
that the BO will be moving.

During command submission you then have the fast path which rather just
grabs the read side of the user lock and check if all BOs are still in the
expected place.

If some BOs were evicted you back off and start the slow path, e.g. maybe
even copy additional data from userspace then grab the write side of the
lock etc.. etc...

That approach is similar to what we use in amdgpu with the per-VM BOs, but
goes a step further. Problem is that we are so used to per BO locks in the
kernel that this is probably not doable any more.

Yeah I think it'd be nice to have the same approach for shared bo too. I
guess what we could do is something like this (spinning your ww_rwmutex
idea a bit further):

dma_buf_read_lock(buf, vm)
{
 if (enabled(CONFIG_DEBUG_WW_MUTEX_SLOWPATH))
 {
 check that vm is indeed listed in buf and splat if not
 }

 /* for a buf that's not shared in multiple vm we'd have buf->resv
  * == vm->resv here */
 return ww_mutex_lock(vm->resv);
}

dma_buf_write_lock(buf)
{
 for_each_vm_in_buf(buf, vm) {
 ww_mutex_lock(vm->resv);
 }
}

Ideally we'd track all these vms with something slightly less shoddy than
a linked list :-) Resizeable array is probably pretty good, I think we
only ever need to go from buf -> vm list, not the other way round. At
least in dma_resv/dma_buf code, driver code ofc needs to keep a list of
all bo bound to a vm somewhere. But that's probably a much bigger
datastructure for tracking vma offsets and mappings and other things on
top.

Ofc to even just get there we'd need something like the sublock list to
keep track of all the additional locks we'd need for the writer lock. And
we'd need the release callback for backoff, so that we could also go
through the slowpath on a vm object that we're not holding a full
reference on. That also means vm need to be refcounted.

And the list of vms on a buffer need to be protected with some lock and
the usual kref_get_unless_zero trickery.

But with all this I think we can make the dma_buf_write_lock lock 100%
like the old per-buffer lock for everyone. And execbuf could switch over
to dma_buf_read_lock for shared buffers. Bonus points when the gpu context
just keeps track of a list of shared vm used by buffers in that context
...

That way we could make vm fastpath locking a la amdgpu opt-in, while
keeping everyone else on the per-object locking juices.

Thoughts?


At least to me that sounds like a plan.


One thing I just realized, which is nasty: The full 

Re: [Intel-gfx] [PATCH 5/5] drm/amdgpu: implement amdgpu_gem_prime_move_notify v2

2020-02-26 Thread Daniel Vetter
On Tue, Feb 25, 2020 at 6:16 PM Daniel Vetter  wrote:
>
> On Mon, Feb 24, 2020 at 07:46:59PM +0100, Christian König wrote:
> > Am 23.02.20 um 17:54 schrieb Thomas Hellström (VMware):
> > > On 2/23/20 4:45 PM, Christian König wrote:
> > > > Am 21.02.20 um 18:12 schrieb Daniel Vetter:
> > > > > [SNIP]
> > > > > Yeah the Great Plan (tm) is to fully rely on ww_mutex slowly
> > > > > degenerating
> > > > > into essentially a global lock. But only when there's actual 
> > > > > contention
> > > > > and thrashing.
> > > >
> > > > Yes exactly. A really big problem in TTM is currently that we drop
> > > > the lock after evicting BOs because they tend to move in again
> > > > directly after that.
> > > >
> > > > From practice I can also confirm that there is exactly zero benefit
> > > > from dropping locks early and reacquire them for example for the VM
> > > > page tables. That's just makes it more likely that somebody needs to
> > > > roll back and this is what we need to avoid in the first place.
> > >
> > > If you have a benchmarking setup available it would be very interesting
> > > for future reference to see how changing from WD to WW mutexes affects
> > > the roll back frequency. WW is known to cause rollbacks much less
> > > frequently but there is more work associated with each rollback.
> >
> > Not of hand. To be honest I still have a hard time to get a grip on the
> > difference between WD and WW from the algorithm point of view. So I can't
> > judge that difference at all.
> >
> > > > Contention on BO locks during command submission is perfectly fine
> > > > as long as this is as lightweight as possible while we don't have
> > > > trashing. When we have trashing multi submission performance is best
> > > > archived to just favor a single process to finish its business and
> > > > block everybody else.
> > >
> > > Hmm. Sounds like we need a per-manager ww_rwsem protecting manager
> > > allocation, taken in write-mode then there's thrashing. In read-mode
> > > otherwise. That would limit the amount of "unnecessary" locks we'd have
> > > to keep and reduce unwanted side-effects, (see below):
> >
> > Well per-manager (you mean per domain here don't you?) doesn't sound like
> > that useful because we rarely use only one domain, but I'm actually
> > questioning for quite a while if the per BO lock scheme was the right
> > approach.
> >
> > See from the performance aspect the closest to ideal solution I can think of
> > would be a ww_rwsem per user of a resource.
> >
> > In other words we don't lock BOs, but instead a list of all their users and
> > when you want to evict a BO you need to walk that list and inform all users
> > that the BO will be moving.
> >
> > During command submission you then have the fast path which rather just
> > grabs the read side of the user lock and check if all BOs are still in the
> > expected place.
> >
> > If some BOs were evicted you back off and start the slow path, e.g. maybe
> > even copy additional data from userspace then grab the write side of the
> > lock etc.. etc...
> >
> > That approach is similar to what we use in amdgpu with the per-VM BOs, but
> > goes a step further. Problem is that we are so used to per BO locks in the
> > kernel that this is probably not doable any more.
>
> Yeah I think it'd be nice to have the same approach for shared bo too. I
> guess what we could do is something like this (spinning your ww_rwmutex
> idea a bit further):
>
> dma_buf_read_lock(buf, vm)
> {
> if (enabled(CONFIG_DEBUG_WW_MUTEX_SLOWPATH))
> {
> check that vm is indeed listed in buf and splat if not
> }
>
> /* for a buf that's not shared in multiple vm we'd have buf->resv
>  * == vm->resv here */
> return ww_mutex_lock(vm->resv);
> }
>
> dma_buf_write_lock(buf)
> {
> for_each_vm_in_buf(buf, vm) {
> ww_mutex_lock(vm->resv);
> }
> }
>
> Ideally we'd track all these vms with something slightly less shoddy than
> a linked list :-) Resizeable array is probably pretty good, I think we
> only ever need to go from buf -> vm list, not the other way round. At
> least in dma_resv/dma_buf code, driver code ofc needs to keep a list of
> all bo bound to a vm somewhere. But that's probably a much bigger
> datastructure for tracking vma offsets and mappings and other things on
> top.
>
> Ofc to even just get there we'd need something like the sublock list to
> keep track of all the additional locks we'd need for the writer lock. And
> we'd need the release callback for backoff, so that we could also go
> through the slowpath on a vm object that we're not holding a full
> reference on. That also means vm need to be refcounted.
>
> And the list of vms on a buffer need to be protected with some lock and
> the usual kref_get_unless_zero trickery.
>
> But with all this I think we can make the dma_buf_write_lock lock 100%
> like the old per-buffer lock for everyone. And execbuf could switch over
> to 

Re: [Intel-gfx] [PATCH 5/5] drm/amdgpu: implement amdgpu_gem_prime_move_notify v2

2020-02-25 Thread Daniel Vetter
On Mon, Feb 24, 2020 at 07:46:59PM +0100, Christian König wrote:
> Am 23.02.20 um 17:54 schrieb Thomas Hellström (VMware):
> > On 2/23/20 4:45 PM, Christian König wrote:
> > > Am 21.02.20 um 18:12 schrieb Daniel Vetter:
> > > > [SNIP]
> > > > Yeah the Great Plan (tm) is to fully rely on ww_mutex slowly
> > > > degenerating
> > > > into essentially a global lock. But only when there's actual contention
> > > > and thrashing.
> > > 
> > > Yes exactly. A really big problem in TTM is currently that we drop
> > > the lock after evicting BOs because they tend to move in again
> > > directly after that.
> > > 
> > > From practice I can also confirm that there is exactly zero benefit
> > > from dropping locks early and reacquire them for example for the VM
> > > page tables. That's just makes it more likely that somebody needs to
> > > roll back and this is what we need to avoid in the first place.
> > 
> > If you have a benchmarking setup available it would be very interesting
> > for future reference to see how changing from WD to WW mutexes affects
> > the roll back frequency. WW is known to cause rollbacks much less
> > frequently but there is more work associated with each rollback.
> 
> Not of hand. To be honest I still have a hard time to get a grip on the
> difference between WD and WW from the algorithm point of view. So I can't
> judge that difference at all.
> 
> > > Contention on BO locks during command submission is perfectly fine
> > > as long as this is as lightweight as possible while we don't have
> > > trashing. When we have trashing multi submission performance is best
> > > archived to just favor a single process to finish its business and
> > > block everybody else.
> > 
> > Hmm. Sounds like we need a per-manager ww_rwsem protecting manager
> > allocation, taken in write-mode then there's thrashing. In read-mode
> > otherwise. That would limit the amount of "unnecessary" locks we'd have
> > to keep and reduce unwanted side-effects, (see below):
> 
> Well per-manager (you mean per domain here don't you?) doesn't sound like
> that useful because we rarely use only one domain, but I'm actually
> questioning for quite a while if the per BO lock scheme was the right
> approach.
> 
> See from the performance aspect the closest to ideal solution I can think of
> would be a ww_rwsem per user of a resource.
> 
> In other words we don't lock BOs, but instead a list of all their users and
> when you want to evict a BO you need to walk that list and inform all users
> that the BO will be moving.
> 
> During command submission you then have the fast path which rather just
> grabs the read side of the user lock and check if all BOs are still in the
> expected place.
> 
> If some BOs were evicted you back off and start the slow path, e.g. maybe
> even copy additional data from userspace then grab the write side of the
> lock etc.. etc...
> 
> That approach is similar to what we use in amdgpu with the per-VM BOs, but
> goes a step further. Problem is that we are so used to per BO locks in the
> kernel that this is probably not doable any more.

Yeah I think it'd be nice to have the same approach for shared bo too. I
guess what we could do is something like this (spinning your ww_rwmutex
idea a bit further):

dma_buf_read_lock(buf, vm)
{
if (enabled(CONFIG_DEBUG_WW_MUTEX_SLOWPATH))
{
check that vm is indeed listed in buf and splat if not
}

/* for a buf that's not shared in multiple vm we'd have buf->resv
 * == vm->resv here */
return ww_mutex_lock(vm->resv);
}

dma_buf_write_lock(buf)
{
for_each_vm_in_buf(buf, vm) {
ww_mutex_lock(vm->resv);
}
}

Ideally we'd track all these vms with something slightly less shoddy than
a linked list :-) Resizeable array is probably pretty good, I think we
only ever need to go from buf -> vm list, not the other way round. At
least in dma_resv/dma_buf code, driver code ofc needs to keep a list of
all bo bound to a vm somewhere. But that's probably a much bigger
datastructure for tracking vma offsets and mappings and other things on
top.

Ofc to even just get there we'd need something like the sublock list to
keep track of all the additional locks we'd need for the writer lock. And
we'd need the release callback for backoff, so that we could also go
through the slowpath on a vm object that we're not holding a full
reference on. That also means vm need to be refcounted.

And the list of vms on a buffer need to be protected with some lock and
the usual kref_get_unless_zero trickery.

But with all this I think we can make the dma_buf_write_lock lock 100%
like the old per-buffer lock for everyone. And execbuf could switch over
to dma_buf_read_lock for shared buffers. Bonus points when the gpu context
just keeps track of a list of shared vm used by buffers in that context
...

That way we could make vm fastpath locking a la amdgpu opt-in, while
keeping everyone else on the per-object 

Re: [Intel-gfx] [PATCH 5/5] drm/amdgpu: implement amdgpu_gem_prime_move_notify v2

2020-02-24 Thread VMware

On 2/24/20 7:46 PM, Christian König wrote:

Am 23.02.20 um 17:54 schrieb Thomas Hellström (VMware):

On 2/23/20 4:45 PM, Christian König wrote:

Am 21.02.20 um 18:12 schrieb Daniel Vetter:

[SNIP]
Yeah the Great Plan (tm) is to fully rely on ww_mutex slowly 
degenerating
into essentially a global lock. But only when there's actual 
contention

and thrashing.


Yes exactly. A really big problem in TTM is currently that we drop 
the lock after evicting BOs because they tend to move in again 
directly after that.


From practice I can also confirm that there is exactly zero benefit 
from dropping locks early and reacquire them for example for the VM 
page tables. That's just makes it more likely that somebody needs to 
roll back and this is what we need to avoid in the first place.


If you have a benchmarking setup available it would be very 
interesting for future reference to see how changing from WD to WW 
mutexes affects the roll back frequency. WW is known to cause 
rollbacks much less frequently but there is more work associated with 
each rollback.


Not of hand. To be honest I still have a hard time to get a grip on 
the difference between WD and WW from the algorithm point of view. So 
I can't judge that difference at all.


OK. I don't think a detailed understanding of the algorithms is strictly 
necessary, though. If we had had a good testcase we'd just have to 
change DEFINE_WD_CLASS in dma-resv.c to DEFINE_WW_CLASS and benchmark 
relevant figures.





Contention on BO locks during command submission is perfectly fine 
as long as this is as lightweight as possible while we don't have 
trashing. When we have trashing multi submission performance is best 
archived to just favor a single process to finish its business and 
block everybody else.


Hmm. Sounds like we need a per-manager ww_rwsem protecting manager 
allocation, taken in write-mode then there's thrashing. In read-mode 
otherwise. That would limit the amount of "unnecessary" locks we'd 
have to keep and reduce unwanted side-effects, (see below):


Well per-manager (you mean per domain here don't you?)

yes.

doesn't sound like that useful because we rarely use only one domain,


Well the difference to keeping all locks would boil down to:
"Instead of keeping all locks of all bos we evict from thrashing 
domains, keep locks of all thrashing domains in write mode". This means 
that for domains that are not thrashing, we'd just keep read locks.



but I'm actually questioning for quite a while if the per BO lock 
scheme was the right approach.


See from the performance aspect the closest to ideal solution I can 
think of would be a ww_rwsem per user of a resource.


In other words we don't lock BOs, but instead a list of all their 
users and when you want to evict a BO you need to walk that list and 
inform all users that the BO will be moving.


During command submission you then have the fast path which rather 
just grabs the read side of the user lock and check if all BOs are 
still in the expected place.


If some BOs were evicted you back off and start the slow path, e.g. 
maybe even copy additional data from userspace then grab the write 
side of the lock etc.. etc...


That approach is similar to what we use in amdgpu with the per-VM BOs, 
but goes a step further. Problem is that we are so used to per BO 
locks in the kernel that this is probably not doable any more.


I think we need some-sort of per-bo lock to protect bo metadata. But 
yes, relying solely on them to resolve other resource (domain) 
contention may not be (or rather probably isn't) the right choice.




Because of this I would actually vote for forbidding to release 
individual ww_mutex() locks in a context.


Yes, I see the problem.

But my first reaction is that this might have undersirable 
side-effects. Let's say somebody wanted to swap the evicted BOs out?


Please explain further, I off hand don't see the problem here.


Lets say thread A evicts a lot of thread B's bos, and keeps the locks of 
those bos for a prolonged time. Then thread C needs memory and wants to 
swap out thread B's bos. It can't, or at least not during a certain 
delay because thread A unnecessarily holds the locks.




In general I actually wanted to re-work TTM in a way that BOs in the 
SYSTEM/SWAPABLE domain are always backed by a shmem file instead of 
the struct page array we currently have.


That would probably work well if there are no SYSTEM+write-combined 
users anymore. Typically in the old AGP days, you wouldn't change 
caching mode when evicting from write-combine AGP to SYSTEM because of 
the dead slow wbinvd() operation.


Or cpu-writes to them causing faults, that might also block the 
mm_sem, which in turn blocks hugepaged?


Mhm, I also only have a higher level view how hugepaged works so why 
does it grabs the mm_sem on the write side?


If I understand it correctly, it's needed when collapsing PMD 
directories to huge PMD pages. But this was merely an example. For 

Re: [Intel-gfx] [PATCH 5/5] drm/amdgpu: implement amdgpu_gem_prime_move_notify v2

2020-02-24 Thread Christian König

Am 23.02.20 um 17:54 schrieb Thomas Hellström (VMware):

On 2/23/20 4:45 PM, Christian König wrote:

Am 21.02.20 um 18:12 schrieb Daniel Vetter:

[SNIP]
Yeah the Great Plan (tm) is to fully rely on ww_mutex slowly 
degenerating

into essentially a global lock. But only when there's actual contention
and thrashing.


Yes exactly. A really big problem in TTM is currently that we drop 
the lock after evicting BOs because they tend to move in again 
directly after that.


From practice I can also confirm that there is exactly zero benefit 
from dropping locks early and reacquire them for example for the VM 
page tables. That's just makes it more likely that somebody needs to 
roll back and this is what we need to avoid in the first place.


If you have a benchmarking setup available it would be very 
interesting for future reference to see how changing from WD to WW 
mutexes affects the roll back frequency. WW is known to cause 
rollbacks much less frequently but there is more work associated with 
each rollback.


Not of hand. To be honest I still have a hard time to get a grip on the 
difference between WD and WW from the algorithm point of view. So I 
can't judge that difference at all.


Contention on BO locks during command submission is perfectly fine as 
long as this is as lightweight as possible while we don't have 
trashing. When we have trashing multi submission performance is best 
archived to just favor a single process to finish its business and 
block everybody else.


Hmm. Sounds like we need a per-manager ww_rwsem protecting manager 
allocation, taken in write-mode then there's thrashing. In read-mode 
otherwise. That would limit the amount of "unnecessary" locks we'd 
have to keep and reduce unwanted side-effects, (see below):


Well per-manager (you mean per domain here don't you?) doesn't sound 
like that useful because we rarely use only one domain, but I'm actually 
questioning for quite a while if the per BO lock scheme was the right 
approach.


See from the performance aspect the closest to ideal solution I can 
think of would be a ww_rwsem per user of a resource.


In other words we don't lock BOs, but instead a list of all their users 
and when you want to evict a BO you need to walk that list and inform 
all users that the BO will be moving.


During command submission you then have the fast path which rather just 
grabs the read side of the user lock and check if all BOs are still in 
the expected place.


If some BOs were evicted you back off and start the slow path, e.g. 
maybe even copy additional data from userspace then grab the write side 
of the lock etc.. etc...


That approach is similar to what we use in amdgpu with the per-VM BOs, 
but goes a step further. Problem is that we are so used to per BO locks 
in the kernel that this is probably not doable any more.


Because of this I would actually vote for forbidding to release 
individual ww_mutex() locks in a context.


Yes, I see the problem.

But my first reaction is that this might have undersirable 
side-effects. Let's say somebody wanted to swap the evicted BOs out?


Please explain further, I off hand don't see the problem here.

In general I actually wanted to re-work TTM in a way that BOs in the 
SYSTEM/SWAPABLE domain are always backed by a shmem file instead of the 
struct page array we currently have.


Or cpu-writes to them causing faults, that might also block the 
mm_sem, which in turn blocks hugepaged?


Mhm, I also only have a higher level view how hugepaged works so why 
does it grabs the mm_sem on the write side?


Thanks,
Christian.



Still it's a fairly simple solution to a problem that seems otherwise 
hard to solve efficiently.


Thanks,
Thomas




Regards,
Christian.


-Daniel





___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 5/5] drm/amdgpu: implement amdgpu_gem_prime_move_notify v2

2020-02-24 Thread Christian König

Am 21.02.20 um 18:12 schrieb Daniel Vetter:

[SNIP]
Yeah the Great Plan (tm) is to fully rely on ww_mutex slowly degenerating
into essentially a global lock. But only when there's actual contention
and thrashing.


Yes exactly. A really big problem in TTM is currently that we drop the 
lock after evicting BOs because they tend to move in again directly 
after that.


From practice I can also confirm that there is exactly zero benefit 
from dropping locks early and reacquire them for example for the VM page 
tables. That's just makes it more likely that somebody needs to roll 
back and this is what we need to avoid in the first place.


Contention on BO locks during command submission is perfectly fine as 
long as this is as lightweight as possible while we don't have trashing. 
When we have trashing multi submission performance is best archived to 
just favor a single process to finish its business and block everybody else.


Because of this I would actually vote for forbidding to release 
individual ww_mutex() locks in a context.


Regards,
Christian.


-Daniel


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 5/5] drm/amdgpu: implement amdgpu_gem_prime_move_notify v2

2020-02-23 Thread VMware

On 2/23/20 4:45 PM, Christian König wrote:

Am 21.02.20 um 18:12 schrieb Daniel Vetter:

[SNIP]
Yeah the Great Plan (tm) is to fully rely on ww_mutex slowly 
degenerating

into essentially a global lock. But only when there's actual contention
and thrashing.


Yes exactly. A really big problem in TTM is currently that we drop the 
lock after evicting BOs because they tend to move in again directly 
after that.


From practice I can also confirm that there is exactly zero benefit 
from dropping locks early and reacquire them for example for the VM 
page tables. That's just makes it more likely that somebody needs to 
roll back and this is what we need to avoid in the first place.



If you have a benchmarking setup available it would be very interesting 
for future reference to see how changing from WD to WW mutexes affects 
the roll back frequency. WW is known to cause rollbacks much less 
frequently but there is more work associated with each rollback.




Contention on BO locks during command submission is perfectly fine as 
long as this is as lightweight as possible while we don't have 
trashing. When we have trashing multi submission performance is best 
archived to just favor a single process to finish its business and 
block everybody else.


Hmm. Sounds like we need a per-manager ww_rwsem protecting manager 
allocation, taken in write-mode then there's thrashing. In read-mode 
otherwise. That would limit the amount of "unnecessary" locks we'd have 
to keep and reduce unwanted side-effects, (see below):




Because of this I would actually vote for forbidding to release 
individual ww_mutex() locks in a context.


Yes, I see the problem.

But my first reaction is that this might have undersirable side-effects. 
Let's say somebody wanted to swap the evicted BOs out? Or cpu-writes to 
them causing faults, that might also block the mm_sem, which in turn 
blocks hugepaged?


Still it's a fairly simple solution to a problem that seems otherwise 
hard to solve efficiently.


Thanks,
Thomas




Regards,
Christian.


-Daniel



___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 5/5] drm/amdgpu: implement amdgpu_gem_prime_move_notify v2

2020-02-21 Thread VMware

On 2/21/20 6:12 PM, Daniel Vetter wrote:

On Thu, Feb 20, 2020 at 11:51:07PM +0100, Thomas Hellström (VMware) wrote:

On 2/20/20 9:08 PM, Daniel Vetter wrote:

On Thu, Feb 20, 2020 at 08:46:27PM +0100, Thomas Hellström (VMware) wrote:

On 2/20/20 7:04 PM, Daniel Vetter wrote:

On Thu, Feb 20, 2020 at 10:39:06AM +0100, Thomas Hellström (VMware) wrote:

On 2/19/20 7:42 AM, Thomas Hellström (VMware) wrote:

On 2/18/20 10:01 PM, Daniel Vetter wrote:

On Tue, Feb 18, 2020 at 9:17 PM Thomas Hellström (VMware)
 wrote:

On 2/17/20 6:55 PM, Daniel Vetter wrote:

On Mon, Feb 17, 2020 at 04:45:09PM +0100, Christian König wrote:

Implement the importer side of unpinned DMA-buf handling.

v2: update page tables immediately

Signed-off-by: Christian König 
---
      drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 66
-
      drivers/gpu/drm/amd/amdgpu/amdgpu_object.c  |  6 ++
      2 files changed, 71 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
index 770baba621b3..48de7624d49c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
@@ -453,7 +453,71 @@ amdgpu_dma_buf_create_obj(struct
drm_device *dev, struct dma_buf *dma_buf)
     return ERR_PTR(ret);
      }

+/**
+ * amdgpu_dma_buf_move_notify - _notify implementation
+ *
+ * @attach: the DMA-buf attachment
+ *
+ * Invalidate the DMA-buf attachment, making sure that
the we re-create the
+ * mapping before the next use.
+ */
+static void
+amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach)
+{
+    struct drm_gem_object *obj = attach->importer_priv;
+    struct ww_acquire_ctx *ticket = dma_resv_locking_ctx(obj->resv);
+    struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj);
+    struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
+    struct ttm_operation_ctx ctx = { false, false };
+    struct ttm_placement placement = {};
+    struct amdgpu_vm_bo_base *bo_base;
+    int r;
+
+    if (bo->tbo.mem.mem_type == TTM_PL_SYSTEM)
+    return;
+
+    r = ttm_bo_validate(>tbo, , );
+    if (r) {
+    DRM_ERROR("Failed to invalidate DMA-buf
import (%d))\n", r);
+    return;
+    }
+
+    for (bo_base = bo->vm_bo; bo_base; bo_base = bo_base->next) {
+    struct amdgpu_vm *vm = bo_base->vm;
+    struct dma_resv *resv = vm->root.base.bo->tbo.base.resv;
+
+    if (ticket) {

Yeah so this is kinda why I've been a total pain about the
exact semantics
of the move_notify hook. I think we should flat-out require
that importers
_always_ have a ticket attach when they call this, and that
they can cope
with additional locks being taken (i.e. full EDEADLCK) handling.

Simplest way to force that contract is to add a dummy 2nd
ww_mutex lock to
the dma_resv object, which we then can take #ifdef
CONFIG_WW_MUTEX_SLOWPATH_DEBUG. Plus mabye a WARN_ON(!ticket).

Now the real disaster is how we handle deadlocks. Two issues:

- Ideally we'd keep any lock we've taken locked until the
end, it helps
       needless backoffs. I've played around a bit with that
but not even poc
       level, just an idea:

https://cgit.freedesktop.org/~danvet/drm/commit/?id=b1799c5a0f02df9e1bb08d27be37331255ab7582


       Idea is essentially to track a list of objects we had to
lock as part of
       the ttm_bo_validate of the main object.

- Second one is if we get a EDEADLCK on one of these
sublocks (like the
       one here). We need to pass that up the entire callchain,
including a
       temporary reference (we have to drop locks to do the
ww_mutex_lock_slow
       call), and need a custom callback to drop that temporary reference
       (since that's all driver specific, might even be
internal ww_mutex and
       not anything remotely looking like a normal dma_buf).
This probably
       needs the exec util helpers from ttm, but at the
dma_resv level, so that
       we can do something like this:

struct dma_resv_ticket {
      struct ww_acquire_ctx base;

      /* can be set by anyone (including other drivers)
that got hold of
       * this ticket and had to acquire some new lock. This
lock might
       * protect anything, including driver-internal stuff, and isn't
       * required to be a dma_buf or even just a dma_resv. */
      struct ww_mutex *contended_lock;

      /* callback which the driver (which might be a dma-buf exporter
       * and not matching the driver that started this
locking ticket)
       * sets together with @contended_lock, for the main
driver to drop
       * when it calls dma_resv_unlock on the contended_lock. */
      void (drop_ref*)(struct ww_mutex *contended_lock);
};

This is all supremely nasty (also ttm_bo_validate would need to be
improved to handle these sublocks and random new objects
that could force
a ww_mutex_lock_slow).


Just a short comment on this:

Neither the currently used wait-die or the wound-wait algorithm

Re: [Intel-gfx] [PATCH 5/5] drm/amdgpu: implement amdgpu_gem_prime_move_notify v2

2020-02-21 Thread Daniel Vetter
On Thu, Feb 20, 2020 at 11:51:07PM +0100, Thomas Hellström (VMware) wrote:
> On 2/20/20 9:08 PM, Daniel Vetter wrote:
> > On Thu, Feb 20, 2020 at 08:46:27PM +0100, Thomas Hellström (VMware) wrote:
> > > On 2/20/20 7:04 PM, Daniel Vetter wrote:
> > > > On Thu, Feb 20, 2020 at 10:39:06AM +0100, Thomas Hellström (VMware) 
> > > > wrote:
> > > > > On 2/19/20 7:42 AM, Thomas Hellström (VMware) wrote:
> > > > > > On 2/18/20 10:01 PM, Daniel Vetter wrote:
> > > > > > > On Tue, Feb 18, 2020 at 9:17 PM Thomas Hellström (VMware)
> > > > > > >  wrote:
> > > > > > > > On 2/17/20 6:55 PM, Daniel Vetter wrote:
> > > > > > > > > On Mon, Feb 17, 2020 at 04:45:09PM +0100, Christian König 
> > > > > > > > > wrote:
> > > > > > > > > > Implement the importer side of unpinned DMA-buf handling.
> > > > > > > > > > 
> > > > > > > > > > v2: update page tables immediately
> > > > > > > > > > 
> > > > > > > > > > Signed-off-by: Christian König 
> > > > > > > > > > ---
> > > > > > > > > >      drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 66
> > > > > > > > > > -
> > > > > > > > > >      drivers/gpu/drm/amd/amdgpu/amdgpu_object.c  |  6 ++
> > > > > > > > > >      2 files changed, 71 insertions(+), 1 deletion(-)
> > > > > > > > > > 
> > > > > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
> > > > > > > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
> > > > > > > > > > index 770baba621b3..48de7624d49c 100644
> > > > > > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
> > > > > > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
> > > > > > > > > > @@ -453,7 +453,71 @@ amdgpu_dma_buf_create_obj(struct
> > > > > > > > > > drm_device *dev, struct dma_buf *dma_buf)
> > > > > > > > > >     return ERR_PTR(ret);
> > > > > > > > > >      }
> > > > > > > > > > 
> > > > > > > > > > +/**
> > > > > > > > > > + * amdgpu_dma_buf_move_notify - _notify 
> > > > > > > > > > implementation
> > > > > > > > > > + *
> > > > > > > > > > + * @attach: the DMA-buf attachment
> > > > > > > > > > + *
> > > > > > > > > > + * Invalidate the DMA-buf attachment, making sure that
> > > > > > > > > > the we re-create the
> > > > > > > > > > + * mapping before the next use.
> > > > > > > > > > + */
> > > > > > > > > > +static void
> > > > > > > > > > +amdgpu_dma_buf_move_notify(struct dma_buf_attachment 
> > > > > > > > > > *attach)
> > > > > > > > > > +{
> > > > > > > > > > +    struct drm_gem_object *obj = attach->importer_priv;
> > > > > > > > > > +    struct ww_acquire_ctx *ticket = 
> > > > > > > > > > dma_resv_locking_ctx(obj->resv);
> > > > > > > > > > +    struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj);
> > > > > > > > > > +    struct amdgpu_device *adev = 
> > > > > > > > > > amdgpu_ttm_adev(bo->tbo.bdev);
> > > > > > > > > > +    struct ttm_operation_ctx ctx = { false, false };
> > > > > > > > > > +    struct ttm_placement placement = {};
> > > > > > > > > > +    struct amdgpu_vm_bo_base *bo_base;
> > > > > > > > > > +    int r;
> > > > > > > > > > +
> > > > > > > > > > +    if (bo->tbo.mem.mem_type == TTM_PL_SYSTEM)
> > > > > > > > > > +    return;
> > > > > > > > > > +
> > > > > > > > > > +    r = ttm_bo_validate(>tbo, , );
> > > > > > > > > > +    if (r) {
> > > > > > > > > > +    DRM_ERROR("Failed to invalidate DMA-buf
> > > > > > > > > > import (%d))\n", r);
> > > > > > > > > > +    return;
> > > > > > > > > > +    }
> > > > > > > > > > +
> > > > > > > > > > +    for (bo_base = bo->vm_bo; bo_base; bo_base = 
> > > > > > > > > > bo_base->next) {
> > > > > > > > > > +    struct amdgpu_vm *vm = bo_base->vm;
> > > > > > > > > > +    struct dma_resv *resv = 
> > > > > > > > > > vm->root.base.bo->tbo.base.resv;
> > > > > > > > > > +
> > > > > > > > > > +    if (ticket) {
> > > > > > > > > Yeah so this is kinda why I've been a total pain about the
> > > > > > > > > exact semantics
> > > > > > > > > of the move_notify hook. I think we should flat-out require
> > > > > > > > > that importers
> > > > > > > > > _always_ have a ticket attach when they call this, and that
> > > > > > > > > they can cope
> > > > > > > > > with additional locks being taken (i.e. full EDEADLCK) 
> > > > > > > > > handling.
> > > > > > > > > 
> > > > > > > > > Simplest way to force that contract is to add a dummy 2nd
> > > > > > > > > ww_mutex lock to
> > > > > > > > > the dma_resv object, which we then can take #ifdef
> > > > > > > > > CONFIG_WW_MUTEX_SLOWPATH_DEBUG. Plus mabye a WARN_ON(!ticket).
> > > > > > > > > 
> > > > > > > > > Now the real disaster is how we handle deadlocks. Two issues:
> > > > > > > > > 
> > > > > > > > > - Ideally we'd keep any lock we've taken locked until the
> > > > > > > > > end, it helps
> > > > > > > > >       needless backoffs. I've played around a bit with that
> > > > > > > > > but not even poc
> > > > > > > > >       level, just an idea:
> > > > > > > > > 
> > > > > > > > > 

Re: [Intel-gfx] [PATCH 5/5] drm/amdgpu: implement amdgpu_gem_prime_move_notify v2

2020-02-20 Thread VMware

On 2/20/20 9:08 PM, Daniel Vetter wrote:

On Thu, Feb 20, 2020 at 08:46:27PM +0100, Thomas Hellström (VMware) wrote:

On 2/20/20 7:04 PM, Daniel Vetter wrote:

On Thu, Feb 20, 2020 at 10:39:06AM +0100, Thomas Hellström (VMware) wrote:

On 2/19/20 7:42 AM, Thomas Hellström (VMware) wrote:

On 2/18/20 10:01 PM, Daniel Vetter wrote:

On Tue, Feb 18, 2020 at 9:17 PM Thomas Hellström (VMware)
 wrote:

On 2/17/20 6:55 PM, Daniel Vetter wrote:

On Mon, Feb 17, 2020 at 04:45:09PM +0100, Christian König wrote:

Implement the importer side of unpinned DMA-buf handling.

v2: update page tables immediately

Signed-off-by: Christian König 
---
     drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 66
-
     drivers/gpu/drm/amd/amdgpu/amdgpu_object.c  |  6 ++
     2 files changed, 71 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
index 770baba621b3..48de7624d49c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
@@ -453,7 +453,71 @@ amdgpu_dma_buf_create_obj(struct
drm_device *dev, struct dma_buf *dma_buf)
    return ERR_PTR(ret);
     }

+/**
+ * amdgpu_dma_buf_move_notify - _notify implementation
+ *
+ * @attach: the DMA-buf attachment
+ *
+ * Invalidate the DMA-buf attachment, making sure that
the we re-create the
+ * mapping before the next use.
+ */
+static void
+amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach)
+{
+    struct drm_gem_object *obj = attach->importer_priv;
+    struct ww_acquire_ctx *ticket = dma_resv_locking_ctx(obj->resv);
+    struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj);
+    struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
+    struct ttm_operation_ctx ctx = { false, false };
+    struct ttm_placement placement = {};
+    struct amdgpu_vm_bo_base *bo_base;
+    int r;
+
+    if (bo->tbo.mem.mem_type == TTM_PL_SYSTEM)
+    return;
+
+    r = ttm_bo_validate(>tbo, , );
+    if (r) {
+    DRM_ERROR("Failed to invalidate DMA-buf
import (%d))\n", r);
+    return;
+    }
+
+    for (bo_base = bo->vm_bo; bo_base; bo_base = bo_base->next) {
+    struct amdgpu_vm *vm = bo_base->vm;
+    struct dma_resv *resv = vm->root.base.bo->tbo.base.resv;
+
+    if (ticket) {

Yeah so this is kinda why I've been a total pain about the
exact semantics
of the move_notify hook. I think we should flat-out require
that importers
_always_ have a ticket attach when they call this, and that
they can cope
with additional locks being taken (i.e. full EDEADLCK) handling.

Simplest way to force that contract is to add a dummy 2nd
ww_mutex lock to
the dma_resv object, which we then can take #ifdef
CONFIG_WW_MUTEX_SLOWPATH_DEBUG. Plus mabye a WARN_ON(!ticket).

Now the real disaster is how we handle deadlocks. Two issues:

- Ideally we'd keep any lock we've taken locked until the
end, it helps
      needless backoffs. I've played around a bit with that
but not even poc
      level, just an idea:

https://cgit.freedesktop.org/~danvet/drm/commit/?id=b1799c5a0f02df9e1bb08d27be37331255ab7582


      Idea is essentially to track a list of objects we had to
lock as part of
      the ttm_bo_validate of the main object.

- Second one is if we get a EDEADLCK on one of these
sublocks (like the
      one here). We need to pass that up the entire callchain,
including a
      temporary reference (we have to drop locks to do the
ww_mutex_lock_slow
      call), and need a custom callback to drop that temporary reference
      (since that's all driver specific, might even be
internal ww_mutex and
      not anything remotely looking like a normal dma_buf).
This probably
      needs the exec util helpers from ttm, but at the
dma_resv level, so that
      we can do something like this:

struct dma_resv_ticket {
     struct ww_acquire_ctx base;

     /* can be set by anyone (including other drivers)
that got hold of
      * this ticket and had to acquire some new lock. This
lock might
      * protect anything, including driver-internal stuff, and isn't
      * required to be a dma_buf or even just a dma_resv. */
     struct ww_mutex *contended_lock;

     /* callback which the driver (which might be a dma-buf exporter
      * and not matching the driver that started this
locking ticket)
      * sets together with @contended_lock, for the main
driver to drop
      * when it calls dma_resv_unlock on the contended_lock. */
     void (drop_ref*)(struct ww_mutex *contended_lock);
};

This is all supremely nasty (also ttm_bo_validate would need to be
improved to handle these sublocks and random new objects
that could force
a ww_mutex_lock_slow).


Just a short comment on this:

Neither the currently used wait-die or the wound-wait algorithm
*strictly* requires a slow lock on the contended lock. For
wait-die it's
just very convenient since it makes us sleep instead of spinning with

Re: [Intel-gfx] [PATCH 5/5] drm/amdgpu: implement amdgpu_gem_prime_move_notify v2

2020-02-20 Thread Daniel Vetter
On Thu, Feb 20, 2020 at 08:46:27PM +0100, Thomas Hellström (VMware) wrote:
> On 2/20/20 7:04 PM, Daniel Vetter wrote:
> > On Thu, Feb 20, 2020 at 10:39:06AM +0100, Thomas Hellström (VMware) wrote:
> > > On 2/19/20 7:42 AM, Thomas Hellström (VMware) wrote:
> > > > On 2/18/20 10:01 PM, Daniel Vetter wrote:
> > > > > On Tue, Feb 18, 2020 at 9:17 PM Thomas Hellström (VMware)
> > > > >  wrote:
> > > > > > On 2/17/20 6:55 PM, Daniel Vetter wrote:
> > > > > > > On Mon, Feb 17, 2020 at 04:45:09PM +0100, Christian König wrote:
> > > > > > > > Implement the importer side of unpinned DMA-buf handling.
> > > > > > > > 
> > > > > > > > v2: update page tables immediately
> > > > > > > > 
> > > > > > > > Signed-off-by: Christian König 
> > > > > > > > ---
> > > > > > > >     drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 66
> > > > > > > > -
> > > > > > > >     drivers/gpu/drm/amd/amdgpu/amdgpu_object.c  |  6 ++
> > > > > > > >     2 files changed, 71 insertions(+), 1 deletion(-)
> > > > > > > > 
> > > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
> > > > > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
> > > > > > > > index 770baba621b3..48de7624d49c 100644
> > > > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
> > > > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
> > > > > > > > @@ -453,7 +453,71 @@ amdgpu_dma_buf_create_obj(struct
> > > > > > > > drm_device *dev, struct dma_buf *dma_buf)
> > > > > > > >    return ERR_PTR(ret);
> > > > > > > >     }
> > > > > > > > 
> > > > > > > > +/**
> > > > > > > > + * amdgpu_dma_buf_move_notify - _notify 
> > > > > > > > implementation
> > > > > > > > + *
> > > > > > > > + * @attach: the DMA-buf attachment
> > > > > > > > + *
> > > > > > > > + * Invalidate the DMA-buf attachment, making sure that
> > > > > > > > the we re-create the
> > > > > > > > + * mapping before the next use.
> > > > > > > > + */
> > > > > > > > +static void
> > > > > > > > +amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach)
> > > > > > > > +{
> > > > > > > > +    struct drm_gem_object *obj = attach->importer_priv;
> > > > > > > > +    struct ww_acquire_ctx *ticket = 
> > > > > > > > dma_resv_locking_ctx(obj->resv);
> > > > > > > > +    struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj);
> > > > > > > > +    struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
> > > > > > > > +    struct ttm_operation_ctx ctx = { false, false };
> > > > > > > > +    struct ttm_placement placement = {};
> > > > > > > > +    struct amdgpu_vm_bo_base *bo_base;
> > > > > > > > +    int r;
> > > > > > > > +
> > > > > > > > +    if (bo->tbo.mem.mem_type == TTM_PL_SYSTEM)
> > > > > > > > +    return;
> > > > > > > > +
> > > > > > > > +    r = ttm_bo_validate(>tbo, , );
> > > > > > > > +    if (r) {
> > > > > > > > +    DRM_ERROR("Failed to invalidate DMA-buf
> > > > > > > > import (%d))\n", r);
> > > > > > > > +    return;
> > > > > > > > +    }
> > > > > > > > +
> > > > > > > > +    for (bo_base = bo->vm_bo; bo_base; bo_base = 
> > > > > > > > bo_base->next) {
> > > > > > > > +    struct amdgpu_vm *vm = bo_base->vm;
> > > > > > > > +    struct dma_resv *resv = 
> > > > > > > > vm->root.base.bo->tbo.base.resv;
> > > > > > > > +
> > > > > > > > +    if (ticket) {
> > > > > > > Yeah so this is kinda why I've been a total pain about the
> > > > > > > exact semantics
> > > > > > > of the move_notify hook. I think we should flat-out require
> > > > > > > that importers
> > > > > > > _always_ have a ticket attach when they call this, and that
> > > > > > > they can cope
> > > > > > > with additional locks being taken (i.e. full EDEADLCK) handling.
> > > > > > > 
> > > > > > > Simplest way to force that contract is to add a dummy 2nd
> > > > > > > ww_mutex lock to
> > > > > > > the dma_resv object, which we then can take #ifdef
> > > > > > > CONFIG_WW_MUTEX_SLOWPATH_DEBUG. Plus mabye a WARN_ON(!ticket).
> > > > > > > 
> > > > > > > Now the real disaster is how we handle deadlocks. Two issues:
> > > > > > > 
> > > > > > > - Ideally we'd keep any lock we've taken locked until the
> > > > > > > end, it helps
> > > > > > >      needless backoffs. I've played around a bit with that
> > > > > > > but not even poc
> > > > > > >      level, just an idea:
> > > > > > > 
> > > > > > > https://cgit.freedesktop.org/~danvet/drm/commit/?id=b1799c5a0f02df9e1bb08d27be37331255ab7582
> > > > > > > 
> > > > > > > 
> > > > > > >      Idea is essentially to track a list of objects we had to
> > > > > > > lock as part of
> > > > > > >      the ttm_bo_validate of the main object.
> > > > > > > 
> > > > > > > - Second one is if we get a EDEADLCK on one of these
> > > > > > > sublocks (like the
> > > > > > >      one here). We need to pass that up the entire callchain,
> > > > > > > including a
> > > > > > >      temporary reference (we have to drop locks to do the
> > > > > > > ww_mutex_lock_slow
> > > > > > >      call), and need a 

Re: [Intel-gfx] [PATCH 5/5] drm/amdgpu: implement amdgpu_gem_prime_move_notify v2

2020-02-20 Thread VMware

On 2/20/20 7:04 PM, Daniel Vetter wrote:

On Thu, Feb 20, 2020 at 10:39:06AM +0100, Thomas Hellström (VMware) wrote:

On 2/19/20 7:42 AM, Thomas Hellström (VMware) wrote:

On 2/18/20 10:01 PM, Daniel Vetter wrote:

On Tue, Feb 18, 2020 at 9:17 PM Thomas Hellström (VMware)
 wrote:

On 2/17/20 6:55 PM, Daniel Vetter wrote:

On Mon, Feb 17, 2020 at 04:45:09PM +0100, Christian König wrote:

Implement the importer side of unpinned DMA-buf handling.

v2: update page tables immediately

Signed-off-by: Christian König 
---
    drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 66
-
    drivers/gpu/drm/amd/amdgpu/amdgpu_object.c  |  6 ++
    2 files changed, 71 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
index 770baba621b3..48de7624d49c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
@@ -453,7 +453,71 @@ amdgpu_dma_buf_create_obj(struct
drm_device *dev, struct dma_buf *dma_buf)
   return ERR_PTR(ret);
    }

+/**
+ * amdgpu_dma_buf_move_notify - _notify implementation
+ *
+ * @attach: the DMA-buf attachment
+ *
+ * Invalidate the DMA-buf attachment, making sure that
the we re-create the
+ * mapping before the next use.
+ */
+static void
+amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach)
+{
+    struct drm_gem_object *obj = attach->importer_priv;
+    struct ww_acquire_ctx *ticket = dma_resv_locking_ctx(obj->resv);
+    struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj);
+    struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
+    struct ttm_operation_ctx ctx = { false, false };
+    struct ttm_placement placement = {};
+    struct amdgpu_vm_bo_base *bo_base;
+    int r;
+
+    if (bo->tbo.mem.mem_type == TTM_PL_SYSTEM)
+    return;
+
+    r = ttm_bo_validate(>tbo, , );
+    if (r) {
+    DRM_ERROR("Failed to invalidate DMA-buf
import (%d))\n", r);
+    return;
+    }
+
+    for (bo_base = bo->vm_bo; bo_base; bo_base = bo_base->next) {
+    struct amdgpu_vm *vm = bo_base->vm;
+    struct dma_resv *resv = vm->root.base.bo->tbo.base.resv;
+
+    if (ticket) {

Yeah so this is kinda why I've been a total pain about the
exact semantics
of the move_notify hook. I think we should flat-out require
that importers
_always_ have a ticket attach when they call this, and that
they can cope
with additional locks being taken (i.e. full EDEADLCK) handling.

Simplest way to force that contract is to add a dummy 2nd
ww_mutex lock to
the dma_resv object, which we then can take #ifdef
CONFIG_WW_MUTEX_SLOWPATH_DEBUG. Plus mabye a WARN_ON(!ticket).

Now the real disaster is how we handle deadlocks. Two issues:

- Ideally we'd keep any lock we've taken locked until the
end, it helps
     needless backoffs. I've played around a bit with that
but not even poc
     level, just an idea:

https://cgit.freedesktop.org/~danvet/drm/commit/?id=b1799c5a0f02df9e1bb08d27be37331255ab7582


     Idea is essentially to track a list of objects we had to
lock as part of
     the ttm_bo_validate of the main object.

- Second one is if we get a EDEADLCK on one of these
sublocks (like the
     one here). We need to pass that up the entire callchain,
including a
     temporary reference (we have to drop locks to do the
ww_mutex_lock_slow
     call), and need a custom callback to drop that temporary reference
     (since that's all driver specific, might even be
internal ww_mutex and
     not anything remotely looking like a normal dma_buf).
This probably
     needs the exec util helpers from ttm, but at the
dma_resv level, so that
     we can do something like this:

struct dma_resv_ticket {
    struct ww_acquire_ctx base;

    /* can be set by anyone (including other drivers)
that got hold of
     * this ticket and had to acquire some new lock. This
lock might
     * protect anything, including driver-internal stuff, and isn't
     * required to be a dma_buf or even just a dma_resv. */
    struct ww_mutex *contended_lock;

    /* callback which the driver (which might be a dma-buf exporter
     * and not matching the driver that started this
locking ticket)
     * sets together with @contended_lock, for the main
driver to drop
     * when it calls dma_resv_unlock on the contended_lock. */
    void (drop_ref*)(struct ww_mutex *contended_lock);
};

This is all supremely nasty (also ttm_bo_validate would need to be
improved to handle these sublocks and random new objects
that could force
a ww_mutex_lock_slow).


Just a short comment on this:

Neither the currently used wait-die or the wound-wait algorithm
*strictly* requires a slow lock on the contended lock. For
wait-die it's
just very convenient since it makes us sleep instead of spinning with
-EDEADLK on the contended lock. For wound-wait IIRC one could just
immediately restart the whole locking transaction after an
-EDEADLK, and
the 

Re: [Intel-gfx] [PATCH 5/5] drm/amdgpu: implement amdgpu_gem_prime_move_notify v2

2020-02-20 Thread Daniel Vetter
On Thu, Feb 20, 2020 at 10:39:06AM +0100, Thomas Hellström (VMware) wrote:
> On 2/19/20 7:42 AM, Thomas Hellström (VMware) wrote:
> > On 2/18/20 10:01 PM, Daniel Vetter wrote:
> > > On Tue, Feb 18, 2020 at 9:17 PM Thomas Hellström (VMware)
> > >  wrote:
> > > > On 2/17/20 6:55 PM, Daniel Vetter wrote:
> > > > > On Mon, Feb 17, 2020 at 04:45:09PM +0100, Christian König wrote:
> > > > > > Implement the importer side of unpinned DMA-buf handling.
> > > > > > 
> > > > > > v2: update page tables immediately
> > > > > > 
> > > > > > Signed-off-by: Christian König 
> > > > > > ---
> > > > > >    drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 66
> > > > > > -
> > > > > >    drivers/gpu/drm/amd/amdgpu/amdgpu_object.c  |  6 ++
> > > > > >    2 files changed, 71 insertions(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
> > > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
> > > > > > index 770baba621b3..48de7624d49c 100644
> > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
> > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
> > > > > > @@ -453,7 +453,71 @@ amdgpu_dma_buf_create_obj(struct
> > > > > > drm_device *dev, struct dma_buf *dma_buf)
> > > > > >   return ERR_PTR(ret);
> > > > > >    }
> > > > > > 
> > > > > > +/**
> > > > > > + * amdgpu_dma_buf_move_notify - _notify implementation
> > > > > > + *
> > > > > > + * @attach: the DMA-buf attachment
> > > > > > + *
> > > > > > + * Invalidate the DMA-buf attachment, making sure that
> > > > > > the we re-create the
> > > > > > + * mapping before the next use.
> > > > > > + */
> > > > > > +static void
> > > > > > +amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach)
> > > > > > +{
> > > > > > +    struct drm_gem_object *obj = attach->importer_priv;
> > > > > > +    struct ww_acquire_ctx *ticket = 
> > > > > > dma_resv_locking_ctx(obj->resv);
> > > > > > +    struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj);
> > > > > > +    struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
> > > > > > +    struct ttm_operation_ctx ctx = { false, false };
> > > > > > +    struct ttm_placement placement = {};
> > > > > > +    struct amdgpu_vm_bo_base *bo_base;
> > > > > > +    int r;
> > > > > > +
> > > > > > +    if (bo->tbo.mem.mem_type == TTM_PL_SYSTEM)
> > > > > > +    return;
> > > > > > +
> > > > > > +    r = ttm_bo_validate(>tbo, , );
> > > > > > +    if (r) {
> > > > > > +    DRM_ERROR("Failed to invalidate DMA-buf
> > > > > > import (%d))\n", r);
> > > > > > +    return;
> > > > > > +    }
> > > > > > +
> > > > > > +    for (bo_base = bo->vm_bo; bo_base; bo_base = bo_base->next) {
> > > > > > +    struct amdgpu_vm *vm = bo_base->vm;
> > > > > > +    struct dma_resv *resv = 
> > > > > > vm->root.base.bo->tbo.base.resv;
> > > > > > +
> > > > > > +    if (ticket) {
> > > > > Yeah so this is kinda why I've been a total pain about the
> > > > > exact semantics
> > > > > of the move_notify hook. I think we should flat-out require
> > > > > that importers
> > > > > _always_ have a ticket attach when they call this, and that
> > > > > they can cope
> > > > > with additional locks being taken (i.e. full EDEADLCK) handling.
> > > > > 
> > > > > Simplest way to force that contract is to add a dummy 2nd
> > > > > ww_mutex lock to
> > > > > the dma_resv object, which we then can take #ifdef
> > > > > CONFIG_WW_MUTEX_SLOWPATH_DEBUG. Plus mabye a WARN_ON(!ticket).
> > > > > 
> > > > > Now the real disaster is how we handle deadlocks. Two issues:
> > > > > 
> > > > > - Ideally we'd keep any lock we've taken locked until the
> > > > > end, it helps
> > > > >     needless backoffs. I've played around a bit with that
> > > > > but not even poc
> > > > >     level, just an idea:
> > > > > 
> > > > > https://cgit.freedesktop.org/~danvet/drm/commit/?id=b1799c5a0f02df9e1bb08d27be37331255ab7582
> > > > > 
> > > > > 
> > > > >     Idea is essentially to track a list of objects we had to
> > > > > lock as part of
> > > > >     the ttm_bo_validate of the main object.
> > > > > 
> > > > > - Second one is if we get a EDEADLCK on one of these
> > > > > sublocks (like the
> > > > >     one here). We need to pass that up the entire callchain,
> > > > > including a
> > > > >     temporary reference (we have to drop locks to do the
> > > > > ww_mutex_lock_slow
> > > > >     call), and need a custom callback to drop that temporary reference
> > > > >     (since that's all driver specific, might even be
> > > > > internal ww_mutex and
> > > > >     not anything remotely looking like a normal dma_buf).
> > > > > This probably
> > > > >     needs the exec util helpers from ttm, but at the
> > > > > dma_resv level, so that
> > > > >     we can do something like this:
> > > > > 
> > > > > struct dma_resv_ticket {
> > > > >    struct ww_acquire_ctx base;
> > > > > 
> > > > >    /* can be set by anyone (including other drivers)
> > > > > that got 

Re: [Intel-gfx] [PATCH 5/5] drm/amdgpu: implement amdgpu_gem_prime_move_notify v2

2020-02-20 Thread VMware

On 2/19/20 7:42 AM, Thomas Hellström (VMware) wrote:

On 2/18/20 10:01 PM, Daniel Vetter wrote:

On Tue, Feb 18, 2020 at 9:17 PM Thomas Hellström (VMware)
 wrote:

On 2/17/20 6:55 PM, Daniel Vetter wrote:

On Mon, Feb 17, 2020 at 04:45:09PM +0100, Christian König wrote:

Implement the importer side of unpinned DMA-buf handling.

v2: update page tables immediately

Signed-off-by: Christian König 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 66 
-

   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c  |  6 ++
   2 files changed, 71 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c

index 770baba621b3..48de7624d49c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
@@ -453,7 +453,71 @@ amdgpu_dma_buf_create_obj(struct drm_device 
*dev, struct dma_buf *dma_buf)

  return ERR_PTR(ret);
   }

+/**
+ * amdgpu_dma_buf_move_notify - _notify implementation
+ *
+ * @attach: the DMA-buf attachment
+ *
+ * Invalidate the DMA-buf attachment, making sure that the we 
re-create the

+ * mapping before the next use.
+ */
+static void
+amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach)
+{
+    struct drm_gem_object *obj = attach->importer_priv;
+    struct ww_acquire_ctx *ticket = dma_resv_locking_ctx(obj->resv);
+    struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj);
+    struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
+    struct ttm_operation_ctx ctx = { false, false };
+    struct ttm_placement placement = {};
+    struct amdgpu_vm_bo_base *bo_base;
+    int r;
+
+    if (bo->tbo.mem.mem_type == TTM_PL_SYSTEM)
+    return;
+
+    r = ttm_bo_validate(>tbo, , );
+    if (r) {
+    DRM_ERROR("Failed to invalidate DMA-buf import 
(%d))\n", r);

+    return;
+    }
+
+    for (bo_base = bo->vm_bo; bo_base; bo_base = bo_base->next) {
+    struct amdgpu_vm *vm = bo_base->vm;
+    struct dma_resv *resv = vm->root.base.bo->tbo.base.resv;
+
+    if (ticket) {
Yeah so this is kinda why I've been a total pain about the exact 
semantics
of the move_notify hook. I think we should flat-out require that 
importers
_always_ have a ticket attach when they call this, and that they 
can cope

with additional locks being taken (i.e. full EDEADLCK) handling.

Simplest way to force that contract is to add a dummy 2nd ww_mutex 
lock to

the dma_resv object, which we then can take #ifdef
CONFIG_WW_MUTEX_SLOWPATH_DEBUG. Plus mabye a WARN_ON(!ticket).

Now the real disaster is how we handle deadlocks. Two issues:

- Ideally we'd keep any lock we've taken locked until the end, it 
helps
    needless backoffs. I've played around a bit with that but not 
even poc

    level, just an idea:

https://cgit.freedesktop.org/~danvet/drm/commit/?id=b1799c5a0f02df9e1bb08d27be37331255ab7582 



    Idea is essentially to track a list of objects we had to lock 
as part of

    the ttm_bo_validate of the main object.

- Second one is if we get a EDEADLCK on one of these sublocks (like 
the
    one here). We need to pass that up the entire callchain, 
including a
    temporary reference (we have to drop locks to do the 
ww_mutex_lock_slow

    call), and need a custom callback to drop that temporary reference
    (since that's all driver specific, might even be internal 
ww_mutex and
    not anything remotely looking like a normal dma_buf). This 
probably
    needs the exec util helpers from ttm, but at the dma_resv 
level, so that

    we can do something like this:

struct dma_resv_ticket {
   struct ww_acquire_ctx base;

   /* can be set by anyone (including other drivers) that got 
hold of
    * this ticket and had to acquire some new lock. This lock 
might

    * protect anything, including driver-internal stuff, and isn't
    * required to be a dma_buf or even just a dma_resv. */
   struct ww_mutex *contended_lock;

   /* callback which the driver (which might be a dma-buf exporter
    * and not matching the driver that started this locking 
ticket)
    * sets together with @contended_lock, for the main driver 
to drop

    * when it calls dma_resv_unlock on the contended_lock. */
   void (drop_ref*)(struct ww_mutex *contended_lock);
};

This is all supremely nasty (also ttm_bo_validate would need to be
improved to handle these sublocks and random new objects that could 
force

a ww_mutex_lock_slow).


Just a short comment on this:

Neither the currently used wait-die or the wound-wait algorithm
*strictly* requires a slow lock on the contended lock. For wait-die 
it's

just very convenient since it makes us sleep instead of spinning with
-EDEADLK on the contended lock. For wound-wait IIRC one could just
immediately restart the whole locking transaction after an -EDEADLK, 
and

the transaction would automatically end up waiting on the contended
lock, provided the mutex lock stealing is 

Re: [Intel-gfx] [PATCH 5/5] drm/amdgpu: implement amdgpu_gem_prime_move_notify v2

2020-02-18 Thread VMware

On 2/18/20 10:01 PM, Daniel Vetter wrote:

On Tue, Feb 18, 2020 at 9:17 PM Thomas Hellström (VMware)
 wrote:

On 2/17/20 6:55 PM, Daniel Vetter wrote:

On Mon, Feb 17, 2020 at 04:45:09PM +0100, Christian König wrote:

Implement the importer side of unpinned DMA-buf handling.

v2: update page tables immediately

Signed-off-by: Christian König 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 66 -
   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c  |  6 ++
   2 files changed, 71 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
index 770baba621b3..48de7624d49c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
@@ -453,7 +453,71 @@ amdgpu_dma_buf_create_obj(struct drm_device *dev, struct 
dma_buf *dma_buf)
  return ERR_PTR(ret);
   }

+/**
+ * amdgpu_dma_buf_move_notify - _notify implementation
+ *
+ * @attach: the DMA-buf attachment
+ *
+ * Invalidate the DMA-buf attachment, making sure that the we re-create the
+ * mapping before the next use.
+ */
+static void
+amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach)
+{
+struct drm_gem_object *obj = attach->importer_priv;
+struct ww_acquire_ctx *ticket = dma_resv_locking_ctx(obj->resv);
+struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj);
+struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
+struct ttm_operation_ctx ctx = { false, false };
+struct ttm_placement placement = {};
+struct amdgpu_vm_bo_base *bo_base;
+int r;
+
+if (bo->tbo.mem.mem_type == TTM_PL_SYSTEM)
+return;
+
+r = ttm_bo_validate(>tbo, , );
+if (r) {
+DRM_ERROR("Failed to invalidate DMA-buf import (%d))\n", r);
+return;
+}
+
+for (bo_base = bo->vm_bo; bo_base; bo_base = bo_base->next) {
+struct amdgpu_vm *vm = bo_base->vm;
+struct dma_resv *resv = vm->root.base.bo->tbo.base.resv;
+
+if (ticket) {

Yeah so this is kinda why I've been a total pain about the exact semantics
of the move_notify hook. I think we should flat-out require that importers
_always_ have a ticket attach when they call this, and that they can cope
with additional locks being taken (i.e. full EDEADLCK) handling.

Simplest way to force that contract is to add a dummy 2nd ww_mutex lock to
the dma_resv object, which we then can take #ifdef
CONFIG_WW_MUTEX_SLOWPATH_DEBUG. Plus mabye a WARN_ON(!ticket).

Now the real disaster is how we handle deadlocks. Two issues:

- Ideally we'd keep any lock we've taken locked until the end, it helps
needless backoffs. I've played around a bit with that but not even poc
level, just an idea:

https://cgit.freedesktop.org/~danvet/drm/commit/?id=b1799c5a0f02df9e1bb08d27be37331255ab7582

Idea is essentially to track a list of objects we had to lock as part of
the ttm_bo_validate of the main object.

- Second one is if we get a EDEADLCK on one of these sublocks (like the
one here). We need to pass that up the entire callchain, including a
temporary reference (we have to drop locks to do the ww_mutex_lock_slow
call), and need a custom callback to drop that temporary reference
(since that's all driver specific, might even be internal ww_mutex and
not anything remotely looking like a normal dma_buf). This probably
needs the exec util helpers from ttm, but at the dma_resv level, so that
we can do something like this:

struct dma_resv_ticket {
   struct ww_acquire_ctx base;

   /* can be set by anyone (including other drivers) that got hold of
* this ticket and had to acquire some new lock. This lock might
* protect anything, including driver-internal stuff, and isn't
* required to be a dma_buf or even just a dma_resv. */
   struct ww_mutex *contended_lock;

   /* callback which the driver (which might be a dma-buf exporter
* and not matching the driver that started this locking ticket)
* sets together with @contended_lock, for the main driver to drop
* when it calls dma_resv_unlock on the contended_lock. */
   void (drop_ref*)(struct ww_mutex *contended_lock);
};

This is all supremely nasty (also ttm_bo_validate would need to be
improved to handle these sublocks and random new objects that could force
a ww_mutex_lock_slow).


Just a short comment on this:

Neither the currently used wait-die or the wound-wait algorithm
*strictly* requires a slow lock on the contended lock. For wait-die it's
just very convenient since it makes us sleep instead of spinning with
-EDEADLK on the contended lock. For wound-wait IIRC one could just
immediately restart the whole locking transaction after an -EDEADLK, and
the transaction would automatically end up waiting on the contended
lock, provided the mutex lock stealing is not allowed. There is however
a possibility that the transaction will be wounded again on 

Re: [Intel-gfx] [PATCH 5/5] drm/amdgpu: implement amdgpu_gem_prime_move_notify v2

2020-02-18 Thread Daniel Vetter
On Tue, Feb 18, 2020 at 9:17 PM Thomas Hellström (VMware)
 wrote:
>
> On 2/17/20 6:55 PM, Daniel Vetter wrote:
> > On Mon, Feb 17, 2020 at 04:45:09PM +0100, Christian König wrote:
> >> Implement the importer side of unpinned DMA-buf handling.
> >>
> >> v2: update page tables immediately
> >>
> >> Signed-off-by: Christian König 
> >> ---
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 66 -
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c  |  6 ++
> >>   2 files changed, 71 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c 
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
> >> index 770baba621b3..48de7624d49c 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
> >> @@ -453,7 +453,71 @@ amdgpu_dma_buf_create_obj(struct drm_device *dev, 
> >> struct dma_buf *dma_buf)
> >>  return ERR_PTR(ret);
> >>   }
> >>
> >> +/**
> >> + * amdgpu_dma_buf_move_notify - _notify implementation
> >> + *
> >> + * @attach: the DMA-buf attachment
> >> + *
> >> + * Invalidate the DMA-buf attachment, making sure that the we re-create 
> >> the
> >> + * mapping before the next use.
> >> + */
> >> +static void
> >> +amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach)
> >> +{
> >> +struct drm_gem_object *obj = attach->importer_priv;
> >> +struct ww_acquire_ctx *ticket = dma_resv_locking_ctx(obj->resv);
> >> +struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj);
> >> +struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
> >> +struct ttm_operation_ctx ctx = { false, false };
> >> +struct ttm_placement placement = {};
> >> +struct amdgpu_vm_bo_base *bo_base;
> >> +int r;
> >> +
> >> +if (bo->tbo.mem.mem_type == TTM_PL_SYSTEM)
> >> +return;
> >> +
> >> +r = ttm_bo_validate(>tbo, , );
> >> +if (r) {
> >> +DRM_ERROR("Failed to invalidate DMA-buf import (%d))\n", r);
> >> +return;
> >> +}
> >> +
> >> +for (bo_base = bo->vm_bo; bo_base; bo_base = bo_base->next) {
> >> +struct amdgpu_vm *vm = bo_base->vm;
> >> +struct dma_resv *resv = vm->root.base.bo->tbo.base.resv;
> >> +
> >> +if (ticket) {
> > Yeah so this is kinda why I've been a total pain about the exact semantics
> > of the move_notify hook. I think we should flat-out require that importers
> > _always_ have a ticket attach when they call this, and that they can cope
> > with additional locks being taken (i.e. full EDEADLCK) handling.
> >
> > Simplest way to force that contract is to add a dummy 2nd ww_mutex lock to
> > the dma_resv object, which we then can take #ifdef
> > CONFIG_WW_MUTEX_SLOWPATH_DEBUG. Plus mabye a WARN_ON(!ticket).
> >
> > Now the real disaster is how we handle deadlocks. Two issues:
> >
> > - Ideally we'd keep any lock we've taken locked until the end, it helps
> >needless backoffs. I've played around a bit with that but not even poc
> >level, just an idea:
> >
> > https://cgit.freedesktop.org/~danvet/drm/commit/?id=b1799c5a0f02df9e1bb08d27be37331255ab7582
> >
> >Idea is essentially to track a list of objects we had to lock as part of
> >the ttm_bo_validate of the main object.
> >
> > - Second one is if we get a EDEADLCK on one of these sublocks (like the
> >one here). We need to pass that up the entire callchain, including a
> >temporary reference (we have to drop locks to do the ww_mutex_lock_slow
> >call), and need a custom callback to drop that temporary reference
> >(since that's all driver specific, might even be internal ww_mutex and
> >not anything remotely looking like a normal dma_buf). This probably
> >needs the exec util helpers from ttm, but at the dma_resv level, so that
> >we can do something like this:
> >
> > struct dma_resv_ticket {
> >   struct ww_acquire_ctx base;
> >
> >   /* can be set by anyone (including other drivers) that got hold of
> >* this ticket and had to acquire some new lock. This lock might
> >* protect anything, including driver-internal stuff, and isn't
> >* required to be a dma_buf or even just a dma_resv. */
> >   struct ww_mutex *contended_lock;
> >
> >   /* callback which the driver (which might be a dma-buf exporter
> >* and not matching the driver that started this locking ticket)
> >* sets together with @contended_lock, for the main driver to drop
> >* when it calls dma_resv_unlock on the contended_lock. */
> >   void (drop_ref*)(struct ww_mutex *contended_lock);
> > };
> >
> > This is all supremely nasty (also ttm_bo_validate would need to be
> > improved to handle these sublocks and random new objects that could force
> > a ww_mutex_lock_slow).
> >
> Just a short comment on this:
>
> Neither the currently used wait-die or the wound-wait algorithm
> *strictly* requires a slow lock on the contended lock. For wait-die it's
> just very 

Re: [Intel-gfx] [PATCH 5/5] drm/amdgpu: implement amdgpu_gem_prime_move_notify v2

2020-02-18 Thread VMware

On 2/17/20 6:55 PM, Daniel Vetter wrote:

On Mon, Feb 17, 2020 at 04:45:09PM +0100, Christian König wrote:

Implement the importer side of unpinned DMA-buf handling.

v2: update page tables immediately

Signed-off-by: Christian König 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 66 -
  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c  |  6 ++
  2 files changed, 71 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
index 770baba621b3..48de7624d49c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
@@ -453,7 +453,71 @@ amdgpu_dma_buf_create_obj(struct drm_device *dev, struct 
dma_buf *dma_buf)
return ERR_PTR(ret);
  }
  
+/**

+ * amdgpu_dma_buf_move_notify - _notify implementation
+ *
+ * @attach: the DMA-buf attachment
+ *
+ * Invalidate the DMA-buf attachment, making sure that the we re-create the
+ * mapping before the next use.
+ */
+static void
+amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach)
+{
+   struct drm_gem_object *obj = attach->importer_priv;
+   struct ww_acquire_ctx *ticket = dma_resv_locking_ctx(obj->resv);
+   struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj);
+   struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
+   struct ttm_operation_ctx ctx = { false, false };
+   struct ttm_placement placement = {};
+   struct amdgpu_vm_bo_base *bo_base;
+   int r;
+
+   if (bo->tbo.mem.mem_type == TTM_PL_SYSTEM)
+   return;
+
+   r = ttm_bo_validate(>tbo, , );
+   if (r) {
+   DRM_ERROR("Failed to invalidate DMA-buf import (%d))\n", r);
+   return;
+   }
+
+   for (bo_base = bo->vm_bo; bo_base; bo_base = bo_base->next) {
+   struct amdgpu_vm *vm = bo_base->vm;
+   struct dma_resv *resv = vm->root.base.bo->tbo.base.resv;
+
+   if (ticket) {

Yeah so this is kinda why I've been a total pain about the exact semantics
of the move_notify hook. I think we should flat-out require that importers
_always_ have a ticket attach when they call this, and that they can cope
with additional locks being taken (i.e. full EDEADLCK) handling.

Simplest way to force that contract is to add a dummy 2nd ww_mutex lock to
the dma_resv object, which we then can take #ifdef
CONFIG_WW_MUTEX_SLOWPATH_DEBUG. Plus mabye a WARN_ON(!ticket).

Now the real disaster is how we handle deadlocks. Two issues:

- Ideally we'd keep any lock we've taken locked until the end, it helps
   needless backoffs. I've played around a bit with that but not even poc
   level, just an idea:

https://cgit.freedesktop.org/~danvet/drm/commit/?id=b1799c5a0f02df9e1bb08d27be37331255ab7582

   Idea is essentially to track a list of objects we had to lock as part of
   the ttm_bo_validate of the main object.

- Second one is if we get a EDEADLCK on one of these sublocks (like the
   one here). We need to pass that up the entire callchain, including a
   temporary reference (we have to drop locks to do the ww_mutex_lock_slow
   call), and need a custom callback to drop that temporary reference
   (since that's all driver specific, might even be internal ww_mutex and
   not anything remotely looking like a normal dma_buf). This probably
   needs the exec util helpers from ttm, but at the dma_resv level, so that
   we can do something like this:

struct dma_resv_ticket {
struct ww_acquire_ctx base;

/* can be set by anyone (including other drivers) that got hold of
 * this ticket and had to acquire some new lock. This lock might
 * protect anything, including driver-internal stuff, and isn't
 * required to be a dma_buf or even just a dma_resv. */
struct ww_mutex *contended_lock;

/* callback which the driver (which might be a dma-buf exporter
 * and not matching the driver that started this locking ticket)
 * sets together with @contended_lock, for the main driver to drop
 * when it calls dma_resv_unlock on the contended_lock. */
void (drop_ref*)(struct ww_mutex *contended_lock);
};

This is all supremely nasty (also ttm_bo_validate would need to be
improved to handle these sublocks and random new objects that could force
a ww_mutex_lock_slow).


Just a short comment on this:

Neither the currently used wait-die or the wound-wait algorithm 
*strictly* requires a slow lock on the contended lock. For wait-die it's 
just very convenient since it makes us sleep instead of spinning with 
-EDEADLK on the contended lock. For wound-wait IIRC one could just 
immediately restart the whole locking transaction after an -EDEADLK, and 
the transaction would automatically end up waiting on the contended 
lock, provided the mutex lock stealing is not allowed. There is however 
a possibility that the transaction will be wounded again on another 
lock, taken before the contended lock, 

Re: [Intel-gfx] [PATCH 5/5] drm/amdgpu: implement amdgpu_gem_prime_move_notify v2

2020-02-18 Thread Christian König

Am 17.02.20 um 18:55 schrieb Daniel Vetter:

On Mon, Feb 17, 2020 at 04:45:09PM +0100, Christian König wrote:

Implement the importer side of unpinned DMA-buf handling.

v2: update page tables immediately

Signed-off-by: Christian König 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 66 -
  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c  |  6 ++
  2 files changed, 71 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
index 770baba621b3..48de7624d49c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
@@ -453,7 +453,71 @@ amdgpu_dma_buf_create_obj(struct drm_device *dev, struct 
dma_buf *dma_buf)
return ERR_PTR(ret);
  }
  
+/**

+ * amdgpu_dma_buf_move_notify - _notify implementation
+ *
+ * @attach: the DMA-buf attachment
+ *
+ * Invalidate the DMA-buf attachment, making sure that the we re-create the
+ * mapping before the next use.
+ */
+static void
+amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach)
+{
+   struct drm_gem_object *obj = attach->importer_priv;
+   struct ww_acquire_ctx *ticket = dma_resv_locking_ctx(obj->resv);
+   struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj);
+   struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
+   struct ttm_operation_ctx ctx = { false, false };
+   struct ttm_placement placement = {};
+   struct amdgpu_vm_bo_base *bo_base;
+   int r;
+
+   if (bo->tbo.mem.mem_type == TTM_PL_SYSTEM)
+   return;
+
+   r = ttm_bo_validate(>tbo, , );
+   if (r) {
+   DRM_ERROR("Failed to invalidate DMA-buf import (%d))\n", r);
+   return;
+   }
+
+   for (bo_base = bo->vm_bo; bo_base; bo_base = bo_base->next) {
+   struct amdgpu_vm *vm = bo_base->vm;
+   struct dma_resv *resv = vm->root.base.bo->tbo.base.resv;
+
+   if (ticket) {

Yeah so this is kinda why I've been a total pain about the exact semantics
of the move_notify hook. I think we should flat-out require that importers
_always_ have a ticket attach when they call this, and that they can cope
with additional locks being taken (i.e. full EDEADLCK) handling.


That is pretty much exactly my thinking as well.

And is also the sole reason why I started looking into the ww_mutex 
cursor handling a while back (e.g. the initial version with the horrible 
macro hack).


But this is really really hard to get right. So my thinking for now is 
to push this series upstream to at least unblock my ongoing P2P work.



Simplest way to force that contract is to add a dummy 2nd ww_mutex lock to
the dma_resv object, which we then can take #ifdef
CONFIG_WW_MUTEX_SLOWPATH_DEBUG. Plus mabye a WARN_ON(!ticket).

Now the real disaster is how we handle deadlocks. Two issues:

- Ideally we'd keep any lock we've taken locked until the end, it helps
   needless backoffs. I've played around a bit with that but not even poc
   level, just an idea:

https://cgit.freedesktop.org/~danvet/drm/commit/?id=b1799c5a0f02df9e1bb08d27be37331255ab7582

   Idea is essentially to track a list of objects we had to lock as part of
   the ttm_bo_validate of the main object.

- Second one is if we get a EDEADLCK on one of these sublocks (like the
   one here). We need to pass that up the entire callchain, including a
   temporary reference (we have to drop locks to do the ww_mutex_lock_slow
   call), and need a custom callback to drop that temporary reference
   (since that's all driver specific, might even be internal ww_mutex and
   not anything remotely looking like a normal dma_buf). This probably
   needs the exec util helpers from ttm, but at the dma_resv level, so that
   we can do something like this:

struct dma_resv_ticket {
struct ww_acquire_ctx base;

/* can be set by anyone (including other drivers) that got hold of
 * this ticket and had to acquire some new lock. This lock might
 * protect anything, including driver-internal stuff, and isn't
 * required to be a dma_buf or even just a dma_resv. */
struct ww_mutex *contended_lock;

/* callback which the driver (which might be a dma-buf exporter
 * and not matching the driver that started this locking ticket)
 * sets together with @contended_lock, for the main driver to drop
 * when it calls dma_resv_unlock on the contended_lock. */
void (drop_ref*)(struct ww_mutex *contended_lock);
};


My initial thinking was to make all of this part of the core ww_mutex 
implementation, but then I quickly found that this won't work.



This is all supremely nasty (also ttm_bo_validate would need to be
improved to handle these sublocks and random new objects that could force
a ww_mutex_lock_slow).


The next idea was to have it based on dma_resv objects, but as you also 
figured out you then need to drop the reference to the contended 

[Intel-gfx] [PATCH 5/5] drm/amdgpu: implement amdgpu_gem_prime_move_notify v2

2020-02-18 Thread Christian König
Implement the importer side of unpinned DMA-buf handling.

v2: update page tables immediately

Signed-off-by: Christian König 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 66 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c  |  6 ++
 2 files changed, 71 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
index 770baba621b3..48de7624d49c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
@@ -453,7 +453,71 @@ amdgpu_dma_buf_create_obj(struct drm_device *dev, struct 
dma_buf *dma_buf)
return ERR_PTR(ret);
 }
 
+/**
+ * amdgpu_dma_buf_move_notify - _notify implementation
+ *
+ * @attach: the DMA-buf attachment
+ *
+ * Invalidate the DMA-buf attachment, making sure that the we re-create the
+ * mapping before the next use.
+ */
+static void
+amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach)
+{
+   struct drm_gem_object *obj = attach->importer_priv;
+   struct ww_acquire_ctx *ticket = dma_resv_locking_ctx(obj->resv);
+   struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj);
+   struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
+   struct ttm_operation_ctx ctx = { false, false };
+   struct ttm_placement placement = {};
+   struct amdgpu_vm_bo_base *bo_base;
+   int r;
+
+   if (bo->tbo.mem.mem_type == TTM_PL_SYSTEM)
+   return;
+
+   r = ttm_bo_validate(>tbo, , );
+   if (r) {
+   DRM_ERROR("Failed to invalidate DMA-buf import (%d))\n", r);
+   return;
+   }
+
+   for (bo_base = bo->vm_bo; bo_base; bo_base = bo_base->next) {
+   struct amdgpu_vm *vm = bo_base->vm;
+   struct dma_resv *resv = vm->root.base.bo->tbo.base.resv;
+
+   if (ticket) {
+   /* When we get an error here it means that somebody
+* else is holding the VM lock and updating page tables
+* So we can just continue here.
+*/
+   r = dma_resv_lock(resv, ticket);
+   if (r)
+   continue;
+
+   } else {
+   /* TODO: This is more problematic and we actually need
+* to allow page tables updates without holding the
+* lock.
+*/
+   if (!dma_resv_trylock(resv))
+   continue;
+   }
+
+   r = amdgpu_vm_clear_freed(adev, vm, NULL);
+   if (!r)
+   r = amdgpu_vm_handle_moved(adev, vm);
+
+   if (r && r != -EBUSY)
+   DRM_ERROR("Failed to invalidate VM page tables (%d))\n",
+ r);
+
+   dma_resv_unlock(resv);
+   }
+}
+
 static const struct dma_buf_attach_ops amdgpu_dma_buf_attach_ops = {
+   .move_notify = amdgpu_dma_buf_move_notify
 };
 
 /**
@@ -489,7 +553,7 @@ struct drm_gem_object *amdgpu_gem_prime_import(struct 
drm_device *dev,
return obj;
 
attach = dma_buf_dynamic_attach(dma_buf, dev->dev,
-   _dma_buf_attach_ops, NULL);
+   _dma_buf_attach_ops, obj);
if (IS_ERR(attach)) {
drm_gem_object_put(obj);
return ERR_CAST(attach);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 8ae260822908..8c480c898b0d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -926,6 +926,9 @@ int amdgpu_bo_pin_restricted(struct amdgpu_bo *bo, u32 
domain,
return 0;
}
 
+   if (bo->tbo.base.import_attach)
+   dma_buf_pin(bo->tbo.base.import_attach);
+
bo->flags |= AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS;
/* force to pin into visible video ram */
if (!(bo->flags & AMDGPU_GEM_CREATE_NO_CPU_ACCESS))
@@ -1009,6 +1012,9 @@ int amdgpu_bo_unpin(struct amdgpu_bo *bo)
 
amdgpu_bo_subtract_pin_size(bo);
 
+   if (bo->tbo.base.import_attach)
+   dma_buf_unpin(bo->tbo.base.import_attach);
+
for (i = 0; i < bo->placement.num_placement; i++) {
bo->placements[i].lpfn = 0;
bo->placements[i].flags &= ~TTM_PL_FLAG_NO_EVICT;
-- 
2.17.1

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 5/5] drm/amdgpu: implement amdgpu_gem_prime_move_notify v2

2020-02-17 Thread Daniel Vetter
On Mon, Feb 17, 2020 at 7:58 PM Christian König
 wrote:
>
> Am 17.02.20 um 18:55 schrieb Daniel Vetter:
> > On Mon, Feb 17, 2020 at 04:45:09PM +0100, Christian König wrote:
> >> Implement the importer side of unpinned DMA-buf handling.
> >>
> >> v2: update page tables immediately
> >>
> >> Signed-off-by: Christian König 
> >> ---
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 66 -
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c  |  6 ++
> >>   2 files changed, 71 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c 
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
> >> index 770baba621b3..48de7624d49c 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
> >> @@ -453,7 +453,71 @@ amdgpu_dma_buf_create_obj(struct drm_device *dev, 
> >> struct dma_buf *dma_buf)
> >>  return ERR_PTR(ret);
> >>   }
> >>
> >> +/**
> >> + * amdgpu_dma_buf_move_notify - _notify implementation
> >> + *
> >> + * @attach: the DMA-buf attachment
> >> + *
> >> + * Invalidate the DMA-buf attachment, making sure that the we re-create 
> >> the
> >> + * mapping before the next use.
> >> + */
> >> +static void
> >> +amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach)
> >> +{
> >> +struct drm_gem_object *obj = attach->importer_priv;
> >> +struct ww_acquire_ctx *ticket = dma_resv_locking_ctx(obj->resv);
> >> +struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj);
> >> +struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
> >> +struct ttm_operation_ctx ctx = { false, false };
> >> +struct ttm_placement placement = {};
> >> +struct amdgpu_vm_bo_base *bo_base;
> >> +int r;
> >> +
> >> +if (bo->tbo.mem.mem_type == TTM_PL_SYSTEM)
> >> +return;
> >> +
> >> +r = ttm_bo_validate(>tbo, , );
> >> +if (r) {
> >> +DRM_ERROR("Failed to invalidate DMA-buf import (%d))\n", r);
> >> +return;
> >> +}
> >> +
> >> +for (bo_base = bo->vm_bo; bo_base; bo_base = bo_base->next) {
> >> +struct amdgpu_vm *vm = bo_base->vm;
> >> +struct dma_resv *resv = vm->root.base.bo->tbo.base.resv;
> >> +
> >> +if (ticket) {
> > Yeah so this is kinda why I've been a total pain about the exact semantics
> > of the move_notify hook. I think we should flat-out require that importers
> > _always_ have a ticket attach when they call this, and that they can cope
> > with additional locks being taken (i.e. full EDEADLCK) handling.
>
> That is pretty much exactly my thinking as well.
>
> And is also the sole reason why I started looking into the ww_mutex
> cursor handling a while back (e.g. the initial version with the horrible
> macro hack).
>
> But this is really really hard to get right. So my thinking for now is
> to push this series upstream to at least unblock my ongoing P2P work.

Hm, but at least the move_notify stuff and the locking nightmare
around that feels rushed if we just push that. Otoh it's indeed
getting painful, and we'll probably have another few rounds of
headaches to sort this all out. What about a

config EXPERIMENTAL_DYNAMIC_DMA_BUF
depends on BROKEN

Wrapped around the new ->move_notify hook, plus all relevant code?
That way you can land at least something, in-tree refactoring might
become easier with at least some example of what it needs to achieve.
But we're also not tricking anyone into believing that this is
production ready.

> > Simplest way to force that contract is to add a dummy 2nd ww_mutex lock to
> > the dma_resv object, which we then can take #ifdef
> > CONFIG_WW_MUTEX_SLOWPATH_DEBUG. Plus mabye a WARN_ON(!ticket).
> >
> > Now the real disaster is how we handle deadlocks. Two issues:
> >
> > - Ideally we'd keep any lock we've taken locked until the end, it helps
> >needless backoffs. I've played around a bit with that but not even poc
> >level, just an idea:
> >
> > https://cgit.freedesktop.org/~danvet/drm/commit/?id=b1799c5a0f02df9e1bb08d27be37331255ab7582
> >
> >Idea is essentially to track a list of objects we had to lock as part of
> >the ttm_bo_validate of the main object.
> >
> > - Second one is if we get a EDEADLCK on one of these sublocks (like the
> >one here). We need to pass that up the entire callchain, including a
> >temporary reference (we have to drop locks to do the ww_mutex_lock_slow
> >call), and need a custom callback to drop that temporary reference
> >(since that's all driver specific, might even be internal ww_mutex and
> >not anything remotely looking like a normal dma_buf). This probably
> >needs the exec util helpers from ttm, but at the dma_resv level, so that
> >we can do something like this:
> >
> > struct dma_resv_ticket {
> >   struct ww_acquire_ctx base;
> >
> >   /* can be set by anyone (including other drivers) that got hold of
> >* this ticket and had to acquire some new lock. This 

Re: [Intel-gfx] [PATCH 5/5] drm/amdgpu: implement amdgpu_gem_prime_move_notify v2

2020-02-17 Thread Daniel Vetter
On Mon, Feb 17, 2020 at 04:45:09PM +0100, Christian König wrote:
> Implement the importer side of unpinned DMA-buf handling.
> 
> v2: update page tables immediately
> 
> Signed-off-by: Christian König 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 66 -
>  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c  |  6 ++
>  2 files changed, 71 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
> index 770baba621b3..48de7624d49c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
> @@ -453,7 +453,71 @@ amdgpu_dma_buf_create_obj(struct drm_device *dev, struct 
> dma_buf *dma_buf)
>   return ERR_PTR(ret);
>  }
>  
> +/**
> + * amdgpu_dma_buf_move_notify - _notify implementation
> + *
> + * @attach: the DMA-buf attachment
> + *
> + * Invalidate the DMA-buf attachment, making sure that the we re-create the
> + * mapping before the next use.
> + */
> +static void
> +amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach)
> +{
> + struct drm_gem_object *obj = attach->importer_priv;
> + struct ww_acquire_ctx *ticket = dma_resv_locking_ctx(obj->resv);
> + struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj);
> + struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
> + struct ttm_operation_ctx ctx = { false, false };
> + struct ttm_placement placement = {};
> + struct amdgpu_vm_bo_base *bo_base;
> + int r;
> +
> + if (bo->tbo.mem.mem_type == TTM_PL_SYSTEM)
> + return;
> +
> + r = ttm_bo_validate(>tbo, , );
> + if (r) {
> + DRM_ERROR("Failed to invalidate DMA-buf import (%d))\n", r);
> + return;
> + }
> +
> + for (bo_base = bo->vm_bo; bo_base; bo_base = bo_base->next) {
> + struct amdgpu_vm *vm = bo_base->vm;
> + struct dma_resv *resv = vm->root.base.bo->tbo.base.resv;
> +
> + if (ticket) {

Yeah so this is kinda why I've been a total pain about the exact semantics
of the move_notify hook. I think we should flat-out require that importers
_always_ have a ticket attach when they call this, and that they can cope
with additional locks being taken (i.e. full EDEADLCK) handling.

Simplest way to force that contract is to add a dummy 2nd ww_mutex lock to
the dma_resv object, which we then can take #ifdef
CONFIG_WW_MUTEX_SLOWPATH_DEBUG. Plus mabye a WARN_ON(!ticket).

Now the real disaster is how we handle deadlocks. Two issues:

- Ideally we'd keep any lock we've taken locked until the end, it helps
  needless backoffs. I've played around a bit with that but not even poc
  level, just an idea:

https://cgit.freedesktop.org/~danvet/drm/commit/?id=b1799c5a0f02df9e1bb08d27be37331255ab7582

  Idea is essentially to track a list of objects we had to lock as part of
  the ttm_bo_validate of the main object.

- Second one is if we get a EDEADLCK on one of these sublocks (like the
  one here). We need to pass that up the entire callchain, including a
  temporary reference (we have to drop locks to do the ww_mutex_lock_slow
  call), and need a custom callback to drop that temporary reference
  (since that's all driver specific, might even be internal ww_mutex and
  not anything remotely looking like a normal dma_buf). This probably
  needs the exec util helpers from ttm, but at the dma_resv level, so that
  we can do something like this:

struct dma_resv_ticket {
struct ww_acquire_ctx base;

/* can be set by anyone (including other drivers) that got hold of
 * this ticket and had to acquire some new lock. This lock might
 * protect anything, including driver-internal stuff, and isn't
 * required to be a dma_buf or even just a dma_resv. */
struct ww_mutex *contended_lock;

/* callback which the driver (which might be a dma-buf exporter
 * and not matching the driver that started this locking ticket)
 * sets together with @contended_lock, for the main driver to drop
 * when it calls dma_resv_unlock on the contended_lock. */
void (drop_ref*)(struct ww_mutex *contended_lock);
};

This is all supremely nasty (also ttm_bo_validate would need to be
improved to handle these sublocks and random new objects that could force
a ww_mutex_lock_slow).

Plan B would be to throw our hands into and declare that "move_notify is
best effort only and can fail for any reason". Exactly like ttm eviction
currently does, even with all your hacks to do at least some dma_resv_lock
(but not the full slowpath).

Given how much "fun" you have with all the low memory handling and ttm
fundamentally being best-effort only (despite that dma_resv would allow us
to do this right, with some work) I'm not sure that's a good idea to
extend to a cross-driver interface. Personally I'd lean towards fixing
this first fully (in ttm/amdgpu), and then using that to implement
move_notify correctly.