Re: [Intel-gfx] [PATCH 5/5] drm/amdgpu: implement amdgpu_gem_prime_move_notify v2

2020-02-24 Thread VMware

On 2/24/20 7:46 PM, Christian König wrote:

Am 23.02.20 um 17:54 schrieb Thomas Hellström (VMware):

On 2/23/20 4:45 PM, Christian König wrote:

Am 21.02.20 um 18:12 schrieb Daniel Vetter:

[SNIP]
Yeah the Great Plan (tm) is to fully rely on ww_mutex slowly 
degenerating
into essentially a global lock. But only when there's actual 
contention

and thrashing.


Yes exactly. A really big problem in TTM is currently that we drop 
the lock after evicting BOs because they tend to move in again 
directly after that.


From practice I can also confirm that there is exactly zero benefit 
from dropping locks early and reacquire them for example for the VM 
page tables. That's just makes it more likely that somebody needs to 
roll back and this is what we need to avoid in the first place.


If you have a benchmarking setup available it would be very 
interesting for future reference to see how changing from WD to WW 
mutexes affects the roll back frequency. WW is known to cause 
rollbacks much less frequently but there is more work associated with 
each rollback.


Not of hand. To be honest I still have a hard time to get a grip on 
the difference between WD and WW from the algorithm point of view. So 
I can't judge that difference at all.


OK. I don't think a detailed understanding of the algorithms is strictly 
necessary, though. If we had had a good testcase we'd just have to 
change DEFINE_WD_CLASS in dma-resv.c to DEFINE_WW_CLASS and benchmark 
relevant figures.





Contention on BO locks during command submission is perfectly fine 
as long as this is as lightweight as possible while we don't have 
trashing. When we have trashing multi submission performance is best 
archived to just favor a single process to finish its business and 
block everybody else.


Hmm. Sounds like we need a per-manager ww_rwsem protecting manager 
allocation, taken in write-mode then there's thrashing. In read-mode 
otherwise. That would limit the amount of "unnecessary" locks we'd 
have to keep and reduce unwanted side-effects, (see below):


Well per-manager (you mean per domain here don't you?)

yes.

doesn't sound like that useful because we rarely use only one domain,


Well the difference to keeping all locks would boil down to:
"Instead of keeping all locks of all bos we evict from thrashing 
domains, keep locks of all thrashing domains in write mode". This means 
that for domains that are not thrashing, we'd just keep read locks.



but I'm actually questioning for quite a while if the per BO lock 
scheme was the right approach.


See from the performance aspect the closest to ideal solution I can 
think of would be a ww_rwsem per user of a resource.


In other words we don't lock BOs, but instead a list of all their 
users and when you want to evict a BO you need to walk that list and 
inform all users that the BO will be moving.


During command submission you then have the fast path which rather 
just grabs the read side of the user lock and check if all BOs are 
still in the expected place.


If some BOs were evicted you back off and start the slow path, e.g. 
maybe even copy additional data from userspace then grab the write 
side of the lock etc.. etc...


That approach is similar to what we use in amdgpu with the per-VM BOs, 
but goes a step further. Problem is that we are so used to per BO 
locks in the kernel that this is probably not doable any more.


I think we need some-sort of per-bo lock to protect bo metadata. But 
yes, relying solely on them to resolve other resource (domain) 
contention may not be (or rather probably isn't) the right choice.




Because of this I would actually vote for forbidding to release 
individual ww_mutex() locks in a context.


Yes, I see the problem.

But my first reaction is that this might have undersirable 
side-effects. Let's say somebody wanted to swap the evicted BOs out?


Please explain further, I off hand don't see the problem here.


Lets say thread A evicts a lot of thread B's bos, and keeps the locks of 
those bos for a prolonged time. Then thread C needs memory and wants to 
swap out thread B's bos. It can't, or at least not during a certain 
delay because thread A unnecessarily holds the locks.




In general I actually wanted to re-work TTM in a way that BOs in the 
SYSTEM/SWAPABLE domain are always backed by a shmem file instead of 
the struct page array we currently have.


That would probably work well if there are no SYSTEM+write-combined 
users anymore. Typically in the old AGP days, you wouldn't change 
caching mode when evicting from write-combine AGP to SYSTEM because of 
the dead slow wbinvd() operation.


Or cpu-writes to them causing faults, that might also block the 
mm_sem, which in turn blocks hugepaged?


Mhm, I also only have a higher level view how hugepaged works so why 
does it grabs the mm_sem on the write side?


If I understand it correctly, it's needed when collapsing PMD 
directories to huge PMD pages. But this w

Re: [Intel-gfx] [PATCH 5/5] drm/amdgpu: implement amdgpu_gem_prime_move_notify v2

2020-02-23 Thread VMware

On 2/23/20 4:45 PM, Christian König wrote:

Am 21.02.20 um 18:12 schrieb Daniel Vetter:

[SNIP]
Yeah the Great Plan (tm) is to fully rely on ww_mutex slowly 
degenerating

into essentially a global lock. But only when there's actual contention
and thrashing.


Yes exactly. A really big problem in TTM is currently that we drop the 
lock after evicting BOs because they tend to move in again directly 
after that.


From practice I can also confirm that there is exactly zero benefit 
from dropping locks early and reacquire them for example for the VM 
page tables. That's just makes it more likely that somebody needs to 
roll back and this is what we need to avoid in the first place.



If you have a benchmarking setup available it would be very interesting 
for future reference to see how changing from WD to WW mutexes affects 
the roll back frequency. WW is known to cause rollbacks much less 
frequently but there is more work associated with each rollback.




Contention on BO locks during command submission is perfectly fine as 
long as this is as lightweight as possible while we don't have 
trashing. When we have trashing multi submission performance is best 
archived to just favor a single process to finish its business and 
block everybody else.


Hmm. Sounds like we need a per-manager ww_rwsem protecting manager 
allocation, taken in write-mode then there's thrashing. In read-mode 
otherwise. That would limit the amount of "unnecessary" locks we'd have 
to keep and reduce unwanted side-effects, (see below):




Because of this I would actually vote for forbidding to release 
individual ww_mutex() locks in a context.


Yes, I see the problem.

But my first reaction is that this might have undersirable side-effects. 
Let's say somebody wanted to swap the evicted BOs out? Or cpu-writes to 
them causing faults, that might also block the mm_sem, which in turn 
blocks hugepaged?


Still it's a fairly simple solution to a problem that seems otherwise 
hard to solve efficiently.


Thanks,
Thomas




Regards,
Christian.


-Daniel



___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 5/5] drm/amdgpu: implement amdgpu_gem_prime_move_notify v2

2020-02-21 Thread VMware

On 2/21/20 6:12 PM, Daniel Vetter wrote:

On Thu, Feb 20, 2020 at 11:51:07PM +0100, Thomas Hellström (VMware) wrote:

On 2/20/20 9:08 PM, Daniel Vetter wrote:

On Thu, Feb 20, 2020 at 08:46:27PM +0100, Thomas Hellström (VMware) wrote:

On 2/20/20 7:04 PM, Daniel Vetter wrote:

On Thu, Feb 20, 2020 at 10:39:06AM +0100, Thomas Hellström (VMware) wrote:

On 2/19/20 7:42 AM, Thomas Hellström (VMware) wrote:

On 2/18/20 10:01 PM, Daniel Vetter wrote:

On Tue, Feb 18, 2020 at 9:17 PM Thomas Hellström (VMware)
 wrote:

On 2/17/20 6:55 PM, Daniel Vetter wrote:

On Mon, Feb 17, 2020 at 04:45:09PM +0100, Christian König wrote:

Implement the importer side of unpinned DMA-buf handling.

v2: update page tables immediately

Signed-off-by: Christian König 
---
      drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 66
-
      drivers/gpu/drm/amd/amdgpu/amdgpu_object.c  |  6 ++
      2 files changed, 71 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
index 770baba621b3..48de7624d49c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
@@ -453,7 +453,71 @@ amdgpu_dma_buf_create_obj(struct
drm_device *dev, struct dma_buf *dma_buf)
     return ERR_PTR(ret);
      }

+/**
+ * amdgpu_dma_buf_move_notify - _notify implementation
+ *
+ * @attach: the DMA-buf attachment
+ *
+ * Invalidate the DMA-buf attachment, making sure that
the we re-create the
+ * mapping before the next use.
+ */
+static void
+amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach)
+{
+    struct drm_gem_object *obj = attach->importer_priv;
+    struct ww_acquire_ctx *ticket = dma_resv_locking_ctx(obj->resv);
+    struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj);
+    struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
+    struct ttm_operation_ctx ctx = { false, false };
+    struct ttm_placement placement = {};
+    struct amdgpu_vm_bo_base *bo_base;
+    int r;
+
+    if (bo->tbo.mem.mem_type == TTM_PL_SYSTEM)
+    return;
+
+    r = ttm_bo_validate(>tbo, , );
+    if (r) {
+    DRM_ERROR("Failed to invalidate DMA-buf
import (%d))\n", r);
+    return;
+    }
+
+    for (bo_base = bo->vm_bo; bo_base; bo_base = bo_base->next) {
+    struct amdgpu_vm *vm = bo_base->vm;
+    struct dma_resv *resv = vm->root.base.bo->tbo.base.resv;
+
+    if (ticket) {

Yeah so this is kinda why I've been a total pain about the
exact semantics
of the move_notify hook. I think we should flat-out require
that importers
_always_ have a ticket attach when they call this, and that
they can cope
with additional locks being taken (i.e. full EDEADLCK) handling.

Simplest way to force that contract is to add a dummy 2nd
ww_mutex lock to
the dma_resv object, which we then can take #ifdef
CONFIG_WW_MUTEX_SLOWPATH_DEBUG. Plus mabye a WARN_ON(!ticket).

Now the real disaster is how we handle deadlocks. Two issues:

- Ideally we'd keep any lock we've taken locked until the
end, it helps
       needless backoffs. I've played around a bit with that
but not even poc
       level, just an idea:

https://cgit.freedesktop.org/~danvet/drm/commit/?id=b1799c5a0f02df9e1bb08d27be37331255ab7582


       Idea is essentially to track a list of objects we had to
lock as part of
       the ttm_bo_validate of the main object.

- Second one is if we get a EDEADLCK on one of these
sublocks (like the
       one here). We need to pass that up the entire callchain,
including a
       temporary reference (we have to drop locks to do the
ww_mutex_lock_slow
       call), and need a custom callback to drop that temporary reference
       (since that's all driver specific, might even be
internal ww_mutex and
       not anything remotely looking like a normal dma_buf).
This probably
       needs the exec util helpers from ttm, but at the
dma_resv level, so that
       we can do something like this:

struct dma_resv_ticket {
      struct ww_acquire_ctx base;

      /* can be set by anyone (including other drivers)
that got hold of
       * this ticket and had to acquire some new lock. This
lock might
       * protect anything, including driver-internal stuff, and isn't
       * required to be a dma_buf or even just a dma_resv. */
      struct ww_mutex *contended_lock;

      /* callback which the driver (which might be a dma-buf exporter
       * and not matching the driver that started this
locking ticket)
       * sets together with @contended_lock, for the main
driver to drop
       * when it calls dma_resv_unlock on the contended_lock. */
      void (drop_ref*)(struct ww_mutex *contended_lock);
};

This is all supremely nasty (also ttm_bo_validate would need to be
improved to handle these sublocks and random new objects
that could force
a ww_mutex_lock_slow).


Just a short comment on this:

Neither the currently u

Re: [Intel-gfx] [PATCH 5/5] drm/amdgpu: implement amdgpu_gem_prime_move_notify v2

2020-02-20 Thread VMware

On 2/20/20 9:08 PM, Daniel Vetter wrote:

On Thu, Feb 20, 2020 at 08:46:27PM +0100, Thomas Hellström (VMware) wrote:

On 2/20/20 7:04 PM, Daniel Vetter wrote:

On Thu, Feb 20, 2020 at 10:39:06AM +0100, Thomas Hellström (VMware) wrote:

On 2/19/20 7:42 AM, Thomas Hellström (VMware) wrote:

On 2/18/20 10:01 PM, Daniel Vetter wrote:

On Tue, Feb 18, 2020 at 9:17 PM Thomas Hellström (VMware)
 wrote:

On 2/17/20 6:55 PM, Daniel Vetter wrote:

On Mon, Feb 17, 2020 at 04:45:09PM +0100, Christian König wrote:

Implement the importer side of unpinned DMA-buf handling.

v2: update page tables immediately

Signed-off-by: Christian König 
---
     drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 66
-
     drivers/gpu/drm/amd/amdgpu/amdgpu_object.c  |  6 ++
     2 files changed, 71 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
index 770baba621b3..48de7624d49c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
@@ -453,7 +453,71 @@ amdgpu_dma_buf_create_obj(struct
drm_device *dev, struct dma_buf *dma_buf)
    return ERR_PTR(ret);
     }

+/**
+ * amdgpu_dma_buf_move_notify - _notify implementation
+ *
+ * @attach: the DMA-buf attachment
+ *
+ * Invalidate the DMA-buf attachment, making sure that
the we re-create the
+ * mapping before the next use.
+ */
+static void
+amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach)
+{
+    struct drm_gem_object *obj = attach->importer_priv;
+    struct ww_acquire_ctx *ticket = dma_resv_locking_ctx(obj->resv);
+    struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj);
+    struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
+    struct ttm_operation_ctx ctx = { false, false };
+    struct ttm_placement placement = {};
+    struct amdgpu_vm_bo_base *bo_base;
+    int r;
+
+    if (bo->tbo.mem.mem_type == TTM_PL_SYSTEM)
+    return;
+
+    r = ttm_bo_validate(>tbo, , );
+    if (r) {
+    DRM_ERROR("Failed to invalidate DMA-buf
import (%d))\n", r);
+    return;
+    }
+
+    for (bo_base = bo->vm_bo; bo_base; bo_base = bo_base->next) {
+    struct amdgpu_vm *vm = bo_base->vm;
+    struct dma_resv *resv = vm->root.base.bo->tbo.base.resv;
+
+    if (ticket) {

Yeah so this is kinda why I've been a total pain about the
exact semantics
of the move_notify hook. I think we should flat-out require
that importers
_always_ have a ticket attach when they call this, and that
they can cope
with additional locks being taken (i.e. full EDEADLCK) handling.

Simplest way to force that contract is to add a dummy 2nd
ww_mutex lock to
the dma_resv object, which we then can take #ifdef
CONFIG_WW_MUTEX_SLOWPATH_DEBUG. Plus mabye a WARN_ON(!ticket).

Now the real disaster is how we handle deadlocks. Two issues:

- Ideally we'd keep any lock we've taken locked until the
end, it helps
      needless backoffs. I've played around a bit with that
but not even poc
      level, just an idea:

https://cgit.freedesktop.org/~danvet/drm/commit/?id=b1799c5a0f02df9e1bb08d27be37331255ab7582


      Idea is essentially to track a list of objects we had to
lock as part of
      the ttm_bo_validate of the main object.

- Second one is if we get a EDEADLCK on one of these
sublocks (like the
      one here). We need to pass that up the entire callchain,
including a
      temporary reference (we have to drop locks to do the
ww_mutex_lock_slow
      call), and need a custom callback to drop that temporary reference
      (since that's all driver specific, might even be
internal ww_mutex and
      not anything remotely looking like a normal dma_buf).
This probably
      needs the exec util helpers from ttm, but at the
dma_resv level, so that
      we can do something like this:

struct dma_resv_ticket {
     struct ww_acquire_ctx base;

     /* can be set by anyone (including other drivers)
that got hold of
      * this ticket and had to acquire some new lock. This
lock might
      * protect anything, including driver-internal stuff, and isn't
      * required to be a dma_buf or even just a dma_resv. */
     struct ww_mutex *contended_lock;

     /* callback which the driver (which might be a dma-buf exporter
      * and not matching the driver that started this
locking ticket)
      * sets together with @contended_lock, for the main
driver to drop
      * when it calls dma_resv_unlock on the contended_lock. */
     void (drop_ref*)(struct ww_mutex *contended_lock);
};

This is all supremely nasty (also ttm_bo_validate would need to be
improved to handle these sublocks and random new objects
that could force
a ww_mutex_lock_slow).


Just a short comment on this:

Neither the currently used wait-die or the wound-wait algorithm
*strictly* requires a slow lock on the contended lock. For
wait-die it's
just very convenient sin

Re: [Intel-gfx] [PATCH 5/5] drm/amdgpu: implement amdgpu_gem_prime_move_notify v2

2020-02-20 Thread VMware

On 2/20/20 7:04 PM, Daniel Vetter wrote:

On Thu, Feb 20, 2020 at 10:39:06AM +0100, Thomas Hellström (VMware) wrote:

On 2/19/20 7:42 AM, Thomas Hellström (VMware) wrote:

On 2/18/20 10:01 PM, Daniel Vetter wrote:

On Tue, Feb 18, 2020 at 9:17 PM Thomas Hellström (VMware)
 wrote:

On 2/17/20 6:55 PM, Daniel Vetter wrote:

On Mon, Feb 17, 2020 at 04:45:09PM +0100, Christian König wrote:

Implement the importer side of unpinned DMA-buf handling.

v2: update page tables immediately

Signed-off-by: Christian König 
---
    drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 66
-
    drivers/gpu/drm/amd/amdgpu/amdgpu_object.c  |  6 ++
    2 files changed, 71 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
index 770baba621b3..48de7624d49c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
@@ -453,7 +453,71 @@ amdgpu_dma_buf_create_obj(struct
drm_device *dev, struct dma_buf *dma_buf)
   return ERR_PTR(ret);
    }

+/**
+ * amdgpu_dma_buf_move_notify - _notify implementation
+ *
+ * @attach: the DMA-buf attachment
+ *
+ * Invalidate the DMA-buf attachment, making sure that
the we re-create the
+ * mapping before the next use.
+ */
+static void
+amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach)
+{
+    struct drm_gem_object *obj = attach->importer_priv;
+    struct ww_acquire_ctx *ticket = dma_resv_locking_ctx(obj->resv);
+    struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj);
+    struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
+    struct ttm_operation_ctx ctx = { false, false };
+    struct ttm_placement placement = {};
+    struct amdgpu_vm_bo_base *bo_base;
+    int r;
+
+    if (bo->tbo.mem.mem_type == TTM_PL_SYSTEM)
+    return;
+
+    r = ttm_bo_validate(>tbo, , );
+    if (r) {
+    DRM_ERROR("Failed to invalidate DMA-buf
import (%d))\n", r);
+    return;
+    }
+
+    for (bo_base = bo->vm_bo; bo_base; bo_base = bo_base->next) {
+    struct amdgpu_vm *vm = bo_base->vm;
+    struct dma_resv *resv = vm->root.base.bo->tbo.base.resv;
+
+    if (ticket) {

Yeah so this is kinda why I've been a total pain about the
exact semantics
of the move_notify hook. I think we should flat-out require
that importers
_always_ have a ticket attach when they call this, and that
they can cope
with additional locks being taken (i.e. full EDEADLCK) handling.

Simplest way to force that contract is to add a dummy 2nd
ww_mutex lock to
the dma_resv object, which we then can take #ifdef
CONFIG_WW_MUTEX_SLOWPATH_DEBUG. Plus mabye a WARN_ON(!ticket).

Now the real disaster is how we handle deadlocks. Two issues:

- Ideally we'd keep any lock we've taken locked until the
end, it helps
     needless backoffs. I've played around a bit with that
but not even poc
     level, just an idea:

https://cgit.freedesktop.org/~danvet/drm/commit/?id=b1799c5a0f02df9e1bb08d27be37331255ab7582


     Idea is essentially to track a list of objects we had to
lock as part of
     the ttm_bo_validate of the main object.

- Second one is if we get a EDEADLCK on one of these
sublocks (like the
     one here). We need to pass that up the entire callchain,
including a
     temporary reference (we have to drop locks to do the
ww_mutex_lock_slow
     call), and need a custom callback to drop that temporary reference
     (since that's all driver specific, might even be
internal ww_mutex and
     not anything remotely looking like a normal dma_buf).
This probably
     needs the exec util helpers from ttm, but at the
dma_resv level, so that
     we can do something like this:

struct dma_resv_ticket {
    struct ww_acquire_ctx base;

    /* can be set by anyone (including other drivers)
that got hold of
     * this ticket and had to acquire some new lock. This
lock might
     * protect anything, including driver-internal stuff, and isn't
     * required to be a dma_buf or even just a dma_resv. */
    struct ww_mutex *contended_lock;

    /* callback which the driver (which might be a dma-buf exporter
     * and not matching the driver that started this
locking ticket)
     * sets together with @contended_lock, for the main
driver to drop
     * when it calls dma_resv_unlock on the contended_lock. */
    void (drop_ref*)(struct ww_mutex *contended_lock);
};

This is all supremely nasty (also ttm_bo_validate would need to be
improved to handle these sublocks and random new objects
that could force
a ww_mutex_lock_slow).


Just a short comment on this:

Neither the currently used wait-die or the wound-wait algorithm
*strictly* requires a slow lock on the contended lock. For
wait-die it's
just very convenient since it makes us sleep instead of spinning with
-EDEADLK on the contended lock. For wound-wait IIRC one could just
immediately restart the who

Re: [Intel-gfx] [PATCH 5/5] drm/amdgpu: implement amdgpu_gem_prime_move_notify v2

2020-02-20 Thread VMware

On 2/19/20 7:42 AM, Thomas Hellström (VMware) wrote:

On 2/18/20 10:01 PM, Daniel Vetter wrote:

On Tue, Feb 18, 2020 at 9:17 PM Thomas Hellström (VMware)
 wrote:

On 2/17/20 6:55 PM, Daniel Vetter wrote:

On Mon, Feb 17, 2020 at 04:45:09PM +0100, Christian König wrote:

Implement the importer side of unpinned DMA-buf handling.

v2: update page tables immediately

Signed-off-by: Christian König 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 66 
-

   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c  |  6 ++
   2 files changed, 71 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c

index 770baba621b3..48de7624d49c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
@@ -453,7 +453,71 @@ amdgpu_dma_buf_create_obj(struct drm_device 
*dev, struct dma_buf *dma_buf)

  return ERR_PTR(ret);
   }

+/**
+ * amdgpu_dma_buf_move_notify - _notify implementation
+ *
+ * @attach: the DMA-buf attachment
+ *
+ * Invalidate the DMA-buf attachment, making sure that the we 
re-create the

+ * mapping before the next use.
+ */
+static void
+amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach)
+{
+    struct drm_gem_object *obj = attach->importer_priv;
+    struct ww_acquire_ctx *ticket = dma_resv_locking_ctx(obj->resv);
+    struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj);
+    struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
+    struct ttm_operation_ctx ctx = { false, false };
+    struct ttm_placement placement = {};
+    struct amdgpu_vm_bo_base *bo_base;
+    int r;
+
+    if (bo->tbo.mem.mem_type == TTM_PL_SYSTEM)
+    return;
+
+    r = ttm_bo_validate(>tbo, , );
+    if (r) {
+    DRM_ERROR("Failed to invalidate DMA-buf import 
(%d))\n", r);

+    return;
+    }
+
+    for (bo_base = bo->vm_bo; bo_base; bo_base = bo_base->next) {
+    struct amdgpu_vm *vm = bo_base->vm;
+    struct dma_resv *resv = vm->root.base.bo->tbo.base.resv;
+
+    if (ticket) {
Yeah so this is kinda why I've been a total pain about the exact 
semantics
of the move_notify hook. I think we should flat-out require that 
importers
_always_ have a ticket attach when they call this, and that they 
can cope

with additional locks being taken (i.e. full EDEADLCK) handling.

Simplest way to force that contract is to add a dummy 2nd ww_mutex 
lock to

the dma_resv object, which we then can take #ifdef
CONFIG_WW_MUTEX_SLOWPATH_DEBUG. Plus mabye a WARN_ON(!ticket).

Now the real disaster is how we handle deadlocks. Two issues:

- Ideally we'd keep any lock we've taken locked until the end, it 
helps
    needless backoffs. I've played around a bit with that but not 
even poc

    level, just an idea:

https://cgit.freedesktop.org/~danvet/drm/commit/?id=b1799c5a0f02df9e1bb08d27be37331255ab7582 



    Idea is essentially to track a list of objects we had to lock 
as part of

    the ttm_bo_validate of the main object.

- Second one is if we get a EDEADLCK on one of these sublocks (like 
the
    one here). We need to pass that up the entire callchain, 
including a
    temporary reference (we have to drop locks to do the 
ww_mutex_lock_slow

    call), and need a custom callback to drop that temporary reference
    (since that's all driver specific, might even be internal 
ww_mutex and
    not anything remotely looking like a normal dma_buf). This 
probably
    needs the exec util helpers from ttm, but at the dma_resv 
level, so that

    we can do something like this:

struct dma_resv_ticket {
   struct ww_acquire_ctx base;

   /* can be set by anyone (including other drivers) that got 
hold of
    * this ticket and had to acquire some new lock. This lock 
might

    * protect anything, including driver-internal stuff, and isn't
    * required to be a dma_buf or even just a dma_resv. */
   struct ww_mutex *contended_lock;

   /* callback which the driver (which might be a dma-buf exporter
    * and not matching the driver that started this locking 
ticket)
    * sets together with @contended_lock, for the main driver 
to drop

    * when it calls dma_resv_unlock on the contended_lock. */
   void (drop_ref*)(struct ww_mutex *contended_lock);
};

This is all supremely nasty (also ttm_bo_validate would need to be
improved to handle these sublocks and random new objects that could 
force

a ww_mutex_lock_slow).


Just a short comment on this:

Neither the currently used wait-die or the wound-wait algorithm
*strictly* requires a slow lock on the contended lock. For wait-die 
it's

just very convenient since it makes us sleep instead of spinning with
-EDEADLK on the contended lock. For wound-wait IIRC one could just
immediately restart the whole locking transaction after an -EDEADLK, 
and

the transaction would automatically end up waiting on the contende

Re: [Intel-gfx] [PATCH 5/5] drm/amdgpu: implement amdgpu_gem_prime_move_notify v2

2020-02-18 Thread VMware

On 2/18/20 10:01 PM, Daniel Vetter wrote:

On Tue, Feb 18, 2020 at 9:17 PM Thomas Hellström (VMware)
 wrote:

On 2/17/20 6:55 PM, Daniel Vetter wrote:

On Mon, Feb 17, 2020 at 04:45:09PM +0100, Christian König wrote:

Implement the importer side of unpinned DMA-buf handling.

v2: update page tables immediately

Signed-off-by: Christian König 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 66 -
   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c  |  6 ++
   2 files changed, 71 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
index 770baba621b3..48de7624d49c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
@@ -453,7 +453,71 @@ amdgpu_dma_buf_create_obj(struct drm_device *dev, struct 
dma_buf *dma_buf)
  return ERR_PTR(ret);
   }

+/**
+ * amdgpu_dma_buf_move_notify - _notify implementation
+ *
+ * @attach: the DMA-buf attachment
+ *
+ * Invalidate the DMA-buf attachment, making sure that the we re-create the
+ * mapping before the next use.
+ */
+static void
+amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach)
+{
+struct drm_gem_object *obj = attach->importer_priv;
+struct ww_acquire_ctx *ticket = dma_resv_locking_ctx(obj->resv);
+struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj);
+struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
+struct ttm_operation_ctx ctx = { false, false };
+struct ttm_placement placement = {};
+struct amdgpu_vm_bo_base *bo_base;
+int r;
+
+if (bo->tbo.mem.mem_type == TTM_PL_SYSTEM)
+return;
+
+r = ttm_bo_validate(>tbo, , );
+if (r) {
+DRM_ERROR("Failed to invalidate DMA-buf import (%d))\n", r);
+return;
+}
+
+for (bo_base = bo->vm_bo; bo_base; bo_base = bo_base->next) {
+struct amdgpu_vm *vm = bo_base->vm;
+struct dma_resv *resv = vm->root.base.bo->tbo.base.resv;
+
+if (ticket) {

Yeah so this is kinda why I've been a total pain about the exact semantics
of the move_notify hook. I think we should flat-out require that importers
_always_ have a ticket attach when they call this, and that they can cope
with additional locks being taken (i.e. full EDEADLCK) handling.

Simplest way to force that contract is to add a dummy 2nd ww_mutex lock to
the dma_resv object, which we then can take #ifdef
CONFIG_WW_MUTEX_SLOWPATH_DEBUG. Plus mabye a WARN_ON(!ticket).

Now the real disaster is how we handle deadlocks. Two issues:

- Ideally we'd keep any lock we've taken locked until the end, it helps
needless backoffs. I've played around a bit with that but not even poc
level, just an idea:

https://cgit.freedesktop.org/~danvet/drm/commit/?id=b1799c5a0f02df9e1bb08d27be37331255ab7582

Idea is essentially to track a list of objects we had to lock as part of
the ttm_bo_validate of the main object.

- Second one is if we get a EDEADLCK on one of these sublocks (like the
one here). We need to pass that up the entire callchain, including a
temporary reference (we have to drop locks to do the ww_mutex_lock_slow
call), and need a custom callback to drop that temporary reference
(since that's all driver specific, might even be internal ww_mutex and
not anything remotely looking like a normal dma_buf). This probably
needs the exec util helpers from ttm, but at the dma_resv level, so that
we can do something like this:

struct dma_resv_ticket {
   struct ww_acquire_ctx base;

   /* can be set by anyone (including other drivers) that got hold of
* this ticket and had to acquire some new lock. This lock might
* protect anything, including driver-internal stuff, and isn't
* required to be a dma_buf or even just a dma_resv. */
   struct ww_mutex *contended_lock;

   /* callback which the driver (which might be a dma-buf exporter
* and not matching the driver that started this locking ticket)
* sets together with @contended_lock, for the main driver to drop
* when it calls dma_resv_unlock on the contended_lock. */
   void (drop_ref*)(struct ww_mutex *contended_lock);
};

This is all supremely nasty (also ttm_bo_validate would need to be
improved to handle these sublocks and random new objects that could force
a ww_mutex_lock_slow).


Just a short comment on this:

Neither the currently used wait-die or the wound-wait algorithm
*strictly* requires a slow lock on the contended lock. For wait-die it's
just very convenient since it makes us sleep instead of spinning with
-EDEADLK on the contended lock. For wound-wait IIRC one could just
immediately restart the whole locking transaction after an -EDEADLK, and
the transaction would automatically end up waiting on the contended
lock, provided the mutex lock stealing is not allowed. There is however
a possibility th

Re: [Intel-gfx] [PATCH 5/5] drm/amdgpu: implement amdgpu_gem_prime_move_notify v2

2020-02-18 Thread VMware

On 2/17/20 6:55 PM, Daniel Vetter wrote:

On Mon, Feb 17, 2020 at 04:45:09PM +0100, Christian König wrote:

Implement the importer side of unpinned DMA-buf handling.

v2: update page tables immediately

Signed-off-by: Christian König 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 66 -
  drivers/gpu/drm/amd/amdgpu/amdgpu_object.c  |  6 ++
  2 files changed, 71 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
index 770baba621b3..48de7624d49c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
@@ -453,7 +453,71 @@ amdgpu_dma_buf_create_obj(struct drm_device *dev, struct 
dma_buf *dma_buf)
return ERR_PTR(ret);
  }
  
+/**

+ * amdgpu_dma_buf_move_notify - _notify implementation
+ *
+ * @attach: the DMA-buf attachment
+ *
+ * Invalidate the DMA-buf attachment, making sure that the we re-create the
+ * mapping before the next use.
+ */
+static void
+amdgpu_dma_buf_move_notify(struct dma_buf_attachment *attach)
+{
+   struct drm_gem_object *obj = attach->importer_priv;
+   struct ww_acquire_ctx *ticket = dma_resv_locking_ctx(obj->resv);
+   struct amdgpu_bo *bo = gem_to_amdgpu_bo(obj);
+   struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
+   struct ttm_operation_ctx ctx = { false, false };
+   struct ttm_placement placement = {};
+   struct amdgpu_vm_bo_base *bo_base;
+   int r;
+
+   if (bo->tbo.mem.mem_type == TTM_PL_SYSTEM)
+   return;
+
+   r = ttm_bo_validate(>tbo, , );
+   if (r) {
+   DRM_ERROR("Failed to invalidate DMA-buf import (%d))\n", r);
+   return;
+   }
+
+   for (bo_base = bo->vm_bo; bo_base; bo_base = bo_base->next) {
+   struct amdgpu_vm *vm = bo_base->vm;
+   struct dma_resv *resv = vm->root.base.bo->tbo.base.resv;
+
+   if (ticket) {

Yeah so this is kinda why I've been a total pain about the exact semantics
of the move_notify hook. I think we should flat-out require that importers
_always_ have a ticket attach when they call this, and that they can cope
with additional locks being taken (i.e. full EDEADLCK) handling.

Simplest way to force that contract is to add a dummy 2nd ww_mutex lock to
the dma_resv object, which we then can take #ifdef
CONFIG_WW_MUTEX_SLOWPATH_DEBUG. Plus mabye a WARN_ON(!ticket).

Now the real disaster is how we handle deadlocks. Two issues:

- Ideally we'd keep any lock we've taken locked until the end, it helps
   needless backoffs. I've played around a bit with that but not even poc
   level, just an idea:

https://cgit.freedesktop.org/~danvet/drm/commit/?id=b1799c5a0f02df9e1bb08d27be37331255ab7582

   Idea is essentially to track a list of objects we had to lock as part of
   the ttm_bo_validate of the main object.

- Second one is if we get a EDEADLCK on one of these sublocks (like the
   one here). We need to pass that up the entire callchain, including a
   temporary reference (we have to drop locks to do the ww_mutex_lock_slow
   call), and need a custom callback to drop that temporary reference
   (since that's all driver specific, might even be internal ww_mutex and
   not anything remotely looking like a normal dma_buf). This probably
   needs the exec util helpers from ttm, but at the dma_resv level, so that
   we can do something like this:

struct dma_resv_ticket {
struct ww_acquire_ctx base;

/* can be set by anyone (including other drivers) that got hold of
 * this ticket and had to acquire some new lock. This lock might
 * protect anything, including driver-internal stuff, and isn't
 * required to be a dma_buf or even just a dma_resv. */
struct ww_mutex *contended_lock;

/* callback which the driver (which might be a dma-buf exporter
 * and not matching the driver that started this locking ticket)
 * sets together with @contended_lock, for the main driver to drop
 * when it calls dma_resv_unlock on the contended_lock. */
void (drop_ref*)(struct ww_mutex *contended_lock);
};

This is all supremely nasty (also ttm_bo_validate would need to be
improved to handle these sublocks and random new objects that could force
a ww_mutex_lock_slow).


Just a short comment on this:

Neither the currently used wait-die or the wound-wait algorithm 
*strictly* requires a slow lock on the contended lock. For wait-die it's 
just very convenient since it makes us sleep instead of spinning with 
-EDEADLK on the contended lock. For wound-wait IIRC one could just 
immediately restart the whole locking transaction after an -EDEADLK, and 
the transaction would automatically end up waiting on the contended 
lock, provided the mutex lock stealing is not allowed. There is however 
a possibility that the transaction will be wounded again on another 
lock, taken before the contended lock, 

Re: [Intel-gfx] [PATCH] drm/auth: Drop master_create/destroy hooks

2020-01-27 Thread VMware

On 1/27/20 11:02 AM, Daniel Vetter wrote:

vmwgfx stopped using them.

With the drm device model that we've slowly evolved over the past few
years master status essentially controls access to display resources,
and nothing else. Since that's a pure access permission check drivers
should have no need at all to track additional state on a per file
basis.

Aside: For cleanup and restoring kernel-internal clients the grand
plan is to move everyone over to drm_client and
drm_master_internal_acquire/release, like the generic fbdev code
already does. That should get rid of most ->lastclose implementations,
and I think also subsumes any processing vmwgfx does in
master_set/drop.

Cc: "Thomas Hellström (VMware)" 
Signed-off-by: Daniel Vetter 


Reviewed-by: Thomas Hellstrom 


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm: Avoid drm_global_mutex for simple inc/dec of dev->open_count

2020-01-24 Thread VMware

On 1/24/20 7:39 PM, Chris Wilson wrote:

Quoting Thomas Hellström (VMware) (2020-01-24 13:37:47)

On 1/24/20 2:01 PM, Chris Wilson wrote:

Since drm_global_mutex is a true global mutex across devices, we don't
want to acquire it unless absolutely necessary. For maintaining the
device local open_count, we can use atomic operations on the counter
itself, except when making the transition to/from 0. Here, we tackle the
easy portion of delaying acquiring the drm_global_mutex for the final
release by using atomic_dec_and_mutex_lock(), leaving the global
serialisation across the device opens.

Signed-off-by: Chris Wilson 
Cc: Thomas Hellström (VMware) 

For the series:

Reviewed-by: Thomas Hellström 

Now being opt-in, it is fairly limited in scope and will not randomly
break others (touch wood) and the close() racing in BAT didn't throw
anything up, so pushed to drm-misc-next. Thanks for the review and
suggestions,

Next task is to suggest others might like to use it as well.
-Chris


Thanks. I'll look at doing the same for those drivers I audited.

/Thomas


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm: Avoid drm_global_mutex for simple inc/dec of dev->open_count

2020-01-24 Thread VMware

On 1/24/20 2:01 PM, Chris Wilson wrote:

Since drm_global_mutex is a true global mutex across devices, we don't
want to acquire it unless absolutely necessary. For maintaining the
device local open_count, we can use atomic operations on the counter
itself, except when making the transition to/from 0. Here, we tackle the
easy portion of delaying acquiring the drm_global_mutex for the final
release by using atomic_dec_and_mutex_lock(), leaving the global
serialisation across the device opens.

Signed-off-by: Chris Wilson 
Cc: Thomas Hellström (VMware) 


For the series:

Reviewed-by: Thomas Hellström 

Now the only remaining (though pre-existing) problem I can see is that 
there is no corresponding mutex lock in drm_open() so that firstopen 
might race with lastclose.. Or I might be missing something..


/Thomas


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH v2] drm: Release filp before global lock

2020-01-24 Thread VMware

On 1/23/20 11:21 PM, Chris Wilson wrote:

The file is not part of the global drm resource and can be released
prior to take the global mutex to drop the open_count (and potentially
close) the drm device. As the global mutex is indeed global, not only
within the device but across devices, a slow file release mechanism can
bottleneck the entire system.

However, inside drm_close_helper() there are a number of dev->driver
callbacks that take the drm_device as the first parameter... Worryingly
some of those callbacks may be (implicitly) depending on the global
mutex.


From a quick audit, via, sis and vmwgfx are safe, so for those

Acked-by: Thomas Hellstrom 

Savage appears to be unsafe, due to unprotected access in the dma device 
member. Haven't audited i810 or potential other drivers affected. 
Perhaps it makes sense to enable lockfree filp release on a per-driver 
basis to begin with?


/Thomas


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm: Release filp before global lock

2020-01-22 Thread VMware

On 1/22/20 11:00 PM, Chris Wilson wrote:

Quoting Thomas Hellström (VMware) (2020-01-22 21:52:23)

Hi, Chris,

On 1/22/20 4:56 PM, Chris Wilson wrote:

The file is not part of the global drm resource and can be released
prior to take the global mutex to drop the open_count (and potentially
close) the drm device.

However, inside drm_close_helper() there are a number of dev->driver
callbacks that take the drm_device as the first parameter... Worryingly
some of those callbacks may be (implicitly) depending on the global
mutex.

I read this as you suspect that there are driver callbacks inside
drm_close_helper() that might need the global mutex held? But then it
wouldn't be safe to move the lock? Is there a strong motivation for
moving the locking in the first place? Also a minor nit below:

The number of processes stuck on 'D' due to mutex_lock() caught my
attention while they were cleaning up files. I think everyone else will
be less impressed if their driver was stuck because i915 was freeing a
user's filp.


Understood. Perhaps a short motivation in the log message?


Signed-off-by: Chris Wilson 
---
   drivers/gpu/drm/drm_file.c | 4 ++--
   1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c
index 92d16724f949..84ed313ee2e9 100644
--- a/drivers/gpu/drm/drm_file.c
+++ b/drivers/gpu/drm/drm_file.c
@@ -438,12 +438,12 @@ int drm_release(struct inode *inode, struct file *filp)
   struct drm_minor *minor = file_priv->minor;
   struct drm_device *dev = minor->dev;
   
- mutex_lock(_global_mutex);

-
   DRM_DEBUG("open_count = %d\n", dev->open_count);

The read of dev->open_count should still be inside the lock to be
consistent with the value that is decremented below. Perhaps move the
DRM_DEBUG()?

Sure. Is it even worth a debug?


Probably an old relic. I'm fine with letting it go.

Thanks,

Thomas




-Chris



___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm: Release filp before global lock

2020-01-22 Thread VMware

Hi, Chris,

On 1/22/20 4:56 PM, Chris Wilson wrote:

The file is not part of the global drm resource and can be released
prior to take the global mutex to drop the open_count (and potentially
close) the drm device.

However, inside drm_close_helper() there are a number of dev->driver
callbacks that take the drm_device as the first parameter... Worryingly
some of those callbacks may be (implicitly) depending on the global
mutex.


I read this as you suspect that there are driver callbacks inside 
drm_close_helper() that might need the global mutex held? But then it 
wouldn't be safe to move the lock? Is there a strong motivation for 
moving the locking in the first place? Also a minor nit below:




Signed-off-by: Chris Wilson 
---
  drivers/gpu/drm/drm_file.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c
index 92d16724f949..84ed313ee2e9 100644
--- a/drivers/gpu/drm/drm_file.c
+++ b/drivers/gpu/drm/drm_file.c
@@ -438,12 +438,12 @@ int drm_release(struct inode *inode, struct file *filp)
struct drm_minor *minor = file_priv->minor;
struct drm_device *dev = minor->dev;
  
-	mutex_lock(_global_mutex);

-
DRM_DEBUG("open_count = %d\n", dev->open_count);


The read of dev->open_count should still be inside the lock to be 
consistent with the value that is decremented below. Perhaps move the 
DRM_DEBUG()?


  
  	drm_close_helper(filp);
  
+	mutex_lock(_global_mutex);

+
if (!--dev->open_count)
drm_lastclose(dev);
  


Thanks,

Thomas


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/ttm: remove ttm_bo_wait_unreserved

2019-08-22 Thread VMware

On 8/22/19 4:24 PM, Thomas Hellström (VMware) wrote:

On 8/22/19 4:02 PM, Koenig, Christian wrote:

Am 22.08.19 um 15:06 schrieb Daniel Vetter:

On Thu, Aug 22, 2019 at 07:56:56AM +, Koenig, Christian wrote:

Am 22.08.19 um 08:49 schrieb Daniel Vetter:

With nouveau fixed all ttm-using drives have the correct nesting of
mmap_sem vs dma_resv, and we can just lock the buffer.

Assuming I didn't screw up anything with my audit of course.

v2:
- Dont forget wu_mutex (Christian König)
- Keep the mmap_sem-less wait optimization (Thomas)
- Use _lock_interruptible to be good citizens (Thomas)

Reviewed-by: Christian König 

btw I realized I didn't remove your r-b, since v1 was broken.

For formality, can you pls reaffirm, or still something broken?

My r-b is still valid.

Only problem I see is that neither of us seems to have a good idea about
the different VM_FAULT_* replies.


I took a look in mm/gup.c. It seems like when using get_user_pages, 
VM_FAULT_RETRY will retry 


s/retry/return/


to a requesting caller telling it that a long wait was expected and 
not performed, whereas VM_FAULT_NOPAGE will just keep get_user_pages 
to spin. So the proposed patch should be correct from my understanding.


If the fault originates from user-space, I guess either is fine.

/Thomas


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: [Intel-gfx] [PATCH] drm/ttm: remove ttm_bo_wait_unreserved

2019-08-22 Thread VMware

On 8/22/19 4:02 PM, Koenig, Christian wrote:

Am 22.08.19 um 15:06 schrieb Daniel Vetter:

On Thu, Aug 22, 2019 at 07:56:56AM +, Koenig, Christian wrote:

Am 22.08.19 um 08:49 schrieb Daniel Vetter:

With nouveau fixed all ttm-using drives have the correct nesting of
mmap_sem vs dma_resv, and we can just lock the buffer.

Assuming I didn't screw up anything with my audit of course.

v2:
- Dont forget wu_mutex (Christian König)
- Keep the mmap_sem-less wait optimization (Thomas)
- Use _lock_interruptible to be good citizens (Thomas)

Reviewed-by: Christian König 

btw I realized I didn't remove your r-b, since v1 was broken.

For formality, can you pls reaffirm, or still something broken?

My r-b is still valid.

Only problem I see is that neither of us seems to have a good idea about
the different VM_FAULT_* replies.


I took a look in mm/gup.c. It seems like when using get_user_pages, 
VM_FAULT_RETRY will retry to a requesting caller telling it that a long 
wait was expected and not performed, whereas VM_FAULT_NOPAGE will just 
keep get_user_pages to spin. So the proposed patch should be correct 
from my understanding.


If the fault originates from user-space, I guess either is fine.

/Thomas




But that worked before so it should still work now,
Christian.


Also from the other thread: Reviewed-by: Thomas Hellström 


Thanks, Daniel


Signed-off-by: Daniel Vetter 
Cc: Christian Koenig 
Cc: Huang Rui 
Cc: Gerd Hoffmann 
Cc: "VMware Graphics" 
Cc: Thomas Hellstrom 
---
drivers/gpu/drm/ttm/ttm_bo.c  | 36 ---
drivers/gpu/drm/ttm/ttm_bo_util.c |  1 -
drivers/gpu/drm/ttm/ttm_bo_vm.c   | 18 +---
include/drm/ttm/ttm_bo_api.h  |  4 
4 files changed, 5 insertions(+), 54 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index 20ff56f27aa4..d1ce5d315d5b 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -162,7 +162,6 @@ static void ttm_bo_release_list(struct kref *list_kref)
dma_fence_put(bo->moving);
if (!ttm_bo_uses_embedded_gem_object(bo))
dma_resv_fini(>base._resv);
-   mutex_destroy(>wu_mutex);
bo->destroy(bo);
ttm_mem_global_free(bdev->glob->mem_glob, acc_size);
}
@@ -1319,7 +1318,6 @@ int ttm_bo_init_reserved(struct ttm_bo_device *bdev,
INIT_LIST_HEAD(>ddestroy);
INIT_LIST_HEAD(>swap);
INIT_LIST_HEAD(>io_reserve_lru);
-   mutex_init(>wu_mutex);
bo->bdev = bdev;
bo->type = type;
bo->num_pages = num_pages;
@@ -1954,37 +1952,3 @@ void ttm_bo_swapout_all(struct ttm_bo_device *bdev)
;
}
EXPORT_SYMBOL(ttm_bo_swapout_all);
-
-/**
- * ttm_bo_wait_unreserved - interruptible wait for a buffer object to become
- * unreserved
- *
- * @bo: Pointer to buffer
- */
-int ttm_bo_wait_unreserved(struct ttm_buffer_object *bo)
-{
-   int ret;
-
-   /*
-* In the absense of a wait_unlocked API,
-* Use the bo::wu_mutex to avoid triggering livelocks due to
-* concurrent use of this function. Note that this use of
-* bo::wu_mutex can go away if we change locking order to
-* mmap_sem -> bo::reserve.
-*/
-   ret = mutex_lock_interruptible(>wu_mutex);
-   if (unlikely(ret != 0))
-   return -ERESTARTSYS;
-   if (!dma_resv_is_locked(bo->base.resv))
-   goto out_unlock;
-   ret = dma_resv_lock_interruptible(bo->base.resv, NULL);
-   if (ret == -EINTR)
-   ret = -ERESTARTSYS;
-   if (unlikely(ret != 0))
-   goto out_unlock;
-   dma_resv_unlock(bo->base.resv);
-
-out_unlock:
-   mutex_unlock(>wu_mutex);
-   return ret;
-}
diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c 
b/drivers/gpu/drm/ttm/ttm_bo_util.c
index fe81c565e7ef..82ea26a49959 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_util.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
@@ -508,7 +508,6 @@ static int ttm_buffer_object_transfer(struct 
ttm_buffer_object *bo,
INIT_LIST_HEAD(>base.lru);
INIT_LIST_HEAD(>base.swap);
INIT_LIST_HEAD(>base.io_reserve_lru);
-   mutex_init(>base.wu_mutex);
fbo->base.moving = NULL;
drm_vma_node_reset(>base.base.vma_node);
atomic_set(>base.cpu_writers, 0);
diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index 76eedb963693..a61a35e57d1c 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -125,30 +125,22 @@ static vm_fault_t ttm_bo_vm_fault(struct vm_fault *vmf)
>man[bo->mem.mem_type];
struct vm_area_struct cvma;

-	/*

-* Work around locking order reversal in fault / nopfn
-* between mmap_sem and bo_reserve: Perform a trylock operation
-* for reserve, and if it fails, retry the fault after waiting
-   

Re: [Intel-gfx] [PATCH 1/3] dma_resv: prime lockdep annotations

2019-08-22 Thread VMware

On 8/21/19 9:51 PM, Daniel Vetter wrote:

On Wed, Aug 21, 2019 at 08:27:59PM +0200, Thomas Hellström (VMware) wrote:

On 8/21/19 8:11 PM, Daniel Vetter wrote:

On Wed, Aug 21, 2019 at 7:06 PM Thomas Hellström (VMware)
 wrote:

On 8/21/19 6:34 PM, Daniel Vetter wrote:

On Wed, Aug 21, 2019 at 05:54:27PM +0200, Thomas Hellström (VMware) wrote:

On 8/20/19 4:53 PM, Daniel Vetter wrote:

Full audit of everyone:

- i915, radeon, amdgpu should be clean per their maintainers.

- vram helpers should be fine, they don't do command submission, so
  really no business holding struct_mutex while doing copy_*_user. But
  I haven't checked them all.

- panfrost seems to dma_resv_lock only in panfrost_job_push, which
  looks clean.

- v3d holds dma_resv locks in the tail of its v3d_submit_cl_ioctl(),
  copying from/to userspace happens all in v3d_lookup_bos which is
  outside of the critical section.

- vmwgfx has a bunch of ioctls that do their own copy_*_user:
  - vmw_execbuf_process: First this does some copies in
vmw_execbuf_cmdbuf() and also in the vmw_execbuf_process() itself.
Then comes the usual ttm reserve/validate sequence, then actual
submission/fencing, then unreserving, and finally some more
copy_to_user in vmw_execbuf_copy_fence_user. Glossing over tons of
details, but looks all safe.
  - vmw_fence_event_ioctl: No ttm_reserve/dma_resv_lock anywhere to be
seen, seems to only create a fence and copy it out.
  - a pile of smaller ioctl in vmwgfx_ioctl.c, no reservations to be
found there.
  Summary: vmwgfx seems to be fine too.

- virtio: There's virtio_gpu_execbuffer_ioctl, which does all the
  copying from userspace before even looking up objects through their
  handles, so safe. Plus the getparam/getcaps ioctl, also both safe.

- qxl only has qxl_execbuffer_ioctl, which calls into
  qxl_process_single_command. There's a lovely comment before the
  __copy_from_user_inatomic that the slowpath should be copied from
  i915, but I guess that never happened. Try not to be unlucky and get
  your CS data evicted between when it's written and the kernel tries
  to read it. The only other copy_from_user is for relocs, but those
  are done before qxl_release_reserve_list(), which seems to be the
  only thing reserving buffers (in the ttm/dma_resv sense) in that
  code. So looks safe.

- A debugfs file in nouveau_debugfs_pstate_set() and the usif ioctl in
  usif_ioctl() look safe. nouveau_gem_ioctl_pushbuf() otoh breaks this
  everywhere and needs to be fixed up.

Cc: Alex Deucher 
Cc: Christian König 
Cc: Chris Wilson 
Cc: Thomas Zimmermann 
Cc: Rob Herring 
Cc: Tomeu Vizoso 
Cc: Eric Anholt 
Cc: Dave Airlie 
Cc: Gerd Hoffmann 
Cc: Ben Skeggs 
Cc: "VMware Graphics" 
Cc: Thomas Hellstrom 
Signed-off-by: Daniel Vetter 
---
 drivers/dma-buf/dma-resv.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/drivers/dma-buf/dma-resv.c b/drivers/dma-buf/dma-resv.c
index 42a8f3f11681..3edca10d3faf 100644
--- a/drivers/dma-buf/dma-resv.c
+++ b/drivers/dma-buf/dma-resv.c
@@ -34,6 +34,7 @@
 #include 
 #include 
+#include 
 /**
  * DOC: Reservation Object Overview
@@ -107,6 +108,17 @@ void dma_resv_init(struct dma_resv *obj)
  _seqcount_class);
  RCU_INIT_POINTER(obj->fence, NULL);
  RCU_INIT_POINTER(obj->fence_excl, NULL);
+
+   if (IS_ENABLED(CONFIG_LOCKDEP)) {
+   if (current->mm)
+   down_read(>mm->mmap_sem);
+   ww_mutex_lock(>lock, NULL);
+   fs_reclaim_acquire(GFP_KERNEL);
+   fs_reclaim_release(GFP_KERNEL);
+   ww_mutex_unlock(>lock);
+   if (current->mm)
+   up_read(>mm->mmap_sem);
+   }
 }
 EXPORT_SYMBOL(dma_resv_init);

I assume if this would have been easily done and maintainable using only
lockdep annotation instead of actually acquiring the locks, that would have
been done?

There's might_lock(), plus a pile of macros, but they don't map obviuosly,
so pretty good chances I accidentally end up with the wrong type of
annotation. Easier to just take the locks quickly, and stuff that all into
a lockdep-only section to avoid overhead.


Otherwise LGTM.

Reviewed-by: Thomas Hellström 

Will test this and let you know if it trips on vmwgfx, but it really
shouldn't.

Thanks, Daniel

One thing that strikes me is that this puts restrictions on where you
can actually initialize a dma_resv, even if locking orders are otherwise
obeyed. But that might not be a big problem.

Hm yeah ... the trouble is a need a non-kthread thread so that I have
a current->mm. Otherwise I'd have put it into some init section with a
temp dma_buf. And I kinda don't want to create a fake ->mm just for
lockdep priming. I don't expect this to be a real problem in practice,
since before you've called dma_resv_init the res

Re: [Intel-gfx] [PATCH] drm/ttm: remove ttm_bo_wait_unreserved

2019-08-22 Thread VMware

On 8/22/19 10:47 AM, Daniel Vetter wrote:

On Thu, Aug 22, 2019 at 9:56 AM Koenig, Christian
 wrote:

Am 22.08.19 um 08:49 schrieb Daniel Vetter:

With nouveau fixed all ttm-using drives have the correct nesting of
mmap_sem vs dma_resv, and we can just lock the buffer.

Assuming I didn't screw up anything with my audit of course.

v2:
- Dont forget wu_mutex (Christian König)
- Keep the mmap_sem-less wait optimization (Thomas)
- Use _lock_interruptible to be good citizens (Thomas)

Reviewed-by: Christian König 
Signed-off-by: Daniel Vetter 
Cc: Christian Koenig 
Cc: Huang Rui 
Cc: Gerd Hoffmann 
Cc: "VMware Graphics" 
Cc: Thomas Hellstrom 
---
   drivers/gpu/drm/ttm/ttm_bo.c  | 36 ---
   drivers/gpu/drm/ttm/ttm_bo_util.c |  1 -
   drivers/gpu/drm/ttm/ttm_bo_vm.c   | 18 +---
   include/drm/ttm/ttm_bo_api.h  |  4 
   4 files changed, 5 insertions(+), 54 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index 20ff56f27aa4..d1ce5d315d5b 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -162,7 +162,6 @@ static void ttm_bo_release_list(struct kref *list_kref)
   dma_fence_put(bo->moving);
   if (!ttm_bo_uses_embedded_gem_object(bo))
   dma_resv_fini(>base._resv);
- mutex_destroy(>wu_mutex);
   bo->destroy(bo);
   ttm_mem_global_free(bdev->glob->mem_glob, acc_size);
   }
@@ -1319,7 +1318,6 @@ int ttm_bo_init_reserved(struct ttm_bo_device *bdev,
   INIT_LIST_HEAD(>ddestroy);
   INIT_LIST_HEAD(>swap);
   INIT_LIST_HEAD(>io_reserve_lru);
- mutex_init(>wu_mutex);
   bo->bdev = bdev;
   bo->type = type;
   bo->num_pages = num_pages;
@@ -1954,37 +1952,3 @@ void ttm_bo_swapout_all(struct ttm_bo_device *bdev)
   ;
   }
   EXPORT_SYMBOL(ttm_bo_swapout_all);
-
-/**
- * ttm_bo_wait_unreserved - interruptible wait for a buffer object to become
- * unreserved
- *
- * @bo: Pointer to buffer
- */
-int ttm_bo_wait_unreserved(struct ttm_buffer_object *bo)
-{
- int ret;
-
- /*
-  * In the absense of a wait_unlocked API,
-  * Use the bo::wu_mutex to avoid triggering livelocks due to
-  * concurrent use of this function. Note that this use of
-  * bo::wu_mutex can go away if we change locking order to
-  * mmap_sem -> bo::reserve.
-  */
- ret = mutex_lock_interruptible(>wu_mutex);
- if (unlikely(ret != 0))
- return -ERESTARTSYS;
- if (!dma_resv_is_locked(bo->base.resv))
- goto out_unlock;
- ret = dma_resv_lock_interruptible(bo->base.resv, NULL);
- if (ret == -EINTR)
- ret = -ERESTARTSYS;
- if (unlikely(ret != 0))
- goto out_unlock;
- dma_resv_unlock(bo->base.resv);
-
-out_unlock:
- mutex_unlock(>wu_mutex);
- return ret;
-}
diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c 
b/drivers/gpu/drm/ttm/ttm_bo_util.c
index fe81c565e7ef..82ea26a49959 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_util.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
@@ -508,7 +508,6 @@ static int ttm_buffer_object_transfer(struct 
ttm_buffer_object *bo,
   INIT_LIST_HEAD(>base.lru);
   INIT_LIST_HEAD(>base.swap);
   INIT_LIST_HEAD(>base.io_reserve_lru);
- mutex_init(>base.wu_mutex);
   fbo->base.moving = NULL;
   drm_vma_node_reset(>base.base.vma_node);
   atomic_set(>base.cpu_writers, 0);
diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index 76eedb963693..a61a35e57d1c 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -125,30 +125,22 @@ static vm_fault_t ttm_bo_vm_fault(struct vm_fault *vmf)
   >man[bo->mem.mem_type];
   struct vm_area_struct cvma;

- /*
-  * Work around locking order reversal in fault / nopfn
-  * between mmap_sem and bo_reserve: Perform a trylock operation
-  * for reserve, and if it fails, retry the fault after waiting
-  * for the buffer to become unreserved.
-  */
   if (unlikely(!dma_resv_trylock(bo->base.resv))) {
   if (vmf->flags & FAULT_FLAG_ALLOW_RETRY) {
   if (!(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) {

Not an expert on fault handling, but shouldn't this now be one if?

E.g. if FAULT_FLAG_RETRY_NOWAIT is set we should return VM_FAULT_NOPAGE
instead of VM_FAULT_RETRY.

Honestly I have no idea at all about this stuff. I just learned about
the mmap_sem less retry now that Thomas pointed it out, and I have no
idea how anything else here works. My approach has always been to just
throw ridiculous amounts of really nasty tests at fault handling
(including handling our own gtt mmaps to copy*user in relocs or gup
for userptr and all that), and leave it at that :-)


But really take that with a grain of salt,

No idea either. It should be functionally equivalent to wha

Re: [Intel-gfx] [PATCH 3/3] drm/ttm: remove ttm_bo_wait_unreserved

2019-08-21 Thread VMware

On 8/21/19 4:47 PM, Daniel Vetter wrote:

On Wed, Aug 21, 2019 at 4:27 PM Thomas Hellström (VMware)
 wrote:

On 8/21/19 4:09 PM, Daniel Vetter wrote:

On Wed, Aug 21, 2019 at 2:47 PM Thomas Hellström (VMware)
 wrote:

On 8/21/19 2:40 PM, Thomas Hellström (VMware) wrote:

On 8/20/19 4:53 PM, Daniel Vetter wrote:

With nouveau fixed all ttm-using drives have the correct nesting of
mmap_sem vs dma_resv, and we can just lock the buffer.

Assuming I didn't screw up anything with my audit of course.

Signed-off-by: Daniel Vetter 
Cc: Christian Koenig 
Cc: Huang Rui 
Cc: Gerd Hoffmann 
Cc: "VMware Graphics" 
Cc: Thomas Hellstrom 
---
drivers/gpu/drm/ttm/ttm_bo.c| 34 -
drivers/gpu/drm/ttm/ttm_bo_vm.c | 26 +
include/drm/ttm/ttm_bo_api.h|  1 -
3 files changed, 1 insertion(+), 60 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index 20ff56f27aa4..a952dd624b06 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -1954,37 +1954,3 @@ void ttm_bo_swapout_all(struct ttm_bo_device
*bdev)
;
}
EXPORT_SYMBOL(ttm_bo_swapout_all);
-
-/**
- * ttm_bo_wait_unreserved - interruptible wait for a buffer object
to become
- * unreserved
- *
- * @bo: Pointer to buffer
- */
-int ttm_bo_wait_unreserved(struct ttm_buffer_object *bo)
-{
-int ret;
-
-/*
- * In the absense of a wait_unlocked API,
- * Use the bo::wu_mutex to avoid triggering livelocks due to
- * concurrent use of this function. Note that this use of
- * bo::wu_mutex can go away if we change locking order to
- * mmap_sem -> bo::reserve.
- */
-ret = mutex_lock_interruptible(>wu_mutex);
-if (unlikely(ret != 0))
-return -ERESTARTSYS;
-if (!dma_resv_is_locked(bo->base.resv))
-goto out_unlock;
-ret = dma_resv_lock_interruptible(bo->base.resv, NULL);
-if (ret == -EINTR)
-ret = -ERESTARTSYS;
-if (unlikely(ret != 0))
-goto out_unlock;
-dma_resv_unlock(bo->base.resv);
-
-out_unlock:
-mutex_unlock(>wu_mutex);
-return ret;
-}
diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c
b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index 76eedb963693..505e1787aeea 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -125,31 +125,7 @@ static vm_fault_t ttm_bo_vm_fault(struct
vm_fault *vmf)
>man[bo->mem.mem_type];
struct vm_area_struct cvma;
-/*
- * Work around locking order reversal in fault / nopfn
- * between mmap_sem and bo_reserve: Perform a trylock operation
- * for reserve, and if it fails, retry the fault after waiting
- * for the buffer to become unreserved.
- */
-if (unlikely(!dma_resv_trylock(bo->base.resv))) {
-if (vmf->flags & FAULT_FLAG_ALLOW_RETRY) {
-if (!(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) {
-ttm_bo_get(bo);
- up_read(>vma->vm_mm->mmap_sem);
-(void) ttm_bo_wait_unreserved(bo);
-ttm_bo_put(bo);
-}
-
-return VM_FAULT_RETRY;
-}
-
-/*
- * If we'd want to change locking order to
- * mmap_sem -> bo::reserve, we'd use a blocking reserve here
- * instead of retrying the fault...
- */

I think you should justify why the above code is removed, since the
comments actually outlines what to do if we change locking order.

The code that's removed above is not for adjusting locking orders but
to decrease the mm latency by releasing the mmap_sem while waiting for
bo reserve which in turn might be waiting for GPU. At a minimum we
should have a separate patch with justification.

Note that the caller here ensures locking progress by adjusting the
RETRY flags after a retry.

That would be patches 1&2 in this series.


Hmm? Those seem to touch only dma-buf and nouveau not ttm?  I mean this
patch should look along the lines of (based on an older tree) to
implement the new locking-order mmap_sem->reservation,

Only nouveau was breaking was doing copy_*_user or get_user_pages
while holding dma_resv locks, no one else. So nothing else needed to
be changed. But patch 1 contains the full audit. I might have missed
something.


but to keep the mm latency optimization using the RETRY functionality:

Still no idea why this is needed? All the comments here and the code
and history seem like they've been about the mmap_sem vs dma_resv
inversion between driver ioctls and fault handling here. Once that's
officially fixed there's no reason to play games here and retry loops
- previously that was necessary because the old ttm_bo_vm_fault had a
busy spin and that's definitely not nice. If it's needed I think it
should be a second patch on top, to keep this all clear. I had to
audit an enormous amount of code, I'd like to make sure I didn't miss
anything before we start to mak

Re: [Intel-gfx] [PATCH 1/3] dma_resv: prime lockdep annotations

2019-08-21 Thread VMware

On 8/21/19 6:34 PM, Daniel Vetter wrote:

On Wed, Aug 21, 2019 at 05:54:27PM +0200, Thomas Hellström (VMware) wrote:

On 8/20/19 4:53 PM, Daniel Vetter wrote:

Full audit of everyone:

- i915, radeon, amdgpu should be clean per their maintainers.

- vram helpers should be fine, they don't do command submission, so
really no business holding struct_mutex while doing copy_*_user. But
I haven't checked them all.

- panfrost seems to dma_resv_lock only in panfrost_job_push, which
looks clean.

- v3d holds dma_resv locks in the tail of its v3d_submit_cl_ioctl(),
copying from/to userspace happens all in v3d_lookup_bos which is
outside of the critical section.

- vmwgfx has a bunch of ioctls that do their own copy_*_user:
- vmw_execbuf_process: First this does some copies in
  vmw_execbuf_cmdbuf() and also in the vmw_execbuf_process() itself.
  Then comes the usual ttm reserve/validate sequence, then actual
  submission/fencing, then unreserving, and finally some more
  copy_to_user in vmw_execbuf_copy_fence_user. Glossing over tons of
  details, but looks all safe.
- vmw_fence_event_ioctl: No ttm_reserve/dma_resv_lock anywhere to be
  seen, seems to only create a fence and copy it out.
- a pile of smaller ioctl in vmwgfx_ioctl.c, no reservations to be
  found there.
Summary: vmwgfx seems to be fine too.

- virtio: There's virtio_gpu_execbuffer_ioctl, which does all the
copying from userspace before even looking up objects through their
handles, so safe. Plus the getparam/getcaps ioctl, also both safe.

- qxl only has qxl_execbuffer_ioctl, which calls into
qxl_process_single_command. There's a lovely comment before the
__copy_from_user_inatomic that the slowpath should be copied from
i915, but I guess that never happened. Try not to be unlucky and get
your CS data evicted between when it's written and the kernel tries
to read it. The only other copy_from_user is for relocs, but those
are done before qxl_release_reserve_list(), which seems to be the
only thing reserving buffers (in the ttm/dma_resv sense) in that
code. So looks safe.

- A debugfs file in nouveau_debugfs_pstate_set() and the usif ioctl in
usif_ioctl() look safe. nouveau_gem_ioctl_pushbuf() otoh breaks this
everywhere and needs to be fixed up.

Cc: Alex Deucher 
Cc: Christian König 
Cc: Chris Wilson 
Cc: Thomas Zimmermann 
Cc: Rob Herring 
Cc: Tomeu Vizoso 
Cc: Eric Anholt 
Cc: Dave Airlie 
Cc: Gerd Hoffmann 
Cc: Ben Skeggs 
Cc: "VMware Graphics" 
Cc: Thomas Hellstrom 
Signed-off-by: Daniel Vetter 
---
   drivers/dma-buf/dma-resv.c | 12 
   1 file changed, 12 insertions(+)

diff --git a/drivers/dma-buf/dma-resv.c b/drivers/dma-buf/dma-resv.c
index 42a8f3f11681..3edca10d3faf 100644
--- a/drivers/dma-buf/dma-resv.c
+++ b/drivers/dma-buf/dma-resv.c
@@ -34,6 +34,7 @@
   #include 
   #include 
+#include 
   /**
* DOC: Reservation Object Overview
@@ -107,6 +108,17 @@ void dma_resv_init(struct dma_resv *obj)
_seqcount_class);
RCU_INIT_POINTER(obj->fence, NULL);
RCU_INIT_POINTER(obj->fence_excl, NULL);
+
+   if (IS_ENABLED(CONFIG_LOCKDEP)) {
+   if (current->mm)
+   down_read(>mm->mmap_sem);
+   ww_mutex_lock(>lock, NULL);
+   fs_reclaim_acquire(GFP_KERNEL);
+   fs_reclaim_release(GFP_KERNEL);
+   ww_mutex_unlock(>lock);
+   if (current->mm)
+   up_read(>mm->mmap_sem);
+   }
   }
   EXPORT_SYMBOL(dma_resv_init);

I assume if this would have been easily done and maintainable using only
lockdep annotation instead of actually acquiring the locks, that would have
been done?

There's might_lock(), plus a pile of macros, but they don't map obviuosly,
so pretty good chances I accidentally end up with the wrong type of
annotation. Easier to just take the locks quickly, and stuff that all into
a lockdep-only section to avoid overhead.


Otherwise LGTM.

Reviewed-by: Thomas Hellström 

Will test this and let you know if it trips on vmwgfx, but it really
shouldn't.

Thanks, Daniel


One thing that strikes me is that this puts restrictions on where you 
can actually initialize a dma_resv, even if locking orders are otherwise 
obeyed. But that might not be a big problem.


/Thomas




___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: [Intel-gfx] [PATCH 3/3] drm/ttm: remove ttm_bo_wait_unreserved

2019-08-21 Thread VMware

On 8/21/19 5:14 PM, Daniel Vetter wrote:

On Wed, Aug 21, 2019 at 5:03 PM Thomas Hellström (VMware)
 wrote:

On 8/21/19 4:47 PM, Daniel Vetter wrote:

On Wed, Aug 21, 2019 at 4:27 PM Thomas Hellström (VMware)
 wrote:

On 8/21/19 4:09 PM, Daniel Vetter wrote:

On Wed, Aug 21, 2019 at 2:47 PM Thomas Hellström (VMware)
 wrote:

On 8/21/19 2:40 PM, Thomas Hellström (VMware) wrote:

On 8/20/19 4:53 PM, Daniel Vetter wrote:

With nouveau fixed all ttm-using drives have the correct nesting of
mmap_sem vs dma_resv, and we can just lock the buffer.

Assuming I didn't screw up anything with my audit of course.

Signed-off-by: Daniel Vetter 
Cc: Christian Koenig 
Cc: Huang Rui 
Cc: Gerd Hoffmann 
Cc: "VMware Graphics" 
Cc: Thomas Hellstrom 
---
 drivers/gpu/drm/ttm/ttm_bo.c| 34 -
 drivers/gpu/drm/ttm/ttm_bo_vm.c | 26 +
 include/drm/ttm/ttm_bo_api.h|  1 -
 3 files changed, 1 insertion(+), 60 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index 20ff56f27aa4..a952dd624b06 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -1954,37 +1954,3 @@ void ttm_bo_swapout_all(struct ttm_bo_device
*bdev)
 ;
 }
 EXPORT_SYMBOL(ttm_bo_swapout_all);
-
-/**
- * ttm_bo_wait_unreserved - interruptible wait for a buffer object
to become
- * unreserved
- *
- * @bo: Pointer to buffer
- */
-int ttm_bo_wait_unreserved(struct ttm_buffer_object *bo)
-{
-int ret;
-
-/*
- * In the absense of a wait_unlocked API,
- * Use the bo::wu_mutex to avoid triggering livelocks due to
- * concurrent use of this function. Note that this use of
- * bo::wu_mutex can go away if we change locking order to
- * mmap_sem -> bo::reserve.
- */
-ret = mutex_lock_interruptible(>wu_mutex);
-if (unlikely(ret != 0))
-return -ERESTARTSYS;
-if (!dma_resv_is_locked(bo->base.resv))
-goto out_unlock;
-ret = dma_resv_lock_interruptible(bo->base.resv, NULL);
-if (ret == -EINTR)
-ret = -ERESTARTSYS;
-if (unlikely(ret != 0))
-goto out_unlock;
-dma_resv_unlock(bo->base.resv);
-
-out_unlock:
-mutex_unlock(>wu_mutex);
-return ret;
-}
diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c
b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index 76eedb963693..505e1787aeea 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -125,31 +125,7 @@ static vm_fault_t ttm_bo_vm_fault(struct
vm_fault *vmf)
 >man[bo->mem.mem_type];
 struct vm_area_struct cvma;
 -/*
- * Work around locking order reversal in fault / nopfn
- * between mmap_sem and bo_reserve: Perform a trylock operation
- * for reserve, and if it fails, retry the fault after waiting
- * for the buffer to become unreserved.
- */
-if (unlikely(!dma_resv_trylock(bo->base.resv))) {
-if (vmf->flags & FAULT_FLAG_ALLOW_RETRY) {
-if (!(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) {
-ttm_bo_get(bo);
- up_read(>vma->vm_mm->mmap_sem);
-(void) ttm_bo_wait_unreserved(bo);
-ttm_bo_put(bo);
-}
-
-return VM_FAULT_RETRY;
-}
-
-/*
- * If we'd want to change locking order to
- * mmap_sem -> bo::reserve, we'd use a blocking reserve here
- * instead of retrying the fault...
- */

I think you should justify why the above code is removed, since the
comments actually outlines what to do if we change locking order.

The code that's removed above is not for adjusting locking orders but
to decrease the mm latency by releasing the mmap_sem while waiting for
bo reserve which in turn might be waiting for GPU. At a minimum we
should have a separate patch with justification.

Note that the caller here ensures locking progress by adjusting the
RETRY flags after a retry.

That would be patches 1&2 in this series.


Hmm? Those seem to touch only dma-buf and nouveau not ttm?  I mean this
patch should look along the lines of (based on an older tree) to
implement the new locking-order mmap_sem->reservation,

Only nouveau was breaking was doing copy_*_user or get_user_pages
while holding dma_resv locks, no one else. So nothing else needed to
be changed. But patch 1 contains the full audit. I might have missed
something.


but to keep the mm latency optimization using the RETRY functionality:

Still no idea why this is needed? All the comments here and the code
and history seem like they've been about the mmap_sem vs dma_resv
inversion between driver ioctls and fault handling here. Once that's
officially fixed there's no reason to play games here and retry loops
- previously that was necessary because the old ttm_bo_vm_fault had a
busy spin and that's definitely not nice. If it's needed I think it
should be a second patch on top, to keep this all cle

Re: [Intel-gfx] [PATCH 1/3] dma_resv: prime lockdep annotations

2019-08-21 Thread VMware

On 8/21/19 8:11 PM, Daniel Vetter wrote:

On Wed, Aug 21, 2019 at 7:06 PM Thomas Hellström (VMware)
 wrote:

On 8/21/19 6:34 PM, Daniel Vetter wrote:

On Wed, Aug 21, 2019 at 05:54:27PM +0200, Thomas Hellström (VMware) wrote:

On 8/20/19 4:53 PM, Daniel Vetter wrote:

Full audit of everyone:

- i915, radeon, amdgpu should be clean per their maintainers.

- vram helpers should be fine, they don't do command submission, so
 really no business holding struct_mutex while doing copy_*_user. But
 I haven't checked them all.

- panfrost seems to dma_resv_lock only in panfrost_job_push, which
 looks clean.

- v3d holds dma_resv locks in the tail of its v3d_submit_cl_ioctl(),
 copying from/to userspace happens all in v3d_lookup_bos which is
 outside of the critical section.

- vmwgfx has a bunch of ioctls that do their own copy_*_user:
 - vmw_execbuf_process: First this does some copies in
   vmw_execbuf_cmdbuf() and also in the vmw_execbuf_process() itself.
   Then comes the usual ttm reserve/validate sequence, then actual
   submission/fencing, then unreserving, and finally some more
   copy_to_user in vmw_execbuf_copy_fence_user. Glossing over tons of
   details, but looks all safe.
 - vmw_fence_event_ioctl: No ttm_reserve/dma_resv_lock anywhere to be
   seen, seems to only create a fence and copy it out.
 - a pile of smaller ioctl in vmwgfx_ioctl.c, no reservations to be
   found there.
 Summary: vmwgfx seems to be fine too.

- virtio: There's virtio_gpu_execbuffer_ioctl, which does all the
 copying from userspace before even looking up objects through their
 handles, so safe. Plus the getparam/getcaps ioctl, also both safe.

- qxl only has qxl_execbuffer_ioctl, which calls into
 qxl_process_single_command. There's a lovely comment before the
 __copy_from_user_inatomic that the slowpath should be copied from
 i915, but I guess that never happened. Try not to be unlucky and get
 your CS data evicted between when it's written and the kernel tries
 to read it. The only other copy_from_user is for relocs, but those
 are done before qxl_release_reserve_list(), which seems to be the
 only thing reserving buffers (in the ttm/dma_resv sense) in that
 code. So looks safe.

- A debugfs file in nouveau_debugfs_pstate_set() and the usif ioctl in
 usif_ioctl() look safe. nouveau_gem_ioctl_pushbuf() otoh breaks this
 everywhere and needs to be fixed up.

Cc: Alex Deucher 
Cc: Christian König 
Cc: Chris Wilson 
Cc: Thomas Zimmermann 
Cc: Rob Herring 
Cc: Tomeu Vizoso 
Cc: Eric Anholt 
Cc: Dave Airlie 
Cc: Gerd Hoffmann 
Cc: Ben Skeggs 
Cc: "VMware Graphics" 
Cc: Thomas Hellstrom 
Signed-off-by: Daniel Vetter 
---
drivers/dma-buf/dma-resv.c | 12 
1 file changed, 12 insertions(+)

diff --git a/drivers/dma-buf/dma-resv.c b/drivers/dma-buf/dma-resv.c
index 42a8f3f11681..3edca10d3faf 100644
--- a/drivers/dma-buf/dma-resv.c
+++ b/drivers/dma-buf/dma-resv.c
@@ -34,6 +34,7 @@
#include 
#include 
+#include 
/**
 * DOC: Reservation Object Overview
@@ -107,6 +108,17 @@ void dma_resv_init(struct dma_resv *obj)
 _seqcount_class);
 RCU_INIT_POINTER(obj->fence, NULL);
 RCU_INIT_POINTER(obj->fence_excl, NULL);
+
+   if (IS_ENABLED(CONFIG_LOCKDEP)) {
+   if (current->mm)
+   down_read(>mm->mmap_sem);
+   ww_mutex_lock(>lock, NULL);
+   fs_reclaim_acquire(GFP_KERNEL);
+   fs_reclaim_release(GFP_KERNEL);
+   ww_mutex_unlock(>lock);
+   if (current->mm)
+   up_read(>mm->mmap_sem);
+   }
}
EXPORT_SYMBOL(dma_resv_init);

I assume if this would have been easily done and maintainable using only
lockdep annotation instead of actually acquiring the locks, that would have
been done?

There's might_lock(), plus a pile of macros, but they don't map obviuosly,
so pretty good chances I accidentally end up with the wrong type of
annotation. Easier to just take the locks quickly, and stuff that all into
a lockdep-only section to avoid overhead.


Otherwise LGTM.

Reviewed-by: Thomas Hellström 

Will test this and let you know if it trips on vmwgfx, but it really
shouldn't.

Thanks, Daniel

One thing that strikes me is that this puts restrictions on where you
can actually initialize a dma_resv, even if locking orders are otherwise
obeyed. But that might not be a big problem.

Hm yeah ... the trouble is a need a non-kthread thread so that I have
a current->mm. Otherwise I'd have put it into some init section with a
temp dma_buf. And I kinda don't want to create a fake ->mm just for
lockdep priming. I don't expect this to be a real problem in practice,
since before you've called dma_resv_init the reservation lock doesn't
exist, so you can't hold it. And you've probably just allocated it, so
fs_reclaim is going to be fine. And if you allocate dma_resv obj

Re: [Intel-gfx] [PATCH 3/3] drm/ttm: remove ttm_bo_wait_unreserved

2019-08-21 Thread VMware

On 8/20/19 4:53 PM, Daniel Vetter wrote:

With nouveau fixed all ttm-using drives have the correct nesting of
mmap_sem vs dma_resv, and we can just lock the buffer.

Assuming I didn't screw up anything with my audit of course.

Signed-off-by: Daniel Vetter 
Cc: Christian Koenig 
Cc: Huang Rui 
Cc: Gerd Hoffmann 
Cc: "VMware Graphics" 
Cc: Thomas Hellstrom 
---
  drivers/gpu/drm/ttm/ttm_bo.c| 34 -
  drivers/gpu/drm/ttm/ttm_bo_vm.c | 26 +
  include/drm/ttm/ttm_bo_api.h|  1 -
  3 files changed, 1 insertion(+), 60 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index 20ff56f27aa4..a952dd624b06 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -1954,37 +1954,3 @@ void ttm_bo_swapout_all(struct ttm_bo_device *bdev)
;
  }
  EXPORT_SYMBOL(ttm_bo_swapout_all);
-
-/**
- * ttm_bo_wait_unreserved - interruptible wait for a buffer object to become
- * unreserved
- *
- * @bo: Pointer to buffer
- */
-int ttm_bo_wait_unreserved(struct ttm_buffer_object *bo)
-{
-   int ret;
-
-   /*
-* In the absense of a wait_unlocked API,
-* Use the bo::wu_mutex to avoid triggering livelocks due to
-* concurrent use of this function. Note that this use of
-* bo::wu_mutex can go away if we change locking order to
-* mmap_sem -> bo::reserve.
-*/
-   ret = mutex_lock_interruptible(>wu_mutex);
-   if (unlikely(ret != 0))
-   return -ERESTARTSYS;
-   if (!dma_resv_is_locked(bo->base.resv))
-   goto out_unlock;
-   ret = dma_resv_lock_interruptible(bo->base.resv, NULL);
-   if (ret == -EINTR)
-   ret = -ERESTARTSYS;
-   if (unlikely(ret != 0))
-   goto out_unlock;
-   dma_resv_unlock(bo->base.resv);
-
-out_unlock:
-   mutex_unlock(>wu_mutex);
-   return ret;
-}
diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index 76eedb963693..505e1787aeea 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -125,31 +125,7 @@ static vm_fault_t ttm_bo_vm_fault(struct vm_fault *vmf)
>man[bo->mem.mem_type];
struct vm_area_struct cvma;
  
-	/*

-* Work around locking order reversal in fault / nopfn
-* between mmap_sem and bo_reserve: Perform a trylock operation
-* for reserve, and if it fails, retry the fault after waiting
-* for the buffer to become unreserved.
-*/
-   if (unlikely(!dma_resv_trylock(bo->base.resv))) {
-   if (vmf->flags & FAULT_FLAG_ALLOW_RETRY) {
-   if (!(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) {
-   ttm_bo_get(bo);
-   up_read(>vma->vm_mm->mmap_sem);
-   (void) ttm_bo_wait_unreserved(bo);
-   ttm_bo_put(bo);
-   }
-
-   return VM_FAULT_RETRY;
-   }
-
-   /*
-* If we'd want to change locking order to
-* mmap_sem -> bo::reserve, we'd use a blocking reserve here
-* instead of retrying the fault...
-*/


I think you should justify why the above code is removed, since the 
comments actually outlines what to do if we change locking order.


The code that's removed above is not for adjusting locking orders but to 
decrease the mm latency by releasing the mmap_sem while waiting for bo 
reserve which in turn might be waiting for GPU. At a minimum we should 
have a separate patch with justification.


Note that the caller here ensures locking progress by adjusting the 
RETRY flags after a retry.


/Thomas



-   return VM_FAULT_NOPAGE;
-   }
+   dma_resv_lock(bo->base.resv, NULL);
  
  	/*

 * Refuse to fault imported pages. This should be handled
diff --git a/include/drm/ttm/ttm_bo_api.h b/include/drm/ttm/ttm_bo_api.h
index 43c4929a2171..6b50e624e3e2 100644
--- a/include/drm/ttm/ttm_bo_api.h
+++ b/include/drm/ttm/ttm_bo_api.h
@@ -765,7 +765,6 @@ ssize_t ttm_bo_io(struct ttm_bo_device *bdev, struct file 
*filp,
  int ttm_bo_swapout(struct ttm_bo_global *glob,
struct ttm_operation_ctx *ctx);
  void ttm_bo_swapout_all(struct ttm_bo_device *bdev);
-int ttm_bo_wait_unreserved(struct ttm_buffer_object *bo);
  
  /**

   * ttm_bo_uses_embedded_gem_object - check if the given bo uses the



___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: [Intel-gfx] [PATCH 3/3] drm/ttm: remove ttm_bo_wait_unreserved

2019-08-21 Thread VMware

On 8/21/19 2:40 PM, Thomas Hellström (VMware) wrote:

On 8/20/19 4:53 PM, Daniel Vetter wrote:

With nouveau fixed all ttm-using drives have the correct nesting of
mmap_sem vs dma_resv, and we can just lock the buffer.

Assuming I didn't screw up anything with my audit of course.

Signed-off-by: Daniel Vetter 
Cc: Christian Koenig 
Cc: Huang Rui 
Cc: Gerd Hoffmann 
Cc: "VMware Graphics" 
Cc: Thomas Hellstrom 
---
  drivers/gpu/drm/ttm/ttm_bo.c    | 34 -
  drivers/gpu/drm/ttm/ttm_bo_vm.c | 26 +
  include/drm/ttm/ttm_bo_api.h    |  1 -
  3 files changed, 1 insertion(+), 60 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index 20ff56f27aa4..a952dd624b06 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -1954,37 +1954,3 @@ void ttm_bo_swapout_all(struct ttm_bo_device 
*bdev)

  ;
  }
  EXPORT_SYMBOL(ttm_bo_swapout_all);
-
-/**
- * ttm_bo_wait_unreserved - interruptible wait for a buffer object 
to become

- * unreserved
- *
- * @bo: Pointer to buffer
- */
-int ttm_bo_wait_unreserved(struct ttm_buffer_object *bo)
-{
-    int ret;
-
-    /*
- * In the absense of a wait_unlocked API,
- * Use the bo::wu_mutex to avoid triggering livelocks due to
- * concurrent use of this function. Note that this use of
- * bo::wu_mutex can go away if we change locking order to
- * mmap_sem -> bo::reserve.
- */
-    ret = mutex_lock_interruptible(>wu_mutex);
-    if (unlikely(ret != 0))
-    return -ERESTARTSYS;
-    if (!dma_resv_is_locked(bo->base.resv))
-    goto out_unlock;
-    ret = dma_resv_lock_interruptible(bo->base.resv, NULL);
-    if (ret == -EINTR)
-    ret = -ERESTARTSYS;
-    if (unlikely(ret != 0))
-    goto out_unlock;
-    dma_resv_unlock(bo->base.resv);
-
-out_unlock:
-    mutex_unlock(>wu_mutex);
-    return ret;
-}
diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c 
b/drivers/gpu/drm/ttm/ttm_bo_vm.c

index 76eedb963693..505e1787aeea 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -125,31 +125,7 @@ static vm_fault_t ttm_bo_vm_fault(struct 
vm_fault *vmf)

  >man[bo->mem.mem_type];
  struct vm_area_struct cvma;
  -    /*
- * Work around locking order reversal in fault / nopfn
- * between mmap_sem and bo_reserve: Perform a trylock operation
- * for reserve, and if it fails, retry the fault after waiting
- * for the buffer to become unreserved.
- */
-    if (unlikely(!dma_resv_trylock(bo->base.resv))) {
-    if (vmf->flags & FAULT_FLAG_ALLOW_RETRY) {
-    if (!(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) {
-    ttm_bo_get(bo);
- up_read(>vma->vm_mm->mmap_sem);
-    (void) ttm_bo_wait_unreserved(bo);
-    ttm_bo_put(bo);
-    }
-
-    return VM_FAULT_RETRY;
-    }
-
-    /*
- * If we'd want to change locking order to
- * mmap_sem -> bo::reserve, we'd use a blocking reserve here
- * instead of retrying the fault...
- */


I think you should justify why the above code is removed, since the 
comments actually outlines what to do if we change locking order.


The code that's removed above is not for adjusting locking orders but 
to decrease the mm latency by releasing the mmap_sem while waiting for 
bo reserve which in turn might be waiting for GPU. At a minimum we 
should have a separate patch with justification.


Note that the caller here ensures locking progress by adjusting the 
RETRY flags after a retry.


And this last sentence also means we still can remove the wu-mutex when 
the locking order is changed, since the chance of livelock is goes away. 
IIRC only a single RETRY spin is allowed by the mm core.


/Thomas



/Thomas



-    return VM_FAULT_NOPAGE;
-    }
+    dma_resv_lock(bo->base.resv, NULL);
    /*
   * Refuse to fault imported pages. This should be handled
diff --git a/include/drm/ttm/ttm_bo_api.h b/include/drm/ttm/ttm_bo_api.h
index 43c4929a2171..6b50e624e3e2 100644
--- a/include/drm/ttm/ttm_bo_api.h
+++ b/include/drm/ttm/ttm_bo_api.h
@@ -765,7 +765,6 @@ ssize_t ttm_bo_io(struct ttm_bo_device *bdev, 
struct file *filp,

  int ttm_bo_swapout(struct ttm_bo_global *glob,
  struct ttm_operation_ctx *ctx);
  void ttm_bo_swapout_all(struct ttm_bo_device *bdev);
-int ttm_bo_wait_unreserved(struct ttm_buffer_object *bo);
    /**
   * ttm_bo_uses_embedded_gem_object - check if the given bo uses the





___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: [Intel-gfx] [PATCH 3/3] drm/ttm: remove ttm_bo_wait_unreserved

2019-08-21 Thread VMware

On 8/21/19 4:09 PM, Daniel Vetter wrote:

On Wed, Aug 21, 2019 at 2:47 PM Thomas Hellström (VMware)
 wrote:

On 8/21/19 2:40 PM, Thomas Hellström (VMware) wrote:

On 8/20/19 4:53 PM, Daniel Vetter wrote:

With nouveau fixed all ttm-using drives have the correct nesting of
mmap_sem vs dma_resv, and we can just lock the buffer.

Assuming I didn't screw up anything with my audit of course.

Signed-off-by: Daniel Vetter 
Cc: Christian Koenig 
Cc: Huang Rui 
Cc: Gerd Hoffmann 
Cc: "VMware Graphics" 
Cc: Thomas Hellstrom 
---
   drivers/gpu/drm/ttm/ttm_bo.c| 34 -
   drivers/gpu/drm/ttm/ttm_bo_vm.c | 26 +
   include/drm/ttm/ttm_bo_api.h|  1 -
   3 files changed, 1 insertion(+), 60 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index 20ff56f27aa4..a952dd624b06 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -1954,37 +1954,3 @@ void ttm_bo_swapout_all(struct ttm_bo_device
*bdev)
   ;
   }
   EXPORT_SYMBOL(ttm_bo_swapout_all);
-
-/**
- * ttm_bo_wait_unreserved - interruptible wait for a buffer object
to become
- * unreserved
- *
- * @bo: Pointer to buffer
- */
-int ttm_bo_wait_unreserved(struct ttm_buffer_object *bo)
-{
-int ret;
-
-/*
- * In the absense of a wait_unlocked API,
- * Use the bo::wu_mutex to avoid triggering livelocks due to
- * concurrent use of this function. Note that this use of
- * bo::wu_mutex can go away if we change locking order to
- * mmap_sem -> bo::reserve.
- */
-ret = mutex_lock_interruptible(>wu_mutex);
-if (unlikely(ret != 0))
-return -ERESTARTSYS;
-if (!dma_resv_is_locked(bo->base.resv))
-goto out_unlock;
-ret = dma_resv_lock_interruptible(bo->base.resv, NULL);
-if (ret == -EINTR)
-ret = -ERESTARTSYS;
-if (unlikely(ret != 0))
-goto out_unlock;
-dma_resv_unlock(bo->base.resv);
-
-out_unlock:
-mutex_unlock(>wu_mutex);
-return ret;
-}
diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c
b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index 76eedb963693..505e1787aeea 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -125,31 +125,7 @@ static vm_fault_t ttm_bo_vm_fault(struct
vm_fault *vmf)
   >man[bo->mem.mem_type];
   struct vm_area_struct cvma;
   -/*
- * Work around locking order reversal in fault / nopfn
- * between mmap_sem and bo_reserve: Perform a trylock operation
- * for reserve, and if it fails, retry the fault after waiting
- * for the buffer to become unreserved.
- */
-if (unlikely(!dma_resv_trylock(bo->base.resv))) {
-if (vmf->flags & FAULT_FLAG_ALLOW_RETRY) {
-if (!(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) {
-ttm_bo_get(bo);
- up_read(>vma->vm_mm->mmap_sem);
-(void) ttm_bo_wait_unreserved(bo);
-ttm_bo_put(bo);
-}
-
-return VM_FAULT_RETRY;
-}
-
-/*
- * If we'd want to change locking order to
- * mmap_sem -> bo::reserve, we'd use a blocking reserve here
- * instead of retrying the fault...
- */

I think you should justify why the above code is removed, since the
comments actually outlines what to do if we change locking order.

The code that's removed above is not for adjusting locking orders but
to decrease the mm latency by releasing the mmap_sem while waiting for
bo reserve which in turn might be waiting for GPU. At a minimum we
should have a separate patch with justification.

Note that the caller here ensures locking progress by adjusting the
RETRY flags after a retry.

That would be patches 1&2 in this series.

Hmm? Those seem to touch only dma-buf and nouveau not ttm?  I mean this 
patch should look along the lines of (based on an older tree) to 
implement the new locking-order mmap_sem->reservation,


but to keep the mm latency optimization using the RETRY functionality:


Thanks,
Thomas


diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index 85f5bcbe0c76..68482c67b9f7 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -125,30 +125,20 @@ static vm_fault_t ttm_bo_vm_fault(struct vm_fault *vmf)
>man[bo->mem.mem_type];
struct vm_area_struct cvma;
 
-   /*

-* Work around locking order reversal in fault / nopfn
-* between mmap_sem and bo_reserve: Perform a trylock operation
-* for reserve, and if it fails, retry the fault after waiting
-* for the buffer to become unreserved.
-*/
+   /* Avoid blocking on reservation with mmap_sem held, if possible */
if (unlikely(!reservation_object_trylock(bo->base.resv))) {
-   if (vmf->flags & FAULT_FLAG_ALLOW_RETRY) {
-   if (!(

Re: [Intel-gfx] [PATCH 1/3] dma_resv: prime lockdep annotations

2019-08-21 Thread VMware

On 8/20/19 4:53 PM, Daniel Vetter wrote:

Full audit of everyone:

- i915, radeon, amdgpu should be clean per their maintainers.

- vram helpers should be fine, they don't do command submission, so
   really no business holding struct_mutex while doing copy_*_user. But
   I haven't checked them all.

- panfrost seems to dma_resv_lock only in panfrost_job_push, which
   looks clean.

- v3d holds dma_resv locks in the tail of its v3d_submit_cl_ioctl(),
   copying from/to userspace happens all in v3d_lookup_bos which is
   outside of the critical section.

- vmwgfx has a bunch of ioctls that do their own copy_*_user:
   - vmw_execbuf_process: First this does some copies in
 vmw_execbuf_cmdbuf() and also in the vmw_execbuf_process() itself.
 Then comes the usual ttm reserve/validate sequence, then actual
 submission/fencing, then unreserving, and finally some more
 copy_to_user in vmw_execbuf_copy_fence_user. Glossing over tons of
 details, but looks all safe.
   - vmw_fence_event_ioctl: No ttm_reserve/dma_resv_lock anywhere to be
 seen, seems to only create a fence and copy it out.
   - a pile of smaller ioctl in vmwgfx_ioctl.c, no reservations to be
 found there.
   Summary: vmwgfx seems to be fine too.

- virtio: There's virtio_gpu_execbuffer_ioctl, which does all the
   copying from userspace before even looking up objects through their
   handles, so safe. Plus the getparam/getcaps ioctl, also both safe.

- qxl only has qxl_execbuffer_ioctl, which calls into
   qxl_process_single_command. There's a lovely comment before the
   __copy_from_user_inatomic that the slowpath should be copied from
   i915, but I guess that never happened. Try not to be unlucky and get
   your CS data evicted between when it's written and the kernel tries
   to read it. The only other copy_from_user is for relocs, but those
   are done before qxl_release_reserve_list(), which seems to be the
   only thing reserving buffers (in the ttm/dma_resv sense) in that
   code. So looks safe.

- A debugfs file in nouveau_debugfs_pstate_set() and the usif ioctl in
   usif_ioctl() look safe. nouveau_gem_ioctl_pushbuf() otoh breaks this
   everywhere and needs to be fixed up.

Cc: Alex Deucher 
Cc: Christian König 
Cc: Chris Wilson 
Cc: Thomas Zimmermann 
Cc: Rob Herring 
Cc: Tomeu Vizoso 
Cc: Eric Anholt 
Cc: Dave Airlie 
Cc: Gerd Hoffmann 
Cc: Ben Skeggs 
Cc: "VMware Graphics" 
Cc: Thomas Hellstrom 
Signed-off-by: Daniel Vetter 
---
  drivers/dma-buf/dma-resv.c | 12 
  1 file changed, 12 insertions(+)

diff --git a/drivers/dma-buf/dma-resv.c b/drivers/dma-buf/dma-resv.c
index 42a8f3f11681..3edca10d3faf 100644
--- a/drivers/dma-buf/dma-resv.c
+++ b/drivers/dma-buf/dma-resv.c
@@ -34,6 +34,7 @@
  
  #include 

  #include 
+#include 
  
  /**

   * DOC: Reservation Object Overview
@@ -107,6 +108,17 @@ void dma_resv_init(struct dma_resv *obj)
_seqcount_class);
RCU_INIT_POINTER(obj->fence, NULL);
RCU_INIT_POINTER(obj->fence_excl, NULL);
+
+   if (IS_ENABLED(CONFIG_LOCKDEP)) {
+   if (current->mm)
+   down_read(>mm->mmap_sem);
+   ww_mutex_lock(>lock, NULL);
+   fs_reclaim_acquire(GFP_KERNEL);
+   fs_reclaim_release(GFP_KERNEL);
+   ww_mutex_unlock(>lock);
+   if (current->mm)
+   up_read(>mm->mmap_sem);
+   }
  }
  EXPORT_SYMBOL(dma_resv_init);
  


I assume if this would have been easily done and maintainable using only 
lockdep annotation instead of actually acquiring the locks, that would 
have been done?


Otherwise LGTM.

Reviewed-by: Thomas Hellström 

Will test this and let you know if it trips on vmwgfx, but it really 
shouldn't.


/Thomas

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: [Intel-gfx] [PATCH 3/3] drm/ttm: remove ttm_bo_wait_unreserved

2019-08-21 Thread VMware

On 8/20/19 4:53 PM, Daniel Vetter wrote:

With nouveau fixed all ttm-using drives have the correct nesting of
mmap_sem vs dma_resv, and we can just lock the buffer.

Assuming I didn't screw up anything with my audit of course.

Signed-off-by: Daniel Vetter 
Cc: Christian Koenig 
Cc: Huang Rui 
Cc: Gerd Hoffmann 
Cc: "VMware Graphics" 
Cc: Thomas Hellstrom 
---
  drivers/gpu/drm/ttm/ttm_bo.c| 34 -
  drivers/gpu/drm/ttm/ttm_bo_vm.c | 26 +
  include/drm/ttm/ttm_bo_api.h|  1 -
  3 files changed, 1 insertion(+), 60 deletions(-)


+   dma_resv_lock(bo->base.resv, NULL);


interruptible, or at least killable. (IIRC think killable is the right 
choice in fault code, even if TTM initially implemented interruptible to 
get reasonable Xorg "silken mouse" latency).


Thanks,
/Thomas


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: [Intel-gfx] [PATCH 3/3] drm/ttm: remove ttm_bo_wait_unreserved

2019-08-21 Thread VMware

On 8/21/19 4:10 PM, Daniel Vetter wrote:

On Wed, Aug 21, 2019 at 3:16 PM Thomas Hellström (VMware)
 wrote:

On 8/20/19 4:53 PM, Daniel Vetter wrote:

With nouveau fixed all ttm-using drives have the correct nesting of
mmap_sem vs dma_resv, and we can just lock the buffer.

Assuming I didn't screw up anything with my audit of course.

Signed-off-by: Daniel Vetter 
Cc: Christian Koenig 
Cc: Huang Rui 
Cc: Gerd Hoffmann 
Cc: "VMware Graphics" 
Cc: Thomas Hellstrom 
---
   drivers/gpu/drm/ttm/ttm_bo.c| 34 -
   drivers/gpu/drm/ttm/ttm_bo_vm.c | 26 +
   include/drm/ttm/ttm_bo_api.h|  1 -
   3 files changed, 1 insertion(+), 60 deletions(-)


+ dma_resv_lock(bo->base.resv, NULL);

interruptible, or at least killable. (IIRC think killable is the right
choice in fault code, even if TTM initially implemented interruptible to
get reasonable Xorg "silken mouse" latency).

I think interruptible is fine. I chickend out of that for v1 because I
always mix up the return code for _lock_interruptibl() :-)


:). IIRC I think the in-kernel users of fault() were unhappy with 
interruptible.  (GUP?), but I guess it's better to use a bulk change at 
some point if necessary.


/Thomas


I'll add that for the next version too.
-Daniel



___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: [Intel-gfx] [PATCH 3/3] drm/ttm: remove ttm_bo_wait_unreserved

2019-08-21 Thread VMware

On 8/21/19 5:22 PM, Daniel Vetter wrote:

On Wed, Aug 21, 2019 at 5:19 PM Thomas Hellström (VMware)
 wrote:

On 8/21/19 5:14 PM, Daniel Vetter wrote:

On Wed, Aug 21, 2019 at 5:03 PM Thomas Hellström (VMware)
 wrote:

On 8/21/19 4:47 PM, Daniel Vetter wrote:

On Wed, Aug 21, 2019 at 4:27 PM Thomas Hellström (VMware)
 wrote:

On 8/21/19 4:09 PM, Daniel Vetter wrote:

On Wed, Aug 21, 2019 at 2:47 PM Thomas Hellström (VMware)
 wrote:

On 8/21/19 2:40 PM, Thomas Hellström (VMware) wrote:

On 8/20/19 4:53 PM, Daniel Vetter wrote:

With nouveau fixed all ttm-using drives have the correct nesting of
mmap_sem vs dma_resv, and we can just lock the buffer.

Assuming I didn't screw up anything with my audit of course.

Signed-off-by: Daniel Vetter 
Cc: Christian Koenig 
Cc: Huang Rui 
Cc: Gerd Hoffmann 
Cc: "VMware Graphics" 
Cc: Thomas Hellstrom 
---
  drivers/gpu/drm/ttm/ttm_bo.c| 34 -
  drivers/gpu/drm/ttm/ttm_bo_vm.c | 26 +
  include/drm/ttm/ttm_bo_api.h|  1 -
  3 files changed, 1 insertion(+), 60 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index 20ff56f27aa4..a952dd624b06 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -1954,37 +1954,3 @@ void ttm_bo_swapout_all(struct ttm_bo_device
*bdev)
  ;
  }
  EXPORT_SYMBOL(ttm_bo_swapout_all);
-
-/**
- * ttm_bo_wait_unreserved - interruptible wait for a buffer object
to become
- * unreserved
- *
- * @bo: Pointer to buffer
- */
-int ttm_bo_wait_unreserved(struct ttm_buffer_object *bo)
-{
-int ret;
-
-/*
- * In the absense of a wait_unlocked API,
- * Use the bo::wu_mutex to avoid triggering livelocks due to
- * concurrent use of this function. Note that this use of
- * bo::wu_mutex can go away if we change locking order to
- * mmap_sem -> bo::reserve.
- */
-ret = mutex_lock_interruptible(>wu_mutex);
-if (unlikely(ret != 0))
-return -ERESTARTSYS;
-if (!dma_resv_is_locked(bo->base.resv))
-goto out_unlock;
-ret = dma_resv_lock_interruptible(bo->base.resv, NULL);
-if (ret == -EINTR)
-ret = -ERESTARTSYS;
-if (unlikely(ret != 0))
-goto out_unlock;
-dma_resv_unlock(bo->base.resv);
-
-out_unlock:
-mutex_unlock(>wu_mutex);
-return ret;
-}
diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c
b/drivers/gpu/drm/ttm/ttm_bo_vm.c
index 76eedb963693..505e1787aeea 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
@@ -125,31 +125,7 @@ static vm_fault_t ttm_bo_vm_fault(struct
vm_fault *vmf)
  >man[bo->mem.mem_type];
  struct vm_area_struct cvma;
  -/*
- * Work around locking order reversal in fault / nopfn
- * between mmap_sem and bo_reserve: Perform a trylock operation
- * for reserve, and if it fails, retry the fault after waiting
- * for the buffer to become unreserved.
- */
-if (unlikely(!dma_resv_trylock(bo->base.resv))) {
-if (vmf->flags & FAULT_FLAG_ALLOW_RETRY) {
-if (!(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) {
-ttm_bo_get(bo);
- up_read(>vma->vm_mm->mmap_sem);
-(void) ttm_bo_wait_unreserved(bo);
-ttm_bo_put(bo);
-}
-
-return VM_FAULT_RETRY;
-}
-
-/*
- * If we'd want to change locking order to
- * mmap_sem -> bo::reserve, we'd use a blocking reserve here
- * instead of retrying the fault...
- */

I think you should justify why the above code is removed, since the
comments actually outlines what to do if we change locking order.

The code that's removed above is not for adjusting locking orders but
to decrease the mm latency by releasing the mmap_sem while waiting for
bo reserve which in turn might be waiting for GPU. At a minimum we
should have a separate patch with justification.

Note that the caller here ensures locking progress by adjusting the
RETRY flags after a retry.

That would be patches 1&2 in this series.


Hmm? Those seem to touch only dma-buf and nouveau not ttm?  I mean this
patch should look along the lines of (based on an older tree) to
implement the new locking-order mmap_sem->reservation,

Only nouveau was breaking was doing copy_*_user or get_user_pages
while holding dma_resv locks, no one else. So nothing else needed to
be changed. But patch 1 contains the full audit. I might have missed
something.


but to keep the mm latency optimization using the RETRY functionality:

Still no idea why this is needed? All the comments here and the code
and history seem like they've been about the mmap_sem vs dma_resv
inversion between driver ioctls and fault handling here. Once that's
officially fixed there's no reason to play games here and retry loops
- previously that was necessary because the old ttm_bo_vm_fault