Re: [Intel-gfx] [PATCH] drm/i915/gt: Trace placement of timeline HWSP

2020-07-15 Thread Mika Kuoppala
Chris Wilson  writes:

> Track the position of the HWSP for each timeline.
>
> References: https://gitlab.freedesktop.org/drm/intel/-/issues/2169
> Signed-off-by: Chris Wilson 
> Cc: Mika Kuoppala 
> ---
>  drivers/gpu/drm/i915/gt/intel_timeline.c|  7 +++
>  drivers/gpu/drm/i915/gt/selftest_timeline.c | 13 -
>  2 files changed, 15 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_timeline.c 
> b/drivers/gpu/drm/i915/gt/intel_timeline.c
> index 4546284fede1..46d20f5f3ddc 100644
> --- a/drivers/gpu/drm/i915/gt/intel_timeline.c
> +++ b/drivers/gpu/drm/i915/gt/intel_timeline.c
> @@ -73,6 +73,8 @@ hwsp_alloc(struct intel_timeline *timeline, unsigned int 
> *cacheline)
>   return vma;
>   }
>  
> + GT_TRACE(timeline->gt, "new HWSP allocated\n");
> +
>   vma->private = hwsp;
>   hwsp->gt = timeline->gt;
>   hwsp->vma = vma;
> @@ -327,6 +329,8 @@ int intel_timeline_pin(struct intel_timeline *tl)
>   tl->hwsp_offset =
>   i915_ggtt_offset(tl->hwsp_ggtt) +
>   offset_in_page(tl->hwsp_offset);
> + GT_TRACE(tl->gt, "timeline:%llx using HWSP offset:%x\n",
> +  tl->fence_context, tl->hwsp_offset);

Regarless of that the coffee is starting to sink in,
I am somewhat uneasy with the whole hwsp_offset being
used both as an offset in page and then in offset in ggtt.

Further, my paranoia is enhanced as if the race
to pin gets losing side, the hwsp_offset linger
to an stale(ish) offset.

But for tracing,
Reviewed-by: Mika Kuoppala 
-Mika

>  
>   cacheline_acquire(tl->hwsp_cacheline);
>   if (atomic_fetch_inc(&tl->pin_count)) {
> @@ -434,6 +438,7 @@ __intel_timeline_get_seqno(struct intel_timeline *tl,
>   int err;
>  
>   might_lock(&tl->gt->ggtt->vm.mutex);
> + GT_TRACE(tl->gt, "timeline:%llx wrapped\n", tl->fence_context);
>  
>   /*
>* If there is an outstanding GPU reference to this cacheline,
> @@ -497,6 +502,8 @@ __intel_timeline_get_seqno(struct intel_timeline *tl,
>   memset(vaddr + tl->hwsp_offset, 0, CACHELINE_BYTES);
>  
>   tl->hwsp_offset += i915_ggtt_offset(vma);
> + GT_TRACE(tl->gt, "timeline:%llx using HWSP offset:%x\n",
> +  tl->fence_context, tl->hwsp_offset);
>  
>   cacheline_acquire(cl);
>   tl->hwsp_cacheline = cl;
> diff --git a/drivers/gpu/drm/i915/gt/selftest_timeline.c 
> b/drivers/gpu/drm/i915/gt/selftest_timeline.c
> index fcdee951579b..fb5b7d3498a6 100644
> --- a/drivers/gpu/drm/i915/gt/selftest_timeline.c
> +++ b/drivers/gpu/drm/i915/gt/selftest_timeline.c
> @@ -562,8 +562,9 @@ static int live_hwsp_engine(void *arg)
>   struct intel_timeline *tl = timelines[n];
>  
>   if (!err && *tl->hwsp_seqno != n) {
> - pr_err("Invalid seqno stored in timeline %lu, found 
> 0x%x\n",
> -n, *tl->hwsp_seqno);
> + pr_err("Invalid seqno stored in timeline %lu @ %x, 
> found 0x%x\n",
> +n, tl->hwsp_offset, *tl->hwsp_seqno);
> + GEM_TRACE_DUMP();
>   err = -EINVAL;
>   }
>   intel_timeline_put(tl);
> @@ -633,8 +634,9 @@ static int live_hwsp_alternate(void *arg)
>   struct intel_timeline *tl = timelines[n];
>  
>   if (!err && *tl->hwsp_seqno != n) {
> - pr_err("Invalid seqno stored in timeline %lu, found 
> 0x%x\n",
> -n, *tl->hwsp_seqno);
> + pr_err("Invalid seqno stored in timeline %lu @ %x, 
> found 0x%x\n",
> +n, tl->hwsp_offset, *tl->hwsp_seqno);
> + GEM_TRACE_DUMP();
>   err = -EINVAL;
>   }
>   intel_timeline_put(tl);
> @@ -965,8 +967,9 @@ static int live_hwsp_recycle(void *arg)
>   }
>  
>   if (*tl->hwsp_seqno != count) {
> - pr_err("Invalid seqno stored in timeline %lu, 
> found 0x%x\n",
> + pr_err("Invalid seqno stored in timeline %lu @ 
> tl->hwsp_offset, found 0x%x\n",
>  count, *tl->hwsp_seqno);
> + GEM_TRACE_DUMP();
>   err = -EINVAL;
>   }
>  
> -- 
> 2.20.1
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 1/3] dma-buf/sw_sync: Avoid recursive lock during fence signal.

2020-07-15 Thread Christian König

Am 14.07.20 um 22:06 schrieb Chris Wilson:

From: Bas Nieuwenhuizen 

Calltree:
   timeline_fence_release
   drm_sched_entity_wakeup
   dma_fence_signal_locked
   sync_timeline_signal
   sw_sync_ioctl

Releasing the reference to the fence in the fence signal callback
seems reasonable to me, so this patch avoids the locking issue in
sw_sync.

d3862e44daa7 ("dma-buf/sw-sync: Fix locking around sync_timeline lists")
fixed the recursive locking issue but caused an use-after-free. Later
d3c6dd1fb30d ("dma-buf/sw_sync: Synchronize signal vs syncpt free")
fixed the use-after-free but reintroduced the recursive locking issue.

In this attempt we avoid the use-after-free still because the release
function still always locks, and outside of the locking region in the
signal function we have properly refcounted references.

We furthermore also avoid the recurive lock by making sure that either:

1) We have a properly refcounted reference, preventing the signal from
triggering the release function inside the locked region.
2) The refcount was already zero, and hence nobody will be able to trigger
the release function from the signal function.

v2: Move dma_fence_signal() into second loop in preparation to moving
the callback out of the timeline obj->lock.

Fixes: d3c6dd1fb30d ("dma-buf/sw_sync: Synchronize signal vs syncpt free")
Cc: Sumit Semwal 
Cc: Chris Wilson 
Cc: Gustavo Padovan 
Cc: Christian König 
Cc: 
Signed-off-by: Bas Nieuwenhuizen 
Signed-off-by: Chris Wilson 


Looks reasonable to me, but I'm not an expert on this container.

So patch is Acked-by: Christian König 

Regards,
Christian.


---
  drivers/dma-buf/sw_sync.c | 32 ++--
  1 file changed, 22 insertions(+), 10 deletions(-)

diff --git a/drivers/dma-buf/sw_sync.c b/drivers/dma-buf/sw_sync.c
index 348b3a9170fa..807c82148062 100644
--- a/drivers/dma-buf/sw_sync.c
+++ b/drivers/dma-buf/sw_sync.c
@@ -192,6 +192,7 @@ static const struct dma_fence_ops timeline_fence_ops = {
  static void sync_timeline_signal(struct sync_timeline *obj, unsigned int inc)
  {
struct sync_pt *pt, *next;
+   LIST_HEAD(signal);
  
  	trace_sync_timeline(obj);
  
@@ -203,21 +204,32 @@ static void sync_timeline_signal(struct sync_timeline *obj, unsigned int inc)

if (!timeline_fence_signaled(&pt->base))
break;
  
-		list_del_init(&pt->link);

-   rb_erase(&pt->node, &obj->pt_tree);
-
/*
-* A signal callback may release the last reference to this
-* fence, causing it to be freed. That operation has to be
-* last to avoid a use after free inside this loop, and must
-* be after we remove the fence from the timeline in order to
-* prevent deadlocking on timeline->lock inside
-* timeline_fence_release().
+* We need to take a reference to avoid a release during
+* signalling (which can cause a recursive lock of obj->lock).
+* If refcount was already zero, another thread is already
+* taking care of destroying the fence.
 */
-   dma_fence_signal_locked(&pt->base);
+   if (!dma_fence_get_rcu(&pt->base))
+   continue;
+
+   list_move_tail(&pt->link, &signal);
+   rb_erase(&pt->node, &obj->pt_tree);
}
  
  	spin_unlock_irq(&obj->lock);

+
+   list_for_each_entry_safe(pt, next, &signal, link) {
+   /*
+* This needs to be cleared before release, otherwise the
+* timeline_fence_release function gets confused about also
+* removing the fence from the pt_tree.
+*/
+   list_del_init(&pt->link);
+
+   dma_fence_signal(&pt->base);
+   dma_fence_put(&pt->base);
+   }
  }
  
  /**


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH -next] drm/i915: Remove unused inline function drain_delayed_work()

2020-07-15 Thread Chris Wilson
Quoting YueHaibing (2020-07-15 04:21:04)
> It is not used since commit 058179e72e09 ("drm/i915/gt: Replace
> hangcheck by heartbeats")
> 
> Signed-off-by: YueHaibing 

Indeed, it is no more.
Reviewed-by: Chris Wilson 
-Chris
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH v2 2/3] dma-buf/sw_sync: Separate signal/timeline locks

2020-07-15 Thread Bas Nieuwenhuizen
Still Reviewed-by: Bas Nieuwenhuizen 

On Tue, Jul 14, 2020 at 11:24 PM Chris Wilson  wrote:
>
> Since we decouple the sync_pt from the timeline tree upon release, in
> order to allow releasing the sync_pt from a signal callback we need to
> separate the sync_pt signaling lock from the timeline tree lock.
>
> v2: Mark up the unlocked read of the current timeline value.
> v3: Store a timeline pointer in the sync_pt as we cannot use the common
> fence->lock trick to find our parent anymore.
>
> Suggested-by: Bas Nieuwenhuizen 
> Signed-off-by: Chris Wilson 
> Cc: Bas Nieuwenhuizen 
> ---
>  drivers/dma-buf/sw_sync.c| 40 +---
>  drivers/dma-buf/sync_debug.c |  2 +-
>  drivers/dma-buf/sync_debug.h | 13 +++-
>  3 files changed, 32 insertions(+), 23 deletions(-)
>
> diff --git a/drivers/dma-buf/sw_sync.c b/drivers/dma-buf/sw_sync.c
> index 807c82148062..17a5c1a3b7ce 100644
> --- a/drivers/dma-buf/sw_sync.c
> +++ b/drivers/dma-buf/sw_sync.c
> @@ -123,33 +123,39 @@ static const char 
> *timeline_fence_get_driver_name(struct dma_fence *fence)
>
>  static const char *timeline_fence_get_timeline_name(struct dma_fence *fence)
>  {
> -   struct sync_timeline *parent = dma_fence_parent(fence);
> -
> -   return parent->name;
> +   return sync_timeline(fence)->name;
>  }
>
>  static void timeline_fence_release(struct dma_fence *fence)
>  {
> struct sync_pt *pt = dma_fence_to_sync_pt(fence);
> -   struct sync_timeline *parent = dma_fence_parent(fence);
> -   unsigned long flags;
> +   struct sync_timeline *parent = pt->timeline;
>
> -   spin_lock_irqsave(fence->lock, flags);
> if (!list_empty(&pt->link)) {
> -   list_del(&pt->link);
> -   rb_erase(&pt->node, &parent->pt_tree);
> +   unsigned long flags;
> +
> +   spin_lock_irqsave(&parent->lock, flags);
> +   if (!list_empty(&pt->link)) {
> +   list_del(&pt->link);
> +   rb_erase(&pt->node, &parent->pt_tree);
> +   }
> +   spin_unlock_irqrestore(&parent->lock, flags);
> }
> -   spin_unlock_irqrestore(fence->lock, flags);
>
> sync_timeline_put(parent);
> dma_fence_free(fence);
>  }
>
> -static bool timeline_fence_signaled(struct dma_fence *fence)
> +static int timeline_value(struct dma_fence *fence)
>  {
> -   struct sync_timeline *parent = dma_fence_parent(fence);
> +   return READ_ONCE(sync_timeline(fence)->value);
> +}
>
> -   return !__dma_fence_is_later(fence->seqno, parent->value, fence->ops);
> +static bool timeline_fence_signaled(struct dma_fence *fence)
> +{
> +   return !__dma_fence_is_later(fence->seqno,
> +timeline_value(fence),
> +fence->ops);
>  }
>
>  static bool timeline_fence_enable_signaling(struct dma_fence *fence)
> @@ -166,9 +172,7 @@ static void timeline_fence_value_str(struct dma_fence 
> *fence,
>  static void timeline_fence_timeline_value_str(struct dma_fence *fence,
>  char *str, int size)
>  {
> -   struct sync_timeline *parent = dma_fence_parent(fence);
> -
> -   snprintf(str, size, "%d", parent->value);
> +   snprintf(str, size, "%d", timeline_value(fence));
>  }
>
>  static const struct dma_fence_ops timeline_fence_ops = {
> @@ -252,12 +256,14 @@ static struct sync_pt *sync_pt_create(struct 
> sync_timeline *obj,
> return NULL;
>
> sync_timeline_get(obj);
> -   dma_fence_init(&pt->base, &timeline_fence_ops, &obj->lock,
> +   spin_lock_init(&pt->lock);
> +   dma_fence_init(&pt->base, &timeline_fence_ops, &pt->lock,
>obj->context, value);
> INIT_LIST_HEAD(&pt->link);
> +   pt->timeline = obj;
>
> spin_lock_irq(&obj->lock);
> -   if (!dma_fence_is_signaled_locked(&pt->base)) {
> +   if (!dma_fence_is_signaled(&pt->base)) {
> struct rb_node **p = &obj->pt_tree.rb_node;
> struct rb_node *parent = NULL;
>
> diff --git a/drivers/dma-buf/sync_debug.c b/drivers/dma-buf/sync_debug.c
> index 101394f16930..2188ee17e889 100644
> --- a/drivers/dma-buf/sync_debug.c
> +++ b/drivers/dma-buf/sync_debug.c
> @@ -65,7 +65,7 @@ static const char *sync_status_str(int status)
>  static void sync_print_fence(struct seq_file *s,
>  struct dma_fence *fence, bool show)
>  {
> -   struct sync_timeline *parent = dma_fence_parent(fence);
> +   struct sync_timeline *parent = sync_timeline(fence);
> int status;
>
> status = dma_fence_get_status_locked(fence);
> diff --git a/drivers/dma-buf/sync_debug.h b/drivers/dma-buf/sync_debug.h
> index 6176e52ba2d7..56589dae2159 100644
> --- a/drivers/dma-buf/sync_debug.h
> +++ b/drivers/dma-buf/sync_debug.h
> @@ -45,23 +45,26 @@ struct sync_timeline {
> struct list_headsync_ti

[Intel-gfx] sw_sync deadlock avoidance, take 3

2020-07-15 Thread Chris Wilson
dma_fence_release() objects to a fence being freed before it is
signaled, so instead of playing fancy tricks to avoid handling dying
requests, let's keep the syncpt alive until signaled. This neatly
removes the issue with having to decouple the syncpt from the timeline
upon fence release.
-Chris


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH 1/2] dma-buf/sw_sync: Avoid recursive lock during fence signal

2020-07-15 Thread Chris Wilson
If a signal callback releases the sw_sync fence, that will trigger a
deadlock as the timeline_fence_release recurses onto the fence->lock
(used both for signaling and the the timeline tree).

If we always hold a reference for an unsignaled fence held by the
timeline, we no longer need to detach the fence from the timeline upon
release. This is only possible since commit ea4d5a270b57
("dma-buf/sw_sync: force signal all unsignaled fences on dying timeline")
where we introduced decoupling of the fences from the timeline upon release.

Reported-by: Bas Nieuwenhuizen 
Fixes: d3c6dd1fb30d ("dma-buf/sw_sync: Synchronize signal vs syncpt free")
Signed-off-by: Chris Wilson 
Cc: Sumit Semwal 
Cc: Chris Wilson 
Cc: Gustavo Padovan 
Cc: Christian König 
Cc: 
---
 drivers/dma-buf/sw_sync.c | 32 +++-
 1 file changed, 7 insertions(+), 25 deletions(-)

diff --git a/drivers/dma-buf/sw_sync.c b/drivers/dma-buf/sw_sync.c
index 348b3a9170fa..4cc2ac03a84a 100644
--- a/drivers/dma-buf/sw_sync.c
+++ b/drivers/dma-buf/sw_sync.c
@@ -130,16 +130,7 @@ static const char *timeline_fence_get_timeline_name(struct 
dma_fence *fence)
 
 static void timeline_fence_release(struct dma_fence *fence)
 {
-   struct sync_pt *pt = dma_fence_to_sync_pt(fence);
struct sync_timeline *parent = dma_fence_parent(fence);
-   unsigned long flags;
-
-   spin_lock_irqsave(fence->lock, flags);
-   if (!list_empty(&pt->link)) {
-   list_del(&pt->link);
-   rb_erase(&pt->node, &parent->pt_tree);
-   }
-   spin_unlock_irqrestore(fence->lock, flags);
 
sync_timeline_put(parent);
dma_fence_free(fence);
@@ -203,18 +194,11 @@ static void sync_timeline_signal(struct sync_timeline 
*obj, unsigned int inc)
if (!timeline_fence_signaled(&pt->base))
break;
 
-   list_del_init(&pt->link);
+   list_del(&pt->link);
rb_erase(&pt->node, &obj->pt_tree);
 
-   /*
-* A signal callback may release the last reference to this
-* fence, causing it to be freed. That operation has to be
-* last to avoid a use after free inside this loop, and must
-* be after we remove the fence from the timeline in order to
-* prevent deadlocking on timeline->lock inside
-* timeline_fence_release().
-*/
dma_fence_signal_locked(&pt->base);
+   dma_fence_put(&pt->base);
}
 
spin_unlock_irq(&obj->lock);
@@ -261,13 +245,9 @@ static struct sync_pt *sync_pt_create(struct sync_timeline 
*obj,
} else if (cmp < 0) {
p = &parent->rb_left;
} else {
-   if (dma_fence_get_rcu(&other->base)) {
-   sync_timeline_put(obj);
-   kfree(pt);
-   pt = other;
-   goto unlock;
-   }
-   p = &parent->rb_left;
+   dma_fence_put(&pt->base);
+   pt = other;
+   goto unlock;
}
}
rb_link_node(&pt->node, parent, p);
@@ -278,6 +258,7 @@ static struct sync_pt *sync_pt_create(struct sync_timeline 
*obj,
  parent ? &rb_entry(parent, typeof(*pt), 
node)->link : &obj->pt_list);
}
 unlock:
+   dma_fence_get(&pt->base); /* keep a ref for the timeline */
spin_unlock_irq(&obj->lock);
 
return pt;
@@ -316,6 +297,7 @@ static int sw_sync_debugfs_release(struct inode *inode, 
struct file *file)
list_for_each_entry_safe(pt, next, &obj->pt_list, link) {
dma_fence_set_error(&pt->base, -ENOENT);
dma_fence_signal_locked(&pt->base);
+   dma_fence_put(&pt->base);
}
 
spin_unlock_irq(&obj->lock);
-- 
2.20.1

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH 2/2] dma-buf/selftests: Add locking selftests for sw_sync

2020-07-15 Thread Chris Wilson
While sw_sync is purely a debug facility for userspace to create fences
and timelines it can control, nevertheless it has some tricky locking
semantics of its own. In particular, Bas Nieuwenhuizen reported that we
had reintroduced a deadlock if a signal callback attempted to destroy
the fence. So let's add a few trivial selftests to make sure that once
fixed again, it stays fixed.

Signed-off-by: Chris Wilson 
Cc: Bas Nieuwenhuizen 
Reviewed-by: Bas Nieuwenhuizen 
---
 drivers/dma-buf/Makefile |   3 +-
 drivers/dma-buf/selftests.h  |   1 +
 drivers/dma-buf/st-sw_sync.c | 279 +++
 drivers/dma-buf/sw_sync.c|  39 +
 drivers/dma-buf/sync_debug.h |   8 +
 5 files changed, 329 insertions(+), 1 deletion(-)
 create mode 100644 drivers/dma-buf/st-sw_sync.c

diff --git a/drivers/dma-buf/Makefile b/drivers/dma-buf/Makefile
index 995e05f609ff..9be4d4611609 100644
--- a/drivers/dma-buf/Makefile
+++ b/drivers/dma-buf/Makefile
@@ -10,6 +10,7 @@ obj-$(CONFIG_UDMABUF) += udmabuf.o
 dmabuf_selftests-y := \
selftest.o \
st-dma-fence.o \
-   st-dma-fence-chain.o
+   st-dma-fence-chain.o \
+   st-sw_sync.o
 
 obj-$(CONFIG_DMABUF_SELFTESTS) += dmabuf_selftests.o
diff --git a/drivers/dma-buf/selftests.h b/drivers/dma-buf/selftests.h
index bc8cea67bf1e..232499a24872 100644
--- a/drivers/dma-buf/selftests.h
+++ b/drivers/dma-buf/selftests.h
@@ -12,3 +12,4 @@
 selftest(sanitycheck, __sanitycheck__) /* keep first (igt selfcheck) */
 selftest(dma_fence, dma_fence)
 selftest(dma_fence_chain, dma_fence_chain)
+selftest(sw_sync, sw_sync)
diff --git a/drivers/dma-buf/st-sw_sync.c b/drivers/dma-buf/st-sw_sync.c
new file mode 100644
index ..145fd330f1c6
--- /dev/null
+++ b/drivers/dma-buf/st-sw_sync.c
@@ -0,0 +1,279 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2020 Intel Corporation
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "sync_debug.h"
+#include "selftest.h"
+
+static int sanitycheck(void *arg)
+{
+   struct sync_timeline *tl;
+   struct dma_fence *f;
+   int err = -ENOMEM;
+
+   /* Quick check we can create the timeline and syncpt */
+
+   tl = st_sync_timeline_create("mock");
+   if (!tl)
+   return -ENOMEM;
+
+   f = st_sync_pt_create(tl, 1);
+   if (!f)
+   goto out;
+
+   dma_fence_signal(f);
+   dma_fence_put(f);
+
+   err = 0;
+out:
+   st_sync_timeline_put(tl);
+   return err;
+}
+
+static int signal(void *arg)
+{
+   struct sync_timeline *tl;
+   struct dma_fence *f;
+   int err = -EINVAL;
+
+   /* Check that the syncpt fence is signaled when the timeline advances */
+
+   tl = st_sync_timeline_create("mock");
+   if (!tl)
+   return -ENOMEM;
+
+   f = st_sync_pt_create(tl, 1);
+   if (!f) {
+   err = -ENOMEM;
+   goto out;
+   }
+
+   if (dma_fence_is_signaled(f)) {
+   pr_err("syncpt:%lld signaled too early\n", f->seqno);
+   goto out_fence;
+   }
+
+   st_sync_timeline_signal(tl, 1);
+
+   if (!dma_fence_is_signaled(f)) {
+   pr_err("syncpt:%lld not signaled after increment\n", f->seqno);
+   goto out_fence;
+   }
+
+   err = 0;
+out_fence:
+   dma_fence_signal(f);
+   dma_fence_put(f);
+out:
+   st_sync_timeline_put(tl);
+   return err;
+}
+
+struct cb_destroy {
+   struct dma_fence_cb cb;
+   struct dma_fence *f;
+};
+
+static void cb_destroy(struct dma_fence *fence, struct dma_fence_cb *_cb)
+{
+   struct cb_destroy *cb = container_of(_cb, typeof(*cb), cb);
+
+   pr_info("syncpt:%llx destroying syncpt:%llx\n",
+   fence->seqno, cb->f->seqno);
+   dma_fence_put(cb->f);
+   cb->f = NULL;
+}
+
+static int cb_autodestroy(void *arg)
+{
+   struct sync_timeline *tl;
+   struct cb_destroy cb;
+   int err = -EINVAL;
+
+   /* Check that we can drop the final syncpt reference from a callback */
+
+   tl = st_sync_timeline_create("mock");
+   if (!tl)
+   return -ENOMEM;
+
+   cb.f = st_sync_pt_create(tl, 1);
+   if (!cb.f) {
+   err = -ENOMEM;
+   goto out;
+   }
+
+   if (dma_fence_add_callback(cb.f, &cb.cb, cb_destroy)) {
+   pr_err("syncpt:%lld signaled before increment\n", cb.f->seqno);
+   goto out;
+   }
+
+   st_sync_timeline_signal(tl, 1);
+   if (cb.f) {
+   pr_err("syncpt:%lld callback not run\n", cb.f->seqno);
+   dma_fence_put(cb.f);
+   goto out;
+   }
+
+   err = 0;
+out:
+   st_sync_timeline_put(tl);
+   return err;
+}
+
+static int cb_destroy_12(void *arg)
+{
+   struct sync_timeline *tl;
+   struct cb_destroy cb;
+   struct dma_fence *f;
+   int err = -EINVAL;
+
+   /* Check that we can drop some other syncpt refe

[Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for series starting with [1/2] dma-buf/sw_sync: Avoid recursive lock during fence signal

2020-07-15 Thread Patchwork
== Series Details ==

Series: series starting with [1/2] dma-buf/sw_sync: Avoid recursive lock during 
fence signal
URL   : https://patchwork.freedesktop.org/series/79510/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
f09f86114c26 dma-buf/sw_sync: Avoid recursive lock during fence signal
-:17: WARNING:COMMIT_LOG_LONG_LINE: Possible unwrapped commit description 
(prefer a maximum 75 chars per line)
#17: 
where we introduced decoupling of the fences from the timeline upon release.

total: 0 errors, 1 warnings, 0 checks, 66 lines checked
7ff4c9562004 dma-buf/selftests: Add locking selftests for sw_sync
-:40: WARNING:FILE_PATH_CHANGES: added, moved or deleted file(s), does 
MAINTAINERS need updating?
#40: 
new file mode 100644

total: 0 errors, 1 warnings, 0 checks, 345 lines checked


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] sw_sync deadlock avoidance, take 3

2020-07-15 Thread Bas Nieuwenhuizen
Hi Chris,

My concern with going in this direction was that we potentially allow
an application to allocate a lot of kernel memory but not a lot of fds
by creating lots of fences and then closing the fds but never
signaling them. Is that not an issue?

- Bas

On Wed, Jul 15, 2020 at 12:04 PM Chris Wilson  wrote:
>
> dma_fence_release() objects to a fence being freed before it is
> signaled, so instead of playing fancy tricks to avoid handling dying
> requests, let's keep the syncpt alive until signaled. This neatly
> removes the issue with having to decouple the syncpt from the timeline
> upon fence release.
> -Chris
>
>
> ___
> dri-devel mailing list
> dri-de...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] sw_sync deadlock avoidance, take 3

2020-07-15 Thread Daniel Stone
Hi,

On Wed, 15 Jul 2020 at 11:23, Bas Nieuwenhuizen  
wrote:
> My concern with going in this direction was that we potentially allow
> an application to allocate a lot of kernel memory but not a lot of fds
> by creating lots of fences and then closing the fds but never
> signaling them. Is that not an issue?

sw_sync is a userspace DoS mechanism by design - if someone wants to
enable and use it, they have bigger problems than unbounded memory
allocations.

Cheers,
Daniel
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] ✓ Fi.CI.BAT: success for series starting with [1/2] dma-buf/sw_sync: Avoid recursive lock during fence signal

2020-07-15 Thread Patchwork
== Series Details ==

Series: series starting with [1/2] dma-buf/sw_sync: Avoid recursive lock during 
fence signal
URL   : https://patchwork.freedesktop.org/series/79510/
State : success

== Summary ==

CI Bug Log - changes from CI_DRM_8748 -> Patchwork_18173


Summary
---

  **SUCCESS**

  No regressions found.

  External URL: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18173/index.html

New tests
-

  New tests have been introduced between CI_DRM_8748 and Patchwork_18173:

### New IGT tests (1) ###

  * igt@dmabuf@all@sw_sync:
- Statuses : 38 pass(s)
- Exec time: [0.02, 0.05] s

  

Known issues


  Here are the changes found in Patchwork_18173 that come from known issues:

### IGT changes ###

 Issues hit 

  * igt@gem_render_linear_blits@basic:
- fi-tgl-y:   [PASS][1] -> [DMESG-WARN][2] ([i915#402])
   [1]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8748/fi-tgl-y/igt@gem_render_linear_bl...@basic.html
   [2]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18173/fi-tgl-y/igt@gem_render_linear_bl...@basic.html

  * igt@i915_pm_rpm@basic-pci-d3-state:
- fi-byt-j1900:   [PASS][3] -> [DMESG-WARN][4] ([i915#1982])
   [3]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8748/fi-byt-j1900/igt@i915_pm_...@basic-pci-d3-state.html
   [4]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18173/fi-byt-j1900/igt@i915_pm_...@basic-pci-d3-state.html

  * igt@kms_flip@basic-flip-vs-dpms@a-dp1:
- fi-apl-guc: [PASS][5] -> [INCOMPLETE][6] ([i915#1635])
   [5]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8748/fi-apl-guc/igt@kms_flip@basic-flip-vs-d...@a-dp1.html
   [6]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18173/fi-apl-guc/igt@kms_flip@basic-flip-vs-d...@a-dp1.html

  * igt@kms_pipe_crc_basic@read-crc-pipe-a-frame-sequence:
- fi-tgl-u2:  [PASS][7] -> [DMESG-WARN][8] ([i915#402]) +1 similar 
issue
   [7]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8748/fi-tgl-u2/igt@kms_pipe_crc_ba...@read-crc-pipe-a-frame-sequence.html
   [8]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18173/fi-tgl-u2/igt@kms_pipe_crc_ba...@read-crc-pipe-a-frame-sequence.html

  
 Possible fixes 

  * igt@i915_pm_rpm@basic-pci-d3-state:
- fi-bsw-kefka:   [DMESG-WARN][9] ([i915#1982]) -> [PASS][10]
   [9]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8748/fi-bsw-kefka/igt@i915_pm_...@basic-pci-d3-state.html
   [10]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18173/fi-bsw-kefka/igt@i915_pm_...@basic-pci-d3-state.html

  * igt@kms_busy@basic@flip:
- fi-tgl-y:   [DMESG-WARN][11] ([i915#1982]) -> [PASS][12]
   [11]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8748/fi-tgl-y/igt@kms_busy@ba...@flip.html
   [12]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18173/fi-tgl-y/igt@kms_busy@ba...@flip.html

  * igt@kms_cursor_legacy@basic-busy-flip-before-cursor-atomic:
- {fi-kbl-7560u}: [DMESG-WARN][13] ([i915#1982]) -> [PASS][14]
   [13]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8748/fi-kbl-7560u/igt@kms_cursor_leg...@basic-busy-flip-before-cursor-atomic.html
   [14]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18173/fi-kbl-7560u/igt@kms_cursor_leg...@basic-busy-flip-before-cursor-atomic.html

  * igt@kms_cursor_legacy@basic-flip-after-cursor-legacy:
- fi-icl-u2:  [DMESG-WARN][15] ([i915#1982]) -> [PASS][16]
   [15]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8748/fi-icl-u2/igt@kms_cursor_leg...@basic-flip-after-cursor-legacy.html
   [16]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18173/fi-icl-u2/igt@kms_cursor_leg...@basic-flip-after-cursor-legacy.html

  * igt@vgem_basic@setversion:
- fi-tgl-y:   [DMESG-WARN][17] ([i915#402]) -> [PASS][18] +1 
similar issue
   [17]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8748/fi-tgl-y/igt@vgem_ba...@setversion.html
   [18]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18173/fi-tgl-y/igt@vgem_ba...@setversion.html

  
 Warnings 

  * igt@debugfs_test@read_all_entries:
- fi-kbl-x1275:   [DMESG-WARN][19] ([i915#62] / [i915#92] / [i915#95]) 
-> [DMESG-WARN][20] ([i915#62] / [i915#92]) +1 similar issue
   [19]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8748/fi-kbl-x1275/igt@debugfs_test@read_all_entries.html
   [20]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18173/fi-kbl-x1275/igt@debugfs_test@read_all_entries.html

  * igt@gem_exec_suspend@basic-s0:
- fi-kbl-x1275:   [DMESG-WARN][21] ([i915#62] / [i915#92]) -> 
[DMESG-WARN][22] ([i915#1982] / [i915#62] / [i915#92])
   [21]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8748/fi-kbl-x1275/igt@gem_exec_susp...@basic-s0.html
   [22]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18173/fi-kbl-x1275/igt@gem_exec_susp...@basic-s0.html

  * igt@kms_force_connector_basic@force-connector-state:
- fi-kbl-x1275:   [DM

Re: [Intel-gfx] sw_sync deadlock avoidance, take 3

2020-07-15 Thread Chris Wilson
Quoting Bas Nieuwenhuizen (2020-07-15 11:23:35)
> Hi Chris,
> 
> My concern with going in this direction was that we potentially allow
> an application to allocate a lot of kernel memory but not a lot of fds
> by creating lots of fences and then closing the fds but never
> signaling them. Is that not an issue?

I did look to see if there was a quick way we could couple into the
sync_file release itself to remove the syncpt from the timeline, but
decided that for a debug feature, it wasn't a pressing concern.

Maybe now is the time to ask: are you using sw_sync outside of
validation?
-Chris
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH 1/2] dma-buf/dma-fence: Trim dma_fence_add_callback()

2020-07-15 Thread Chris Wilson
Rearrange the code to pull the operations beore the fence->lock critical
section, and remove a small amount of redundancy:

Function old new   delta
dma_fence_add_callback   156 145 -11

Signed-off-by: Chris Wilson 
---
 drivers/dma-buf/dma-fence.c | 26 +++---
 1 file changed, 11 insertions(+), 15 deletions(-)

diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
index 656e9ac2d028..8d5bdfce638e 100644
--- a/drivers/dma-buf/dma-fence.c
+++ b/drivers/dma-buf/dma-fence.c
@@ -348,29 +348,25 @@ EXPORT_SYMBOL(dma_fence_enable_sw_signaling);
 int dma_fence_add_callback(struct dma_fence *fence, struct dma_fence_cb *cb,
   dma_fence_func_t func)
 {
-   unsigned long flags;
-   int ret = 0;
+   int ret = -ENOENT;
 
if (WARN_ON(!fence || !func))
return -EINVAL;
 
-   if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags)) {
-   INIT_LIST_HEAD(&cb->node);
-   return -ENOENT;
-   }
+   cb->func = func;
+   INIT_LIST_HEAD(&cb->node);
 
-   spin_lock_irqsave(fence->lock, flags);
+   if (!test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags)) {
+   unsigned long flags;
 
-   if (__dma_fence_enable_signaling(fence)) {
-   cb->func = func;
-   list_add_tail(&cb->node, &fence->cb_list);
-   } else {
-   INIT_LIST_HEAD(&cb->node);
-   ret = -ENOENT;
+   spin_lock_irqsave(fence->lock, flags);
+   if (__dma_fence_enable_signaling(fence)) {
+   list_add_tail(&cb->node, &fence->cb_list);
+   ret = 0;
+   }
+   spin_unlock_irqrestore(fence->lock, flags);
}
 
-   spin_unlock_irqrestore(fence->lock, flags);
-
return ret;
 }
 EXPORT_SYMBOL(dma_fence_add_callback);
-- 
2.20.1

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH 2/2] dma-buf/dma-fence: Add quick tests before dma_fence_remove_callback

2020-07-15 Thread Chris Wilson
When waiting with a callback on the stack, we must remove the callback
upon wait completion. Since this will be notified by the fence signal
callback, the removal often contends with the fence->lock being held by
the signaler. We can look at the list entry to see if the callback was
already signaled before we take the contended lock.

Signed-off-by: Chris Wilson 
---
 drivers/dma-buf/dma-fence.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
index 8d5bdfce638e..b910d7bc0854 100644
--- a/drivers/dma-buf/dma-fence.c
+++ b/drivers/dma-buf/dma-fence.c
@@ -420,6 +420,9 @@ dma_fence_remove_callback(struct dma_fence *fence, struct 
dma_fence_cb *cb)
unsigned long flags;
bool ret;
 
+   if (list_empty(&cb->node))
+   return false;
+
spin_lock_irqsave(fence->lock, flags);
 
ret = !list_empty(&cb->node);
-- 
2.20.1

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH] drm/i915: Move i915_vma_lock in the selftests to avoid lock inversion, v3.

2020-07-15 Thread Maarten Lankhorst
Make sure vma_lock is not used as inner lock when kernel context is used,
and add ww handling where appropriate.

Ensure that execbuf selftests keep passing by using ww handling.

Changes since v2:
- Fix i915_gem_context finally.

Signed-off-by: Maarten Lankhorst 
---
 .../i915/gem/selftests/i915_gem_coherency.c   |  26 +++--
 .../drm/i915/gem/selftests/i915_gem_context.c | 106 +-
 .../drm/i915/gem/selftests/i915_gem_mman.c|  41 +--
 drivers/gpu/drm/i915/gt/selftest_rps.c|  30 +++--
 drivers/gpu/drm/i915/selftests/i915_request.c |  18 ++-
 5 files changed, 125 insertions(+), 96 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_coherency.c 
b/drivers/gpu/drm/i915/gem/selftests/i915_gem_coherency.c
index dcdfc396f2f8..7049a6bbc03d 100644
--- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_coherency.c
+++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_coherency.c
@@ -201,25 +201,25 @@ static int gpu_set(struct context *ctx, unsigned long 
offset, u32 v)
 
i915_gem_object_lock(ctx->obj, NULL);
err = i915_gem_object_set_to_gtt_domain(ctx->obj, true);
-   i915_gem_object_unlock(ctx->obj);
if (err)
-   return err;
+   goto out_unlock;
 
vma = i915_gem_object_ggtt_pin(ctx->obj, NULL, 0, 0, 0);
-   if (IS_ERR(vma))
-   return PTR_ERR(vma);
+   if (IS_ERR(vma)) {
+   err = PTR_ERR(vma);
+   goto out_unlock;
+   }
 
rq = intel_engine_create_kernel_request(ctx->engine);
if (IS_ERR(rq)) {
-   i915_vma_unpin(vma);
-   return PTR_ERR(rq);
+   err = PTR_ERR(rq);
+   goto out_unpin;
}
 
cs = intel_ring_begin(rq, 4);
if (IS_ERR(cs)) {
-   i915_request_add(rq);
-   i915_vma_unpin(vma);
-   return PTR_ERR(cs);
+   err = PTR_ERR(cs);
+   goto out_rq;
}
 
if (INTEL_GEN(ctx->engine->i915) >= 8) {
@@ -240,14 +240,16 @@ static int gpu_set(struct context *ctx, unsigned long 
offset, u32 v)
}
intel_ring_advance(rq, cs);
 
-   i915_vma_lock(vma);
err = i915_request_await_object(rq, vma->obj, true);
if (err == 0)
err = i915_vma_move_to_active(vma, rq, EXEC_OBJECT_WRITE);
-   i915_vma_unlock(vma);
-   i915_vma_unpin(vma);
 
+out_rq:
i915_request_add(rq);
+out_unpin:
+   i915_vma_unpin(vma);
+out_unlock:
+   i915_gem_object_unlock(ctx->obj);
 
return err;
 }
diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c 
b/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
index b93fd16a539d..fd49fe57ca53 100644
--- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
@@ -893,24 +893,15 @@ static int igt_shared_ctx_exec(void *arg)
return err;
 }
 
-static struct i915_vma *rpcs_query_batch(struct i915_vma *vma)
+static int rpcs_query_batch(struct drm_i915_gem_object *rpcs, struct i915_vma 
*vma)
 {
-   struct drm_i915_gem_object *obj;
u32 *cmd;
-   int err;
 
-   if (INTEL_GEN(vma->vm->i915) < 8)
-   return ERR_PTR(-EINVAL);
+   GEM_BUG_ON(INTEL_GEN(vma->vm->i915) < 8);
 
-   obj = i915_gem_object_create_internal(vma->vm->i915, PAGE_SIZE);
-   if (IS_ERR(obj))
-   return ERR_CAST(obj);
-
-   cmd = i915_gem_object_pin_map(obj, I915_MAP_WB);
-   if (IS_ERR(cmd)) {
-   err = PTR_ERR(cmd);
-   goto err;
-   }
+   cmd = i915_gem_object_pin_map(rpcs, I915_MAP_WB);
+   if (IS_ERR(cmd))
+   return PTR_ERR(cmd);
 
*cmd++ = MI_STORE_REGISTER_MEM_GEN8;
*cmd++ = i915_mmio_reg_offset(GEN8_R_PWR_CLK_STATE);
@@ -918,26 +909,12 @@ static struct i915_vma *rpcs_query_batch(struct i915_vma 
*vma)
*cmd++ = upper_32_bits(vma->node.start);
*cmd = MI_BATCH_BUFFER_END;
 
-   __i915_gem_object_flush_map(obj, 0, 64);
-   i915_gem_object_unpin_map(obj);
+   __i915_gem_object_flush_map(rpcs, 0, 64);
+   i915_gem_object_unpin_map(rpcs);
 
intel_gt_chipset_flush(vma->vm->gt);
 
-   vma = i915_vma_instance(obj, vma->vm, NULL);
-   if (IS_ERR(vma)) {
-   err = PTR_ERR(vma);
-   goto err;
-   }
-
-   err = i915_vma_pin(vma, 0, 0, PIN_USER);
-   if (err)
-   goto err;
-
-   return vma;
-
-err:
-   i915_gem_object_put(obj);
-   return ERR_PTR(err);
+   return 0;
 }
 
 static int
@@ -945,52 +922,68 @@ emit_rpcs_query(struct drm_i915_gem_object *obj,
struct intel_context *ce,
struct i915_request **rq_out)
 {
+   struct drm_i915_private *i915 = to_i915(obj->base.dev);
struct i915_request *rq;
+   struct i915_gem_ww_ctx ww;
struct i915_vma *batch;
struct i915_vma *vma;
+   struct drm_i915_gem_object

[Intel-gfx] [PATCH] drm/i915: Reduce i915_request.lock contention for i915_request_wait

2020-07-15 Thread Chris Wilson
Currently, we use i915_request_completed() directly in
i915_request_wait() and follow up with a manual invocation of
dma_fence_signal(). This appears to cause a large number of contentions
on i915_request.lock as when the process is woken up after the fence is
signaled by an interrupt, we will then try and call dma_fence_signal()
ourselves while the signaler is still holding the lock.
dma_fence_is_signaled() has the benefit of checking the
DMA_FENCE_FLAG_SIGNALED_BIT prior to calling dma_fence_signal() and so
avoids most of that contention.

Signed-off-by: Chris Wilson 
Cc: Matthew Auld 
Cc: Tvrtko Ursulin 
---
 drivers/gpu/drm/i915/i915_request.c | 12 
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_request.c 
b/drivers/gpu/drm/i915/i915_request.c
index 0b2fe55e6194..bb4eb1a8780e 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -1640,7 +1640,7 @@ static bool busywait_stop(unsigned long timeout, unsigned 
int cpu)
return this_cpu != cpu;
 }
 
-static bool __i915_spin_request(const struct i915_request * const rq, int 
state)
+static bool __i915_spin_request(struct i915_request * const rq, int state)
 {
unsigned long timeout_ns;
unsigned int cpu;
@@ -1673,7 +1673,7 @@ static bool __i915_spin_request(const struct i915_request 
* const rq, int state)
timeout_ns = READ_ONCE(rq->engine->props.max_busywait_duration_ns);
timeout_ns += local_clock_ns(&cpu);
do {
-   if (i915_request_completed(rq))
+   if (dma_fence_is_signaled(&rq->fence))
return true;
 
if (signal_pending_state(state, current))
@@ -1766,10 +1766,8 @@ long i915_request_wait(struct i915_request *rq,
 * duration, which we currently lack.
 */
if (IS_ACTIVE(CONFIG_DRM_I915_MAX_REQUEST_BUSYWAIT) &&
-   __i915_spin_request(rq, state)) {
-   dma_fence_signal(&rq->fence);
+   __i915_spin_request(rq, state))
goto out;
-   }
 
/*
 * This client is about to stall waiting for the GPU. In many cases
@@ -1796,10 +1794,8 @@ long i915_request_wait(struct i915_request *rq,
for (;;) {
set_current_state(state);
 
-   if (i915_request_completed(rq)) {
-   dma_fence_signal(&rq->fence);
+   if (dma_fence_is_signaled(&rq->fence))
break;
-   }
 
intel_engine_flush_submission(rq->engine);
 
-- 
2.20.1

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] sw_sync deadlock avoidance, take 3

2020-07-15 Thread Bas Nieuwenhuizen
On Wed, Jul 15, 2020 at 12:34 PM Chris Wilson  wrote:
>
> Quoting Bas Nieuwenhuizen (2020-07-15 11:23:35)
> > Hi Chris,
> >
> > My concern with going in this direction was that we potentially allow
> > an application to allocate a lot of kernel memory but not a lot of fds
> > by creating lots of fences and then closing the fds but never
> > signaling them. Is that not an issue?
>
> I did look to see if there was a quick way we could couple into the
> sync_file release itself to remove the syncpt from the timeline, but
> decided that for a debug feature, it wasn't a pressing concern.
>
> Maybe now is the time to ask: are you using sw_sync outside of
> validation?

Yes, this is used as part of the Android stack on Chrome OS (need to
see if ChromeOS specific, but
https://source.android.com/devices/graphics/sync#sync_timeline
suggests not)

> -Chris
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH v12 0/3] drm/i915: timeline semaphore support

2020-07-15 Thread Lionel Landwerlin

Ping?

On 08/07/2020 16:17, Lionel Landwerlin wrote:

Hi all,

This is resuming the work on trying to get timeline semaphore support
for i915 upstream, now that some selftests have been added to
dma-fence-chain.

There are a few fix from the last iteration and a rebase following the
changes in the upstream execbuf code.

Cheers,

Lionel Landwerlin (3):
   drm/i915: introduce a mechanism to extend execbuf2
   drm/i915: add syncobj timeline support
   drm/i915: peel dma-fence-chains wait fences

  .../gpu/drm/i915/gem/i915_gem_execbuffer.c| 333 +++---
  drivers/gpu/drm/i915/i915_drv.c   |   3 +-
  drivers/gpu/drm/i915/i915_getparam.c  |   1 +
  include/uapi/drm/i915_drm.h   |  65 +++-
  4 files changed, 342 insertions(+), 60 deletions(-)

--
2.27.0



___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] ✓ Fi.CI.BAT: success for series starting with [1/2] dma-buf/dma-fence: Trim dma_fence_add_callback()

2020-07-15 Thread Patchwork
== Series Details ==

Series: series starting with [1/2] dma-buf/dma-fence: Trim 
dma_fence_add_callback()
URL   : https://patchwork.freedesktop.org/series/79513/
State : success

== Summary ==

CI Bug Log - changes from CI_DRM_8748 -> Patchwork_18174


Summary
---

  **SUCCESS**

  No regressions found.

  External URL: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18174/index.html

Known issues


  Here are the changes found in Patchwork_18174 that come from known issues:

### IGT changes ###

 Issues hit 

  * igt@gem_flink_basic@basic:
- fi-tgl-y:   [PASS][1] -> [DMESG-WARN][2] ([i915#402])
   [1]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8748/fi-tgl-y/igt@gem_flink_ba...@basic.html
   [2]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18174/fi-tgl-y/igt@gem_flink_ba...@basic.html

  * igt@i915_pm_rpm@module-reload:
- fi-byt-j1900:   [PASS][3] -> [DMESG-WARN][4] ([i915#1982]) +1 similar 
issue
   [3]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8748/fi-byt-j1900/igt@i915_pm_...@module-reload.html
   [4]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18174/fi-byt-j1900/igt@i915_pm_...@module-reload.html

  * igt@kms_cursor_legacy@basic-busy-flip-before-cursor-atomic:
- fi-icl-u2:  [PASS][5] -> [DMESG-WARN][6] ([i915#1982])
   [5]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8748/fi-icl-u2/igt@kms_cursor_leg...@basic-busy-flip-before-cursor-atomic.html
   [6]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18174/fi-icl-u2/igt@kms_cursor_leg...@basic-busy-flip-before-cursor-atomic.html

  
 Possible fixes 

  * igt@gem_exec_suspend@basic-s0:
- fi-tgl-u2:  [FAIL][7] ([i915#1888]) -> [PASS][8]
   [7]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8748/fi-tgl-u2/igt@gem_exec_susp...@basic-s0.html
   [8]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18174/fi-tgl-u2/igt@gem_exec_susp...@basic-s0.html

  * igt@i915_pm_rpm@basic-pci-d3-state:
- fi-bsw-kefka:   [DMESG-WARN][9] ([i915#1982]) -> [PASS][10]
   [9]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8748/fi-bsw-kefka/igt@i915_pm_...@basic-pci-d3-state.html
   [10]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18174/fi-bsw-kefka/igt@i915_pm_...@basic-pci-d3-state.html

  * igt@kms_busy@basic@flip:
- fi-tgl-y:   [DMESG-WARN][11] ([i915#1982]) -> [PASS][12]
   [11]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8748/fi-tgl-y/igt@kms_busy@ba...@flip.html
   [12]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18174/fi-tgl-y/igt@kms_busy@ba...@flip.html

  * igt@kms_cursor_legacy@basic-busy-flip-before-cursor-atomic:
- {fi-kbl-7560u}: [DMESG-WARN][13] ([i915#1982]) -> [PASS][14]
   [13]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8748/fi-kbl-7560u/igt@kms_cursor_leg...@basic-busy-flip-before-cursor-atomic.html
   [14]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18174/fi-kbl-7560u/igt@kms_cursor_leg...@basic-busy-flip-before-cursor-atomic.html

  * igt@kms_flip@basic-flip-vs-wf_vblank@c-edp1:
- fi-icl-u2:  [DMESG-WARN][15] ([i915#1982]) -> [PASS][16] +1 
similar issue
   [15]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8748/fi-icl-u2/igt@kms_flip@basic-flip-vs-wf_vbl...@c-edp1.html
   [16]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18174/fi-icl-u2/igt@kms_flip@basic-flip-vs-wf_vbl...@c-edp1.html

  * igt@vgem_basic@setversion:
- fi-tgl-y:   [DMESG-WARN][17] ([i915#402]) -> [PASS][18] +1 
similar issue
   [17]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8748/fi-tgl-y/igt@vgem_ba...@setversion.html
   [18]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18174/fi-tgl-y/igt@vgem_ba...@setversion.html

  
 Warnings 

  * igt@debugfs_test@read_all_entries:
- fi-kbl-x1275:   [DMESG-WARN][19] ([i915#62] / [i915#92] / [i915#95]) 
-> [DMESG-WARN][20] ([i915#62] / [i915#92]) +2 similar issues
   [19]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8748/fi-kbl-x1275/igt@debugfs_test@read_all_entries.html
   [20]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18174/fi-kbl-x1275/igt@debugfs_test@read_all_entries.html

  * igt@kms_flip@basic-flip-vs-modeset@a-dp1:
- fi-kbl-x1275:   [DMESG-WARN][21] ([i915#62] / [i915#92]) -> 
[DMESG-WARN][22] ([i915#62] / [i915#92] / [i915#95]) +2 similar issues
   [21]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8748/fi-kbl-x1275/igt@kms_flip@basic-flip-vs-mode...@a-dp1.html
   [22]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18174/fi-kbl-x1275/igt@kms_flip@basic-flip-vs-mode...@a-dp1.html

  
  {name}: This element is suppressed. This means it is ignored when computing
  the status of the difference (SUCCESS, WARNING, or FAILURE).

  [i915#1888]: https://gitlab.freedesktop.org/drm/intel/issues/1888
  [i915#1982]: https://gitlab.freedesktop.org/drm/intel/issues/1982
  [i915#402]: https://gitlab.freedes

[Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for series starting with [01/23] Revert "drm/i915/gem: Async GPU relocations only" (rev3)

2020-07-15 Thread Patchwork
== Series Details ==

Series: series starting with [01/23] Revert "drm/i915/gem: Async GPU 
relocations only" (rev3)
URL   : https://patchwork.freedesktop.org/series/79470/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
814752a982cc Revert "drm/i915/gem: Async GPU relocations only"
-:113: WARNING:MEMORY_BARRIER: memory barrier without comment
#113: FILE: drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:1109:
+   mb();

-:161: WARNING:MEMORY_BARRIER: memory barrier without comment
#161: FILE: drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:1157:
+   mb();

-:181: CHECK:SPACING: No space is necessary after a cast
#181: FILE: drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:1177:
+   io_mapping_unmap_atomic((void __force __iomem *) 
unmask_page(cache->vaddr));

-:260: WARNING:MEMORY_BARRIER: memory barrier without comment
#260: FILE: drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:1256:
+   mb();

-:274: CHECK:BRACES: Unbalanced braces around else statement
#274: FILE: drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:1270:
+   } else

total: 0 errors, 3 warnings, 2 checks, 455 lines checked
6392eb4a8da1 drm/i915: Revert relocation chaining commits.
-:6: WARNING:COMMIT_LOG_LONG_LINE: Possible unwrapped commit description 
(prefer a maximum 75 chars per line)
#6: 
This reverts commit 964a9b0f611ee ("drm/i915/gem: Use chained reloc batches")

-:221: CHECK:SPACING: spaces preferred around that '/' (ctx:VxV)
#221: FILE: drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:1313:
+   if (cache->rq_size > PAGE_SIZE/sizeof(u32) - (len + 1))
  ^

total: 0 errors, 1 warnings, 1 checks, 281 lines checked
d7d81d2155e7 Revert "drm/i915/gem: Drop relocation slowpath".
-:131: WARNING:LINE_SPACING: Missing a blank line after declarations
#131: FILE: drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:1705:
+   int err = __get_user(c, addr);
+   if (err)

total: 0 errors, 1 warnings, 0 checks, 320 lines checked
b7e72e95ad06 drm/i915: Add an implementation for i915_gem_ww_ctx locking, v2.
-:445: WARNING:LONG_LINE: line length of 103 exceeds 100 columns
#445: FILE: drivers/gpu/drm/i915/i915_gem.c:1359:
+   while ((obj = list_first_entry_or_null(&ww->obj_list, struct 
drm_i915_gem_object, obj_link))) {

total: 0 errors, 1 warnings, 0 checks, 441 lines checked
768f25a145c6 drm/i915: Remove locking from i915_gem_object_prepare_read/write
b874980be3e5 drm/i915: Parse command buffer earlier in eb_relocate(slow)
f9c42a704f73 Revert "drm/i915/gem: Split eb_vma into its own allocation"
fbcda5740e01 drm/i915: Use per object locking in execbuf, v12.
-:457: CHECK:PARENTHESIS_ALIGNMENT: Alignment should match open parenthesis
#457: FILE: drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:1410:
+static int __reloc_entry_gpu(struct i915_execbuffer *eb,
  struct i915_vma *vma,

-:477: CHECK:PARENTHESIS_ALIGNMENT: Alignment should match open parenthesis
#477: FILE: drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:1483:
+static int reloc_entry_gpu(struct i915_execbuffer *eb,
struct i915_vma *vma,

-:489: ERROR:TRAILING_WHITESPACE: trailing whitespace
#489: FILE: drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:1508:
+^I$

-:759: CHECK:MULTIPLE_ASSIGNMENTS: multiple assignments should be avoided
#759: FILE: drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c:2878:
+   eb.reloc_pool = eb.batch_pool = NULL;

total: 1 errors, 0 warnings, 3 checks, 865 lines checked
0f2665b24b41 drm/i915: Use ww locking in intel_renderstate.
-:10: WARNING:COMMIT_LOG_LONG_LINE: Possible unwrapped commit description 
(prefer a maximum 75 chars per line)
#10: 
Convert to using ww-waiting, and make sure we always pin intel_context_state,

total: 0 errors, 1 warnings, 0 checks, 190 lines checked
3f3703db8b84 drm/i915: Add ww context handling to context_barrier_task
-:19: WARNING:LONG_LINE: line length of 109 exceeds 100 columns
#19: FILE: drivers/gpu/drm/i915/gem/i915_gem_context.c:1097:
+   int (*pin)(struct intel_context *ce, struct 
i915_gem_ww_ctx *ww, void *data),

total: 0 errors, 1 warnings, 0 checks, 146 lines checked
7cd4ce6f551a drm/i915: Nuke arguments to eb_pin_engine
47814daad689 drm/i915: Pin engine before pinning all objects, v5.
66a29b323d43 drm/i915: Rework intel_context pinning to do everything outside of 
pin_mutex
-:125: CHECK:LINE_SPACING: Please don't use multiple blank lines
#125: FILE: drivers/gpu/drm/i915/gt/intel_context.c:176:
+
+

-:338: CHECK:PARENTHESIS_ALIGNMENT: Alignment should match open parenthesis
#338: FILE: drivers/gpu/drm/i915/gt/intel_lrc.c:3483:
+   *vaddr = i915_gem_object_pin_map(ce->state->obj,
+   
i915_coherent_map_type(ce->engine->i915) |

total: 0 errors, 0 warnings, 2 checks, 434 lines checked
be73070bd265 drm/i915: Make sure execbuffer always passes ww state to 
i915_vma

[Intel-gfx] ✗ Fi.CI.SPARSE: warning for series starting with [01/23] Revert "drm/i915/gem: Async GPU relocations only" (rev3)

2020-07-15 Thread Patchwork
== Series Details ==

Series: series starting with [01/23] Revert "drm/i915/gem: Async GPU 
relocations only" (rev3)
URL   : https://patchwork.freedesktop.org/series/79470/
State : warning

== Summary ==

$ dim sparse --fast origin/drm-tip
Sparse version: v0.6.0
Fast mode used, each commit won't be checked separately.


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 1/2] dma-buf/sw_sync: Avoid recursive lock during fence signal

2020-07-15 Thread Bas Nieuwenhuizen
Reviewed-by: Bas Nieuwenhuizen 

On Wed, Jul 15, 2020 at 12:04 PM Chris Wilson  wrote:
>
> If a signal callback releases the sw_sync fence, that will trigger a
> deadlock as the timeline_fence_release recurses onto the fence->lock
> (used both for signaling and the the timeline tree).
>
> If we always hold a reference for an unsignaled fence held by the
> timeline, we no longer need to detach the fence from the timeline upon
> release. This is only possible since commit ea4d5a270b57
> ("dma-buf/sw_sync: force signal all unsignaled fences on dying timeline")
> where we introduced decoupling of the fences from the timeline upon release.
>
> Reported-by: Bas Nieuwenhuizen 
> Fixes: d3c6dd1fb30d ("dma-buf/sw_sync: Synchronize signal vs syncpt free")
> Signed-off-by: Chris Wilson 
> Cc: Sumit Semwal 
> Cc: Chris Wilson 
> Cc: Gustavo Padovan 
> Cc: Christian König 
> Cc: 
> ---
>  drivers/dma-buf/sw_sync.c | 32 +++-
>  1 file changed, 7 insertions(+), 25 deletions(-)
>
> diff --git a/drivers/dma-buf/sw_sync.c b/drivers/dma-buf/sw_sync.c
> index 348b3a9170fa..4cc2ac03a84a 100644
> --- a/drivers/dma-buf/sw_sync.c
> +++ b/drivers/dma-buf/sw_sync.c
> @@ -130,16 +130,7 @@ static const char 
> *timeline_fence_get_timeline_name(struct dma_fence *fence)
>
>  static void timeline_fence_release(struct dma_fence *fence)
>  {
> -   struct sync_pt *pt = dma_fence_to_sync_pt(fence);
> struct sync_timeline *parent = dma_fence_parent(fence);
> -   unsigned long flags;
> -
> -   spin_lock_irqsave(fence->lock, flags);
> -   if (!list_empty(&pt->link)) {
> -   list_del(&pt->link);
> -   rb_erase(&pt->node, &parent->pt_tree);
> -   }
> -   spin_unlock_irqrestore(fence->lock, flags);
>
> sync_timeline_put(parent);
> dma_fence_free(fence);
> @@ -203,18 +194,11 @@ static void sync_timeline_signal(struct sync_timeline 
> *obj, unsigned int inc)
> if (!timeline_fence_signaled(&pt->base))
> break;
>
> -   list_del_init(&pt->link);
> +   list_del(&pt->link);
> rb_erase(&pt->node, &obj->pt_tree);
>
> -   /*
> -* A signal callback may release the last reference to this
> -* fence, causing it to be freed. That operation has to be
> -* last to avoid a use after free inside this loop, and must
> -* be after we remove the fence from the timeline in order to
> -* prevent deadlocking on timeline->lock inside
> -* timeline_fence_release().
> -*/
> dma_fence_signal_locked(&pt->base);
> +   dma_fence_put(&pt->base);
> }
>
> spin_unlock_irq(&obj->lock);
> @@ -261,13 +245,9 @@ static struct sync_pt *sync_pt_create(struct 
> sync_timeline *obj,
> } else if (cmp < 0) {
> p = &parent->rb_left;
> } else {
> -   if (dma_fence_get_rcu(&other->base)) {
> -   sync_timeline_put(obj);
> -   kfree(pt);
> -   pt = other;
> -   goto unlock;
> -   }
> -   p = &parent->rb_left;
> +   dma_fence_put(&pt->base);
> +   pt = other;
> +   goto unlock;
> }
> }
> rb_link_node(&pt->node, parent, p);
> @@ -278,6 +258,7 @@ static struct sync_pt *sync_pt_create(struct 
> sync_timeline *obj,
>   parent ? &rb_entry(parent, typeof(*pt), 
> node)->link : &obj->pt_list);
> }
>  unlock:
> +   dma_fence_get(&pt->base); /* keep a ref for the timeline */
> spin_unlock_irq(&obj->lock);
>
> return pt;
> @@ -316,6 +297,7 @@ static int sw_sync_debugfs_release(struct inode *inode, 
> struct file *file)
> list_for_each_entry_safe(pt, next, &obj->pt_list, link) {
> dma_fence_set_error(&pt->base, -ENOENT);
> dma_fence_signal_locked(&pt->base);
> +   dma_fence_put(&pt->base);
> }
>
> spin_unlock_irq(&obj->lock);
> --
> 2.20.1
>
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 19/25] drm/amdgpu: s/GFP_KERNEL/GFP_ATOMIC in scheduler code

2020-07-15 Thread Christian König

Am 14.07.20 um 16:31 schrieb Daniel Vetter:

On Tue, Jul 14, 2020 at 01:40:11PM +0200, Christian König wrote:

Am 14.07.20 um 12:49 schrieb Daniel Vetter:

On Tue, Jul 07, 2020 at 10:12:23PM +0200, Daniel Vetter wrote:

My dma-fence lockdep annotations caught an inversion because we
allocate memory where we really shouldn't:

kmem_cache_alloc+0x2b/0x6d0
amdgpu_fence_emit+0x30/0x330 [amdgpu]
amdgpu_ib_schedule+0x306/0x550 [amdgpu]
amdgpu_job_run+0x10f/0x260 [amdgpu]
drm_sched_main+0x1b9/0x490 [gpu_sched]
kthread+0x12e/0x150

Trouble right now is that lockdep only validates against GFP_FS, which
would be good enough for shrinkers. But for mmu_notifiers we actually
need !GFP_ATOMIC, since they can be called from any page laundering,
even if GFP_NOFS or GFP_NOIO are set.

I guess we should improve the lockdep annotations for
fs_reclaim_acquire/release.

Ofc real fix is to properly preallocate this fence and stuff it into
the amdgpu job structure. But GFP_ATOMIC gets the lockdep splat out of
the way.

v2: Two more allocations in scheduler paths.

Frist one:

__kmalloc+0x58/0x720
amdgpu_vmid_grab+0x100/0xca0 [amdgpu]
amdgpu_job_dependency+0xf9/0x120 [amdgpu]
drm_sched_entity_pop_job+0x3f/0x440 [gpu_sched]
drm_sched_main+0xf9/0x490 [gpu_sched]

Second one:

kmem_cache_alloc+0x2b/0x6d0
amdgpu_sync_fence+0x7e/0x110 [amdgpu]
amdgpu_vmid_grab+0x86b/0xca0 [amdgpu]
amdgpu_job_dependency+0xf9/0x120 [amdgpu]
drm_sched_entity_pop_job+0x3f/0x440 [gpu_sched]
drm_sched_main+0xf9/0x490 [gpu_sched]

Cc: linux-me...@vger.kernel.org
Cc: linaro-mm-...@lists.linaro.org
Cc: linux-r...@vger.kernel.org
Cc: amd-...@lists.freedesktop.org
Cc: intel-gfx@lists.freedesktop.org
Cc: Chris Wilson 
Cc: Maarten Lankhorst 
Cc: Christian König 
Signed-off-by: Daniel Vetter 

Has anyone from amd side started looking into how to fix this properly?

Yeah I checked both and neither are any real problem.

I'm confused ... do you mean "no real problem fixing them" or "not
actually a real problem"?


Both, at least the VMID stuff is trivial to avoid.

And the fence allocation is extremely unlikely. E.g. when we allocate a 
new one we previously most likely just freed one already.





I looked a bit into fixing this with mempool, and the big guarantee we
need is that
- there's a hard upper limit on how many allocations we minimally need to
guarantee forward progress. And the entire vmid allocation and
amdgpu_sync_fence stuff kinda makes me question that's a valid
assumption.

We do have hard upper limits for those.

The VMID allocation could as well just return the fence instead of putting
it into the sync object IIRC. So that just needs some cleanup and can avoid
the allocation entirely.

Yeah embedding should be simplest solution of all.


The hardware fence is limited by the number of submissions we can have
concurrently on the ring buffers, so also not a problem at all.

Ok that sounds good. Wrt releasing the memory again, is that also done
without any of the allocation-side locks held? I've seen some vmid manager
somewhere ...


Well that's the issue. We can't guarantee that for the hardware fence 
memory since it could be that we hold another reference during debugging 
IIRC.


Still looking if and how we could fix this. But as I said this problem 
is so extremely unlikely.


Christian.


-Daniel


Regards,
Christian.


- mempool_free must be called without any locks in the way which are held
while we call mempool_alloc. Otherwise we again have a nice deadlock
with no forward progress. I tried auditing that, but got lost in amdgpu
and scheduler code. Some lockdep annotations for mempool.c might help,
but they're not going to catch everything. Plus it would be again manual
annotations because this is yet another cross-release issue. So not sure
that helps at all.

iow, not sure what to do here. Ideas?

Cheers, Daniel


---
   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +-
   drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c   | 2 +-
   drivers/gpu/drm/amd/amdgpu/amdgpu_sync.c  | 2 +-
   3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 8d84975885cd..a089a827fdfe 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -143,7 +143,7 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, struct 
dma_fence **f,
uint32_t seq;
int r;
-   fence = kmem_cache_alloc(amdgpu_fence_slab, GFP_KERNEL);
+   fence = kmem_cache_alloc(amdgpu_fence_slab, GFP_ATOMIC);
if (fence == NULL)
return -ENOMEM;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
index 267fa45ddb66..a333ca2d4ddd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ids.c
+++ b/drivers/gpu/drm/amd/amdg

Re: [Intel-gfx] [PATCH v8 00/12] Introduce CAP_PERFMON to secure system performance monitoring and observability

2020-07-15 Thread Arnaldo Carvalho de Melo
Em Tue, Jul 14, 2020 at 12:59:34PM +0200, Peter Zijlstra escreveu:
> On Mon, Jul 13, 2020 at 03:51:52PM -0300, Arnaldo Carvalho de Melo wrote:
> 
> > > > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > > > index 856d98c36f56..a2397f724c10 100644
> > > > --- a/kernel/events/core.c
> > > > +++ b/kernel/events/core.c
> > > > @@ -11595,7 +11595,7 @@ SYSCALL_DEFINE5(perf_event_open,
> > > >  * perf_event_exit_task() that could imply).
> > > >  */
> > > > err = -EACCES;
> > > > -   if (!ptrace_may_access(task, 
> > > > PTRACE_MODE_READ_REALCREDS))
> > > > +   if (!perfmon_capable() && !ptrace_may_access(task, 
> > > > PTRACE_MODE_READ_REALCREDS))
> > > > goto err_cred;
> > > > }

> > > >> makes monitoring simpler and even more secure to use since Perf tool 
> > > >> need
> > > >> not to start/stop/single-step and read/write registers and memory and 
> > > >> so on
> > > >> like a debugger or strace-like tool. What do you think?

> > > > I tend to agree, Peter?

> So this basically says that if CAP_PERFMON, we don't care about the
> ptrace() permissions? Just like how CAP_SYS_PTRACE would always allow
> the ptrace checks?

> I suppose that makes sense.

Yeah, it in fact addresses the comment right above it:

if (task) {
err = 
mutex_lock_interruptible(&task->signal->exec_update_mutex);
if (err)
goto err_task;

/*
 * Reuse ptrace permission checks for now.
 *
 * We must hold exec_update_mutex across this and any potential
 * perf_install_in_context() call for this new event to
 * serialize against exec() altering our credentials (and the
 * perf_event_exit_task() that could imply).
 */
err = -EACCES;
if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS))
goto err_cred;
}


that "for now" part :-)

Idea is to not require CAP_PTRACE for that, i.e. the attack surface for the
perf binary is reduced.

- Arnaldo
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] ✓ Fi.CI.BAT: success for series starting with [01/23] Revert "drm/i915/gem: Async GPU relocations only" (rev3)

2020-07-15 Thread Patchwork
== Series Details ==

Series: series starting with [01/23] Revert "drm/i915/gem: Async GPU 
relocations only" (rev3)
URL   : https://patchwork.freedesktop.org/series/79470/
State : success

== Summary ==

CI Bug Log - changes from CI_DRM_8748 -> Patchwork_18175


Summary
---

  **SUCCESS**

  No regressions found.

  External URL: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18175/index.html

Known issues


  Here are the changes found in Patchwork_18175 that come from known issues:

### IGT changes ###

 Issues hit 

  * igt@gem_tiled_pread_basic:
- fi-tgl-y:   [PASS][1] -> [DMESG-WARN][2] ([i915#402])
   [1]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8748/fi-tgl-y/igt@gem_tiled_pread_basic.html
   [2]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18175/fi-tgl-y/igt@gem_tiled_pread_basic.html

  * igt@i915_module_load@reload:
- fi-tgl-u2:  [PASS][3] -> [DMESG-WARN][4] ([i915#402])
   [3]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8748/fi-tgl-u2/igt@i915_module_l...@reload.html
   [4]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18175/fi-tgl-u2/igt@i915_module_l...@reload.html

  * igt@i915_pm_rpm@basic-pci-d3-state:
- fi-byt-j1900:   [PASS][5] -> [DMESG-WARN][6] ([i915#1982])
   [5]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8748/fi-byt-j1900/igt@i915_pm_...@basic-pci-d3-state.html
   [6]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18175/fi-byt-j1900/igt@i915_pm_...@basic-pci-d3-state.html

  * igt@kms_busy@basic@flip:
- fi-kbl-x1275:   [PASS][7] -> [DMESG-WARN][8] ([i915#62] / [i915#92] / 
[i915#95])
   [7]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8748/fi-kbl-x1275/igt@kms_busy@ba...@flip.html
   [8]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18175/fi-kbl-x1275/igt@kms_busy@ba...@flip.html

  * igt@kms_cursor_legacy@basic-busy-flip-before-cursor-legacy:
- fi-icl-u2:  [PASS][9] -> [DMESG-WARN][10] ([i915#1982]) +1 
similar issue
   [9]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8748/fi-icl-u2/igt@kms_cursor_leg...@basic-busy-flip-before-cursor-legacy.html
   [10]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18175/fi-icl-u2/igt@kms_cursor_leg...@basic-busy-flip-before-cursor-legacy.html

  
 Possible fixes 

  * igt@gem_exec_suspend@basic-s0:
- fi-tgl-u2:  [FAIL][11] ([i915#1888]) -> [PASS][12]
   [11]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8748/fi-tgl-u2/igt@gem_exec_susp...@basic-s0.html
   [12]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18175/fi-tgl-u2/igt@gem_exec_susp...@basic-s0.html

  * igt@i915_pm_rpm@basic-pci-d3-state:
- fi-bsw-kefka:   [DMESG-WARN][13] ([i915#1982]) -> [PASS][14]
   [13]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8748/fi-bsw-kefka/igt@i915_pm_...@basic-pci-d3-state.html
   [14]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18175/fi-bsw-kefka/igt@i915_pm_...@basic-pci-d3-state.html

  * igt@kms_addfb_basic@bad-pitch-256:
- fi-tgl-y:   [DMESG-WARN][15] ([i915#402]) -> [PASS][16]
   [15]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8748/fi-tgl-y/igt@kms_addfb_ba...@bad-pitch-256.html
   [16]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18175/fi-tgl-y/igt@kms_addfb_ba...@bad-pitch-256.html

  * igt@kms_cursor_legacy@basic-busy-flip-before-cursor-atomic:
- fi-kbl-soraka:  [DMESG-WARN][17] ([i915#1982]) -> [PASS][18]
   [17]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8748/fi-kbl-soraka/igt@kms_cursor_leg...@basic-busy-flip-before-cursor-atomic.html
   [18]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18175/fi-kbl-soraka/igt@kms_cursor_leg...@basic-busy-flip-before-cursor-atomic.html

  * igt@kms_flip@basic-flip-vs-wf_vblank@c-edp1:
- fi-icl-u2:  [DMESG-WARN][19] ([i915#1982]) -> [PASS][20] +1 
similar issue
   [19]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8748/fi-icl-u2/igt@kms_flip@basic-flip-vs-wf_vbl...@c-edp1.html
   [20]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18175/fi-icl-u2/igt@kms_flip@basic-flip-vs-wf_vbl...@c-edp1.html

  
 Warnings 

  * igt@kms_force_connector_basic@force-connector-state:
- fi-kbl-x1275:   [DMESG-WARN][21] ([i915#62] / [i915#92]) -> 
[DMESG-WARN][22] ([i915#62] / [i915#92] / [i915#95]) +3 similar issues
   [21]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8748/fi-kbl-x1275/igt@kms_force_connector_ba...@force-connector-state.html
   [22]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18175/fi-kbl-x1275/igt@kms_force_connector_ba...@force-connector-state.html

  * igt@kms_force_connector_basic@force-edid:
- fi-kbl-x1275:   [DMESG-WARN][23] ([i915#62] / [i915#92] / [i915#95]) 
-> [DMESG-WARN][24] ([i915#62] / [i915#92]) +3 similar issues
   [23]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8748/fi-kbl-x1275/igt@kms_force_connector_ba...@force-edid.html
   [24]: 
http

Re: [Intel-gfx] sw_sync deadlock avoidance, take 3

2020-07-15 Thread Daniel Stone
Hi,

On Wed, 15 Jul 2020 at 12:05, Bas Nieuwenhuizen  
wrote:
> On Wed, Jul 15, 2020 at 12:34 PM Chris Wilson  
> wrote:
> > Maybe now is the time to ask: are you using sw_sync outside of
> > validation?
>
> Yes, this is used as part of the Android stack on Chrome OS (need to
> see if ChromeOS specific, but
> https://source.android.com/devices/graphics/sync#sync_timeline
> suggests not)

Android used to mandate it for their earlier iteration of release
fences, which was an empty/future fence having no guarantee of
eventual forward progress until someone committed work later on. For
example, when you committed a buffer to SF, it would give you an empty
'release fence' for that buffer which would only be tied to work to
signal it when you committed your _next_ buffer, which might never
happen. They removed that because a) future fences were a bad idea,
and b) it was only ever useful if you assumed strictly
FIFO/round-robin return order which wasn't always true.

So now it's been watered down to 'use this if you don't have a
hardware timeline', but why don't we work with Android people to get
that removed entirely?

Cheers,
Daniel
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH 37/66] drm/i915/gt: Free stale request on destroying the virtual engine

2020-07-15 Thread Chris Wilson
Since preempt-to-busy, we may unsubmit a request while it is still on
the HW and completes asynchronously. That means it may be retired and in
the process destroy the virtual engine (as the user has closed their
context), but that engine may still be holding onto the unsubmitted
compelted request. Therefore we need to potentially cleanup the old
request on destroying the virtual engine. We also have to keep the
virtual_engine alive until after the sibling's execlists_dequeue() have
finished peeking into the virtual engines, for which we serialise with
RCU.

Signed-off-by: Chris Wilson 
Cc: Tvrtko Ursulin 
---
 drivers/gpu/drm/i915/gt/intel_lrc.c | 22 +++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c 
b/drivers/gpu/drm/i915/gt/intel_lrc.c
index 4e770274ea8f..fabb20a6800b 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -179,6 +179,7 @@
 #define EXECLISTS_REQUEST_SIZE 64 /* bytes */
 
 struct virtual_engine {
+   struct rcu_head rcu;
struct intel_engine_cs base;
struct intel_context context;
 
@@ -5319,10 +5320,25 @@ static void virtual_context_destroy(struct kref *kref)
container_of(kref, typeof(*ve), context.ref);
unsigned int n;
 
-   GEM_BUG_ON(!list_empty(virtual_queue(ve)));
-   GEM_BUG_ON(ve->request);
GEM_BUG_ON(ve->context.inflight);
 
+   if (unlikely(ve->request)) {
+   struct i915_request *old;
+   unsigned long flags;
+
+   spin_lock_irqsave(&ve->base.active.lock, flags);
+
+   old = fetch_and_zero(&ve->request);
+   if (old) {
+   GEM_BUG_ON(!i915_request_completed(old));
+   __i915_request_submit(old);
+   i915_request_put(old);
+   }
+
+   spin_unlock_irqrestore(&ve->base.active.lock, flags);
+   }
+   GEM_BUG_ON(!list_empty(virtual_queue(ve)));
+
for (n = 0; n < ve->num_siblings; n++) {
struct intel_engine_cs *sibling = ve->siblings[n];
struct rb_node *node = &ve->nodes[sibling->id].rb;
@@ -5348,7 +5364,7 @@ static void virtual_context_destroy(struct kref *kref)
intel_engine_free_request_pool(&ve->base);
 
kfree(ve->bonds);
-   kfree(ve);
+   kfree_rcu(ve, rcu);
 }
 
 static void virtual_engine_initial_hint(struct virtual_engine *ve)
-- 
2.20.1

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH 61/66] drm/i915/gt: Support creation of 'internal' rings

2020-07-15 Thread Chris Wilson
To support legacy ring buffer scheduling, we want a virtual ringbuffer
for each client. These rings are purely for holding the requests as they
are being constructed on the CPU and never accessed by the GPU, so they
should not be bound into the GGTT, and we can use plain old WB mapped
pages.

As they are not bound, we need to nerf a few assumptions that a rq->ring
is in the GGTT.

Signed-off-by: Chris Wilson 
---
 drivers/gpu/drm/i915/gt/intel_context.c|  2 +-
 drivers/gpu/drm/i915/gt/intel_engine_cs.c  | 17 ++
 drivers/gpu/drm/i915/gt/intel_ring.c   | 63 ++
 drivers/gpu/drm/i915/gt/intel_ring.h   | 12 -
 drivers/gpu/drm/i915/gt/intel_ring_types.h |  2 +
 5 files changed, 57 insertions(+), 39 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c 
b/drivers/gpu/drm/i915/gt/intel_context.c
index 9ba1c15114d7..fb32b6c92f29 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -129,7 +129,7 @@ int __intel_context_do_pin(struct intel_context *ce)
goto err_active;
 
CE_TRACE(ce, "pin ring:{start:%08x, head:%04x, tail:%04x}\n",
-i915_ggtt_offset(ce->ring->vma),
+intel_ring_address(ce->ring),
 ce->ring->head, ce->ring->tail);
 
smp_mb__before_atomic(); /* flush pin before it is visible */
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index af9cc42d3061..df234ce10907 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -1342,7 +1342,7 @@ static int print_ring(char *buf, int sz, struct 
i915_request *rq)
 
len = scnprintf(buf, sz,
"ring:{start:%08x, hwsp:%08x, seqno:%08x, 
runtime:%llums}, ",
-   i915_ggtt_offset(rq->ring->vma),
+   intel_ring_address(rq->ring),
tl ? tl->hwsp_offset : 0,
hwsp_seqno(rq),

DIV_ROUND_CLOSEST_ULL(intel_context_get_total_runtime_ns(rq->context),
@@ -1634,7 +1634,7 @@ void intel_engine_dump(struct intel_engine_cs *engine,
print_request(m, rq, "\t\tactive ");
 
drm_printf(m, "\t\tring->start:  0x%08x\n",
-  i915_ggtt_offset(rq->ring->vma));
+  intel_ring_address(rq->ring));
drm_printf(m, "\t\tring->head:   0x%08x\n",
   rq->ring->head);
drm_printf(m, "\t\tring->tail:   0x%08x\n",
@@ -1715,13 +1715,6 @@ ktime_t intel_engine_get_busy_time(struct 
intel_engine_cs *engine, ktime_t *now)
return total;
 }
 
-static bool match_ring(struct i915_request *rq)
-{
-   u32 ring = ENGINE_READ(rq->engine, RING_START);
-
-   return ring == i915_ggtt_offset(rq->ring->vma);
-}
-
 struct i915_request *
 intel_engine_find_active_request(struct intel_engine_cs *engine)
 {
@@ -1761,11 +1754,7 @@ intel_engine_find_active_request(struct intel_engine_cs 
*engine)
continue;
 
if (!i915_request_started(request))
-   continue;
-
-   /* More than one preemptible request may match! */
-   if (!match_ring(request))
-   continue;
+   break;
 
active = request;
break;
diff --git a/drivers/gpu/drm/i915/gt/intel_ring.c 
b/drivers/gpu/drm/i915/gt/intel_ring.c
index 1c21f5725731..9aeb4025c485 100644
--- a/drivers/gpu/drm/i915/gt/intel_ring.c
+++ b/drivers/gpu/drm/i915/gt/intel_ring.c
@@ -33,33 +33,42 @@ int intel_ring_pin_locked(struct intel_ring *ring)
 {
struct i915_vma *vma = ring->vma;
enum i915_map_type type;
-   unsigned int flags;
void *addr;
int ret;
 
if (atomic_fetch_inc(&ring->pin_count))
return 0;
 
-   /* Ring wraparound at offset 0 sometimes hangs. No idea why. */
-   flags = PIN_OFFSET_BIAS | i915_ggtt_pin_bias(vma);
+   if (!(ring->flags & INTEL_RING_CREATE_INTERNAL)) {
+   unsigned int pin;
 
-   if (vma->obj->stolen)
-   flags |= PIN_MAPPABLE;
-   else
-   flags |= PIN_HIGH;
+   /* Ring wraparound at offset 0 sometimes hangs. No idea why. */
+   pin |= PIN_OFFSET_BIAS | i915_ggtt_pin_bias(vma);
 
-   ret = i915_ggtt_pin_locked(vma, 0, flags);
-   if (unlikely(ret))
-   goto err_unpin;
+   if (vma->obj->stolen)
+   pin |= PIN_MAPPABLE;
+   else
+   pin |= PIN_HIGH;
 
-   type = i915_coherent_map_type(vma->vm->i915);
-   if (i915_vma_is_map_and_fenceable(vma))
-   addr = (void __force *)i915_vma_pin_iomap(vma);
-   else
-   add

[Intel-gfx] [PATCH 15/66] drm/i915/gem: Break apart the early i915_vma_pin from execbuf object lookup

2020-07-15 Thread Chris Wilson
As a prelude to the next step where we want to perform all the object
allocations together under the same lock, we first must delay the
i915_vma_pin() as that implicitly does the allocations for us, one by
one. As it only does the allocations one by one, it is not allowed to
wait/evict, whereas pulling all the allocations together the entire set
can be scheduled as one.

Signed-off-by: Chris Wilson 
Reviewed-by: Tvrtko Ursulin 
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c| 74 ++-
 1 file changed, 41 insertions(+), 33 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c 
b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 40ee2718007e..28cf28fcf80a 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -33,6 +33,8 @@ struct eb_vma {
 
/** This vma's place in the execbuf reservation list */
struct drm_i915_gem_exec_object2 *exec;
+
+   struct list_head bind_link;
struct list_head unbound_link;
struct list_head reloc_link;
 
@@ -240,8 +242,8 @@ struct i915_execbuffer {
/** actual size of execobj[] as we may extend it for the cmdparser */
unsigned int buffer_count;
 
-   /** list of vma not yet bound during reservation phase */
-   struct list_head unbound;
+   /** list of all vma required to be bound for this execbuf */
+   struct list_head bind_list;
 
/** list of vma that have execobj.relocation_count */
struct list_head relocs;
@@ -565,6 +567,8 @@ eb_add_vma(struct i915_execbuffer *eb,
eb->lut_size)]);
}
 
+   list_add_tail(&ev->bind_link, &eb->bind_list);
+
if (entry->relocation_count)
list_add_tail(&ev->reloc_link, &eb->relocs);
 
@@ -586,16 +590,6 @@ eb_add_vma(struct i915_execbuffer *eb,
 
eb->batch = ev;
}
-
-   if (eb_pin_vma(eb, entry, ev)) {
-   if (entry->offset != vma->node.start) {
-   entry->offset = vma->node.start | UPDATE;
-   eb->args->flags |= __EXEC_HAS_RELOC;
-   }
-   } else {
-   eb_unreserve_vma(ev);
-   list_add_tail(&ev->unbound_link, &eb->unbound);
-   }
 }
 
 static int eb_reserve_vma(const struct i915_execbuffer *eb,
@@ -670,13 +664,31 @@ static int wait_for_timeline(struct intel_timeline *tl)
} while (1);
 }
 
-static int eb_reserve(struct i915_execbuffer *eb)
+static int eb_reserve_vm(struct i915_execbuffer *eb)
 {
-   const unsigned int count = eb->buffer_count;
unsigned int pin_flags = PIN_USER | PIN_NONBLOCK;
-   struct list_head last;
+   struct list_head last, unbound;
struct eb_vma *ev;
-   unsigned int i, pass;
+   unsigned int pass;
+
+   INIT_LIST_HEAD(&unbound);
+   list_for_each_entry(ev, &eb->bind_list, bind_link) {
+   struct drm_i915_gem_exec_object2 *entry = ev->exec;
+   struct i915_vma *vma = ev->vma;
+
+   if (eb_pin_vma(eb, entry, ev)) {
+   if (entry->offset != vma->node.start) {
+   entry->offset = vma->node.start | UPDATE;
+   eb->args->flags |= __EXEC_HAS_RELOC;
+   }
+   } else {
+   eb_unreserve_vma(ev);
+   list_add_tail(&ev->unbound_link, &unbound);
+   }
+   }
+
+   if (list_empty(&unbound))
+   return 0;
 
/*
 * Attempt to pin all of the buffers into the GTT.
@@ -714,7 +726,7 @@ static int eb_reserve(struct i915_execbuffer *eb)
if (mutex_lock_interruptible(&eb->i915->drm.struct_mutex))
return -EINTR;
 
-   list_for_each_entry(ev, &eb->unbound, unbound_link) {
+   list_for_each_entry(ev, &unbound, unbound_link) {
err = eb_reserve_vma(eb, ev, pin_flags);
if (err)
break;
@@ -725,13 +737,11 @@ static int eb_reserve(struct i915_execbuffer *eb)
}
 
/* Resort *all* the objects into priority order */
-   INIT_LIST_HEAD(&eb->unbound);
+   INIT_LIST_HEAD(&unbound);
INIT_LIST_HEAD(&last);
-   for (i = 0; i < count; i++) {
-   unsigned int flags;
+   list_for_each_entry(ev, &eb->bind_list, bind_link) {
+   unsigned int flags = ev->flags;
 
-   ev = &eb->vma[i];
-   flags = ev->flags;
if (flags & EXEC_OBJECT_PINNED &&
flags & __EXEC_OBJECT_HAS_PIN)
continue;
@@ -740,17 +750,17 @@ static int eb_reserve(struct i915_execbuffer *eb)
 
if (flags & EXEC_OBJECT_PINNED)
 

[Intel-gfx] [PATCH 40/66] drm/i915/gt: Defer schedule_out until after the next dequeue

2020-07-15 Thread Chris Wilson
Inside schedule_out, we do extra work upon idling the context, such as
updating the runtime, kicking off retires, kicking virtual engines.
However, if we are in a series of processing single requests per
contexts, we may find ourselves scheduling out the context, only to
immediately schedule it back in during dequeue. This is just extra work
that we can avoid if we keep the context marked as inflight across the
dequeue. This becomes more significant later on for minimising virtual
engine misses.

Signed-off-by: Chris Wilson 
---
 drivers/gpu/drm/i915/gt/intel_context_types.h |   4 +-
 drivers/gpu/drm/i915/gt/intel_lrc.c   | 111 --
 2 files changed, 78 insertions(+), 37 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context_types.h 
b/drivers/gpu/drm/i915/gt/intel_context_types.h
index 4954b0df4864..b63db45bab7b 100644
--- a/drivers/gpu/drm/i915/gt/intel_context_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_context_types.h
@@ -45,8 +45,8 @@ struct intel_context {
 
struct intel_engine_cs *engine;
struct intel_engine_cs *inflight;
-#define intel_context_inflight(ce) ptr_mask_bits(READ_ONCE((ce)->inflight), 2)
-#define intel_context_inflight_count(ce) 
ptr_unmask_bits(READ_ONCE((ce)->inflight), 2)
+#define intel_context_inflight(ce) ptr_mask_bits(READ_ONCE((ce)->inflight), 3)
+#define intel_context_inflight_count(ce) 
ptr_unmask_bits(READ_ONCE((ce)->inflight), 3)
 
struct i915_address_space *vm;
struct i915_gem_context __rcu *gem_context;
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c 
b/drivers/gpu/drm/i915/gt/intel_lrc.c
index 2f35aceea778..aa3233702613 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -1362,6 +1362,8 @@ __execlists_schedule_in(struct i915_request *rq)
execlists_context_status_change(rq, INTEL_CONTEXT_SCHEDULE_IN);
intel_engine_context_in(engine);
 
+   CE_TRACE(ce, "schedule-in, ccid:%x\n", ce->lrc.ccid);
+
return engine;
 }
 
@@ -1405,6 +1407,8 @@ __execlists_schedule_out(struct i915_request *rq,
 * refrain from doing non-trivial work here.
 */
 
+   CE_TRACE(ce, "schedule-out, ccid:%x\n", ccid);
+
/*
 * If we have just completed this context, the engine may now be
 * idle and we want to re-enter powersaving.
@@ -2037,11 +2041,6 @@ static void set_preempt_timeout(struct intel_engine_cs 
*engine,
 active_preempt_timeout(engine, rq));
 }
 
-static inline void clear_ports(struct i915_request **ports, int count)
-{
-   memset_p((void **)ports, NULL, count);
-}
-
 static void execlists_dequeue(struct intel_engine_cs *engine)
 {
struct intel_engine_execlists * const execlists = &engine->execlists;
@@ -2390,26 +2389,36 @@ static void execlists_dequeue(struct intel_engine_cs 
*engine)
start_timeslice(engine, execlists->queue_priority_hint);
 skip_submit:
ring_set_paused(engine, 0);
+   while (port-- != execlists->pending)
+   i915_request_put(*port);
*execlists->pending = NULL;
}
 }
 
-static void
-cancel_port_requests(struct intel_engine_execlists * const execlists)
+static inline void clear_ports(struct i915_request **ports, int count)
+{
+   memset_p((void **)ports, NULL, count);
+}
+
+static struct i915_request **
+cancel_port_requests(struct intel_engine_execlists * const execlists,
+struct i915_request **inactive)
 {
struct i915_request * const *port;
 
for (port = execlists->pending; *port; port++)
-   execlists_schedule_out(*port);
+   *inactive++ = *port;
clear_ports(execlists->pending, ARRAY_SIZE(execlists->pending));
 
/* Mark the end of active before we overwrite *active */
for (port = xchg(&execlists->active, execlists->pending); *port; port++)
-   execlists_schedule_out(*port);
+   *inactive++ = *port;
clear_ports(execlists->inflight, ARRAY_SIZE(execlists->inflight));
 
smp_wmb(); /* complete the seqlock for execlists_active() */
WRITE_ONCE(execlists->active, execlists->inflight);
+
+   return inactive;
 }
 
 static inline void
@@ -2481,7 +2490,8 @@ gen8_csb_parse(const struct intel_engine_execlists 
*execlists, const u32 *csb)
return *csb & (GEN8_CTX_STATUS_IDLE_ACTIVE | GEN8_CTX_STATUS_PREEMPTED);
 }
 
-static void process_csb(struct intel_engine_cs *engine)
+static struct i915_request **
+process_csb(struct intel_engine_cs *engine, struct i915_request **inactive)
 {
struct intel_engine_execlists * const execlists = &engine->execlists;
const u32 * const buf = execlists->csb_status;
@@ -2510,7 +2520,7 @@ static void process_csb(struct intel_engine_cs *engine)
head = execlists->csb_head;
tail = READ_ONCE(*execlists->csb_write);
if (unlikely(head == tail))
-   return;
+   return 

[Intel-gfx] [PATCH 47/66] drm/i915: Lift waiter/signaler iterators

2020-07-15 Thread Chris Wilson
Lift the list iteration defines for traversing the signaler/waiter lists
into i915_scheduler.h for reuse.

Signed-off-by: Chris Wilson 
---
 drivers/gpu/drm/i915/gt/intel_lrc.c | 10 --
 drivers/gpu/drm/i915/i915_scheduler_types.h | 10 ++
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c 
b/drivers/gpu/drm/i915/gt/intel_lrc.c
index 534adfdc42fe..78dad751c187 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -1819,16 +1819,6 @@ static void virtual_xfer_breadcrumbs(struct 
virtual_engine *ve)
intel_engine_transfer_stale_breadcrumbs(ve->siblings[0], &ve->context);
 }
 
-#define for_each_waiter(p__, rq__) \
-   list_for_each_entry_lockless(p__, \
-&(rq__)->sched.waiters_list, \
-wait_link)
-
-#define for_each_signaler(p__, rq__) \
-   list_for_each_entry_rcu(p__, \
-   &(rq__)->sched.signalers_list, \
-   signal_link)
-
 static void defer_request(struct i915_request *rq, struct list_head * const pl)
 {
LIST_HEAD(list);
diff --git a/drivers/gpu/drm/i915/i915_scheduler_types.h 
b/drivers/gpu/drm/i915/i915_scheduler_types.h
index f72e6c397b08..343ed44d5ed4 100644
--- a/drivers/gpu/drm/i915/i915_scheduler_types.h
+++ b/drivers/gpu/drm/i915/i915_scheduler_types.h
@@ -81,4 +81,14 @@ struct i915_dependency {
 #define I915_DEPENDENCY_WEAK   BIT(2)
 };
 
+#define for_each_waiter(p__, rq__) \
+   list_for_each_entry_lockless(p__, \
+&(rq__)->sched.waiters_list, \
+wait_link)
+
+#define for_each_signaler(p__, rq__) \
+   list_for_each_entry_rcu(p__, \
+   &(rq__)->sched.signalers_list, \
+   signal_link)
+
 #endif /* _I915_SCHEDULER_TYPES_H_ */
-- 
2.20.1

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH 19/66] drm/i915/gem: Assign context id for async work

2020-07-15 Thread Chris Wilson
Allocate a few dma fence context id that we can use to associate async work
[for the CPU] launched on behalf of this context. For extra fun, we allow
a configurable concurrency width.

A current example would be that we spawn an unbound worker for every
userptr get_pages. In the future, we wish to charge this work to the
context that initiated the async work and to impose concurrency limits
based on the context.

Signed-off-by: Chris Wilson 
---
 drivers/gpu/drm/i915/gem/i915_gem_context.c   | 4 
 drivers/gpu/drm/i915/gem/i915_gem_context.h   | 6 ++
 drivers/gpu/drm/i915/gem/i915_gem_context_types.h | 6 ++
 3 files changed, 16 insertions(+)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c 
b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index d0bdb6d447ed..b5f6dc2333ab 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -714,6 +714,10 @@ __create_context(struct drm_i915_private *i915)
ctx->sched.priority = I915_USER_PRIORITY(I915_PRIORITY_NORMAL);
mutex_init(&ctx->mutex);
 
+   ctx->async.width = rounddown_pow_of_two(num_online_cpus());
+   ctx->async.context = dma_fence_context_alloc(ctx->async.width);
+   ctx->async.width--;
+
spin_lock_init(&ctx->stale.lock);
INIT_LIST_HEAD(&ctx->stale.engines);
 
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.h 
b/drivers/gpu/drm/i915/gem/i915_gem_context.h
index a133f92bbedb..f254458a795e 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.h
@@ -134,6 +134,12 @@ int i915_gem_context_setparam_ioctl(struct drm_device 
*dev, void *data,
 int i915_gem_context_reset_stats_ioctl(struct drm_device *dev, void *data,
   struct drm_file *file);
 
+static inline u64 i915_gem_context_async_id(struct i915_gem_context *ctx)
+{
+   return (ctx->async.context +
+   (atomic_fetch_inc(&ctx->async.cur) & ctx->async.width));
+}
+
 static inline struct i915_gem_context *
 i915_gem_context_get(struct i915_gem_context *ctx)
 {
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h 
b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
index ae14ca24a11f..52561f98000f 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context_types.h
@@ -85,6 +85,12 @@ struct i915_gem_context {
 
struct intel_timeline *timeline;
 
+   struct {
+   u64 context;
+   atomic_t cur;
+   unsigned int width;
+   } async;
+
/**
 * @vm: unique address space (GTT)
 *
-- 
2.20.1

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH 50/66] drm/i915: Replace engine->schedule() with a known request operation

2020-07-15 Thread Chris Wilson
Looking to the future, we want to set the scheduling attributes
explicitly and so replace the generic engine->schedule() with the more
direct i915_request_set_priority()

What it loses in removing the 'schedule' name from the function, it
gains in having an explicit entry point with a stated goal.

Signed-off-by: Chris Wilson 
---
 drivers/gpu/drm/i915/display/intel_display.c  |  9 +
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c|  2 +-
 drivers/gpu/drm/i915/gem/i915_gem_object.h|  2 +-
 drivers/gpu/drm/i915/gem/i915_gem_wait.c  | 27 +--
 drivers/gpu/drm/i915/gt/intel_engine_cs.c |  3 --
 .../gpu/drm/i915/gt/intel_engine_heartbeat.c  |  4 +--
 drivers/gpu/drm/i915/gt/intel_engine_types.h  | 29 
 drivers/gpu/drm/i915/gt/intel_engine_user.c   |  2 +-
 drivers/gpu/drm/i915/gt/intel_lrc.c   |  3 +-
 drivers/gpu/drm/i915/gt/selftest_hangcheck.c  | 11 +++
 drivers/gpu/drm/i915/gt/selftest_lrc.c| 33 +--
 drivers/gpu/drm/i915/i915_request.c   | 11 ---
 drivers/gpu/drm/i915/i915_scheduler.c | 15 +
 drivers/gpu/drm/i915/i915_scheduler.h |  3 +-
 14 files changed, 58 insertions(+), 96 deletions(-)

diff --git a/drivers/gpu/drm/i915/display/intel_display.c 
b/drivers/gpu/drm/i915/display/intel_display.c
index b1120d49d44e..c74e664a3759 100644
--- a/drivers/gpu/drm/i915/display/intel_display.c
+++ b/drivers/gpu/drm/i915/display/intel_display.c
@@ -15907,13 +15907,6 @@ static void intel_plane_unpin_fb(struct 
intel_plane_state *old_plane_state)
intel_unpin_fb_vma(vma, old_plane_state->flags);
 }
 
-static void fb_obj_bump_render_priority(struct drm_i915_gem_object *obj)
-{
-   struct i915_sched_attr attr = { .priority = I915_PRIORITY_DISPLAY };
-
-   i915_gem_object_wait_priority(obj, 0, &attr);
-}
-
 /**
  * intel_prepare_plane_fb - Prepare fb for usage on plane
  * @_plane: drm plane to prepare for
@@ -15990,7 +15983,7 @@ intel_prepare_plane_fb(struct drm_plane *_plane,
if (ret)
return ret;
 
-   fb_obj_bump_render_priority(obj);
+   i915_gem_object_wait_priority(obj, 0, I915_PRIORITY_DISPLAY);
i915_gem_object_flush_frontbuffer(obj, ORIGIN_DIRTYFB);
 
if (!new_plane_state->uapi.fence) { /* implicit fencing */
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c 
b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 7ad65612e4a0..d9f1403ddfa4 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -2973,7 +2973,7 @@ static int __eb_pin_reloc_engine(struct i915_execbuffer 
*eb)
return PTR_ERR(ce);
 
/* Reuse eb->context->timeline with scheduler! */
-   if (engine->schedule)
+   if (intel_engine_has_scheduler(engine))
ce->timeline = intel_timeline_get(eb->context->timeline);
 
i915_vm_put(ce->vm);
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.h 
b/drivers/gpu/drm/i915/gem/i915_gem_object.h
index 26f53321443b..d916155b0c52 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_object.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_object.h
@@ -459,7 +459,7 @@ int i915_gem_object_wait(struct drm_i915_gem_object *obj,
 long timeout);
 int i915_gem_object_wait_priority(struct drm_i915_gem_object *obj,
  unsigned int flags,
- const struct i915_sched_attr *attr);
+ int prio);
 
 void __i915_gem_object_flush_frontbuffer(struct drm_i915_gem_object *obj,
 enum fb_op_origin origin);
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_wait.c 
b/drivers/gpu/drm/i915/gem/i915_gem_wait.c
index 8af55cd3e690..cefbbb3d9b52 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_wait.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_wait.c
@@ -93,28 +93,17 @@ i915_gem_object_wait_reservation(struct dma_resv *resv,
return timeout;
 }
 
-static void __fence_set_priority(struct dma_fence *fence,
-const struct i915_sched_attr *attr)
+static void __fence_set_priority(struct dma_fence *fence, int prio)
 {
-   struct i915_request *rq;
-   struct intel_engine_cs *engine;
-
if (dma_fence_is_signaled(fence) || !dma_fence_is_i915(fence))
return;
 
-   rq = to_request(fence);
-   engine = rq->engine;
-
local_bh_disable();
-   rcu_read_lock(); /* RCU serialisation for set-wedged protection */
-   if (engine->schedule)
-   engine->schedule(rq, attr);
-   rcu_read_unlock();
+   i915_request_set_priority(to_request(fence), prio);
local_bh_enable(); /* kick the tasklets if queues were reprioritised */
 }
 
-static void fence_set_priority(struct dma_fence *fence,
-  const struct i915_sched_attr *attr)
+static void fence_set_priority(struct dma_fence *fence, int prio)
 {

[Intel-gfx] [PATCH 14/66] drm/i915/gem: Rename execbuf.bind_link to unbound_link

2020-07-15 Thread Chris Wilson
Rename the current list of unbound objects so that we can track of all
objects that we need to bind, as well as the list of currently unbound
[unprocessed] objects.

Signed-off-by: Chris Wilson 
Reviewed-by: Tvrtko Ursulin 
---
 drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c 
b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index af3499aafd22..40ee2718007e 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -33,7 +33,7 @@ struct eb_vma {
 
/** This vma's place in the execbuf reservation list */
struct drm_i915_gem_exec_object2 *exec;
-   struct list_head bind_link;
+   struct list_head unbound_link;
struct list_head reloc_link;
 
struct hlist_node node;
@@ -594,7 +594,7 @@ eb_add_vma(struct i915_execbuffer *eb,
}
} else {
eb_unreserve_vma(ev);
-   list_add_tail(&ev->bind_link, &eb->unbound);
+   list_add_tail(&ev->unbound_link, &eb->unbound);
}
 }
 
@@ -714,7 +714,7 @@ static int eb_reserve(struct i915_execbuffer *eb)
if (mutex_lock_interruptible(&eb->i915->drm.struct_mutex))
return -EINTR;
 
-   list_for_each_entry(ev, &eb->unbound, bind_link) {
+   list_for_each_entry(ev, &eb->unbound, unbound_link) {
err = eb_reserve_vma(eb, ev, pin_flags);
if (err)
break;
@@ -740,15 +740,15 @@ static int eb_reserve(struct i915_execbuffer *eb)
 
if (flags & EXEC_OBJECT_PINNED)
/* Pinned must have their slot */
-   list_add(&ev->bind_link, &eb->unbound);
+   list_add(&ev->unbound_link, &eb->unbound);
else if (flags & __EXEC_OBJECT_NEEDS_MAP)
/* Map require the lowest 256MiB (aperture) */
-   list_add_tail(&ev->bind_link, &eb->unbound);
+   list_add_tail(&ev->unbound_link, &eb->unbound);
else if (!(flags & EXEC_OBJECT_SUPPORTS_48B_ADDRESS))
/* Prioritise 4GiB region for restricted bo */
-   list_add(&ev->bind_link, &last);
+   list_add(&ev->unbound_link, &last);
else
-   list_add_tail(&ev->bind_link, &last);
+   list_add_tail(&ev->unbound_link, &last);
}
list_splice_tail(&last, &eb->unbound);
mutex_unlock(&eb->i915->drm.struct_mutex);
-- 
2.20.1

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH 41/66] drm/i915/gt: Resubmit the virtual engine on schedule-out

2020-07-15 Thread Chris Wilson
Having recognised that we do not change the sibling until we schedule
out, we can then defer the decision to resubmit the virtual engine from
the unwind of the active queue to scheduling out of the virtual context.

By keeping the unwind order intact on the local engine, we can preserve
data dependency ordering while doing a preempt-to-busy pass until we
have determined the new ELSP. This means that if we try to timeslice
between a virtual engine and a data-dependent ordinary request, the pair
will maintain their relative ordering and we will avoid the
resubmission, cancelling the timeslicing until further change.

The dilemma though is that we then may end up in a situation where the
'demotion' of the virtual request to an ordinary request in the engine
queue results in filling the ELSP[] with virtual requests instead of
spreading the load across the engines. To compensate for this, we mark
each virtual request and refuse to resubmit a virtual request in the
secondary ELSP slots, thus forcing subsequent virtual requests to be
scheduled out after timeslicing. By delaying the decision until we
schedule out, we will avoid unnecessary resubmission.

Signed-off-by: Chris Wilson 
Cc: Tvrtko Ursulin 
---
 drivers/gpu/drm/i915/gt/intel_lrc.c| 92 +-
 drivers/gpu/drm/i915/gt/selftest_lrc.c |  2 +-
 2 files changed, 63 insertions(+), 31 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c 
b/drivers/gpu/drm/i915/gt/intel_lrc.c
index aa3233702613..062185116e13 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -1110,39 +1110,23 @@ __unwind_incomplete_requests(struct intel_engine_cs 
*engine)
 
__i915_request_unsubmit(rq);
 
-   /*
-* Push the request back into the queue for later resubmission.
-* If this request is not native to this physical engine (i.e.
-* it came from a virtual source), push it back onto the virtual
-* engine so that it can be moved across onto another physical
-* engine as load dictates.
-*/
-   if (likely(rq->execution_mask == engine->mask)) {
-   GEM_BUG_ON(rq_prio(rq) == I915_PRIORITY_INVALID);
-   if (rq_prio(rq) != prio) {
-   prio = rq_prio(rq);
-   pl = i915_sched_lookup_priolist(engine, prio);
-   }
-   
GEM_BUG_ON(RB_EMPTY_ROOT(&engine->execlists.queue.rb_root));
-
-   list_move(&rq->sched.link, pl);
-   set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
+   GEM_BUG_ON(rq_prio(rq) == I915_PRIORITY_INVALID);
+   if (rq_prio(rq) != prio) {
+   prio = rq_prio(rq);
+   pl = i915_sched_lookup_priolist(engine, prio);
+   }
+   GEM_BUG_ON(RB_EMPTY_ROOT(&engine->execlists.queue.rb_root));
 
-   /* Check in case we rollback so far we wrap [size/2] */
-   if (intel_ring_direction(rq->ring,
-intel_ring_wrap(rq->ring,
-rq->tail),
-rq->ring->tail) > 0)
-   rq->context->lrc.desc |= CTX_DESC_FORCE_RESTORE;
+   list_move(&rq->sched.link, pl);
+   set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
 
-   active = rq;
-   } else {
-   struct intel_engine_cs *owner = rq->context->engine;
+   /* Check in case we rollback so far we wrap [size/2] */
+   if (intel_ring_direction(rq->ring,
+intel_ring_wrap(rq->ring, rq->tail),
+rq->ring->tail) > 0)
+   rq->context->lrc.desc |= CTX_DESC_FORCE_RESTORE;
 
-   WRITE_ONCE(rq->engine, owner);
-   owner->submit_request(rq);
-   active = NULL;
-   }
+   active = rq;
}
 
return active;
@@ -1386,12 +1370,37 @@ static inline void execlists_schedule_in(struct 
i915_request *rq, int idx)
GEM_BUG_ON(intel_context_inflight(ce) != rq->engine);
 }
 
+static void
+resubmit_virtual_request(struct i915_request *rq, struct virtual_engine *ve)
+{
+   struct intel_engine_cs *engine = rq->engine;
+   unsigned long flags;
+
+   spin_lock_irqsave(&engine->active.lock, flags);
+
+   clear_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
+   WRITE_ONCE(rq->engine, &ve->base);
+   ve->base.submit_request(rq);
+
+   spin_unlock_irqrestore(&engine->active.lock, flags);
+}
+
 static void kick_siblings(struct i915_request *rq, struct intel_context *ce)
 {
struct v

[Intel-gfx] [PATCH 48/66] drm/i915: Strip out internal priorities

2020-07-15 Thread Chris Wilson
Since we are not using any internal priority levels, and in the next few
patches will introduce a new index for which the optimisation is not so
lear cut, discard the small table within the priolist.

Signed-off-by: Chris Wilson 
---
 .../gpu/drm/i915/gt/intel_engine_heartbeat.c  |  2 +-
 drivers/gpu/drm/i915/gt/intel_lrc.c   | 22 ++--
 drivers/gpu/drm/i915/gt/selftest_lrc.c|  2 -
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |  6 +--
 drivers/gpu/drm/i915/i915_priolist_types.h|  8 +--
 drivers/gpu/drm/i915/i915_scheduler.c | 51 +++
 drivers/gpu/drm/i915/i915_scheduler.h | 18 ++-
 7 files changed, 21 insertions(+), 88 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c 
b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
index be5d78472f18..addab2d922b7 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
@@ -113,7 +113,7 @@ static void heartbeat(struct work_struct *wrk)
 * low latency and no jitter] the chance to naturally
 * complete before being preempted.
 */
-   attr.priority = I915_PRIORITY_MASK;
+   attr.priority = 0;
if (rq->sched.attr.priority >= attr.priority)
attr.priority |= 
I915_USER_PRIORITY(I915_PRIORITY_HEARTBEAT);
if (rq->sched.attr.priority >= attr.priority)
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c 
b/drivers/gpu/drm/i915/gt/intel_lrc.c
index 78dad751c187..e3d7647a8514 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -436,22 +436,13 @@ static int effective_prio(const struct i915_request *rq)
 
 static int queue_prio(const struct intel_engine_execlists *execlists)
 {
-   struct i915_priolist *p;
struct rb_node *rb;
 
rb = rb_first_cached(&execlists->queue);
if (!rb)
return INT_MIN;
 
-   /*
-* As the priolist[] are inverted, with the highest priority in [0],
-* we have to flip the index value to become priority.
-*/
-   p = to_priolist(rb);
-   if (!I915_USER_PRIORITY_SHIFT)
-   return p->priority;
-
-   return ((p->priority + 1) << I915_USER_PRIORITY_SHIFT) - ffs(p->used);
+   return to_priolist(rb)->priority;
 }
 
 static int virtual_prio(const struct intel_engine_execlists *el)
@@ -2248,9 +2239,8 @@ static void execlists_dequeue(struct intel_engine_cs 
*engine)
while ((rb = rb_first_cached(&execlists->queue))) {
struct i915_priolist *p = to_priolist(rb);
struct i915_request *rq, *rn;
-   int i;
 
-   priolist_for_each_request_consume(rq, rn, p, i) {
+   priolist_for_each_request_consume(rq, rn, p) {
bool merge = true;
 
/*
@@ -4244,9 +4234,8 @@ static void execlists_reset_cancel(struct intel_engine_cs 
*engine)
/* Flush the queued requests to the timeline list (for retiring). */
while ((rb = rb_first_cached(&execlists->queue))) {
struct i915_priolist *p = to_priolist(rb);
-   int i;
 
-   priolist_for_each_request_consume(rq, rn, p, i) {
+   priolist_for_each_request_consume(rq, rn, p) {
mark_eio(rq);
__i915_request_submit(rq);
}
@@ -5270,7 +5259,7 @@ static int __execlists_context_alloc(struct intel_context 
*ce,
 
 static struct list_head *virtual_queue(struct virtual_engine *ve)
 {
-   return &ve->base.execlists.default_priolist.requests[0];
+   return &ve->base.execlists.default_priolist.requests;
 }
 
 static void virtual_context_destroy(struct kref *kref)
@@ -5835,9 +5824,8 @@ void intel_execlists_show_requests(struct intel_engine_cs 
*engine,
count = 0;
for (rb = rb_first_cached(&execlists->queue); rb; rb = rb_next(rb)) {
struct i915_priolist *p = rb_entry(rb, typeof(*p), node);
-   int i;
 
-   priolist_for_each_request(rq, p, i) {
+   priolist_for_each_request(rq, p) {
if (count++ < max - 1)
show_request(m, rq, "\t\tQ ");
else
diff --git a/drivers/gpu/drm/i915/gt/selftest_lrc.c 
b/drivers/gpu/drm/i915/gt/selftest_lrc.c
index e05c750452be..3843c69ac8a3 100644
--- a/drivers/gpu/drm/i915/gt/selftest_lrc.c
+++ b/drivers/gpu/drm/i915/gt/selftest_lrc.c
@@ -1102,7 +1102,6 @@ create_rewinder(struct intel_context *ce,
 
intel_ring_advance(rq, cs);
 
-   rq->sched.attr.priority = I915_PRIORITY_MASK;
err = 0;
 err:
i915_request_get(rq);
@@ -5363,7 +5362,6 @@ create_timestamp(struct intel_context *ce, void *slot, 
int idx)
 
intel_ring_advance(rq, cs);
 
-   rq-

[Intel-gfx] [PATCH 56/66] drm/i915/gt: Specify a deadline for the heartbeat

2020-07-15 Thread Chris Wilson
As we know when we expect the heartbeat to be checked for completion,
pass this information along as its deadline. We still do not complain if
the deadline is missed, at least until we have tried a few times, but it
will allow for quicker hang detection on systems where deadlines are
adhered to.

Signed-off-by: Chris Wilson 
---
 drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c 
b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
index 9fdc8223007f..41199254b2b5 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
@@ -54,6 +54,16 @@ static void heartbeat_commit(struct i915_request *rq,
local_bh_enable();
 }
 
+static void set_heartbeat_deadline(struct intel_engine_cs *engine,
+  struct i915_request *rq)
+{
+   unsigned long interval;
+
+   interval = READ_ONCE(engine->props.heartbeat_interval_ms);
+   if (interval)
+   i915_request_set_deadline(rq, ktime_get() + (interval << 20));
+}
+
 static void show_heartbeat(const struct i915_request *rq,
   struct intel_engine_cs *engine)
 {
@@ -119,6 +129,8 @@ static void heartbeat(struct work_struct *wrk)
 
local_bh_disable();
i915_request_set_priority(rq, attr.priority);
+   if (attr.priority == I915_PRIORITY_BARRIER)
+   i915_request_set_deadline(rq, 0);
local_bh_enable();
} else {
if (IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM))
@@ -155,6 +167,7 @@ static void heartbeat(struct work_struct *wrk)
if (engine->i915->params.enable_hangcheck)
engine->heartbeat.systole = i915_request_get(rq);
 
+   set_heartbeat_deadline(engine, rq);
heartbeat_commit(rq, &attr);
 
 unlock:
-- 
2.20.1

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH 45/66] drm/i915/gt: Extract busy-stats for ring-scheduler

2020-07-15 Thread Chris Wilson
Lift the busy-stats context-in/out implementation out of intel_lrc, so
that we can reuse it for other scheduler implementations.

Signed-off-by: Chris Wilson 
---
 drivers/gpu/drm/i915/gt/intel_engine_stats.h | 49 
 drivers/gpu/drm/i915/gt/intel_lrc.c  | 34 +-
 2 files changed, 50 insertions(+), 33 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/gt/intel_engine_stats.h

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_stats.h 
b/drivers/gpu/drm/i915/gt/intel_engine_stats.h
new file mode 100644
index ..58491eae3482
--- /dev/null
+++ b/drivers/gpu/drm/i915/gt/intel_engine_stats.h
@@ -0,0 +1,49 @@
+/* SPDX-License-Identifier: MIT */
+/*
+ * Copyright © 2020 Intel Corporation
+ */
+
+#ifndef __INTEL_ENGINE_STATS_H__
+#define __INTEL_ENGINE_STATS_H__
+
+#include 
+#include 
+#include 
+
+#include "i915_gem.h" /* GEM_BUG_ON */
+#include "intel_engine.h"
+
+static inline void intel_engine_context_in(struct intel_engine_cs *engine)
+{
+   unsigned long flags;
+
+   if (atomic_add_unless(&engine->stats.active, 1, 0))
+   return;
+
+   write_seqlock_irqsave(&engine->stats.lock, flags);
+   if (!atomic_add_unless(&engine->stats.active, 1, 0)) {
+   engine->stats.start = ktime_get();
+   atomic_inc(&engine->stats.active);
+   }
+   write_sequnlock_irqrestore(&engine->stats.lock, flags);
+}
+
+static inline void intel_engine_context_out(struct intel_engine_cs *engine)
+{
+   unsigned long flags;
+
+   GEM_BUG_ON(!atomic_read(&engine->stats.active));
+
+   if (atomic_add_unless(&engine->stats.active, -1, 1))
+   return;
+
+   write_seqlock_irqsave(&engine->stats.lock, flags);
+   if (atomic_dec_and_test(&engine->stats.active)) {
+   engine->stats.total =
+   ktime_add(engine->stats.total,
+ ktime_sub(ktime_get(), engine->stats.start));
+   }
+   write_sequnlock_irqrestore(&engine->stats.lock, flags);
+}
+
+#endif /* __INTEL_ENGINE_STATS_H__ */
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c 
b/drivers/gpu/drm/i915/gt/intel_lrc.c
index 72b343242251..534adfdc42fe 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -139,6 +139,7 @@
 #include "i915_vgpu.h"
 #include "intel_context.h"
 #include "intel_engine_pm.h"
+#include "intel_engine_stats.h"
 #include "intel_gt.h"
 #include "intel_gt_pm.h"
 #include "intel_gt_requests.h"
@@ -1155,39 +1156,6 @@ execlists_context_status_change(struct i915_request *rq, 
unsigned long status)
   status, rq);
 }
 
-static void intel_engine_context_in(struct intel_engine_cs *engine)
-{
-   unsigned long flags;
-
-   if (atomic_add_unless(&engine->stats.active, 1, 0))
-   return;
-
-   write_seqlock_irqsave(&engine->stats.lock, flags);
-   if (!atomic_add_unless(&engine->stats.active, 1, 0)) {
-   engine->stats.start = ktime_get();
-   atomic_inc(&engine->stats.active);
-   }
-   write_sequnlock_irqrestore(&engine->stats.lock, flags);
-}
-
-static void intel_engine_context_out(struct intel_engine_cs *engine)
-{
-   unsigned long flags;
-
-   GEM_BUG_ON(!atomic_read(&engine->stats.active));
-
-   if (atomic_add_unless(&engine->stats.active, -1, 1))
-   return;
-
-   write_seqlock_irqsave(&engine->stats.lock, flags);
-   if (atomic_dec_and_test(&engine->stats.active)) {
-   engine->stats.total =
-   ktime_add(engine->stats.total,
- ktime_sub(ktime_get(), engine->stats.start));
-   }
-   write_sequnlock_irqrestore(&engine->stats.lock, flags);
-}
-
 static void
 execlists_check_context(const struct intel_context *ce,
const struct intel_engine_cs *engine)
-- 
2.20.1

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH 36/66] drm/i915/gt: Replace direct submit with direct call to tasklet

2020-07-15 Thread Chris Wilson
Rather than having special case code for opportunistically calling
process_csb() and performing a direct submit while holding the engine
spinlock for submitting the request, simply call the tasklet directly.
This allows us to retain the direct submission path, including the CS
draining to allow fast/immediate submissions, without requiring any
duplicated code paths.

The trickiest part here is to ensure that paired operations (such as
schedule_in/schedule_out) remain under consistent locking domains,
e.g. when pulled outside of the engine->active.lock

v2: Use bh kicking, see commit 3c53776e29f8 ("Mark HI and TASKLET
softirq synchronous").
v3: Update engine-reset to be tasklet aware

Signed-off-by: Chris Wilson 
Reviewed-by: Mika Kuoppala 
---
 drivers/gpu/drm/i915/gem/i915_gem_context.c   |   4 +-
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c|   2 +
 drivers/gpu/drm/i915/gt/intel_engine_cs.c |  35 +++--
 .../gpu/drm/i915/gt/intel_engine_heartbeat.c  |  20 ++-
 drivers/gpu/drm/i915/gt/intel_engine_types.h  |   3 +-
 drivers/gpu/drm/i915/gt/intel_lrc.c   | 120 ++
 drivers/gpu/drm/i915/gt/intel_reset.c |  60 +
 drivers/gpu/drm/i915/gt/intel_reset.h |   2 +
 drivers/gpu/drm/i915/gt/selftest_hangcheck.c  |   7 +-
 drivers/gpu/drm/i915/gt/selftest_lrc.c|  27 ++--
 drivers/gpu/drm/i915/gt/selftest_reset.c  |   8 +-
 drivers/gpu/drm/i915/i915_request.c   |   2 +
 drivers/gpu/drm/i915/selftests/i915_request.c |   6 +-
 13 files changed, 152 insertions(+), 144 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c 
b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index b5f6dc2333ab..901b2f5614ea 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -398,12 +398,14 @@ static bool __reset_engine(struct intel_engine_cs *engine)
if (!intel_has_reset_engine(gt))
return false;
 
+   local_bh_disable();
if (!test_and_set_bit(I915_RESET_ENGINE + engine->id,
  >->reset.flags)) {
-   success = intel_engine_reset(engine, NULL) == 0;
+   success = __intel_engine_reset_bh(engine, NULL) == 0;
clear_and_wake_up_bit(I915_RESET_ENGINE + engine->id,
  >->reset.flags);
}
+   local_bh_enable();
 
return success;
 }
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c 
b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index b07c508812ad..7ad65612e4a0 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -3327,7 +3327,9 @@ static void eb_request_add(struct i915_execbuffer *eb)
__i915_request_skip(rq);
}
 
+   local_bh_disable();
__i915_request_queue(rq, &attr);
+   local_bh_enable();
 }
 
 static int
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index dd1a42c4d344..c10521fdbbe4 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -978,32 +978,39 @@ static unsigned long stop_timeout(const struct 
intel_engine_cs *engine)
return READ_ONCE(engine->props.stop_timeout_ms);
 }
 
-int intel_engine_stop_cs(struct intel_engine_cs *engine)
+static int __intel_engine_stop_cs(struct intel_engine_cs *engine,
+ int fast_timeout_us,
+ int slow_timeout_ms)
 {
struct intel_uncore *uncore = engine->uncore;
-   const u32 base = engine->mmio_base;
-   const i915_reg_t mode = RING_MI_MODE(base);
+   const i915_reg_t mode = RING_MI_MODE(engine->mmio_base);
int err;
 
+   intel_uncore_write_fw(uncore, mode, _MASKED_BIT_ENABLE(STOP_RING));
+   err = __intel_wait_for_register_fw(engine->uncore, mode,
+  MODE_IDLE, MODE_IDLE,
+  fast_timeout_us,
+  slow_timeout_ms,
+  NULL);
+
+   /* A final mmio read to let GPU writes be hopefully flushed to memory */
+   intel_uncore_posting_read_fw(uncore, mode);
+   return err;
+}
+
+int intel_engine_stop_cs(struct intel_engine_cs *engine)
+{
+   int err = 0;
+
if (INTEL_GEN(engine->i915) < 3)
return -ENODEV;
 
ENGINE_TRACE(engine, "\n");
-
-   intel_uncore_write_fw(uncore, mode, _MASKED_BIT_ENABLE(STOP_RING));
-
-   err = 0;
-   if (__intel_wait_for_register_fw(uncore,
-mode, MODE_IDLE, MODE_IDLE,
-1000, stop_timeout(engine),
-NULL)) {
+   if (__intel_engine_stop_cs(engine, 1000, stop_timeout(engine))) {
ENGINE_TRACE(engine, "timed out on STOP_RING -> IDLE\

[Intel-gfx] [PATCH 38/66] drm/i915/gt: Use virtual_engine during execlists_dequeue

2020-07-15 Thread Chris Wilson
Rather than going back and forth between the rb_node entry and the
virtual_engine type, store the ve local and reuse it. As the
container_of conversion from rb_node to virtual_engine requires a
variable offset, performing that conversion just once shaves off a bit
of code.

v2: Keep a single virtual engine lookup, for typical use.

Signed-off-by: Chris Wilson 
Cc: Tvrtko Ursulin 
Cc:  # v5.4+
---
 drivers/gpu/drm/i915/gt/intel_lrc.c | 254 
 1 file changed, 111 insertions(+), 143 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c 
b/drivers/gpu/drm/i915/gt/intel_lrc.c
index fabb20a6800b..ec533dfe3be9 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -453,9 +453,15 @@ static int queue_prio(const struct intel_engine_execlists 
*execlists)
return ((p->priority + 1) << I915_USER_PRIORITY_SHIFT) - ffs(p->used);
 }
 
+static int virtual_prio(const struct intel_engine_execlists *el)
+{
+   struct rb_node *rb = rb_first_cached(&el->virtual);
+
+   return rb ? rb_entry(rb, struct ve_node, rb)->prio : INT_MIN;
+}
+
 static inline bool need_preempt(const struct intel_engine_cs *engine,
-   const struct i915_request *rq,
-   struct rb_node *rb)
+   const struct i915_request *rq)
 {
int last_prio;
 
@@ -492,25 +498,6 @@ static inline bool need_preempt(const struct 
intel_engine_cs *engine,
rq_prio(list_next_entry(rq, sched.link)) > last_prio)
return true;
 
-   if (rb) {
-   struct virtual_engine *ve =
-   rb_entry(rb, typeof(*ve), nodes[engine->id].rb);
-   bool preempt = false;
-
-   if (engine == ve->siblings[0]) { /* only preempt one sibling */
-   struct i915_request *next;
-
-   rcu_read_lock();
-   next = READ_ONCE(ve->request);
-   if (next)
-   preempt = rq_prio(next) > last_prio;
-   rcu_read_unlock();
-   }
-
-   if (preempt)
-   return preempt;
-   }
-
/*
 * If the inflight context did not trigger the preemption, then maybe
 * it was the set of queued requests? Pick the highest priority in
@@ -521,7 +508,8 @@ static inline bool need_preempt(const struct 
intel_engine_cs *engine,
 * ELSP[0] or ELSP[1] as, thanks again to PI, if it was the same
 * context, it's priority would not exceed ELSP[0] aka last_prio.
 */
-   return queue_prio(&engine->execlists) > last_prio;
+   return max(virtual_prio(&engine->execlists),
+  queue_prio(&engine->execlists)) > last_prio;
 }
 
 __maybe_unused static inline bool
@@ -1806,6 +1794,35 @@ static bool virtual_matches(const struct virtual_engine 
*ve,
return true;
 }
 
+static struct virtual_engine *
+first_virtual_engine(struct intel_engine_cs *engine)
+{
+   struct intel_engine_execlists *el = &engine->execlists;
+   struct rb_node *rb = rb_first_cached(&el->virtual);
+
+   while (rb) {
+   struct virtual_engine *ve =
+   rb_entry(rb, typeof(*ve), nodes[engine->id].rb);
+   struct i915_request *rq = READ_ONCE(ve->request);
+
+   /* lazily cleanup after another engine handled rq */
+   if (!rq) {
+   rb_erase_cached(rb, &el->virtual);
+   RB_CLEAR_NODE(rb);
+   rb = rb_first_cached(&el->virtual);
+   continue;
+   }
+
+   if (!virtual_matches(ve, rq, engine)) {
+   rb = rb_next(rb);
+   continue;
+   }
+   return ve;
+   }
+
+   return NULL;
+}
+
 static void virtual_xfer_breadcrumbs(struct virtual_engine *ve)
 {
/*
@@ -1889,32 +1906,15 @@ static void defer_active(struct intel_engine_cs *engine)
 
 static bool
 need_timeslice(const struct intel_engine_cs *engine,
-  const struct i915_request *rq,
-  const struct rb_node *rb)
+  const struct i915_request *rq)
 {
int hint;
 
if (!intel_engine_has_timeslices(engine))
return false;
 
-   hint = engine->execlists.queue_priority_hint;
-
-   if (rb) {
-   const struct virtual_engine *ve =
-   rb_entry(rb, typeof(*ve), nodes[engine->id].rb);
-   const struct intel_engine_cs *inflight =
-   intel_context_inflight(&ve->context);
-
-   if (!inflight || inflight == engine) {
-   struct i915_request *next;
-
-   rcu_read_lock();
-   next = READ_ONCE(ve->request);
-   if (next)
-   hint = max(hint, r

[Intel-gfx] [PATCH 30/66] drm/i915: Specialise GGTT binding

2020-07-15 Thread Chris Wilson
The Global GTT mmapings do not require any backing storage for the page
directories and so do not need extensive support for preallocations, or
for handling multiple bindings en masse. The Global GTT bindings also
need to take into account an eviction strategy for pinned vma, that we
want to explicitly avoid for user bindings. It is easier to specialise
the i915_ggtt_pin() to keep alive the pages/address as they are used by
HW in its private GTT, while we deconstruct the i915_vma_pin() and
rebuild.

Signed-off-by: Chris Wilson 
---
 drivers/gpu/drm/i915/gt/gen6_ppgtt.c |   7 +-
 drivers/gpu/drm/i915/i915_vma.c  | 125 +++
 drivers/gpu/drm/i915/i915_vma.h  |   1 +
 3 files changed, 113 insertions(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/gen6_ppgtt.c 
b/drivers/gpu/drm/i915/gt/gen6_ppgtt.c
index 3eab2cc751bc..308f7f4f7bd7 100644
--- a/drivers/gpu/drm/i915/gt/gen6_ppgtt.c
+++ b/drivers/gpu/drm/i915/gt/gen6_ppgtt.c
@@ -392,8 +392,11 @@ int gen6_ppgtt_pin(struct i915_ppgtt *base)
 * size. We allocate at the top of the GTT to avoid fragmentation.
 */
err = 0;
-   if (!atomic_read(&ppgtt->pin_count))
-   err = i915_ggtt_pin(ppgtt->vma, GEN6_PD_ALIGN, PIN_HIGH);
+   if (!atomic_read(&ppgtt->pin_count)) {
+   err = i915_ggtt_pin_locked(ppgtt->vma, GEN6_PD_ALIGN, PIN_HIGH);
+   if (err == 0)
+   err = i915_vma_wait_for_bind(ppgtt->vma);
+   }
if (!err)
atomic_inc(&ppgtt->pin_count);
mutex_unlock(&ppgtt->pin_mutex);
diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
index e584a3355911..4993fa99cb71 100644
--- a/drivers/gpu/drm/i915/i915_vma.c
+++ b/drivers/gpu/drm/i915/i915_vma.c
@@ -952,7 +952,7 @@ int i915_vma_pin(struct i915_vma *vma, u64 size, u64 
alignment, u64 flags)
return err;
 }
 
-static void flush_idle_contexts(struct intel_gt *gt)
+static void unpin_idle_contexts(struct intel_gt *gt)
 {
struct intel_engine_cs *engine;
enum intel_engine_id id;
@@ -963,31 +963,120 @@ static void flush_idle_contexts(struct intel_gt *gt)
intel_gt_wait_for_idle(gt, MAX_SCHEDULE_TIMEOUT);
 }
 
+int i915_ggtt_pin_locked(struct i915_vma *vma, u32 align, unsigned int flags)
+{
+   struct i915_vma_work *work = NULL;
+   unsigned int bound;
+   int err;
+
+   GEM_BUG_ON(vma->vm->allocate_va_range);
+   GEM_BUG_ON(i915_vma_is_closed(vma));
+
+   /* First try and grab the pin without rebinding the vma */
+   if (i915_vma_pin_inplace(vma, I915_VMA_GLOBAL_BIND))
+   return 0;
+
+   work = i915_vma_work();
+   if (!work)
+   return -ENOMEM;
+   work->vm = i915_vm_get(vma->vm);
+
+   err = mutex_lock_interruptible(&vma->vm->mutex);
+   if (err)
+   goto err_fence;
+
+   /* No more allocations allowed now we hold vm->mutex */
+
+   bound = atomic_read(&vma->flags);
+   if (unlikely(bound & I915_VMA_ERROR)) {
+   err = -ENOMEM;
+   goto err_unlock;
+   }
+
+   if (unlikely(!((bound + 1) & I915_VMA_PIN_MASK))) {
+   err = -EAGAIN; /* pins are meant to be fairly temporary */
+   goto err_unlock;
+   }
+
+   if (unlikely(bound & I915_VMA_GLOBAL_BIND)) {
+   __i915_vma_pin(vma);
+   goto err_unlock;
+   }
+
+   err = i915_active_acquire(&vma->active);
+   if (err)
+   goto err_unlock;
+
+   if (!(bound & I915_VMA_BIND_MASK)) {
+   err = __wait_for_unbind(vma, flags);
+   if (err)
+   goto err_active;
+
+   err = i915_vma_insert(vma, 0, align,
+ flags | I915_VMA_GLOBAL_BIND);
+   if (err == -ENOSPC) {
+   unpin_idle_contexts(vma->vm->gt);
+   err = i915_vma_insert(vma, 0, align,
+ flags | I915_VMA_GLOBAL_BIND);
+   }
+   if (err)
+   goto err_active;
+
+   __i915_vma_set_map_and_fenceable(vma);
+   }
+
+   err = i915_vma_bind(vma,
+   vma->obj ? vma->obj->cache_level : 0,
+   I915_VMA_GLOBAL_BIND,
+   work);
+   if (err)
+   goto err_remove;
+   GEM_BUG_ON(!i915_vma_is_bound(vma, I915_VMA_GLOBAL_BIND));
+
+   list_move_tail(&vma->vm_link, &vma->vm->bound_list);
+   GEM_BUG_ON(!i915_vma_is_active(vma));
+
+   __i915_vma_pin(vma);
+
+err_remove:
+   if (!i915_vma_is_bound(vma, I915_VMA_BIND_MASK)) {
+   i915_vma_detach(vma);
+   drm_mm_remove_node(&vma->node);
+   }
+err_active:
+   i915_active_release(&vma->active);
+err_unlock:
+   mutex_unlock(&vma->vm->mutex);
+err_fence:
+   dma_fence_work_commit_imm(&work->base);

[Intel-gfx] [PATCH 64/66] drm/i915/gt: Implement ring scheduler for gen6/7

2020-07-15 Thread Chris Wilson
A key prolem with legacy ring buffer submission is that it is an inheret
FIFO queue across all clients; if one blocks, they all block. A
scheduler allows us to avoid that limitation, and ensures that all
clients can submit in parallel, removing the resource contention of the
global ringbuffer.

Having built the ring scheduler infrastructure over top of the global
ringbuffer submission, we now need to provide the HW knowledge required
to build command packets and implement context switching.

Signed-off-by: Chris Wilson 
---
 .../gpu/drm/i915/gt/intel_ring_scheduler.c| 428 +-
 drivers/gpu/drm/i915/i915_reg.h   |   1 +
 2 files changed, 426 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_ring_scheduler.c 
b/drivers/gpu/drm/i915/gt/intel_ring_scheduler.c
index d3c22037f17d..2d26d62e0135 100644
--- a/drivers/gpu/drm/i915/gt/intel_ring_scheduler.c
+++ b/drivers/gpu/drm/i915/gt/intel_ring_scheduler.c
@@ -9,6 +9,10 @@
 
 #include "mm/i915_acquire_ctx.h"
 
+#include "gen2_engine_cs.h"
+#include "gen6_engine_cs.h"
+#include "gen6_ppgtt.h"
+#include "gen7_renderclear.h"
 #include "i915_drv.h"
 #include "intel_context.h"
 #include "intel_engine_stats.h"
@@ -136,8 +140,263 @@ static void ring_copy(struct intel_ring *dst,
memcpy(out, src->vaddr + start, end - start);
 }
 
+static void mi_set_context(struct intel_ring *ring,
+  struct intel_engine_cs *engine,
+  struct intel_context *ce,
+  u32 flags)
+{
+   struct drm_i915_private *i915 = engine->i915;
+   enum intel_engine_id id;
+   const int num_engines =
+   IS_HASWELL(i915) ? engine->gt->info.num_engines - 1 : 0;
+   int len;
+   u32 *cs;
+
+   len = 4;
+   if (IS_GEN(i915, 7))
+   len += 2 + (num_engines ? 4 * num_engines + 6 : 0);
+   else if (IS_GEN(i915, 5))
+   len += 2;
+
+   cs = ring_map_dw(ring, len);
+
+   /* WaProgramMiArbOnOffAroundMiSetContext:ivb,vlv,hsw,bdw,chv */
+   if (IS_GEN(i915, 7)) {
+   *cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
+   if (num_engines) {
+   struct intel_engine_cs *signaller;
+
+   *cs++ = MI_LOAD_REGISTER_IMM(num_engines);
+   for_each_engine(signaller, engine->gt, id) {
+   if (signaller == engine)
+   continue;
+
+   *cs++ = i915_mmio_reg_offset(
+  RING_PSMI_CTL(signaller->mmio_base));
+   *cs++ = _MASKED_BIT_ENABLE(
+   GEN6_PSMI_SLEEP_MSG_DISABLE);
+   }
+   }
+   } else if (IS_GEN(i915, 5)) {
+   /*
+* This w/a is only listed for pre-production ilk a/b steppings,
+* but is also mentioned for programming the powerctx. To be
+* safe, just apply the workaround; we do not use SyncFlush so
+* this should never take effect and so be a no-op!
+*/
+   *cs++ = MI_SUSPEND_FLUSH | MI_SUSPEND_FLUSH_EN;
+   }
+
+   *cs++ = MI_NOOP;
+   *cs++ = MI_SET_CONTEXT;
+   *cs++ = i915_ggtt_offset(ce->state) | flags;
+   /*
+* w/a: MI_SET_CONTEXT must always be followed by MI_NOOP
+* WaMiSetContext_Hang:snb,ivb,vlv
+*/
+   *cs++ = MI_NOOP;
+
+   if (IS_GEN(i915, 7)) {
+   if (num_engines) {
+   struct intel_engine_cs *signaller;
+   i915_reg_t last_reg = {}; /* keep gcc quiet */
+
+   *cs++ = MI_LOAD_REGISTER_IMM(num_engines);
+   for_each_engine(signaller, engine->gt, id) {
+   if (signaller == engine)
+   continue;
+
+   last_reg = RING_PSMI_CTL(signaller->mmio_base);
+   *cs++ = i915_mmio_reg_offset(last_reg);
+   *cs++ = _MASKED_BIT_DISABLE(
+   GEN6_PSMI_SLEEP_MSG_DISABLE);
+   }
+
+   /* Insert a delay before the next switch! */
+   *cs++ = MI_STORE_REGISTER_MEM | MI_SRM_LRM_GLOBAL_GTT;
+   *cs++ = i915_mmio_reg_offset(last_reg);
+   *cs++ = intel_gt_scratch_offset(engine->gt,
+   
INTEL_GT_SCRATCH_FIELD_DEFAULT);
+   *cs++ = MI_NOOP;
+   }
+   *cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
+   } else if (IS_GEN(i915, 5)) {
+   *cs++ = MI_SUSPEND_FLUSH;
+   }
+}
+
+static struct i915_address_space *vm_alias(struct i915_address_space *vm)
+{
+   if (i915_is_ggtt(vm))
+  

[Intel-gfx] [PATCH 57/66] drm/i915: Replace the priority boosting for the display with a deadline

2020-07-15 Thread Chris Wilson
For a modeset/pageflip, there is a very precise deadline by which the
frame must be completed in order to hit the vblank and be shown. While
we don't pass along that exact information, we can at least inform the
scheduler that this request-chain needs to be completed asap.

Signed-off-by: Chris Wilson 
---
 drivers/gpu/drm/i915/display/intel_display.c |  2 +-
 drivers/gpu/drm/i915/gem/i915_gem_object.h   |  4 ++--
 drivers/gpu/drm/i915/gem/i915_gem_wait.c | 19 ++-
 drivers/gpu/drm/i915/i915_priolist_types.h   |  3 ---
 4 files changed, 13 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/i915/display/intel_display.c 
b/drivers/gpu/drm/i915/display/intel_display.c
index c74e664a3759..1c644b613246 100644
--- a/drivers/gpu/drm/i915/display/intel_display.c
+++ b/drivers/gpu/drm/i915/display/intel_display.c
@@ -15983,7 +15983,7 @@ intel_prepare_plane_fb(struct drm_plane *_plane,
if (ret)
return ret;
 
-   i915_gem_object_wait_priority(obj, 0, I915_PRIORITY_DISPLAY);
+   i915_gem_object_wait_deadline(obj, 0, ktime_get() /* next vblank? */);
i915_gem_object_flush_frontbuffer(obj, ORIGIN_DIRTYFB);
 
if (!new_plane_state->uapi.fence) { /* implicit fencing */
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.h 
b/drivers/gpu/drm/i915/gem/i915_gem_object.h
index d916155b0c52..d2d7e6bd099d 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_object.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_object.h
@@ -457,9 +457,9 @@ static inline void __start_cpu_write(struct 
drm_i915_gem_object *obj)
 int i915_gem_object_wait(struct drm_i915_gem_object *obj,
 unsigned int flags,
 long timeout);
-int i915_gem_object_wait_priority(struct drm_i915_gem_object *obj,
+int i915_gem_object_wait_deadline(struct drm_i915_gem_object *obj,
  unsigned int flags,
- int prio);
+ ktime_t deadline);
 
 void __i915_gem_object_flush_frontbuffer(struct drm_i915_gem_object *obj,
 enum fb_op_origin origin);
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_wait.c 
b/drivers/gpu/drm/i915/gem/i915_gem_wait.c
index cefbbb3d9b52..3334817183f6 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_wait.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_wait.c
@@ -93,17 +93,18 @@ i915_gem_object_wait_reservation(struct dma_resv *resv,
return timeout;
 }
 
-static void __fence_set_priority(struct dma_fence *fence, int prio)
+static void __fence_set_deadline(struct dma_fence *fence, ktime_t deadline)
 {
if (dma_fence_is_signaled(fence) || !dma_fence_is_i915(fence))
return;
 
local_bh_disable();
-   i915_request_set_priority(to_request(fence), prio);
+   i915_request_set_deadline(to_request(fence),
+ i915_sched_to_ticks(deadline));
local_bh_enable(); /* kick the tasklets if queues were reprioritised */
 }
 
-static void fence_set_priority(struct dma_fence *fence, int prio)
+static void fence_set_deadline(struct dma_fence *fence, ktime_t deadline)
 {
/* Recurse once into a fence-array */
if (dma_fence_is_array(fence)) {
@@ -111,16 +112,16 @@ static void fence_set_priority(struct dma_fence *fence, 
int prio)
int i;
 
for (i = 0; i < array->num_fences; i++)
-   __fence_set_priority(array->fences[i], prio);
+   __fence_set_deadline(array->fences[i], deadline);
} else {
-   __fence_set_priority(fence, prio);
+   __fence_set_deadline(fence, deadline);
}
 }
 
 int
-i915_gem_object_wait_priority(struct drm_i915_gem_object *obj,
+i915_gem_object_wait_deadline(struct drm_i915_gem_object *obj,
  unsigned int flags,
- int prio)
+ ktime_t deadline)
 {
struct dma_fence *excl;
 
@@ -135,7 +136,7 @@ i915_gem_object_wait_priority(struct drm_i915_gem_object 
*obj,
return ret;
 
for (i = 0; i < count; i++) {
-   fence_set_priority(shared[i], prio);
+   fence_set_deadline(shared[i], deadline);
dma_fence_put(shared[i]);
}
 
@@ -145,7 +146,7 @@ i915_gem_object_wait_priority(struct drm_i915_gem_object 
*obj,
}
 
if (excl) {
-   fence_set_priority(excl, prio);
+   fence_set_deadline(excl, deadline);
dma_fence_put(excl);
}
return 0;
diff --git a/drivers/gpu/drm/i915/i915_priolist_types.h 
b/drivers/gpu/drm/i915/i915_priolist_types.h
index 43a0ac45295f..ac6d9614ea23 100644
--- a/drivers/gpu/drm/i915/i915_priolist_types.h
+++ b/drivers/gpu/drm/i915/i915_priolist_types.h
@@ -20,9 +20,6 @@ enum {
/* A preemptive pulse used to monitor the health of each en

[Intel-gfx] [PATCH 35/66] drm/i915/gt: Check for a completed last request once

2020-07-15 Thread Chris Wilson
Pull the repeated check for the last active request being completed to a
single spot, when deciding whether or not execlist preemption is
required.

Signed-off-by: Chris Wilson 
---
 drivers/gpu/drm/i915/gt/intel_lrc.c | 15 ---
 1 file changed, 4 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c 
b/drivers/gpu/drm/i915/gt/intel_lrc.c
index f52b52a7b1d3..0c478187f9ba 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -2123,12 +2123,9 @@ static void execlists_dequeue(struct intel_engine_cs 
*engine)
 */
 
if ((last = *active)) {
-   if (need_preempt(engine, last, rb)) {
-   if (i915_request_completed(last)) {
-   tasklet_hi_schedule(&execlists->tasklet);
-   return;
-   }
-
+   if (i915_request_completed(last)) {
+   goto check_secondary;
+   } else if (need_preempt(engine, last, rb)) {
ENGINE_TRACE(engine,
 "preempting last=%llx:%lld, prio=%d, 
hint=%d\n",
 last->fence.context,
@@ -2156,11 +2153,6 @@ static void execlists_dequeue(struct intel_engine_cs 
*engine)
last = NULL;
} else if (need_timeslice(engine, last, rb) &&
   timeslice_expired(execlists, last)) {
-   if (i915_request_completed(last)) {
-   tasklet_hi_schedule(&execlists->tasklet);
-   return;
-   }
-
ENGINE_TRACE(engine,
 "expired last=%llx:%lld, prio=%d, hint=%d, 
yield?=%s\n",
 last->fence.context,
@@ -2196,6 +2188,7 @@ static void execlists_dequeue(struct intel_engine_cs 
*engine)
 * we hopefully coalesce several updates into a single
 * submission.
 */
+check_secondary:
if (!list_is_last(&last->sched.link,
  &engine->active.requests)) {
/*
-- 
2.20.1

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH 65/66] drm/i915/gt: Enable ring scheduling for gen6/7

2020-07-15 Thread Chris Wilson
Switch over from FIFO global submission to the priority-sorted
topographical scheduler. At the cost of more busy work on the CPU to
keep the GPU supplied with the next packet of requests, this allows us
to reorder requests around submission stalls.

This also enables the timer based RPS, with the exception of Valleyview
who's PCU doesn't take kindly to our interference.

Signed-off-by: Chris Wilson 
---
 drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c | 2 +-
 drivers/gpu/drm/i915/gt/intel_engine_cs.c | 2 ++
 drivers/gpu/drm/i915/gt/intel_rps.c   | 6 ++
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c 
b/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
index f2a307b4146e..55f09ab7136a 100644
--- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
@@ -94,7 +94,7 @@ static int live_nop_switch(void *arg)
rq = i915_request_get(this);
i915_request_add(this);
}
-   if (i915_request_wait(rq, 0, HZ / 5) < 0) {
+   if (i915_request_wait(rq, 0, HZ) < 0) {
pr_err("Failed to populated %d contexts\n", nctx);
intel_gt_set_wedged(&i915->gt);
i915_request_put(rq);
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index df234ce10907..c9db59b9bacf 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -863,6 +863,8 @@ int intel_engines_init(struct intel_gt *gt)
 
if (HAS_EXECLISTS(gt->i915))
setup = intel_execlists_submission_setup;
+   else if (INTEL_GEN(gt->i915) >= 6)
+   setup = intel_ring_scheduler_setup;
else
setup = intel_ring_submission_setup;
 
diff --git a/drivers/gpu/drm/i915/gt/intel_rps.c 
b/drivers/gpu/drm/i915/gt/intel_rps.c
index 49910425e986..bf923df212d1 100644
--- a/drivers/gpu/drm/i915/gt/intel_rps.c
+++ b/drivers/gpu/drm/i915/gt/intel_rps.c
@@ -1052,9 +1052,7 @@ static bool gen6_rps_enable(struct intel_rps *rps)
intel_uncore_write_fw(uncore, GEN6_RP_DOWN_TIMEOUT, 5);
intel_uncore_write_fw(uncore, GEN6_RP_IDLE_HYSTERSIS, 10);
 
-   rps->pm_events = (GEN6_PM_RP_UP_THRESHOLD |
- GEN6_PM_RP_DOWN_THRESHOLD |
- GEN6_PM_RP_DOWN_TIMEOUT);
+   rps->pm_events = GEN6_PM_RP_UP_THRESHOLD | GEN6_PM_RP_DOWN_THRESHOLD;
 
return rps_reset(rps);
 }
@@ -1362,7 +1360,7 @@ void intel_rps_enable(struct intel_rps *rps)
GEM_BUG_ON(rps->efficient_freq < rps->min_freq);
GEM_BUG_ON(rps->efficient_freq > rps->max_freq);
 
-   if (has_busy_stats(rps))
+   if (has_busy_stats(rps) && !IS_VALLEYVIEW(i915))
intel_rps_set_timer(rps);
else if (INTEL_GEN(i915) >= 6)
intel_rps_set_interrupts(rps);
-- 
2.20.1

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH 06/66] drm/i915: Export a preallocate variant of i915_active_acquire()

2020-07-15 Thread Chris Wilson
Sometimes we have to be very careful not to allocate underneath a mutex
(or spinlock) and yet still want to track activity. Enter
i915_active_acquire_for_context(). This raises the activity counter on
i915_active prior to use and ensures that the fence-tree contains a slot
for the context.

Signed-off-by: Chris Wilson 
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c|   2 +-
 drivers/gpu/drm/i915/gt/intel_timeline.c  |   4 +-
 drivers/gpu/drm/i915/i915_active.c| 136 +++---
 drivers/gpu/drm/i915/i915_active.h|  12 +-
 4 files changed, 126 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c 
b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 6b4ec66cb558..719ba9fe3e85 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -1729,7 +1729,7 @@ __parser_mark_active(struct i915_vma *vma,
 {
struct intel_gt_buffer_pool_node *node = vma->private;
 
-   return i915_active_ref(&node->active, tl, fence);
+   return i915_active_ref(&node->active, tl->fence_context, fence);
 }
 
 static int
diff --git a/drivers/gpu/drm/i915/gt/intel_timeline.c 
b/drivers/gpu/drm/i915/gt/intel_timeline.c
index 46d20f5f3ddc..acb43aebd669 100644
--- a/drivers/gpu/drm/i915/gt/intel_timeline.c
+++ b/drivers/gpu/drm/i915/gt/intel_timeline.c
@@ -484,7 +484,9 @@ __intel_timeline_get_seqno(struct intel_timeline *tl,
 * free it after the current request is retired, which ensures that
 * all writes into the cacheline from previous requests are complete.
 */
-   err = i915_active_ref(&tl->hwsp_cacheline->active, tl, &rq->fence);
+   err = i915_active_ref(&tl->hwsp_cacheline->active,
+ tl->fence_context,
+ &rq->fence);
if (err)
goto err_cacheline;
 
diff --git a/drivers/gpu/drm/i915/i915_active.c 
b/drivers/gpu/drm/i915/i915_active.c
index 841b5c30950a..799282fb1bb9 100644
--- a/drivers/gpu/drm/i915/i915_active.c
+++ b/drivers/gpu/drm/i915/i915_active.c
@@ -28,12 +28,14 @@ static struct i915_global_active {
 } global;
 
 struct active_node {
+   struct rb_node node;
struct i915_active_fence base;
struct i915_active *ref;
-   struct rb_node node;
u64 timeline;
 };
 
+#define fetch_node(x) rb_entry(READ_ONCE(x), typeof(struct active_node), node)
+
 static inline struct active_node *
 node_from_active(struct i915_active_fence *active)
 {
@@ -216,12 +218,40 @@ excl_retire(struct dma_fence *fence, struct dma_fence_cb 
*cb)
active_retire(container_of(cb, struct i915_active, excl.cb));
 }
 
+static struct active_node *__active_lookup(struct i915_active *ref, u64 idx)
+{
+   struct active_node *it;
+
+   it = READ_ONCE(ref->cache);
+   if (it && it->timeline == idx)
+   return it;
+
+   BUILD_BUG_ON(offsetof(typeof(*it), node));
+
+   /* While active, the tree can only be built; not destroyed */
+   GEM_BUG_ON(i915_active_is_idle(ref));
+
+   it = fetch_node(ref->tree.rb_node);
+   while (it) {
+   if (it->timeline < idx) {
+   it = fetch_node(it->node.rb_right);
+   } else if (it->timeline > idx) {
+   it = fetch_node(it->node.rb_left);
+   } else {
+   WRITE_ONCE(ref->cache, it);
+   break;
+   }
+   }
+
+   /* NB: If the tree rotated beneath us, we may miss our target. */
+   return it;
+}
+
 static struct i915_active_fence *
-active_instance(struct i915_active *ref, struct intel_timeline *tl)
+active_instance(struct i915_active *ref, u64 idx)
 {
struct active_node *node, *prealloc;
struct rb_node **p, *parent;
-   u64 idx = tl->fence_context;
 
/*
 * We track the most recently used timeline to skip a rbtree search
@@ -230,8 +260,8 @@ active_instance(struct i915_active *ref, struct 
intel_timeline *tl)
 * after the previous activity has been retired, or if it matches the
 * current timeline.
 */
-   node = READ_ONCE(ref->cache);
-   if (node && node->timeline == idx)
+   node = __active_lookup(ref, idx);
+   if (likely(node))
return &node->base;
 
/* Preallocate a replacement, just in case */
@@ -268,10 +298,9 @@ active_instance(struct i915_active *ref, struct 
intel_timeline *tl)
rb_insert_color(&node->node, &ref->tree);
 
 out:
-   ref->cache = node;
+   WRITE_ONCE(ref->cache, node);
spin_unlock_irq(&ref->tree_lock);
 
-   BUILD_BUG_ON(offsetof(typeof(*node), base));
return &node->base;
 }
 
@@ -353,21 +382,17 @@ __active_del_barrier(struct i915_active *ref, struct 
active_node *node)
return active_del_barrier(ref, node, barrier_to_engine(node));
 }
 
-int i915_active_ref(struct i915_active *ref,
-   

[Intel-gfx] [PATCH 66/66] drm/i915/gem: Remove timeline nesting from snb relocs

2020-07-15 Thread Chris Wilson
As snb is the only one to require an alternative engine for performing
relocations, we know that we can reuse a common timeline between
engines.

Signed-off-by: Chris Wilson 
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c| 22 +--
 1 file changed, 5 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c 
b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index d9f1403ddfa4..28f5c28a9449 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -1965,16 +1965,9 @@ nested_request_create(struct intel_context *ce, struct 
i915_execbuffer *eb)
 {
struct i915_request *rq;
 
-   /* XXX This only works once; replace with shared timeline */
-   if (ce->timeline != eb->context->timeline)
-   mutex_lock_nested(&ce->timeline->mutex, SINGLE_DEPTH_NESTING);
intel_context_enter(ce);
-
rq = __i915_request_create(ce, GFP_KERNEL);
-
intel_context_exit(ce);
-   if (IS_ERR(rq) && ce->timeline != eb->context->timeline)
-   mutex_unlock(&ce->timeline->mutex);
 
return rq;
 }
@@ -2021,9 +2014,6 @@ reloc_gpu_flush(struct i915_execbuffer *eb, struct 
i915_request *rq, int err)
intel_gt_chipset_flush(rq->engine->gt);
__i915_request_add(rq, &eb->gem_context->sched);
 
-   if (i915_request_timeline(rq) != eb->context->timeline)
-   mutex_unlock(&i915_request_timeline(rq)->mutex);
-
return err;
 }
 
@@ -2426,10 +2416,7 @@ static struct i915_request *reloc_gpu_alloc(struct 
i915_execbuffer *eb)
struct reloc_cache *cache = &eb->reloc_cache;
struct i915_request *rq;
 
-   if (cache->ce == eb->context)
-   rq = __i915_request_create(cache->ce, GFP_KERNEL);
-   else
-   rq = nested_request_create(cache->ce, eb);
+   rq = nested_request_create(cache->ce, eb);
if (IS_ERR(rq))
return rq;
 
@@ -2968,13 +2955,14 @@ static int __eb_pin_reloc_engine(struct i915_execbuffer 
*eb)
if (!engine)
return -ENODEV;
 
+   if (!intel_engine_has_scheduler(engine))
+   return -ENODEV;
+
ce = intel_context_create(engine);
if (IS_ERR(ce))
return PTR_ERR(ce);
 
-   /* Reuse eb->context->timeline with scheduler! */
-   if (intel_engine_has_scheduler(engine))
-   ce->timeline = intel_timeline_get(eb->context->timeline);
+   ce->timeline = intel_timeline_get(eb->context->timeline);
 
i915_vm_put(ce->vm);
ce->vm = i915_vm_get(eb->context->vm);
-- 
2.20.1

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH 49/66] drm/i915: Remove I915_USER_PRIORITY_SHIFT

2020-07-15 Thread Chris Wilson
As we do not have any internal priority levels, the priority can be set
directed from the user values.

Signed-off-by: Chris Wilson 
---
 drivers/gpu/drm/i915/display/intel_display.c  |  4 +-
 drivers/gpu/drm/i915/gem/i915_gem_context.c   |  6 +--
 .../i915/gem/selftests/i915_gem_object_blt.c  |  4 +-
 .../gpu/drm/i915/gt/intel_engine_heartbeat.c  |  6 +--
 drivers/gpu/drm/i915/gt/selftest_lrc.c| 44 +++
 drivers/gpu/drm/i915/i915_priolist_types.h|  3 --
 drivers/gpu/drm/i915/i915_scheduler.c |  1 -
 7 files changed, 23 insertions(+), 45 deletions(-)

diff --git a/drivers/gpu/drm/i915/display/intel_display.c 
b/drivers/gpu/drm/i915/display/intel_display.c
index 729ec6e0d43a..b1120d49d44e 100644
--- a/drivers/gpu/drm/i915/display/intel_display.c
+++ b/drivers/gpu/drm/i915/display/intel_display.c
@@ -15909,9 +15909,7 @@ static void intel_plane_unpin_fb(struct 
intel_plane_state *old_plane_state)
 
 static void fb_obj_bump_render_priority(struct drm_i915_gem_object *obj)
 {
-   struct i915_sched_attr attr = {
-   .priority = I915_USER_PRIORITY(I915_PRIORITY_DISPLAY),
-   };
+   struct i915_sched_attr attr = { .priority = I915_PRIORITY_DISPLAY };
 
i915_gem_object_wait_priority(obj, 0, &attr);
 }
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c 
b/drivers/gpu/drm/i915/gem/i915_gem_context.c
index 901b2f5614ea..e30f7dbc5700 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c
@@ -713,7 +713,7 @@ __create_context(struct drm_i915_private *i915)
 
kref_init(&ctx->ref);
ctx->i915 = i915;
-   ctx->sched.priority = I915_USER_PRIORITY(I915_PRIORITY_NORMAL);
+   ctx->sched.priority = I915_PRIORITY_NORMAL;
mutex_init(&ctx->mutex);
 
ctx->async.width = rounddown_pow_of_two(num_online_cpus());
@@ -2002,7 +2002,7 @@ static int set_priority(struct i915_gem_context *ctx,
!capable(CAP_SYS_NICE))
return -EPERM;
 
-   ctx->sched.priority = I915_USER_PRIORITY(priority);
+   ctx->sched.priority = priority;
context_apply_all(ctx, __apply_priority, ctx);
 
return 0;
@@ -2505,7 +2505,7 @@ int i915_gem_context_getparam_ioctl(struct drm_device 
*dev, void *data,
 
case I915_CONTEXT_PARAM_PRIORITY:
args->size = 0;
-   args->value = ctx->sched.priority >> I915_USER_PRIORITY_SHIFT;
+   args->value = ctx->sched.priority;
break;
 
case I915_CONTEXT_PARAM_SSEU:
diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_object_blt.c 
b/drivers/gpu/drm/i915/gem/selftests/i915_gem_object_blt.c
index 23b6e11bbc3e..c4c04fb97d14 100644
--- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_object_blt.c
+++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_object_blt.c
@@ -220,7 +220,7 @@ static int igt_fill_blt_thread(void *arg)
return PTR_ERR(ctx);
 
prio = i915_prandom_u32_max_state(I915_PRIORITY_MAX, prng);
-   ctx->sched.priority = I915_USER_PRIORITY(prio);
+   ctx->sched.priority = prio;
}
 
ce = i915_gem_context_get_engine(ctx, 0);
@@ -338,7 +338,7 @@ static int igt_copy_blt_thread(void *arg)
return PTR_ERR(ctx);
 
prio = i915_prandom_u32_max_state(I915_PRIORITY_MAX, prng);
-   ctx->sched.priority = I915_USER_PRIORITY(prio);
+   ctx->sched.priority = prio;
}
 
ce = i915_gem_context_get_engine(ctx, 0);
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c 
b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
index addab2d922b7..58a5c43156f4 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
@@ -69,9 +69,7 @@ static void show_heartbeat(const struct i915_request *rq,
 
 static void heartbeat(struct work_struct *wrk)
 {
-   struct i915_sched_attr attr = {
-   .priority = I915_USER_PRIORITY(I915_PRIORITY_MIN),
-   };
+   struct i915_sched_attr attr = { .priority = I915_PRIORITY_MIN };
struct intel_engine_cs *engine =
container_of(wrk, typeof(*engine), heartbeat.work.work);
struct intel_context *ce = engine->kernel_context;
@@ -115,7 +113,7 @@ static void heartbeat(struct work_struct *wrk)
 */
attr.priority = 0;
if (rq->sched.attr.priority >= attr.priority)
-   attr.priority |= 
I915_USER_PRIORITY(I915_PRIORITY_HEARTBEAT);
+   attr.priority = I915_PRIORITY_HEARTBEAT;
if (rq->sched.attr.priority >= attr.priority)
attr.priority = I915_PRIORITY_BARRIER;
 
diff --git a/drivers/gpu/drm/i915/gt/selftest_lrc.c 
b/drivers/gpu/drm/i915/gt/selftest_lrc.c
index 3843c69ac8a3..8a395b885b54 100644
--- a/drivers/gpu/drm/i

[Intel-gfx] [PATCH 32/66] drm/i915/gt: Push the wait for the context to bound to the request

2020-07-15 Thread Chris Wilson
Rather than synchronously wait for the context to be bound, within the
intel_context_pin(), we can track the pending completion of the bind
fence and only submit requests along the context when signaled.

Signed-off-by: Chris Wilson 
---
 drivers/gpu/drm/i915/Makefile  |  1 +
 drivers/gpu/drm/i915/gt/intel_context.c| 80 +-
 drivers/gpu/drm/i915/gt/intel_context.h|  6 ++
 drivers/gpu/drm/i915/i915_active.h |  1 -
 drivers/gpu/drm/i915/i915_request.c|  4 ++
 drivers/gpu/drm/i915/i915_sw_fence_await.c | 62 +
 drivers/gpu/drm/i915/i915_sw_fence_await.h | 19 +
 7 files changed, 140 insertions(+), 33 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/i915_sw_fence_await.c
 create mode 100644 drivers/gpu/drm/i915/i915_sw_fence_await.h

diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
index a3a4c8a555ec..2cf54db8b847 100644
--- a/drivers/gpu/drm/i915/Makefile
+++ b/drivers/gpu/drm/i915/Makefile
@@ -61,6 +61,7 @@ i915-y += \
i915_memcpy.o \
i915_mm.o \
i915_sw_fence.o \
+   i915_sw_fence_await.o \
i915_sw_fence_work.o \
i915_syncmap.o \
i915_user_extensions.o
diff --git a/drivers/gpu/drm/i915/gt/intel_context.c 
b/drivers/gpu/drm/i915/gt/intel_context.c
index 2f1606365f63..9ba1c15114d7 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -10,6 +10,7 @@
 
 #include "i915_drv.h"
 #include "i915_globals.h"
+#include "i915_sw_fence_await.h"
 
 #include "intel_context.h"
 #include "intel_engine.h"
@@ -94,27 +95,6 @@ static void intel_context_active_release(struct 
intel_context *ce)
i915_active_release(&ce->active);
 }
 
-static int __intel_context_sync(struct intel_context *ce)
-{
-   int err;
-
-   err = i915_vma_wait_for_bind(ce->ring->vma);
-   if (err)
-   return err;
-
-   err = i915_vma_wait_for_bind(ce->timeline->hwsp_ggtt);
-   if (err)
-   return err;
-
-   if (ce->state) {
-   err = i915_vma_wait_for_bind(ce->state);
-   if (err)
-   return err;
-   }
-
-   return 0;
-}
-
 int __intel_context_do_pin(struct intel_context *ce)
 {
int err;
@@ -140,10 +120,6 @@ int __intel_context_do_pin(struct intel_context *ce)
}
 
if (likely(!atomic_add_unless(&ce->pin_count, 1, 0))) {
-   err = __intel_context_sync(ce);
-   if (unlikely(err))
-   goto out_unlock;
-
err = intel_context_active_acquire(ce);
if (unlikely(err))
goto out_unlock;
@@ -301,31 +277,71 @@ intel_context_acquire_lock(struct intel_context *ce,
return 0;
 }
 
+static int await_bind(struct dma_fence_await *fence, struct i915_vma *vma)
+{
+   struct dma_fence *bind;
+   int err = 0;
+
+   bind = i915_active_fence_get(&vma->active.excl);
+   if (bind) {
+   err = i915_sw_fence_await_dma_fence(&fence->await, bind,
+   0, GFP_KERNEL);
+   dma_fence_put(bind);
+   }
+
+   return err;
+}
+
 static int intel_context_active_locked(struct intel_context *ce)
 {
+   struct dma_fence_await *fence;
int err;
 
+   fence = dma_fence_await_create(GFP_KERNEL);
+   if (!fence)
+   return -ENOMEM;
+
err = __ring_active_locked(ce->ring);
if (err)
-   return err;
+   goto out_fence;
+
+   err = await_bind(fence, ce->ring->vma);
+   if (err < 0)
+   goto err_ring;
 
err = intel_timeline_pin_locked(ce->timeline);
if (err)
goto err_ring;
 
-   if (!ce->state)
-   return 0;
-
-   err = __context_active_locked(ce->state);
-   if (err)
+   err = await_bind(fence, ce->timeline->hwsp_ggtt);
+   if (err < 0)
goto err_timeline;
 
-   return 0;
+   if (ce->state) {
+   err = __context_active_locked(ce->state);
+   if (err)
+   goto err_timeline;
+
+   err = await_bind(fence, ce->state);
+   if (err < 0)
+   goto err_state;
+   }
+
+   /* Must be the last action as it *releases* the ce->active */
+   if (atomic_read(&fence->await.pending) > 1)
+   i915_active_set_exclusive(&ce->active, &fence->dma);
+
+   err = 0;
+   goto out_fence;
 
+err_state:
+   __context_unpin_state(ce->state);
 err_timeline:
intel_timeline_unpin(ce->timeline);
 err_ring:
__ring_retire(ce->ring);
+out_fence:
+   i915_sw_fence_commit(&fence->await);
return err;
 }
 
diff --git a/drivers/gpu/drm/i915/gt/intel_context.h 
b/drivers/gpu/drm/i915/gt/intel_context.h
index 07be021882cc..f48df2784a6c 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.h
+++ b/drivers/gp

[Intel-gfx] [PATCH 10/66] drm/i915: Soften the tasklet flush frequency before waits

2020-07-15 Thread Chris Wilson
We include a tasklet flush before waiting on a request as a precaution
against the HW being lax in event signaling. We now have a precautionary
flush in the engine's heartbeat and so do not need to be quite so
zealous on every request wait. If we focus on the request, the only
tasklet flush that matters is if there is a delay in submitting this
request to HW, so if the request is not ready to be executed no
advantage in reducing this wait can be gained by running the tasklet.
And there is little point in doing busy work for no result.

Signed-off-by: Chris Wilson 
Cc: Mika Kuoppala 
---
 drivers/gpu/drm/i915/i915_request.c | 20 ++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_request.c 
b/drivers/gpu/drm/i915/i915_request.c
index 29b5e71307e3..f58beff5e859 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -1760,14 +1760,30 @@ long i915_request_wait(struct i915_request *rq,
if (dma_fence_add_callback(&rq->fence, &wait.cb, request_wait_wake))
goto out;
 
+   /*
+* Flush the submission tasklet, but only if it may help this request.
+*
+* We sometimes experience some latency between the HW interrupts and
+* tasklet execution (mostly due to ksoftirqd latency, but it can also
+* be due to lazy CS events), so lets run the tasklet manually if there
+* is a chance it may submit this request. If the request is not ready
+* to run, as it is waiting for other fences to be signaled, flushing
+* the tasklet is busy work without any advantage for this client.
+*
+* If the HW is being lazy, this is the last chance before we go to
+* sleep to catch any pending events. We will check periodically in
+* the heartbeat to flush the submission tasklets as a last resort
+* for unhappy HW.
+*/
+   if (i915_request_is_ready(rq))
+   intel_engine_flush_submission(rq->engine);
+
for (;;) {
set_current_state(state);
 
if (dma_fence_is_signaled(&rq->fence))
break;
 
-   intel_engine_flush_submission(rq->engine);
-
if (signal_pending_state(state, current)) {
timeout = -ERESTARTSYS;
break;
-- 
2.20.1

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH 26/66] drm/i915: Add an implementation for i915_gem_ww_ctx locking, v2.

2020-07-15 Thread Chris Wilson
From: Maarten Lankhorst 

i915_gem_ww_ctx is used to lock all gem bo's for pinning and memory
eviction. We don't use it yet, but lets start adding the definition
first.

To use it, we have to pass a non-NULL ww to gem_object_lock, and don't
unlock directly. It is done in i915_gem_ww_ctx_fini.

Changes since v1:
- Change ww_ctx and obj order in locking functions (Jonas Lahtinen)

Signed-off-by: Maarten Lankhorst 
---
 drivers/gpu/drm/i915/Makefile |   4 +
 drivers/gpu/drm/i915/i915_globals.c   |   1 +
 drivers/gpu/drm/i915/i915_globals.h   |   1 +
 drivers/gpu/drm/i915/mm/i915_acquire_ctx.c| 139 ++
 drivers/gpu/drm/i915/mm/i915_acquire_ctx.h|  34 +++
 drivers/gpu/drm/i915/mm/st_acquire_ctx.c  | 242 ++
 .../drm/i915/selftests/i915_mock_selftests.h  |   1 +
 7 files changed, 422 insertions(+)
 create mode 100644 drivers/gpu/drm/i915/mm/i915_acquire_ctx.c
 create mode 100644 drivers/gpu/drm/i915/mm/i915_acquire_ctx.h
 create mode 100644 drivers/gpu/drm/i915/mm/st_acquire_ctx.c

diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
index bda4c0e408f8..a3a4c8a555ec 100644
--- a/drivers/gpu/drm/i915/Makefile
+++ b/drivers/gpu/drm/i915/Makefile
@@ -125,6 +125,10 @@ gt-y += \
gt/gen9_renderstate.o
 i915-y += $(gt-y)
 
+# Memory + DMA management
+i915-y += \
+   mm/i915_acquire_ctx.o
+
 # GEM (Graphics Execution Management) code
 gem-y += \
gem/i915_gem_busy.o \
diff --git a/drivers/gpu/drm/i915/i915_globals.c 
b/drivers/gpu/drm/i915/i915_globals.c
index 3aa213684293..51ec42a14694 100644
--- a/drivers/gpu/drm/i915/i915_globals.c
+++ b/drivers/gpu/drm/i915/i915_globals.c
@@ -87,6 +87,7 @@ static void __i915_globals_cleanup(void)
 
 static __initconst int (* const initfn[])(void) = {
i915_global_active_init,
+   i915_global_acquire_init,
i915_global_buddy_init,
i915_global_context_init,
i915_global_gem_context_init,
diff --git a/drivers/gpu/drm/i915/i915_globals.h 
b/drivers/gpu/drm/i915/i915_globals.h
index b2f5cd9b9b1a..11227abf2769 100644
--- a/drivers/gpu/drm/i915/i915_globals.h
+++ b/drivers/gpu/drm/i915/i915_globals.h
@@ -27,6 +27,7 @@ void i915_globals_exit(void);
 
 /* constructors */
 int i915_global_active_init(void);
+int i915_global_acquire_init(void);
 int i915_global_buddy_init(void);
 int i915_global_context_init(void);
 int i915_global_gem_context_init(void);
diff --git a/drivers/gpu/drm/i915/mm/i915_acquire_ctx.c 
b/drivers/gpu/drm/i915/mm/i915_acquire_ctx.c
new file mode 100644
index ..d1c3b958c15d
--- /dev/null
+++ b/drivers/gpu/drm/i915/mm/i915_acquire_ctx.c
@@ -0,0 +1,139 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2020 Intel Corporation
+ */
+
+#include 
+
+#include "i915_globals.h"
+#include "gem/i915_gem_object.h"
+
+#include "i915_acquire_ctx.h"
+
+static struct i915_global_acquire {
+   struct i915_global base;
+   struct kmem_cache *slab_acquires;
+} global;
+
+struct i915_acquire {
+   struct drm_i915_gem_object *obj;
+   struct i915_acquire *next;
+};
+
+static struct i915_acquire *i915_acquire_alloc(void)
+{
+   return kmem_cache_alloc(global.slab_acquires, GFP_KERNEL);
+}
+
+static void i915_acquire_free(struct i915_acquire *lnk)
+{
+   kmem_cache_free(global.slab_acquires, lnk);
+}
+
+void i915_acquire_ctx_init(struct i915_acquire_ctx *ctx)
+{
+   ww_acquire_init(&ctx->ctx, &reservation_ww_class);
+   ctx->locked = NULL;
+}
+
+int i915_acquire_ctx_lock(struct i915_acquire_ctx *ctx,
+ struct drm_i915_gem_object *obj)
+{
+   struct i915_acquire *lock, *lnk;
+   int err;
+
+   lock = i915_acquire_alloc();
+   if (!lock)
+   return -ENOMEM;
+
+   lock->obj = i915_gem_object_get(obj);
+   lock->next = NULL;
+
+   while ((lnk = lock)) {
+   obj = lnk->obj;
+   lock = lnk->next;
+
+   err = dma_resv_lock_interruptible(obj->base.resv, &ctx->ctx);
+   if (err == -EDEADLK) {
+   struct i915_acquire *old;
+
+   while ((old = ctx->locked)) {
+   i915_gem_object_unlock(old->obj);
+   ctx->locked = old->next;
+   old->next = lock;
+   lock = old;
+   }
+
+   err = dma_resv_lock_slow_interruptible(obj->base.resv,
+  &ctx->ctx);
+   }
+   if (!err) {
+   lnk->next = ctx->locked;
+   ctx->locked = lnk;
+   } else {
+   i915_gem_object_put(obj);
+   i915_acquire_free(lnk);
+   }
+   if (err == -EALREADY)
+   err = 0;
+   if (err)
+   break;
+   }
+
+   w

[Intel-gfx] [PATCH 43/66] drm/i915/gt: ce->inflight updates are now serialised

2020-07-15 Thread Chris Wilson
Since schedule-in and schedule-out are now both always under the tasklet
bitlock, we can reduce the individual atomic operations to simple
instructions and worry less.

Signed-off-by: Chris Wilson 
---
 drivers/gpu/drm/i915/gt/intel_lrc.c | 44 +
 1 file changed, 19 insertions(+), 25 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c 
b/drivers/gpu/drm/i915/gt/intel_lrc.c
index 0020fc77b3da..a59332f28cd3 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -1332,7 +1332,7 @@ __execlists_schedule_in(struct i915_request *rq)
unsigned int tag = ffs(READ_ONCE(engine->context_tag));
 
GEM_BUG_ON(tag == 0 || tag >= BITS_PER_LONG);
-   clear_bit(tag - 1, &engine->context_tag);
+   __clear_bit(tag - 1, &engine->context_tag);
ce->lrc.ccid = tag << (GEN11_SW_CTX_ID_SHIFT - 32);
 
BUILD_BUG_ON(BITS_PER_LONG > GEN12_MAX_CONTEXT_HW_ID);
@@ -1359,13 +1359,10 @@ static inline void execlists_schedule_in(struct 
i915_request *rq, int idx)
GEM_BUG_ON(!intel_engine_pm_is_awake(rq->engine));
trace_i915_request_in(rq, idx);
 
-   old = READ_ONCE(ce->inflight);
-   do {
-   if (!old) {
-   WRITE_ONCE(ce->inflight, __execlists_schedule_in(rq));
-   break;
-   }
-   } while (!try_cmpxchg(&ce->inflight, &old, ptr_inc(old)));
+   old = ce->inflight;
+   if (!old)
+   old = __execlists_schedule_in(rq);
+   WRITE_ONCE(ce->inflight, ptr_inc(old));
 
GEM_BUG_ON(intel_context_inflight(ce) != rq->engine);
 }
@@ -1403,12 +1400,11 @@ static void kick_siblings(struct i915_request *rq, 
struct intel_context *ce)
resubmit_virtual_request(rq, ve);
 }
 
-static inline void
-__execlists_schedule_out(struct i915_request *rq,
-struct intel_engine_cs * const engine,
-unsigned int ccid)
+static inline void __execlists_schedule_out(struct i915_request *rq)
 {
struct intel_context * const ce = rq->context;
+   struct intel_engine_cs * const engine = rq->engine;
+   unsigned int ccid;
 
/*
 * NB process_csb() is not under the engine->active.lock and hence
@@ -1416,7 +1412,7 @@ __execlists_schedule_out(struct i915_request *rq,
 * refrain from doing non-trivial work here.
 */
 
-   CE_TRACE(ce, "schedule-out, ccid:%x\n", ccid);
+   CE_TRACE(ce, "schedule-out, ccid:%x\n", ce->lrc.ccid);
 
/*
 * If we have just completed this context, the engine may now be
@@ -1426,12 +1422,13 @@ __execlists_schedule_out(struct i915_request *rq,
i915_request_completed(rq))
intel_engine_add_retire(engine, ce->timeline);
 
+   ccid = ce->lrc.ccid;
ccid >>= GEN11_SW_CTX_ID_SHIFT - 32;
ccid &= GEN12_MAX_CONTEXT_HW_ID;
if (ccid < BITS_PER_LONG) {
GEM_BUG_ON(ccid == 0);
GEM_BUG_ON(test_bit(ccid - 1, &engine->context_tag));
-   set_bit(ccid - 1, &engine->context_tag);
+   __set_bit(ccid - 1, &engine->context_tag);
}
 
intel_context_update_runtime(ce);
@@ -1452,26 +1449,23 @@ __execlists_schedule_out(struct i915_request *rq,
 */
if (ce->engine != engine)
kick_siblings(rq, ce);
-
-   intel_context_put(ce);
 }
 
 static inline void
 execlists_schedule_out(struct i915_request *rq)
 {
struct intel_context * const ce = rq->context;
-   struct intel_engine_cs *cur, *old;
-   u32 ccid;
 
trace_i915_request_out(rq);
 
-   ccid = rq->context->lrc.ccid;
-   old = READ_ONCE(ce->inflight);
-   do
-   cur = ptr_unmask_bits(old, 2) ? ptr_dec(old) : NULL;
-   while (!try_cmpxchg(&ce->inflight, &old, cur));
-   if (!cur)
-   __execlists_schedule_out(rq, old, ccid);
+   GEM_BUG_ON(!ce->inflight);
+   ce->inflight = ptr_dec(ce->inflight);
+   if (!intel_context_inflight_count(ce)) {
+   GEM_BUG_ON(ce->inflight != rq->engine);
+   __execlists_schedule_out(rq);
+   WRITE_ONCE(ce->inflight, NULL);
+   intel_context_put(ce);
+   }
 
i915_request_put(rq);
 }
-- 
2.20.1

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH 53/66] drm/i915: Restructure priority inheritance

2020-07-15 Thread Chris Wilson
In anticipation of wanting to be able to call pi from underneath an
engine's active.lock, rework the priority inheritance to primarily work
along an engine's priority queue, delegating any other engine that the
chain may traverse to a worker. This reduces the global spinlock from
governing the entire priority inheritance depth-first search, to a small
lock around a single list.

Signed-off-by: Chris Wilson 
---
 drivers/gpu/drm/i915/i915_scheduler.c   | 277 +++-
 drivers/gpu/drm/i915/i915_scheduler_types.h |   6 +-
 2 files changed, 162 insertions(+), 121 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_scheduler.c 
b/drivers/gpu/drm/i915/i915_scheduler.c
index 2e4d512e61d8..3f261b4fee66 100644
--- a/drivers/gpu/drm/i915/i915_scheduler.c
+++ b/drivers/gpu/drm/i915/i915_scheduler.c
@@ -17,7 +17,65 @@ static struct i915_global_scheduler {
struct kmem_cache *slab_priorities;
 } global;
 
-static DEFINE_SPINLOCK(schedule_lock);
+static DEFINE_SPINLOCK(ipi_lock);
+static LIST_HEAD(ipi_list);
+
+static inline int rq_prio(const struct i915_request *rq)
+{
+   return READ_ONCE(rq->sched.attr.priority);
+}
+
+static void ipi_schedule(struct irq_work *wrk)
+{
+   rcu_read_lock();
+   do {
+   struct i915_dependency *p;
+   struct i915_request *rq;
+   unsigned long flags;
+   int prio;
+
+   spin_lock_irqsave(&ipi_lock, flags);
+   p = list_first_entry_or_null(&ipi_list, typeof(*p), ipi_link);
+   if (p) {
+   rq = container_of(p->signaler, typeof(*rq), sched);
+   list_del_init(&p->ipi_link);
+
+   prio = p->ipi_priority;
+   p->ipi_priority = I915_PRIORITY_INVALID;
+   }
+   spin_unlock_irqrestore(&ipi_lock, flags);
+   if (!p)
+   break;
+
+   if (i915_request_completed(rq))
+   continue;
+
+   i915_request_set_priority(rq, prio);
+   } while (1);
+   rcu_read_unlock();
+}
+
+static DEFINE_IRQ_WORK(ipi_work, ipi_schedule);
+
+/*
+ * Virtual engines complicate acquiring the engine timeline lock,
+ * as their rq->engine pointer is not stable until under that
+ * engine lock. The simple ploy we use is to take the lock then
+ * check that the rq still belongs to the newly locked engine.
+ */
+#define lock_engine_irqsave(rq, flags) ({ \
+   struct i915_request * const rq__ = (rq); \
+   struct intel_engine_cs *engine__ = READ_ONCE(rq__->engine); \
+\
+   spin_lock_irqsave(&engine__->active.lock, (flags)); \
+   while (engine__ != READ_ONCE((rq__)->engine)) { \
+   spin_unlock(&engine__->active.lock); \
+   engine__ = READ_ONCE(rq__->engine); \
+   spin_lock(&engine__->active.lock); \
+   } \
+\
+   engine__; \
+})
 
 static const struct i915_request *
 node_to_request(const struct i915_sched_node *node)
@@ -126,42 +184,6 @@ void __i915_priolist_free(struct i915_priolist *p)
kmem_cache_free(global.slab_priorities, p);
 }
 
-struct sched_cache {
-   struct list_head *priolist;
-};
-
-static struct intel_engine_cs *
-sched_lock_engine(const struct i915_sched_node *node,
- struct intel_engine_cs *locked,
- struct sched_cache *cache)
-{
-   const struct i915_request *rq = node_to_request(node);
-   struct intel_engine_cs *engine;
-
-   GEM_BUG_ON(!locked);
-
-   /*
-* Virtual engines complicate acquiring the engine timeline lock,
-* as their rq->engine pointer is not stable until under that
-* engine lock. The simple ploy we use is to take the lock then
-* check that the rq still belongs to the newly locked engine.
-*/
-   while (locked != (engine = READ_ONCE(rq->engine))) {
-   spin_unlock(&locked->active.lock);
-   memset(cache, 0, sizeof(*cache));
-   spin_lock(&engine->active.lock);
-   locked = engine;
-   }
-
-   GEM_BUG_ON(locked != engine);
-   return locked;
-}
-
-static inline int rq_prio(const struct i915_request *rq)
-{
-   return rq->sched.attr.priority;
-}
-
 static inline bool need_preempt(int prio, int active)
 {
/*
@@ -216,25 +238,15 @@ static void kick_submission(struct intel_engine_cs 
*engine,
rcu_read_unlock();
 }
 
-static void __i915_schedule(struct i915_sched_node *node, int prio)
+static void __i915_request_set_priority(struct i915_request *rq, int prio)
 {
-   struct intel_engine_cs *engine;
-   struct i915_dependency *dep, *p;
-   struct i915_dependency stack;
-   struct sched_cache cache;
+   struct intel_engine_cs *engine = rq->engine;
+   struct i915_request *rn;
+   struct list_head *plist;
LIST_HEAD(dfs);
 
-   /* Needed in order to use the temporary link inside i915_dependency */
-   lockdep_assert

[Intel-gfx] [PATCH 17/66] drm/i915: Add list_for_each_entry_safe_continue_reverse

2020-07-15 Thread Chris Wilson
One more list iterator variant, for when we want to unwind from inside
one list iterator with the intention of restarting from the current
entry as the new head of the list.

Signed-off-by: Chris Wilson 
Reviewed-by: Tvrtko Ursulin 
---
 drivers/gpu/drm/i915/i915_utils.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_utils.h 
b/drivers/gpu/drm/i915/i915_utils.h
index 54773371e6bd..32cecd8583b1 100644
--- a/drivers/gpu/drm/i915/i915_utils.h
+++ b/drivers/gpu/drm/i915/i915_utils.h
@@ -266,6 +266,12 @@ static inline int list_is_last_rcu(const struct list_head 
*list,
return READ_ONCE(list->next) == head;
 }
 
+#define list_for_each_entry_safe_continue_reverse(pos, n, head, member)
\
+   for (pos = list_prev_entry(pos, member),\
+n = list_prev_entry(pos, member);  \
+&pos->member != (head);\
+pos = n, n = list_prev_entry(n, member))
+
 static inline unsigned long msecs_to_jiffies_timeout(const unsigned int m)
 {
unsigned long j = msecs_to_jiffies(m);
-- 
2.20.1

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH 62/66] drm/i915/gt: Use client timeline address for seqno writes

2020-07-15 Thread Chris Wilson
If we allow for per-client timelines, even with legacy ring submission,
we open the door to a world full of possiblities [scheduling and
semaphores].

Signed-off-by: Chris Wilson 
---
 drivers/gpu/drm/i915/gt/gen6_engine_cs.c | 21 +
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/gen6_engine_cs.c 
b/drivers/gpu/drm/i915/gt/gen6_engine_cs.c
index ce38d1bcaba3..fa11174bb13b 100644
--- a/drivers/gpu/drm/i915/gt/gen6_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/gen6_engine_cs.c
@@ -373,11 +373,10 @@ u32 *gen7_emit_breadcrumb_rcs(struct i915_request *rq, 
u32 *cs)
 
 u32 *gen6_emit_breadcrumb_xcs(struct i915_request *rq, u32 *cs)
 {
-   GEM_BUG_ON(i915_request_active_timeline(rq)->hwsp_ggtt != 
rq->engine->status_page.vma);
-   
GEM_BUG_ON(offset_in_page(i915_request_active_timeline(rq)->hwsp_offset) != 
I915_GEM_HWS_SEQNO_ADDR);
+   u32 addr = i915_request_active_timeline(rq)->hwsp_offset;
 
-   *cs++ = MI_FLUSH_DW | MI_FLUSH_DW_OP_STOREDW | MI_FLUSH_DW_STORE_INDEX;
-   *cs++ = I915_GEM_HWS_SEQNO_ADDR | MI_FLUSH_DW_USE_GTT;
+   *cs++ = MI_FLUSH_DW | MI_FLUSH_DW_OP_STOREDW;
+   *cs++ = addr | MI_FLUSH_DW_USE_GTT;
*cs++ = rq->fence.seqno;
 
*cs++ = MI_USER_INTERRUPT;
@@ -391,19 +390,17 @@ u32 *gen6_emit_breadcrumb_xcs(struct i915_request *rq, 
u32 *cs)
 #define GEN7_XCS_WA 32
 u32 *gen7_emit_breadcrumb_xcs(struct i915_request *rq, u32 *cs)
 {
+   u32 addr = i915_request_active_timeline(rq)->hwsp_offset;
int i;
 
-   GEM_BUG_ON(i915_request_active_timeline(rq)->hwsp_ggtt != 
rq->engine->status_page.vma);
-   
GEM_BUG_ON(offset_in_page(i915_request_active_timeline(rq)->hwsp_offset) != 
I915_GEM_HWS_SEQNO_ADDR);
-
-   *cs++ = MI_FLUSH_DW | MI_INVALIDATE_TLB |
-   MI_FLUSH_DW_OP_STOREDW | MI_FLUSH_DW_STORE_INDEX;
-   *cs++ = I915_GEM_HWS_SEQNO_ADDR | MI_FLUSH_DW_USE_GTT;
+   *cs++ = MI_FLUSH_DW | MI_FLUSH_DW_OP_STOREDW;
+   *cs++ = addr | MI_FLUSH_DW_USE_GTT;
*cs++ = rq->fence.seqno;
 
for (i = 0; i < GEN7_XCS_WA; i++) {
-   *cs++ = MI_STORE_DWORD_INDEX;
-   *cs++ = I915_GEM_HWS_SEQNO_ADDR;
+   *cs++ = MI_STORE_DWORD_IMM_GEN4 | MI_USE_GGTT;
+   *cs++ = 0;
+   *cs++ = addr;
*cs++ = rq->fence.seqno;
}
 
-- 
2.20.1

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH 59/66] Restore "drm/i915: drop engine_pin/unpin_breadcrumbs_irq"

2020-07-15 Thread Chris Wilson
This was removed in commit 478ffad6d690 ("drm/i915: drop
engine_pin/unpin_breadcrumbs_irq") as the last user had been removed,
but now there is a promise of a new user in the next patch.

Signed-off-by: Chris Wilson 
Reviewed-by: Mika Kuoppala 
---
 drivers/gpu/drm/i915/gt/intel_breadcrumbs.c | 22 +
 drivers/gpu/drm/i915/gt/intel_engine.h  |  3 +++
 2 files changed, 25 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c 
b/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c
index 87fd06d3eb3f..5a7a4853cbba 100644
--- a/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c
+++ b/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c
@@ -220,6 +220,28 @@ static void signal_irq_work(struct irq_work *work)
}
 }
 
+void intel_engine_pin_breadcrumbs_irq(struct intel_engine_cs *engine)
+{
+   struct intel_breadcrumbs *b = &engine->breadcrumbs;
+
+   spin_lock_irq(&b->irq_lock);
+   if (!b->irq_enabled++)
+   irq_enable(engine);
+   GEM_BUG_ON(!b->irq_enabled); /* no overflow! */
+   spin_unlock_irq(&b->irq_lock);
+}
+
+void intel_engine_unpin_breadcrumbs_irq(struct intel_engine_cs *engine)
+{
+   struct intel_breadcrumbs *b = &engine->breadcrumbs;
+
+   spin_lock_irq(&b->irq_lock);
+   GEM_BUG_ON(!b->irq_enabled); /* no underflow! */
+   if (!--b->irq_enabled)
+   irq_disable(engine);
+   spin_unlock_irq(&b->irq_lock);
+}
+
 static void __intel_breadcrumbs_arm_irq(struct intel_breadcrumbs *b)
 {
struct intel_engine_cs *engine =
diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h 
b/drivers/gpu/drm/i915/gt/intel_engine.h
index a9249a23903a..dcc2fc22ea37 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine.h
@@ -226,6 +226,9 @@ void intel_engine_init_execlists(struct intel_engine_cs 
*engine);
 void intel_engine_init_breadcrumbs(struct intel_engine_cs *engine);
 void intel_engine_fini_breadcrumbs(struct intel_engine_cs *engine);
 
+void intel_engine_pin_breadcrumbs_irq(struct intel_engine_cs *engine);
+void intel_engine_unpin_breadcrumbs_irq(struct intel_engine_cs *engine);
+
 void intel_engine_disarm_breadcrumbs(struct intel_engine_cs *engine);
 
 static inline void
-- 
2.20.1

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH 63/66] drm/i915/gt: Infrastructure for ring scheduling

2020-07-15 Thread Chris Wilson
Build a bare bones scheduler to sit on top the global legacy ringbuffer
submission. This virtual execlists scheme should be applicable to all
older platforms.

A key problem we have with the legacy ring buffer submission is that it
only allows for FIFO queuing. All clients share the global request queue
and must contend for its lock when submitting. As any client may need to
wait for external events, all clients must then wait. However, if we
stage each client into their own virtual ringbuffer with their own
timelines, we can copy the client requests into the global ringbuffer
only when they are ready, reordering the submission around stalls.
Furthermore, the ability to reorder gives us rudimentarily priority
sorting -- although without preemption support, once something is on the
GPU it stays on the GPU, and so it is still possible for a hog to delay
a high priority request (such as updating the display). However, it does
means that in keeping a short submission queue, the high priority
request will be next. This design resembles the old guc submission
scheduler, for reordering requests onto a global workqueue.

The implementation uses the MI_USER_INTERRUPT at the end of every
request to track completion, so is more interrupt happy than execlists
[which has an interrupt for each context event, albeit two]. Our
interrupts on these system are relatively heavy, and in the past we have
been able to completely starve Sandybrige by the interrupt traffic. Our
interrupt handlers are being much better (in part offloading the work to
bottom halves leaving the interrupt itself only dealing with acking the
registers) but we can still see the impact of starvation in the uneven
submission latency on a saturated system.

Overall though, the short sumission queues and extra interrupts do not
appear to be affecting throughput (+-10%, some tasks even improve to the
reduced request overheads) and improve latency. [Which is a massive
improvement since the introduction of Sandybridge!]

Signed-off-by: Chris Wilson 
---
 drivers/gpu/drm/i915/Makefile |   1 +
 drivers/gpu/drm/i915/gt/intel_engine.h|   1 +
 drivers/gpu/drm/i915/gt/intel_engine_types.h  |   1 +
 .../gpu/drm/i915/gt/intel_ring_scheduler.c| 762 ++
 .../gpu/drm/i915/gt/intel_ring_submission.c   |  13 +-
 .../gpu/drm/i915/gt/intel_ring_submission.h   |  16 +
 6 files changed, 788 insertions(+), 6 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/gt/intel_ring_scheduler.c
 create mode 100644 drivers/gpu/drm/i915/gt/intel_ring_submission.h

diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
index 2cf54db8b847..e4eea4980129 100644
--- a/drivers/gpu/drm/i915/Makefile
+++ b/drivers/gpu/drm/i915/Makefile
@@ -110,6 +110,7 @@ gt-y += \
gt/intel_renderstate.o \
gt/intel_reset.o \
gt/intel_ring.o \
+   gt/intel_ring_scheduler.o \
gt/intel_ring_submission.o \
gt/intel_rps.o \
gt/intel_sseu.o \
diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h 
b/drivers/gpu/drm/i915/gt/intel_engine.h
index dcc2fc22ea37..b816581b95d3 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine.h
@@ -209,6 +209,7 @@ void intel_engine_cleanup_common(struct intel_engine_cs 
*engine);
 int intel_engine_resume(struct intel_engine_cs *engine);
 
 int intel_ring_submission_setup(struct intel_engine_cs *engine);
+int intel_ring_scheduler_setup(struct intel_engine_cs *engine);
 
 int intel_engine_stop_cs(struct intel_engine_cs *engine);
 void intel_engine_cancel_stop_cs(struct intel_engine_cs *engine);
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h 
b/drivers/gpu/drm/i915/gt/intel_engine_types.h
index 8c502cf34de7..78a57879aef8 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
@@ -339,6 +339,7 @@ struct intel_engine_cs {
struct {
struct intel_ring *ring;
struct intel_timeline *timeline;
+   struct intel_context *context;
} legacy;
 
/*
diff --git a/drivers/gpu/drm/i915/gt/intel_ring_scheduler.c 
b/drivers/gpu/drm/i915/gt/intel_ring_scheduler.c
new file mode 100644
index ..d3c22037f17d
--- /dev/null
+++ b/drivers/gpu/drm/i915/gt/intel_ring_scheduler.c
@@ -0,0 +1,762 @@
+// SPDX-License-Identifier: MIT
+/*
+ * Copyright © 2020 Intel Corporation
+ */
+
+#include 
+
+#include 
+
+#include "mm/i915_acquire_ctx.h"
+
+#include "i915_drv.h"
+#include "intel_context.h"
+#include "intel_engine_stats.h"
+#include "intel_gt.h"
+#include "intel_gt_pm.h"
+#include "intel_gt_requests.h"
+#include "intel_reset.h"
+#include "intel_ring.h"
+#include "intel_ring_submission.h"
+#include "shmem_utils.h"
+
+/*
+ * Rough estimate of the typical request size, performing a flush,
+ * set-context and then emitting the batch.
+ */
+#define LEGACY_REQUEST_SIZE 200
+
+static inline int rq_prio(const struct i915_request *rq)
+{
+

[Intel-gfx] [PATCH 18/66] drm/i915: Always defer fenced work to the worker

2020-07-15 Thread Chris Wilson
Currently, if an error is raised we always call the cleanup locally
[and skip the main work callback]. However, some future users may need
to take a mutex to cleanup and so we cannot immediately execute the
cleanup as we may still be in interrupt context.

With the execute-immediate flag, for most cases this should result in
immediate cleanup of an error.

Signed-off-by: Chris Wilson 
---
 drivers/gpu/drm/i915/i915_sw_fence_work.c | 25 +++
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_sw_fence_work.c 
b/drivers/gpu/drm/i915/i915_sw_fence_work.c
index a3a81bb8f2c3..29f63ebc24e8 100644
--- a/drivers/gpu/drm/i915/i915_sw_fence_work.c
+++ b/drivers/gpu/drm/i915/i915_sw_fence_work.c
@@ -16,11 +16,14 @@ static void fence_complete(struct dma_fence_work *f)
 static void fence_work(struct work_struct *work)
 {
struct dma_fence_work *f = container_of(work, typeof(*f), work);
-   int err;
 
-   err = f->ops->work(f);
-   if (err)
-   dma_fence_set_error(&f->dma, err);
+   if (!f->dma.error) {
+   int err;
+
+   err = f->ops->work(f);
+   if (err)
+   dma_fence_set_error(&f->dma, err);
+   }
 
fence_complete(f);
dma_fence_put(&f->dma);
@@ -36,15 +39,11 @@ fence_notify(struct i915_sw_fence *fence, enum 
i915_sw_fence_notify state)
if (fence->error)
dma_fence_set_error(&f->dma, fence->error);
 
-   if (!f->dma.error) {
-   dma_fence_get(&f->dma);
-   if (test_bit(DMA_FENCE_WORK_IMM, &f->dma.flags))
-   fence_work(&f->work);
-   else
-   queue_work(system_unbound_wq, &f->work);
-   } else {
-   fence_complete(f);
-   }
+   dma_fence_get(&f->dma);
+   if (test_bit(DMA_FENCE_WORK_IMM, &f->dma.flags))
+   fence_work(&f->work);
+   else
+   queue_work(system_unbound_wq, &f->work);
break;
 
case FENCE_FREE:
-- 
2.20.1

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH 16/66] drm/i915/gem: Remove the call for no-evict i915_vma_pin

2020-07-15 Thread Chris Wilson
Remove the stub i915_vma_pin() used for incrementally pining objects for
execbuf (under the severe restriction that they must not wait on a
resource as we may have already pinned it) and replace it with a
i915_vma_pin_inplace() that is only allowed to reclaim the currently
bound location for the vma (and will never wait for a pinned resource).

Signed-off-by: Chris Wilson 
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c| 69 +++
 drivers/gpu/drm/i915/i915_vma.c   |  6 +-
 drivers/gpu/drm/i915/i915_vma.h   |  2 +
 3 files changed, 45 insertions(+), 32 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c 
b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 28cf28fcf80a..0b8a26da26e5 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -452,49 +452,55 @@ static u64 eb_pin_flags(const struct 
drm_i915_gem_exec_object2 *entry,
return pin_flags;
 }
 
+static bool eb_pin_vma_fence_inplace(struct eb_vma *ev)
+{
+   struct i915_vma *vma = ev->vma;
+   struct i915_fence_reg *reg = vma->fence;
+
+   if (reg) {
+   if (READ_ONCE(reg->dirty))
+   return false;
+
+   atomic_inc(®->pin_count);
+   ev->flags |= __EXEC_OBJECT_HAS_FENCE;
+   } else {
+   if (i915_gem_object_is_tiled(vma->obj))
+   return false;
+   }
+
+   return true;
+}
+
 static inline bool
-eb_pin_vma(struct i915_execbuffer *eb,
-  const struct drm_i915_gem_exec_object2 *entry,
-  struct eb_vma *ev)
+eb_pin_vma_inplace(struct i915_execbuffer *eb,
+  const struct drm_i915_gem_exec_object2 *entry,
+  struct eb_vma *ev)
 {
struct i915_vma *vma = ev->vma;
-   u64 pin_flags;
+   unsigned int pin_flags;
 
-   if (vma->node.size)
-   pin_flags = vma->node.start;
-   else
-   pin_flags = entry->offset & PIN_OFFSET_MASK;
+   if (eb_vma_misplaced(entry, vma, ev->flags))
+   return false;
 
-   pin_flags |= PIN_USER | PIN_NOEVICT | PIN_OFFSET_FIXED;
+   pin_flags = PIN_USER;
if (unlikely(ev->flags & EXEC_OBJECT_NEEDS_GTT))
pin_flags |= PIN_GLOBAL;
 
/* Attempt to reuse the current location if available */
-   if (unlikely(i915_vma_pin(vma, 0, 0, pin_flags))) {
-   if (entry->flags & EXEC_OBJECT_PINNED)
-   return false;
-
-   /* Failing that pick any _free_ space if suitable */
-   if (unlikely(i915_vma_pin(vma,
- entry->pad_to_size,
- entry->alignment,
- eb_pin_flags(entry, ev->flags) |
- PIN_USER | PIN_NOEVICT)))
-   return false;
-   }
+   if (!i915_vma_pin_inplace(vma, pin_flags))
+   return false;
 
if (unlikely(ev->flags & EXEC_OBJECT_NEEDS_FENCE)) {
-   if (unlikely(i915_vma_pin_fence(vma))) {
-   i915_vma_unpin(vma);
+   if (!eb_pin_vma_fence_inplace(ev)) {
+   __i915_vma_unpin(vma);
return false;
}
-
-   if (vma->fence)
-   ev->flags |= __EXEC_OBJECT_HAS_FENCE;
}
 
+   GEM_BUG_ON(eb_vma_misplaced(entry, vma, ev->flags));
+
ev->flags |= __EXEC_OBJECT_HAS_PIN;
-   return !eb_vma_misplaced(entry, vma, ev->flags);
+   return true;
 }
 
 static int
@@ -676,14 +682,17 @@ static int eb_reserve_vm(struct i915_execbuffer *eb)
struct drm_i915_gem_exec_object2 *entry = ev->exec;
struct i915_vma *vma = ev->vma;
 
-   if (eb_pin_vma(eb, entry, ev)) {
+   if (eb_pin_vma_inplace(eb, entry, ev)) {
if (entry->offset != vma->node.start) {
entry->offset = vma->node.start | UPDATE;
eb->args->flags |= __EXEC_HAS_RELOC;
}
} else {
-   eb_unreserve_vma(ev);
-   list_add_tail(&ev->unbound_link, &unbound);
+   /* Lightly sort user placed objects to the fore */
+   if (ev->flags & EXEC_OBJECT_PINNED)
+   list_add(&ev->unbound_link, &unbound);
+   else
+   list_add_tail(&ev->unbound_link, &unbound);
}
}
 
diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
index c6bf04ca2032..dbe11b349175 100644
--- a/drivers/gpu/drm/i915/i915_vma.c
+++ b/drivers/gpu/drm/i915/i915_vma.c
@@ -740,11 +740,13 @@ i915_vma_detach(struct i915_vma *vma)
list_del(&vma->vm_link);
 }
 
-static bool t

[Intel-gfx] [PATCH 08/66] drm/i915: Make the stale cached active node available for any timeline

2020-07-15 Thread Chris Wilson
Rather than require the next timeline after idling to match the MRU
before idling, reset the index on the node and allow it to match the
first request. However, this requires cmpxchg(u64) and so is not trivial
on 32b, so for compatibility we just fallback to keeping the cached node
pointing to the MRU timeline.

Signed-off-by: Chris Wilson 
---
 drivers/gpu/drm/i915/i915_active.c | 21 +++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_active.c 
b/drivers/gpu/drm/i915/i915_active.c
index 0854b1552bc1..6737b5615c0c 100644
--- a/drivers/gpu/drm/i915/i915_active.c
+++ b/drivers/gpu/drm/i915/i915_active.c
@@ -157,6 +157,10 @@ __active_retire(struct i915_active *ref)
rb_link_node(&ref->cache->node, NULL, &ref->tree.rb_node);
rb_insert_color(&ref->cache->node, &ref->tree);
GEM_BUG_ON(ref->tree.rb_node != &ref->cache->node);
+
+   /* Make the cached node available for reuse with any timeline */
+   if (IS_ENABLED(CONFIG_64BIT))
+   ref->cache->timeline = 0; /* needs cmpxchg(u64) */
}
 
spin_unlock_irqrestore(&ref->tree_lock, flags);
@@ -235,9 +239,22 @@ static struct active_node *__active_lookup(struct 
i915_active *ref, u64 idx)
 {
struct active_node *it;
 
+   GEM_BUG_ON(idx == 0); /* 0 is the unordered timeline, rsvd for cache */
+
it = READ_ONCE(ref->cache);
-   if (it && it->timeline == idx)
-   return it;
+   if (it) {
+   u64 cached = READ_ONCE(it->timeline);
+
+   if (cached == idx)
+   return it;
+
+#ifdef CONFIG_64BIT /* for cmpxchg(u64) */
+   if (!cached && !cmpxchg(&it->timeline, 0, idx)) {
+   GEM_BUG_ON(i915_active_fence_isset(&it->base));
+   return it;
+   }
+#endif
+   }
 
BUILD_BUG_ON(offsetof(typeof(*it), node));
 
-- 
2.20.1

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH 02/66] drm/i915: Remove i915_request.lock requirement for execution callbacks

2020-07-15 Thread Chris Wilson
We are using the i915_request.lock to serialise adding an execution
callback with __i915_request_submit. However, if we use an atomic
llist_add to serialise multiple waiters and then check to see if the
request is already executing, we can remove the irq-spinlock.

Fixes: 1d9221e9d395 ("drm/i915: Skip signaling a signaled request")
Signed-off-by: Chris Wilson 
Cc: Tvrtko Ursulin 
---
 drivers/gpu/drm/i915/i915_request.c | 43 -
 1 file changed, 12 insertions(+), 31 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_request.c 
b/drivers/gpu/drm/i915/i915_request.c
index bb4eb1a8780e..d13dd013acb4 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -190,13 +190,11 @@ static void __notify_execute_cb(struct i915_request *rq)
 {
struct execute_cb *cb, *cn;
 
-   lockdep_assert_held(&rq->lock);
-
-   GEM_BUG_ON(!i915_request_is_active(rq));
if (llist_empty(&rq->execute_cb))
return;
 
-   llist_for_each_entry_safe(cb, cn, rq->execute_cb.first, work.llnode)
+   llist_for_each_entry_safe(cb, cn,
+ llist_del_all(&rq->execute_cb), work.llnode)
irq_work_queue(&cb->work);
 
/*
@@ -209,7 +207,6 @@ static void __notify_execute_cb(struct i915_request *rq)
 * preempt-to-idle cycle on the target engine, all the while the
 * master execute_cb may refire.
 */
-   init_llist_head(&rq->execute_cb);
 }
 
 static inline void
@@ -274,9 +271,11 @@ static void remove_from_engine(struct i915_request *rq)
locked = engine;
}
list_del_init(&rq->sched.link);
+   spin_unlock_irq(&locked->active.lock);
+
clear_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
clear_bit(I915_FENCE_FLAG_HOLD, &rq->fence.flags);
-   spin_unlock_irq(&locked->active.lock);
+   set_bit(I915_FENCE_FLAG_ACTIVE, &rq->fence.flags);
 }
 
 bool i915_request_retire(struct i915_request *rq)
@@ -288,6 +287,7 @@ bool i915_request_retire(struct i915_request *rq)
 
GEM_BUG_ON(!i915_sw_fence_signaled(&rq->submit));
trace_i915_request_retire(rq);
+   i915_request_mark_complete(rq);
 
/*
 * We know the GPU must have read the request to have
@@ -314,7 +314,6 @@ bool i915_request_retire(struct i915_request *rq)
remove_from_engine(rq);
 
spin_lock_irq(&rq->lock);
-   i915_request_mark_complete(rq);
if (!i915_request_signaled(rq))
dma_fence_signal_locked(&rq->fence);
if (test_bit(DMA_FENCE_FLAG_ENABLE_SIGNAL_BIT, &rq->fence.flags))
@@ -323,12 +322,8 @@ bool i915_request_retire(struct i915_request *rq)
GEM_BUG_ON(!atomic_read(&rq->engine->gt->rps.num_waiters));
atomic_dec(&rq->engine->gt->rps.num_waiters);
}
-   if (!test_bit(I915_FENCE_FLAG_ACTIVE, &rq->fence.flags)) {
-   set_bit(I915_FENCE_FLAG_ACTIVE, &rq->fence.flags);
-   __notify_execute_cb(rq);
-   }
-   GEM_BUG_ON(!llist_empty(&rq->execute_cb));
spin_unlock_irq(&rq->lock);
+   __notify_execute_cb(rq);
 
remove_from_client(rq);
__list_del_entry(&rq->link); /* poison neither prev/next (RCU walks) */
@@ -357,12 +352,6 @@ void i915_request_retire_upto(struct i915_request *rq)
} while (i915_request_retire(tmp) && tmp != rq);
 }
 
-static void __llist_add(struct llist_node *node, struct llist_head *head)
-{
-   node->next = head->first;
-   head->first = node;
-}
-
 static struct i915_request * const *
 __engine_active(struct intel_engine_cs *engine)
 {
@@ -439,18 +428,11 @@ __await_execution(struct i915_request *rq,
cb->work.func = irq_execute_cb_hook;
}
 
-   spin_lock_irq(&signal->lock);
-   if (i915_request_is_active(signal) || __request_in_flight(signal)) {
-   if (hook) {
-   hook(rq, &signal->fence);
-   i915_request_put(signal);
-   }
-   i915_sw_fence_complete(cb->fence);
-   kmem_cache_free(global.slab_execute_cbs, cb);
-   } else {
-   __llist_add(&cb->work.llnode, &signal->execute_cb);
+   if (llist_add(&cb->work.llnode, &signal->execute_cb)) {
+   if (i915_request_is_active(signal) ||
+   __request_in_flight(signal))
+   __notify_execute_cb(signal);
}
-   spin_unlock_irq(&signal->lock);
 
return 0;
 }
@@ -565,19 +547,18 @@ bool __i915_request_submit(struct i915_request *request)
list_move_tail(&request->sched.link, &engine->active.requests);
clear_bit(I915_FENCE_FLAG_PQUEUE, &request->fence.flags);
}
+   __notify_execute_cb(request);
 
/* We may be recursing from the signal callback of another i915 fence */
if (!i915_request_signaled(request)) {
spin_lock_nested(&request->lock, 

[Intel-gfx] [PATCH 33/66] drm/i915: Remove unused i915_gem_evict_vm()

2020-07-15 Thread Chris Wilson
Obsolete, last user removed.

Signed-off-by: Chris Wilson 
---
 drivers/gpu/drm/i915/i915_drv.h   |  1 -
 drivers/gpu/drm/i915/i915_gem_evict.c | 57 ---
 .../gpu/drm/i915/selftests/i915_gem_evict.c   | 40 -
 3 files changed, 98 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index bd7ff2ad6514..2c1a9b74af8d 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1865,7 +1865,6 @@ int __must_check i915_gem_evict_something(struct 
i915_address_space *vm,
 int __must_check i915_gem_evict_for_node(struct i915_address_space *vm,
 struct drm_mm_node *node,
 unsigned int flags);
-int i915_gem_evict_vm(struct i915_address_space *vm);
 
 /* i915_gem_internal.c */
 struct drm_i915_gem_object *
diff --git a/drivers/gpu/drm/i915/i915_gem_evict.c 
b/drivers/gpu/drm/i915/i915_gem_evict.c
index 6501939929d5..e35f0ba5e245 100644
--- a/drivers/gpu/drm/i915/i915_gem_evict.c
+++ b/drivers/gpu/drm/i915/i915_gem_evict.c
@@ -343,63 +343,6 @@ int i915_gem_evict_for_node(struct i915_address_space *vm,
return ret;
 }
 
-/**
- * i915_gem_evict_vm - Evict all idle vmas from a vm
- * @vm: Address space to cleanse
- *
- * This function evicts all vmas from a vm.
- *
- * This is used by the execbuf code as a last-ditch effort to defragment the
- * address space.
- *
- * To clarify: This is for freeing up virtual address space, not for freeing
- * memory in e.g. the shrinker.
- */
-int i915_gem_evict_vm(struct i915_address_space *vm)
-{
-   int ret = 0;
-
-   lockdep_assert_held(&vm->mutex);
-   trace_i915_gem_evict_vm(vm);
-
-   /* Switch back to the default context in order to unpin
-* the existing context objects. However, such objects only
-* pin themselves inside the global GTT and performing the
-* switch otherwise is ineffective.
-*/
-   if (i915_is_ggtt(vm)) {
-   ret = ggtt_flush(vm->gt);
-   if (ret)
-   return ret;
-   }
-
-   do {
-   struct i915_vma *vma, *vn;
-   LIST_HEAD(eviction_list);
-
-   list_for_each_entry(vma, &vm->bound_list, vm_link) {
-   if (i915_vma_is_pinned(vma))
-   continue;
-
-   __i915_vma_pin(vma);
-   list_add(&vma->evict_link, &eviction_list);
-   }
-   if (list_empty(&eviction_list))
-   break;
-
-   ret = 0;
-   list_for_each_entry_safe(vma, vn, &eviction_list, evict_link) {
-   __i915_vma_unpin(vma);
-   if (ret == 0)
-   ret = __i915_vma_unbind(vma);
-   if (ret != -EINTR) /* "Get me out of here!" */
-   ret = 0;
-   }
-   } while (ret == 0);
-
-   return ret;
-}
-
 #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
 #include "selftests/i915_gem_evict.c"
 #endif
diff --git a/drivers/gpu/drm/i915/selftests/i915_gem_evict.c 
b/drivers/gpu/drm/i915/selftests/i915_gem_evict.c
index 028baae9631f..773cecacba82 100644
--- a/drivers/gpu/drm/i915/selftests/i915_gem_evict.c
+++ b/drivers/gpu/drm/i915/selftests/i915_gem_evict.c
@@ -327,45 +327,6 @@ static int igt_evict_for_cache_color(void *arg)
return err;
 }
 
-static int igt_evict_vm(void *arg)
-{
-   struct intel_gt *gt = arg;
-   struct i915_ggtt *ggtt = gt->ggtt;
-   LIST_HEAD(objects);
-   int err;
-
-   /* Fill the GGTT with pinned objects and try to evict everything. */
-
-   err = populate_ggtt(ggtt, &objects);
-   if (err)
-   goto cleanup;
-
-   /* Everything is pinned, nothing should happen */
-   mutex_lock(&ggtt->vm.mutex);
-   err = i915_gem_evict_vm(&ggtt->vm);
-   mutex_unlock(&ggtt->vm.mutex);
-   if (err) {
-   pr_err("i915_gem_evict_vm on a full GGTT returned err=%d]\n",
-  err);
-   goto cleanup;
-   }
-
-   unpin_ggtt(ggtt);
-
-   mutex_lock(&ggtt->vm.mutex);
-   err = i915_gem_evict_vm(&ggtt->vm);
-   mutex_unlock(&ggtt->vm.mutex);
-   if (err) {
-   pr_err("i915_gem_evict_vm on a full GGTT returned err=%d]\n",
-  err);
-   goto cleanup;
-   }
-
-cleanup:
-   cleanup_objects(ggtt, &objects);
-   return err;
-}
-
 static int igt_evict_contexts(void *arg)
 {
const u64 PRETEND_GGTT_SIZE = 16ull << 20;
@@ -522,7 +483,6 @@ int i915_gem_evict_mock_selftests(void)
SUBTEST(igt_evict_something),
SUBTEST(igt_evict_for_vma),
SUBTEST(igt_evict_for_cache_color),
-   SUBTEST(igt_evict_vm),
SUBTEST(igt_overcommit),
};
struct drm_i915_privat

[Intel-gfx] [PATCH 27/66] drm/i915/gem: Pull execbuf dma resv under a single critical section

2020-07-15 Thread Chris Wilson
Acquire all the objects and their backing storage, and page directories,
as used by execbuf under a single common ww_mutex. Albeit we have to
restart the critical section a few times in order to handle various
restrictions (such as avoiding copy_(from|to)_user and mmap_sem).

Signed-off-by: Chris Wilson 
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c| 168 +-
 .../i915/gem/selftests/i915_gem_execbuffer.c  |   8 +-
 2 files changed, 87 insertions(+), 89 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c 
b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index ebabc0746d50..db433f3f18ec 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -20,6 +20,7 @@
 #include "gt/intel_gt_pm.h"
 #include "gt/intel_gt_requests.h"
 #include "gt/intel_ring.h"
+#include "mm/i915_acquire_ctx.h"
 
 #include "i915_drv.h"
 #include "i915_gem_clflush.h"
@@ -244,6 +245,8 @@ struct i915_execbuffer {
struct intel_context *context; /* logical state for the request */
struct i915_gem_context *gem_context; /** caller's context */
 
+   struct i915_acquire_ctx acquire; /** lock for _all_ DMA reservations */
+
struct i915_request *request; /** our request to build */
struct eb_vma *batch; /** identity of the batch obj/vma */
 
@@ -389,42 +392,6 @@ static void eb_vma_array_put(struct eb_vma_array *arr)
kref_put(&arr->kref, eb_vma_array_destroy);
 }
 
-static int
-eb_lock_vma(struct i915_execbuffer *eb, struct ww_acquire_ctx *acquire)
-{
-   struct eb_vma *ev;
-   int err = 0;
-
-   list_for_each_entry(ev, &eb->submit_list, submit_link) {
-   struct i915_vma *vma = ev->vma;
-
-   err = ww_mutex_lock_interruptible(&vma->resv->lock, acquire);
-   if (err == -EDEADLK) {
-   struct eb_vma *unlock = ev, *en;
-
-   list_for_each_entry_safe_continue_reverse(unlock, en,
- 
&eb->submit_list,
- submit_link) {
-   ww_mutex_unlock(&unlock->vma->resv->lock);
-   list_move_tail(&unlock->submit_link, 
&eb->submit_list);
-   }
-
-   GEM_BUG_ON(!list_is_first(&ev->submit_link, 
&eb->submit_list));
-   err = ww_mutex_lock_slow_interruptible(&vma->resv->lock,
-  acquire);
-   }
-   if (err) {
-   list_for_each_entry_continue_reverse(ev,
-&eb->submit_list,
-submit_link)
-   ww_mutex_unlock(&ev->vma->resv->lock);
-   break;
-   }
-   }
-
-   return err;
-}
-
 static int eb_create(struct i915_execbuffer *eb)
 {
/* Allocate an extra slot for use by the sentinel */
@@ -668,6 +635,25 @@ eb_add_vma(struct i915_execbuffer *eb,
}
 }
 
+static int eb_lock_mm(struct i915_execbuffer *eb)
+{
+   struct eb_vma *ev;
+   int err;
+
+   list_for_each_entry(ev, &eb->bind_list, bind_link) {
+   err = i915_acquire_ctx_lock(&eb->acquire, ev->vma->obj);
+   if (err)
+   return err;
+   }
+
+   return 0;
+}
+
+static int eb_acquire_mm(struct i915_execbuffer *eb)
+{
+   return i915_acquire_mm(&eb->acquire);
+}
+
 struct eb_vm_work {
struct dma_fence_work base;
struct eb_vma_array *array;
@@ -1390,7 +1376,15 @@ static int eb_reserve_vm(struct i915_execbuffer *eb)
unsigned long count;
struct eb_vma *ev;
unsigned int pass;
-   int err = 0;
+   int err;
+
+   err = eb_lock_mm(eb);
+   if (err)
+   return err;
+
+   err = eb_acquire_mm(eb);
+   if (err)
+   return err;
 
count = 0;
INIT_LIST_HEAD(&unbound);
@@ -1416,10 +1410,15 @@ static int eb_reserve_vm(struct i915_execbuffer *eb)
if (count == 0)
return 0;
 
+   /* We need to reserve page directories, release all, start over */
+   i915_acquire_ctx_fini(&eb->acquire);
+
pass = 0;
do {
struct eb_vm_work *work;
 
+   i915_acquire_ctx_init(&eb->acquire);
+
/*
 * We need to hold one lock as we bind all the vma so that
 * we have a consistent view of the entire vm and can plan
@@ -1436,6 +1435,11 @@ static int eb_reserve_vm(struct i915_execbuffer *eb)
 * beneath it, so we have to stage and preallocate all the
 * resources we may require before taking the mutex.
 */
+
+   err = eb_lock_mm(eb);
+   if (err)

[Intel-gfx] [PATCH 23/66] drm/i915/gem: Include cmdparser in common execbuf pinning

2020-07-15 Thread Chris Wilson
Pull the cmdparser allocations in to the reservation phase, and then
they are included in the common vma pinning pass.

Signed-off-by: Chris Wilson 
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c| 360 +++---
 drivers/gpu/drm/i915/gem/i915_gem_object.h|  10 +
 drivers/gpu/drm/i915/i915_cmd_parser.c|  21 +-
 3 files changed, 230 insertions(+), 161 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c 
b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index af2b4aeb6df0..8c1f3528b1e9 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -25,6 +25,7 @@
 #include "i915_gem_clflush.h"
 #include "i915_gem_context.h"
 #include "i915_gem_ioctls.h"
+#include "i915_memcpy.h"
 #include "i915_sw_fence_work.h"
 #include "i915_trace.h"
 
@@ -52,6 +53,7 @@ struct eb_bind_vma {
 
 struct eb_vma_array {
struct kref kref;
+   struct list_head aux_list;
struct eb_vma vma[];
 };
 
@@ -246,7 +248,6 @@ struct i915_execbuffer {
 
struct i915_request *request; /** our request to build */
struct eb_vma *batch; /** identity of the batch obj/vma */
-   struct i915_vma *trampoline; /** trampoline used for chaining */
 
/** actual size of execobj[] as we may extend it for the cmdparser */
unsigned int buffer_count;
@@ -281,6 +282,11 @@ struct i915_execbuffer {
unsigned int rq_size;
} reloc_cache;
 
+   struct eb_cmdparser {
+   struct eb_vma *shadow;
+   struct eb_vma *trampoline;
+   } parser;
+
u64 invalid_flags; /** Set of execobj.flags that are invalid */
u32 context_flags; /** Set of execobj.flags to insert from the ctx */
 
@@ -298,6 +304,10 @@ struct i915_execbuffer {
struct eb_vma_array *array;
 };
 
+static struct drm_i915_gem_exec_object2 no_entry = {
+   .offset = -1ull
+};
+
 static inline bool eb_use_cmdparser(const struct i915_execbuffer *eb)
 {
return intel_engine_requires_cmd_parser(eb->engine) ||
@@ -314,6 +324,7 @@ static struct eb_vma_array *eb_vma_array_create(unsigned 
int count)
return NULL;
 
kref_init(&arr->kref);
+   INIT_LIST_HEAD(&arr->aux_list);
arr->vma[0].vma = NULL;
 
return arr;
@@ -339,16 +350,31 @@ static inline void eb_unreserve_vma(struct eb_vma *ev)
   __EXEC_OBJECT_HAS_FENCE);
 }
 
+static void eb_vma_destroy(struct eb_vma *ev)
+{
+   eb_unreserve_vma(ev);
+   i915_vma_put(ev->vma);
+}
+
+static void eb_destroy_aux(struct eb_vma_array *arr)
+{
+   struct eb_vma *ev, *en;
+
+   list_for_each_entry_safe(ev, en, &arr->aux_list, reloc_link) {
+   eb_vma_destroy(ev);
+   kfree(ev);
+   }
+}
+
 static void eb_vma_array_destroy(struct kref *kref)
 {
struct eb_vma_array *arr = container_of(kref, typeof(*arr), kref);
-   struct eb_vma *ev = arr->vma;
+   struct eb_vma *ev;
 
-   while (ev->vma) {
-   eb_unreserve_vma(ev);
-   i915_vma_put(ev->vma);
-   ev++;
-   }
+   eb_destroy_aux(arr);
+
+   for (ev = arr->vma; ev->vma; ev++)
+   eb_vma_destroy(ev);
 
kvfree(arr);
 }
@@ -396,8 +422,8 @@ eb_lock_vma(struct i915_execbuffer *eb, struct 
ww_acquire_ctx *acquire)
 
 static int eb_create(struct i915_execbuffer *eb)
 {
-   /* Allocate an extra slot for use by the command parser + sentinel */
-   eb->array = eb_vma_array_create(eb->buffer_count + 2);
+   /* Allocate an extra slot for use by the sentinel */
+   eb->array = eb_vma_array_create(eb->buffer_count + 1);
if (!eb->array)
return -ENOMEM;
 
@@ -1078,7 +1104,7 @@ static int eb_reserve_vma(struct eb_vm_work *work, struct 
eb_bind_vma *bind)
GEM_BUG_ON(!(drm_mm_node_allocated(&vma->node) ^
 drm_mm_node_allocated(&bind->hole)));
 
-   if (entry->offset != vma->node.start) {
+   if (entry != &no_entry && entry->offset != vma->node.start) {
entry->offset = vma->node.start | UPDATE;
*work->p_flags |= __EXEC_HAS_RELOC;
}
@@ -1371,7 +1397,8 @@ static int eb_reserve_vm(struct i915_execbuffer *eb)
struct i915_vma *vma = ev->vma;
 
if (eb_pin_vma_inplace(eb, entry, ev)) {
-   if (entry->offset != vma->node.start) {
+   if (entry != &no_entry &&
+   entry->offset != vma->node.start) {
entry->offset = vma->node.start | UPDATE;
eb->args->flags |= __EXEC_HAS_RELOC;
}
@@ -1542,6 +1569,113 @@ static int eb_reserve_vm(struct i915_execbuffer *eb)
} while (1);
 }
 
+static int eb_alloc_cmdparser(struct i915_execbuffer *eb)
+{
+   struct intel_gt_buffer_pool_node *pool;
+   struct i915_vma *vma;
+   struct eb_vma *ev;

[Intel-gfx] [PATCH 20/66] drm/i915/gem: Separate the ww_mutex walker into its own list

2020-07-15 Thread Chris Wilson
In preparation for making eb_vma bigger and heavy to run in parallel,
we need to stop applying an in-place swap() to reorder around ww_mutex
deadlocks. Keep the array intact and reorder the locks using a dedicated
list.

Signed-off-by: Chris Wilson 
Reviewed-by: Tvrtko Ursulin 
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c| 83 ---
 1 file changed, 54 insertions(+), 29 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c 
b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 0b8a26da26e5..430b2d4dc747 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -37,6 +37,7 @@ struct eb_vma {
struct list_head bind_link;
struct list_head unbound_link;
struct list_head reloc_link;
+   struct list_head submit_link;
 
struct hlist_node node;
u32 handle;
@@ -248,6 +249,8 @@ struct i915_execbuffer {
/** list of vma that have execobj.relocation_count */
struct list_head relocs;
 
+   struct list_head submit_list;
+
/**
 * Track the most recently used object for relocations, as we
 * frequently have to perform multiple relocations within the same
@@ -341,6 +344,42 @@ static void eb_vma_array_put(struct eb_vma_array *arr)
kref_put(&arr->kref, eb_vma_array_destroy);
 }
 
+static int
+eb_lock_vma(struct i915_execbuffer *eb, struct ww_acquire_ctx *acquire)
+{
+   struct eb_vma *ev;
+   int err = 0;
+
+   list_for_each_entry(ev, &eb->submit_list, submit_link) {
+   struct i915_vma *vma = ev->vma;
+
+   err = ww_mutex_lock_interruptible(&vma->resv->lock, acquire);
+   if (err == -EDEADLK) {
+   struct eb_vma *unlock = ev, *en;
+
+   list_for_each_entry_safe_continue_reverse(unlock, en,
+ 
&eb->submit_list,
+ submit_link) {
+   ww_mutex_unlock(&unlock->vma->resv->lock);
+   list_move_tail(&unlock->submit_link, 
&eb->submit_list);
+   }
+
+   GEM_BUG_ON(!list_is_first(&ev->submit_link, 
&eb->submit_list));
+   err = ww_mutex_lock_slow_interruptible(&vma->resv->lock,
+  acquire);
+   }
+   if (err) {
+   list_for_each_entry_continue_reverse(ev,
+&eb->submit_list,
+submit_link)
+   ww_mutex_unlock(&ev->vma->resv->lock);
+   break;
+   }
+   }
+
+   return err;
+}
+
 static int eb_create(struct i915_execbuffer *eb)
 {
/* Allocate an extra slot for use by the command parser + sentinel */
@@ -393,6 +432,10 @@ static int eb_create(struct i915_execbuffer *eb)
eb->lut_size = -eb->buffer_count;
}
 
+   INIT_LIST_HEAD(&eb->bind_list);
+   INIT_LIST_HEAD(&eb->submit_list);
+   INIT_LIST_HEAD(&eb->relocs);
+
return 0;
 }
 
@@ -574,6 +617,7 @@ eb_add_vma(struct i915_execbuffer *eb,
}
 
list_add_tail(&ev->bind_link, &eb->bind_list);
+   list_add_tail(&ev->submit_link, &eb->submit_list);
 
if (entry->relocation_count)
list_add_tail(&ev->reloc_link, &eb->relocs);
@@ -940,9 +984,6 @@ static int eb_lookup_vmas(struct i915_execbuffer *eb)
unsigned int i;
int err = 0;
 
-   INIT_LIST_HEAD(&eb->bind_list);
-   INIT_LIST_HEAD(&eb->relocs);
-
for (i = 0; i < eb->buffer_count; i++) {
struct i915_vma *vma;
 
@@ -1609,38 +1650,19 @@ static int eb_relocate(struct i915_execbuffer *eb)
 
 static int eb_move_to_gpu(struct i915_execbuffer *eb)
 {
-   const unsigned int count = eb->buffer_count;
struct ww_acquire_ctx acquire;
-   unsigned int i;
+   struct eb_vma *ev;
int err = 0;
 
ww_acquire_init(&acquire, &reservation_ww_class);
 
-   for (i = 0; i < count; i++) {
-   struct eb_vma *ev = &eb->vma[i];
-   struct i915_vma *vma = ev->vma;
-
-   err = ww_mutex_lock_interruptible(&vma->resv->lock, &acquire);
-   if (err == -EDEADLK) {
-   GEM_BUG_ON(i == 0);
-   do {
-   int j = i - 1;
-
-   ww_mutex_unlock(&eb->vma[j].vma->resv->lock);
-
-   swap(eb->vma[i],  eb->vma[j]);
-   } while (--i);
+   err = eb_lock_vma(eb, &acquire);
+   if (err)
+   goto err_fini;
 
-   err = ww_mutex_lock_slow_interruptible(&vma->resv->lock,
-

[Intel-gfx] [PATCH 39/66] drm/i915/gt: Decouple inflight virtual engines

2020-07-15 Thread Chris Wilson
Once a virtual engine has been bound to a sibling, it will remain bound
until we finally schedule out the last active request. We can not rebind
the context to a new sibling while it is inflight as the context save
will conflict, hence we wait. As we cannot then use any other sibliing
while the context is inflight, only kick the bound sibling while it
inflight and upon scheduling out the kick the rest (so that we can swap
engines on timeslicing if the previously bound engine becomes
oversubscribed).

Signed-off-by: Chris Wilson 
---
 drivers/gpu/drm/i915/gt/intel_lrc.c | 28 
 1 file changed, 12 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c 
b/drivers/gpu/drm/i915/gt/intel_lrc.c
index ec533dfe3be9..2f35aceea778 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -1387,9 +1387,8 @@ static inline void execlists_schedule_in(struct 
i915_request *rq, int idx)
 static void kick_siblings(struct i915_request *rq, struct intel_context *ce)
 {
struct virtual_engine *ve = container_of(ce, typeof(*ve), context);
-   struct i915_request *next = READ_ONCE(ve->request);
 
-   if (next == rq || (next && next->execution_mask & ~rq->execution_mask))
+   if (READ_ONCE(ve->request))
tasklet_hi_schedule(&ve->base.execlists.tasklet);
 }
 
@@ -1806,17 +1805,13 @@ first_virtual_engine(struct intel_engine_cs *engine)
struct i915_request *rq = READ_ONCE(ve->request);
 
/* lazily cleanup after another engine handled rq */
-   if (!rq) {
+   if (!rq || !virtual_matches(ve, rq, engine)) {
rb_erase_cached(rb, &el->virtual);
RB_CLEAR_NODE(rb);
rb = rb_first_cached(&el->virtual);
continue;
}
 
-   if (!virtual_matches(ve, rq, engine)) {
-   rb = rb_next(rb);
-   continue;
-   }
return ve;
}
 
@@ -5443,7 +5438,6 @@ static void virtual_submission_tasklet(unsigned long data)
if (unlikely(!mask))
return;
 
-   local_irq_disable();
for (n = 0; n < ve->num_siblings; n++) {
struct intel_engine_cs *sibling = READ_ONCE(ve->siblings[n]);
struct ve_node * const node = &ve->nodes[sibling->id];
@@ -5453,20 +5447,19 @@ static void virtual_submission_tasklet(unsigned long 
data)
if (!READ_ONCE(ve->request))
break; /* already handled by a sibling's tasklet */
 
+   spin_lock_irq(&sibling->active.lock);
+
if (unlikely(!(mask & sibling->mask))) {
if (!RB_EMPTY_NODE(&node->rb)) {
-   spin_lock(&sibling->active.lock);
rb_erase_cached(&node->rb,
&sibling->execlists.virtual);
RB_CLEAR_NODE(&node->rb);
-   spin_unlock(&sibling->active.lock);
}
-   continue;
-   }
 
-   spin_lock(&sibling->active.lock);
+   goto unlock_engine;
+   }
 
-   if (!RB_EMPTY_NODE(&node->rb)) {
+   if (unlikely(!RB_EMPTY_NODE(&node->rb))) {
/*
 * Cheat and avoid rebalancing the tree if we can
 * reuse this node in situ.
@@ -5506,9 +5499,12 @@ static void virtual_submission_tasklet(unsigned long 
data)
if (first && prio > sibling->execlists.queue_priority_hint)
tasklet_hi_schedule(&sibling->execlists.tasklet);
 
-   spin_unlock(&sibling->active.lock);
+unlock_engine:
+   spin_unlock_irq(&sibling->active.lock);
+
+   if (intel_context_inflight(&ve->context))
+   break;
}
-   local_irq_enable();
 }
 
 static void virtual_submit_request(struct i915_request *rq)
-- 
2.20.1

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH 51/66] drm/i915/gt: Do not suspend bonded requests if one hangs

2020-07-15 Thread Chris Wilson
Treat the dependency between bonded requests as weak and leave the
remainder of the pair on the GPU if one hangs.

Signed-off-by: Chris Wilson 
---
 drivers/gpu/drm/i915/gt/intel_lrc.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c 
b/drivers/gpu/drm/i915/gt/intel_lrc.c
index dca6f8165ec7..fdeeed8b45d5 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -2683,6 +2683,9 @@ static void __execlists_hold(struct i915_request *rq)
struct i915_request *w =
container_of(p->waiter, typeof(*w), sched);
 
+   if (p->flags & I915_DEPENDENCY_WEAK)
+   continue;
+
/* Leave semaphores spinning on the other engines */
if (w->engine != rq->engine)
continue;
@@ -2778,6 +2781,9 @@ static void __execlists_unhold(struct i915_request *rq)
struct i915_request *w =
container_of(p->waiter, typeof(*w), sched);
 
+   if (p->flags & I915_DEPENDENCY_WEAK)
+   continue;
+
/* Propagate any change in error status */
if (rq->fence.error)
i915_request_set_error_once(w, rq->fence.error);
-- 
2.20.1

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH 04/66] drm/i915: Add a couple of missing i915_active_fini()

2020-07-15 Thread Chris Wilson
We use i915_active_fini() as a debug check on the i915_active state
before freeing. If we forget to call it, we may end up angering the
debugobjects contained within.

Signed-off-by: Chris Wilson 
---
 drivers/gpu/drm/i915/display/intel_frontbuffer.c| 2 ++
 drivers/gpu/drm/i915/gt/selftest_engine_heartbeat.c | 5 -
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/display/intel_frontbuffer.c 
b/drivers/gpu/drm/i915/display/intel_frontbuffer.c
index 2979ed2588eb..d898b370d7a4 100644
--- a/drivers/gpu/drm/i915/display/intel_frontbuffer.c
+++ b/drivers/gpu/drm/i915/display/intel_frontbuffer.c
@@ -232,6 +232,8 @@ static void frontbuffer_release(struct kref *ref)
RCU_INIT_POINTER(obj->frontbuffer, NULL);
spin_unlock(&to_i915(obj->base.dev)->fb_tracking.lock);
 
+   i915_active_fini(&front->write);
+
i915_gem_object_put(obj);
kfree_rcu(front, rcu);
 }
diff --git a/drivers/gpu/drm/i915/gt/selftest_engine_heartbeat.c 
b/drivers/gpu/drm/i915/gt/selftest_engine_heartbeat.c
index 73243ba59c7d..e73854dd2fe0 100644
--- a/drivers/gpu/drm/i915/gt/selftest_engine_heartbeat.c
+++ b/drivers/gpu/drm/i915/gt/selftest_engine_heartbeat.c
@@ -47,7 +47,10 @@ static int pulse_active(struct i915_active *active)
 
 static void pulse_free(struct kref *kref)
 {
-   kfree(container_of(kref, struct pulse, kref));
+   struct pulse *p = container_of(kref, typeof(*p), kref);
+
+   i915_active_fini(&p->active);
+   kfree(p);
 }
 
 static void pulse_put(struct pulse *p)
-- 
2.20.1

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH 03/66] drm/i915: Remove requirement for holding i915_request.lock for breadcrumbs

2020-07-15 Thread Chris Wilson
Since the breadcrumb enabling/cancelling itself is serialised by the
breadcrumbs.irq_lock, with a bit of care we can remove the outer
serialisation with i915_request.lock for concurrent
dma_fence_enable_signaling(). This has the important side-effect of
eliminating the nested i915_request.lock within request submission.

Signed-off-by: Chris Wilson 
---
 drivers/gpu/drm/i915/gt/intel_breadcrumbs.c | 100 +++-
 drivers/gpu/drm/i915/gt/intel_lrc.c |  14 ---
 drivers/gpu/drm/i915/i915_request.c |  30 ++
 3 files changed, 63 insertions(+), 81 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c 
b/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c
index 91786310c114..87fd06d3eb3f 100644
--- a/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c
+++ b/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c
@@ -220,17 +220,17 @@ static void signal_irq_work(struct irq_work *work)
}
 }
 
-static bool __intel_breadcrumbs_arm_irq(struct intel_breadcrumbs *b)
+static void __intel_breadcrumbs_arm_irq(struct intel_breadcrumbs *b)
 {
struct intel_engine_cs *engine =
container_of(b, struct intel_engine_cs, breadcrumbs);
 
lockdep_assert_held(&b->irq_lock);
if (b->irq_armed)
-   return true;
+   return;
 
if (!intel_gt_pm_get_if_awake(engine->gt))
-   return false;
+   return;
 
/*
 * The breadcrumb irq will be disarmed on the interrupt after the
@@ -250,8 +250,6 @@ static bool __intel_breadcrumbs_arm_irq(struct 
intel_breadcrumbs *b)
 
if (!b->irq_enabled++)
irq_enable(engine);
-
-   return true;
 }
 
 void intel_engine_init_breadcrumbs(struct intel_engine_cs *engine)
@@ -310,57 +308,69 @@ void intel_engine_fini_breadcrumbs(struct intel_engine_cs 
*engine)
 {
 }
 
-bool i915_request_enable_breadcrumb(struct i915_request *rq)
+static void insert_breadcrumb(struct i915_request *rq,
+ struct intel_breadcrumbs *b)
 {
-   lockdep_assert_held(&rq->lock);
+   struct intel_context *ce = rq->context;
+   struct list_head *pos;
 
-   if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &rq->fence.flags))
-   return true;
+   if (test_bit(I915_FENCE_FLAG_SIGNAL, &rq->fence.flags))
+   return;
 
-   if (test_bit(I915_FENCE_FLAG_ACTIVE, &rq->fence.flags)) {
-   struct intel_breadcrumbs *b = &rq->engine->breadcrumbs;
-   struct intel_context *ce = rq->context;
-   struct list_head *pos;
+   __intel_breadcrumbs_arm_irq(b);
 
-   spin_lock(&b->irq_lock);
+   /*
+* We keep the seqno in retirement order, so we can break
+* inside intel_engine_signal_breadcrumbs as soon as we've
+* passed the last completed request (or seen a request that
+* hasn't event started). We could walk the timeline->requests,
+* but keeping a separate signalers_list has the advantage of
+* hopefully being much smaller than the full list and so
+* provides faster iteration and detection when there are no
+* more interrupts required for this context.
+*
+* We typically expect to add new signalers in order, so we
+* start looking for our insertion point from the tail of
+* the list.
+*/
+   list_for_each_prev(pos, &ce->signals) {
+   struct i915_request *it =
+   list_entry(pos, typeof(*it), signal_link);
+
+   if (i915_seqno_passed(rq->fence.seqno, it->fence.seqno))
+   break;
+   }
+   list_add(&rq->signal_link, pos);
+   if (pos == &ce->signals) /* catch transitions from empty list */
+   list_move_tail(&ce->signal_link, &b->signalers);
+   GEM_BUG_ON(!check_signal_order(ce, rq));
 
-   if (test_bit(I915_FENCE_FLAG_SIGNAL, &rq->fence.flags))
-   goto unlock;
+   set_bit(I915_FENCE_FLAG_SIGNAL, &rq->fence.flags);
+}
 
-   if (!__intel_breadcrumbs_arm_irq(b))
-   goto unlock;
+bool i915_request_enable_breadcrumb(struct i915_request *rq)
+{
+   struct intel_breadcrumbs *b;
 
-   /*
-* We keep the seqno in retirement order, so we can break
-* inside intel_engine_signal_breadcrumbs as soon as we've
-* passed the last completed request (or seen a request that
-* hasn't event started). We could walk the timeline->requests,
-* but keeping a separate signalers_list has the advantage of
-* hopefully being much smaller than the full list and so
-* provides faster iteration and detection when there are no
-* more interrupts required for this context.
-*
-* We typically expect to add new signalers in order, so we
-* start looking for our

[Intel-gfx] [PATCH 54/66] drm/i915/gt: Remove timeslice suppression

2020-07-15 Thread Chris Wilson
In the next patch, we remove the strict priority system and continuously
re-evaluate the relative priority of tasks. As such we need to enable
the timeslice whenever there is more than one context in the pipeline.
This simplifies the decision and removes some of the tweaks to suppress
timeslicing, allowing us to lift the timeslice enabling to a common spot
at the end of running the submission tasklet.

Signed-off-by: Chris Wilson 
---
 drivers/gpu/drm/i915/gt/intel_engine_types.h |  10 --
 drivers/gpu/drm/i915/gt/intel_lrc.c  | 146 +++
 2 files changed, 52 insertions(+), 104 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h 
b/drivers/gpu/drm/i915/gt/intel_engine_types.h
index a0ed041cfab4..354e01c560f2 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
@@ -236,16 +236,6 @@ struct intel_engine_execlists {
 */
unsigned int port_mask;
 
-   /**
-* @switch_priority_hint: Second context priority.
-*
-* We submit multiple contexts to the HW simultaneously and would
-* like to occasionally switch between them to emulate timeslicing.
-* To know when timeslicing is suitable, we track the priority of
-* the context submitted second.
-*/
-   int switch_priority_hint;
-
/**
 * @queue_priority_hint: Highest pending priority.
 *
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c 
b/drivers/gpu/drm/i915/gt/intel_lrc.c
index 2dd116c0d2a1..29072215635e 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -1869,25 +1869,6 @@ static void defer_active(struct intel_engine_cs *engine)
defer_request(rq, i915_sched_lookup_priolist(engine, rq_prio(rq)));
 }
 
-static bool
-need_timeslice(const struct intel_engine_cs *engine,
-  const struct i915_request *rq)
-{
-   int hint;
-
-   if (!intel_engine_has_timeslices(engine))
-   return false;
-
-   hint = max(engine->execlists.queue_priority_hint,
-  virtual_prio(&engine->execlists));
-
-   if (!list_is_last(&rq->sched.link, &engine->active.requests))
-   hint = max(hint, rq_prio(list_next_entry(rq, sched.link)));
-
-   GEM_BUG_ON(hint >= I915_PRIORITY_UNPREEMPTABLE);
-   return hint >= effective_prio(rq);
-}
-
 static bool
 timeslice_yield(const struct intel_engine_execlists *el,
const struct i915_request *rq)
@@ -1907,76 +1888,63 @@ timeslice_yield(const struct intel_engine_execlists *el,
return rq->context->lrc.ccid == READ_ONCE(el->yield);
 }
 
-static bool
-timeslice_expired(const struct intel_engine_execlists *el,
- const struct i915_request *rq)
+static bool needs_timeslice(const struct intel_engine_cs *engine,
+   const struct i915_request *rq)
 {
-   return timer_expired(&el->timer) || timeslice_yield(el, rq);
-}
+   /* If not currently active, or about to switch, wait for next event */
+   if (!rq || i915_request_completed(rq))
+   return false;
 
-static int
-switch_prio(struct intel_engine_cs *engine, const struct i915_request *rq)
-{
-   if (list_is_last(&rq->sched.link, &engine->active.requests))
-   return engine->execlists.queue_priority_hint;
+   /* We do not need to start the timeslice until after the ACK */
+   if (READ_ONCE(engine->execlists.pending[0]))
+   return false;
 
-   return rq_prio(list_next_entry(rq, sched.link));
-}
+   /* If ELSP[1] is occupied, always check to see if worth slicing */
+   if (!list_is_last_rcu(&rq->sched.link, &engine->active.requests))
+   return true;
 
-static inline unsigned long
-timeslice(const struct intel_engine_cs *engine)
-{
-   return READ_ONCE(engine->props.timeslice_duration_ms);
+   /* Otherwise, ELSP[0] is by itself, but may be waiting in the queue */
+   if (!RB_EMPTY_ROOT(&engine->execlists.queue.rb_root))
+   return true;
+
+   return !RB_EMPTY_ROOT(&engine->execlists.virtual.rb_root);
 }
 
-static unsigned long active_timeslice(const struct intel_engine_cs *engine)
+static bool
+timeslice_expired(struct intel_engine_cs *engine, const struct i915_request 
*rq)
 {
-   const struct intel_engine_execlists *execlists = &engine->execlists;
-   const struct i915_request *rq = *execlists->active;
+   const struct intel_engine_execlists *el = &engine->execlists;
 
-   if (!rq || i915_request_completed(rq))
-   return 0;
+   if (!intel_engine_has_timeslices(engine))
+   return false;
 
-   if (READ_ONCE(execlists->switch_priority_hint) < effective_prio(rq))
-   return 0;
+   if (i915_request_has_nopreempt(rq) && i915_request_started(rq))
+   return false;
+
+   if (!needs_timeslice(engine, rq))
+   return false;
 
-   return timesli

[Intel-gfx] [PATCH 24/66] drm/i915/gem: Include secure batch in common execbuf pinning

2020-07-15 Thread Chris Wilson
Pull the GGTT binding for the secure batch dispatch into the common vma
pinning routine for execbuf, so that there is just a single central
place for all i915_vma_pin().

Signed-off-by: Chris Wilson 
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c| 88 +++
 1 file changed, 51 insertions(+), 37 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c 
b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 8c1f3528b1e9..b6290c2b99c8 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -1676,6 +1676,48 @@ static int eb_alloc_cmdparser(struct i915_execbuffer *eb)
return err;
 }
 
+static int eb_secure_batch(struct i915_execbuffer *eb)
+{
+   struct i915_vma *vma = eb->batch->vma;
+
+   /*
+* snb/ivb/vlv conflate the "batch in ppgtt" bit with the "non-secure
+* batch" bit. Hence we need to pin secure batches into the global gtt.
+* hsw should have this fixed, but bdw mucks it up again.
+*/
+   if (!(eb->batch_flags & I915_DISPATCH_SECURE))
+   return 0;
+
+   if (GEM_WARN_ON(vma->vm != &eb->engine->gt->ggtt->vm)) {
+   struct eb_vma *ev;
+
+   ev = kzalloc(sizeof(*ev), GFP_KERNEL);
+   if (!ev)
+   return -ENOMEM;
+
+   vma = i915_vma_instance(vma->obj,
+   &eb->engine->gt->ggtt->vm,
+   NULL);
+   if (IS_ERR(vma)) {
+   kfree(ev);
+   return PTR_ERR(vma);
+   }
+
+   ev->vma = i915_vma_get(vma);
+   ev->exec = &no_entry;
+
+   list_add(&ev->submit_link, &eb->submit_list);
+   list_add(&ev->reloc_link, &eb->array->aux_list);
+   list_add(&ev->bind_link, &eb->bind_list);
+
+   GEM_BUG_ON(eb->batch->vma->private);
+   eb->batch = ev;
+   }
+
+   eb->batch->flags |= EXEC_OBJECT_NEEDS_GTT;
+   return 0;
+}
+
 static unsigned int eb_batch_index(const struct i915_execbuffer *eb)
 {
if (eb->args->flags & I915_EXEC_BATCH_FIRST)
@@ -1825,6 +1867,10 @@ static int eb_lookup_vmas(struct i915_execbuffer *eb)
if (err)
return err;
 
+   err = eb_secure_batch(eb);
+   if (err)
+   return err;
+
return 0;
 }
 
@@ -2805,7 +2851,7 @@ add_to_client(struct i915_request *rq, struct drm_file 
*file)
spin_unlock(&file_priv->mm.lock);
 }
 
-static int eb_submit(struct i915_execbuffer *eb, struct i915_vma *batch)
+static int eb_submit(struct i915_execbuffer *eb)
 {
int err;
 
@@ -2832,7 +2878,7 @@ static int eb_submit(struct i915_execbuffer *eb, struct 
i915_vma *batch)
}
 
err = eb->engine->emit_bb_start(eb->request,
-   batch->node.start +
+   eb->batch->vma->node.start +
eb->batch_start_offset,
eb->batch_len,
eb->batch_flags);
@@ -3311,7 +3357,6 @@ i915_gem_do_execbuffer(struct drm_device *dev,
struct i915_execbuffer eb;
struct dma_fence *in_fence = NULL;
struct sync_file *out_fence = NULL;
-   struct i915_vma *batch;
int out_fence_fd = -1;
int err;
 
@@ -3412,34 +3457,6 @@ i915_gem_do_execbuffer(struct drm_device *dev,
if (err)
goto err_vma;
 
-   /*
-* snb/ivb/vlv conflate the "batch in ppgtt" bit with the "non-secure
-* batch" bit. Hence we need to pin secure batches into the global gtt.
-* hsw should have this fixed, but bdw mucks it up again. */
-   batch = i915_vma_get(eb.batch->vma);
-   if (eb.batch_flags & I915_DISPATCH_SECURE) {
-   struct i915_vma *vma;
-
-   /*
-* So on first glance it looks freaky that we pin the batch here
-* outside of the reservation loop. But:
-* - The batch is already pinned into the relevant ppgtt, so we
-*   already have the backing storage fully allocated.
-* - No other BO uses the global gtt (well contexts, but meh),
-*   so we don't really have issues with multiple objects not
-*   fitting due to fragmentation.
-* So this is actually safe.
-*/
-   vma = i915_gem_object_ggtt_pin(batch->obj, NULL, 0, 0, 0);
-   if (IS_ERR(vma)) {
-   err = PTR_ERR(vma);
-   goto err_vma;
-   }
-
-   GEM_BUG_ON(vma->obj != batch->obj);
-   batch = vma;
-   }
-
/* All GPU relocation batches must be submitted prior to the user rq */
GEM_BUG_ON(eb.reloc_cache.rq);
 
@@ -3447,7 +3464,7 @@ i915_gem

[Intel-gfx] [PATCH 46/66] drm/i915/gt: Convert stats.active to plain unsigned int

2020-07-15 Thread Chris Wilson
As context-in/out is now always serialised, we do not have to worry
about concurrent enabling/disable of the busy-stats and can reduce the
atomic_t active to a plain unsigned int, and the seqlock to a seqcount.

Signed-off-by: Chris Wilson 
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c|  8 ++--
 drivers/gpu/drm/i915/gt/intel_engine_stats.h | 45 
 drivers/gpu/drm/i915/gt/intel_engine_types.h |  4 +-
 3 files changed, 34 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 10997cae5e41..fcdf336ebf43 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -338,7 +338,7 @@ static int intel_engine_setup(struct intel_gt *gt, enum 
intel_engine_id id)
engine->schedule = NULL;
 
ewma__engine_latency_init(&engine->latency);
-   seqlock_init(&engine->stats.lock);
+   seqcount_init(&engine->stats.lock);
 
ATOMIC_INIT_NOTIFIER_HEAD(&engine->context_status_notifier);
 
@@ -1692,7 +1692,7 @@ static ktime_t __intel_engine_get_busy_time(struct 
intel_engine_cs *engine,
 * add it to the total.
 */
*now = ktime_get();
-   if (atomic_read(&engine->stats.active))
+   if (READ_ONCE(engine->stats.active))
total = ktime_add(total, ktime_sub(*now, engine->stats.start));
 
return total;
@@ -1711,9 +1711,9 @@ ktime_t intel_engine_get_busy_time(struct intel_engine_cs 
*engine, ktime_t *now)
ktime_t total;
 
do {
-   seq = read_seqbegin(&engine->stats.lock);
+   seq = read_seqcount_begin(&engine->stats.lock);
total = __intel_engine_get_busy_time(engine, now);
-   } while (read_seqretry(&engine->stats.lock, seq));
+   } while (read_seqcount_retry(&engine->stats.lock, seq));
 
return total;
 }
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_stats.h 
b/drivers/gpu/drm/i915/gt/intel_engine_stats.h
index 58491eae3482..24fbdd94351a 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_stats.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_stats.h
@@ -17,33 +17,44 @@ static inline void intel_engine_context_in(struct 
intel_engine_cs *engine)
 {
unsigned long flags;
 
-   if (atomic_add_unless(&engine->stats.active, 1, 0))
+   if (engine->stats.active) {
+   engine->stats.active++;
return;
-
-   write_seqlock_irqsave(&engine->stats.lock, flags);
-   if (!atomic_add_unless(&engine->stats.active, 1, 0)) {
-   engine->stats.start = ktime_get();
-   atomic_inc(&engine->stats.active);
}
-   write_sequnlock_irqrestore(&engine->stats.lock, flags);
+
+   /* The writer is serialised; but the pmu reader may be from hardirq */
+   local_irq_save(flags);
+   write_seqcount_begin(&engine->stats.lock);
+
+   engine->stats.start = ktime_get();
+   engine->stats.active++;
+
+   write_seqcount_end(&engine->stats.lock);
+   local_irq_restore(flags);
+
+   GEM_BUG_ON(!engine->stats.active);
 }
 
 static inline void intel_engine_context_out(struct intel_engine_cs *engine)
 {
unsigned long flags;
 
-   GEM_BUG_ON(!atomic_read(&engine->stats.active));
-
-   if (atomic_add_unless(&engine->stats.active, -1, 1))
+   GEM_BUG_ON(!engine->stats.active);
+   if (engine->stats.active > 1) {
+   engine->stats.active--;
return;
-
-   write_seqlock_irqsave(&engine->stats.lock, flags);
-   if (atomic_dec_and_test(&engine->stats.active)) {
-   engine->stats.total =
-   ktime_add(engine->stats.total,
- ktime_sub(ktime_get(), engine->stats.start));
}
-   write_sequnlock_irqrestore(&engine->stats.lock, flags);
+
+   local_irq_save(flags);
+   write_seqcount_begin(&engine->stats.lock);
+
+   engine->stats.active--;
+   engine->stats.total =
+   ktime_add(engine->stats.total,
+ ktime_sub(ktime_get(), engine->stats.start));
+
+   write_seqcount_end(&engine->stats.lock);
+   local_irq_restore(flags);
 }
 
 #endif /* __INTEL_ENGINE_STATS_H__ */
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h 
b/drivers/gpu/drm/i915/gt/intel_engine_types.h
index f86efafd385f..7be475315fa9 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
@@ -550,12 +550,12 @@ struct intel_engine_cs {
/**
 * @active: Number of contexts currently scheduled in.
 */
-   atomic_t active;
+   unsigned int active;
 
/**
 * @lock: Lock protecting the below fields.
 */
-   seqlock_t lock;
+   seqcount_t lock;
 
/**
 * @total: Total time this engine was busy.
-- 
2.20.1

[Intel-gfx] [PATCH 21/66] drm/i915/gem: Asynchronous GTT unbinding

2020-07-15 Thread Chris Wilson
It is reasonably common for userspace (even modern drivers like iris) to
reuse an active address for a new buffer. This would cause the
application to stall under its mutex (originally struct_mutex) until the
old batches were idle and it could synchronously remove the stale PTE.
However, we can queue up a job that waits on the signal for the old
nodes to complete and upon those signals, remove the old nodes replacing
them with the new ones for the batch. This is still CPU driven, but in
theory we can do the GTT patching from the GPU. The job itself has a
completion signal allowing the execbuf to wait upon the rebinding, and
also other observers to coordinate with the common VM activity.

Letting userspace queue up more work, lets it do more stuff without
blocking other clients. In turn, we take care not to let it too much
concurrent work, creating a small number of queues for each context to
limit the number of concurrent tasks.

The implementation relies on only scheduling one unbind operation per
vma as we use the unbound vma->node location to track the stale PTE.

Closes: https://gitlab.freedesktop.org/drm/intel/issues/1402
Signed-off-by: Chris Wilson 
Cc: Matthew Auld 
Cc: Andi Shyti 
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c| 919 --
 drivers/gpu/drm/i915/gt/gen6_ppgtt.c  |   1 +
 drivers/gpu/drm/i915/gt/intel_gtt.c   |   4 +
 drivers/gpu/drm/i915/gt/intel_gtt.h   |   2 +
 drivers/gpu/drm/i915/i915_gem.c   |   7 +
 drivers/gpu/drm/i915/i915_gem_gtt.c   |   5 +
 drivers/gpu/drm/i915/i915_vma.c   |  71 +-
 drivers/gpu/drm/i915/i915_vma.h   |   4 +
 8 files changed, 883 insertions(+), 130 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c 
b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 430b2d4dc747..bdcbb82bfc3d 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -18,6 +18,7 @@
 #include "gt/intel_gt.h"
 #include "gt/intel_gt_buffer_pool.h"
 #include "gt/intel_gt_pm.h"
+#include "gt/intel_gt_requests.h"
 #include "gt/intel_ring.h"
 
 #include "i915_drv.h"
@@ -43,6 +44,12 @@ struct eb_vma {
u32 handle;
 };
 
+struct eb_bind_vma {
+   struct eb_vma *ev;
+   struct drm_mm_node hole;
+   unsigned int bind_flags;
+};
+
 struct eb_vma_array {
struct kref kref;
struct eb_vma vma[];
@@ -66,11 +73,12 @@ struct eb_vma_array {
 I915_EXEC_RESOURCE_STREAMER)
 
 /* Catch emission of unexpected errors for CI! */
+#define __EINVAL__ 22
 #if IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM)
 #undef EINVAL
 #define EINVAL ({ \
DRM_DEBUG_DRIVER("EINVAL at %s:%d\n", __func__, __LINE__); \
-   22; \
+   __EINVAL__; \
 })
 #endif
 
@@ -311,6 +319,12 @@ static struct eb_vma_array *eb_vma_array_create(unsigned 
int count)
return arr;
 }
 
+static struct eb_vma_array *eb_vma_array_get(struct eb_vma_array *arr)
+{
+   kref_get(&arr->kref);
+   return arr;
+}
+
 static inline void eb_unreserve_vma(struct eb_vma *ev)
 {
struct i915_vma *vma = ev->vma;
@@ -444,7 +458,10 @@ eb_vma_misplaced(const struct drm_i915_gem_exec_object2 
*entry,
 const struct i915_vma *vma,
 unsigned int flags)
 {
-   if (vma->node.size < entry->pad_to_size)
+   if (test_bit(I915_VMA_ERROR_BIT, __i915_vma_flags(vma)))
+   return true;
+
+   if (vma->node.size < max(vma->size, entry->pad_to_size))
return true;
 
if (entry->alignment && !IS_ALIGNED(vma->node.start, entry->alignment))
@@ -469,32 +486,6 @@ eb_vma_misplaced(const struct drm_i915_gem_exec_object2 
*entry,
return false;
 }
 
-static u64 eb_pin_flags(const struct drm_i915_gem_exec_object2 *entry,
-   unsigned int exec_flags)
-{
-   u64 pin_flags = 0;
-
-   if (exec_flags & EXEC_OBJECT_NEEDS_GTT)
-   pin_flags |= PIN_GLOBAL;
-
-   /*
-* Wa32bitGeneralStateOffset & Wa32bitInstructionBaseOffset,
-* limit address to the first 4GBs for unflagged objects.
-*/
-   if (!(exec_flags & EXEC_OBJECT_SUPPORTS_48B_ADDRESS))
-   pin_flags |= PIN_ZONE_4G;
-
-   if (exec_flags & __EXEC_OBJECT_NEEDS_MAP)
-   pin_flags |= PIN_MAPPABLE;
-
-   if (exec_flags & EXEC_OBJECT_PINNED)
-   pin_flags |= entry->offset | PIN_OFFSET_FIXED;
-   else if (exec_flags & __EXEC_OBJECT_NEEDS_BIAS)
-   pin_flags |= BATCH_OFFSET_BIAS | PIN_OFFSET_BIAS;
-
-   return pin_flags;
-}
-
 static bool eb_pin_vma_fence_inplace(struct eb_vma *ev)
 {
struct i915_vma *vma = ev->vma;
@@ -522,6 +513,10 @@ eb_pin_vma_inplace(struct i915_execbuffer *eb,
struct i915_vma *vma = ev->vma;
unsigned int pin_flags;
 
+   /* Concurrent async binds in progress, get in the queue */
+   if (!i915_active_is_idle(&vma->vm->binding))
+   return false

[Intel-gfx] [PATCH 25/66] drm/i915/gem: Reintroduce multiple passes for reloc processing

2020-07-15 Thread Chris Wilson
The prospect of locking the entire submission sequence under a wide
ww_mutex re-imposes some key restrictions, in particular that we must
not call copy_(from|to)_user underneath the mutex (as the faulthandlers
themselves may need to take the ww_mutex). To satisfy this requirement,
we need to split the relocation handling into multiple phases again.
After dropping the reservations, we need to allocate enough buffer space
to both copy the relocations from userspace into, and serve as the
relocation command buffer. Once we have finished copying the
relocations, we can then re-aquire all the objects for the execbuf and
rebind them, including our new relocations objects. After we have bound
all the new and old objects into their final locations, we can then
convert the relocation entries into the GPU commands to update the
relocated vma. Finally, once it is all over and we have dropped the
ww_mutex for the last time, we can then complete the update of the user
relocation entries.

Signed-off-by: Chris Wilson 
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c| 887 +-
 .../i915/gem/selftests/i915_gem_execbuffer.c  | 201 ++--
 2 files changed, 564 insertions(+), 524 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c 
b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index b6290c2b99c8..ebabc0746d50 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -35,6 +35,7 @@ struct eb_vma {
 
/** This vma's place in the execbuf reservation list */
struct drm_i915_gem_exec_object2 *exec;
+   u32 bias;
 
struct list_head bind_link;
struct list_head unbound_link;
@@ -60,15 +61,12 @@ struct eb_vma_array {
 #define __EXEC_OBJECT_HAS_PIN  BIT(31)
 #define __EXEC_OBJECT_HAS_FENCEBIT(30)
 #define __EXEC_OBJECT_NEEDS_MAPBIT(29)
-#define __EXEC_OBJECT_NEEDS_BIAS   BIT(28)
-#define __EXEC_OBJECT_INTERNAL_FLAGS   (~0u << 28) /* all of the above */
+#define __EXEC_OBJECT_INTERNAL_FLAGS   (~0u << 29) /* all of the above */
 
 #define __EXEC_HAS_RELOC   BIT(31)
 #define __EXEC_INTERNAL_FLAGS  (~0u << 31)
 #define UPDATE PIN_OFFSET_FIXED
 
-#define BATCH_OFFSET_BIAS (256*1024)
-
 #define __I915_EXEC_ILLEGAL_FLAGS \
(__I915_EXEC_UNKNOWN_FLAGS | \
 I915_EXEC_CONSTANTS_MASK  | \
@@ -266,20 +264,21 @@ struct i915_execbuffer {
 * obj/page
 */
struct reloc_cache {
-   struct drm_mm_node node; /** temporary GTT binding */
unsigned int gen; /** Cached value of INTEL_GEN */
bool use_64bit_reloc : 1;
-   bool has_llc : 1;
bool has_fence : 1;
bool needs_unfenced : 1;
 
struct intel_context *ce;
 
-   struct i915_vma *target;
-   struct i915_request *rq;
-   struct i915_vma *rq_vma;
-   u32 *rq_cmd;
-   unsigned int rq_size;
+   struct eb_relocs_link {
+   struct i915_vma *vma;
+   } head;
+   struct drm_i915_gem_relocation_entry *map;
+   unsigned int pos;
+   unsigned int max;
+
+   unsigned long bufsz;
} reloc_cache;
 
struct eb_cmdparser {
@@ -288,7 +287,7 @@ struct i915_execbuffer {
} parser;
 
u64 invalid_flags; /** Set of execobj.flags that are invalid */
-   u32 context_flags; /** Set of execobj.flags to insert from the ctx */
+   u32 context_bias;
 
u32 batch_start_offset; /** Location within object of batch */
u32 batch_len; /** Length of batch within object */
@@ -308,6 +307,12 @@ static struct drm_i915_gem_exec_object2 no_entry = {
.offset = -1ull
 };
 
+static u64 noncanonical_addr(u64 addr, const struct i915_address_space *vm)
+{
+   GEM_BUG_ON(!is_power_of_2(vm->total));
+   return addr & (vm->total - 1);
+}
+
 static inline bool eb_use_cmdparser(const struct i915_execbuffer *eb)
 {
return intel_engine_requires_cmd_parser(eb->engine) ||
@@ -479,11 +484,12 @@ static int eb_create(struct i915_execbuffer *eb)
return 0;
 }
 
-static bool
-eb_vma_misplaced(const struct drm_i915_gem_exec_object2 *entry,
-const struct i915_vma *vma,
-unsigned int flags)
+static bool eb_vma_misplaced(const struct eb_vma *ev)
 {
+   const struct drm_i915_gem_exec_object2 *entry = ev->exec;
+   const struct i915_vma *vma = ev->vma;
+   unsigned int flags = ev->flags;
+
if (test_bit(I915_VMA_ERROR_BIT, __i915_vma_flags(vma)))
return true;
 
@@ -497,8 +503,7 @@ eb_vma_misplaced(const struct drm_i915_gem_exec_object2 
*entry,
vma->node.start != entry->offset)
return true;
 
-   if (flags & __EXEC_OBJECT_NEEDS_BIAS &&
-   vma->node.start < BATCH_OFFSET_BIAS)
+   if (vma->node.start

[Intel-gfx] [PATCH 42/66] drm/i915/gt: Simplify virtual engine handling for execlists_hold()

2020-07-15 Thread Chris Wilson
Now that the tasklet completely controls scheduling of the requests, and
we postpone scheduling out the old requests, we can keep a hanging
virtual request bound to the engine on which it hung, and remove it from
te queue. On release, it will be returned to the same engine and remain
in its queue until it is scheduled; after which point it will become
eligible for transfer to a sibling. Instead, we could opt to resubmit the
request along the virtual engine on unhold, making it eligible for load
balancing immediately -- but that seems like a pointless optimisation
for a hanging context.

Signed-off-by: Chris Wilson 
---
 drivers/gpu/drm/i915/gt/intel_lrc.c | 29 -
 1 file changed, 29 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c 
b/drivers/gpu/drm/i915/gt/intel_lrc.c
index 062185116e13..0020fc77b3da 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -2771,35 +2771,6 @@ static bool execlists_hold(struct intel_engine_cs 
*engine,
goto unlock;
}
 
-   if (rq->engine != engine) { /* preempted virtual engine */
-   struct virtual_engine *ve = to_virtual_engine(rq->engine);
-
-   /*
-* intel_context_inflight() is only protected by virtue
-* of process_csb() being called only by the tasklet (or
-* directly from inside reset while the tasklet is suspended).
-* Assert that neither of those are allowed to run while we
-* poke at the request queues.
-*/
-   GEM_BUG_ON(!reset_in_progress(&engine->execlists));
-
-   /*
-* An unsubmitted request along a virtual engine will
-* remain on the active (this) engine until we are able
-* to process the context switch away (and so mark the
-* context as no longer in flight). That cannot have happened
-* yet, otherwise we would not be hanging!
-*/
-   spin_lock(&ve->base.active.lock);
-   GEM_BUG_ON(intel_context_inflight(rq->context) != engine);
-   GEM_BUG_ON(ve->request != rq);
-   ve->request = NULL;
-   spin_unlock(&ve->base.active.lock);
-   i915_request_put(rq);
-
-   rq->engine = engine;
-   }
-
/*
 * Transfer this request onto the hold queue to prevent it
 * being resumbitted to HW (and potentially completed) before we have
-- 
2.20.1

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH 34/66] drm/i915/gt: Decouple completed requests on unwind

2020-07-15 Thread Chris Wilson
Since the introduction of preempt-to-busy, requests can complete in the
background, even while they are not on the engine->active.requests list.
As such, the engine->active.request list itself is not in strict
retirement order, and we have to scan the entire list while unwinding to
not miss any. However, if the request is completed we currently leave it
on the list [until retirement], but we could just as simply remove it
and stop treating it as active. We would only have to then traverse it
once while unwinding in quick succession.

Signed-off-by: Chris Wilson 
---
 drivers/gpu/drm/i915/gt/intel_lrc.c | 6 --
 drivers/gpu/drm/i915/i915_request.c | 3 ++-
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c 
b/drivers/gpu/drm/i915/gt/intel_lrc.c
index aa7be7f05f8c..f52b52a7b1d3 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -1114,8 +1114,10 @@ __unwind_incomplete_requests(struct intel_engine_cs 
*engine)
list_for_each_entry_safe_reverse(rq, rn,
 &engine->active.requests,
 sched.link) {
-   if (i915_request_completed(rq))
-   continue; /* XXX */
+   if (i915_request_completed(rq)) {
+   list_del_init(&rq->sched.link);
+   continue;
+   }
 
__i915_request_unsubmit(rq);
 
diff --git a/drivers/gpu/drm/i915/i915_request.c 
b/drivers/gpu/drm/i915/i915_request.c
index 83696955ddf7..31c60e6c5c7a 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -311,7 +311,8 @@ bool i915_request_retire(struct i915_request *rq)
 * request that we have removed from the HW and put back on a run
 * queue.
 */
-   remove_from_engine(rq);
+   if (!list_empty(&rq->sched.link))
+   remove_from_engine(rq);
 
spin_lock_irq(&rq->lock);
if (!i915_request_signaled(rq))
-- 
2.20.1

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH 55/66] drm/i915: Fair low-latency scheduling

2020-07-15 Thread Chris Wilson
The first "scheduler" was a topographical sorting of requests into
priority order. The execution order was deterministic, the earliest
submitted, highest priority request would be executed first. Priority
inherited ensured that inversions were kept at bay, and allowed us to
dynamically boost priorities (e.g. for interactive pageflips).

The minimalistic timeslicing scheme was an attempt to introduce fairness
between long running requests, by evicting the active request at the end
of a timeslice and moving it to the back of its priority queue (while
ensuring that dependencies were kept in order). For short running
requests from many clients of equal priority, the scheme is still very
much FIFO submission ordering, and as unfair as before.

To impose fairness, we need an external metric that ensures that clients
are interpersed, we don't execute one long chain from client A before
executing any of client B. This could be imposed by the clients by using
a fences based on an external clock, that is they only submit work for a
"frame" at frame-interval, instead of submitting as much work as they
are able to. The standard SwapBuffers approach is akin to double
bufferring, where as one frame is being executed, the next is being
submitted, such that there is always a maximum of two frames per client
in the pipeline. Even this scheme exhibits unfairness under load as a
single client will execute two frames back to back before the next, and
with enough clients, deadlines will be missed.

The idea introduced by BFS/MuQSS is that fairness is introduced by
metering with an external clock. Every request, when it becomes ready to
execute is assigned a virtual deadline, and execution order is then
determined by earliest deadline. Priority is used as a hint, rather than
strict ordering, where high priority requests have earlier deadlines,
but not necessarily earlier than outstanding work. Thus work is executed
in order of 'readiness', with timeslicing to demote long running work.

The Achille's heel of this scheduler is its strong preference for
low-latency and favouring of new queues. Whereas it was easy to dominate
the old scheduler by flooding it with many requests over a short period
of time, the new scheduler can be dominated by a 'synchronous' client
that waits for each of its requests to complete before submitting the
next. As such a client has no history, it is always considered
ready-to-run and receives an earlier deadline than the long running
requests.

To check the impact on throughput (often the downfall of latency
sensitive schedulers), we used gem_wsim to simulate various transcode
workloads with different load balancers, and varying the number of
competing [heterogenous] clients.

+mB+
|   a  |
| cda  |
| c.a  |
| ..aa |
|   ..---. |
|   -.--+-.|
|.c.-.-+++.  b |
|   bbb.d-c-+--+++.aab aab b   |
|b  b   b   b  b.  b ..---+++-+a. b. b b   b   bb b|
1   A| |
2 |___AM|  |
3|A__| |
4|MA_| |
+--+
Clients   Min   Max Median   AvgStddev
1   -8.20   5.4 -0.045  -0.02375   0.094722134
2  -15.96 19.28  -0.64 -1.05 2.2428076
4   -5.11  2.95  -1.15-1.0680.72382651
8   -5.63  1.85 -0.905   -0.871224490.73390971

The impact was on average 1% under contention due to the change in context
execution order and number of context switches.

Signed-off-by: Chris Wilson 
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c |  12 +-
 .../gpu/drm/i915/gt/intel_engine_heartbeat.c  |   1 +
 drivers/gpu/drm/i915/gt/intel_engine_pm.c |   4 +-
 drivers/gpu/drm/i915/gt/intel_engine_types.h  |  14 -
 drivers/gpu/drm/i915/gt/intel_lrc.c   | 230 +---
 drivers/gpu/drm/i915/gt/selftest_hangcheck.c  |   5 +-
 drivers/gpu/drm/i915/gt/selftest_lrc.c|  41 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   6 +-
 drivers/gpu/drm/i915/i915_priolist_types.h|   7 +-
 drivers/gpu/drm/i915/i915_request.c   |   1 +
 drivers/gpu/drm/i915/i915_scheduler.c | 352 +-
 drivers/gpu/drm/i915/i915_scheduler.h |  24 +-
 driver

[Intel-gfx] [PATCH 31/66] drm/i915/gt: Acquire backing storage for the context

2020-07-15 Thread Chris Wilson
Pull the individual acquisition of the context objects (state, ring,
timeline) under a common i915_acquire_ctx in preparation to allow the
context to evict memory (or rather the i915_acquire_ctx on its behalf).

The context objects maintain their semi-permanent status; that is they
are assumed to be accessible by the HW at all times until we receive a
signal from the HW that they are no longer in use. Currently, we
generate such a signal ourselves from the context switch following the
final use of the objects. This means that they will remain on the HW for
an indefinite amount of time, and we retain the use of pinning to keep
them in the same place. As they are pinned, they can be processed
outside of the working set for the requests within the context. This is
useful, as the context share some global state causing it to incur a
global lock via its objects. By only requiring that lock as the context
is activated, it is both reduced in frequency and reduced in duration
(as compared to execbuf).

Signed-off-by: Chris Wilson 
---
 drivers/gpu/drm/i915/gt/intel_context.c   | 108 ++--
 drivers/gpu/drm/i915/gt/intel_ring.c  |  17 ++-
 drivers/gpu/drm/i915/gt/intel_ring.h  |   5 +-
 .../gpu/drm/i915/gt/intel_ring_submission.c   | 117 --
 drivers/gpu/drm/i915/gt/intel_timeline.c  |  14 ++-
 drivers/gpu/drm/i915/gt/intel_timeline.h  |  10 +-
 drivers/gpu/drm/i915/gt/mock_engine.c |   2 +
 drivers/gpu/drm/i915/gt/selftest_timeline.c   |  30 -
 8 files changed, 237 insertions(+), 66 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c 
b/drivers/gpu/drm/i915/gt/intel_context.c
index 52db2bde44a3..2f1606365f63 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -6,6 +6,7 @@
 
 #include "gem/i915_gem_context.h"
 #include "gem/i915_gem_pm.h"
+#include "mm/i915_acquire_ctx.h"
 
 #include "i915_drv.h"
 #include "i915_globals.h"
@@ -93,6 +94,27 @@ static void intel_context_active_release(struct 
intel_context *ce)
i915_active_release(&ce->active);
 }
 
+static int __intel_context_sync(struct intel_context *ce)
+{
+   int err;
+
+   err = i915_vma_wait_for_bind(ce->ring->vma);
+   if (err)
+   return err;
+
+   err = i915_vma_wait_for_bind(ce->timeline->hwsp_ggtt);
+   if (err)
+   return err;
+
+   if (ce->state) {
+   err = i915_vma_wait_for_bind(ce->state);
+   if (err)
+   return err;
+   }
+
+   return 0;
+}
+
 int __intel_context_do_pin(struct intel_context *ce)
 {
int err;
@@ -118,6 +140,10 @@ int __intel_context_do_pin(struct intel_context *ce)
}
 
if (likely(!atomic_add_unless(&ce->pin_count, 1, 0))) {
+   err = __intel_context_sync(ce);
+   if (unlikely(err))
+   goto out_unlock;
+
err = intel_context_active_acquire(ce);
if (unlikely(err))
goto out_unlock;
@@ -166,12 +192,12 @@ void intel_context_unpin(struct intel_context *ce)
intel_context_put(ce);
 }
 
-static int __context_pin_state(struct i915_vma *vma)
+static int __context_active_locked(struct i915_vma *vma)
 {
unsigned int bias = i915_ggtt_pin_bias(vma) | PIN_OFFSET_BIAS;
int err;
 
-   err = i915_ggtt_pin(vma, 0, bias | PIN_HIGH);
+   err = i915_ggtt_pin_locked(vma, 0, bias | PIN_HIGH);
if (err)
return err;
 
@@ -200,11 +226,11 @@ static void __context_unpin_state(struct i915_vma *vma)
__i915_vma_unpin(vma);
 }
 
-static int __ring_active(struct intel_ring *ring)
+static int __ring_active_locked(struct intel_ring *ring)
 {
int err;
 
-   err = intel_ring_pin(ring);
+   err = intel_ring_pin_locked(ring);
if (err)
return err;
 
@@ -244,27 +270,53 @@ static void __intel_context_retire(struct i915_active 
*active)
intel_context_put(ce);
 }
 
-static int __intel_context_active(struct i915_active *active)
+static int
+__intel_context_acquire_lock(struct intel_context *ce,
+struct i915_acquire_ctx *ctx)
+{
+   return i915_acquire_ctx_lock(ctx, ce->state->obj);
+}
+
+static int
+intel_context_acquire_lock(struct intel_context *ce,
+  struct i915_acquire_ctx *ctx)
 {
-   struct intel_context *ce = container_of(active, typeof(*ce), active);
int err;
 
-   CE_TRACE(ce, "active\n");
+   err = intel_ring_acquire_lock(ce->ring, ctx);
+   if (err)
+   return err;
 
-   intel_context_get(ce);
+   if (ce->state) {
+   err = __intel_context_acquire_lock(ce, ctx);
+   if (err)
+   return err;
+   }
 
-   err = __ring_active(ce->ring);
+   /* Note that the timeline will migrate as the seqno wrap around */
+   err = intel_timeline_acquire_lock(ce->time

[Intel-gfx] [PATCH 60/66] drm/i915/gt: Couple tasklet scheduling for all CS interrupts

2020-07-15 Thread Chris Wilson
If any engine asks for the tasklet to be kicked from the CS interrupt,
do so. Currently, this is used by the execlists scheduler backends to
feed in the next request to the HW, and similarly could be used by a
ring scheduler, as will be seen in the next patch.

Signed-off-by: Chris Wilson 
Reviewed-by: Mika Kuoppala 
---
 drivers/gpu/drm/i915/gt/intel_gt_irq.c | 19 ++-
 drivers/gpu/drm/i915/gt/intel_gt_irq.h |  3 +++
 drivers/gpu/drm/i915/gt/intel_rps.c|  2 +-
 drivers/gpu/drm/i915/i915_irq.c|  8 
 4 files changed, 22 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_irq.c 
b/drivers/gpu/drm/i915/gt/intel_gt_irq.c
index b05da68e52f4..b825b93b4b05 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_irq.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_irq.c
@@ -61,6 +61,15 @@ cs_irq_handler(struct intel_engine_cs *engine, u32 iir)
tasklet_hi_schedule(&engine->execlists.tasklet);
 }
 
+void gen2_engine_cs_irq(struct intel_engine_cs *engine)
+{
+   if (!list_empty(&engine->breadcrumbs.signalers))
+   intel_engine_signal_breadcrumbs(engine);
+
+   if (intel_engine_needs_breadcrumb_tasklet(engine))
+   tasklet_hi_schedule(&engine->execlists.tasklet);
+}
+
 static u32
 gen11_gt_engine_identity(struct intel_gt *gt,
 const unsigned int bank, const unsigned int bit)
@@ -274,9 +283,9 @@ void gen11_gt_irq_postinstall(struct intel_gt *gt)
 void gen5_gt_irq_handler(struct intel_gt *gt, u32 gt_iir)
 {
if (gt_iir & GT_RENDER_USER_INTERRUPT)
-   
intel_engine_signal_breadcrumbs(gt->engine_class[RENDER_CLASS][0]);
+   gen2_engine_cs_irq(gt->engine_class[RENDER_CLASS][0]);
if (gt_iir & ILK_BSD_USER_INTERRUPT)
-   
intel_engine_signal_breadcrumbs(gt->engine_class[VIDEO_DECODE_CLASS][0]);
+   gen2_engine_cs_irq(gt->engine_class[VIDEO_DECODE_CLASS][0]);
 }
 
 static void gen7_parity_error_irq_handler(struct intel_gt *gt, u32 iir)
@@ -300,11 +309,11 @@ static void gen7_parity_error_irq_handler(struct intel_gt 
*gt, u32 iir)
 void gen6_gt_irq_handler(struct intel_gt *gt, u32 gt_iir)
 {
if (gt_iir & GT_RENDER_USER_INTERRUPT)
-   
intel_engine_signal_breadcrumbs(gt->engine_class[RENDER_CLASS][0]);
+   gen2_engine_cs_irq(gt->engine_class[RENDER_CLASS][0]);
if (gt_iir & GT_BSD_USER_INTERRUPT)
-   
intel_engine_signal_breadcrumbs(gt->engine_class[VIDEO_DECODE_CLASS][0]);
+   gen2_engine_cs_irq(gt->engine_class[VIDEO_DECODE_CLASS][0]);
if (gt_iir & GT_BLT_USER_INTERRUPT)
-   
intel_engine_signal_breadcrumbs(gt->engine_class[COPY_ENGINE_CLASS][0]);
+   gen2_engine_cs_irq(gt->engine_class[COPY_ENGINE_CLASS][0]);
 
if (gt_iir & (GT_BLT_CS_ERROR_INTERRUPT |
  GT_BSD_CS_ERROR_INTERRUPT |
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_irq.h 
b/drivers/gpu/drm/i915/gt/intel_gt_irq.h
index 886c5cf408a2..6c69cd563fe1 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_irq.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_irq.h
@@ -9,6 +9,7 @@
 
 #include 
 
+struct intel_engine_cs;
 struct intel_gt;
 
 #define GEN8_GT_IRQS (GEN8_GT_RCS_IRQ | \
@@ -19,6 +20,8 @@ struct intel_gt;
  GEN8_GT_PM_IRQ | \
  GEN8_GT_GUC_IRQ)
 
+void gen2_engine_cs_irq(struct intel_engine_cs *engine);
+
 void gen11_gt_irq_reset(struct intel_gt *gt);
 void gen11_gt_irq_postinstall(struct intel_gt *gt);
 void gen11_gt_irq_handler(struct intel_gt *gt, const u32 master_ctl);
diff --git a/drivers/gpu/drm/i915/gt/intel_rps.c 
b/drivers/gpu/drm/i915/gt/intel_rps.c
index 97ba14ad52e4..49910425e986 100644
--- a/drivers/gpu/drm/i915/gt/intel_rps.c
+++ b/drivers/gpu/drm/i915/gt/intel_rps.c
@@ -1741,7 +1741,7 @@ void gen6_rps_irq_handler(struct intel_rps *rps, u32 
pm_iir)
return;
 
if (pm_iir & PM_VEBOX_USER_INTERRUPT)
-   intel_engine_signal_breadcrumbs(gt->engine[VECS0]);
+   gen2_engine_cs_irq(gt->engine[VECS0]);
 
if (pm_iir & PM_VEBOX_CS_ERROR_INTERRUPT)
DRM_DEBUG("Command parser error, pm_iir 0x%08x\n", pm_iir);
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 1fa67700d8f4..27a0b3b89ddf 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -3734,7 +3734,7 @@ static irqreturn_t i8xx_irq_handler(int irq, void *arg)
intel_uncore_write16(&dev_priv->uncore, GEN2_IIR, iir);
 
if (iir & I915_USER_INTERRUPT)
-   
intel_engine_signal_breadcrumbs(dev_priv->gt.engine[RCS0]);
+   gen2_engine_cs_irq(dev_priv->gt.engine[RCS0]);
 
if (iir & I915_MASTER_ERROR_INTERRUPT)
i8xx_error_irq_handler(dev_priv, eir, eir_stuck);
@@ -3839,7 +3839,7 @@ static irqreturn_t i915_irq_handler(int irq, void *arg)
I915

[Intel-gfx] [PATCH 22/66] drm/i915/gem: Bind the fence async for execbuf

2020-07-15 Thread Chris Wilson
It is illegal to wait on an another vma while holding the vm->mutex, as
that easily leads to ABBA deadlocks (we wait on a second vma that waits
on us to release the vm->mutex). So while the vm->mutex exists, move the
waiting outside of the lock into the async binding pipeline.

Signed-off-by: Chris Wilson 
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c|  21 +--
 drivers/gpu/drm/i915/gt/intel_ggtt_fencing.c  | 137 +-
 drivers/gpu/drm/i915/gt/intel_ggtt_fencing.h  |   5 +
 3 files changed, 151 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c 
b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index bdcbb82bfc3d..af2b4aeb6df0 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -1056,15 +1056,6 @@ static int eb_reserve_vma(struct eb_vm_work *work, 
struct eb_bind_vma *bind)
return err;
 
 pin:
-   if (unlikely(exec_flags & EXEC_OBJECT_NEEDS_FENCE)) {
-   err = __i915_vma_pin_fence(vma); /* XXX no waiting */
-   if (unlikely(err))
-   return err;
-
-   if (vma->fence)
-   bind->ev->flags |= __EXEC_OBJECT_HAS_FENCE;
-   }
-
bind_flags &= ~atomic_read(&vma->flags);
if (bind_flags) {
err = set_bind_fence(vma, work);
@@ -1095,6 +1086,15 @@ static int eb_reserve_vma(struct eb_vm_work *work, 
struct eb_bind_vma *bind)
bind->ev->flags |= __EXEC_OBJECT_HAS_PIN;
GEM_BUG_ON(eb_vma_misplaced(entry, vma, bind->ev->flags));
 
+   if (unlikely(exec_flags & EXEC_OBJECT_NEEDS_FENCE)) {
+   err = __i915_vma_pin_fence_async(vma, &work->base);
+   if (unlikely(err))
+   return err;
+
+   if (vma->fence)
+   bind->ev->flags |= __EXEC_OBJECT_HAS_FENCE;
+   }
+
return 0;
 }
 
@@ -1160,6 +1160,9 @@ static void __eb_bind_vma(struct eb_vm_work *work)
struct eb_bind_vma *bind = &work->bind[n];
struct i915_vma *vma = bind->ev->vma;
 
+   if (bind->ev->flags & __EXEC_OBJECT_HAS_FENCE)
+   __i915_vma_apply_fence_async(vma);
+
if (!bind->bind_flags)
goto put;
 
diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt_fencing.c 
b/drivers/gpu/drm/i915/gt/intel_ggtt_fencing.c
index 7fb36b12fe7a..734b6aa61809 100644
--- a/drivers/gpu/drm/i915/gt/intel_ggtt_fencing.c
+++ b/drivers/gpu/drm/i915/gt/intel_ggtt_fencing.c
@@ -21,10 +21,13 @@
  * IN THE SOFTWARE.
  */
 
+#include "i915_active.h"
 #include "i915_drv.h"
 #include "i915_scatterlist.h"
+#include "i915_sw_fence_work.h"
 #include "i915_pvinfo.h"
 #include "i915_vgpu.h"
+#include "i915_vma.h"
 
 /**
  * DOC: fence register handling
@@ -340,19 +343,37 @@ static struct i915_fence_reg *fence_find(struct i915_ggtt 
*ggtt)
return ERR_PTR(-EDEADLK);
 }
 
+static int fence_wait_bind(struct i915_fence_reg *reg)
+{
+   struct dma_fence *fence;
+   int err = 0;
+
+   fence = i915_active_fence_get(®->active.excl);
+   if (fence) {
+   err = dma_fence_wait(fence, true);
+   dma_fence_put(fence);
+   }
+
+   return err;
+}
+
 int __i915_vma_pin_fence(struct i915_vma *vma)
 {
struct i915_ggtt *ggtt = i915_vm_to_ggtt(vma->vm);
-   struct i915_fence_reg *fence;
+   struct i915_fence_reg *fence = vma->fence;
struct i915_vma *set = i915_gem_object_is_tiled(vma->obj) ? vma : NULL;
int err;
 
lockdep_assert_held(&vma->vm->mutex);
 
/* Just update our place in the LRU if our fence is getting reused. */
-   if (vma->fence) {
-   fence = vma->fence;
+   if (fence) {
GEM_BUG_ON(fence->vma != vma);
+
+   err = fence_wait_bind(fence);
+   if (err)
+   return err;
+
atomic_inc(&fence->pin_count);
if (!fence->dirty) {
list_move_tail(&fence->link, &ggtt->fence_list);
@@ -384,6 +405,116 @@ int __i915_vma_pin_fence(struct i915_vma *vma)
return err;
 }
 
+static int set_bind_fence(struct i915_fence_reg *fence,
+ struct dma_fence_work *work)
+{
+   struct dma_fence *prev;
+   int err;
+
+   if (rcu_access_pointer(fence->active.excl.fence) == &work->dma)
+   return 0;
+
+   err = i915_sw_fence_await_active(&work->chain,
+&fence->active,
+I915_ACTIVE_AWAIT_ACTIVE);
+   if (err)
+   return err;
+
+   if (i915_active_acquire(&fence->active))
+   return -ENOENT;
+
+   prev = i915_active_set_exclusive(&fence->active, &work->dma);
+   if (unlikely(prev)) {
+   err = i915_sw_fence_await_dma_fence(&work->chain, prev, 0,
+   

[Intel-gfx] [PATCH 58/66] drm/i915: Move saturated workload detection to the GT

2020-07-15 Thread Chris Wilson
When we introduced the saturated workload detection to tell us to back
off from semaphore usage [semaphores have a noticeable impact on
contended bus cycles with the CPU for some heavy workloads], we first
introduced it as a per-context tracker. This allows individual contexts
to try and optimise their own usage, but we found that with the local
tracking and the no-semaphore boosting, the first context to disable
semaphores got a massive priority boost and so would starve the rest and
all new contexts (as they started with semaphores enabled and lower
priority). Hence we moved the saturated workload detection to the
engine, and a consequence had to disable semaphores on virtual engines.

Now that we do not have semaphore priority boosting, we can move the
tracking to the GT and virtual engines can now utilise the faster
inter-engine synchronisation, while maintaining the global information
to back off on saturation.

References: 44d89409a12e ("drm/i915: Make the semaphore saturation mask global")
Signed-off-by: Chris Wilson 
---
 drivers/gpu/drm/i915/gt/intel_engine_pm.c|  2 +-
 drivers/gpu/drm/i915/gt/intel_engine_types.h |  2 --
 drivers/gpu/drm/i915/gt/intel_gt_types.h |  2 ++
 drivers/gpu/drm/i915/gt/intel_lrc.c  | 15 ---
 drivers/gpu/drm/i915/i915_request.c  | 13 -
 5 files changed, 11 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c 
b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
index a95099b7b759..630c0cf8cffd 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
@@ -231,7 +231,7 @@ static int __engine_park(struct intel_wakeref *wf)
struct intel_engine_cs *engine =
container_of(wf, typeof(*engine), wakeref);
 
-   engine->saturated = 0;
+   clear_bit(engine->id, &engine->gt->saturated);
 
/*
 * If one and only one request is completed between pm events,
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h 
b/drivers/gpu/drm/i915/gt/intel_engine_types.h
index af6f1154200a..8c502cf34de7 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
@@ -324,8 +324,6 @@ struct intel_engine_cs {
 
struct intel_context *kernel_context; /* pinned */
 
-   intel_engine_mask_t saturated; /* submitting semaphores too late? */
-
struct {
struct delayed_work work;
struct i915_request *systole;
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_types.h 
b/drivers/gpu/drm/i915/gt/intel_gt_types.h
index 6d39a4a11bf3..6e7719082add 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_gt_types.h
@@ -74,6 +74,8 @@ struct intel_gt {
 */
intel_wakeref_t awake;
 
+   unsigned long saturated; /* submitting semaphores too late? */
+
u32 clock_frequency;
 
struct intel_llc llc;
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c 
b/drivers/gpu/drm/i915/gt/intel_lrc.c
index 6054695611ad..a22d24a5696e 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -5541,21 +5541,6 @@ intel_execlists_create_virtual(struct intel_engine_cs 
**siblings,
ve->base.instance = I915_ENGINE_CLASS_INVALID_VIRTUAL;
ve->base.uabi_instance = I915_ENGINE_CLASS_INVALID_VIRTUAL;
 
-   /*
-* The decision on whether to submit a request using semaphores
-* depends on the saturated state of the engine. We only compute
-* this during HW submission of the request, and we need for this
-* state to be globally applied to all requests being submitted
-* to this engine. Virtual engines encompass more than one physical
-* engine and so we cannot accurately tell in advance if one of those
-* engines is already saturated and so cannot afford to use a semaphore
-* and be pessimized in priority for doing so -- if we are the only
-* context using semaphores after all other clients have stopped, we
-* will be starved on the saturated system. Such a global switch for
-* semaphores is less than ideal, but alas is the current compromise.
-*/
-   ve->base.saturated = ALL_ENGINES;
-
snprintf(ve->base.name, sizeof(ve->base.name), "virtual");
 
intel_engine_init_active(&ve->base, ENGINE_VIRTUAL);
diff --git a/drivers/gpu/drm/i915/i915_request.c 
b/drivers/gpu/drm/i915/i915_request.c
index a90e90e96c19..e4bafd90432b 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -535,7 +535,7 @@ bool __i915_request_submit(struct i915_request *request)
 */
if (request->sched.semaphores &&
i915_sw_fence_signaled(&request->semaphore))
-   engine->saturated |= request->sched.semaphores;
+   set_bit(engine->id, &engine->gt->saturated);
 
engine->emit_fini_breadcrumb(request,
   

[Intel-gfx] [PATCH 12/66] drm/i915: Switch to object allocations for page directories

2020-07-15 Thread Chris Wilson
The GEM object is grossly overweight for the practicality of tracking
large numbers of individual pages, yet it is currently our only
abstraction for tracking DMA allocations. Since those allocations need
to be reserved upfront before an operation, and that we need to break
away from simple system memory, we need to ditch using plain struct page
wrappers.

In the process, we drop the WC mapping as we ended up clflushing
everything anyway due to various issues across a wider range of
platforms. Though in a future step, we need to drop the kmap_atomic
approach which suggests we need to pre-map all the pages and keep them
mapped.

v2: Verify our large scratch page is suitably DMA aligned; and manually
clear the scratch since we are allocating random struct pages.

Signed-off-by: Chris Wilson 
Cc: Matthew Auld 
---
 .../gpu/drm/i915/gem/i915_gem_object_types.h  |   1 +
 .../gpu/drm/i915/gem/selftests/huge_pages.c   |   2 +-
 .../drm/i915/gem/selftests/i915_gem_context.c |   2 +-
 drivers/gpu/drm/i915/gt/gen6_ppgtt.c  |  53 +--
 drivers/gpu/drm/i915/gt/gen6_ppgtt.h  |   1 +
 drivers/gpu/drm/i915/gt/gen8_ppgtt.c  |  89 ++---
 drivers/gpu/drm/i915/gt/intel_ggtt.c  |  37 ++-
 drivers/gpu/drm/i915/gt/intel_gtt.c   | 303 --
 drivers/gpu/drm/i915/gt/intel_gtt.h   |  94 ++
 drivers/gpu/drm/i915/gt/intel_ppgtt.c |  42 ++-
 .../gpu/drm/i915/gt/intel_ring_submission.c   |  16 +-
 drivers/gpu/drm/i915/gvt/scheduler.c  |  17 +-
 drivers/gpu/drm/i915/i915_drv.c   |   1 +
 drivers/gpu/drm/i915/i915_drv.h   |   5 -
 drivers/gpu/drm/i915/i915_vma.c   |  18 +-
 drivers/gpu/drm/i915/selftests/i915_gem_gtt.c |  23 ++
 drivers/gpu/drm/i915/selftests/i915_perf.c|   4 +-
 drivers/gpu/drm/i915/selftests/mock_gtt.c |   4 +
 18 files changed, 289 insertions(+), 423 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object_types.h 
b/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
index 5335f799b548..d0847d7896f9 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
+++ b/drivers/gpu/drm/i915/gem/i915_gem_object_types.h
@@ -282,6 +282,7 @@ struct drm_i915_gem_object {
} userptr;
 
unsigned long scratch;
+   u64 encode;
 
void *gvt_info;
};
diff --git a/drivers/gpu/drm/i915/gem/selftests/huge_pages.c 
b/drivers/gpu/drm/i915/gem/selftests/huge_pages.c
index 8291ede6902c..e2f3d014acb2 100644
--- a/drivers/gpu/drm/i915/gem/selftests/huge_pages.c
+++ b/drivers/gpu/drm/i915/gem/selftests/huge_pages.c
@@ -393,7 +393,7 @@ static int igt_mock_exhaust_device_supported_pages(void 
*arg)
 */
 
for (i = 1; i < BIT(ARRAY_SIZE(page_sizes)); i++) {
-   unsigned int combination = 0;
+   unsigned int combination = SZ_4K; /* Required for ppGTT */
 
for (j = 0; j < ARRAY_SIZE(page_sizes); j++) {
if (i & BIT(j))
diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c 
b/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
index 7ffc3c751432..d176b015353f 100644
--- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
@@ -1748,7 +1748,7 @@ static int check_scratch_page(struct i915_gem_context 
*ctx, u32 *out)
if (!vm)
return -ENODEV;
 
-   page = vm->scratch[0].base.page;
+   page = __px_page(vm->scratch[0]);
if (!page) {
pr_err("No scratch page!\n");
return -EINVAL;
diff --git a/drivers/gpu/drm/i915/gt/gen6_ppgtt.c 
b/drivers/gpu/drm/i915/gt/gen6_ppgtt.c
index ee2e149454cb..a823d2e3c39c 100644
--- a/drivers/gpu/drm/i915/gt/gen6_ppgtt.c
+++ b/drivers/gpu/drm/i915/gt/gen6_ppgtt.c
@@ -16,8 +16,10 @@ static inline void gen6_write_pde(const struct gen6_ppgtt 
*ppgtt,
  const unsigned int pde,
  const struct i915_page_table *pt)
 {
+   dma_addr_t addr = pt ? px_dma(pt) : px_dma(ppgtt->base.vm.scratch[1]);
+
/* Caller needs to make sure the write completes if necessary */
-   iowrite32(GEN6_PDE_ADDR_ENCODE(px_dma(pt)) | GEN6_PDE_VALID,
+   iowrite32(GEN6_PDE_ADDR_ENCODE(addr) | GEN6_PDE_VALID,
  ppgtt->pd_addr + pde);
 }
 
@@ -79,7 +81,7 @@ static void gen6_ppgtt_clear_range(struct i915_address_space 
*vm,
 {
struct gen6_ppgtt * const ppgtt = to_gen6_ppgtt(i915_vm_to_ppgtt(vm));
const unsigned int first_entry = start / I915_GTT_PAGE_SIZE;
-   const gen6_pte_t scratch_pte = vm->scratch[0].encode;
+   const gen6_pte_t scratch_pte = vm->scratch[0]->encode;
unsigned int pde = first_entry / GEN6_PTES;
unsigned int pte = first_entry % GEN6_PTES;
unsigned int num_entries = length / I915_GTT_PAGE_SIZE;
@@ -90,8 +92,6 @@ static void gen6_ppgtt_clear_range(struct i915_address_space 
*vm,
   

[Intel-gfx] [PATCH 44/66] drm/i915/gt: Drop atomic for engine->fw_active tracking

2020-07-15 Thread Chris Wilson
Since schedule-in/out is now entirely serialised by the tasklet bitlock,
we do not need to worry about concurrent in/out operations and so reduce
the atomic operations to plain instructions.

Signed-off-by: Chris Wilson 
---
 drivers/gpu/drm/i915/gt/intel_engine_cs.c| 2 +-
 drivers/gpu/drm/i915/gt/intel_engine_types.h | 2 +-
 drivers/gpu/drm/i915/gt/intel_lrc.c  | 4 ++--
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index c10521fdbbe4..10997cae5e41 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -1615,7 +1615,7 @@ void intel_engine_dump(struct intel_engine_cs *engine,
   ktime_to_ms(intel_engine_get_busy_time(engine,
  &dummy)));
drm_printf(m, "\tForcewake: %x domains, %d active\n",
-  engine->fw_domain, atomic_read(&engine->fw_active));
+  engine->fw_domain, READ_ONCE(engine->fw_active));
 
rcu_read_lock();
rq = READ_ONCE(engine->heartbeat.systole);
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h 
b/drivers/gpu/drm/i915/gt/intel_engine_types.h
index 36981ba1db75..f86efafd385f 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
@@ -327,7 +327,7 @@ struct intel_engine_cs {
 * as possible.
 */
enum forcewake_domains fw_domain;
-   atomic_t fw_active;
+   unsigned int fw_active;
 
unsigned long context_tag;
 
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c 
b/drivers/gpu/drm/i915/gt/intel_lrc.c
index a59332f28cd3..72b343242251 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -1341,7 +1341,7 @@ __execlists_schedule_in(struct i915_request *rq)
ce->lrc.ccid |= engine->execlists.ccid;
 
__intel_gt_pm_get(engine->gt);
-   if (engine->fw_domain && !atomic_fetch_inc(&engine->fw_active))
+   if (engine->fw_domain && !engine->fw_active++)
intel_uncore_forcewake_get(engine->uncore, engine->fw_domain);
execlists_context_status_change(rq, INTEL_CONTEXT_SCHEDULE_IN);
intel_engine_context_in(engine);
@@ -1434,7 +1434,7 @@ static inline void __execlists_schedule_out(struct 
i915_request *rq)
intel_context_update_runtime(ce);
intel_engine_context_out(engine);
execlists_context_status_change(rq, INTEL_CONTEXT_SCHEDULE_OUT);
-   if (engine->fw_domain && !atomic_dec_return(&engine->fw_active))
+   if (engine->fw_domain && !--engine->fw_active)
intel_uncore_forcewake_put(engine->uncore, engine->fw_domain);
intel_gt_pm_put_async(engine->gt);
 
-- 
2.20.1

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH 01/66] drm/i915: Reduce i915_request.lock contention for i915_request_wait

2020-07-15 Thread Chris Wilson
Currently, we use i915_request_completed() directly in
i915_request_wait() and follow up with a manual invocation of
dma_fence_signal(). This appears to cause a large number of contentions
on i915_request.lock as when the process is woken up after the fence is
signaled by an interrupt, we will then try and call dma_fence_signal()
ourselves while the signaler is still holding the lock.
dma_fence_is_signaled() has the benefit of checking the
DMA_FENCE_FLAG_SIGNALED_BIT prior to calling dma_fence_signal() and so
avoids most of that contention.

Signed-off-by: Chris Wilson 
Cc: Matthew Auld 
Cc: Tvrtko Ursulin 
---
 drivers/gpu/drm/i915/i915_request.c | 12 
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_request.c 
b/drivers/gpu/drm/i915/i915_request.c
index 0b2fe55e6194..bb4eb1a8780e 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -1640,7 +1640,7 @@ static bool busywait_stop(unsigned long timeout, unsigned 
int cpu)
return this_cpu != cpu;
 }
 
-static bool __i915_spin_request(const struct i915_request * const rq, int 
state)
+static bool __i915_spin_request(struct i915_request * const rq, int state)
 {
unsigned long timeout_ns;
unsigned int cpu;
@@ -1673,7 +1673,7 @@ static bool __i915_spin_request(const struct i915_request 
* const rq, int state)
timeout_ns = READ_ONCE(rq->engine->props.max_busywait_duration_ns);
timeout_ns += local_clock_ns(&cpu);
do {
-   if (i915_request_completed(rq))
+   if (dma_fence_is_signaled(&rq->fence))
return true;
 
if (signal_pending_state(state, current))
@@ -1766,10 +1766,8 @@ long i915_request_wait(struct i915_request *rq,
 * duration, which we currently lack.
 */
if (IS_ACTIVE(CONFIG_DRM_I915_MAX_REQUEST_BUSYWAIT) &&
-   __i915_spin_request(rq, state)) {
-   dma_fence_signal(&rq->fence);
+   __i915_spin_request(rq, state))
goto out;
-   }
 
/*
 * This client is about to stall waiting for the GPU. In many cases
@@ -1796,10 +1794,8 @@ long i915_request_wait(struct i915_request *rq,
for (;;) {
set_current_state(state);
 
-   if (i915_request_completed(rq)) {
-   dma_fence_signal(&rq->fence);
+   if (dma_fence_is_signaled(&rq->fence))
break;
-   }
 
intel_engine_flush_submission(rq->engine);
 
-- 
2.20.1

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH 11/66] drm/i915: Preallocate stashes for vma page-directories

2020-07-15 Thread Chris Wilson
We need to make the DMA allocations used for page directories to be
performed up front so that we can include those allocations in our
memory reservation pass. The downside is that we have to assume the
worst case, even before we know the final layout, and always allocate
enough page directories for this object, even when there will be overlap.
This unfortunately can be quite expensive, especially as we have to
clear/reset the page directories and DMA pages, but it should only be
required during early phases of a workload when new objects are being
discovered, or after memory/eviction pressure when we need to rebind.
Once we reach steady state, the objects should not be moved and we no
longer need to preallocating the pages tables.

It should be noted that the lifetime for the page directories DMA is
more or less decoupled from individual fences as they will be shared
across objects across timelines.

v2: Only allocate enough PD space for the PTE we may use, we do not need
to allocate PD that will be left as scratch.
v3: Store the shift unto the first PD level to encapsulate the different
PTE counts for gen6/gen8.

Signed-off-by: Chris Wilson 
Cc: Matthew Auld 
---
 .../gpu/drm/i915/gem/i915_gem_client_blt.c| 11 +--
 drivers/gpu/drm/i915/gt/gen6_ppgtt.c  | 40 -
 drivers/gpu/drm/i915/gt/gen8_ppgtt.c  | 78 +
 drivers/gpu/drm/i915/gt/intel_ggtt.c  | 60 ++
 drivers/gpu/drm/i915/gt/intel_gtt.h   | 46 ++
 drivers/gpu/drm/i915/gt/intel_ppgtt.c | 83 ---
 drivers/gpu/drm/i915/i915_vma.c   | 27 +++---
 drivers/gpu/drm/i915/selftests/i915_gem_gtt.c | 60 --
 drivers/gpu/drm/i915/selftests/mock_gtt.c | 22 ++---
 9 files changed, 237 insertions(+), 190 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_client_blt.c 
b/drivers/gpu/drm/i915/gem/i915_gem_client_blt.c
index 278664f831e7..947c8aa8e13e 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_client_blt.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_client_blt.c
@@ -32,12 +32,13 @@ static void vma_clear_pages(struct i915_vma *vma)
vma->pages = NULL;
 }
 
-static int vma_bind(struct i915_address_space *vm,
-   struct i915_vma *vma,
-   enum i915_cache_level cache_level,
-   u32 flags)
+static void vma_bind(struct i915_address_space *vm,
+struct i915_vm_pt_stash *stash,
+struct i915_vma *vma,
+enum i915_cache_level cache_level,
+u32 flags)
 {
-   return vm->vma_ops.bind_vma(vm, vma, cache_level, flags);
+   vm->vma_ops.bind_vma(vm, stash, vma, cache_level, flags);
 }
 
 static void vma_unbind(struct i915_address_space *vm, struct i915_vma *vma)
diff --git a/drivers/gpu/drm/i915/gt/gen6_ppgtt.c 
b/drivers/gpu/drm/i915/gt/gen6_ppgtt.c
index cdc0b9c54305..ee2e149454cb 100644
--- a/drivers/gpu/drm/i915/gt/gen6_ppgtt.c
+++ b/drivers/gpu/drm/i915/gt/gen6_ppgtt.c
@@ -177,16 +177,16 @@ static void gen6_flush_pd(struct gen6_ppgtt *ppgtt, u64 
start, u64 end)
mutex_unlock(&ppgtt->flush);
 }
 
-static int gen6_alloc_va_range(struct i915_address_space *vm,
-  u64 start, u64 length)
+static void gen6_alloc_va_range(struct i915_address_space *vm,
+   struct i915_vm_pt_stash *stash,
+   u64 start, u64 length)
 {
struct gen6_ppgtt *ppgtt = to_gen6_ppgtt(i915_vm_to_ppgtt(vm));
struct i915_page_directory * const pd = ppgtt->base.pd;
-   struct i915_page_table *pt, *alloc = NULL;
+   struct i915_page_table *pt;
bool flush = false;
u64 from = start;
unsigned int pde;
-   int ret = 0;
 
spin_lock(&pd->lock);
gen6_for_each_pde(pt, pd, start, length, pde) {
@@ -195,21 +195,17 @@ static int gen6_alloc_va_range(struct i915_address_space 
*vm,
if (px_base(pt) == px_base(&vm->scratch[1])) {
spin_unlock(&pd->lock);
 
-   pt = fetch_and_zero(&alloc);
-   if (!pt)
-   pt = alloc_pt(vm);
-   if (IS_ERR(pt)) {
-   ret = PTR_ERR(pt);
-   goto unwind_out;
-   }
+   pt = stash->pt[0];
+   GEM_BUG_ON(!pt);
 
fill32_px(pt, vm->scratch[0].encode);
 
spin_lock(&pd->lock);
if (pd->entry[pde] == &vm->scratch[1]) {
+   stash->pt[0] = pt->stash;
+   atomic_set(&pt->used, 0);
pd->entry[pde] = pt;
} else {
-   alloc = pt;
pt = pd->entry[pde];
}
 
@@ -226,15 +222,6 @@ static int gen

[Intel-gfx] [PATCH 29/66] drm/i915: Hold wakeref for the duration of the vma GGTT binding

2020-07-15 Thread Chris Wilson
Now that we have pushed the binding itself outside of the vm->mutex, we
are clear of the potential wakeref inversions and can take the wakeref
around the actual duration of the HW interaction.

Signed-off-by: Chris Wilson 
---
 drivers/gpu/drm/i915/gt/intel_ggtt.c | 39 
 drivers/gpu/drm/i915/i915_vma.c  |  6 -
 2 files changed, 22 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt.c 
b/drivers/gpu/drm/i915/gt/intel_ggtt.c
index 59a4a3ab6bfd..a78ae2733fd6 100644
--- a/drivers/gpu/drm/i915/gt/intel_ggtt.c
+++ b/drivers/gpu/drm/i915/gt/intel_ggtt.c
@@ -434,27 +434,39 @@ static void i915_ggtt_clear_range(struct 
i915_address_space *vm,
intel_gtt_clear_range(start >> PAGE_SHIFT, length >> PAGE_SHIFT);
 }
 
-static void ggtt_bind_vma(struct i915_address_space *vm,
- struct i915_vm_pt_stash *stash,
- struct i915_vma *vma,
- enum i915_cache_level cache_level,
- u32 flags)
+static void __ggtt_bind_vma(struct i915_address_space *vm,
+   struct i915_vm_pt_stash *stash,
+   struct i915_vma *vma,
+   enum i915_cache_level cache_level,
+   u32 flags)
 {
struct drm_i915_gem_object *obj = vma->obj;
+   intel_wakeref_t wakeref;
u32 pte_flags;
 
-   if (i915_vma_is_bound(vma, ~flags & I915_VMA_BIND_MASK))
-   return;
-
/* Applicable to VLV (gen8+ do not support RO in the GGTT) */
pte_flags = 0;
if (i915_gem_object_is_readonly(obj))
pte_flags |= PTE_READ_ONLY;
 
-   vm->insert_entries(vm, vma, cache_level, pte_flags);
+   with_intel_runtime_pm(vm->gt->uncore->rpm, wakeref)
+   vm->insert_entries(vm, vma, cache_level, pte_flags);
+
vma->page_sizes.gtt = I915_GTT_PAGE_SIZE;
 }
 
+static void ggtt_bind_vma(struct i915_address_space *vm,
+ struct i915_vm_pt_stash *stash,
+ struct i915_vma *vma,
+ enum i915_cache_level cache_level,
+ u32 flags)
+{
+   if (i915_vma_is_bound(vma, ~flags & I915_VMA_BIND_MASK))
+   return;
+
+   __ggtt_bind_vma(vm, stash, vma, cache_level, flags);
+}
+
 static void ggtt_unbind_vma(struct i915_address_space *vm, struct i915_vma 
*vma)
 {
vm->clear_range(vm, vma->node.start, vma->size);
@@ -571,19 +583,12 @@ static void aliasing_gtt_bind_vma(struct 
i915_address_space *vm,
  enum i915_cache_level cache_level,
  u32 flags)
 {
-   u32 pte_flags;
-
-   /* Currently applicable only to VLV */
-   pte_flags = 0;
-   if (i915_gem_object_is_readonly(vma->obj))
-   pte_flags |= PTE_READ_ONLY;
-
if (flags & I915_VMA_LOCAL_BIND)
ppgtt_bind_vma(&i915_vm_to_ggtt(vm)->alias->vm,
   stash, vma, cache_level, flags);
 
if (flags & I915_VMA_GLOBAL_BIND)
-   vm->insert_entries(vm, vma, cache_level, pte_flags);
+   __ggtt_bind_vma(vm, stash, vma, cache_level, flags);
 }
 
 static void aliasing_gtt_unbind_vma(struct i915_address_space *vm,
diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
index 633f335ce892..e584a3355911 100644
--- a/drivers/gpu/drm/i915/i915_vma.c
+++ b/drivers/gpu/drm/i915/i915_vma.c
@@ -820,7 +820,6 @@ static int __wait_for_unbind(struct i915_vma *vma, unsigned 
int flags)
 int i915_vma_pin(struct i915_vma *vma, u64 size, u64 alignment, u64 flags)
 {
struct i915_vma_work *work = NULL;
-   intel_wakeref_t wakeref = 0;
unsigned int bound;
int err;
 
@@ -839,9 +838,6 @@ int i915_vma_pin(struct i915_vma *vma, u64 size, u64 
alignment, u64 flags)
return err;
}
 
-   if (flags & PIN_GLOBAL)
-   wakeref = intel_runtime_pm_get(&vma->vm->i915->runtime_pm);
-
err = __wait_for_unbind(vma, flags);
if (err)
goto err_rpm;
@@ -951,8 +947,6 @@ int i915_vma_pin(struct i915_vma *vma, u64 size, u64 
alignment, u64 flags)
 err_fence:
dma_fence_work_commit_imm(&work->base);
 err_rpm:
-   if (wakeref)
-   intel_runtime_pm_put(&vma->vm->i915->runtime_pm, wakeref);
if (vma->obj)
i915_gem_object_unpin_pages(vma->obj);
return err;
-- 
2.20.1

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH 28/66] drm/i915/gem: Replace i915_gem_object.mm.mutex with reservation_ww_class

2020-07-15 Thread Chris Wilson
Our goal is to pull all memory reservations (next iteration
obj->ops->get_pages()) under a ww_mutex, and to align those reservations
with other drivers, i.e. control all such allocations with the
reservation_ww_class. Currently, this is under the purview of the
obj->mm.mutex, and while obj->mm remains an embedded struct we can
"simply" switch to using the reservation_ww_class obj->base.resv->lock

The major consequence is the impact on the shrinker paths as the
reservation_ww_class is used to wrap allocations, and a ww_mutex does
not support subclassing so we cannot do our usual trick of knowing that
we never recurse inside the shrinker and instead have to finish the
reclaim with a trylock. This may result in us failing to release the
pages after having released the vma. This will have to do until a better
idea comes along.

However, this step only converts the mutex over and continues to treat
everything as a single allocation and pinning the pages. With the
ww_mutex in place we can remove the temporary pinning, as we can then
reserve all storage en masse.

One last thing to do: kill the implict page pinning for active vma.
This will require us to invalidate the vma->pages when the backing store
is removed (and we expect that while the vma is active, we mark the
backing store as active so that it cannot be removed while the HW is
busy.)

Signed-off-by: Chris Wilson 
---
 drivers/gpu/drm/i915/gem/i915_gem_clflush.c   |  20 +-
 drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c|  18 +-
 drivers/gpu/drm/i915/gem/i915_gem_domain.c|  65 ++
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c|  40 +++-
 drivers/gpu/drm/i915/gem/i915_gem_object.c|   8 +-
 drivers/gpu/drm/i915/gem/i915_gem_object.h|  37 +--
 .../gpu/drm/i915/gem/i915_gem_object_types.h  |   1 -
 drivers/gpu/drm/i915/gem/i915_gem_pages.c | 134 ++-
 drivers/gpu/drm/i915/gem/i915_gem_phys.c  |   8 +-
 drivers/gpu/drm/i915/gem/i915_gem_shrinker.c  |  13 +-
 drivers/gpu/drm/i915/gem/i915_gem_tiling.c|   2 -
 drivers/gpu/drm/i915/gem/i915_gem_userptr.c   |  15 +-
 .../gpu/drm/i915/gem/selftests/huge_pages.c   |  32 ++-
 .../i915/gem/selftests/i915_gem_coherency.c   |  14 +-
 .../drm/i915/gem/selftests/i915_gem_context.c |  10 +-
 .../drm/i915/gem/selftests/i915_gem_mman.c|   2 +
 drivers/gpu/drm/i915/gt/gen6_ppgtt.c  |   2 -
 drivers/gpu/drm/i915/gt/gen8_ppgtt.c  |   1 -
 drivers/gpu/drm/i915/gt/intel_ggtt.c  |   5 +-
 drivers/gpu/drm/i915/gt/intel_gtt.h   |   2 -
 drivers/gpu/drm/i915/gt/intel_ppgtt.c |   1 +
 drivers/gpu/drm/i915/i915_gem.c   |  16 +-
 drivers/gpu/drm/i915/i915_vma.c   | 217 +++---
 drivers/gpu/drm/i915/i915_vma_types.h |   6 -
 drivers/gpu/drm/i915/mm/i915_acquire_ctx.c|  12 +-
 drivers/gpu/drm/i915/selftests/i915_gem_gtt.c |   4 +-
 .../drm/i915/selftests/intel_memory_region.c  |  17 +-
 27 files changed, 313 insertions(+), 389 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_clflush.c 
b/drivers/gpu/drm/i915/gem/i915_gem_clflush.c
index bc0223716906..a32fd0d5570b 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_clflush.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_clflush.c
@@ -27,16 +27,8 @@ static void __do_clflush(struct drm_i915_gem_object *obj)
 static int clflush_work(struct dma_fence_work *base)
 {
struct clflush *clflush = container_of(base, typeof(*clflush), base);
-   struct drm_i915_gem_object *obj = clflush->obj;
-   int err;
-
-   err = i915_gem_object_pin_pages(obj);
-   if (err)
-   return err;
-
-   __do_clflush(obj);
-   i915_gem_object_unpin_pages(obj);
 
+   __do_clflush(clflush->obj);
return 0;
 }
 
@@ -44,7 +36,7 @@ static void clflush_release(struct dma_fence_work *base)
 {
struct clflush *clflush = container_of(base, typeof(*clflush), base);
 
-   i915_gem_object_put(clflush->obj);
+   i915_gem_object_unpin_pages(clflush->obj);
 }
 
 static const struct dma_fence_work_ops clflush_ops = {
@@ -63,8 +55,14 @@ static struct clflush *clflush_work_create(struct 
drm_i915_gem_object *obj)
if (!clflush)
return NULL;
 
+   if (__i915_gem_object_get_pages_locked(obj)) {
+   kfree(clflush);
+   return NULL;
+   }
+
dma_fence_work_init(&clflush->base, &clflush_ops);
-   clflush->obj = i915_gem_object_get(obj); /* obj <-> clflush cycle */
+   __i915_gem_object_pin_pages(obj);
+   clflush->obj = obj; /* Beware the obj.resv <-> clflush fence cycle */
 
return clflush;
 }
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c 
b/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c
index 2679380159fc..049a15e6b496 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c
@@ -124,19 +124,12 @@ static int i915_gem_begin_cpu_access(struct dma_buf 
*dma_buf, enum dma_data_dire
bool write = (direction == DMA_

Re: [Intel-gfx] [PATCH 19/25] drm/amdgpu: s/GFP_KERNEL/GFP_ATOMIC in scheduler code

2020-07-15 Thread Daniel Vetter
On Wed, Jul 15, 2020 at 11:17 AM Christian König
 wrote:
>
> Am 14.07.20 um 16:31 schrieb Daniel Vetter:
> > On Tue, Jul 14, 2020 at 01:40:11PM +0200, Christian König wrote:
> >> Am 14.07.20 um 12:49 schrieb Daniel Vetter:
> >>> On Tue, Jul 07, 2020 at 10:12:23PM +0200, Daniel Vetter wrote:
>  My dma-fence lockdep annotations caught an inversion because we
>  allocate memory where we really shouldn't:
> 
> kmem_cache_alloc+0x2b/0x6d0
> amdgpu_fence_emit+0x30/0x330 [amdgpu]
> amdgpu_ib_schedule+0x306/0x550 [amdgpu]
> amdgpu_job_run+0x10f/0x260 [amdgpu]
> drm_sched_main+0x1b9/0x490 [gpu_sched]
> kthread+0x12e/0x150
> 
>  Trouble right now is that lockdep only validates against GFP_FS, which
>  would be good enough for shrinkers. But for mmu_notifiers we actually
>  need !GFP_ATOMIC, since they can be called from any page laundering,
>  even if GFP_NOFS or GFP_NOIO are set.
> 
>  I guess we should improve the lockdep annotations for
>  fs_reclaim_acquire/release.
> 
>  Ofc real fix is to properly preallocate this fence and stuff it into
>  the amdgpu job structure. But GFP_ATOMIC gets the lockdep splat out of
>  the way.
> 
>  v2: Two more allocations in scheduler paths.
> 
>  Frist one:
> 
> __kmalloc+0x58/0x720
> amdgpu_vmid_grab+0x100/0xca0 [amdgpu]
> amdgpu_job_dependency+0xf9/0x120 [amdgpu]
> drm_sched_entity_pop_job+0x3f/0x440 [gpu_sched]
> drm_sched_main+0xf9/0x490 [gpu_sched]
> 
>  Second one:
> 
> kmem_cache_alloc+0x2b/0x6d0
> amdgpu_sync_fence+0x7e/0x110 [amdgpu]
> amdgpu_vmid_grab+0x86b/0xca0 [amdgpu]
> amdgpu_job_dependency+0xf9/0x120 [amdgpu]
> drm_sched_entity_pop_job+0x3f/0x440 [gpu_sched]
> drm_sched_main+0xf9/0x490 [gpu_sched]
> 
>  Cc: linux-me...@vger.kernel.org
>  Cc: linaro-mm-...@lists.linaro.org
>  Cc: linux-r...@vger.kernel.org
>  Cc: amd-...@lists.freedesktop.org
>  Cc: intel-gfx@lists.freedesktop.org
>  Cc: Chris Wilson 
>  Cc: Maarten Lankhorst 
>  Cc: Christian König 
>  Signed-off-by: Daniel Vetter 
> >>> Has anyone from amd side started looking into how to fix this properly?
> >> Yeah I checked both and neither are any real problem.
> > I'm confused ... do you mean "no real problem fixing them" or "not
> > actually a real problem"?
>
> Both, at least the VMID stuff is trivial to avoid.
>
> And the fence allocation is extremely unlikely. E.g. when we allocate a
> new one we previously most likely just freed one already.

Yeah I think debugging we can avoid, just stop debugging if things get
hung up like that. So mempool for the hw fences should be perfectly
fine.

The vmid stuff I don't really understand enough, but the hw fence
stuff I think I grok, plus other scheduler users need that too from a
quick look. I might be tackling that one (maybe put the mempool
outright into drm_scheduler code as a helper), except if you have
patches already in the works. vmid I'll leave to you guys :-)

-Daniel

>
> >
> >>> I looked a bit into fixing this with mempool, and the big guarantee we
> >>> need is that
> >>> - there's a hard upper limit on how many allocations we minimally need to
> >>> guarantee forward progress. And the entire vmid allocation and
> >>> amdgpu_sync_fence stuff kinda makes me question that's a valid
> >>> assumption.
> >> We do have hard upper limits for those.
> >>
> >> The VMID allocation could as well just return the fence instead of putting
> >> it into the sync object IIRC. So that just needs some cleanup and can avoid
> >> the allocation entirely.
> > Yeah embedding should be simplest solution of all.
> >
> >> The hardware fence is limited by the number of submissions we can have
> >> concurrently on the ring buffers, so also not a problem at all.
> > Ok that sounds good. Wrt releasing the memory again, is that also done
> > without any of the allocation-side locks held? I've seen some vmid manager
> > somewhere ...
>
> Well that's the issue. We can't guarantee that for the hardware fence
> memory since it could be that we hold another reference during debugging
> IIRC.
>
> Still looking if and how we could fix this. But as I said this problem
> is so extremely unlikely.
>
> Christian.
>
> > -Daniel
> >
> >> Regards,
> >> Christian.
> >>
> >>> - mempool_free must be called without any locks in the way which are held
> >>> while we call mempool_alloc. Otherwise we again have a nice deadlock
> >>> with no forward progress. I tried auditing that, but got lost in 
> >>> amdgpu
> >>> and scheduler code. Some lockdep annotations for mempool.c might help,
> >>> but they're not going to catch everything. Plus it would be again 
> >>> manual
> >>> annotations because this is yet another cross-release issue. So not 
> >>> sure
> >>> that helps at all.
> >>>
> >>> iow, not sure

[Intel-gfx] [PATCH 13/66] drm/i915/gem: Don't drop the timeline lock during execbuf

2020-07-15 Thread Chris Wilson
Our timeline lock is our defence against a concurrent execbuf
interrupting our request construction. we need hold it throughout or,
for example, a second thread may interject a relocation request in
between our own relocation request and execution in the ring.

A second, major benefit, is that it allows us to preserve a large chunk
of the ringbuffer for our exclusive use; which should virtually
eliminate the threat of hitting a wait_for_space during request
construction -- although we should have already dropped other
contentious locks at that point.

Signed-off-by: Chris Wilson 
---
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c| 413 +++---
 .../i915/gem/selftests/i915_gem_execbuffer.c  |  24 +-
 2 files changed, 281 insertions(+), 156 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c 
b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
index 719ba9fe3e85..af3499aafd22 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c
@@ -259,6 +259,8 @@ struct i915_execbuffer {
bool has_fence : 1;
bool needs_unfenced : 1;
 
+   struct intel_context *ce;
+
struct i915_vma *target;
struct i915_request *rq;
struct i915_vma *rq_vma;
@@ -639,6 +641,35 @@ static int eb_reserve_vma(const struct i915_execbuffer *eb,
return 0;
 }
 
+static void retire_requests(struct intel_timeline *tl)
+{
+   struct i915_request *rq, *rn;
+
+   list_for_each_entry_safe(rq, rn, &tl->requests, link)
+   if (!i915_request_retire(rq))
+   break;
+}
+
+static int wait_for_timeline(struct intel_timeline *tl)
+{
+   do {
+   struct dma_fence *fence;
+   int err;
+
+   fence = i915_active_fence_get(&tl->last_request);
+   if (!fence)
+   return 0;
+
+   err = dma_fence_wait(fence, true);
+   dma_fence_put(fence);
+   if (err)
+   return err;
+
+   /* Retiring may trigger a barrier, requiring an extra pass */
+   retire_requests(tl);
+   } while (1);
+}
+
 static int eb_reserve(struct i915_execbuffer *eb)
 {
const unsigned int count = eb->buffer_count;
@@ -646,7 +677,6 @@ static int eb_reserve(struct i915_execbuffer *eb)
struct list_head last;
struct eb_vma *ev;
unsigned int i, pass;
-   int err = 0;
 
/*
 * Attempt to pin all of the buffers into the GTT.
@@ -662,18 +692,37 @@ static int eb_reserve(struct i915_execbuffer *eb)
 * room for the earlier objects *unless* we need to defragment.
 */
 
-   if (mutex_lock_interruptible(&eb->i915->drm.struct_mutex))
-   return -EINTR;
-
pass = 0;
do {
+   int err = 0;
+
+   /*
+* We need to hold one lock as we bind all the vma so that
+* we have a consistent view of the entire vm and can plan
+* evictions to fill the whole GTT. If we allow a second
+* thread to run as we do this, it will either unbind
+* everything we want pinned, or steal space that we need for
+* ourselves. The closer we are to a full GTT, the more likely
+* such contention will cause us to fail to bind the workload
+* for this batch. Since we know at this point we need to
+* find space for new buffers, we know that extra pressure
+* from contention is likely.
+*
+* In lieu of being able to hold vm->mutex for the entire
+* sequence (it's complicated!), we opt for struct_mutex.
+*/
+   if (mutex_lock_interruptible(&eb->i915->drm.struct_mutex))
+   return -EINTR;
+
list_for_each_entry(ev, &eb->unbound, bind_link) {
err = eb_reserve_vma(eb, ev, pin_flags);
if (err)
break;
}
-   if (!(err == -ENOSPC || err == -EAGAIN))
-   break;
+   if (!(err == -ENOSPC || err == -EAGAIN)) {
+   mutex_unlock(&eb->i915->drm.struct_mutex);
+   return err;
+   }
 
/* Resort *all* the objects into priority order */
INIT_LIST_HEAD(&eb->unbound);
@@ -702,38 +751,50 @@ static int eb_reserve(struct i915_execbuffer *eb)
list_add_tail(&ev->bind_link, &last);
}
list_splice_tail(&last, &eb->unbound);
+   mutex_unlock(&eb->i915->drm.struct_mutex);
 
if (err == -EAGAIN) {
-   mutex_unlock(&eb->i915->drm.struct_mutex);
flush_workqueue(eb->i915->mm.userp

[Intel-gfx] [PATCH 05/66] drm/i915: Skip taking acquire mutex for no ref->active callback

2020-07-15 Thread Chris Wilson
If no active callback is defined for i915_active, we do not need to
serialise its enabling with the mutex. We still do only want to call the
debug activate once, and must still serialise with a concurrent retire.

Signed-off-by: Chris Wilson 
---
 drivers/gpu/drm/i915/i915_active.c | 25 -
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_active.c 
b/drivers/gpu/drm/i915/i915_active.c
index d960d0be5bd2..841b5c30950a 100644
--- a/drivers/gpu/drm/i915/i915_active.c
+++ b/drivers/gpu/drm/i915/i915_active.c
@@ -416,6 +416,14 @@ bool i915_active_acquire_if_busy(struct i915_active *ref)
return atomic_add_unless(&ref->count, 1, 0);
 }
 
+static void __i915_active_activate(struct i915_active *ref)
+{
+   spin_lock_irq(&ref->tree_lock); /* __active_retire() */
+   if (!atomic_fetch_inc(&ref->count))
+   debug_active_activate(ref);
+   spin_unlock_irq(&ref->tree_lock);
+}
+
 int i915_active_acquire(struct i915_active *ref)
 {
int err;
@@ -423,23 +431,22 @@ int i915_active_acquire(struct i915_active *ref)
if (i915_active_acquire_if_busy(ref))
return 0;
 
+   if (!ref->active) {
+   __i915_active_activate(ref);
+   return 0;
+   }
+
err = mutex_lock_interruptible(&ref->mutex);
if (err)
return err;
 
if (likely(!i915_active_acquire_if_busy(ref))) {
-   if (ref->active)
-   err = ref->active(ref);
-   if (!err) {
-   spin_lock_irq(&ref->tree_lock); /* __active_retire() */
-   debug_active_activate(ref);
-   atomic_inc(&ref->count);
-   spin_unlock_irq(&ref->tree_lock);
-   }
+   err = ref->active(ref);
+   if (!err)
+   __i915_active_activate(ref);
}
 
mutex_unlock(&ref->mutex);
-
return err;
 }
 
-- 
2.20.1

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH 09/66] drm/i915: Provide a fastpath for waiting on vma bindings

2020-07-15 Thread Chris Wilson
Before we can execute a request, we must wait for all of its vma to be
bound. This is a frequent operation for which we can optimise away a
few atomic operations (notably a cmpxchg) in lieu of the RCU protection.

Signed-off-by: Chris Wilson 
---
 drivers/gpu/drm/i915/i915_active.h | 15 +++
 drivers/gpu/drm/i915/i915_vma.c|  9 +++--
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_active.h 
b/drivers/gpu/drm/i915/i915_active.h
index b9e0394e2975..fb165d3f01cf 100644
--- a/drivers/gpu/drm/i915/i915_active.h
+++ b/drivers/gpu/drm/i915/i915_active.h
@@ -231,4 +231,19 @@ struct i915_active *i915_active_create(void);
 struct i915_active *i915_active_get(struct i915_active *ref);
 void i915_active_put(struct i915_active *ref);
 
+static inline int __i915_request_await_exclusive(struct i915_request *rq,
+struct i915_active *active)
+{
+   struct dma_fence *fence;
+   int err = 0;
+
+   fence = i915_active_fence_get(&active->excl);
+   if (fence) {
+   err = i915_request_await_dma_fence(rq, fence);
+   dma_fence_put(fence);
+   }
+
+   return err;
+}
+
 #endif /* _I915_ACTIVE_H_ */
diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
index bc64f773dcdb..cd12047c7791 100644
--- a/drivers/gpu/drm/i915/i915_vma.c
+++ b/drivers/gpu/drm/i915/i915_vma.c
@@ -1167,6 +1167,12 @@ void i915_vma_revoke_mmap(struct i915_vma *vma)
list_del(&vma->obj->userfault_link);
 }
 
+static int
+__i915_request_await_bind(struct i915_request *rq, struct i915_vma *vma)
+{
+   return __i915_request_await_exclusive(rq, &vma->active);
+}
+
 int __i915_vma_move_to_active(struct i915_vma *vma, struct i915_request *rq)
 {
int err;
@@ -1174,8 +1180,7 @@ int __i915_vma_move_to_active(struct i915_vma *vma, 
struct i915_request *rq)
GEM_BUG_ON(!i915_vma_is_pinned(vma));
 
/* Wait for the vma to be bound before we start! */
-   err = i915_request_await_active(rq, &vma->active,
-   I915_ACTIVE_AWAIT_EXCL);
+   err = __i915_request_await_bind(rq, vma);
if (err)
return err;
 
-- 
2.20.1

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH 52/66] drm/i915: Teach the i915_dependency to use a double-lock

2020-07-15 Thread Chris Wilson
Currently, we construct and teardown the i915_dependency chains using a
global spinlock. As the lists are entirely local, it should be possible
to use an double-lock with an explicit nesting [signaler -> waiter,
always] and so avoid the costly convenience of a global spinlock.

Signed-off-by: Chris Wilson 
---
 drivers/gpu/drm/i915/gt/intel_lrc.c |  6 +--
 drivers/gpu/drm/i915/i915_request.c |  2 +-
 drivers/gpu/drm/i915/i915_scheduler.c   | 44 +
 drivers/gpu/drm/i915/i915_scheduler.h   |  2 +-
 drivers/gpu/drm/i915/i915_scheduler_types.h |  1 +
 5 files changed, 34 insertions(+), 21 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c 
b/drivers/gpu/drm/i915/gt/intel_lrc.c
index fdeeed8b45d5..2dd116c0d2a1 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -1831,7 +1831,7 @@ static void defer_request(struct i915_request *rq, struct 
list_head * const pl)
struct i915_request *w =
container_of(p->waiter, typeof(*w), sched);
 
-   if (p->flags & I915_DEPENDENCY_WEAK)
+   if (!p->waiter || p->flags & I915_DEPENDENCY_WEAK)
continue;
 
/* Leave semaphores spinning on the other engines */
@@ -2683,7 +2683,7 @@ static void __execlists_hold(struct i915_request *rq)
struct i915_request *w =
container_of(p->waiter, typeof(*w), sched);
 
-   if (p->flags & I915_DEPENDENCY_WEAK)
+   if (!p->waiter || p->flags & I915_DEPENDENCY_WEAK)
continue;
 
/* Leave semaphores spinning on the other engines */
@@ -2781,7 +2781,7 @@ static void __execlists_unhold(struct i915_request *rq)
struct i915_request *w =
container_of(p->waiter, typeof(*w), sched);
 
-   if (p->flags & I915_DEPENDENCY_WEAK)
+   if (!p->waiter || p->flags & I915_DEPENDENCY_WEAK)
continue;
 
/* Propagate any change in error status */
diff --git a/drivers/gpu/drm/i915/i915_request.c 
b/drivers/gpu/drm/i915/i915_request.c
index 1c00edf427f0..6528ace4c0b7 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -334,7 +334,7 @@ bool i915_request_retire(struct i915_request *rq)
intel_context_unpin(rq->context);
 
free_capture_list(rq);
-   i915_sched_node_fini(&rq->sched);
+   i915_sched_node_retire(&rq->sched);
i915_request_put(rq);
 
return true;
diff --git a/drivers/gpu/drm/i915/i915_scheduler.c 
b/drivers/gpu/drm/i915/i915_scheduler.c
index 9f744f470556..2e4d512e61d8 100644
--- a/drivers/gpu/drm/i915/i915_scheduler.c
+++ b/drivers/gpu/drm/i915/i915_scheduler.c
@@ -353,6 +353,8 @@ void i915_request_set_priority(struct i915_request *rq, int 
prio)
 
 void i915_sched_node_init(struct i915_sched_node *node)
 {
+   spin_lock_init(&node->lock);
+
INIT_LIST_HEAD(&node->signalers_list);
INIT_LIST_HEAD(&node->waiters_list);
INIT_LIST_HEAD(&node->link);
@@ -390,7 +392,8 @@ bool __i915_sched_node_add_dependency(struct 
i915_sched_node *node,
 {
bool ret = false;
 
-   spin_lock_irq(&schedule_lock);
+   /* The signal->lock is always the outer lock in this double-lock. */
+   spin_lock_irq(&signal->lock);
 
if (!node_signaled(signal)) {
INIT_LIST_HEAD(&dep->dfs_link);
@@ -399,15 +402,17 @@ bool __i915_sched_node_add_dependency(struct 
i915_sched_node *node,
dep->flags = flags;
 
/* All set, now publish. Beware the lockless walkers. */
+   spin_lock_nested(&node->lock, SINGLE_DEPTH_NESTING);
list_add_rcu(&dep->signal_link, &node->signalers_list);
list_add_rcu(&dep->wait_link, &signal->waiters_list);
+   spin_unlock(&node->lock);
 
/* Propagate the chains */
node->flags |= signal->flags;
ret = true;
}
 
-   spin_unlock_irq(&schedule_lock);
+   spin_unlock_irq(&signal->lock);
 
return ret;
 }
@@ -433,39 +438,46 @@ int i915_sched_node_add_dependency(struct i915_sched_node 
*node,
return 0;
 }
 
-void i915_sched_node_fini(struct i915_sched_node *node)
+void i915_sched_node_retire(struct i915_sched_node *node)
 {
struct i915_dependency *dep, *tmp;
 
-   spin_lock_irq(&schedule_lock);
+   spin_lock_irq(&node->lock);
 
/*
 * Everyone we depended upon (the fences we wait to be signaled)
 * should retire before us and remove themselves from our list.
 * However, retirement is run independently on each timeline and
-* so we may be called out-of-order.
+* so we may be ca

[Intel-gfx] [PATCH 07/66] drm/i915: Keep the most recently used active-fence upon discard

2020-07-15 Thread Chris Wilson
Whenever an i915_active idles, we prune its tree of old fence slots to
prevent a gradual leak should it be used to track many, many timelines.
The downside is that we then have to frequently reallocate the rbtree.
A compromise is that we keep the most recently used fence slot, and
reuse that for the next active reference as that is the most likely
timeline to be reused.

Signed-off-by: Chris Wilson 
---
 drivers/gpu/drm/i915/i915_active.c | 27 ---
 drivers/gpu/drm/i915/i915_active.h |  4 
 2 files changed, 20 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_active.c 
b/drivers/gpu/drm/i915/i915_active.c
index 799282fb1bb9..0854b1552bc1 100644
--- a/drivers/gpu/drm/i915/i915_active.c
+++ b/drivers/gpu/drm/i915/i915_active.c
@@ -130,8 +130,8 @@ static inline void debug_active_assert(struct i915_active 
*ref) { }
 static void
 __active_retire(struct i915_active *ref)
 {
+   struct rb_root root = RB_ROOT;
struct active_node *it, *n;
-   struct rb_root root;
unsigned long flags;
 
GEM_BUG_ON(i915_active_is_idle(ref));
@@ -143,9 +143,21 @@ __active_retire(struct i915_active *ref)
GEM_BUG_ON(rcu_access_pointer(ref->excl.fence));
debug_active_deactivate(ref);
 
-   root = ref->tree;
-   ref->tree = RB_ROOT;
-   ref->cache = NULL;
+   /* Even if we have not used the cache, we may still have a barrier */
+   if (!ref->cache)
+   ref->cache = fetch_node(ref->tree.rb_node);
+
+   /* Keep the MRU cached node for reuse */
+   if (ref->cache) {
+   /* Discard all other nodes in the tree */
+   rb_erase(&ref->cache->node, &ref->tree);
+   root = ref->tree;
+
+   /* Rebuild the tree with only the cached node */
+   rb_link_node(&ref->cache->node, NULL, &ref->tree.rb_node);
+   rb_insert_color(&ref->cache->node, &ref->tree);
+   GEM_BUG_ON(ref->tree.rb_node != &ref->cache->node);
+   }
 
spin_unlock_irqrestore(&ref->tree_lock, flags);
 
@@ -156,6 +168,7 @@ __active_retire(struct i915_active *ref)
/* ... except if you wait on it, you must manage your own references! */
wake_up_var(ref);
 
+   /* Finally free the discarded timeline tree  */
rbtree_postorder_for_each_entry_safe(it, n, &root, node) {
GEM_BUG_ON(i915_active_fence_isset(&it->base));
kmem_cache_free(global.slab_cache, it);
@@ -750,16 +763,16 @@ int i915_sw_fence_await_active(struct i915_sw_fence 
*fence,
return await_active(ref, flags, sw_await_fence, fence, fence);
 }
 
-#if IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM)
 void i915_active_fini(struct i915_active *ref)
 {
debug_active_fini(ref);
GEM_BUG_ON(atomic_read(&ref->count));
GEM_BUG_ON(work_pending(&ref->work));
-   GEM_BUG_ON(!RB_EMPTY_ROOT(&ref->tree));
mutex_destroy(&ref->mutex);
+
+   if (ref->cache)
+   kmem_cache_free(global.slab_cache, ref->cache);
 }
-#endif
 
 static inline bool is_idle_barrier(struct active_node *node, u64 idx)
 {
diff --git a/drivers/gpu/drm/i915/i915_active.h 
b/drivers/gpu/drm/i915/i915_active.h
index 73ded3c52a04..b9e0394e2975 100644
--- a/drivers/gpu/drm/i915/i915_active.h
+++ b/drivers/gpu/drm/i915/i915_active.h
@@ -217,11 +217,7 @@ i915_active_is_idle(const struct i915_active *ref)
return !atomic_read(&ref->count);
 }
 
-#if IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM)
 void i915_active_fini(struct i915_active *ref);
-#else
-static inline void i915_active_fini(struct i915_active *ref) { }
-#endif
 
 int i915_active_acquire_preallocate_barrier(struct i915_active *ref,
struct intel_engine_cs *engine);
-- 
2.20.1

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] sw_sync deadlock avoidance, take 3

2020-07-15 Thread Daniel Vetter
On Wed, Jul 15, 2020 at 1:47 PM Daniel Stone  wrote:
>
> Hi,
>
> On Wed, 15 Jul 2020 at 12:05, Bas Nieuwenhuizen  
> wrote:
> > On Wed, Jul 15, 2020 at 12:34 PM Chris Wilson  
> > wrote:
> > > Maybe now is the time to ask: are you using sw_sync outside of
> > > validation?
> >
> > Yes, this is used as part of the Android stack on Chrome OS (need to
> > see if ChromeOS specific, but
> > https://source.android.com/devices/graphics/sync#sync_timeline
> > suggests not)
>
> Android used to mandate it for their earlier iteration of release
> fences, which was an empty/future fence having no guarantee of
> eventual forward progress until someone committed work later on. For
> example, when you committed a buffer to SF, it would give you an empty
> 'release fence' for that buffer which would only be tied to work to
> signal it when you committed your _next_ buffer, which might never
> happen. They removed that because a) future fences were a bad idea,
> and b) it was only ever useful if you assumed strictly
> FIFO/round-robin return order which wasn't always true.
>
> So now it's been watered down to 'use this if you don't have a
> hardware timeline', but why don't we work with Android people to get
> that removed entirely?

I think there's some testcases still using these, but most real fence
testcases use vgem nowadays. So from an upstream pov there's indeed
not much if anything holding us back from just deleting this all. And
would probably be a good idea.

Adding Rob and John for more of the android pov.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915: Reduce i915_request.lock contention for i915_request_wait

2020-07-15 Thread Tvrtko Ursulin



On 15/07/2020 11:50, Chris Wilson wrote:

Currently, we use i915_request_completed() directly in
i915_request_wait() and follow up with a manual invocation of
dma_fence_signal(). This appears to cause a large number of contentions
on i915_request.lock as when the process is woken up after the fence is
signaled by an interrupt, we will then try and call dma_fence_signal()
ourselves while the signaler is still holding the lock.
dma_fence_is_signaled() has the benefit of checking the
DMA_FENCE_FLAG_SIGNALED_BIT prior to calling dma_fence_signal() and so
avoids most of that contention.

Signed-off-by: Chris Wilson 
Cc: Matthew Auld 
Cc: Tvrtko Ursulin 
---
  drivers/gpu/drm/i915/i915_request.c | 12 
  1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_request.c 
b/drivers/gpu/drm/i915/i915_request.c
index 0b2fe55e6194..bb4eb1a8780e 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -1640,7 +1640,7 @@ static bool busywait_stop(unsigned long timeout, unsigned 
int cpu)
return this_cpu != cpu;
  }
  
-static bool __i915_spin_request(const struct i915_request * const rq, int state)

+static bool __i915_spin_request(struct i915_request * const rq, int state)
  {
unsigned long timeout_ns;
unsigned int cpu;
@@ -1673,7 +1673,7 @@ static bool __i915_spin_request(const struct i915_request 
* const rq, int state)
timeout_ns = READ_ONCE(rq->engine->props.max_busywait_duration_ns);
timeout_ns += local_clock_ns(&cpu);
do {
-   if (i915_request_completed(rq))
+   if (dma_fence_is_signaled(&rq->fence))
return true;
  
  		if (signal_pending_state(state, current))

@@ -1766,10 +1766,8 @@ long i915_request_wait(struct i915_request *rq,
 * duration, which we currently lack.
 */
if (IS_ACTIVE(CONFIG_DRM_I915_MAX_REQUEST_BUSYWAIT) &&
-   __i915_spin_request(rq, state)) {
-   dma_fence_signal(&rq->fence);
+   __i915_spin_request(rq, state))
goto out;
-   }
  
  	/*

 * This client is about to stall waiting for the GPU. In many cases
@@ -1796,10 +1794,8 @@ long i915_request_wait(struct i915_request *rq,
for (;;) {
set_current_state(state);
  
-		if (i915_request_completed(rq)) {

-   dma_fence_signal(&rq->fence);
+   if (dma_fence_is_signaled(&rq->fence))
break;
-   }
  
  		intel_engine_flush_submission(rq->engine);
  



In other words putting some latency back into the waiters, which is 
probably okay, since sync waits is not our primary model.


I have a slight concern about the remaining value of busy spinning if 
i915_request_completed check is removed from there as well. Of course it 
doesn't make sense to have different completion criteria between the 
two.. We could wait a bit longer if real check in busyspin said request 
is actually completed, just not signal it but wait for the breadcrumbs 
to do it.


Regards,

Tvrtko
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 2/2] dma-buf/dma-fence: Add quick tests before dma_fence_remove_callback

2020-07-15 Thread Daniel Vetter
On Wed, Jul 15, 2020 at 11:49:05AM +0100, Chris Wilson wrote:
> When waiting with a callback on the stack, we must remove the callback
> upon wait completion. Since this will be notified by the fence signal
> callback, the removal often contends with the fence->lock being held by
> the signaler. We can look at the list entry to see if the callback was
> already signaled before we take the contended lock.
> 
> Signed-off-by: Chris Wilson 
> ---
>  drivers/dma-buf/dma-fence.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> index 8d5bdfce638e..b910d7bc0854 100644
> --- a/drivers/dma-buf/dma-fence.c
> +++ b/drivers/dma-buf/dma-fence.c
> @@ -420,6 +420,9 @@ dma_fence_remove_callback(struct dma_fence *fence, struct 
> dma_fence_cb *cb)
>   unsigned long flags;
>   bool ret;
>  
> + if (list_empty(&cb->node))

I was about to say "but the races" but then noticed that Paul fixed
list_empty to use READ_ONCE like 5 years ago :-)

Reviewed-by: Daniel Vetter 

> + return false;
> +
>   spin_lock_irqsave(fence->lock, flags);
>  
>   ret = !list_empty(&cb->node);
> -- 
> 2.20.1
> 
> ___
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] ✗ Fi.CI.SPARSE: warning for drm/i915: Reduce i915_request.lock contention for i915_request_wait

2020-07-15 Thread Patchwork
== Series Details ==

Series: drm/i915: Reduce i915_request.lock contention for i915_request_wait
URL   : https://patchwork.freedesktop.org/series/79514/
State : warning

== Summary ==

$ dim sparse --fast origin/drm-tip
Sparse version: v0.6.0
Fast mode used, each commit won't be checked separately.


___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 2/2] dma-buf/dma-fence: Add quick tests before dma_fence_remove_callback

2020-07-15 Thread Chris Wilson
Quoting Daniel Vetter (2020-07-15 13:10:22)
> On Wed, Jul 15, 2020 at 11:49:05AM +0100, Chris Wilson wrote:
> > When waiting with a callback on the stack, we must remove the callback
> > upon wait completion. Since this will be notified by the fence signal
> > callback, the removal often contends with the fence->lock being held by
> > the signaler. We can look at the list entry to see if the callback was
> > already signaled before we take the contended lock.
> > 
> > Signed-off-by: Chris Wilson 
> > ---
> >  drivers/dma-buf/dma-fence.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
> > index 8d5bdfce638e..b910d7bc0854 100644
> > --- a/drivers/dma-buf/dma-fence.c
> > +++ b/drivers/dma-buf/dma-fence.c
> > @@ -420,6 +420,9 @@ dma_fence_remove_callback(struct dma_fence *fence, 
> > struct dma_fence_cb *cb)
> >   unsigned long flags;
> >   bool ret;
> >  
> > + if (list_empty(&cb->node))
> 
> I was about to say "but the races" but then noticed that Paul fixed
> list_empty to use READ_ONCE like 5 years ago :-)

I'm always going "when exactly do we need list_empty_careful()"?

We can rule out a concurrent dma_fence_add_callback() for the same
dma_fence_cb, as that is a lost cause. So we only have to worry about
the concurrent list_del_init() from dma_fence_signal_locked(). So it's
the timing of
list_del_init(): WRITE_ONCE(list->next, list)
vs
READ_ONCE(list->next) == list
and we don't need to care about the trailing instructions in
list_del_init()...

Wait that trailing instruction is actually important here if the
dma_fence_cb is on the stack, or other imminent free.

Ok, this does need to be list_empty_careful!
-Chris
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH] drm/i915: Reduce i915_request.lock contention for i915_request_wait

2020-07-15 Thread Tvrtko Ursulin


On 15/07/2020 13:06, Tvrtko Ursulin wrote:


On 15/07/2020 11:50, Chris Wilson wrote:

Currently, we use i915_request_completed() directly in
i915_request_wait() and follow up with a manual invocation of
dma_fence_signal(). This appears to cause a large number of contentions
on i915_request.lock as when the process is woken up after the fence is
signaled by an interrupt, we will then try and call dma_fence_signal()
ourselves while the signaler is still holding the lock.
dma_fence_is_signaled() has the benefit of checking the
DMA_FENCE_FLAG_SIGNALED_BIT prior to calling dma_fence_signal() and so
avoids most of that contention.

Signed-off-by: Chris Wilson 
Cc: Matthew Auld 
Cc: Tvrtko Ursulin 
---
  drivers/gpu/drm/i915/i915_request.c | 12 
  1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_request.c 
b/drivers/gpu/drm/i915/i915_request.c

index 0b2fe55e6194..bb4eb1a8780e 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -1640,7 +1640,7 @@ static bool busywait_stop(unsigned long timeout, 
unsigned int cpu)

  return this_cpu != cpu;
  }
-static bool __i915_spin_request(const struct i915_request * const rq, 
int state)
+static bool __i915_spin_request(struct i915_request * const rq, int 
state)

  {
  unsigned long timeout_ns;
  unsigned int cpu;
@@ -1673,7 +1673,7 @@ static bool __i915_spin_request(const struct 
i915_request * const rq, int state)

  timeout_ns = READ_ONCE(rq->engine->props.max_busywait_duration_ns);
  timeout_ns += local_clock_ns(&cpu);
  do {
-    if (i915_request_completed(rq))
+    if (dma_fence_is_signaled(&rq->fence))
  return true;
  if (signal_pending_state(state, current))
@@ -1766,10 +1766,8 @@ long i915_request_wait(struct i915_request *rq,
   * duration, which we currently lack.
   */
  if (IS_ACTIVE(CONFIG_DRM_I915_MAX_REQUEST_BUSYWAIT) &&
-    __i915_spin_request(rq, state)) {
-    dma_fence_signal(&rq->fence);
+    __i915_spin_request(rq, state))
  goto out;
-    }
  /*
   * This client is about to stall waiting for the GPU. In many cases
@@ -1796,10 +1794,8 @@ long i915_request_wait(struct i915_request *rq,
  for (;;) {
  set_current_state(state);
-    if (i915_request_completed(rq)) {
-    dma_fence_signal(&rq->fence);
+    if (dma_fence_is_signaled(&rq->fence))
  break;
-    }
  intel_engine_flush_submission(rq->engine);



In other words putting some latency back into the waiters, which is 
probably okay, since sync waits is not our primary model.


I have a slight concern about the remaining value of busy spinning if 
i915_request_completed check is removed from there as well. Of course it 
doesn't make sense to have different completion criteria between the 
two.. We could wait a bit longer if real check in busyspin said request 
is actually completed, just not signal it but wait for the breadcrumbs 
to do it.


What a load of nonsense.. :)

Okay, I think the only real question is i915_request_completed vs 
dma_fence_signaled in __i915_spin_request. Do we want to burn CPU cycles 
waiting on GPU and breadcrumb irq work, or just the GPU.


Regards,

Tvrtko



___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] ✓ Fi.CI.BAT: success for drm/i915: Reduce i915_request.lock contention for i915_request_wait

2020-07-15 Thread Patchwork
== Series Details ==

Series: drm/i915: Reduce i915_request.lock contention for i915_request_wait
URL   : https://patchwork.freedesktop.org/series/79514/
State : success

== Summary ==

CI Bug Log - changes from CI_DRM_8749 -> Patchwork_18176


Summary
---

  **SUCCESS**

  No regressions found.

  External URL: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18176/index.html

Known issues


  Here are the changes found in Patchwork_18176 that come from known issues:

### IGT changes ###

 Issues hit 

  * igt@gem_exec_suspend@basic-s0:
- fi-tgl-u2:  [PASS][1] -> [FAIL][2] ([i915#1888])
   [1]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8749/fi-tgl-u2/igt@gem_exec_susp...@basic-s0.html
   [2]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18176/fi-tgl-u2/igt@gem_exec_susp...@basic-s0.html

  * igt@i915_pm_rpm@module-reload:
- fi-kbl-guc: [PASS][3] -> [SKIP][4] ([fdo#109271])
   [3]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8749/fi-kbl-guc/igt@i915_pm_...@module-reload.html
   [4]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18176/fi-kbl-guc/igt@i915_pm_...@module-reload.html

  * igt@i915_selftest@live@gt_contexts:
- fi-snb-2520m:   [PASS][5] -> [DMESG-FAIL][6] ([i915#541])
   [5]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8749/fi-snb-2520m/igt@i915_selftest@live@gt_contexts.html
   [6]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18176/fi-snb-2520m/igt@i915_selftest@live@gt_contexts.html

  * igt@kms_force_connector_basic@force-connector-state:
- fi-tgl-y:   [PASS][7] -> [DMESG-WARN][8] ([i915#1982]) +1 similar 
issue
   [7]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8749/fi-tgl-y/igt@kms_force_connector_ba...@force-connector-state.html
   [8]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18176/fi-tgl-y/igt@kms_force_connector_ba...@force-connector-state.html

  * igt@prime_vgem@basic-write:
- fi-tgl-y:   [PASS][9] -> [DMESG-WARN][10] ([i915#402]) +1 similar 
issue
   [9]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8749/fi-tgl-y/igt@prime_v...@basic-write.html
   [10]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18176/fi-tgl-y/igt@prime_v...@basic-write.html

  
 Possible fixes 

  * igt@i915_pm_rpm@module-reload:
- fi-byt-j1900:   [DMESG-WARN][11] ([i915#1982]) -> [PASS][12]
   [11]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8749/fi-byt-j1900/igt@i915_pm_...@module-reload.html
   [12]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18176/fi-byt-j1900/igt@i915_pm_...@module-reload.html

  * igt@kms_cursor_legacy@basic-busy-flip-before-cursor-atomic:
- fi-bsw-n3050:   [DMESG-WARN][13] ([i915#1982]) -> [PASS][14]
   [13]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8749/fi-bsw-n3050/igt@kms_cursor_leg...@basic-busy-flip-before-cursor-atomic.html
   [14]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18176/fi-bsw-n3050/igt@kms_cursor_leg...@basic-busy-flip-before-cursor-atomic.html
- fi-bsw-kefka:   [DMESG-WARN][15] ([i915#1982]) -> [PASS][16]
   [15]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8749/fi-bsw-kefka/igt@kms_cursor_leg...@basic-busy-flip-before-cursor-atomic.html
   [16]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18176/fi-bsw-kefka/igt@kms_cursor_leg...@basic-busy-flip-before-cursor-atomic.html

  * igt@kms_flip@basic-plain-flip@d-dsi1:
- {fi-tgl-dsi}:   [DMESG-WARN][17] ([i915#1982]) -> [PASS][18]
   [17]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8749/fi-tgl-dsi/igt@kms_flip@basic-plain-f...@d-dsi1.html
   [18]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18176/fi-tgl-dsi/igt@kms_flip@basic-plain-f...@d-dsi1.html

  * igt@vgem_basic@create:
- fi-tgl-y:   [DMESG-WARN][19] ([i915#402]) -> [PASS][20]
   [19]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8749/fi-tgl-y/igt@vgem_ba...@create.html
   [20]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18176/fi-tgl-y/igt@vgem_ba...@create.html

  
 Warnings 

  * igt@kms_cursor_legacy@basic-flip-before-cursor-varying-size:
- fi-kbl-x1275:   [DMESG-WARN][21] ([i915#62] / [i915#92]) -> 
[DMESG-WARN][22] ([i915#62] / [i915#92] / [i915#95]) +5 similar issues
   [21]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8749/fi-kbl-x1275/igt@kms_cursor_leg...@basic-flip-before-cursor-varying-size.html
   [22]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18176/fi-kbl-x1275/igt@kms_cursor_leg...@basic-flip-before-cursor-varying-size.html

  * igt@kms_force_connector_basic@force-edid:
- fi-kbl-x1275:   [DMESG-WARN][23] ([i915#62] / [i915#92] / [i915#95]) 
-> [DMESG-WARN][24] ([i915#62] / [i915#92]) +1 similar issue
   [23]: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_8749/fi-kbl-x1275/igt@kms_force_connector_ba...@force-edid.html
   [24]: 
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_18176/fi-kbl-x1275/ig

  1   2   >