Re: [Intel-gfx] [PATCH 10/34] drm/i915: Remove GPU reset dependence on struct_mutex

2019-01-24 Thread Chris Wilson
Quoting Chris Wilson (2019-01-24 12:50:30)
> Quoting Mika Kuoppala (2019-01-24 12:06:01)
> > Chris Wilson  writes:
> > > - /*
> > > -  * We want to perform per-engine reset from atomic context (e.g.
> > > -  * softirq), which imposes the constraint that we cannot sleep.
> > > -  * However, experience suggests that spending a bit of time waiting
> > > -  * for a reset helps in various cases, so for a full-device reset
> > > -  * we apply the opposite rule and wait if we want to. As we should
> > > -  * always follow up a failed per-engine reset with a full device 
> > > reset,
> > > -  * being a little faster, stricter and more error prone for the
> > > -  * atomic case seems an acceptable compromise.
> > > -  *
> > > -  * Unfortunately this leads to a bimodal routine, when the goal was
> > > -  * to have a single reset function that worked for resetting any
> > > -  * number of engines simultaneously.
> > > -  */
> > > - might_sleep_if(engine_mask == ALL_ENGINES);
> > 
> > Oh here it is. I was after this on atomic resets.
> 
> I was saying it didn't make sense to lift the restriction until we
> relied upon. Just because the code became safe doesn't mean it was then
> part of the API :)

Hmm, set-wedged is meant to be magical more or less now. That gives more
weight to the argument of making intel_gpu_reset() magical and removing
this comment earlier.
-Chris
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 10/34] drm/i915: Remove GPU reset dependence on struct_mutex

2019-01-24 Thread Chris Wilson
Quoting Chris Wilson (2019-01-24 12:50:30)
> Quoting Mika Kuoppala (2019-01-24 12:06:01)
> > Why did you remove the comment on needing a empty request?
> 
> I don't believe it's true any more, and it was detracting from the
> emphasis on the idea of restoring a context.

Thinking about it, it was probably fixed by the work for reset handling
prior to the gen6/gen7 ppgtt updates.
-Chris
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 10/34] drm/i915: Remove GPU reset dependence on struct_mutex

2019-01-24 Thread Chris Wilson
Quoting Mika Kuoppala (2019-01-24 12:06:01)
> Chris Wilson  writes:
> > + mutex_lock(>drm.struct_mutex);
> >   i915_gem_contexts_lost(i915);
> >   mutex_unlock(>drm.struct_mutex);
> >  }
> > @@ -4534,6 +4528,8 @@ int i915_gem_suspend(struct drm_i915_private *i915)
> >   wakeref = intel_runtime_pm_get(i915);
> >   intel_suspend_gt_powersave(i915);
> >  
> > + flush_workqueue(i915->wq);
> 
> I don't know what is happening here. Why
> don't we need the i915_gem_drain_workqueue in here?

It's just a poke at the queue before we end up doing the same work
ourselves.

> >   mutex_lock(>drm.struct_mutex);
> >  
> >   /*
> > @@ -4563,11 +4559,9 @@ int i915_gem_suspend(struct drm_i915_private *i915)
> >   i915_retire_requests(i915); /* ensure we flush after wedging */
> >  
> >   mutex_unlock(>drm.struct_mutex);
> > + i915_reset_flush(i915);
> >  
> > - intel_uc_suspend(i915);
> > -
> > - cancel_delayed_work_sync(>gpu_error.hangcheck_work);
> > - cancel_delayed_work_sync(>gt.retire_work);
> > + drain_delayed_work(>gt.retire_work);
> 
> Hangcheck is inside reset flush but why the change
> for retire?

So we didn't leave retire_work work in an ill-defined state. I was
probably thinking consistency over cleverness. It's also highly probable
that I stuck something else in there at one point.

> > - ecode = i915_error_generate_code(dev_priv, error, _id);
> > + for (i = 0; i < ARRAY_SIZE(error->engine); i++)
> > + if (!error->engine[i].context.pid)
> > + engines &= ~BIT(i);
> 
> No more grouping for driver internal hangs...?

There's no pid, right, so the message at moment is garbled. The ecode is
just noise, you don't use it for analysis, you don't use it for error
state matching (or at least you shouldn't as it has no meaning wrt the
error state).

> >   len = scnprintf(error->error_msg, sizeof(error->error_msg),
> > - "GPU HANG: ecode %d:%d:0x%08x",
> > - INTEL_GEN(dev_priv), engine_id, ecode);
> > -
> > - if (engine_id != -1 && error->engine[engine_id].context.pid)
> > + "GPU HANG: ecode %d:%lx:0x%08x",
> > + INTEL_GEN(error->i915), engines,
> > + i915_error_generate_code(error, engines));
> > + if (engines) {
> > + /* Just show the first executing process, more is confusing */
> > + i = ffs(engines);
> 
> then why not just make the ecode accepting single engine and move it here.

I don't think the ecode should just accept a single engine. I don't
think the ecode should exist at all, but that's another matter.

> > - /*
> > -  * We want to perform per-engine reset from atomic context (e.g.
> > -  * softirq), which imposes the constraint that we cannot sleep.
> > -  * However, experience suggests that spending a bit of time waiting
> > -  * for a reset helps in various cases, so for a full-device reset
> > -  * we apply the opposite rule and wait if we want to. As we should
> > -  * always follow up a failed per-engine reset with a full device 
> > reset,
> > -  * being a little faster, stricter and more error prone for the
> > -  * atomic case seems an acceptable compromise.
> > -  *
> > -  * Unfortunately this leads to a bimodal routine, when the goal was
> > -  * to have a single reset function that worked for resetting any
> > -  * number of engines simultaneously.
> > -  */
> > - might_sleep_if(engine_mask == ALL_ENGINES);
> 
> Oh here it is. I was after this on atomic resets.

I was saying it didn't make sense to lift the restriction until we
relied upon. Just because the code became safe doesn't mean it was then
part of the API :)

> > +static void restart_work(struct work_struct *work)
> >  {
> > + struct i915_gpu_restart *arg = container_of(work, typeof(*arg), work);
> > + struct drm_i915_private *i915 = arg->i915;
> >   struct intel_engine_cs *engine;
> >   enum intel_engine_id id;
> > + intel_wakeref_t wakeref;
> >  
> > - lockdep_assert_held(>drm.struct_mutex);
> > + wakeref = intel_runtime_pm_get(i915);
> > + mutex_lock(>drm.struct_mutex);
> >  
> > - i915_retire_requests(i915);
> 
> Can't do this anymore yes. What will be the effect
> of delaying this and the other explicit retirements?
> Are we more prone to starvation?

No. We risk gen3 spontaneously dying as we have no idea why it needs a
request in the pipeline soon after a reset, so no idea if a potential
delay will kill it.

> > + smp_store_mb(i915->gpu_error.restart, NULL);
> 
> Checkpatch might want a comment for the mb.

Check patch is silly about this one. It's precisely because the other
side is unserialised is it smp_store_mb. But it's just a glorified
WRITE_ONCE not that important.

> >   for_each_engine(engine, i915, id) {
> > - struct intel_context *ce;
> > -
> > - 

Re: [Intel-gfx] [PATCH 10/34] drm/i915: Remove GPU reset dependence on struct_mutex

2019-01-24 Thread Mika Kuoppala
Chris Wilson  writes:

> Now that the submission backends are controlled via their own spinlocks,
> with a wave of a magic wand we can lift the struct_mutex requirement
> around GPU reset. That is we allow the submission frontend (userspace)
> to keep on submitting while we process the GPU reset as we can suspend
> the backend independently.
>
> The major change is around the backoff/handoff strategy for performing
> the reset. With no mutex deadlock, we no longer have to coordinate with
> any waiter, and just perform the reset immediately.
>
> Signed-off-by: Chris Wilson 
> ---
>  drivers/gpu/drm/i915/i915_debugfs.c   |  38 +-
>  drivers/gpu/drm/i915/i915_drv.h   |   5 -
>  drivers/gpu/drm/i915/i915_gem.c   |  18 +-
>  drivers/gpu/drm/i915/i915_gem_fence_reg.h |   1 -
>  drivers/gpu/drm/i915/i915_gem_gtt.h   |   1 +
>  drivers/gpu/drm/i915/i915_gpu_error.c | 104 +++--
>  drivers/gpu/drm/i915/i915_gpu_error.h |  28 +-
>  drivers/gpu/drm/i915/i915_request.c   |  47 ---
>  drivers/gpu/drm/i915/i915_reset.c | 397 --
>  drivers/gpu/drm/i915/i915_reset.h |   3 +
>  drivers/gpu/drm/i915/intel_engine_cs.c|   6 +-
>  drivers/gpu/drm/i915/intel_guc_submission.c   |   5 +-
>  drivers/gpu/drm/i915/intel_hangcheck.c|  28 +-
>  drivers/gpu/drm/i915/intel_lrc.c  |  92 ++--
>  drivers/gpu/drm/i915/intel_overlay.c  |   2 -
>  drivers/gpu/drm/i915/intel_ringbuffer.c   |  91 ++--
>  drivers/gpu/drm/i915/intel_ringbuffer.h   |  17 +-
>  .../gpu/drm/i915/selftests/intel_hangcheck.c  |  57 +--
>  .../drm/i915/selftests/intel_workarounds.c|   3 -
>  .../gpu/drm/i915/selftests/mock_gem_device.c  |   4 +-
>  20 files changed, 393 insertions(+), 554 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c 
> b/drivers/gpu/drm/i915/i915_debugfs.c
> index 24d6d4ce14ef..3ec369980d40 100644
> --- a/drivers/gpu/drm/i915/i915_debugfs.c
> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
> @@ -1284,8 +1284,6 @@ static int i915_hangcheck_info(struct seq_file *m, void 
> *unused)
>   seq_puts(m, "Wedged\n");
>   if (test_bit(I915_RESET_BACKOFF, _priv->gpu_error.flags))
>   seq_puts(m, "Reset in progress: struct_mutex backoff\n");
> - if (test_bit(I915_RESET_HANDOFF, _priv->gpu_error.flags))
> - seq_puts(m, "Reset in progress: reset handoff to waiter\n");
>   if (waitqueue_active(_priv->gpu_error.wait_queue))
>   seq_puts(m, "Waiter holding struct mutex\n");
>   if (waitqueue_active(_priv->gpu_error.reset_queue))
> @@ -1321,15 +1319,15 @@ static int i915_hangcheck_info(struct seq_file *m, 
> void *unused)
>   struct rb_node *rb;
>  
>   seq_printf(m, "%s:\n", engine->name);
> - seq_printf(m, "\tseqno = %x [current %x, last %x]\n",
> + seq_printf(m, "\tseqno = %x [current %x, last %x], %dms ago\n",
>  engine->hangcheck.seqno, seqno[id],
> -intel_engine_last_submit(engine));
> - seq_printf(m, "\twaiters? %s, fake irq active? %s, stalled? %s, 
> wedged? %s\n",
> +intel_engine_last_submit(engine),
> +jiffies_to_msecs(jiffies -
> + 
> engine->hangcheck.action_timestamp));
> + seq_printf(m, "\twaiters? %s, fake irq active? %s\n",
>  yesno(intel_engine_has_waiter(engine)),
>  yesno(test_bit(engine->id,
> -   
> _priv->gpu_error.missed_irq_rings)),
> -yesno(engine->hangcheck.stalled),
> -yesno(engine->hangcheck.wedged));
> +   
> _priv->gpu_error.missed_irq_rings)));
>  
>   spin_lock_irq(>rb_lock);
>   for (rb = rb_first(>waiters); rb; rb = rb_next(rb)) {
> @@ -1343,11 +1341,6 @@ static int i915_hangcheck_info(struct seq_file *m, 
> void *unused)
>   seq_printf(m, "\tACTHD = 0x%08llx [current 0x%08llx]\n",
>  (long long)engine->hangcheck.acthd,
>  (long long)acthd[id]);
> - seq_printf(m, "\taction = %s(%d) %d ms ago\n",
> -hangcheck_action_to_str(engine->hangcheck.action),
> -engine->hangcheck.action,
> -jiffies_to_msecs(jiffies -
> - 
> engine->hangcheck.action_timestamp));

Yeah it is a time for sample and most decision are on top of
seqno. Welcomed compression.

>  
>   if (engine->id == RCS) {
>   seq_puts(m, "\tinstdone read =\n");
> @@ -3886,8 +3879,6 @@ static int
>  i915_wedged_set(void *data, u64 val)

*hones his axe*

>  {
>   struct drm_i915_private *i915 = data;
> - struct intel_engine_cs *engine;
> - 

[Intel-gfx] [PATCH 10/34] drm/i915: Remove GPU reset dependence on struct_mutex

2019-01-21 Thread Chris Wilson
Now that the submission backends are controlled via their own spinlocks,
with a wave of a magic wand we can lift the struct_mutex requirement
around GPU reset. That is we allow the submission frontend (userspace)
to keep on submitting while we process the GPU reset as we can suspend
the backend independently.

The major change is around the backoff/handoff strategy for performing
the reset. With no mutex deadlock, we no longer have to coordinate with
any waiter, and just perform the reset immediately.

Signed-off-by: Chris Wilson 
---
 drivers/gpu/drm/i915/i915_debugfs.c   |  38 +-
 drivers/gpu/drm/i915/i915_drv.h   |   5 -
 drivers/gpu/drm/i915/i915_gem.c   |  18 +-
 drivers/gpu/drm/i915/i915_gem_fence_reg.h |   1 -
 drivers/gpu/drm/i915/i915_gem_gtt.h   |   1 +
 drivers/gpu/drm/i915/i915_gpu_error.c | 104 +++--
 drivers/gpu/drm/i915/i915_gpu_error.h |  28 +-
 drivers/gpu/drm/i915/i915_request.c   |  47 ---
 drivers/gpu/drm/i915/i915_reset.c | 397 --
 drivers/gpu/drm/i915/i915_reset.h |   3 +
 drivers/gpu/drm/i915/intel_engine_cs.c|   6 +-
 drivers/gpu/drm/i915/intel_guc_submission.c   |   5 +-
 drivers/gpu/drm/i915/intel_hangcheck.c|  28 +-
 drivers/gpu/drm/i915/intel_lrc.c  |  92 ++--
 drivers/gpu/drm/i915/intel_overlay.c  |   2 -
 drivers/gpu/drm/i915/intel_ringbuffer.c   |  91 ++--
 drivers/gpu/drm/i915/intel_ringbuffer.h   |  17 +-
 .../gpu/drm/i915/selftests/intel_hangcheck.c  |  57 +--
 .../drm/i915/selftests/intel_workarounds.c|   3 -
 .../gpu/drm/i915/selftests/mock_gem_device.c  |   4 +-
 20 files changed, 393 insertions(+), 554 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c 
b/drivers/gpu/drm/i915/i915_debugfs.c
index 24d6d4ce14ef..3ec369980d40 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -1284,8 +1284,6 @@ static int i915_hangcheck_info(struct seq_file *m, void 
*unused)
seq_puts(m, "Wedged\n");
if (test_bit(I915_RESET_BACKOFF, _priv->gpu_error.flags))
seq_puts(m, "Reset in progress: struct_mutex backoff\n");
-   if (test_bit(I915_RESET_HANDOFF, _priv->gpu_error.flags))
-   seq_puts(m, "Reset in progress: reset handoff to waiter\n");
if (waitqueue_active(_priv->gpu_error.wait_queue))
seq_puts(m, "Waiter holding struct mutex\n");
if (waitqueue_active(_priv->gpu_error.reset_queue))
@@ -1321,15 +1319,15 @@ static int i915_hangcheck_info(struct seq_file *m, void 
*unused)
struct rb_node *rb;
 
seq_printf(m, "%s:\n", engine->name);
-   seq_printf(m, "\tseqno = %x [current %x, last %x]\n",
+   seq_printf(m, "\tseqno = %x [current %x, last %x], %dms ago\n",
   engine->hangcheck.seqno, seqno[id],
-  intel_engine_last_submit(engine));
-   seq_printf(m, "\twaiters? %s, fake irq active? %s, stalled? %s, 
wedged? %s\n",
+  intel_engine_last_submit(engine),
+  jiffies_to_msecs(jiffies -
+   
engine->hangcheck.action_timestamp));
+   seq_printf(m, "\twaiters? %s, fake irq active? %s\n",
   yesno(intel_engine_has_waiter(engine)),
   yesno(test_bit(engine->id,
- 
_priv->gpu_error.missed_irq_rings)),
-  yesno(engine->hangcheck.stalled),
-  yesno(engine->hangcheck.wedged));
+ 
_priv->gpu_error.missed_irq_rings)));
 
spin_lock_irq(>rb_lock);
for (rb = rb_first(>waiters); rb; rb = rb_next(rb)) {
@@ -1343,11 +1341,6 @@ static int i915_hangcheck_info(struct seq_file *m, void 
*unused)
seq_printf(m, "\tACTHD = 0x%08llx [current 0x%08llx]\n",
   (long long)engine->hangcheck.acthd,
   (long long)acthd[id]);
-   seq_printf(m, "\taction = %s(%d) %d ms ago\n",
-  hangcheck_action_to_str(engine->hangcheck.action),
-  engine->hangcheck.action,
-  jiffies_to_msecs(jiffies -
-   
engine->hangcheck.action_timestamp));
 
if (engine->id == RCS) {
seq_puts(m, "\tinstdone read =\n");
@@ -3886,8 +3879,6 @@ static int
 i915_wedged_set(void *data, u64 val)
 {
struct drm_i915_private *i915 = data;
-   struct intel_engine_cs *engine;
-   unsigned int tmp;
 
/*
 * There is no safeguard against this debugfs entry colliding
@@ -3900,18 +3891,8 @@ i915_wedged_set(void *data, u64 val)
if (i915_reset_backoff(>gpu_error))
return -EAGAIN;
 
-