Re: [Intel-gfx] [PATCH 10/34] drm/i915: Remove GPU reset dependence on struct_mutex
Quoting Chris Wilson (2019-01-24 12:50:30) > Quoting Mika Kuoppala (2019-01-24 12:06:01) > > Chris Wilson writes: > > > - /* > > > - * We want to perform per-engine reset from atomic context (e.g. > > > - * softirq), which imposes the constraint that we cannot sleep. > > > - * However, experience suggests that spending a bit of time waiting > > > - * for a reset helps in various cases, so for a full-device reset > > > - * we apply the opposite rule and wait if we want to. As we should > > > - * always follow up a failed per-engine reset with a full device > > > reset, > > > - * being a little faster, stricter and more error prone for the > > > - * atomic case seems an acceptable compromise. > > > - * > > > - * Unfortunately this leads to a bimodal routine, when the goal was > > > - * to have a single reset function that worked for resetting any > > > - * number of engines simultaneously. > > > - */ > > > - might_sleep_if(engine_mask == ALL_ENGINES); > > > > Oh here it is. I was after this on atomic resets. > > I was saying it didn't make sense to lift the restriction until we > relied upon. Just because the code became safe doesn't mean it was then > part of the API :) Hmm, set-wedged is meant to be magical more or less now. That gives more weight to the argument of making intel_gpu_reset() magical and removing this comment earlier. -Chris ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH 10/34] drm/i915: Remove GPU reset dependence on struct_mutex
Quoting Chris Wilson (2019-01-24 12:50:30) > Quoting Mika Kuoppala (2019-01-24 12:06:01) > > Why did you remove the comment on needing a empty request? > > I don't believe it's true any more, and it was detracting from the > emphasis on the idea of restoring a context. Thinking about it, it was probably fixed by the work for reset handling prior to the gen6/gen7 ppgtt updates. -Chris ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH 10/34] drm/i915: Remove GPU reset dependence on struct_mutex
Quoting Mika Kuoppala (2019-01-24 12:06:01) > Chris Wilson writes: > > + mutex_lock(>drm.struct_mutex); > > i915_gem_contexts_lost(i915); > > mutex_unlock(>drm.struct_mutex); > > } > > @@ -4534,6 +4528,8 @@ int i915_gem_suspend(struct drm_i915_private *i915) > > wakeref = intel_runtime_pm_get(i915); > > intel_suspend_gt_powersave(i915); > > > > + flush_workqueue(i915->wq); > > I don't know what is happening here. Why > don't we need the i915_gem_drain_workqueue in here? It's just a poke at the queue before we end up doing the same work ourselves. > > mutex_lock(>drm.struct_mutex); > > > > /* > > @@ -4563,11 +4559,9 @@ int i915_gem_suspend(struct drm_i915_private *i915) > > i915_retire_requests(i915); /* ensure we flush after wedging */ > > > > mutex_unlock(>drm.struct_mutex); > > + i915_reset_flush(i915); > > > > - intel_uc_suspend(i915); > > - > > - cancel_delayed_work_sync(>gpu_error.hangcheck_work); > > - cancel_delayed_work_sync(>gt.retire_work); > > + drain_delayed_work(>gt.retire_work); > > Hangcheck is inside reset flush but why the change > for retire? So we didn't leave retire_work work in an ill-defined state. I was probably thinking consistency over cleverness. It's also highly probable that I stuck something else in there at one point. > > - ecode = i915_error_generate_code(dev_priv, error, _id); > > + for (i = 0; i < ARRAY_SIZE(error->engine); i++) > > + if (!error->engine[i].context.pid) > > + engines &= ~BIT(i); > > No more grouping for driver internal hangs...? There's no pid, right, so the message at moment is garbled. The ecode is just noise, you don't use it for analysis, you don't use it for error state matching (or at least you shouldn't as it has no meaning wrt the error state). > > len = scnprintf(error->error_msg, sizeof(error->error_msg), > > - "GPU HANG: ecode %d:%d:0x%08x", > > - INTEL_GEN(dev_priv), engine_id, ecode); > > - > > - if (engine_id != -1 && error->engine[engine_id].context.pid) > > + "GPU HANG: ecode %d:%lx:0x%08x", > > + INTEL_GEN(error->i915), engines, > > + i915_error_generate_code(error, engines)); > > + if (engines) { > > + /* Just show the first executing process, more is confusing */ > > + i = ffs(engines); > > then why not just make the ecode accepting single engine and move it here. I don't think the ecode should just accept a single engine. I don't think the ecode should exist at all, but that's another matter. > > - /* > > - * We want to perform per-engine reset from atomic context (e.g. > > - * softirq), which imposes the constraint that we cannot sleep. > > - * However, experience suggests that spending a bit of time waiting > > - * for a reset helps in various cases, so for a full-device reset > > - * we apply the opposite rule and wait if we want to. As we should > > - * always follow up a failed per-engine reset with a full device > > reset, > > - * being a little faster, stricter and more error prone for the > > - * atomic case seems an acceptable compromise. > > - * > > - * Unfortunately this leads to a bimodal routine, when the goal was > > - * to have a single reset function that worked for resetting any > > - * number of engines simultaneously. > > - */ > > - might_sleep_if(engine_mask == ALL_ENGINES); > > Oh here it is. I was after this on atomic resets. I was saying it didn't make sense to lift the restriction until we relied upon. Just because the code became safe doesn't mean it was then part of the API :) > > +static void restart_work(struct work_struct *work) > > { > > + struct i915_gpu_restart *arg = container_of(work, typeof(*arg), work); > > + struct drm_i915_private *i915 = arg->i915; > > struct intel_engine_cs *engine; > > enum intel_engine_id id; > > + intel_wakeref_t wakeref; > > > > - lockdep_assert_held(>drm.struct_mutex); > > + wakeref = intel_runtime_pm_get(i915); > > + mutex_lock(>drm.struct_mutex); > > > > - i915_retire_requests(i915); > > Can't do this anymore yes. What will be the effect > of delaying this and the other explicit retirements? > Are we more prone to starvation? No. We risk gen3 spontaneously dying as we have no idea why it needs a request in the pipeline soon after a reset, so no idea if a potential delay will kill it. > > + smp_store_mb(i915->gpu_error.restart, NULL); > > Checkpatch might want a comment for the mb. Check patch is silly about this one. It's precisely because the other side is unserialised is it smp_store_mb. But it's just a glorified WRITE_ONCE not that important. > > for_each_engine(engine, i915, id) { > > - struct intel_context *ce; > > - > > -
Re: [Intel-gfx] [PATCH 10/34] drm/i915: Remove GPU reset dependence on struct_mutex
Chris Wilson writes: > Now that the submission backends are controlled via their own spinlocks, > with a wave of a magic wand we can lift the struct_mutex requirement > around GPU reset. That is we allow the submission frontend (userspace) > to keep on submitting while we process the GPU reset as we can suspend > the backend independently. > > The major change is around the backoff/handoff strategy for performing > the reset. With no mutex deadlock, we no longer have to coordinate with > any waiter, and just perform the reset immediately. > > Signed-off-by: Chris Wilson > --- > drivers/gpu/drm/i915/i915_debugfs.c | 38 +- > drivers/gpu/drm/i915/i915_drv.h | 5 - > drivers/gpu/drm/i915/i915_gem.c | 18 +- > drivers/gpu/drm/i915/i915_gem_fence_reg.h | 1 - > drivers/gpu/drm/i915/i915_gem_gtt.h | 1 + > drivers/gpu/drm/i915/i915_gpu_error.c | 104 +++-- > drivers/gpu/drm/i915/i915_gpu_error.h | 28 +- > drivers/gpu/drm/i915/i915_request.c | 47 --- > drivers/gpu/drm/i915/i915_reset.c | 397 -- > drivers/gpu/drm/i915/i915_reset.h | 3 + > drivers/gpu/drm/i915/intel_engine_cs.c| 6 +- > drivers/gpu/drm/i915/intel_guc_submission.c | 5 +- > drivers/gpu/drm/i915/intel_hangcheck.c| 28 +- > drivers/gpu/drm/i915/intel_lrc.c | 92 ++-- > drivers/gpu/drm/i915/intel_overlay.c | 2 - > drivers/gpu/drm/i915/intel_ringbuffer.c | 91 ++-- > drivers/gpu/drm/i915/intel_ringbuffer.h | 17 +- > .../gpu/drm/i915/selftests/intel_hangcheck.c | 57 +-- > .../drm/i915/selftests/intel_workarounds.c| 3 - > .../gpu/drm/i915/selftests/mock_gem_device.c | 4 +- > 20 files changed, 393 insertions(+), 554 deletions(-) > > diff --git a/drivers/gpu/drm/i915/i915_debugfs.c > b/drivers/gpu/drm/i915/i915_debugfs.c > index 24d6d4ce14ef..3ec369980d40 100644 > --- a/drivers/gpu/drm/i915/i915_debugfs.c > +++ b/drivers/gpu/drm/i915/i915_debugfs.c > @@ -1284,8 +1284,6 @@ static int i915_hangcheck_info(struct seq_file *m, void > *unused) > seq_puts(m, "Wedged\n"); > if (test_bit(I915_RESET_BACKOFF, _priv->gpu_error.flags)) > seq_puts(m, "Reset in progress: struct_mutex backoff\n"); > - if (test_bit(I915_RESET_HANDOFF, _priv->gpu_error.flags)) > - seq_puts(m, "Reset in progress: reset handoff to waiter\n"); > if (waitqueue_active(_priv->gpu_error.wait_queue)) > seq_puts(m, "Waiter holding struct mutex\n"); > if (waitqueue_active(_priv->gpu_error.reset_queue)) > @@ -1321,15 +1319,15 @@ static int i915_hangcheck_info(struct seq_file *m, > void *unused) > struct rb_node *rb; > > seq_printf(m, "%s:\n", engine->name); > - seq_printf(m, "\tseqno = %x [current %x, last %x]\n", > + seq_printf(m, "\tseqno = %x [current %x, last %x], %dms ago\n", > engine->hangcheck.seqno, seqno[id], > -intel_engine_last_submit(engine)); > - seq_printf(m, "\twaiters? %s, fake irq active? %s, stalled? %s, > wedged? %s\n", > +intel_engine_last_submit(engine), > +jiffies_to_msecs(jiffies - > + > engine->hangcheck.action_timestamp)); > + seq_printf(m, "\twaiters? %s, fake irq active? %s\n", > yesno(intel_engine_has_waiter(engine)), > yesno(test_bit(engine->id, > - > _priv->gpu_error.missed_irq_rings)), > -yesno(engine->hangcheck.stalled), > -yesno(engine->hangcheck.wedged)); > + > _priv->gpu_error.missed_irq_rings))); > > spin_lock_irq(>rb_lock); > for (rb = rb_first(>waiters); rb; rb = rb_next(rb)) { > @@ -1343,11 +1341,6 @@ static int i915_hangcheck_info(struct seq_file *m, > void *unused) > seq_printf(m, "\tACTHD = 0x%08llx [current 0x%08llx]\n", > (long long)engine->hangcheck.acthd, > (long long)acthd[id]); > - seq_printf(m, "\taction = %s(%d) %d ms ago\n", > -hangcheck_action_to_str(engine->hangcheck.action), > -engine->hangcheck.action, > -jiffies_to_msecs(jiffies - > - > engine->hangcheck.action_timestamp)); Yeah it is a time for sample and most decision are on top of seqno. Welcomed compression. > > if (engine->id == RCS) { > seq_puts(m, "\tinstdone read =\n"); > @@ -3886,8 +3879,6 @@ static int > i915_wedged_set(void *data, u64 val) *hones his axe* > { > struct drm_i915_private *i915 = data; > - struct intel_engine_cs *engine; > -
[Intel-gfx] [PATCH 10/34] drm/i915: Remove GPU reset dependence on struct_mutex
Now that the submission backends are controlled via their own spinlocks, with a wave of a magic wand we can lift the struct_mutex requirement around GPU reset. That is we allow the submission frontend (userspace) to keep on submitting while we process the GPU reset as we can suspend the backend independently. The major change is around the backoff/handoff strategy for performing the reset. With no mutex deadlock, we no longer have to coordinate with any waiter, and just perform the reset immediately. Signed-off-by: Chris Wilson --- drivers/gpu/drm/i915/i915_debugfs.c | 38 +- drivers/gpu/drm/i915/i915_drv.h | 5 - drivers/gpu/drm/i915/i915_gem.c | 18 +- drivers/gpu/drm/i915/i915_gem_fence_reg.h | 1 - drivers/gpu/drm/i915/i915_gem_gtt.h | 1 + drivers/gpu/drm/i915/i915_gpu_error.c | 104 +++-- drivers/gpu/drm/i915/i915_gpu_error.h | 28 +- drivers/gpu/drm/i915/i915_request.c | 47 --- drivers/gpu/drm/i915/i915_reset.c | 397 -- drivers/gpu/drm/i915/i915_reset.h | 3 + drivers/gpu/drm/i915/intel_engine_cs.c| 6 +- drivers/gpu/drm/i915/intel_guc_submission.c | 5 +- drivers/gpu/drm/i915/intel_hangcheck.c| 28 +- drivers/gpu/drm/i915/intel_lrc.c | 92 ++-- drivers/gpu/drm/i915/intel_overlay.c | 2 - drivers/gpu/drm/i915/intel_ringbuffer.c | 91 ++-- drivers/gpu/drm/i915/intel_ringbuffer.h | 17 +- .../gpu/drm/i915/selftests/intel_hangcheck.c | 57 +-- .../drm/i915/selftests/intel_workarounds.c| 3 - .../gpu/drm/i915/selftests/mock_gem_device.c | 4 +- 20 files changed, 393 insertions(+), 554 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c index 24d6d4ce14ef..3ec369980d40 100644 --- a/drivers/gpu/drm/i915/i915_debugfs.c +++ b/drivers/gpu/drm/i915/i915_debugfs.c @@ -1284,8 +1284,6 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused) seq_puts(m, "Wedged\n"); if (test_bit(I915_RESET_BACKOFF, _priv->gpu_error.flags)) seq_puts(m, "Reset in progress: struct_mutex backoff\n"); - if (test_bit(I915_RESET_HANDOFF, _priv->gpu_error.flags)) - seq_puts(m, "Reset in progress: reset handoff to waiter\n"); if (waitqueue_active(_priv->gpu_error.wait_queue)) seq_puts(m, "Waiter holding struct mutex\n"); if (waitqueue_active(_priv->gpu_error.reset_queue)) @@ -1321,15 +1319,15 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused) struct rb_node *rb; seq_printf(m, "%s:\n", engine->name); - seq_printf(m, "\tseqno = %x [current %x, last %x]\n", + seq_printf(m, "\tseqno = %x [current %x, last %x], %dms ago\n", engine->hangcheck.seqno, seqno[id], - intel_engine_last_submit(engine)); - seq_printf(m, "\twaiters? %s, fake irq active? %s, stalled? %s, wedged? %s\n", + intel_engine_last_submit(engine), + jiffies_to_msecs(jiffies - + engine->hangcheck.action_timestamp)); + seq_printf(m, "\twaiters? %s, fake irq active? %s\n", yesno(intel_engine_has_waiter(engine)), yesno(test_bit(engine->id, - _priv->gpu_error.missed_irq_rings)), - yesno(engine->hangcheck.stalled), - yesno(engine->hangcheck.wedged)); + _priv->gpu_error.missed_irq_rings))); spin_lock_irq(>rb_lock); for (rb = rb_first(>waiters); rb; rb = rb_next(rb)) { @@ -1343,11 +1341,6 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused) seq_printf(m, "\tACTHD = 0x%08llx [current 0x%08llx]\n", (long long)engine->hangcheck.acthd, (long long)acthd[id]); - seq_printf(m, "\taction = %s(%d) %d ms ago\n", - hangcheck_action_to_str(engine->hangcheck.action), - engine->hangcheck.action, - jiffies_to_msecs(jiffies - - engine->hangcheck.action_timestamp)); if (engine->id == RCS) { seq_puts(m, "\tinstdone read =\n"); @@ -3886,8 +3879,6 @@ static int i915_wedged_set(void *data, u64 val) { struct drm_i915_private *i915 = data; - struct intel_engine_cs *engine; - unsigned int tmp; /* * There is no safeguard against this debugfs entry colliding @@ -3900,18 +3891,8 @@ i915_wedged_set(void *data, u64 val) if (i915_reset_backoff(>gpu_error)) return -EAGAIN; -