Quoting Chris Wilson (2019-07-17 14:40:26) > Quoting Tvrtko Ursulin (2019-07-17 14:31:00) > > > > On 16/07/2019 13:49, Chris Wilson wrote: > > > By stopping the rings, we may trigger an arbitration point resulting in > > > a premature context-switch (i.e. a completion event before the request > > > is actually complete). This clears the active context before the reset, > > > but we must remember to rewind the incomplete context for replay upon > > > resume. > > > > > > Signed-off-by: Chris Wilson <ch...@chris-wilson.co.uk> > > > --- > > > drivers/gpu/drm/i915/gt/intel_lrc.c | 6 ++++-- > > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c > > > b/drivers/gpu/drm/i915/gt/intel_lrc.c > > > index 9b87a2fc186c..7570a9256001 100644 > > > --- a/drivers/gpu/drm/i915/gt/intel_lrc.c > > > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c > > > @@ -1419,7 +1419,8 @@ static void process_csb(struct intel_engine_cs > > > *engine) > > > * coherent (visible from the CPU) before the > > > * user interrupt and CSB is processed. > > > */ > > > - > > > GEM_BUG_ON(!i915_request_completed(*execlists->active)); > > > + > > > GEM_BUG_ON(!i915_request_completed(*execlists->active) && > > > + !reset_in_progress(execlists)); > > > execlists_schedule_out(*execlists->active++); > > > > > > GEM_BUG_ON(execlists->active - execlists->inflight > > > > @@ -2254,7 +2255,7 @@ static void __execlists_reset(struct > > > intel_engine_cs *engine, bool stalled) > > > */ > > > rq = execlists_active(execlists); > > > if (!rq) > > > - return; > > > + goto unwind; > > > > > > ce = rq->hw_context; > > > GEM_BUG_ON(i915_active_is_idle(&ce->active)); > > > @@ -2331,6 +2332,7 @@ static void __execlists_reset(struct > > > intel_engine_cs *engine, bool stalled) > > > intel_ring_update_space(ce->ring); > > > __execlists_update_reg_state(ce, engine); > > > > > > +unwind: > > > /* Push back any incomplete requests for replay after the reset. */ > > > __unwind_incomplete_requests(engine); > > > } > > > > > > > Sounds plausible. > > > > Reviewed-by: Tvrtko Ursulin <tvrtko.ursu...@intel.com> > > > > Shouldn't there be a Fixes: tag to go with it? > > Yeah, it's rare even by our standards, I think there's a live_hangcheck > failure about once a month that could be the result of this. However, > the result would be an unrecoverable GPU hang as each attempt at > resetting would not see the missing request and so it would remain > perpetually in the engine->active.list until a set-wedged (i.e. suspend > in the user case).
Heh, the commit responsible was one that was itself trying to workaround the effect of stop_engines() setting RING_HEAD=0 :) commit 1863e3020ab50bd5f68d85719ba26356cc282643 Author: Chris Wilson <ch...@chris-wilson.co.uk> Date: Thu Apr 11 14:05:15 2019 +0100 drm/i915/execlists: Always reset the context's RING registers During reset, we try and stop the active ring. This has the consequence that we often clobber the RING registers within the context image. When we find an active request, we update the context image to rerun that request (if it was guilty, we replace the hanging user payload with NOPs). However, we were ignoring an active context if the request had completed, with the consequence that the next submission on that request would start with RING_HEAD==0 and not the tail of the previous request, causing all requests still in the ring to be rerun. Rare, but occasionally seen within CI where we would spot that the context seqno would reverse and complain that we were retiring an incomplete request. -Chris _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx