Quoting Tvrtko Ursulin (2018-05-25 09:36:18) > > On 24/05/2018 14:40, Chris Wilson wrote: > > Quoting Tvrtko Ursulin (2018-05-24 14:34:41) > >> > >> On 19/05/2018 10:04, Chris Wilson wrote: > >>> Quoting Tvrtko Ursulin (2018-05-18 15:42:00) > >>>> > >>>> On 18/05/2018 15:13, Chris Wilson wrote: > >>>>> Quoting Tvrtko Ursulin (2018-05-18 13:36:52) > >>>>>> > >>>>>> On 18/05/2018 13:28, Chris Wilson wrote: > >>>>>>> Quoting Tvrtko Ursulin (2018-05-18 12:50:41) > >>>>>>>> > >>>>>>>> On 18/05/2018 12:10, Chris Wilson wrote: > >>>>>>>>> Quoting Tvrtko Ursulin (2018-05-18 12:05:17) > >>>>>>>>>> > >>>>>>>>>> On 18/05/2018 11:09, Chris Wilson wrote: > >>>>>>>>>>> Inside the live_hangcheck (reset) selftests, we occasionally see > >>>>>>>>>>> failures like > >>>>>>>>>>> > >>>>>>>>>>> <7>[ 239.094840] i915_gem_set_wedged rcs0 > >>>>>>>>>>> <7>[ 239.094843] i915_gem_set_wedged current seqno > >>>>>>>>>>> 19a98, last 19a9a, hangcheck 0 [5158 ms] > >>>>>>>>>>> <7>[ 239.094846] i915_gem_set_wedged Reset count: 6239 > >>>>>>>>>>> (global 1) > >>>>>>>>>>> <7>[ 239.094848] i915_gem_set_wedged Requests: > >>>>>>>>>>> <7>[ 239.095052] i915_gem_set_wedged first > >>>>>>>>>>> 19a99 [e8c:5f] prio=1024 @ 5159ms: (null) > >>>>>>>>>>> <7>[ 239.095056] i915_gem_set_wedged last > >>>>>>>>>>> 19a9a [e81:1a] prio=139 @ 5159ms: igt/rcs0[5977]/1 > >>>>>>>>>>> <7>[ 239.095059] i915_gem_set_wedged active > >>>>>>>>>>> 19a99 [e8c:5f] prio=1024 @ 5159ms: (null) > >>>>>>>>>>> <7>[ 239.095062] i915_gem_set_wedged [head 0220, > >>>>>>>>>>> postfix 0280, tail 02a8, batch 0xffffffff_ffffffff] > >>>>>>>>>>> <7>[ 239.100050] i915_gem_set_wedged > >>>>>>>>>>> ring->start: 0x00283000 > >>>>>>>>>>> <7>[ 239.100053] i915_gem_set_wedged ring->head: > >>>>>>>>>>> 0x000001f8 > >>>>>>>>>>> <7>[ 239.100055] i915_gem_set_wedged ring->tail: > >>>>>>>>>>> 0x000002a8 > >>>>>>>>>>> <7>[ 239.100057] i915_gem_set_wedged ring->emit: > >>>>>>>>>>> 0x000002a8 > >>>>>>>>>>> <7>[ 239.100059] i915_gem_set_wedged > >>>>>>>>>>> ring->space: 0x00000f10 > >>>>>>>>>>> <7>[ 239.100085] i915_gem_set_wedged RING_START: > >>>>>>>>>>> 0x00283000 > >>>>>>>>>>> <7>[ 239.100088] i915_gem_set_wedged RING_HEAD: > >>>>>>>>>>> 0x00000260 > >>>>>>>>>>> <7>[ 239.100091] i915_gem_set_wedged RING_TAIL: > >>>>>>>>>>> 0x000002a8 > >>>>>>>>>>> <7>[ 239.100094] i915_gem_set_wedged RING_CTL: > >>>>>>>>>>> 0x00000001 > >>>>>>>>>>> <7>[ 239.100097] i915_gem_set_wedged RING_MODE: > >>>>>>>>>>> 0x00000300 [idle] > >>>>>>>>>>> <7>[ 239.100100] i915_gem_set_wedged RING_IMR: fffffefe > >>>>>>>>>>> <7>[ 239.100104] i915_gem_set_wedged ACTHD: > >>>>>>>>>>> 0x00000000_0000609c > >>>>>>>>>>> <7>[ 239.100108] i915_gem_set_wedged BBADDR: > >>>>>>>>>>> 0x00000000_0000609d > >>>>>>>>>>> <7>[ 239.100111] i915_gem_set_wedged DMA_FADDR: > >>>>>>>>>>> 0x00000000_00283260 > >>>>>>>>>>> <7>[ 239.100114] i915_gem_set_wedged IPEIR: 0x00000000 > >>>>>>>>>>> <7>[ 239.100117] i915_gem_set_wedged IPEHR: 0x02800000 > >>>>>>>>>>> <7>[ 239.100120] i915_gem_set_wedged Execlist status: > >>>>>>>>>>> 0x00044052 00000002 > >>>>>>>>>>> <7>[ 239.100124] i915_gem_set_wedged Execlist CSB read 5 > >>>>>>>>>>> [5 cached], write 5 [5 from hws], interrupt posted? no, tasklet > >>>>>>>>>>> queued? no (enabled) > >>>>>>>>>>> <7>[ 239.100128] i915_gem_set_wedged ELSP[0] > >>>>>>>>>>> count=1, ring->start=00283000, rq: 19a99 [e8c:5f] prio=1024 @ > >>>>>>>>>>> 5164ms: (null) > >>>>>>>>>>> <7>[ 239.100132] i915_gem_set_wedged ELSP[1] > >>>>>>>>>>> count=1, ring->start=00257000, rq: 19a9a [e81:1a] prio=139 @ > >>>>>>>>>>> 5164ms: igt/rcs0[5977]/1 > >>>>>>>>>>> <7>[ 239.100135] i915_gem_set_wedged HW active? > >>>>>>>>>>> 0x5 > >>>>>>>>>>> <7>[ 239.100250] i915_gem_set_wedged E 19a99 > >>>>>>>>>>> [e8c:5f] prio=1024 @ 5164ms: (null) > >>>>>>>>>>> <7>[ 239.100338] i915_gem_set_wedged E 19a9a > >>>>>>>>>>> [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1 > >>>>>>>>>>> <7>[ 239.100340] i915_gem_set_wedged Queue > >>>>>>>>>>> priority: 139 > >>>>>>>>>>> <7>[ 239.100343] i915_gem_set_wedged Q 0 > >>>>>>>>>>> [e98:19] prio=132 @ 5164ms: igt/rcs0[5977]/8 > >>>>>>>>>>> <7>[ 239.100346] i915_gem_set_wedged Q 0 > >>>>>>>>>>> [e84:19] prio=121 @ 5165ms: igt/rcs0[5977]/2 > >>>>>>>>>>> <7>[ 239.100349] i915_gem_set_wedged Q 0 > >>>>>>>>>>> [e87:19] prio=82 @ 5165ms: igt/rcs0[5977]/3 > >>>>>>>>>>> <7>[ 239.100352] i915_gem_set_wedged Q 0 > >>>>>>>>>>> [e84:1a] prio=44 @ 5164ms: igt/rcs0[5977]/2 > >>>>>>>>>>> <7>[ 239.100356] i915_gem_set_wedged Q 0 > >>>>>>>>>>> [e8b:19] prio=20 @ 5165ms: igt/rcs0[5977]/4 > >>>>>>>>>>> <7>[ 239.100362] i915_gem_set_wedged drv_selftest [5894] > >>>>>>>>>>> waiting for 19a99 > >>>>>>>>>>> > >>>>>>>>>>> where the GPU saw an arbitration point and idles; AND HAS NOT > >>>>>>>>>>> BEEN RESET! > >>>>>>>>>>> The RING_MODE indicates that is idle and has the STOP_RING bit > >>>>>>>>>>> set, so > >>>>>>>>>>> try clearing it. > >>>>>>>>>>> > >>>>>>>>>>> v2: Only clear the bit on restarting the ring, as we want to be > >>>>>>>>>>> sure the > >>>>>>>>>>> STOP_RING bit is kept if reset fails on wedging. > >>>>>>>>>>> v3: Spot when the ring state doesn't make sense when > >>>>>>>>>>> re-initialising the > >>>>>>>>>>> engine and dump it to the logs so that we don't have to wait for > >>>>>>>>>>> an > >>>>>>>>>>> error later and try to guess what happened earlier. > >>>>>>>>>>> v4: Prepare to print all the unexpected state, not just the first. > >>>>>>>>>>> > >>>>>>>>>>> Signed-off-by: Chris Wilson <ch...@chris-wilson.co.uk> > >>>>>>>>>>> Cc: Tvrtko Ursulin <tvrtko.ursu...@linux.intel.com> > >>>>>>>>>>> --- > >>>>>>>>>>> drivers/gpu/drm/i915/intel_lrc.c | 22 > >>>>>>>>>>> ++++++++++++++++++++++ > >>>>>>>>>>> 1 file changed, 22 insertions(+) > >>>>>>>>>>> > >>>>>>>>>>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c > >>>>>>>>>>> b/drivers/gpu/drm/i915/intel_lrc.c > >>>>>>>>>>> index 3744f5750624..ba8411ba4abf 100644 > >>>>>>>>>>> --- a/drivers/gpu/drm/i915/intel_lrc.c > >>>>>>>>>>> +++ b/drivers/gpu/drm/i915/intel_lrc.c > >>>>>>>>>>> @@ -1781,6 +1781,9 @@ static void enable_execlists(struct > >>>>>>>>>>> intel_engine_cs *engine) > >>>>>>>>>>> I915_WRITE(RING_MODE_GEN7(engine), > >>>>>>>>>>> > >>>>>>>>>>> _MASKED_BIT_ENABLE(GFX_RUN_LIST_ENABLE)); > >>>>>>>>>>> > >>>>>>>>>>> + I915_WRITE(RING_MI_MODE(engine->mmio_base), > >>>>>>>>>>> + _MASKED_BIT_DISABLE(STOP_RING)); > >>>>>>>>>> > >>>>>>>>>> Worries me a bit to clear it unconditionally since documentation > >>>>>>>>>> says > >>>>>>>>>> nothing (that I can find) about this scenario. > >>>>>>>>>> > >>>>>>>>>>> + > >>>>>>>>>>> I915_WRITE(RING_HWS_PGA(engine->mmio_base), > >>>>>>>>>>> engine->status_page.ggtt_offset); > >>>>>>>>>>> POSTING_READ(RING_HWS_PGA(engine->mmio_base)); > >>>>>>>>>>> @@ -1789,6 +1792,19 @@ static void enable_execlists(struct > >>>>>>>>>>> intel_engine_cs *engine) > >>>>>>>>>>> engine->execlists.csb_head = -1; > >>>>>>>>>>> } > >>>>>>>>>>> > >>>>>>>>>>> +static bool unexpected_starting_state(struct intel_engine_cs > >>>>>>>>>>> *engine) > >>>>>>>>>>> +{ > >>>>>>>>>>> + struct drm_i915_private *dev_priv = engine->i915; > >>>>>>>>>>> + bool unexpected = false; > >>>>>>>>>>> + > >>>>>>>>>>> + if (I915_READ(RING_MI_MODE(engine->mmio_base)) & STOP_RING) > >>>>>>>>>>> { > >>>>>>>>>>> + DRM_DEBUG_DRIVER("STOP_RING still set in > >>>>>>>>>>> RING_MI_MODE\n"); > >>>>>>>>>> > >>>>>>>>>> Also worries me to bury this as a debug message unless we can find > >>>>>>>>>> in > >>>>>>>>>> documentation that this is known to happen occasionally. > >>>>>>>>>> > >>>>>>>>>> If we had the check and logging immediately after reset, with a > >>>>>>>>>> more > >>>>>>>>>> visible log, we could build a larger data set of reported failures > >>>>>>>>>> and > >>>>>>>>>> then go back to hw people and ask them about it. > >>>>>>>>> > >>>>>>>>> The problem with the reset side is trying to avoid mixing in the > >>>>>>>>> case > >>>>>>>>> where we are faking it and expect it to still be set. On the > >>>>>>>>> ringbuffer > >>>>>>>>> path, we handle STOP_RING as being part of the init sequence. The > >>>>>>>>> log > >>>>>>>>> already says we did a reset and STOP_RING is set (since we have > >>>>>>>>> tracing > >>>>>>>>> through the reset), I just need to find the right spot to handle it. > >>>>>>>> > >>>>>>>> The "distance" in code paths between reset and init_hw is probably > >>>>>>>> the > >>>>>>>> smallest of my concerns. If we could make this message not debug > >>>>>>>> only, > >>>>>>>> and clear STOP_RING only conditionally I think I would be happy with > >>>>>>>> giving this a go and see if it collects any data. > >>>>>>> > >>>>>>> Why? Who else but us are going to read it, and why bother if it is not > >>>>>>> actionable. It is logged (was logged already, now highlighted) and for > >>>>>>> the user none of it is relevant, the only thing that matters to them > >>>>>>> is > >>>>>>> that they don't even notice the system suffered a GPU hang. > >>>>>> > >>>>>> My reasoning is this - something has happened which is completely > >>>>>> unexpected and may have other negative effect which are currently > >>>>>> unknown to us. By logging it at a higher level we collect this now very > >>>>>> specific signature on a wider collection of platforms and can then act > >>>>>> on the collected data set. Also very specific log message makes bug > >>>>>> triage easier and a lower chance it gets bunched up with other hang > >>>>>> reports. > >>>>> > >>>>> You do have a very specific debug message, but having a flip-flopping > >>>>> test that has no impact on system stability (as evidenced by the lack of > >>>>> errors, if there was a subsequent error it is imperative that the post > >>>>> mortem capture does its best, everything else is lost) does not help us. > >>>> > >>>> What if we deploy this as DRM_NOTICE/INFO, get reports from the field > >>>> and not just the selftest, and then realize the issue is actually > >>>> interesting and could be fed back to HW design? > >>> > >>> But we already have enough to point out that the HW is not behaving as > >>> we^W I would expect. > >>> > >>> For field analysis, I strongly believe that unless you can capture > >>> everything you need without user intervention, you will never get the > >>> debug information required. The instructions we have are "oops, > >>> something went wrong, please file a new bug report here and attach this > >>> file". That's already too complicated wants to be automated, e.g. abrtd. > >>> But for whatever reason we don't get the reports from abrtd. > >>> > >>> Elsewhere we use the method of light periodic checks and if they spot an > >>> issue, switch on heavy debug mode. For something like this where we want > >>> a trace of operations as they happen? I think we need to be able to turn > >>> on GEM tracing on the fly and be able to capture the compressed buffers. > >> > >> Not sure that we need a trace of pre-events for this. It seems pretty > >> straight-forward to me. Register is not in the state in which it should > >> be immediately after reset. > > > > But you don't always have a reset, that makes throwing a tantrum here > > difficult. > > > >> It will help sort the hang reports into a new bucket. > > > > The information is already in the hang report. What isn't which would be > > useful is the trace of events that precede certain hangs (ones that I > > suspect are driver bugs as opposed to userspace GPU programming issues). > > It is there but takes more skill to triage.
It is largely pattern matching. I'm sure an AI could do a far better job at finding discrepancies worth investigating. Add another item to the wishlist. > Okay, given that the check is not only after reset, and given how you do > a lion's share of bug triaging anyway: > > Reviewed-by: Tvrtko Ursulin <tvrtko.ursu...@intel.com> Sorry for holding out on not trying to parcel too much into this, thanks for the review. Pushed hoping that it truly does stop the flip-flops. -Chris _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx