Re: [Intel-gfx] ✗ Fi.CI.BAT: failure for drm/i915: ttm for stolen (rev5)

Tvrtko Ursulin Tue, 28 Jun 2022 01:46:51 -0700


On 27/06/2022 18:08, Robert Beckett wrote:

On 22/06/2022 10:05, Tvrtko Ursulin wrote:
On 21/06/2022 20:11, Robert Beckett wrote:
On 21/06/2022 18:37, Patchwork wrote:
*Patch Details*
*Series:*    drm/i915: ttm for stolen (rev5)
*URL:* https://patchwork.freedesktop.org/series/101396/<https://patchwork.freedesktop.org/series/101396/>
*State:*    failure
*Details:*https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html<https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html>
  CI Bug Log - changes from CI_DRM_11790 -> Patchwork_101396v5


    Summary

*FAILURE*
Serious unknown changes coming with Patchwork_101396v5 absolutelyneed to be
verified manually.

If you think the reported changes have nothing to do with the changes
introduced in Patchwork_101396v5, please notify your bug team toallow themto document this new failure mode, which will reduce false positivesin CI.
External URL:https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html
    Participating hosts (40 -> 41)

Additional (2): fi-icl-u2 bat-dg2-9
Missing (1): fi-bdw-samus


    Possible new issues
Here are the unknown changes that may have been introduced inPatchwork_101396v5:
      IGT changes


        Possible regressions

  * igt@i915_selftest@live@reset:
      o bat-adlp-4: PASS
<https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11790/bat-adlp-4/igt@i915_selftest@l...@reset.html>
        -> DMESG-FAIL
<https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/bat-adlp-4/igt@i915_selftest@l...@reset.html>
I keep hitting clobbered pages during engine resets on bat-adlp-4.
It seems to happen most of the time on that machine and occasionallyon bat-adlp-6.
Should bat-adlp-4 be considered an unreliable machine like bat-adlp-6is for now?
Alternatively, seeing the history of this in
commit 3da3c5c1c9825c24168f27b021339e90af37e969 "drm/i915: Excludelow pages (128KiB) of stolen from use"
could this be an indication that maybe the original issue is worse onadlp machines?I have only ever seen page page 135 or 136 clobbered across many runsvia trybot, so it looks fairly consistent.
Though excluding the use of over 540K of stolen might be too severe.
Don't know but I see that on the latest version you even hit pages165/166.
Any history of hitting this in CI without your series? If not, arethere some other changes which could explain it? Are you touching theselftest itself?
Hexdump of the clobbered page looks quite complex. EspeciallyPOISON_FREE. Any idea how that ends up there?
(seehttps://intel-gfx-ci.01.org/tree/drm-tip/Trybot_105517v4/fi-rkl-guc/igt@i915_selftest@l...@reset.html#dmesg-warnings702)
after lots of slow debug via CI, it looks like the issue is that a ringbuffer was allocated and taking up that page during the initial crccapture in the test, but by the time it came to check for corruption, ithad been freed from that page.
The test has a number of weaknesses:
1. the busy check is done twice, without taking in to account any changein between. I assume previously this could be relied on never to occur,but now it can for some reason (more on that later)

You mean the stolen page used/unused test? Probably the premise is thatthe test controls the driver completely ie. is the sole user and the twochecks are run at the time where nothing else could have changed the state.

With the nerfed request (as with GuC) this actually should hold. In thegeneric case I am less sure, my working knowledge faded a bit, butperhaps there was something guaranteeing the spinner couldn't have beenretired yet at the time of the second check. Would need clarifying atleast in comments.

2. the engine reset returns early with an error for guc submissionengines, but it is silently ignored in the test. Perhaps it shouldignore guc submission engines as it is a largely useless test for thosesituations.

Yes looks dodgy indeed. You will need to summon the owners of the GuCbackend to comment on this.

However even if the test should be skipped with GuC it is extremelyinteresting that you are hitting this so I suspect there is a moreserious issue at play.

A quick obvious fix is to have a busy bitmask that remembers each page'sbusy state initially and only check for corruption if it was busy duringboth checks.
However, the main question is why this is occurring now with my changes.
I have added more debug to check where the stolen memory is being freed,but the first run last night didn't hit the issue for once.I am running again now, will report back if I figure out where it isbeing freed.
I am pretty sure the "corruption" (which isn't actually corruption) isfrom a ring buffer.The POISON_FREE is the only difference between the captured before andafter dumps:
[0040] 00000000 02800000 6b6b6b6b 6b6b6b6b 6b6b6b6b 6b6b6b6b 6b6b6b6b6b6b6b6b
with the 2nd dword being the MI_ARB_CHECK used for the spinner.
I think this is the request poisoning from i915_request_retire()
The bit I don't know yet is why a ring buffer was freed between theinitial crc capture and the corruption check. The spinner should beactive across the entire test, maintaining a ref on the context and it'sring.
hopefully my latest debug will give more answers.

Yeah if you can figure our whether the a) spinner is still active duringthe 2nd check (as I think it should be), and b) is the corruptiondetected in the same pages which were used in the 1st pass that would beinteresting.


Regards,

Tvrtko

Btw what is the benefit of converting stolen to start with? It's notmuch of a backend since it just uses the drm range manager. So quitethin and uneventful. Diffstats for the series also do not look likeyou end up with much code reduction?
Regards,

Tvrtko

Re: [Intel-gfx] ✗ Fi.CI.BAT: failure for drm/i915: ttm for stolen (rev5)

Reply via email to