gt: Increase suspend timeout

Thomas Hellström Thu, 23 Sep 2021 06:19:49 -0700


On 9/23/21 2:59 PM, Tvrtko Ursulin wrote:

On 23/09/2021 12:47, Thomas Hellström wrote:
Hi, Tvrtko,

On 9/23/21 12:13 PM, Tvrtko Ursulin wrote:
On 22/09/2021 07:25, Thomas Hellström wrote:
With GuC submission on DG1, the execution of the requests times out
for the gem_exec_suspend igt test case after executing around 800-900
of 1000 submitted requests.
Given the time we allow elsewhere for fences to signal (in theorder ofseconds), increase the timeout before we mark the gt wedged andproceed.
I suspect it is not about requests not retiring in time but aboutthe intel_guc_wait_for_idle part of intel_gt_wait_for_idle. AlthoughI don't know which G2H message is the code waiting for at suspendtime so perhaps something to run past the GuC experts.
So what's happening here is that the tests submits 1000 requests,each writing a value to an object, and then that object content ischecked after resume. With GuC it turns out that only 800-900 or sovalues are actually written before we time out, and the test(basic-S3) fails, but not on every run.
Yes and that did not make sense to me. It is a single context even soI did not come up with an explanation why would GuC be slower.
Unless it somehow manages to not even update the ring tail in time andrequests are still only stuck in the software queue? Perhaps you cansee that from context tail and head when it happens.
This is a bit interesting in itself, because I never saw the hang-S3test fail, which from what I can tell basically is an identical testbut with a spinner submitted after the 1000th request. Could be thatthe suspend backup code ends up waiting for something before we endup in intel_gt_wait_for_idle, giving more requests time to execute.
No idea, I don't know the suspend paths that well. For instance beforelooking at the code I thought we would preempt what's executing andnot wait for everything that has been submitted to finish. :)
Anyway, if that turns out to be correct then perhaps it would bebetter to split the two timeouts (like if required GuC timeout isperhaps fundamentally independent) so it's clear who needs how muchtime. Adding Matt and John to comment.
You mean we have separate timeouts depending on whether we're usingGuC or execlists submission?
No, I don't know yet. First I think we need to figure out what exactlyis happening.

Well then TBH I will need to file a separate Jira about that. Theremight be various things going on here like swiching between the migratecontext for eviction of unrelated LMEM buffers and the context used bygem_exec_suspend. The gem_exec_suspend failures are blocking DG1 BAT soit's pretty urgent to get this series merged. If you insist I can leavethis patch out for now, but rather I'd commit it as is and File a Jirainstead.


/Thomas

Re: [Intel-gfx] [PATCH v6 3/9] drm/i915/gt: Increase suspend timeout

Reply via email to