Re: [Intel-gfx] [PATCH] drm/i915: Don't wait forever in drop_caches

John Harrison Mon, 07 Nov 2022 11:45:27 -0800

On 11/7/2022 06:09, Tvrtko Ursulin wrote:

On 04/11/2022 17:45, John Harrison wrote:
On 11/4/2022 03:01, Tvrtko Ursulin wrote:
On 03/11/2022 19:16, John Harrison wrote:
On 11/3/2022 02:38, Tvrtko Ursulin wrote:
On 03/11/2022 09:18, Tvrtko Ursulin wrote:
On 03/11/2022 01:33, John Harrison wrote:
On 11/2/2022 07:20, Tvrtko Ursulin wrote:
On 02/11/2022 12:12, Jani Nikula wrote:
On Tue, 01 Nov 2022, john.c.harri...@intel.com wrote:
From: John Harrison <john.c.harri...@intel.com>
At the end of each test, IGT does a drop caches call viasysfs with
sysfs?
Sorry, that was meant to say debugfs. I've also been working onsome sysfs IGT issues and evidently got my wires crossed!
special flags set. One of the possible paths waits for idlewith aninfinite timeout. That causes problems for debugging issueswhen CIcatches a "can't go idle" test failure. Best case, the CIsystem times
out (after 90s), attempts a bunch of state dump actions and then
reboots the system to recover it. Worst case, the CI systemcan't doanything at all and then times out (after 1000s) and simplyreboots.Sometimes a serial port log of dmesg might be available,sometimes not.
So rather than making life hard for ourselves, change thetimeout to
be 10s rather than infinite. Also, trigger the standard
wedge/reset/recover sequence so that testing can continue with a
working system (if possible).

Signed-off-by: John Harrison <john.c.harri...@intel.com>
---
  drivers/gpu/drm/i915/i915_debugfs.c | 7 ++++++-
  1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/i915/i915_debugfs.cb/drivers/gpu/drm/i915/i915_debugfs.c
index ae987e92251dd..9d916fbbfc27c 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -641,6 +641,9 @@DEFINE_SIMPLE_ATTRIBUTE(i915_perf_noa_delay_fops,
            DROP_RESET_ACTIVE | \
            DROP_RESET_SEQNO | \
            DROP_RCU)
+
+#define DROP_IDLE_TIMEOUT    (HZ * 10)
I915_IDLE_ENGINES_TIMEOUT is defined in i915_drv.h. It's alsoonly used
here.
So move here, dropping i915 prefix, next to the newly proposedone?
Sure, can do that.
I915_GEM_IDLE_TIMEOUT is defined in i915_gem.h. It's only used in
gt/intel_gt.c.
Move there and rename to GT_IDLE_TIMEOUT?
I915_GT_SUSPEND_IDLE_TIMEOUT is defined and used only inintel_gt_pm.c.
No action needed, maybe drop i915 prefix if wanted.
These two are totally unrelated and in code not being touched bythis change. I would rather not conflate changing random otherthings with fixing this specific issue.
I915_IDLE_ENGINES_TIMEOUT is in ms, the rest are in jiffies.
Add _MS suffix if wanted.
My head spins.
I follow and raise that the newly proposed DROP_IDLE_TIMEOUTapplies to DROP_ACTIVE and not only DROP_IDLE.
My original intention for the name was that is the 'drop cachestimeout for intel_gt_wait_for_idle'. Which is quite the mouthfuland hence abbreviated to DROP_IDLE_TIMEOUT. But yes, I realisedlater that name can be conflated with the DROP_IDLE flag. Willrename.
Things get refactored, code moves around, bits get left behind,who knows. No reason to get too worked up. :) As long as peopleare taking a wider view when touching the code base, and arenot afraid to send cleanups, things should be good.
On the other hand, if every patch gets blocked in code reviewbecause someone points out some completely unrelated piece ofcode could be a bit better then nothing ever gets fixed. If youspot something that you think should be improved, isn't thegeneral idea that you should post a patch yourself to improve it?
There's two maintainers per branch and an order of magnitude ortwo more developers so it'd be nice if cleanups would just beincoming on self-initiative basis. ;)
For the actual functional change at hand - it would be nice ifcode paths in question could handle SIGINT and then we couldpunt the decision on how long someone wants to wait purely touserspace. But it's probably hard and it's only debugfs sowhatever.
The code paths in question will already abort on a signal won'tthey? Both intel_gt_wait_for_idle() andintel_guc_wait_for_pending_msg(), which is where theuc_wait_for_idle eventually ends up, have an 'if(signal_pending)return -EINTR;' check. Beyond that, it sounds like what you areasking for is a change in the IGT libraries and/or CI frameworkto start sending signals after some specific timeout. That seemslike a significantly more complex change (in terms of the numberof entities affected and number of groups involved) andunnecessary.
If you say so, I haven't looked at them all. But if the code pathin question already aborts on signals then I am not sure what isthe patch fixing? I assumed you are trying to avoid the writestuck in D forever, which then prevents driver unload andeverything, requiring the test runner to eventually reboot. Ifyou say SIGINT works then you can already recover from userspace,no?
Whether or not 10s is enough CI will hopefully tell us. I'dprobably err on the side of safety and make it longer, but atmost half from the test runner timeout.
This is supposed to be test clean up. This is not about how longa particular test takes to complete but about how long it takesto declare the system broken after the test has alreadyfinished. I would argue that even 10s is massively longer thanrequired.
I am not convinced that wedging is correct though. Conceptuallycould be just that the timeout is too short. What does wedgingreally give us, on top of limiting the wait, when latter AFAIUis the key factor which would prevent the need to reboot themachine?
It gives us a system that knows what state it is in. If we can'tidle the GT then something is very badly wrong. Wedgingindicates that. It also ensure that a full GT reset will beattempted before the next test is run. Helping to prevent afailure on test X from propagating into failures of unrelatedtests X+1, X+2, ... And if the GT reset does not work becausethe system is really that badly broken then future tests willnot run rather than report erroneous failures.
This is not about getting a more stable system for end users bysweeping issues under the carpet and pretending all is well. Endusers don't run IGTs or explicitly call dodgy debugfs entrypoints. The sole motivation here is to get more accurate resultsfrom CI. That is, correctly identifying which test has hit aproblem, getting valid debug analysis for that test (logs andsuch) and allowing further testing to complete correctly in thecase where the system can be recovered.
I don't really oppose shortening of the timeout in principle,just want a clear statement if this is something IGT / testrunner could already do or not. It can apply a timeout, it canalso send SIGINT, and it could even trigger a reset from outside.Sure it is debugfs hacks so general "kernel should not implementpolicy" need not be strictly followed, but lets have it clearwhat are the options.
One conceptual problem with applying this policy is that the code is:

    if (val & (DROP_IDLE | DROP_ACTIVE)) {
        ret = intel_gt_wait_for_idle(gt, MAX_SCHEDULE_TIMEOUT);
        if (ret)
            return ret;
    }

    if (val & DROP_IDLE) {
        ret = intel_gt_pm_wait_for_idle(gt);
        if (ret)
            return ret;
    }
So if someone passes in DROP_IDLE and then why would only thefirst branch have a short timeout and wedge. Yeah some bug happensto be there at the moment, but put a bug in a different place andyou hang on the second branch and then need another patch. Versusperhaps making it all respect SIGINT and handle from outside.
The pm_wait_for_idle is can only called after gt_wait_for_idle hascompleted successfully. There is no route to skip the GT idle or todo the PM idle even if the GT idle fails. So the chances of the PMidle failing are greatly reduced. There would have to be somethingoutside of a GT keeping the GPU awake and there isn't a whole lotof hardware left at that point!
Well "greatly reduced" is beside my point. Point is today bug ishere and we add a timeout, tomorrow bug is there and then the samedance. It can be just a sw bug which forgets to release the pm refin some circumstances, doesn't really matter.
Huh?
Greatly reduced is the whole point. Today there is a bug and itcauses a kernel hang which requires the CI framework to reboot thesystem in an extremely unfriendly way which makes it very hard towork out what happened. Logs are likely not available. We don't evennecessarily know which test was being run at the time. Etc. So wereplace the infinite timeout with a meaningful timeout. CI nowcorrectly marks the single test as failing, captures all the correctlogs, creates a useful bug report and continues on testing more stuff.
So what is preventing CI to collect logs if IGT is forever stuck ininterruptible wait? Surely it can collect the logs at that point ifthe kernel is healthy enough. If it isn't then I don't see how wedgingthe GPU will make the kernel any healthier.
Is i915 preventing better log collection or could test runner beimproved?
Sure, there is still the chance of hitting an infinite timeout. Butthat one is significantly more complicated to remove. And the chancesof hitting that one are significantly smaller than the chances ofhitting the first one.
This statement relies on intimate knowledge implementation details anda bit too much white box testing approach but that's okay, lets movepast this one.
So you are arguing that because I can't fix the last 0.1% of possiblefailures, I am not allowed to fix the first 99.9% of the failures?
I am clearly not arguing for that. But we are also not talking about"fixing failures" here. Just how to make CI cope better with a classof i915 bugs.
Regarding signals, the PM idle code ends up atwait_var_event_killable(). I assume that is interruptible via atleast a KILL signal if not any signal. Although it's not entirelyclear trying to follow through the implementation of this code.Also, I have no idea if there is a safe way to add a timeout tothat code (or why it wasn't already written with a timeoutincluded). Someone more familiar with the wakeref internals wouldneed to comment.
However, I strongly disagree that we should not fix the driver justbecause it is possible to workaround the issue by re-writing the CIframework. Feel free to bring a redesign plan to the IGT WG andwhatever equivalent CI meetings in parallel. But we absolutelyshould not have infinite waits in the kernel if there is a trivialway to not have infinite waits.
I thought I was clear that I am not really opposed to the timeout.
The rest of the paragraph I don't really care - point is mootbecause it's debugfs so we can do whatever, as long as it is notburdensome to i915, which this isn't. If either wasn't the case thenwe certainly wouldn't be adding any workarounds in the kernel if itcan be achieved in IGT.
Also, sending a signal does not result in the wedge happening. Ispecifically did not want to change that code path because I wasassuming there was a valid reason for it. If you have beeninterrupted then you are in the territory of maybe it would havesucceeded if you just left it for a moment longer. Whereas, hittingthe timeout says that someone very deliberately said this is toolong to wait and therefore the system must be broken.
I wanted to know specifically about wedging - why can't youwedge/reset from IGT if DROP_IDLE times out in quiescent orwherever, if that's what you say is the right thing?
Huh?
DROP_IDLE has two waits. One that I am trying to change from infiniteto finite + wedge. One that would take considerable effort to changeand would be quite invasive to a lot more of the driver and which canonly be hit if the first timeout actually completed successfully andis therefore of less importance anyway. Both of those time outsappear to respect signal interrupts.
That's a policy decision so why would i915 wedge if an arbitrarytimeout expired? I915 is not controlling how much work there isoutstanding at the point the IGT decides to call DROP_IDLE.
Because this is a debug test interface that is used solely by IGTafter it has finished its testing. This is not about wedging thedevice at some random arbitrary point because an AI compute workloadtakes three hours to complete. This is about a very specific testframework cleaning up after testing is completed and making sure thetest did not fry the system.
And even if an IGT test was calling DROP_IDLE in the middle of a testfor some reason, it should not be deliberately pushing 10+ seconds ofwork through and then calling a debug only interface to flush it out.If a test wants to verify that the system can cope with submitting aminutes worth of rendering and then waiting for it to complete thenthe test should be using official channels for that wait.
Plus, infinite wait is not a valid code path in the first place soany change in behaviour is not really a change in behaviour. Codecan't be relying on a kernel call to never return for its correctoperation!
Why infinite wait wouldn't be valid? Then you better change theother one as well. ;P
In what universe is it ever valid to wait forever for a test tocomplete?
Well above you claimed both paths respect SIGINT. If that is so thenthe wait is as infinite as the IGT wanted it to be.
See above, the PM code would require much more invasive changes. Thiswas low hanging fruit. It was supposed to be a two minute change to avery self contained section of code that would provide significantbenefit to debugging a small class of very hard to debug problems.
Sure, but I'd still like to know why can't you do what you want fromthe IGT framework.
Have the timeout reduction in i915, again that's fine assuming 10seconds it enough to not break something by accident.

CI showed no regressions. And if someone does find a valid reason why apost test drop caches call should legitimately take a stupidly long timethen it is easy to track back where the ETIME error came from and bumpthe timeout.

With that change you already have broken the "infinite wait". It makesthe debugfs write return -ETIME in time much shorter than the testrunner timeout(s). What is the thing that you cannot do from IGT atthat point is my question? You want to wedge then? SendDROP_RESET_ACTIVE to do it for you? If that doesn't work add a newflag which will wedge explicitly.
We are again degrading into a huge philosophical discussion and all Iwanted to start with is to hear how exactly things go bad.

I have no idea what you are wanting. I am trying to have a technicaldiscussion about improving the stability of the driver during CItesting. I have no idea if you are arguing that this change is good,bad, broken, wrong direction or what.

Things go bad as explained in the commit message. The CI framework doesnot use signals. The IGT framework does not use signals. There is nowatchdog that sends a TERM or KILL signal after a specified timeout. Allthat happens is the IGT sits there forever waiting for the drop cachesIOCTL to return. The CI framework eventually gives up waiting for thetest to complete and tries to recover. There are many different CIframeworks in use across Intel. Some timeout quickly, some timeoutslowly. But basically, they all eventually give up and don't bothertrying any kind of remedial action but just hit the reset button(sometimes by literally power cycling the DUT). As result, backgroundprocesses that are saving dmesg, stdout, etc do not necessarilyterminate cleanly. That results in logs that are at best truncated, atworst missing entirely. It also results in some frameworks abortingtesting at that point. So no results are generated for all the othertests that have yet to be run. Some frameworks also run tests inbatches. All they log is that something, somewhere in the batch died. Soyou don't even know which specific test actually hit the problem.

Can the CI frameworks be improved? Undoubtedly. In very many ways. Isthat something we have the ability to do with a simple patch? No. Wouldre-writing the IGT framework to add watchdog mechanisms improve things?Yes. Can it be done with a simple patch? No. Would a simple patch toi915 significantly improve the situation? Yes. Will it solve everypossible CI hang? No. Will it fix any actual end user visible bugs? No.Will it introduce any new bugs? No. Will it help us to debug at leastsome CI failures? Yes.


John.

Regards,

Tvrtko

Re: [Intel-gfx] [PATCH] drm/i915: Don't wait forever in drop_caches

Reply via email to