Francisco Jerez <curroje...@riseup.net> writes: > Chris Wilson <ch...@chris-wilson.co.uk> writes: > >> Quoting Francisco Jerez (2020-04-14 22:00:25) >>> Chris Wilson <ch...@chris-wilson.co.uk> writes: >>> >>> > Quoting Francisco Jerez (2020-04-14 20:39:48) >>> >> Chris Wilson <ch...@chris-wilson.co.uk> writes: >>> >> >>> >> > Quoting Chris Wilson (2020-04-14 17:14:23) >>> >> >> Try to make RPS dramatically more responsive by shrinking the >>> >> >> evaluation >>> >> >> intervales by a factor of 100! The issue is as we now park the GPU >>> >> >> rapidly upon idling, a short or bursty workload such as the composited >>> >> >> desktop never sustains enough work to fill and complete an evaluation >>> >> >> window. As such, the frequency we program remains stuck. This was >>> >> >> first >>> >> >> reported as once boosted, we never relinquished the boost [see commit >>> >> >> 21abf0bf168d ("drm/i915/gt: Treat idling as a RPS downclock event")] >>> >> >> but >>> >> >> it equally applies in the order direction for bursty workloads that >>> >> >> *need* low latency, like desktop animations. >>> >> >> >>> >> >> What we could try is preserve the incomplete EI history across idling, >>> >> >> it is not clear whether that would be effective, nor whether the >>> >> >> presumption of continuous workloads is accurate. A clearer path seems >>> >> >> to >>> >> >> treat it as symptomatic that we fail to handle bursty workload with >>> >> >> the >>> >> >> current EI, and seek to address that by shrinking the EI so the >>> >> >> evaluations are run much more often. >>> >> >> >>> >> >> This will likely entail more frequent interrupts, and by the time we >>> >> >> process the interrupt in the bottom half [from inside a worker], the >>> >> >> workload on the GPU has changed. To address the changeable nature, in >>> >> >> the previous patch we compared the previous complete EI with the >>> >> >> interrupt request and only up/down clock if both agree. The impact of >>> >> >> asking for, and presumably, receiving more interrupts is still to be >>> >> >> determined and mitigations sought. The first idea is to differentiate >>> >> >> between up/down responsivity and make upclocking more responsive than >>> >> >> downlocking. This should both help thwart jitter on bursty workloads >>> >> >> by >>> >> >> making it easier to increase than it is to decrease frequencies, and >>> >> >> reduce the number of interrupts we would need to process. >>> >> > >>> >> > Another worry I'd like to raise, is that by reducing the EI we risk >>> >> > unstable evaluations. I'm not sure how accurate the HW is, and I worry >>> >> > about borderline workloads (if that is possible) but mainly the worry >>> >> > is >>> >> > how the HW is sampling. >>> >> > >>> >> > The other unmentioned unknown is the latency in reprogramming the >>> >> > frequency. At what point does it start to become a significant factor? >>> >> > I'm presuming the RPS evaluation itself is free, until it has to talk >>> >> > across the chip to send an interrupt. >>> >> > -Chris >>> >> >>> >> At least on ICL the problem which this patch and 21abf0bf168d were >>> >> working around seems to have to do with RPS interrupt delivery being >>> >> inadvertently blocked for extended periods of time. Looking at the GPU >>> >> utilization and RPS events on a graph I could see the GPU being stuck at >>> >> low frequency without any RPS interrupts firing, for a time interval >>> >> orders of magnitude greater than the EI we're theoretically programming >>> >> today. IOW it seems like the real problem isn't that our EIs are too >>> >> long, but that we're missing a bunch of them. >>> >> >>> >> The solution I was suggesting for this on IRC during the last couple of >>> >> days wouldn't have any of the drawbacks you mention above, I'll send it >>> >> to this list in a moment if the general approach seems okay to you: >>> >> >>> >> https://github.com/curro/linux/commit/f7bc31402aa727a52d957e62d985c6dae6be4b86 >>> > >>> > We were explicitly told to mask the interrupt generation at source >>> > to conserve power. So I would hope for a statement as to whether that is >>> > still a requirement from the HW architects; but I can't see why we would >>> > not apply the mask and that this is just paper. If the observation about >>> > forcewake tallies, would this not suggest that it is still conserving >>> > power on icl? >>> > >>> >>> Yeah, it's hard to see how disabling interrupt generation could save any >>> additional power in a unit which is powered off -- At least on ICL where >>> even the interrupt masking register is powered off... >>> >>> > I haven't looked at whether I see the same phenomenon as you [missing >>> > interrupts on icl] locally, but I was expecting something like the bug >>> > report since the observation that render times are less than EI was >>> > causing the clocks to stay high. And I noticed your problem statement >>> > and was hopeful for a link. >>> > >>> >>> Right. In the workloads I was looking at last week the GPU would often >>> be active for periods of time several times greater than the EI, and we >>> would still fail to clock up. >>> >>> > They sound like two different problems. (Underclocking for animations is >>> > not icl specific.) >>> > >>> >>> Sure. But it seems like the underclocking problem has been greatly >>> exacerbated by 21abf0bf168d, which may have been mitigating the same ICL >>> problem I was looking at leading to RPS interrupt loss. Maybe >>> 21abf0bf168d wouldn't be necessary with working RPS interrupt delivery? >>> And without 21abf0bf168d platforms earlier than ICL wouldn't have as >>> much of an underclocking problem either. >> >> 21abf0bf168d ("drm/i915/gt: Treat idling as a RPS downclock event") >> >> is necessary due to that we can set a boost frequency and then never run >> the RPS worker due to short activity cycles. See >> igt/i915_pm_rc6_residency for rc6-idle, the bg_load is essentially just >> >> for (;;) { >> execbuf(&nop); >> sleep(.1); >> } >> >> Without 21abf0bf168d if you trigger a waitboost just before, it never >> recovers and power utilisation is measurably higher. >> > > I can believe that, but can you confirm that the problem isn't a symptom > of the PMINTRMSK issue I was talking about earlier? Assuming it isn't, > couldn't you do something like 21abf0bf168d which simply stores a > timestamp at GPU parking time and delays the actual frequency ramp-down > to GPU unpark, at which point it's performed *only* if the time spent > idle goes over a down-clock threshold based on the current "power" mode > (e.g. something like the value of GEN6_RP_DOWN_EI minus > GEN6_RP_DOWN_THRESHOLD) -- So the behavior of frequency rampdown due to > parking more closely matches what the RPS would have done if it had been > active. >
If you want to match the behavior of the RPS even more closely you could use a variable delay kind of like the one I'm using for overload tracking in: https://github.com/curro/linux/commit/0669ce23148eafa00e15fc44b9d07f7919802af0 However using a fixed delay based on GEN6_RP_DOWN_THRESHOLD as I was describing above might be sufficient to avoid the waitboost energy efficiency problem you were talking about. >>> >> That said it *might* be helpful to reduce the EIs we use right now in >>> >> addition, but a factor of 100 seems over the top since that will cause >>> >> the evaluation interval to be roughly two orders of magnitude shorter >>> >> than the rendering time of a typical frame, which can lead to massive >>> >> oscillations even in workloads that use a small fraction of the GPU time >>> >> to render a single frame. Maybe we want something in between? >>> > >>> > Probably; as you can guess these were pulled out of nowhere based on the >>> > observation that the frame lengths are much shorter than the current EI >>> > and that in order for us to ramp up to maxclocks in a single frame of >>> > animation would take about 4 samples per frame. Based on the reporter's >>> > observations, we do have to ramp up very quickly for single frame of >>> > rendering in order to hit the vblank, as we are ramping down afterwards. >>> > >>> > With a target of 4 samples within say 1ms, 160us isn't too far of the >>> > mark. (We have to allow some extra time to catch up rendering.) >>> >>> >>> How about we stop ramping down after the rendering of a single frame? >>> It's not like we save any power by doing that, since the GPU seems to be >>> forced to the minimum frequency for as long as it remains parked anyway. >>> If the GPU remains idle long enough for the RPS utilization counters to >>> drop below the threshold and qualify for a ramp-down the RPS should send >>> us an interrupt, at which point we will ramp down the frequency. >> >> Because it demonstrably and quite dramatically reduces power consumption >> for very light desktop workloads. >> >>> Unconditionally ramping down on parking seems to disturb the accuracy of >>> that RPS feedback loop, which then needs to be worked around by reducing >>> the averaging window of the RPS to a tiny fraction of the oscillation >>> period of any typical GPU workload, which is going to prevent the RPS >>> from seeing a whole oscillation period before it reacts, which is almost >>> guaranteed to have a serious energy-efficiency cost. >> >> There is no feedback loop in these workloads. There are no completed RPS >> workers, they are all cancelled if they were even scheduled. (Now we >> might be tempted to look at the results if they scheduled and take that >> into consideration instead of unconditionally downclocking.) >> > > Processing any pending RPS work seems better IMHO than unconditionally > assuming they indicate a clock-down. > >>> > As for steady state workloads, I'm optimistic the smoothing helps. (It's >>> > harder to find steady state, unthrottled workloads!) >>> >>> I'm curious, how is that smoothing you do in PATCH 1 better than simply >>> setting 2x the EIs? (Which would also mean half the interrupt processing >>> overhead as in this series) >> >> I'm anticipating where the RPS worker may not run until the next jiffie >> or two, by which point the iir is stale. But yes, there's only a subtle >> difference between comparing the last 320us every 160us and comparing >> the last 320us every 320us. > > Sounds like the benefit from smoothing is very slight, considering its > accuracy and interrupt processing overhead costs. > >> -Chris > _______________________________________________ > Intel-gfx mailing list > Intel-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/intel-gfx
signature.asc
Description: PGP signature
_______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx