Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
Francisco Jerez writes: > This series attempts to solve an energy efficiency problem of the > current active-mode non-HWP governor of the intel_pstate driver used > for the most part on low-power platforms. Under heavy IO load the > current controller tends to increase frequencies to the maximum turbo > P-state, partly due to IO wait boosting, partly due to the roughly > flat frequency response curve of the current controller (see > [6]), which causes it to ramp frequencies up and down repeatedly for > any oscillating workload (think of graphics, audio or disk IO when any > of them becomes a bottleneck), severely increasing energy usage > relative to a (throughput-wise equivalent) controller able to provide > the same average frequency without fluctuation. The core energy > efficiency improvement has been observed to be of the order of 20% via > RAPL, but it's expected to vary substantially between workloads (see > perf-per-watt comparison [2]). > > One might expect that this could come at some cost in terms of system > responsiveness, but the governor implemented in PATCH 6 has a variable > response curve controlled by a heuristic that keeps the controller in > a low-latency state unless the system is under heavy IO load for an > extended period of time. The low-latency behavior is actually > significantly more aggressive than the current governor, allowing it > to achieve better throughput in some scenarios where the load > ping-pongs between the CPU and some IO device (see PATCH 6 for more of > the rationale). The controller offers relatively lower latency than > the upstream one particularly while C0 residency is low (which by > itself contributes to mitigate the increased energy usage while on > C0). However under certain conditions the low-latency heuristic may > increase power consumption (see perf-per-watt comparison [2], the > apparent regressions are correlated with an increase in performance in > the same benchmark due to the use of the low-latency heuristic) -- If > this is a problem a different trade-off between latency and energy > usage shouldn't be difficult to achieve, but it will come at a > performance cost in some cases. I couldn't observe a statistically > significant increase in idle power consumption due to this behavior > (on BXT J3455): > > package-0 RAPL (W):XX ±0.14% x8 -> XX ±0.15% x9 > d=-0.04% ±0.14% p=61.73% > > [Absolute benchmark results are unfortunately omitted from this letter > due to company policies, but the percent change and Student's T > p-value are included above and in the referenced benchmark results] > > The most obvious impact of this series will likely be the overall > improvement in graphics performance on systems with an IGP integrated > into the processor package (though for the moment this is only enabled > on BXT+), because the TDP budget shared among CPU and GPU can > frequently become a limiting factor in low-power devices. On heavily > TDP-bound devices this series improves performance of virtually any > non-trivial graphics rendering by a significant amount (of the order > of the energy efficiency improvement for that workload assuming the > optimization didn't cause it to become non-TDP-bound). > > See [1]-[5] for detailed numbers including various graphics benchmarks > and a sample of the Phoronix daily-system-tracker. Some popular > graphics benchmarks like GfxBench gl_manhattan31 and gl_4 improve > between 5% and 11% on our systems. The exact improvement can vary > substantially between systems (compare the benchmark results from the > two different J3455 systems [1] and [3]) due to a number of factors, > including the ratio between CPU and GPU processing power, the behavior > of the userspace graphics driver, the windowing system and resolution, > the BIOS (which has an influence on the package TDP), the thermal > characteristics of the system, etc. > > Unigine Valley and Heaven improve by a similar factor on some systems > (see the J3455 results [1]), but on others the improvement is lower > because the benchmark fails to fully utilize the GPU, which causes the > heuristic to remain in low-latency state for longer, which leaves a > reduced TDP budget available to the GPU, which prevents performance > from increasing further. This can be avoided by using the alternative > heuristic parameters suggested in the commit message of PATCH 8, which > provide a lower IO utilization threshold and hysteresis for the > controller to attempt to save energy. I'm not proposing those for > upstream (yet) because they would also increase the risk for > latency-sensitive IO-heavy workloads to regress (like SynMark2 > OglTerrainFly* and some arguably poorly designed IPC-bound X11 > benchmarks). > > Discrete graphics aren't likely to experience that much of a visible > improvement from this, even though many non-IGP workloads *could* > benefit by reducing the system's energy usage while the discrete GPU > (or really, any
Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
Francisco Jerez writes: > This series attempts to solve an energy efficiency problem of the > current active-mode non-HWP governor of the intel_pstate driver used > for the most part on low-power platforms. Under heavy IO load the > current controller tends to increase frequencies to the maximum turbo > P-state, partly due to IO wait boosting, partly due to the roughly > flat frequency response curve of the current controller (see > [6]), which causes it to ramp frequencies up and down repeatedly for > any oscillating workload (think of graphics, audio or disk IO when any > of them becomes a bottleneck), severely increasing energy usage > relative to a (throughput-wise equivalent) controller able to provide > the same average frequency without fluctuation. The core energy > efficiency improvement has been observed to be of the order of 20% via > RAPL, but it's expected to vary substantially between workloads (see > perf-per-watt comparison [2]). > > One might expect that this could come at some cost in terms of system > responsiveness, but the governor implemented in PATCH 6 has a variable > response curve controlled by a heuristic that keeps the controller in > a low-latency state unless the system is under heavy IO load for an > extended period of time. The low-latency behavior is actually > significantly more aggressive than the current governor, allowing it > to achieve better throughput in some scenarios where the load > ping-pongs between the CPU and some IO device (see PATCH 6 for more of > the rationale). The controller offers relatively lower latency than > the upstream one particularly while C0 residency is low (which by > itself contributes to mitigate the increased energy usage while on > C0). However under certain conditions the low-latency heuristic may > increase power consumption (see perf-per-watt comparison [2], the > apparent regressions are correlated with an increase in performance in > the same benchmark due to the use of the low-latency heuristic) -- If > this is a problem a different trade-off between latency and energy > usage shouldn't be difficult to achieve, but it will come at a > performance cost in some cases. I couldn't observe a statistically > significant increase in idle power consumption due to this behavior > (on BXT J3455): > > package-0 RAPL (W):XX ±0.14% x8 -> XX ±0.15% x9 > d=-0.04% ±0.14% p=61.73% > For the case anyone is wondering what's going on, Srinivas pointed me at a larger idle power usage increase off-list, ultimately caused by the low-latency heuristic as discussed in the paragraph above. I have a v2 of PATCH 6 that gives the controller a third response curve roughly intermediate between the low-latency and low-power states of this revision, which avoids the energy usage increase while C0 residency is low (e.g. during idle) expected for v1. The low-latency behavior of this revision is still going to be available based on a heuristic (in particular when a realtime-priority task is scheduled). We're carrying out some additional testing, I'll post the code here eventually. > [Absolute benchmark results are unfortunately omitted from this letter > due to company policies, but the percent change and Student's T > p-value are included above and in the referenced benchmark results] > > The most obvious impact of this series will likely be the overall > improvement in graphics performance on systems with an IGP integrated > into the processor package (though for the moment this is only enabled > on BXT+), because the TDP budget shared among CPU and GPU can > frequently become a limiting factor in low-power devices. On heavily > TDP-bound devices this series improves performance of virtually any > non-trivial graphics rendering by a significant amount (of the order > of the energy efficiency improvement for that workload assuming the > optimization didn't cause it to become non-TDP-bound). > > See [1]-[5] for detailed numbers including various graphics benchmarks > and a sample of the Phoronix daily-system-tracker. Some popular > graphics benchmarks like GfxBench gl_manhattan31 and gl_4 improve > between 5% and 11% on our systems. The exact improvement can vary > substantially between systems (compare the benchmark results from the > two different J3455 systems [1] and [3]) due to a number of factors, > including the ratio between CPU and GPU processing power, the behavior > of the userspace graphics driver, the windowing system and resolution, > the BIOS (which has an influence on the package TDP), the thermal > characteristics of the system, etc. > > Unigine Valley and Heaven improve by a similar factor on some systems > (see the J3455 results [1]), but on others the improvement is lower > because the benchmark fails to fully utilize the GPU, which causes the > heuristic to remain in low-latency state for longer, which leaves a > reduced TDP budget available to the GPU, which prevents performance > from increasing fur
Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
On Tue, 2018-04-10 at 15:28 -0700, Francisco Jerez wrote: > Francisco Jerez writes: > [...] > For the case anyone is wondering what's going on, Srinivas pointed me > at > a larger idle power usage increase off-list, ultimately caused by the > low-latency heuristic as discussed in the paragraph above. I have a > v2 > of PATCH 6 that gives the controller a third response curve roughly > intermediate between the low-latency and low-power states of this > revision, which avoids the energy usage increase while C0 residency > is > low (e.g. during idle) expected for v1. The low-latency behavior of > this revision is still going to be available based on a heuristic (in > particular when a realtime-priority task is scheduled). We're > carrying > out some additional testing, I'll post the code here eventually. Please try sched-util governor also. There is a frequency-invariant patch, which I can send you (This eventually will be pushed by Peter). We want to avoid complexity to intel-pstate for non HWP power sensitive platforms as far as possible. Thanks, Srinivas > > > [Absolute benchmark results are unfortunately omitted from this > > letter > > due to company policies, but the percent change and Student's T > > p-value are included above and in the referenced benchmark results] > > > > The most obvious impact of this series will likely be the overall > > improvement in graphics performance on systems with an IGP > > integrated > > into the processor package (though for the moment this is only > > enabled > > on BXT+), because the TDP budget shared among CPU and GPU can > > frequently become a limiting factor in low-power devices. On > > heavily > > TDP-bound devices this series improves performance of virtually any > > non-trivial graphics rendering by a significant amount (of the > > order > > of the energy efficiency improvement for that workload assuming the > > optimization didn't cause it to become non-TDP-bound). > > > > See [1]-[5] for detailed numbers including various graphics > > benchmarks > > and a sample of the Phoronix daily-system-tracker. Some popular > > graphics benchmarks like GfxBench gl_manhattan31 and gl_4 improve > > between 5% and 11% on our systems. The exact improvement can vary > > substantially between systems (compare the benchmark results from > > the > > two different J3455 systems [1] and [3]) due to a number of > > factors, > > including the ratio between CPU and GPU processing power, the > > behavior > > of the userspace graphics driver, the windowing system and > > resolution, > > the BIOS (which has an influence on the package TDP), the thermal > > characteristics of the system, etc. > > > > Unigine Valley and Heaven improve by a similar factor on some > > systems > > (see the J3455 results [1]), but on others the improvement is lower > > because the benchmark fails to fully utilize the GPU, which causes > > the > > heuristic to remain in low-latency state for longer, which leaves a > > reduced TDP budget available to the GPU, which prevents performance > > from increasing further. This can be avoided by using the > > alternative > > heuristic parameters suggested in the commit message of PATCH 8, > > which > > provide a lower IO utilization threshold and hysteresis for the > > controller to attempt to save energy. I'm not proposing those for > > upstream (yet) because they would also increase the risk for > > latency-sensitive IO-heavy workloads to regress (like SynMark2 > > OglTerrainFly* and some arguably poorly designed IPC-bound X11 > > benchmarks). > > > > Discrete graphics aren't likely to experience that much of a > > visible > > improvement from this, even though many non-IGP workloads *could* > > benefit by reducing the system's energy usage while the discrete > > GPU > > (or really, any other IO device) becomes a bottleneck, but this is > > not > > attempted in this series, since that would involve making an energy > > efficiency/latency trade-off that only the maintainers of the > > respective drivers are in a position to make. The cpufreq > > interface > > introduced in PATCH 1 to achieve this is left as an opt-in for that > > reason, only the i915 DRM driver is hooked up since it will get the > > most direct pay-off due to the increased energy budget available to > > the GPU, but other power-hungry third-party gadgets built into the > > same package (*cough* AMD *cough* Mali *cough* PowerVR *cough*) may > > be > > able to benefit from this interface eventually by instrumenting the > > driver in a similar way. > > > > The cpufreq interface is not exclusively tied to the intel_pstate > > driver, because other governors can make use of the statistic > > calculated as result to avoid over-optimizing for latency in > > scenarios > > where a lower frequency would be able to achieve similar throughput > > while using less energy. The interpretation of this statistic > > relies > > on the observation that for as long as the system is CPU-bound, a
Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
Hi Srinivas, Srinivas Pandruvada writes: > On Tue, 2018-04-10 at 15:28 -0700, Francisco Jerez wrote: >> Francisco Jerez writes: >> > [...] > > >> For the case anyone is wondering what's going on, Srinivas pointed me >> at >> a larger idle power usage increase off-list, ultimately caused by the >> low-latency heuristic as discussed in the paragraph above. I have a >> v2 >> of PATCH 6 that gives the controller a third response curve roughly >> intermediate between the low-latency and low-power states of this >> revision, which avoids the energy usage increase while C0 residency >> is >> low (e.g. during idle) expected for v1. The low-latency behavior of >> this revision is still going to be available based on a heuristic (in >> particular when a realtime-priority task is scheduled). We're >> carrying >> out some additional testing, I'll post the code here eventually. > > Please try sched-util governor also. There is a frequency-invariant > patch, which I can send you (This eventually will be pushed by Peter). > We want to avoid complexity to intel-pstate for non HWP power sensitive > platforms as far as possible. > Unfortunately the schedutil governor (whether frequency invariant or not) has the exact same energy efficiency issues as the present intel_pstate non-HWP governor. Its response is severely underdamped leading to energy-inefficient behavior for any oscillating non-CPU-bound workload. To exacerbate that problem the frequency is maxed out on frequent IO waiting just like the current intel_pstate cpu-load controller does, even though the frequent IO waits may actually be an indication that the system is IO-bound (which means that the large energy usage increase may not be translated in any performance benefit in practice, not to speak of performance being impacted negatively in TDP-bound scenarios like GPU rendering). Regarding run-time complexity, I haven't observed this governor to be measurably more computationally intensive than the present one. It's a bunch more instructions indeed, but still within the same ballpark as the current governor. The average increase in CPU utilization on my BXT with this series is less than 0.03% (sampled via ftrace for v1, I can repeat the measurement for the v2 I have in the works, though I don't expect the result to be substantially different). If this is a problem for you there are several optimization opportunities that would cut down the number of CPU cycles get_target_pstate_lp() takes to execute by a large percent (most of the optimization ideas I can think of right now though would come at some accuracy/maintainability/debuggability cost, but may still be worth pursuing), but the computational overhead is low enough at this point that the impact on any benchmark or real workload would be orders of magnitude lower than its variance, which makes it kind of difficult to keep the discussion data-driven [as possibly any performance optimization discussion should ever be ;)]. > > Thanks, > Srinivas > > > >> >> > [Absolute benchmark results are unfortunately omitted from this >> > letter >> > due to company policies, but the percent change and Student's T >> > p-value are included above and in the referenced benchmark results] >> > >> > The most obvious impact of this series will likely be the overall >> > improvement in graphics performance on systems with an IGP >> > integrated >> > into the processor package (though for the moment this is only >> > enabled >> > on BXT+), because the TDP budget shared among CPU and GPU can >> > frequently become a limiting factor in low-power devices. On >> > heavily >> > TDP-bound devices this series improves performance of virtually any >> > non-trivial graphics rendering by a significant amount (of the >> > order >> > of the energy efficiency improvement for that workload assuming the >> > optimization didn't cause it to become non-TDP-bound). >> > >> > See [1]-[5] for detailed numbers including various graphics >> > benchmarks >> > and a sample of the Phoronix daily-system-tracker. Some popular >> > graphics benchmarks like GfxBench gl_manhattan31 and gl_4 improve >> > between 5% and 11% on our systems. The exact improvement can vary >> > substantially between systems (compare the benchmark results from >> > the >> > two different J3455 systems [1] and [3]) due to a number of >> > factors, >> > including the ratio between CPU and GPU processing power, the >> > behavior >> > of the userspace graphics driver, the windowing system and >> > resolution, >> > the BIOS (which has an influence on the package TDP), the thermal >> > characteristics of the system, etc. >> > >> > Unigine Valley and Heaven improve by a similar factor on some >> > systems >> > (see the J3455 results [1]), but on others the improvement is lower >> > because the benchmark fails to fully utilize the GPU, which causes >> > the >> > heuristic to remain in low-latency state for longer, which leaves a >> > reduced TDP budget available to th
Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
Francisco Jerez writes: > Hi Srinivas, > > Srinivas Pandruvada writes: > >> On Tue, 2018-04-10 at 15:28 -0700, Francisco Jerez wrote: >>> Francisco Jerez writes: >>> >> [...] >> >> >>> For the case anyone is wondering what's going on, Srinivas pointed me >>> at >>> a larger idle power usage increase off-list, ultimately caused by the >>> low-latency heuristic as discussed in the paragraph above. I have a >>> v2 >>> of PATCH 6 that gives the controller a third response curve roughly >>> intermediate between the low-latency and low-power states of this >>> revision, which avoids the energy usage increase while C0 residency >>> is >>> low (e.g. during idle) expected for v1. The low-latency behavior of >>> this revision is still going to be available based on a heuristic (in >>> particular when a realtime-priority task is scheduled). We're >>> carrying >>> out some additional testing, I'll post the code here eventually. >> >> Please try sched-util governor also. There is a frequency-invariant >> patch, which I can send you (This eventually will be pushed by Peter). >> We want to avoid complexity to intel-pstate for non HWP power sensitive >> platforms as far as possible. >> > > Unfortunately the schedutil governor (whether frequency invariant or > not) has the exact same energy efficiency issues as the present > intel_pstate non-HWP governor. Its response is severely underdamped > leading to energy-inefficient behavior for any oscillating non-CPU-bound > workload. To exacerbate that problem the frequency is maxed out on > frequent IO waiting just like the current intel_pstate cpu-load "just like" here is possibly somewhat unfair to the schedutil governor, admittedly its progressive IOWAIT boosting behavior seems somewhat less wasteful than the intel_pstate non-HWP governor's IOWAIT boosting behavior, but it's still largely unhelpful on IO-bound conditions. > controller does, even though the frequent IO waits may actually be an > indication that the system is IO-bound (which means that the large > energy usage increase may not be translated in any performance benefit > in practice, not to speak of performance being impacted negatively in > TDP-bound scenarios like GPU rendering). > > Regarding run-time complexity, I haven't observed this governor to be > measurably more computationally intensive than the present one. It's a > bunch more instructions indeed, but still within the same ballpark as > the current governor. The average increase in CPU utilization on my BXT > with this series is less than 0.03% (sampled via ftrace for v1, I can > repeat the measurement for the v2 I have in the works, though I don't > expect the result to be substantially different). If this is a problem > for you there are several optimization opportunities that would cut down > the number of CPU cycles get_target_pstate_lp() takes to execute by a > large percent (most of the optimization ideas I can think of right now > though would come at some accuracy/maintainability/debuggability cost, > but may still be worth pursuing), but the computational overhead is low > enough at this point that the impact on any benchmark or real workload > would be orders of magnitude lower than its variance, which makes it > kind of difficult to keep the discussion data-driven [as possibly any > performance optimization discussion should ever be ;)]. > >> >> Thanks, >> Srinivas >> >> >> >>> >>> > [Absolute benchmark results are unfortunately omitted from this >>> > letter >>> > due to company policies, but the percent change and Student's T >>> > p-value are included above and in the referenced benchmark results] >>> > >>> > The most obvious impact of this series will likely be the overall >>> > improvement in graphics performance on systems with an IGP >>> > integrated >>> > into the processor package (though for the moment this is only >>> > enabled >>> > on BXT+), because the TDP budget shared among CPU and GPU can >>> > frequently become a limiting factor in low-power devices. On >>> > heavily >>> > TDP-bound devices this series improves performance of virtually any >>> > non-trivial graphics rendering by a significant amount (of the >>> > order >>> > of the energy efficiency improvement for that workload assuming the >>> > optimization didn't cause it to become non-TDP-bound). >>> > >>> > See [1]-[5] for detailed numbers including various graphics >>> > benchmarks >>> > and a sample of the Phoronix daily-system-tracker. Some popular >>> > graphics benchmarks like GfxBench gl_manhattan31 and gl_4 improve >>> > between 5% and 11% on our systems. The exact improvement can vary >>> > substantially between systems (compare the benchmark results from >>> > the >>> > two different J3455 systems [1] and [3]) due to a number of >>> > factors, >>> > including the ratio between CPU and GPU processing power, the >>> > behavior >>> > of the userspace graphics driver, the windowing system and >>> > resolution, >>> > the BIOS (which has an in
Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
On Wed, 2018-04-11 at 09:26 -0700, Francisco Jerez wrote: > > "just like" here is possibly somewhat unfair to the schedutil > governor, > admittedly its progressive IOWAIT boosting behavior seems somewhat > less > wasteful than the intel_pstate non-HWP governor's IOWAIT boosting > behavior, but it's still largely unhelpful on IO-bound conditions. > OK, if you think so, then improve it for sched-util governor or other mechanisms (as Juri suggested) instead of intel-pstate. This will benefit all architectures including x86 + non i915. BTW intel-pstate can be driven by sched-util governor (passive mode), so if your prove benefits to Broxton, this can be a default. As before: - No regression to idle power at all. This is more important than benchmarks - Not just score, performance/watt is important Thanks, Srinivas > > controller does, even though the frequent IO waits may actually be > > an > > indication that the system is IO-bound (which means that the large > > energy usage increase may not be translated in any performance > > benefit > > in practice, not to speak of performance being impacted negatively > > in > > TDP-bound scenarios like GPU rendering). > > > > Regarding run-time complexity, I haven't observed this governor to > > be > > measurably more computationally intensive than the present > > one. It's a > > bunch more instructions indeed, but still within the same ballpark > > as > > the current governor. The average increase in CPU utilization on > > my BXT > > with this series is less than 0.03% (sampled via ftrace for v1, I > > can > > repeat the measurement for the v2 I have in the works, though I > > don't > > expect the result to be substantially different). If this is a > > problem > > for you there are several optimization opportunities that would cut > > down > > the number of CPU cycles get_target_pstate_lp() takes to execute by > > a > > large percent (most of the optimization ideas I can think of right > > now > > though would come at some accuracy/maintainability/debuggability > > cost, > > but may still be worth pursuing), but the computational overhead is > > low > > enough at this point that the impact on any benchmark or real > > workload > > would be orders of magnitude lower than its variance, which makes > > it > > kind of difficult to keep the discussion data-driven [as possibly > > any > > performance optimization discussion should ever be ;)]. > > > > > > > > Thanks, > > > Srinivas > > > > > > > > > > > > > > > > > > [Absolute benchmark results are unfortunately omitted from > > > > > this > > > > > letter > > > > > due to company policies, but the percent change and Student's > > > > > T > > > > > p-value are included above and in the referenced benchmark > > > > > results] > > > > > > > > > > The most obvious impact of this series will likely be the > > > > > overall > > > > > improvement in graphics performance on systems with an IGP > > > > > integrated > > > > > into the processor package (though for the moment this is > > > > > only > > > > > enabled > > > > > on BXT+), because the TDP budget shared among CPU and GPU can > > > > > frequently become a limiting factor in low-power devices. On > > > > > heavily > > > > > TDP-bound devices this series improves performance of > > > > > virtually any > > > > > non-trivial graphics rendering by a significant amount (of > > > > > the > > > > > order > > > > > of the energy efficiency improvement for that workload > > > > > assuming the > > > > > optimization didn't cause it to become non-TDP-bound). > > > > > > > > > > See [1]-[5] for detailed numbers including various graphics > > > > > benchmarks > > > > > and a sample of the Phoronix daily-system-tracker. Some > > > > > popular > > > > > graphics benchmarks like GfxBench gl_manhattan31 and gl_4 > > > > > improve > > > > > between 5% and 11% on our systems. The exact improvement can > > > > > vary > > > > > substantially between systems (compare the benchmark results > > > > > from > > > > > the > > > > > two different J3455 systems [1] and [3]) due to a number of > > > > > factors, > > > > > including the ratio between CPU and GPU processing power, the > > > > > behavior > > > > > of the userspace graphics driver, the windowing system and > > > > > resolution, > > > > > the BIOS (which has an influence on the package TDP), the > > > > > thermal > > > > > characteristics of the system, etc. > > > > > > > > > > Unigine Valley and Heaven improve by a similar factor on some > > > > > systems > > > > > (see the J3455 results [1]), but on others the improvement is > > > > > lower > > > > > because the benchmark fails to fully utilize the GPU, which > > > > > causes > > > > > the > > > > > heuristic to remain in low-latency state for longer, which > > > > > leaves a > > > > > reduced TDP budget available to the GPU, which prevents > > > > > performance > > > > > from increasing further. This can be avoided by using the > > > > > alternative > > > > > heuri
Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
On Wed, Apr 11, 2018 at 09:26:11AM -0700, Francisco Jerez wrote: > "just like" here is possibly somewhat unfair to the schedutil governor, > admittedly its progressive IOWAIT boosting behavior seems somewhat less > wasteful than the intel_pstate non-HWP governor's IOWAIT boosting > behavior, but it's still largely unhelpful on IO-bound conditions. So you understand why we need the iowait boosting right? It is just that when we get back to runnable, we want to process the next data packet ASAP. See also here: https://lkml.kernel.org/r/20170522082154.f57cqovterd2q...@hirez.programming.kicks-ass.net What I don't really understand is why it is costing so much power; after all, when we're in iowait the CPU is mostly idle and can power-gate. ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
Peter Zijlstra writes: > On Wed, Apr 11, 2018 at 09:26:11AM -0700, Francisco Jerez wrote: >> "just like" here is possibly somewhat unfair to the schedutil governor, >> admittedly its progressive IOWAIT boosting behavior seems somewhat less >> wasteful than the intel_pstate non-HWP governor's IOWAIT boosting >> behavior, but it's still largely unhelpful on IO-bound conditions. > > So you understand why we need the iowait boosting right? > Yeah, sort of. The latency-minimizing state of this governor provides a comparable effect, but it's based on a pessimistic estimate of the frequency required for the workload to achieve maximum throughput (rather than a plain or exponential boost up to the max frequency which can substantially deviate from that frequency, see the explanation in PATCH 6 for more details). It's enabled under conditions partially overlapping but not identical to iowait boosting: The optimization is not applied under IO-bound conditions (in order to avoid impacting energy efficiency negatively for zero or negative payoff), OTOH the optimization is applied in some cases where the current governor wouldn't, like RT-priority threads (that's the main difference with v2 I'm planning to send out next week). > It is just that when we get back to runnable, we want to process the > next data packet ASAP. See also here: > > > https://lkml.kernel.org/r/20170522082154.f57cqovterd2q...@hirez.programming.kicks-ass.net > > What I don't really understand is why it is costing so much power; after > all, when we're in iowait the CPU is mostly idle and can power-gate. The reason for the energy efficiency problem of iowait boosting is precisely the greater oscillation between turbo and idle. Say that iowait boost increases the frequency by a factor alpha relative to the optimal frequency f0 (in terms of energy efficiency) required to execute some IO-bound workload. This will cause the CPU to be busy for a fraction of the time it was busy originally, approximately t1 = t0 / alpha, which indeed divides the overall energy usage by a factor alpha, but at the same time multiplies the instantaneous power consumption while busy by a factor potentially much greater than alpha, since the CPU's power curve is largely non-linear, and in fact approximately convex within the frequency range allowed by the policy, so you get an average energy usage possibly much greater than the optimal. signature.asc Description: PGP signature ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
On Thu, Apr 12, 2018 at 11:34:54AM -0700, Francisco Jerez wrote: > The reason for the energy efficiency problem of iowait boosting is > precisely the greater oscillation between turbo and idle. Say that > iowait boost increases the frequency by a factor alpha relative to the > optimal frequency f0 (in terms of energy efficiency) required to execute > some IO-bound workload. This will cause the CPU to be busy for a > fraction of the time it was busy originally, approximately t1 = t0 / > alpha, which indeed divides the overall energy usage by a factor alpha, > but at the same time multiplies the instantaneous power consumption > while busy by a factor potentially much greater than alpha, since the > CPU's power curve is largely non-linear, and in fact approximately > convex within the frequency range allowed by the policy, so you get an > average energy usage possibly much greater than the optimal. Ah, but we don't (yet) have the (normalized) power curves, so we cannot make that call. Once we have the various energy/OPP numbers required for EAS we can compute the optimal. I think such was even mentioned in the thread I referred earlier. Until such time; we boost to max for lack of a better option. ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
Peter Zijlstra writes: > On Thu, Apr 12, 2018 at 11:34:54AM -0700, Francisco Jerez wrote: >> The reason for the energy efficiency problem of iowait boosting is >> precisely the greater oscillation between turbo and idle. Say that >> iowait boost increases the frequency by a factor alpha relative to the >> optimal frequency f0 (in terms of energy efficiency) required to execute >> some IO-bound workload. This will cause the CPU to be busy for a >> fraction of the time it was busy originally, approximately t1 = t0 / >> alpha, which indeed divides the overall energy usage by a factor alpha, >> but at the same time multiplies the instantaneous power consumption >> while busy by a factor potentially much greater than alpha, since the >> CPU's power curve is largely non-linear, and in fact approximately >> convex within the frequency range allowed by the policy, so you get an >> average energy usage possibly much greater than the optimal. > > Ah, but we don't (yet) have the (normalized) power curves, so we cannot > make that call. > > Once we have the various energy/OPP numbers required for EAS we can > compute the optimal. I think such was even mentioned in the thread I > referred earlier. > > Until such time; we boost to max for lack of a better option. Actually assuming that a single geometric feature of the power curve is known -- it being convex in the frequency range allowed by the policy (which is almost always the case, not only for Intel CPUs), the optimal frequency for an IO-bound workload is fully independent of the exact power curve -- It's just the minimum CPU frequency that's able to keep the bottlenecking IO device at 100% utilization. Any frequency higher than that will lead to strictly lower energy efficiency whatever the exact form of the power curve is. I agree though that exact knowledge of the power curve *might* be useful as a mechanism to estimate the potential costs of exceeding that optimal frequency (e.g. as a mechanism to offset performance loss heuristically for the case the workload fluctuates by giving the governor an upward bias with an approximately known energy cost), but that's not required for the governor's behavior to be approximately optimal in IO-bound conditions. Not making further assumptions about the power curve beyond its convexity makes the algorithm fairly robust against any inaccuracy in the power curve numbers (which there will always be, since the energy efficiency of the workload is really dependent on the behavior of multiple components of the system interacting with each other), and makes it easily reusable on platforms where the exact power curves are not known. signature.asc Description: PGP signature ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
Juri Lelli writes: > Hi, > > On 11/04/18 09:26, Francisco Jerez wrote: >> Francisco Jerez writes: >> >> > Hi Srinivas, >> > >> > Srinivas Pandruvada writes: >> > >> >> On Tue, 2018-04-10 at 15:28 -0700, Francisco Jerez wrote: >> >>> Francisco Jerez writes: >> >>> >> >> [...] >> >> >> >> >> >>> For the case anyone is wondering what's going on, Srinivas pointed me >> >>> at >> >>> a larger idle power usage increase off-list, ultimately caused by the >> >>> low-latency heuristic as discussed in the paragraph above. I have a >> >>> v2 >> >>> of PATCH 6 that gives the controller a third response curve roughly >> >>> intermediate between the low-latency and low-power states of this >> >>> revision, which avoids the energy usage increase while C0 residency >> >>> is >> >>> low (e.g. during idle) expected for v1. The low-latency behavior of >> >>> this revision is still going to be available based on a heuristic (in >> >>> particular when a realtime-priority task is scheduled). We're >> >>> carrying >> >>> out some additional testing, I'll post the code here eventually. >> >> >> >> Please try sched-util governor also. There is a frequency-invariant >> >> patch, which I can send you (This eventually will be pushed by Peter). >> >> We want to avoid complexity to intel-pstate for non HWP power sensitive >> >> platforms as far as possible. >> >> >> > >> > Unfortunately the schedutil governor (whether frequency invariant or >> > not) has the exact same energy efficiency issues as the present >> > intel_pstate non-HWP governor. Its response is severely underdamped >> > leading to energy-inefficient behavior for any oscillating non-CPU-bound >> > workload. To exacerbate that problem the frequency is maxed out on >> > frequent IO waiting just like the current intel_pstate cpu-load >> >> "just like" here is possibly somewhat unfair to the schedutil governor, >> admittedly its progressive IOWAIT boosting behavior seems somewhat less >> wasteful than the intel_pstate non-HWP governor's IOWAIT boosting >> behavior, but it's still largely unhelpful on IO-bound conditions. > > Sorry if I jump in out of the blue, but what you are trying to solve > looks very similar to what IPA [1] is targeting as well. I might be > wrong (I'll try to spend more time reviewing your set), but my first > impression is that we should try to solve similar problems with a more > general approach that could benefit different sys/archs. > Thanks, seems interesting, I've also been taking a look at your whitepaper and source code. The problem we've both been trying to solve is indeed closely related, there may be an opportunity for sharing efforts both ways. Correct me if I didn't understand the whole details about your power allocation code, but IPA seems to be dividing up the available power budget proportionally to the power requested by the different actors (up to the point that causes some actor to reach its maximum power) and configured weights. From my understanding of the get_requested_power implementations for cpufreq and devfreq, the requested power attempts to approximate the current power usage of each device (whether it's estimated from the current frequency and a capacitance model, from the get_real_power callback, or other mechanism), which can be far from the optimal power consumption in cases where the device's governor is programming a frequency that wildly deviates from the optimal one (as is the case with the current intel_pstate governor for any IO-bound workload, which incidentally will suffer the greatest penalty from a suboptimal power allocation in cases where the IO device is actually an integrated GPU). Is there any mechanism in place to prevent the system from stabilizing at a power allocation that prevents it from achieving maximum throughput? E.g. in a TDP-limited system with two devices consuming a total power of Pmax = P0(f0) + P1(f1), with f0 much greater than the optimal, and f1 capped at a frequency lower than the optimal due to TDP or thermal constraints, and assuming that the system is bottlenecking at the second device. In such a scenario wouldn't IPA distribute power in a way that roughly approximates the pre-existing suboptimal distribution? If that's the case, I understand that it's the responsibility of the device's (or CPU's) frequency governor to request a frequency which is reasonably energy-efficient in the first place for the balancer to function correctly? (That's precisely the goal of this series) -- Which in addition allows the system to use less power to get the same work done in cases where the system is not thermally or TDP-limited as a whole, so the balancing logic wouldn't have any effect at all. > I'm Cc-ing some Arm folks... > > Best, > > - Juri > > [1] https://developer.arm.com/open-source/intelligent-power-allocation signature.asc Description: PGP signature ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.
Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
Hi, On 11/04/18 09:26, Francisco Jerez wrote: > Francisco Jerez writes: > > > Hi Srinivas, > > > > Srinivas Pandruvada writes: > > > >> On Tue, 2018-04-10 at 15:28 -0700, Francisco Jerez wrote: > >>> Francisco Jerez writes: > >>> > >> [...] > >> > >> > >>> For the case anyone is wondering what's going on, Srinivas pointed me > >>> at > >>> a larger idle power usage increase off-list, ultimately caused by the > >>> low-latency heuristic as discussed in the paragraph above. I have a > >>> v2 > >>> of PATCH 6 that gives the controller a third response curve roughly > >>> intermediate between the low-latency and low-power states of this > >>> revision, which avoids the energy usage increase while C0 residency > >>> is > >>> low (e.g. during idle) expected for v1. The low-latency behavior of > >>> this revision is still going to be available based on a heuristic (in > >>> particular when a realtime-priority task is scheduled). We're > >>> carrying > >>> out some additional testing, I'll post the code here eventually. > >> > >> Please try sched-util governor also. There is a frequency-invariant > >> patch, which I can send you (This eventually will be pushed by Peter). > >> We want to avoid complexity to intel-pstate for non HWP power sensitive > >> platforms as far as possible. > >> > > > > Unfortunately the schedutil governor (whether frequency invariant or > > not) has the exact same energy efficiency issues as the present > > intel_pstate non-HWP governor. Its response is severely underdamped > > leading to energy-inefficient behavior for any oscillating non-CPU-bound > > workload. To exacerbate that problem the frequency is maxed out on > > frequent IO waiting just like the current intel_pstate cpu-load > > "just like" here is possibly somewhat unfair to the schedutil governor, > admittedly its progressive IOWAIT boosting behavior seems somewhat less > wasteful than the intel_pstate non-HWP governor's IOWAIT boosting > behavior, but it's still largely unhelpful on IO-bound conditions. Sorry if I jump in out of the blue, but what you are trying to solve looks very similar to what IPA [1] is targeting as well. I might be wrong (I'll try to spend more time reviewing your set), but my first impression is that we should try to solve similar problems with a more general approach that could benefit different sys/archs. I'm Cc-ing some Arm folks... Best, - Juri [1] https://developer.arm.com/open-source/intelligent-power-allocation ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
On Thu, Apr 12, 2018 at 12:55:39PM -0700, Francisco Jerez wrote: > Actually assuming that a single geometric feature of the power curve is > known -- it being convex in the frequency range allowed by the policy > (which is almost always the case, not only for Intel CPUs), the optimal > frequency for an IO-bound workload is fully independent of the exact > power curve -- It's just the minimum CPU frequency that's able to keep > the bottlenecking IO device at 100% utilization. I think that is difficult to determine with the information at hand. We have lost all device information by the time we reach the scheduler. ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
Peter Zijlstra writes: > On Thu, Apr 12, 2018 at 12:55:39PM -0700, Francisco Jerez wrote: >> Actually assuming that a single geometric feature of the power curve is >> known -- it being convex in the frequency range allowed by the policy >> (which is almost always the case, not only for Intel CPUs), the optimal >> frequency for an IO-bound workload is fully independent of the exact >> power curve -- It's just the minimum CPU frequency that's able to keep >> the bottlenecking IO device at 100% utilization. > > I think that is difficult to determine with the information at hand. We > have lost all device information by the time we reach the scheduler. I assume you mean it's difficult to tell whether the workload is CPU-bound or IO-bound? Yeah, it's non-trivial to determine whether the system is bottlenecking on IO, it requires additional infrastructure to keep track of IO utilization (that's the purpose of PATCH 1), and even then it involves some heuristic assumptions which are not guaranteed fail-proof, so the controller needs to be prepared for things to behave reasonably when the assumptions deviate from reality (see the comments in PATCH 6 for more details on what happens in such cases) -- How frequently that happens in practice is what determines how far the controller's response will be from the optimally energy-efficient behavior in a real workload. It seems to work fairly well in practice, at least in the sample of test-cases I've been able to gather data from so far. Anyway that's the difficult part. Once (if) you know you're IO-bound, determining the optimal (most energy-efficient) CPU frequency is relatively straightforward, and doesn't require knowledge of the exact power curve of the CPU (beyond clamping the controller response to the convexity region of the power curve). signature.asc Description: PGP signature ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
Hi Srinivas, Srinivas Pandruvada writes: > On Wed, 2018-04-11 at 09:26 -0700, Francisco Jerez wrote: >> >> "just like" here is possibly somewhat unfair to the schedutil >> governor, >> admittedly its progressive IOWAIT boosting behavior seems somewhat >> less >> wasteful than the intel_pstate non-HWP governor's IOWAIT boosting >> behavior, but it's still largely unhelpful on IO-bound conditions. >> > > OK, if you think so, then improve it for sched-util governor or other > mechanisms (as Juri suggested) instead of intel-pstate. You may not have realized but this series provides a full drop-in replacement for the current non-HWP governor of the intel_pstate driver, it should be strictly superior to the current cpu-load governor in terms of energy usage and performance under most scenarios (hold on for v2 for the idle consumption issue). Main reason it's implemented as a separate governor currently is for us to be able to deploy it on BXT+ platforms only for the moment, in order to decrease our initial validation effort and get enough test coverage on BXT (which is incidentally the platform that's going to get the greatest payoff) during a few release cycles. Are you no longer interested in improving those aspects of the non-HWP governor? Is it that you're planning to delete it and move back to a generic cpufreq governor for non-HWP platforms in the near future? > This will benefit all architectures including x86 + non i915. > The current design encourages re-use of the IO utilization statistic (see PATCH 1) by other governors as a mechanism driving the trade-off between energy efficiency and responsiveness based on whether the system is close to CPU-bound, in whatever way is applicable to each governor (e.g. it would make sense for it to be hooked up to the EPP preference knob in the case of the intel_pstate HWP governor, which would allow it to achieve better energy efficiency in IO-bound situations just like this series does for non-HWP parts). There's nothing really x86- nor i915-specific about it. > BTW intel-pstate can be driven by sched-util governor (passive mode), > so if your prove benefits to Broxton, this can be a default. > As before: > - No regression to idle power at all. This is more important than > benchmarks > - Not just score, performance/watt is important > Is schedutil actually on par with the intel_pstate non-HWP governor as of today, according to these metrics and the overall benchmark numbers? > Thanks, > Srinivas > > >> > controller does, even though the frequent IO waits may actually be >> > an >> > indication that the system is IO-bound (which means that the large >> > energy usage increase may not be translated in any performance >> > benefit >> > in practice, not to speak of performance being impacted negatively >> > in >> > TDP-bound scenarios like GPU rendering). >> > >> > Regarding run-time complexity, I haven't observed this governor to >> > be >> > measurably more computationally intensive than the present >> > one. It's a >> > bunch more instructions indeed, but still within the same ballpark >> > as >> > the current governor. The average increase in CPU utilization on >> > my BXT >> > with this series is less than 0.03% (sampled via ftrace for v1, I >> > can >> > repeat the measurement for the v2 I have in the works, though I >> > don't >> > expect the result to be substantially different). If this is a >> > problem >> > for you there are several optimization opportunities that would cut >> > down >> > the number of CPU cycles get_target_pstate_lp() takes to execute by >> > a >> > large percent (most of the optimization ideas I can think of right >> > now >> > though would come at some accuracy/maintainability/debuggability >> > cost, >> > but may still be worth pursuing), but the computational overhead is >> > low >> > enough at this point that the impact on any benchmark or real >> > workload >> > would be orders of magnitude lower than its variance, which makes >> > it >> > kind of difficult to keep the discussion data-driven [as possibly >> > any >> > performance optimization discussion should ever be ;)]. >> > >> > > >> > > Thanks, >> > > Srinivas >> > > >> > > >> > > >> > > > >> > > > > [Absolute benchmark results are unfortunately omitted from >> > > > > this >> > > > > letter >> > > > > due to company policies, but the percent change and Student's >> > > > > T >> > > > > p-value are included above and in the referenced benchmark >> > > > > results] >> > > > > >> > > > > The most obvious impact of this series will likely be the >> > > > > overall >> > > > > improvement in graphics performance on systems with an IGP >> > > > > integrated >> > > > > into the processor package (though for the moment this is >> > > > > only >> > > > > enabled >> > > > > on BXT+), because the TDP budget shared among CPU and GPU can >> > > > > frequently become a limiting factor in low-power devices. On >> > > > > heavily >> > > > > TDP-bound devices this serie
Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
Hi Francisco, [...] > Are you no longer interested in improving those aspects of the non- > HWP > governor? Is it that you're planning to delete it and move back to a > generic cpufreq governor for non-HWP platforms in the near future? Yes that is the plan for Atom platforms, which are only non HWP platforms till now. You have to show good gain for performance and performance/watt to carry and maintain such big change. So we have to see your performance and power numbers. > > > This will benefit all architectures including x86 + non i915. > > > > The current design encourages re-use of the IO utilization statistic > (see PATCH 1) by other governors as a mechanism driving the trade-off > between energy efficiency and responsiveness based on whether the > system > is close to CPU-bound, in whatever way is applicable to each governor > (e.g. it would make sense for it to be hooked up to the EPP > preference > knob in the case of the intel_pstate HWP governor, which would allow > it > to achieve better energy efficiency in IO-bound situations just like > this series does for non-HWP parts). There's nothing really x86- nor > i915-specific about it. > > > BTW intel-pstate can be driven by sched-util governor (passive > > mode), > > so if your prove benefits to Broxton, this can be a default. > > As before: > > - No regression to idle power at all. This is more important than > > benchmarks > > - Not just score, performance/watt is important > > > > Is schedutil actually on par with the intel_pstate non-HWP governor > as > of today, according to these metrics and the overall benchmark > numbers? Yes, except for few cases. I have not tested recently, so may be better. Thanks, Srinivas > > Thanks, > > Srinivas > > > > > > > > controller does, even though the frequent IO waits may actually > > > > be > > > > an > > > > indication that the system is IO-bound (which means that the > > > > large > > > > energy usage increase may not be translated in any performance > > > > benefit > > > > in practice, not to speak of performance being impacted > > > > negatively > > > > in > > > > TDP-bound scenarios like GPU rendering). > > > > > > > > Regarding run-time complexity, I haven't observed this governor > > > > to > > > > be > > > > measurably more computationally intensive than the present > > > > one. It's a > > > > bunch more instructions indeed, but still within the same > > > > ballpark > > > > as > > > > the current governor. The average increase in CPU utilization > > > > on > > > > my BXT > > > > with this series is less than 0.03% (sampled via ftrace for v1, > > > > I > > > > can > > > > repeat the measurement for the v2 I have in the works, though I > > > > don't > > > > expect the result to be substantially different). If this is a > > > > problem > > > > for you there are several optimization opportunities that would > > > > cut > > > > down > > > > the number of CPU cycles get_target_pstate_lp() takes to > > > > execute by > > > > a > > > > large percent (most of the optimization ideas I can think of > > > > right > > > > now > > > > though would come at some > > > > accuracy/maintainability/debuggability > > > > cost, > > > > but may still be worth pursuing), but the computational > > > > overhead is > > > > low > > > > enough at this point that the impact on any benchmark or real > > > > workload > > > > would be orders of magnitude lower than its variance, which > > > > makes > > > > it > > > > kind of difficult to keep the discussion data-driven [as > > > > possibly > > > > any > > > > performance optimization discussion should ever be ;)]. > > > > > > > > > > > > > > Thanks, > > > > > Srinivas > > > > > > > > > > > > > > > > > > > > > > > > > > > > [Absolute benchmark results are unfortunately omitted > > > > > > > from > > > > > > > this > > > > > > > letter > > > > > > > due to company policies, but the percent change and > > > > > > > Student's > > > > > > > T > > > > > > > p-value are included above and in the referenced > > > > > > > benchmark > > > > > > > results] > > > > > > > > > > > > > > The most obvious impact of this series will likely be the > > > > > > > overall > > > > > > > improvement in graphics performance on systems with an > > > > > > > IGP > > > > > > > integrated > > > > > > > into the processor package (though for the moment this is > > > > > > > only > > > > > > > enabled > > > > > > > on BXT+), because the TDP budget shared among CPU and GPU > > > > > > > can > > > > > > > frequently become a limiting factor in low-power > > > > > > > devices. On > > > > > > > heavily > > > > > > > TDP-bound devices this series improves performance of > > > > > > > virtually any > > > > > > > non-trivial graphics rendering by a significant amount > > > > > > > (of > > > > > > > the > > > > > > > order > > > > > > > of the energy efficiency improvement for that workload > > > > > > > assuming the > > > > > > > optimization didn't cause it to become non-TDP-bound). > >
Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
On Fri, Apr 13, 2018 at 06:57:39PM -0700, Francisco Jerez wrote: > Peter Zijlstra writes: > > > On Thu, Apr 12, 2018 at 12:55:39PM -0700, Francisco Jerez wrote: > >> Actually assuming that a single geometric feature of the power curve is > >> known -- it being convex in the frequency range allowed by the policy > >> (which is almost always the case, not only for Intel CPUs), the optimal > >> frequency for an IO-bound workload is fully independent of the exact > >> power curve -- It's just the minimum CPU frequency that's able to keep > >> the bottlenecking IO device at 100% utilization. > > > > I think that is difficult to determine with the information at hand. We > > have lost all device information by the time we reach the scheduler. > > I assume you mean it's difficult to tell whether the workload is > CPU-bound or IO-bound? Yeah, it's non-trivial to determine whether the > system is bottlenecking on IO, it requires additional infrastructure to > keep track of IO utilization (that's the purpose of PATCH 1), and even Note that I've not actually seen any of your patches; I got Cc'ed on later. ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
Hi, On 14.04.2018 07:01, Srinivas Pandruvada wrote: Hi Francisco, [...] Are you no longer interested in improving those aspects of the non- HWP governor? Is it that you're planning to delete it and move back to a generic cpufreq governor for non-HWP platforms in the near future? Yes that is the plan for Atom platforms, which are only non HWP platforms till now. You have to show good gain for performance and performance/watt to carry and maintain such big change. So we have to see your performance and power numbers. For the active cases, you can look at the links at the beginning / bottom of this mail thread. Francisco provided performance results for >100 benchmarks. At this side of Atlantic, we've been testing different versions of the patchset in past few months for >50 Linux 3D benchmarks on 6 different platforms. On Geminilake and few BXT configurations (where 3D benchmarks are TDP limited), many tests' performance improves by 5-15%, also complex ones. And more importantly, there were no regressions. (You can see details + links to more info in Jira ticket VIZ-12078.) *On (fully) TDP limited cases, power usage (obviously) keeps the same, so performance/watt improvements can be derived from the measured performance improvements.* We have data also for earlier platforms from slightly older versions of the patchset, but on those it didn't have any significant impact on performance. I think the main reason for this is that BYT & BSW NUCs that we have, have space only for single memory module. Without dual-memory channel configuration, benchmarks are too memory-bottlenecked to utilized GPU enough to make things TDP limited on those platforms. However, now that I look at the old BYT & BSW data (for few benchmarks which improved most on BXT & GLK), I see that there's a reduction in the CPU power utilization according to RAPL, at least on BSW. - Eero This will benefit all architectures including x86 + non i915. The current design encourages re-use of the IO utilization statistic (see PATCH 1) by other governors as a mechanism driving the trade-off between energy efficiency and responsiveness based on whether the system is close to CPU-bound, in whatever way is applicable to each governor (e.g. it would make sense for it to be hooked up to the EPP preference knob in the case of the intel_pstate HWP governor, which would allow it to achieve better energy efficiency in IO-bound situations just like this series does for non-HWP parts). There's nothing really x86- nor i915-specific about it. BTW intel-pstate can be driven by sched-util governor (passive mode), so if your prove benefits to Broxton, this can be a default. As before: - No regression to idle power at all. This is more important than benchmarks - Not just score, performance/watt is important Is schedutil actually on par with the intel_pstate non-HWP governor as of today, according to these metrics and the overall benchmark numbers? Yes, except for few cases. I have not tested recently, so may be better. Thanks, Srinivas Thanks, Srinivas controller does, even though the frequent IO waits may actually be an indication that the system is IO-bound (which means that the large energy usage increase may not be translated in any performance benefit in practice, not to speak of performance being impacted negatively in TDP-bound scenarios like GPU rendering). Regarding run-time complexity, I haven't observed this governor to be measurably more computationally intensive than the present one. It's a bunch more instructions indeed, but still within the same ballpark as the current governor. The average increase in CPU utilization on my BXT with this series is less than 0.03% (sampled via ftrace for v1, I can repeat the measurement for the v2 I have in the works, though I don't expect the result to be substantially different). If this is a problem for you there are several optimization opportunities that would cut down the number of CPU cycles get_target_pstate_lp() takes to execute by a large percent (most of the optimization ideas I can think of right now though would come at some accuracy/maintainability/debuggability cost, but may still be worth pursuing), but the computational overhead is low enough at this point that the impact on any benchmark or real workload would be orders of magnitude lower than its variance, which makes it kind of difficult to keep the discussion data-driven [as possibly any performance optimization discussion should ever be ;)]. Thanks, Srinivas [Absolute benchmark results are unfortunately omitted from this letter due to company policies, but the percent change and Student's T p-value are included above and in the referenced benchmark results] The most obvious impact of this series will likely be the overall improvement in graphics performance on systems with an IGP integrated into the processor package (though for the moment this is only enabled on
Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
On Mon, 2018-04-16 at 17:04 +0300, Eero Tamminen wrote: > Hi, > > On 14.04.2018 07:01, Srinivas Pandruvada wrote: > > Hi Francisco, > > > > [...] > > > > > Are you no longer interested in improving those aspects of the > > > non- > > > HWP > > > governor? Is it that you're planning to delete it and move back > > > to a > > > generic cpufreq governor for non-HWP platforms in the near > > > future? > > > > Yes that is the plan for Atom platforms, which are only non HWP > > platforms till now. You have to show good gain for performance and > > performance/watt to carry and maintain such big change. So we have > > to > > see your performance and power numbers. > > For the active cases, you can look at the links at the beginning / > bottom of this mail thread. Francisco provided performance results > for > >100 benchmarks. Looks like you didn't test the idle cases, which are more important. Systems will tend to be more idle (increased +50% by the patches). Once you fix the idle, you have to retest and then results will be interesting. Once you fix this, then it is pure algorithm, whether it is done in intel-pstate or sched-util governor is not a big different. It is better to do in sched-util as this will benefit all architectures and will get better test coverage and maintained. Thanks, Srinivas > > > At this side of Atlantic, we've been testing different versions of > the > patchset in past few months for >50 Linux 3D benchmarks on 6 > different > platforms. > > On Geminilake and few BXT configurations (where 3D benchmarks are > TDP > limited), many tests' performance improves by 5-15%, also complex > ones. > And more importantly, there were no regressions. > > (You can see details + links to more info in Jira ticket VIZ-12078.) > > *On (fully) TDP limited cases, power usage (obviously) keeps the > same, > so performance/watt improvements can be derived from the measured > performance improvements.* > > > We have data also for earlier platforms from slightly older versions > of > the patchset, but on those it didn't have any significant impact on > performance. > > I think the main reason for this is that BYT & BSW NUCs that we > have, > have space only for single memory module. Without dual-memory > channel > configuration, benchmarks are too memory-bottlenecked to utilized > GPU > enough to make things TDP limited on those platforms. > > However, now that I look at the old BYT & BSW data (for few > benchmarks > which improved most on BXT & GLK), I see that there's a reduction in > the > CPU power utilization according to RAPL, at least on BSW. > > > - Eero > > > > > > This will benefit all architectures including x86 + non i915. > > > > > > > > > > The current design encourages re-use of the IO utilization > > > statistic > > > (see PATCH 1) by other governors as a mechanism driving the > > > trade-off > > > between energy efficiency and responsiveness based on whether the > > > system > > > is close to CPU-bound, in whatever way is applicable to each > > > governor > > > (e.g. it would make sense for it to be hooked up to the EPP > > > preference > > > knob in the case of the intel_pstate HWP governor, which would > > > allow > > > it > > > to achieve better energy efficiency in IO-bound situations just > > > like > > > this series does for non-HWP parts). There's nothing really x86- > > > nor > > > i915-specific about it. > > > > > > > BTW intel-pstate can be driven by sched-util governor (passive > > > > mode), > > > > so if your prove benefits to Broxton, this can be a default. > > > > As before: > > > > - No regression to idle power at all. This is more important > > > > than > > > > benchmarks > > > > - Not just score, performance/watt is important > > > > > > > > > > Is schedutil actually on par with the intel_pstate non-HWP > > > governor > > > as > > > of today, according to these metrics and the overall benchmark > > > numbers? > > > > Yes, except for few cases. I have not tested recently, so may be > > better. > > > > Thanks, > > Srinivas > > > > > > > > Thanks, > > > > Srinivas > > > > > > > > > > > > > > controller does, even though the frequent IO waits may > > > > > > actually > > > > > > be > > > > > > an > > > > > > indication that the system is IO-bound (which means that > > > > > > the > > > > > > large > > > > > > energy usage increase may not be translated in any > > > > > > performance > > > > > > benefit > > > > > > in practice, not to speak of performance being impacted > > > > > > negatively > > > > > > in > > > > > > TDP-bound scenarios like GPU rendering). > > > > > > > > > > > > Regarding run-time complexity, I haven't observed this > > > > > > governor > > > > > > to > > > > > > be > > > > > > measurably more computationally intensive than the present > > > > > > one. It's a > > > > > > bunch more instructions indeed, but still within the same > > > > > > ballpark > > > > > > as > > > > > > the current governor. The
Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
I have to ask, if this is all just to work around iowait triggering high frequencies for GPU bound applications, does it all just boil down to i915 incorrectly using iowait. Does this patch set perform better than diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c index 9ca9c24b4421..7e7c95411bcd 100644 --- a/drivers/gpu/drm/i915/i915_request.c +++ b/drivers/gpu/drm/i915/i915_request.c @@ -1267,7 +1267,7 @@ long i915_request_wait(struct i915_request *rq, goto complete; } - timeout = io_schedule_timeout(timeout); + timeout = schedule_timeout(timeout); } while (1); GEM_BUG_ON(!intel_wait_has_seqno(&wait)); Quite clearly the general framework could prove useful in a broader range of situations, but does the above suffice? (And can be backported to stable.) -Chris ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
On Tue, 2018-04-17 at 15:03 +0100, Chris Wilson wrote: > I have to ask, if this is all just to work around iowait triggering > high > frequencies for GPU bound applications, does it all just boil down to > i915 incorrectly using iowait. Does this patch set perform better > than > > diff --git a/drivers/gpu/drm/i915/i915_request.c > b/drivers/gpu/drm/i915/i915_request.c > index 9ca9c24b4421..7e7c95411bcd 100644 > --- a/drivers/gpu/drm/i915/i915_request.c > +++ b/drivers/gpu/drm/i915/i915_request.c > @@ -1267,7 +1267,7 @@ long i915_request_wait(struct i915_request *rq, > goto complete; > } > > - timeout = io_schedule_timeout(timeout); > + timeout = schedule_timeout(timeout); > } while (1); > > GEM_BUG_ON(!intel_wait_has_seqno(&wait)); > > Quite clearly the general framework could prove useful in a broader > range of situations, but does the above suffice? (And can be > backported > to stable.) Definitely a very good test to do. Thanks, Srinivas > -Chris ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
Hey Chris, Chris Wilson writes: > I have to ask, if this is all just to work around iowait triggering high > frequencies for GPU bound applications, does it all just boil down to > i915 incorrectly using iowait. Does this patch set perform better than > > diff --git a/drivers/gpu/drm/i915/i915_request.c > b/drivers/gpu/drm/i915/i915_request.c > index 9ca9c24b4421..7e7c95411bcd 100644 > --- a/drivers/gpu/drm/i915/i915_request.c > +++ b/drivers/gpu/drm/i915/i915_request.c > @@ -1267,7 +1267,7 @@ long i915_request_wait(struct i915_request *rq, > goto complete; > } > > - timeout = io_schedule_timeout(timeout); > + timeout = schedule_timeout(timeout); > } while (1); > > GEM_BUG_ON(!intel_wait_has_seqno(&wait)); > > Quite clearly the general framework could prove useful in a broader > range of situations, but does the above suffice? (And can be backported > to stable.) > -Chris Nope... This hunk is one of the first things I tried when I started looking into this. It didn't cut it, and it seemed to lead to some regressions in latency-bound test-cases that were relying on the upward bias provided by IOWAIT boosting in combination with the upstream P-state governor. The reason why it's not sufficient is that the bulk of the energy efficiency improvement from this series is obtained by dampening high-frequency oscillations of the CPU P-state, which occur currently for any periodically fluctuating workload (not only i915 rendering) regardless of whether IOWAIT boosting kicks in. i915 using IO waits does exacerbate the problem with the upstream governor by amplifying the oscillation, but it's not really the ultimate cause. In combination with v2 (you can take a peek at the half-baked patch here [1], planning to make a few more changes this week so it isn't quite ready for review yet) this hunk will actually cause more serious regressions because v2 is able to use the frequent IOWAIT stalls of the i915 driver in combination with an IO-underutilized system as a strong indication that the workload is latency-bound, which causes it to transition to latency-minimizing mode, which significantly improves performance of latency-bound rendering (most visible in a handful X11 test-cases and some GPU/CPU sync test-cases from SynMark2). [1] https://people.freedesktop.org/~currojerez/intel_pstate-lp/0001-cpufreq-intel_pstate-Implement-variably-low-pass-fil-v1.8.patch signature.asc Description: PGP signature ___ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx