Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.

2018-03-30 Thread Francisco Jerez
Francisco Jerez  writes:

> This series attempts to solve an energy efficiency problem of the
> current active-mode non-HWP governor of the intel_pstate driver used
> for the most part on low-power platforms.  Under heavy IO load the
> current controller tends to increase frequencies to the maximum turbo
> P-state, partly due to IO wait boosting, partly due to the roughly
> flat frequency response curve of the current controller (see
> [6]), which causes it to ramp frequencies up and down repeatedly for
> any oscillating workload (think of graphics, audio or disk IO when any
> of them becomes a bottleneck), severely increasing energy usage
> relative to a (throughput-wise equivalent) controller able to provide
> the same average frequency without fluctuation.  The core energy
> efficiency improvement has been observed to be of the order of 20% via
> RAPL, but it's expected to vary substantially between workloads (see
> perf-per-watt comparison [2]).
>
> One might expect that this could come at some cost in terms of system
> responsiveness, but the governor implemented in PATCH 6 has a variable
> response curve controlled by a heuristic that keeps the controller in
> a low-latency state unless the system is under heavy IO load for an
> extended period of time.  The low-latency behavior is actually
> significantly more aggressive than the current governor, allowing it
> to achieve better throughput in some scenarios where the load
> ping-pongs between the CPU and some IO device (see PATCH 6 for more of
> the rationale).  The controller offers relatively lower latency than
> the upstream one particularly while C0 residency is low (which by
> itself contributes to mitigate the increased energy usage while on
> C0).  However under certain conditions the low-latency heuristic may
> increase power consumption (see perf-per-watt comparison [2], the
> apparent regressions are correlated with an increase in performance in
> the same benchmark due to the use of the low-latency heuristic) -- If
> this is a problem a different trade-off between latency and energy
> usage shouldn't be difficult to achieve, but it will come at a
> performance cost in some cases.  I couldn't observe a statistically
> significant increase in idle power consumption due to this behavior
> (on BXT J3455):
>
> package-0 RAPL (W):XX ±0.14% x8 -> XX ±0.15% x9 
> d=-0.04% ±0.14%  p=61.73%
>
> [Absolute benchmark results are unfortunately omitted from this letter
> due to company policies, but the percent change and Student's T
> p-value are included above and in the referenced benchmark results]
>
> The most obvious impact of this series will likely be the overall
> improvement in graphics performance on systems with an IGP integrated
> into the processor package (though for the moment this is only enabled
> on BXT+), because the TDP budget shared among CPU and GPU can
> frequently become a limiting factor in low-power devices.  On heavily
> TDP-bound devices this series improves performance of virtually any
> non-trivial graphics rendering by a significant amount (of the order
> of the energy efficiency improvement for that workload assuming the
> optimization didn't cause it to become non-TDP-bound).
>
> See [1]-[5] for detailed numbers including various graphics benchmarks
> and a sample of the Phoronix daily-system-tracker.  Some popular
> graphics benchmarks like GfxBench gl_manhattan31 and gl_4 improve
> between 5% and 11% on our systems.  The exact improvement can vary
> substantially between systems (compare the benchmark results from the
> two different J3455 systems [1] and [3]) due to a number of factors,
> including the ratio between CPU and GPU processing power, the behavior
> of the userspace graphics driver, the windowing system and resolution,
> the BIOS (which has an influence on the package TDP), the thermal
> characteristics of the system, etc.
>
> Unigine Valley and Heaven improve by a similar factor on some systems
> (see the J3455 results [1]), but on others the improvement is lower
> because the benchmark fails to fully utilize the GPU, which causes the
> heuristic to remain in low-latency state for longer, which leaves a
> reduced TDP budget available to the GPU, which prevents performance
> from increasing further.  This can be avoided by using the alternative
> heuristic parameters suggested in the commit message of PATCH 8, which
> provide a lower IO utilization threshold and hysteresis for the
> controller to attempt to save energy.  I'm not proposing those for
> upstream (yet) because they would also increase the risk for
> latency-sensitive IO-heavy workloads to regress (like SynMark2
> OglTerrainFly* and some arguably poorly designed IPC-bound X11
> benchmarks).
>
> Discrete graphics aren't likely to experience that much of a visible
> improvement from this, even though many non-IGP workloads *could*
> benefit by reducing the system's energy usage while the discrete GPU
> (or really, any 

Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.

2018-04-10 Thread Francisco Jerez
Francisco Jerez  writes:

> This series attempts to solve an energy efficiency problem of the
> current active-mode non-HWP governor of the intel_pstate driver used
> for the most part on low-power platforms.  Under heavy IO load the
> current controller tends to increase frequencies to the maximum turbo
> P-state, partly due to IO wait boosting, partly due to the roughly
> flat frequency response curve of the current controller (see
> [6]), which causes it to ramp frequencies up and down repeatedly for
> any oscillating workload (think of graphics, audio or disk IO when any
> of them becomes a bottleneck), severely increasing energy usage
> relative to a (throughput-wise equivalent) controller able to provide
> the same average frequency without fluctuation.  The core energy
> efficiency improvement has been observed to be of the order of 20% via
> RAPL, but it's expected to vary substantially between workloads (see
> perf-per-watt comparison [2]).
>
> One might expect that this could come at some cost in terms of system
> responsiveness, but the governor implemented in PATCH 6 has a variable
> response curve controlled by a heuristic that keeps the controller in
> a low-latency state unless the system is under heavy IO load for an
> extended period of time.  The low-latency behavior is actually
> significantly more aggressive than the current governor, allowing it
> to achieve better throughput in some scenarios where the load
> ping-pongs between the CPU and some IO device (see PATCH 6 for more of
> the rationale).  The controller offers relatively lower latency than
> the upstream one particularly while C0 residency is low (which by
> itself contributes to mitigate the increased energy usage while on
> C0).  However under certain conditions the low-latency heuristic may
> increase power consumption (see perf-per-watt comparison [2], the
> apparent regressions are correlated with an increase in performance in
> the same benchmark due to the use of the low-latency heuristic) -- If
> this is a problem a different trade-off between latency and energy
> usage shouldn't be difficult to achieve, but it will come at a
> performance cost in some cases.  I couldn't observe a statistically
> significant increase in idle power consumption due to this behavior
> (on BXT J3455):
>
> package-0 RAPL (W):XX ±0.14% x8 -> XX ±0.15% x9 
> d=-0.04% ±0.14%  p=61.73%
>

For the case anyone is wondering what's going on, Srinivas pointed me at
a larger idle power usage increase off-list, ultimately caused by the
low-latency heuristic as discussed in the paragraph above.  I have a v2
of PATCH 6 that gives the controller a third response curve roughly
intermediate between the low-latency and low-power states of this
revision, which avoids the energy usage increase while C0 residency is
low (e.g. during idle) expected for v1.  The low-latency behavior of
this revision is still going to be available based on a heuristic (in
particular when a realtime-priority task is scheduled).  We're carrying
out some additional testing, I'll post the code here eventually.

> [Absolute benchmark results are unfortunately omitted from this letter
> due to company policies, but the percent change and Student's T
> p-value are included above and in the referenced benchmark results]
>
> The most obvious impact of this series will likely be the overall
> improvement in graphics performance on systems with an IGP integrated
> into the processor package (though for the moment this is only enabled
> on BXT+), because the TDP budget shared among CPU and GPU can
> frequently become a limiting factor in low-power devices.  On heavily
> TDP-bound devices this series improves performance of virtually any
> non-trivial graphics rendering by a significant amount (of the order
> of the energy efficiency improvement for that workload assuming the
> optimization didn't cause it to become non-TDP-bound).
>
> See [1]-[5] for detailed numbers including various graphics benchmarks
> and a sample of the Phoronix daily-system-tracker.  Some popular
> graphics benchmarks like GfxBench gl_manhattan31 and gl_4 improve
> between 5% and 11% on our systems.  The exact improvement can vary
> substantially between systems (compare the benchmark results from the
> two different J3455 systems [1] and [3]) due to a number of factors,
> including the ratio between CPU and GPU processing power, the behavior
> of the userspace graphics driver, the windowing system and resolution,
> the BIOS (which has an influence on the package TDP), the thermal
> characteristics of the system, etc.
>
> Unigine Valley and Heaven improve by a similar factor on some systems
> (see the J3455 results [1]), but on others the improvement is lower
> because the benchmark fails to fully utilize the GPU, which causes the
> heuristic to remain in low-latency state for longer, which leaves a
> reduced TDP budget available to the GPU, which prevents performance
> from increasing fur

Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.

2018-04-10 Thread Srinivas Pandruvada
On Tue, 2018-04-10 at 15:28 -0700, Francisco Jerez wrote:
> Francisco Jerez  writes:
> 
[...]


> For the case anyone is wondering what's going on, Srinivas pointed me
> at
> a larger idle power usage increase off-list, ultimately caused by the
> low-latency heuristic as discussed in the paragraph above.  I have a
> v2
> of PATCH 6 that gives the controller a third response curve roughly
> intermediate between the low-latency and low-power states of this
> revision, which avoids the energy usage increase while C0 residency
> is
> low (e.g. during idle) expected for v1.  The low-latency behavior of
> this revision is still going to be available based on a heuristic (in
> particular when a realtime-priority task is scheduled).  We're
> carrying
> out some additional testing, I'll post the code here eventually.

Please try sched-util governor also. There is a frequency-invariant
patch, which I can send you (This eventually will be pushed by Peter).
We want to avoid complexity to intel-pstate for non HWP power sensitive
platforms as far as possible.


Thanks,
Srinivas



> 
> > [Absolute benchmark results are unfortunately omitted from this
> > letter
> > due to company policies, but the percent change and Student's T
> > p-value are included above and in the referenced benchmark results]
> > 
> > The most obvious impact of this series will likely be the overall
> > improvement in graphics performance on systems with an IGP
> > integrated
> > into the processor package (though for the moment this is only
> > enabled
> > on BXT+), because the TDP budget shared among CPU and GPU can
> > frequently become a limiting factor in low-power devices.  On
> > heavily
> > TDP-bound devices this series improves performance of virtually any
> > non-trivial graphics rendering by a significant amount (of the
> > order
> > of the energy efficiency improvement for that workload assuming the
> > optimization didn't cause it to become non-TDP-bound).
> > 
> > See [1]-[5] for detailed numbers including various graphics
> > benchmarks
> > and a sample of the Phoronix daily-system-tracker.  Some popular
> > graphics benchmarks like GfxBench gl_manhattan31 and gl_4 improve
> > between 5% and 11% on our systems.  The exact improvement can vary
> > substantially between systems (compare the benchmark results from
> > the
> > two different J3455 systems [1] and [3]) due to a number of
> > factors,
> > including the ratio between CPU and GPU processing power, the
> > behavior
> > of the userspace graphics driver, the windowing system and
> > resolution,
> > the BIOS (which has an influence on the package TDP), the thermal
> > characteristics of the system, etc.
> > 
> > Unigine Valley and Heaven improve by a similar factor on some
> > systems
> > (see the J3455 results [1]), but on others the improvement is lower
> > because the benchmark fails to fully utilize the GPU, which causes
> > the
> > heuristic to remain in low-latency state for longer, which leaves a
> > reduced TDP budget available to the GPU, which prevents performance
> > from increasing further.  This can be avoided by using the
> > alternative
> > heuristic parameters suggested in the commit message of PATCH 8,
> > which
> > provide a lower IO utilization threshold and hysteresis for the
> > controller to attempt to save energy.  I'm not proposing those for
> > upstream (yet) because they would also increase the risk for
> > latency-sensitive IO-heavy workloads to regress (like SynMark2
> > OglTerrainFly* and some arguably poorly designed IPC-bound X11
> > benchmarks).
> > 
> > Discrete graphics aren't likely to experience that much of a
> > visible
> > improvement from this, even though many non-IGP workloads *could*
> > benefit by reducing the system's energy usage while the discrete
> > GPU
> > (or really, any other IO device) becomes a bottleneck, but this is
> > not
> > attempted in this series, since that would involve making an energy
> > efficiency/latency trade-off that only the maintainers of the
> > respective drivers are in a position to make.  The cpufreq
> > interface
> > introduced in PATCH 1 to achieve this is left as an opt-in for that
> > reason, only the i915 DRM driver is hooked up since it will get the
> > most direct pay-off due to the increased energy budget available to
> > the GPU, but other power-hungry third-party gadgets built into the
> > same package (*cough* AMD *cough* Mali *cough* PowerVR *cough*) may
> > be
> > able to benefit from this interface eventually by instrumenting the
> > driver in a similar way.
> > 
> > The cpufreq interface is not exclusively tied to the intel_pstate
> > driver, because other governors can make use of the statistic
> > calculated as result to avoid over-optimizing for latency in
> > scenarios
> > where a lower frequency would be able to achieve similar throughput
> > while using less energy.  The interpretation of this statistic
> > relies
> > on the observation that for as long as the system is CPU-bound, a

Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.

2018-04-11 Thread Francisco Jerez
Hi Srinivas,

Srinivas Pandruvada  writes:

> On Tue, 2018-04-10 at 15:28 -0700, Francisco Jerez wrote:
>> Francisco Jerez  writes:
>> 
> [...]
>
>
>> For the case anyone is wondering what's going on, Srinivas pointed me
>> at
>> a larger idle power usage increase off-list, ultimately caused by the
>> low-latency heuristic as discussed in the paragraph above.  I have a
>> v2
>> of PATCH 6 that gives the controller a third response curve roughly
>> intermediate between the low-latency and low-power states of this
>> revision, which avoids the energy usage increase while C0 residency
>> is
>> low (e.g. during idle) expected for v1.  The low-latency behavior of
>> this revision is still going to be available based on a heuristic (in
>> particular when a realtime-priority task is scheduled).  We're
>> carrying
>> out some additional testing, I'll post the code here eventually.
>
> Please try sched-util governor also. There is a frequency-invariant
> patch, which I can send you (This eventually will be pushed by Peter).
> We want to avoid complexity to intel-pstate for non HWP power sensitive
> platforms as far as possible.
>

Unfortunately the schedutil governor (whether frequency invariant or
not) has the exact same energy efficiency issues as the present
intel_pstate non-HWP governor.  Its response is severely underdamped
leading to energy-inefficient behavior for any oscillating non-CPU-bound
workload.  To exacerbate that problem the frequency is maxed out on
frequent IO waiting just like the current intel_pstate cpu-load
controller does, even though the frequent IO waits may actually be an
indication that the system is IO-bound (which means that the large
energy usage increase may not be translated in any performance benefit
in practice, not to speak of performance being impacted negatively in
TDP-bound scenarios like GPU rendering).

Regarding run-time complexity, I haven't observed this governor to be
measurably more computationally intensive than the present one.  It's a
bunch more instructions indeed, but still within the same ballpark as
the current governor.  The average increase in CPU utilization on my BXT
with this series is less than 0.03% (sampled via ftrace for v1, I can
repeat the measurement for the v2 I have in the works, though I don't
expect the result to be substantially different).  If this is a problem
for you there are several optimization opportunities that would cut down
the number of CPU cycles get_target_pstate_lp() takes to execute by a
large percent (most of the optimization ideas I can think of right now
though would come at some accuracy/maintainability/debuggability cost,
but may still be worth pursuing), but the computational overhead is low
enough at this point that the impact on any benchmark or real workload
would be orders of magnitude lower than its variance, which makes it
kind of difficult to keep the discussion data-driven [as possibly any
performance optimization discussion should ever be ;)].

>
> Thanks,
> Srinivas
>
>
>
>> 
>> > [Absolute benchmark results are unfortunately omitted from this
>> > letter
>> > due to company policies, but the percent change and Student's T
>> > p-value are included above and in the referenced benchmark results]
>> > 
>> > The most obvious impact of this series will likely be the overall
>> > improvement in graphics performance on systems with an IGP
>> > integrated
>> > into the processor package (though for the moment this is only
>> > enabled
>> > on BXT+), because the TDP budget shared among CPU and GPU can
>> > frequently become a limiting factor in low-power devices.  On
>> > heavily
>> > TDP-bound devices this series improves performance of virtually any
>> > non-trivial graphics rendering by a significant amount (of the
>> > order
>> > of the energy efficiency improvement for that workload assuming the
>> > optimization didn't cause it to become non-TDP-bound).
>> > 
>> > See [1]-[5] for detailed numbers including various graphics
>> > benchmarks
>> > and a sample of the Phoronix daily-system-tracker.  Some popular
>> > graphics benchmarks like GfxBench gl_manhattan31 and gl_4 improve
>> > between 5% and 11% on our systems.  The exact improvement can vary
>> > substantially between systems (compare the benchmark results from
>> > the
>> > two different J3455 systems [1] and [3]) due to a number of
>> > factors,
>> > including the ratio between CPU and GPU processing power, the
>> > behavior
>> > of the userspace graphics driver, the windowing system and
>> > resolution,
>> > the BIOS (which has an influence on the package TDP), the thermal
>> > characteristics of the system, etc.
>> > 
>> > Unigine Valley and Heaven improve by a similar factor on some
>> > systems
>> > (see the J3455 results [1]), but on others the improvement is lower
>> > because the benchmark fails to fully utilize the GPU, which causes
>> > the
>> > heuristic to remain in low-latency state for longer, which leaves a
>> > reduced TDP budget available to th

Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.

2018-04-11 Thread Francisco Jerez
Francisco Jerez  writes:

> Hi Srinivas,
>
> Srinivas Pandruvada  writes:
>
>> On Tue, 2018-04-10 at 15:28 -0700, Francisco Jerez wrote:
>>> Francisco Jerez  writes:
>>> 
>> [...]
>>
>>
>>> For the case anyone is wondering what's going on, Srinivas pointed me
>>> at
>>> a larger idle power usage increase off-list, ultimately caused by the
>>> low-latency heuristic as discussed in the paragraph above.  I have a
>>> v2
>>> of PATCH 6 that gives the controller a third response curve roughly
>>> intermediate between the low-latency and low-power states of this
>>> revision, which avoids the energy usage increase while C0 residency
>>> is
>>> low (e.g. during idle) expected for v1.  The low-latency behavior of
>>> this revision is still going to be available based on a heuristic (in
>>> particular when a realtime-priority task is scheduled).  We're
>>> carrying
>>> out some additional testing, I'll post the code here eventually.
>>
>> Please try sched-util governor also. There is a frequency-invariant
>> patch, which I can send you (This eventually will be pushed by Peter).
>> We want to avoid complexity to intel-pstate for non HWP power sensitive
>> platforms as far as possible.
>>
>
> Unfortunately the schedutil governor (whether frequency invariant or
> not) has the exact same energy efficiency issues as the present
> intel_pstate non-HWP governor.  Its response is severely underdamped
> leading to energy-inefficient behavior for any oscillating non-CPU-bound
> workload.  To exacerbate that problem the frequency is maxed out on
> frequent IO waiting just like the current intel_pstate cpu-load

"just like" here is possibly somewhat unfair to the schedutil governor,
admittedly its progressive IOWAIT boosting behavior seems somewhat less
wasteful than the intel_pstate non-HWP governor's IOWAIT boosting
behavior, but it's still largely unhelpful on IO-bound conditions.

> controller does, even though the frequent IO waits may actually be an
> indication that the system is IO-bound (which means that the large
> energy usage increase may not be translated in any performance benefit
> in practice, not to speak of performance being impacted negatively in
> TDP-bound scenarios like GPU rendering).
>
> Regarding run-time complexity, I haven't observed this governor to be
> measurably more computationally intensive than the present one.  It's a
> bunch more instructions indeed, but still within the same ballpark as
> the current governor.  The average increase in CPU utilization on my BXT
> with this series is less than 0.03% (sampled via ftrace for v1, I can
> repeat the measurement for the v2 I have in the works, though I don't
> expect the result to be substantially different).  If this is a problem
> for you there are several optimization opportunities that would cut down
> the number of CPU cycles get_target_pstate_lp() takes to execute by a
> large percent (most of the optimization ideas I can think of right now
> though would come at some accuracy/maintainability/debuggability cost,
> but may still be worth pursuing), but the computational overhead is low
> enough at this point that the impact on any benchmark or real workload
> would be orders of magnitude lower than its variance, which makes it
> kind of difficult to keep the discussion data-driven [as possibly any
> performance optimization discussion should ever be ;)].
>
>>
>> Thanks,
>> Srinivas
>>
>>
>>
>>> 
>>> > [Absolute benchmark results are unfortunately omitted from this
>>> > letter
>>> > due to company policies, but the percent change and Student's T
>>> > p-value are included above and in the referenced benchmark results]
>>> > 
>>> > The most obvious impact of this series will likely be the overall
>>> > improvement in graphics performance on systems with an IGP
>>> > integrated
>>> > into the processor package (though for the moment this is only
>>> > enabled
>>> > on BXT+), because the TDP budget shared among CPU and GPU can
>>> > frequently become a limiting factor in low-power devices.  On
>>> > heavily
>>> > TDP-bound devices this series improves performance of virtually any
>>> > non-trivial graphics rendering by a significant amount (of the
>>> > order
>>> > of the energy efficiency improvement for that workload assuming the
>>> > optimization didn't cause it to become non-TDP-bound).
>>> > 
>>> > See [1]-[5] for detailed numbers including various graphics
>>> > benchmarks
>>> > and a sample of the Phoronix daily-system-tracker.  Some popular
>>> > graphics benchmarks like GfxBench gl_manhattan31 and gl_4 improve
>>> > between 5% and 11% on our systems.  The exact improvement can vary
>>> > substantially between systems (compare the benchmark results from
>>> > the
>>> > two different J3455 systems [1] and [3]) due to a number of
>>> > factors,
>>> > including the ratio between CPU and GPU processing power, the
>>> > behavior
>>> > of the userspace graphics driver, the windowing system and
>>> > resolution,
>>> > the BIOS (which has an in

Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.

2018-04-11 Thread Srinivas Pandruvada
On Wed, 2018-04-11 at 09:26 -0700, Francisco Jerez wrote:
> 
> "just like" here is possibly somewhat unfair to the schedutil
> governor,
> admittedly its progressive IOWAIT boosting behavior seems somewhat
> less
> wasteful than the intel_pstate non-HWP governor's IOWAIT boosting
> behavior, but it's still largely unhelpful on IO-bound conditions.
> 

OK, if you think so, then improve it for sched-util governor or other
mechanisms (as Juri suggested) instead of intel-pstate. This will
benefit all architectures including x86 + non i915.

BTW intel-pstate can be driven by sched-util governor (passive mode),
so if your prove benefits to Broxton, this can be a default.
As before:
- No regression to idle power at all. This is more important than
benchmarks
- Not just score, performance/watt is important

Thanks,
Srinivas


> > controller does, even though the frequent IO waits may actually be
> > an
> > indication that the system is IO-bound (which means that the large
> > energy usage increase may not be translated in any performance
> > benefit
> > in practice, not to speak of performance being impacted negatively
> > in
> > TDP-bound scenarios like GPU rendering).
> > 
> > Regarding run-time complexity, I haven't observed this governor to
> > be
> > measurably more computationally intensive than the present
> > one.  It's a
> > bunch more instructions indeed, but still within the same ballpark
> > as
> > the current governor.  The average increase in CPU utilization on
> > my BXT
> > with this series is less than 0.03% (sampled via ftrace for v1, I
> > can
> > repeat the measurement for the v2 I have in the works, though I
> > don't
> > expect the result to be substantially different).  If this is a
> > problem
> > for you there are several optimization opportunities that would cut
> > down
> > the number of CPU cycles get_target_pstate_lp() takes to execute by
> > a
> > large percent (most of the optimization ideas I can think of right
> > now
> > though would come at some accuracy/maintainability/debuggability
> > cost,
> > but may still be worth pursuing), but the computational overhead is
> > low
> > enough at this point that the impact on any benchmark or real
> > workload
> > would be orders of magnitude lower than its variance, which makes
> > it
> > kind of difficult to keep the discussion data-driven [as possibly
> > any
> > performance optimization discussion should ever be ;)].
> > 
> > > 
> > > Thanks,
> > > Srinivas
> > > 
> > > 
> > > 
> > > > 
> > > > > [Absolute benchmark results are unfortunately omitted from
> > > > > this
> > > > > letter
> > > > > due to company policies, but the percent change and Student's
> > > > > T
> > > > > p-value are included above and in the referenced benchmark
> > > > > results]
> > > > > 
> > > > > The most obvious impact of this series will likely be the
> > > > > overall
> > > > > improvement in graphics performance on systems with an IGP
> > > > > integrated
> > > > > into the processor package (though for the moment this is
> > > > > only
> > > > > enabled
> > > > > on BXT+), because the TDP budget shared among CPU and GPU can
> > > > > frequently become a limiting factor in low-power devices.  On
> > > > > heavily
> > > > > TDP-bound devices this series improves performance of
> > > > > virtually any
> > > > > non-trivial graphics rendering by a significant amount (of
> > > > > the
> > > > > order
> > > > > of the energy efficiency improvement for that workload
> > > > > assuming the
> > > > > optimization didn't cause it to become non-TDP-bound).
> > > > > 
> > > > > See [1]-[5] for detailed numbers including various graphics
> > > > > benchmarks
> > > > > and a sample of the Phoronix daily-system-tracker.  Some
> > > > > popular
> > > > > graphics benchmarks like GfxBench gl_manhattan31 and gl_4
> > > > > improve
> > > > > between 5% and 11% on our systems.  The exact improvement can
> > > > > vary
> > > > > substantially between systems (compare the benchmark results
> > > > > from
> > > > > the
> > > > > two different J3455 systems [1] and [3]) due to a number of
> > > > > factors,
> > > > > including the ratio between CPU and GPU processing power, the
> > > > > behavior
> > > > > of the userspace graphics driver, the windowing system and
> > > > > resolution,
> > > > > the BIOS (which has an influence on the package TDP), the
> > > > > thermal
> > > > > characteristics of the system, etc.
> > > > > 
> > > > > Unigine Valley and Heaven improve by a similar factor on some
> > > > > systems
> > > > > (see the J3455 results [1]), but on others the improvement is
> > > > > lower
> > > > > because the benchmark fails to fully utilize the GPU, which
> > > > > causes
> > > > > the
> > > > > heuristic to remain in low-latency state for longer, which
> > > > > leaves a
> > > > > reduced TDP budget available to the GPU, which prevents
> > > > > performance
> > > > > from increasing further.  This can be avoided by using the
> > > > > alternative
> > > > > heuri

Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.

2018-04-12 Thread Peter Zijlstra
On Wed, Apr 11, 2018 at 09:26:11AM -0700, Francisco Jerez wrote:
> "just like" here is possibly somewhat unfair to the schedutil governor,
> admittedly its progressive IOWAIT boosting behavior seems somewhat less
> wasteful than the intel_pstate non-HWP governor's IOWAIT boosting
> behavior, but it's still largely unhelpful on IO-bound conditions.

So you understand why we need the iowait boosting right?

It is just that when we get back to runnable, we want to process the
next data packet ASAP. See also here:

  
https://lkml.kernel.org/r/20170522082154.f57cqovterd2q...@hirez.programming.kicks-ass.net

What I don't really understand is why it is costing so much power; after
all, when we're in iowait the CPU is mostly idle and can power-gate.
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.

2018-04-12 Thread Francisco Jerez
Peter Zijlstra  writes:

> On Wed, Apr 11, 2018 at 09:26:11AM -0700, Francisco Jerez wrote:
>> "just like" here is possibly somewhat unfair to the schedutil governor,
>> admittedly its progressive IOWAIT boosting behavior seems somewhat less
>> wasteful than the intel_pstate non-HWP governor's IOWAIT boosting
>> behavior, but it's still largely unhelpful on IO-bound conditions.
>
> So you understand why we need the iowait boosting right?
>

Yeah, sort of.  The latency-minimizing state of this governor provides a
comparable effect, but it's based on a pessimistic estimate of the
frequency required for the workload to achieve maximum throughput
(rather than a plain or exponential boost up to the max frequency which
can substantially deviate from that frequency, see the explanation in
PATCH 6 for more details).  It's enabled under conditions partially
overlapping but not identical to iowait boosting: The optimization is
not applied under IO-bound conditions (in order to avoid impacting
energy efficiency negatively for zero or negative payoff), OTOH the
optimization is applied in some cases where the current governor
wouldn't, like RT-priority threads (that's the main difference with v2
I'm planning to send out next week).

> It is just that when we get back to runnable, we want to process the
> next data packet ASAP. See also here:
>
>   
> https://lkml.kernel.org/r/20170522082154.f57cqovterd2q...@hirez.programming.kicks-ass.net
>
> What I don't really understand is why it is costing so much power; after
> all, when we're in iowait the CPU is mostly idle and can power-gate.

The reason for the energy efficiency problem of iowait boosting is
precisely the greater oscillation between turbo and idle.  Say that
iowait boost increases the frequency by a factor alpha relative to the
optimal frequency f0 (in terms of energy efficiency) required to execute
some IO-bound workload.  This will cause the CPU to be busy for a
fraction of the time it was busy originally, approximately t1 = t0 /
alpha, which indeed divides the overall energy usage by a factor alpha,
but at the same time multiplies the instantaneous power consumption
while busy by a factor potentially much greater than alpha, since the
CPU's power curve is largely non-linear, and in fact approximately
convex within the frequency range allowed by the policy, so you get an
average energy usage possibly much greater than the optimal.


signature.asc
Description: PGP signature
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.

2018-04-12 Thread Peter Zijlstra
On Thu, Apr 12, 2018 at 11:34:54AM -0700, Francisco Jerez wrote:
> The reason for the energy efficiency problem of iowait boosting is
> precisely the greater oscillation between turbo and idle.  Say that
> iowait boost increases the frequency by a factor alpha relative to the
> optimal frequency f0 (in terms of energy efficiency) required to execute
> some IO-bound workload.  This will cause the CPU to be busy for a
> fraction of the time it was busy originally, approximately t1 = t0 /
> alpha, which indeed divides the overall energy usage by a factor alpha,
> but at the same time multiplies the instantaneous power consumption
> while busy by a factor potentially much greater than alpha, since the
> CPU's power curve is largely non-linear, and in fact approximately
> convex within the frequency range allowed by the policy, so you get an
> average energy usage possibly much greater than the optimal.

Ah, but we don't (yet) have the (normalized) power curves, so we cannot
make that call.

Once we have the various energy/OPP numbers required for EAS we can
compute the optimal. I think such was even mentioned in the thread I
referred earlier.

Until such time; we boost to max for lack of a better option.
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.

2018-04-12 Thread Francisco Jerez
Peter Zijlstra  writes:

> On Thu, Apr 12, 2018 at 11:34:54AM -0700, Francisco Jerez wrote:
>> The reason for the energy efficiency problem of iowait boosting is
>> precisely the greater oscillation between turbo and idle.  Say that
>> iowait boost increases the frequency by a factor alpha relative to the
>> optimal frequency f0 (in terms of energy efficiency) required to execute
>> some IO-bound workload.  This will cause the CPU to be busy for a
>> fraction of the time it was busy originally, approximately t1 = t0 /
>> alpha, which indeed divides the overall energy usage by a factor alpha,
>> but at the same time multiplies the instantaneous power consumption
>> while busy by a factor potentially much greater than alpha, since the
>> CPU's power curve is largely non-linear, and in fact approximately
>> convex within the frequency range allowed by the policy, so you get an
>> average energy usage possibly much greater than the optimal.
>
> Ah, but we don't (yet) have the (normalized) power curves, so we cannot
> make that call.
>
> Once we have the various energy/OPP numbers required for EAS we can
> compute the optimal. I think such was even mentioned in the thread I
> referred earlier.
>
> Until such time; we boost to max for lack of a better option.

Actually assuming that a single geometric feature of the power curve is
known -- it being convex in the frequency range allowed by the policy
(which is almost always the case, not only for Intel CPUs), the optimal
frequency for an IO-bound workload is fully independent of the exact
power curve -- It's just the minimum CPU frequency that's able to keep
the bottlenecking IO device at 100% utilization.  Any frequency higher
than that will lead to strictly lower energy efficiency whatever the
exact form of the power curve is.

I agree though that exact knowledge of the power curve *might* be useful
as a mechanism to estimate the potential costs of exceeding that optimal
frequency (e.g. as a mechanism to offset performance loss heuristically
for the case the workload fluctuates by giving the governor an upward
bias with an approximately known energy cost), but that's not required
for the governor's behavior to be approximately optimal in IO-bound
conditions.  Not making further assumptions about the power curve beyond
its convexity makes the algorithm fairly robust against any inaccuracy
in the power curve numbers (which there will always be, since the energy
efficiency of the workload is really dependent on the behavior of
multiple components of the system interacting with each other), and
makes it easily reusable on platforms where the exact power curves are
not known.


signature.asc
Description: PGP signature
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.

2018-04-12 Thread Francisco Jerez
Juri Lelli  writes:

> Hi,
>
> On 11/04/18 09:26, Francisco Jerez wrote:
>> Francisco Jerez  writes:
>> 
>> > Hi Srinivas,
>> >
>> > Srinivas Pandruvada  writes:
>> >
>> >> On Tue, 2018-04-10 at 15:28 -0700, Francisco Jerez wrote:
>> >>> Francisco Jerez  writes:
>> >>> 
>> >> [...]
>> >>
>> >>
>> >>> For the case anyone is wondering what's going on, Srinivas pointed me
>> >>> at
>> >>> a larger idle power usage increase off-list, ultimately caused by the
>> >>> low-latency heuristic as discussed in the paragraph above.  I have a
>> >>> v2
>> >>> of PATCH 6 that gives the controller a third response curve roughly
>> >>> intermediate between the low-latency and low-power states of this
>> >>> revision, which avoids the energy usage increase while C0 residency
>> >>> is
>> >>> low (e.g. during idle) expected for v1.  The low-latency behavior of
>> >>> this revision is still going to be available based on a heuristic (in
>> >>> particular when a realtime-priority task is scheduled).  We're
>> >>> carrying
>> >>> out some additional testing, I'll post the code here eventually.
>> >>
>> >> Please try sched-util governor also. There is a frequency-invariant
>> >> patch, which I can send you (This eventually will be pushed by Peter).
>> >> We want to avoid complexity to intel-pstate for non HWP power sensitive
>> >> platforms as far as possible.
>> >>
>> >
>> > Unfortunately the schedutil governor (whether frequency invariant or
>> > not) has the exact same energy efficiency issues as the present
>> > intel_pstate non-HWP governor.  Its response is severely underdamped
>> > leading to energy-inefficient behavior for any oscillating non-CPU-bound
>> > workload.  To exacerbate that problem the frequency is maxed out on
>> > frequent IO waiting just like the current intel_pstate cpu-load
>> 
>> "just like" here is possibly somewhat unfair to the schedutil governor,
>> admittedly its progressive IOWAIT boosting behavior seems somewhat less
>> wasteful than the intel_pstate non-HWP governor's IOWAIT boosting
>> behavior, but it's still largely unhelpful on IO-bound conditions.
>
> Sorry if I jump in out of the blue, but what you are trying to solve
> looks very similar to what IPA [1] is targeting as well. I might be
> wrong (I'll try to spend more time reviewing your set), but my first
> impression is that we should try to solve similar problems with a more
> general approach that could benefit different sys/archs.
>

Thanks, seems interesting, I've also been taking a look at your
whitepaper and source code.  The problem we've both been trying to solve
is indeed closely related, there may be an opportunity for sharing
efforts both ways.

Correct me if I didn't understand the whole details about your power
allocation code, but IPA seems to be dividing up the available power
budget proportionally to the power requested by the different actors (up
to the point that causes some actor to reach its maximum power) and
configured weights.  From my understanding of the get_requested_power
implementations for cpufreq and devfreq, the requested power attempts to
approximate the current power usage of each device (whether it's
estimated from the current frequency and a capacitance model, from the
get_real_power callback, or other mechanism), which can be far from the
optimal power consumption in cases where the device's governor is
programming a frequency that wildly deviates from the optimal one (as is
the case with the current intel_pstate governor for any IO-bound
workload, which incidentally will suffer the greatest penalty from a
suboptimal power allocation in cases where the IO device is actually an
integrated GPU).

Is there any mechanism in place to prevent the system from stabilizing
at a power allocation that prevents it from achieving maximum
throughput?  E.g. in a TDP-limited system with two devices consuming a
total power of Pmax = P0(f0) + P1(f1), with f0 much greater than the
optimal, and f1 capped at a frequency lower than the optimal due to TDP
or thermal constraints, and assuming that the system is bottlenecking at
the second device.  In such a scenario wouldn't IPA distribute power in
a way that roughly approximates the pre-existing suboptimal
distribution?

If that's the case, I understand that it's the responsibility of the
device's (or CPU's) frequency governor to request a frequency which is
reasonably energy-efficient in the first place for the balancer to
function correctly? (That's precisely the goal of this series) -- Which
in addition allows the system to use less power to get the same work
done in cases where the system is not thermally or TDP-limited as a
whole, so the balancing logic wouldn't have any effect at all.

> I'm Cc-ing some Arm folks...
>
> Best,
>
> - Juri
>
> [1] https://developer.arm.com/open-source/intelligent-power-allocation


signature.asc
Description: PGP signature
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.

Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.

2018-04-13 Thread Juri Lelli
Hi,

On 11/04/18 09:26, Francisco Jerez wrote:
> Francisco Jerez  writes:
> 
> > Hi Srinivas,
> >
> > Srinivas Pandruvada  writes:
> >
> >> On Tue, 2018-04-10 at 15:28 -0700, Francisco Jerez wrote:
> >>> Francisco Jerez  writes:
> >>> 
> >> [...]
> >>
> >>
> >>> For the case anyone is wondering what's going on, Srinivas pointed me
> >>> at
> >>> a larger idle power usage increase off-list, ultimately caused by the
> >>> low-latency heuristic as discussed in the paragraph above.  I have a
> >>> v2
> >>> of PATCH 6 that gives the controller a third response curve roughly
> >>> intermediate between the low-latency and low-power states of this
> >>> revision, which avoids the energy usage increase while C0 residency
> >>> is
> >>> low (e.g. during idle) expected for v1.  The low-latency behavior of
> >>> this revision is still going to be available based on a heuristic (in
> >>> particular when a realtime-priority task is scheduled).  We're
> >>> carrying
> >>> out some additional testing, I'll post the code here eventually.
> >>
> >> Please try sched-util governor also. There is a frequency-invariant
> >> patch, which I can send you (This eventually will be pushed by Peter).
> >> We want to avoid complexity to intel-pstate for non HWP power sensitive
> >> platforms as far as possible.
> >>
> >
> > Unfortunately the schedutil governor (whether frequency invariant or
> > not) has the exact same energy efficiency issues as the present
> > intel_pstate non-HWP governor.  Its response is severely underdamped
> > leading to energy-inefficient behavior for any oscillating non-CPU-bound
> > workload.  To exacerbate that problem the frequency is maxed out on
> > frequent IO waiting just like the current intel_pstate cpu-load
> 
> "just like" here is possibly somewhat unfair to the schedutil governor,
> admittedly its progressive IOWAIT boosting behavior seems somewhat less
> wasteful than the intel_pstate non-HWP governor's IOWAIT boosting
> behavior, but it's still largely unhelpful on IO-bound conditions.

Sorry if I jump in out of the blue, but what you are trying to solve
looks very similar to what IPA [1] is targeting as well. I might be
wrong (I'll try to spend more time reviewing your set), but my first
impression is that we should try to solve similar problems with a more
general approach that could benefit different sys/archs.

I'm Cc-ing some Arm folks...

Best,

- Juri

[1] https://developer.arm.com/open-source/intelligent-power-allocation
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.

2018-04-13 Thread Peter Zijlstra
On Thu, Apr 12, 2018 at 12:55:39PM -0700, Francisco Jerez wrote:
> Actually assuming that a single geometric feature of the power curve is
> known -- it being convex in the frequency range allowed by the policy
> (which is almost always the case, not only for Intel CPUs), the optimal
> frequency for an IO-bound workload is fully independent of the exact
> power curve -- It's just the minimum CPU frequency that's able to keep
> the bottlenecking IO device at 100% utilization. 

I think that is difficult to determine with the information at hand. We
have lost all device information by the time we reach the scheduler.
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.

2018-04-13 Thread Francisco Jerez
Peter Zijlstra  writes:

> On Thu, Apr 12, 2018 at 12:55:39PM -0700, Francisco Jerez wrote:
>> Actually assuming that a single geometric feature of the power curve is
>> known -- it being convex in the frequency range allowed by the policy
>> (which is almost always the case, not only for Intel CPUs), the optimal
>> frequency for an IO-bound workload is fully independent of the exact
>> power curve -- It's just the minimum CPU frequency that's able to keep
>> the bottlenecking IO device at 100% utilization. 
>
> I think that is difficult to determine with the information at hand. We
> have lost all device information by the time we reach the scheduler.

I assume you mean it's difficult to tell whether the workload is
CPU-bound or IO-bound?  Yeah, it's non-trivial to determine whether the
system is bottlenecking on IO, it requires additional infrastructure to
keep track of IO utilization (that's the purpose of PATCH 1), and even
then it involves some heuristic assumptions which are not guaranteed
fail-proof, so the controller needs to be prepared for things to behave
reasonably when the assumptions deviate from reality (see the comments
in PATCH 6 for more details on what happens in such cases) -- How
frequently that happens in practice is what determines how far the
controller's response will be from the optimally energy-efficient
behavior in a real workload.  It seems to work fairly well in practice,
at least in the sample of test-cases I've been able to gather data from
so far.

Anyway that's the difficult part.  Once (if) you know you're IO-bound,
determining the optimal (most energy-efficient) CPU frequency is
relatively straightforward, and doesn't require knowledge of the exact
power curve of the CPU (beyond clamping the controller response to the
convexity region of the power curve).


signature.asc
Description: PGP signature
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.

2018-04-13 Thread Francisco Jerez
Hi Srinivas,

Srinivas Pandruvada  writes:

> On Wed, 2018-04-11 at 09:26 -0700, Francisco Jerez wrote:
>> 
>> "just like" here is possibly somewhat unfair to the schedutil
>> governor,
>> admittedly its progressive IOWAIT boosting behavior seems somewhat
>> less
>> wasteful than the intel_pstate non-HWP governor's IOWAIT boosting
>> behavior, but it's still largely unhelpful on IO-bound conditions.
>> 
>
> OK, if you think so, then improve it for sched-util governor or other
> mechanisms (as Juri suggested) instead of intel-pstate.

You may not have realized but this series provides a full drop-in
replacement for the current non-HWP governor of the intel_pstate driver,
it should be strictly superior to the current cpu-load governor in terms
of energy usage and performance under most scenarios (hold on for v2 for
the idle consumption issue).  Main reason it's implemented as a separate
governor currently is for us to be able to deploy it on BXT+ platforms
only for the moment, in order to decrease our initial validation effort
and get enough test coverage on BXT (which is incidentally the platform
that's going to get the greatest payoff) during a few release cycles.

Are you no longer interested in improving those aspects of the non-HWP
governor?  Is it that you're planning to delete it and move back to a
generic cpufreq governor for non-HWP platforms in the near future?

> This will benefit all architectures including x86 + non i915.
>

The current design encourages re-use of the IO utilization statistic
(see PATCH 1) by other governors as a mechanism driving the trade-off
between energy efficiency and responsiveness based on whether the system
is close to CPU-bound, in whatever way is applicable to each governor
(e.g. it would make sense for it to be hooked up to the EPP preference
knob in the case of the intel_pstate HWP governor, which would allow it
to achieve better energy efficiency in IO-bound situations just like
this series does for non-HWP parts).  There's nothing really x86- nor
i915-specific about it.

> BTW intel-pstate can be driven by sched-util governor (passive mode),
> so if your prove benefits to Broxton, this can be a default.
> As before:
> - No regression to idle power at all. This is more important than
> benchmarks
> - Not just score, performance/watt is important
>

Is schedutil actually on par with the intel_pstate non-HWP governor as
of today, according to these metrics and the overall benchmark numbers?

> Thanks,
> Srinivas
>
>
>> > controller does, even though the frequent IO waits may actually be
>> > an
>> > indication that the system is IO-bound (which means that the large
>> > energy usage increase may not be translated in any performance
>> > benefit
>> > in practice, not to speak of performance being impacted negatively
>> > in
>> > TDP-bound scenarios like GPU rendering).
>> > 
>> > Regarding run-time complexity, I haven't observed this governor to
>> > be
>> > measurably more computationally intensive than the present
>> > one.  It's a
>> > bunch more instructions indeed, but still within the same ballpark
>> > as
>> > the current governor.  The average increase in CPU utilization on
>> > my BXT
>> > with this series is less than 0.03% (sampled via ftrace for v1, I
>> > can
>> > repeat the measurement for the v2 I have in the works, though I
>> > don't
>> > expect the result to be substantially different).  If this is a
>> > problem
>> > for you there are several optimization opportunities that would cut
>> > down
>> > the number of CPU cycles get_target_pstate_lp() takes to execute by
>> > a
>> > large percent (most of the optimization ideas I can think of right
>> > now
>> > though would come at some accuracy/maintainability/debuggability
>> > cost,
>> > but may still be worth pursuing), but the computational overhead is
>> > low
>> > enough at this point that the impact on any benchmark or real
>> > workload
>> > would be orders of magnitude lower than its variance, which makes
>> > it
>> > kind of difficult to keep the discussion data-driven [as possibly
>> > any
>> > performance optimization discussion should ever be ;)].
>> > 
>> > > 
>> > > Thanks,
>> > > Srinivas
>> > > 
>> > > 
>> > > 
>> > > > 
>> > > > > [Absolute benchmark results are unfortunately omitted from
>> > > > > this
>> > > > > letter
>> > > > > due to company policies, but the percent change and Student's
>> > > > > T
>> > > > > p-value are included above and in the referenced benchmark
>> > > > > results]
>> > > > > 
>> > > > > The most obvious impact of this series will likely be the
>> > > > > overall
>> > > > > improvement in graphics performance on systems with an IGP
>> > > > > integrated
>> > > > > into the processor package (though for the moment this is
>> > > > > only
>> > > > > enabled
>> > > > > on BXT+), because the TDP budget shared among CPU and GPU can
>> > > > > frequently become a limiting factor in low-power devices.  On
>> > > > > heavily
>> > > > > TDP-bound devices this serie

Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.

2018-04-13 Thread Srinivas Pandruvada
Hi Francisco,

[...]

> Are you no longer interested in improving those aspects of the non-
> HWP
> governor?  Is it that you're planning to delete it and move back to a
> generic cpufreq governor for non-HWP platforms in the near future?

Yes that is the plan for Atom platforms, which are only non HWP
platforms till now. You have to show good gain for performance and
performance/watt to carry and maintain such big change. So we have to
see your performance and power numbers.

> 
> > This will benefit all architectures including x86 + non i915.
> > 
> 
> The current design encourages re-use of the IO utilization statistic
> (see PATCH 1) by other governors as a mechanism driving the trade-off
> between energy efficiency and responsiveness based on whether the
> system
> is close to CPU-bound, in whatever way is applicable to each governor
> (e.g. it would make sense for it to be hooked up to the EPP
> preference
> knob in the case of the intel_pstate HWP governor, which would allow
> it
> to achieve better energy efficiency in IO-bound situations just like
> this series does for non-HWP parts).  There's nothing really x86- nor
> i915-specific about it.
> 
> > BTW intel-pstate can be driven by sched-util governor (passive
> > mode),
> > so if your prove benefits to Broxton, this can be a default.
> > As before:
> > - No regression to idle power at all. This is more important than
> > benchmarks
> > - Not just score, performance/watt is important
> > 
> 
> Is schedutil actually on par with the intel_pstate non-HWP governor
> as
> of today, according to these metrics and the overall benchmark
> numbers?
Yes, except for few cases. I have not tested recently, so may be
better.

Thanks,
Srinivas


> > Thanks,
> > Srinivas
> > 
> > 
> > > > controller does, even though the frequent IO waits may actually
> > > > be
> > > > an
> > > > indication that the system is IO-bound (which means that the
> > > > large
> > > > energy usage increase may not be translated in any performance
> > > > benefit
> > > > in practice, not to speak of performance being impacted
> > > > negatively
> > > > in
> > > > TDP-bound scenarios like GPU rendering).
> > > > 
> > > > Regarding run-time complexity, I haven't observed this governor
> > > > to
> > > > be
> > > > measurably more computationally intensive than the present
> > > > one.  It's a
> > > > bunch more instructions indeed, but still within the same
> > > > ballpark
> > > > as
> > > > the current governor.  The average increase in CPU utilization
> > > > on
> > > > my BXT
> > > > with this series is less than 0.03% (sampled via ftrace for v1,
> > > > I
> > > > can
> > > > repeat the measurement for the v2 I have in the works, though I
> > > > don't
> > > > expect the result to be substantially different).  If this is a
> > > > problem
> > > > for you there are several optimization opportunities that would
> > > > cut
> > > > down
> > > > the number of CPU cycles get_target_pstate_lp() takes to
> > > > execute by
> > > > a
> > > > large percent (most of the optimization ideas I can think of
> > > > right
> > > > now
> > > > though would come at some
> > > > accuracy/maintainability/debuggability
> > > > cost,
> > > > but may still be worth pursuing), but the computational
> > > > overhead is
> > > > low
> > > > enough at this point that the impact on any benchmark or real
> > > > workload
> > > > would be orders of magnitude lower than its variance, which
> > > > makes
> > > > it
> > > > kind of difficult to keep the discussion data-driven [as
> > > > possibly
> > > > any
> > > > performance optimization discussion should ever be ;)].
> > > > 
> > > > > 
> > > > > Thanks,
> > > > > Srinivas
> > > > > 
> > > > > 
> > > > > 
> > > > > > 
> > > > > > > [Absolute benchmark results are unfortunately omitted
> > > > > > > from
> > > > > > > this
> > > > > > > letter
> > > > > > > due to company policies, but the percent change and
> > > > > > > Student's
> > > > > > > T
> > > > > > > p-value are included above and in the referenced
> > > > > > > benchmark
> > > > > > > results]
> > > > > > > 
> > > > > > > The most obvious impact of this series will likely be the
> > > > > > > overall
> > > > > > > improvement in graphics performance on systems with an
> > > > > > > IGP
> > > > > > > integrated
> > > > > > > into the processor package (though for the moment this is
> > > > > > > only
> > > > > > > enabled
> > > > > > > on BXT+), because the TDP budget shared among CPU and GPU
> > > > > > > can
> > > > > > > frequently become a limiting factor in low-power
> > > > > > > devices.  On
> > > > > > > heavily
> > > > > > > TDP-bound devices this series improves performance of
> > > > > > > virtually any
> > > > > > > non-trivial graphics rendering by a significant amount
> > > > > > > (of
> > > > > > > the
> > > > > > > order
> > > > > > > of the energy efficiency improvement for that workload
> > > > > > > assuming the
> > > > > > > optimization didn't cause it to become non-TDP-bound).
> >

Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.

2018-04-14 Thread Peter Zijlstra
On Fri, Apr 13, 2018 at 06:57:39PM -0700, Francisco Jerez wrote:
> Peter Zijlstra  writes:
> 
> > On Thu, Apr 12, 2018 at 12:55:39PM -0700, Francisco Jerez wrote:
> >> Actually assuming that a single geometric feature of the power curve is
> >> known -- it being convex in the frequency range allowed by the policy
> >> (which is almost always the case, not only for Intel CPUs), the optimal
> >> frequency for an IO-bound workload is fully independent of the exact
> >> power curve -- It's just the minimum CPU frequency that's able to keep
> >> the bottlenecking IO device at 100% utilization. 
> >
> > I think that is difficult to determine with the information at hand. We
> > have lost all device information by the time we reach the scheduler.
> 
> I assume you mean it's difficult to tell whether the workload is
> CPU-bound or IO-bound?  Yeah, it's non-trivial to determine whether the
> system is bottlenecking on IO, it requires additional infrastructure to
> keep track of IO utilization (that's the purpose of PATCH 1), and even

Note that I've not actually seen any of your patches; I got Cc'ed on
later.

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.

2018-04-16 Thread Eero Tamminen

Hi,

On 14.04.2018 07:01, Srinivas Pandruvada wrote:

Hi Francisco,

[...]


Are you no longer interested in improving those aspects of the non-
HWP
governor?  Is it that you're planning to delete it and move back to a
generic cpufreq governor for non-HWP platforms in the near future?


Yes that is the plan for Atom platforms, which are only non HWP
platforms till now. You have to show good gain for performance and
performance/watt to carry and maintain such big change. So we have to
see your performance and power numbers.


For the active cases, you can look at the links at the beginning / 
bottom of this mail thread.  Francisco provided performance results for 
>100 benchmarks.



At this side of Atlantic, we've been testing different versions of the 
patchset in past few months for >50 Linux 3D benchmarks on 6 different 
platforms.


On Geminilake and few BXT configurations (where 3D benchmarks are TDP 
limited), many tests' performance improves by 5-15%, also complex ones. 
And more importantly, there were no regressions.


(You can see details + links to more info in Jira ticket VIZ-12078.)

*On (fully) TDP limited cases, power usage (obviously) keeps the same, 
so performance/watt improvements can be derived from the measured 
performance improvements.*



We have data also for earlier platforms from slightly older versions of 
the patchset, but on those it didn't have any significant impact on 
performance.


I think the main reason for this is that BYT & BSW NUCs that we have, 
have space only for single memory module.  Without dual-memory channel 
configuration, benchmarks are too memory-bottlenecked to utilized GPU 
enough to make things TDP limited on those platforms.


However, now that I look at the old BYT & BSW data (for few benchmarks 
which improved most on BXT & GLK), I see that there's a reduction in the 
CPU power utilization according to RAPL, at least on BSW.



- Eero



This will benefit all architectures including x86 + non i915.



The current design encourages re-use of the IO utilization statistic
(see PATCH 1) by other governors as a mechanism driving the trade-off
between energy efficiency and responsiveness based on whether the
system
is close to CPU-bound, in whatever way is applicable to each governor
(e.g. it would make sense for it to be hooked up to the EPP
preference
knob in the case of the intel_pstate HWP governor, which would allow
it
to achieve better energy efficiency in IO-bound situations just like
this series does for non-HWP parts).  There's nothing really x86- nor
i915-specific about it.


BTW intel-pstate can be driven by sched-util governor (passive
mode),
so if your prove benefits to Broxton, this can be a default.
As before:
- No regression to idle power at all. This is more important than
benchmarks
- Not just score, performance/watt is important



Is schedutil actually on par with the intel_pstate non-HWP governor
as
of today, according to these metrics and the overall benchmark
numbers?

Yes, except for few cases. I have not tested recently, so may be
better.

Thanks,
Srinivas



Thanks,
Srinivas



controller does, even though the frequent IO waits may actually
be
an
indication that the system is IO-bound (which means that the
large
energy usage increase may not be translated in any performance
benefit
in practice, not to speak of performance being impacted
negatively
in
TDP-bound scenarios like GPU rendering).

Regarding run-time complexity, I haven't observed this governor
to
be
measurably more computationally intensive than the present
one.  It's a
bunch more instructions indeed, but still within the same
ballpark
as
the current governor.  The average increase in CPU utilization
on
my BXT
with this series is less than 0.03% (sampled via ftrace for v1,
I
can
repeat the measurement for the v2 I have in the works, though I
don't
expect the result to be substantially different).  If this is a
problem
for you there are several optimization opportunities that would
cut
down
the number of CPU cycles get_target_pstate_lp() takes to
execute by
a
large percent (most of the optimization ideas I can think of
right
now
though would come at some
accuracy/maintainability/debuggability
cost,
but may still be worth pursuing), but the computational
overhead is
low
enough at this point that the impact on any benchmark or real
workload
would be orders of magnitude lower than its variance, which
makes
it
kind of difficult to keep the discussion data-driven [as
possibly
any
performance optimization discussion should ever be ;)].



Thanks,
Srinivas






[Absolute benchmark results are unfortunately omitted
from
this
letter
due to company policies, but the percent change and
Student's
T
p-value are included above and in the referenced
benchmark
results]

The most obvious impact of this series will likely be the
overall
improvement in graphics performance on systems with an
IGP
integrated
into the processor package (though for the moment this is
only
enabled
on 

Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.

2018-04-16 Thread Srinivas Pandruvada
On Mon, 2018-04-16 at 17:04 +0300, Eero Tamminen wrote:
> Hi,
> 
> On 14.04.2018 07:01, Srinivas Pandruvada wrote:
> > Hi Francisco,
> > 
> > [...]
> > 
> > > Are you no longer interested in improving those aspects of the
> > > non-
> > > HWP
> > > governor?  Is it that you're planning to delete it and move back
> > > to a
> > > generic cpufreq governor for non-HWP platforms in the near
> > > future?
> > 
> > Yes that is the plan for Atom platforms, which are only non HWP
> > platforms till now. You have to show good gain for performance and
> > performance/watt to carry and maintain such big change. So we have
> > to
> > see your performance and power numbers.
> 
> For the active cases, you can look at the links at the beginning / 
> bottom of this mail thread.  Francisco provided performance results
> for 
>  >100 benchmarks.
Looks like you didn't test the idle cases, which are more important.
Systems will tend to be more idle (increased +50% by the patches). Once
you fix the idle, you have to retest and then results will be
interesting.

Once you fix this, then it is pure algorithm, whether it is done in
intel-pstate or sched-util governor is not a big different. It is
better to do in sched-util as this will benefit all architectures and
will get better test coverage and maintained.

Thanks,
Srinivas



> 
> 
> At this side of Atlantic, we've been testing different versions of
> the 
> patchset in past few months for >50 Linux 3D benchmarks on 6
> different 
> platforms.
> 
> On Geminilake and few BXT configurations (where 3D benchmarks are
> TDP 
> limited), many tests' performance improves by 5-15%, also complex
> ones. 
> And more importantly, there were no regressions.
> 
> (You can see details + links to more info in Jira ticket VIZ-12078.)
> 
> *On (fully) TDP limited cases, power usage (obviously) keeps the
> same, 
> so performance/watt improvements can be derived from the measured 
> performance improvements.*
> 
> 
> We have data also for earlier platforms from slightly older versions
> of 
> the patchset, but on those it didn't have any significant impact on 
> performance.
> 
> I think the main reason for this is that BYT & BSW NUCs that we
> have, 
> have space only for single memory module.  Without dual-memory
> channel 
> configuration, benchmarks are too memory-bottlenecked to utilized
> GPU 
> enough to make things TDP limited on those platforms.
> 
> However, now that I look at the old BYT & BSW data (for few
> benchmarks 
> which improved most on BXT & GLK), I see that there's a reduction in
> the 
> CPU power utilization according to RAPL, at least on BSW.
> 
> 
>   - Eero
> 
> 
> > > > This will benefit all architectures including x86 + non i915.
> > > > 
> > > 
> > > The current design encourages re-use of the IO utilization
> > > statistic
> > > (see PATCH 1) by other governors as a mechanism driving the
> > > trade-off
> > > between energy efficiency and responsiveness based on whether the
> > > system
> > > is close to CPU-bound, in whatever way is applicable to each
> > > governor
> > > (e.g. it would make sense for it to be hooked up to the EPP
> > > preference
> > > knob in the case of the intel_pstate HWP governor, which would
> > > allow
> > > it
> > > to achieve better energy efficiency in IO-bound situations just
> > > like
> > > this series does for non-HWP parts).  There's nothing really x86-
> > > nor
> > > i915-specific about it.
> > > 
> > > > BTW intel-pstate can be driven by sched-util governor (passive
> > > > mode),
> > > > so if your prove benefits to Broxton, this can be a default.
> > > > As before:
> > > > - No regression to idle power at all. This is more important
> > > > than
> > > > benchmarks
> > > > - Not just score, performance/watt is important
> > > > 
> > > 
> > > Is schedutil actually on par with the intel_pstate non-HWP
> > > governor
> > > as
> > > of today, according to these metrics and the overall benchmark
> > > numbers?
> > 
> > Yes, except for few cases. I have not tested recently, so may be
> > better.
> > 
> > Thanks,
> > Srinivas
> > 
> > 
> > > > Thanks,
> > > > Srinivas
> > > > 
> > > > 
> > > > > > controller does, even though the frequent IO waits may
> > > > > > actually
> > > > > > be
> > > > > > an
> > > > > > indication that the system is IO-bound (which means that
> > > > > > the
> > > > > > large
> > > > > > energy usage increase may not be translated in any
> > > > > > performance
> > > > > > benefit
> > > > > > in practice, not to speak of performance being impacted
> > > > > > negatively
> > > > > > in
> > > > > > TDP-bound scenarios like GPU rendering).
> > > > > > 
> > > > > > Regarding run-time complexity, I haven't observed this
> > > > > > governor
> > > > > > to
> > > > > > be
> > > > > > measurably more computationally intensive than the present
> > > > > > one.  It's a
> > > > > > bunch more instructions indeed, but still within the same
> > > > > > ballpark
> > > > > > as
> > > > > > the current governor.  The 

Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.

2018-04-17 Thread Chris Wilson
I have to ask, if this is all just to work around iowait triggering high
frequencies for GPU bound applications, does it all just boil down to
i915 incorrectly using iowait. Does this patch set perform better than

diff --git a/drivers/gpu/drm/i915/i915_request.c 
b/drivers/gpu/drm/i915/i915_request.c
index 9ca9c24b4421..7e7c95411bcd 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -1267,7 +1267,7 @@ long i915_request_wait(struct i915_request *rq,
goto complete;
}
 
-   timeout = io_schedule_timeout(timeout);
+   timeout = schedule_timeout(timeout);
} while (1);
 
GEM_BUG_ON(!intel_wait_has_seqno(&wait));

Quite clearly the general framework could prove useful in a broader
range of situations, but does the above suffice? (And can be backported
to stable.)
-Chris
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.

2018-04-17 Thread Srinivas Pandruvada
On Tue, 2018-04-17 at 15:03 +0100, Chris Wilson wrote:
> I have to ask, if this is all just to work around iowait triggering
> high
> frequencies for GPU bound applications, does it all just boil down to
> i915 incorrectly using iowait. Does this patch set perform better
> than
> 
> diff --git a/drivers/gpu/drm/i915/i915_request.c
> b/drivers/gpu/drm/i915/i915_request.c
> index 9ca9c24b4421..7e7c95411bcd 100644
> --- a/drivers/gpu/drm/i915/i915_request.c
> +++ b/drivers/gpu/drm/i915/i915_request.c
> @@ -1267,7 +1267,7 @@ long i915_request_wait(struct i915_request *rq,
> goto complete;
> }
>  
> -   timeout = io_schedule_timeout(timeout);
> +   timeout = schedule_timeout(timeout);
> } while (1);
>  
> GEM_BUG_ON(!intel_wait_has_seqno(&wait));
> 
> Quite clearly the general framework could prove useful in a broader
> range of situations, but does the above suffice? (And can be
> backported
> to stable.)

Definitely a very good test to do.

Thanks,
Srinivas

> -Chris
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.

2018-04-17 Thread Francisco Jerez
Hey Chris,

Chris Wilson  writes:

> I have to ask, if this is all just to work around iowait triggering high
> frequencies for GPU bound applications, does it all just boil down to
> i915 incorrectly using iowait. Does this patch set perform better than
>
> diff --git a/drivers/gpu/drm/i915/i915_request.c 
> b/drivers/gpu/drm/i915/i915_request.c
> index 9ca9c24b4421..7e7c95411bcd 100644
> --- a/drivers/gpu/drm/i915/i915_request.c
> +++ b/drivers/gpu/drm/i915/i915_request.c
> @@ -1267,7 +1267,7 @@ long i915_request_wait(struct i915_request *rq,
> goto complete;
> }
>  
> -   timeout = io_schedule_timeout(timeout);
> +   timeout = schedule_timeout(timeout);
> } while (1);
>  
> GEM_BUG_ON(!intel_wait_has_seqno(&wait));
>
> Quite clearly the general framework could prove useful in a broader
> range of situations, but does the above suffice? (And can be backported
> to stable.)
> -Chris

Nope...  This hunk is one of the first things I tried when I started
looking into this.  It didn't cut it, and it seemed to lead to some
regressions in latency-bound test-cases that were relying on the upward
bias provided by IOWAIT boosting in combination with the upstream
P-state governor.

The reason why it's not sufficient is that the bulk of the energy
efficiency improvement from this series is obtained by dampening
high-frequency oscillations of the CPU P-state, which occur currently
for any periodically fluctuating workload (not only i915 rendering)
regardless of whether IOWAIT boosting kicks in.  i915 using IO waits
does exacerbate the problem with the upstream governor by amplifying the
oscillation, but it's not really the ultimate cause.

In combination with v2 (you can take a peek at the half-baked patch here
[1], planning to make a few more changes this week so it isn't quite
ready for review yet) this hunk will actually cause more serious
regressions because v2 is able to use the frequent IOWAIT stalls of the
i915 driver in combination with an IO-underutilized system as a strong
indication that the workload is latency-bound, which causes it to
transition to latency-minimizing mode, which significantly improves
performance of latency-bound rendering (most visible in a handful X11
test-cases and some GPU/CPU sync test-cases from SynMark2).

[1] 
https://people.freedesktop.org/~currojerez/intel_pstate-lp/0001-cpufreq-intel_pstate-Implement-variably-low-pass-fil-v1.8.patch


signature.asc
Description: PGP signature
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx