Bug#814578: laptop-mode-tools throttles pstate-managed CPUs to half speed on battery power

Ritesh Raj Sarraf Thu, 07 Apr 2016 04:06:46 -0700

About cpufreq-info's output, I think after re-reading Arjan's explanation, it
makes sense.

Arjan's comments
https://plus.google.com/+ArjanvandeVen/posts/dLn9T4ehywL
=================

Some basics on CPU P states on Intel processors

there seems to be a lot of things people don't realize on how P state selection
works on Intel processors, and arguably the documentation is slightly confusing
in this regard... and things have been changing generation to generation.

First.. why use the word "P state" and not "frequency"? This is important in
terms of thinking about how this works.

"Clock frequency" is something that you measure over some period of time,
basically an average on how fast a clock signal went up/down.It's something you
can measure, but it's backwards looking. Intel CPUs expose two counters (aperf
and mperf) via MSR registers, and if you look at these two registers at two
separate times (far enough apart to avoid rounding effects), the ratio of the
delta in these two registers gives you a very nice "average frequency" over your
measurement interval. (The official SDM documentation has the exact formula for
this)

A P state is a number the OS tells the hardware regarding how much performance
it would like to see on a certain (logical) cpu; a P state request is very much
something forward looking.

So how are these related? 
In the ten year old, single core, no hyperthreading world, things were
relatively simple. You could basically map a P state to some "frequency" that
you'd get, and as the marketing folks told us, a higher frequency means more
performance.

Today, things are much more complex in several key ways.

First of all, and this is important and different from 10 years ago... no matter
which P state you ask for, when a logical processor is idle (C state), its
frequency is typically 0. The exception to this "typically" is the lightest of
the C states (C1), where the frequency is the lowest frequency the CPU supports,
and not zero. (but going into C1 is pretty rare, and very short lived, so for
this posting, I'm going to ignore C1).

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A second important aspect is that of "coordination". For practical reasons, on
current Intel processors, all the cores in a package share the same voltage. And
because running at a lower frequency than possible at a certain voltage is
inefficient, all the cores will also share the same clock frequency at any one
time. Of course, except the cores that are idle, because their frequency is
zero!
Because the OS will ask each individual logical processor for a separate P
state, some reconciliation is needed between the different cores. This
reconciliation is actually very simple, at any point in time, the frequency of
all the cores is the maximum of what each of the individual cores wants. Of
course, minus the idle cores. Their frequency is zero, and the maximum of
"something" and "zero" is "something". 

A simple example is appropriate here.
Lets take a two core system (core A and core B, that are initially both busy).
Core A would want to have a clock that ticks at 1 Ghz, and Core B wants a clock
that ticks at 2 Ghz.
The maximum of 1Ghz and 2Ghz is .. 2Ghz, so Core A and Core B will both run at 2
Ghz, even though core A only asked for 1 Ghz.
But now at time X, Core B is going idle. Since an idle core has a frequency of
zero, and the maximum of zero and 1Ghz is 1Ghz... Core A now runs with a clock
of 1 Ghz.

The key thing here is that Core A gets a very variable behavior, independent of
what it asked for, due to what Core B is doing.
Or in other words, the forward predictive value of a P state selection on a
logical CPU is rather limited.

Sound complex? Now imagine that the GPU on die is in many ways like a CPU
core.... and realize that what I described above is actually a simplification of
reality.

Another development in the last few years has been that of "Turbo".
Some people call it "overclocking", but it isn't overclocking, it's all within
the specs of the hardware. Turbo exists because in a multi-core system, it's
possible to run a single core faster than the frequency that is on the label of
the box when you buy the processor. This has to do with power budgets; when you
buy a 35 Watt TDP cpu, the CPU isn't supposed to use more than 35 Watts. So if
you have, say, 4 cores, that means each core by itself can use a little less
than 9 Watts to fit that budget.
But if 3 of the 4 cores are idle... the one remaining core can use the whole 35
Watts. (Now add in that the GPU also counts into this 35 Watts as do several
other shared resources, and it gets much more complex).
If this single core would be limited to 9 Watts instead of the full 35W even
when the others are idle, a lot of potential performance is left on the table.

Now in the first processors that supported Turbo, the available "extra range"
was limited, but this range has been growing and growing as core counts have
gone up, power sensors have been added to the CPU and power levels have come
down. (don't be surprised to see that your CPU has more levels in the turbo
range than it has outside the turbo range)

What does this mean? Well, when the OS asks for a P state value that is in the
"Turbo Range", it may not actually get the performance that maps to that level;
the sum of the power in the system could be exceeding the allowed TDP value if
that performance (clock frequency) was granted to all cores (remember from above
that all running cores share clock frequency).
What you do get at any one point in time depends on what other cores and the GPU
etc are doing.... and this will vary over time as cores go idle or become
active, or as the GPU finishes a frame or starts a new complex frame... and even
with temperature.
Or in other words, what frequency you get is highly dependent on other things
including the C state selection policy and the graphics subsystem.

Another fun angle is that when a task is running completely memory bound, the
performance of this task is basically independent of the clock frequency.... and
some systems will detect this condition and temporarily lower the clock
frequency to save power without reducing performance too much (all within the
bounds of all the things I described above).

If it wasn't clear yet, a lot of what I described above varies from generation
to generation quite a bit... and its going to change quite a bit more in the
next few years.

In the 3.9 kernel we've introduced a new controller driver for the P states,
simply because the previous, 10+ year old algorithm wasn't cutting it anymore;
too much has changed. By making the driver CPU generation specific, we can now
select and tune algorithms for each specific generation, and do significantly
better (30%+) than when we used a very generic algorithm.

Another thing to realize from all of this is that while it's easy to talk and
look at performance looking backwards (aperf/mperf allow us to do that),
predicting performance going forward, even if you are very deliberately picking
a P state value, is often near impossible since what you will actually get
depends a LOT on what the other parts of the system are doing.

On Thu, 2016-04-07 at 15:25 +0530, Ritesh Raj Sarraf wrote:
> On the other hand, that tool cpufreq-info, seems to be giving odd results.
> This
> is on a machine which is idle. The reading was taken after interrupting the
> openssl command and giving it some time to settle down.
> 
> The power saving setting was set to full speed,
> i.e. BATT_INTEL_PSTATE_PERF_MAX_PCT=100
> 
> rrs@learner:/media/SSHD/rrs-home/devel/Laptop-Mode-Tools/laptop-mode-tools
> (lmt-
> upstream)$ sudo cpufreq-info 
> cpufrequtils 008: cpufreq-info (C) Dominik Brodowski 2004-2009
> Report errors and bugs to cpuf...@vger.kernel.org, please.
> analyzing CPU 0:
>   driver: intel_pstate
>   CPUs which run at the same hardware frequency: 0
>   CPUs which need to have their frequency coordinated by software: 0
>   maximum transition latency: 0.97 ms.
>   hardware limits: 800 MHz - 3.10 GHz
>   available cpufreq governors: performance, powersave
>   current policy: frequency should be within 800 MHz and 3.10 GHz.
>                   The governor "powersave" may decide which speed to use
>                   within this range.
>   current CPU frequency is 1.90 GHz (asserted by call to hardware).
> analyzing CPU 1:
>   driver: intel_pstate
>   CPUs which run at the same hardware frequency: 1
>   CPUs which need to have their frequency coordinated by software: 1
>   maximum transition latency: 0.97 ms.
>   hardware limits: 800 MHz - 3.10 GHz
>   available cpufreq governors: performance, powersave
>   current policy: frequency should be within 800 MHz and 3.10 GHz.
>                   The governor "powersave" may decide which speed to use
>                   within this range.
>   current CPU frequency is 1.90 GHz (asserted by call to hardware).
> analyzing CPU 2:
>   driver: intel_pstate
>   CPUs which run at the same hardware frequency: 2
>   CPUs which need to have their frequency coordinated by software: 2
>   maximum transition latency: 0.97 ms.
>   hardware limits: 800 MHz - 3.10 GHz
>   available cpufreq governors: performance, powersave
>   current policy: frequency should be within 800 MHz and 3.10 GHz.
>                   The governor "powersave" may decide which speed to use
>                   within this range.
>   current CPU frequency is 1.90 GHz (asserted by call to hardware).
> analyzing CPU 3:
>   driver: intel_pstate
>   CPUs which run at the same hardware frequency: 3
>   CPUs which need to have their frequency coordinated by software: 3
>   maximum transition latency: 0.97 ms.
>   hardware limits: 800 MHz - 3.10 GHz
>   available cpufreq governors: performance, powersave
>   current policy: frequency should be within 800 MHz and 3.10 GHz.
>                   The governor "powersave" may decide which speed to use
>                   within this range.
>   current CPU frequency is 1.86 GHz (asserted by call to hardware).
> 2016-04-07 / 15:15:27 ♒♒♒  ☺  
-- 
Ritesh Raj Sarraf | http://people.debian.org/~rrs
Debian - The Universal Operating System

signature.asc
Description: This is a digitally signed message part

Bug#814578: laptop-mode-tools throttles pstate-managed CPUs to half speed on battery power

Reply via email to