Bug#945055: great CPU temperature increase from 5.2 to 5.5 ... and when using intel_pstate

2020-04-19 Thread Christoph Anton Mitterer
Controle: retitle -1 huge CPU temperature increase from 5.2 to 5.5 ... and when 
using intel_pstate


I've upgraded to 5.5.17 (again the stock Debian sid package), and all
future tests with 5.5.x will be with this.

Problems unchanged.




I've also checked 5.5.17 with intel_pstate being enabled but at the
same time using:

iommu=off mitigations=off pci=nomsi


I didn't repeat all tests as extensively as they're in the git repo,
but I've played back a video with mpv and did some casual working (Atl-
Tab-switching between windows, scrolling/up down in some windows,
etc.).

None of these seem to help in terms of my CPU temperature going through
the roof.



Bug#945055: great CPU temperature increase from 5.2 to 5.5 ... and when using intel_pstate

2020-04-17 Thread Christoph Anton Mitterer
Control: tags -1 - patch
Control: notfound -1 3.16.81-1
Control: notfound -1 4.19.87-1
Control: notfound -1 5.4-1~exp1
Control: forwarded -1 
https://bugzilla.kernel.org/show_bug.cgi?id=207245


Hey Ben.


On Fri, 2020-04-17 at 18:02 +0100, Ben Hutchings wrote:
> This is now neither "fixed" nor "found" in any 5.5 version.  Please
> update the versions properly.

Took a while till I got the mail that the bug was unarchived so I
didn't update everything immediately.


> This is also tagged "patch" but without a direct link to the
> patch(es)
> that are supposed to fix it.  (Linking to the upstream bug report is
> not specific enough.)

Sorry for the confusion I might have caused. The patch tag and also
found-in-version was based on my guess that the problems I see since
versions > 5.2 were caused by 
https://gitlab.freedesktop.org/drm/intel/issues/614

That bug was a regression introduced by a security fix that prevented
the GPU from entering RC6 sleep states.

perf showed me that I was affected by it, so I assumed the fix (which
was introduced in 5.5rc-something) would solve everything.

It didn't, as my fruther test series, which I've just sent to this
Debian as well, showed.


Even with 5.5 I see a tremendous temperature increase.



Unfortunately I'm by far not an expert enough to really tell where the
problem comes from (I'd say there may be even different problems
involved)... and I'd also need guiding what to actually test, to better
nail it down.


When I saw the problem still occurs with 5.5, I've made another test
series and reported it first at lkml:
https://lore.kernel.org/lkml/ce8097694ddfab616616f8f81521495d99c74416.ca...@scientia.net/T/#u

When I got no response I've updated my older ticket at intel-drm:
https://gitlab.freedesktop.org/drm/intel/-/issues/953


My tests would indicate that there are a number of temperature
problems, in short:

- GPU intensive stuff (like playing videos)
- GPU stuff which shouldn't be intensive at all (e.g. moving around
windows)

but also:
- supposedly non-GPU intensive stuff like Alt-Tab-ing between windows,
scrolling up/down in lists in the GUI)
- stuff which doesn't even do graphics at all (see the unhide-brute and
(SHA)-verify tests I've made.



For the GPU-intensive stuff (specifically that I hit 100°C when I play
any videos) there is:
https://gitlab.freedesktop.org/drm/intel/issues/956
(intel-drm folks had asked me to put it in a separate issue)


For the general stuff (e.g. unhide brute or SHA512 verification running
much hotter), there is:
- the post to lkml
- https://bugzilla.kernel.org/show_bug.cgi?id=207245
- and since intel_pstate being enabled there's also:
  https://bugzilla.kernel.org/show_bug.cgi?id=207247


The different tickets contain also descriptions of symptoms I've see,
e.g. where temperatures go through the roof even when just moving
windows, Alt-Tab-switching between them, scrolling up/down in a window,
and so on.


See especially the plots in the git repo I've provided, which shows how
much higher the temperature is from 5.2 to 5.5 (and for each of them
for intel_pstate  being on or off).



Any help on what to test would be highly appreciated.


I did some preliminary tests with perf record, while then e.g.
scrolling up/down in a GUI window (used the mail list in Evolution)
while the temperatures go up to ~80°C ...
This would have indicated that during that, the number of events as
recorded by perf record, grows by a magnitude.

I haven't had time yet to make more systematic tests.


Thanks,
Chris.



Bug#945055: great CPU temperature increase from 5.2 to 5.5 ... and when using intel_pstate

2020-04-17 Thread Christoph Anton Mitterer
Hey.


For several months now, I've been chasing a tremendous heat increase
(CPU/GPU) respectively power usage on my notebook.

It basically started after upgrading from 5.2 to 5.3, at least I
haven't explicitly noted any grave changes from before 5.2 to 5.2.
The issue (actually there might be several) persists until at least 5.4
and 5.5.

Things are so bad, that when just type this mail,... that I can hear
the fan go up considerably (and temps up to 90°C) just by typing the
mail in the mail client (while it goes back to - still insane - 65°C
idle, when not typing... ok idle here(!) is with firefox running).
Similar things when I scroll through a terminal window, Alt-Tab cycle
between windows, and so on.


Testing is a bit difficult for me, as I couldn't come up with an easy
way to reproducibly generate real world load (like this typing, or
scrolling terminal windows), yet I tried to do an extensive test
series, which I think will illustrate some things.


Not really sure what the normal average or idle temps of that CPU are,
but I guess getting at average >80°C by just typing shouldn't be the
case.




1) Previous tests
*
When first searching for the reason of the temperature increase, I've
had opened several tickets:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=945055
https://lore.kernel.org/lkml/d05aba2742ae42783788c954e2a380e7fcb10830.ca...@scientia.net/

Finally to find (by coincidence):
https://gitlab.freedesktop.org/drm/intel/issues/614
when reporting:
https://gitlab.freedesktop.org/drm/intel/-/issues/953
myself.

At first I thought #614 would be the bug, but the fix for that went
into 5.5-rc, and in fact, with 5.5.x I do see the GPU entering RC6
sleep states again, yet the temperature of my system is still crazy.




2) Testing Environment
**
(for these new tests here)
- Fujitsu Lifebook U757
- most recent BIOS version (1.25) in the tests below (I've had used an
  older one in previous tests from the links)
- 32GB memory, some Sandisk SSD
- Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz
- microcode: sig=0x806e9, pf=0x80, revision=0xca
- Debian sid, all packages (unless some totally unrelated stuff at
  their newest versions in unstable)
- all used kernels are stock kernels from Debian
- I do use full dm-crypt encryption of the system, but that shouldn't
  be a cause for the problems, I guess.
- in my /etc/sysfs.conf I have:
  devices/system/cpu/intel_pstate/no_turbo = 1
  basically since I have that laptop... with turbo enabled I always
got 
  these:
  Apr  5 18:27:07 heisenberg kernel: [ 9884.510420] mce: CPU3: Package
temperature above threshold, cpu clock throttled (total events = 2609)
  Apr  5 18:27:07 heisenberg kernel: [ 9884.510422] mce: CPU1: Package
temperature above threshold, cpu clock throttled (total events = 2609)
  Apr  5 18:27:07 heisenberg kernel: [ 9884.510465] mce: CPU0: Package
temperature above threshold, cpu clock throttled (total events = 2609)
  Apr  5 18:27:07 heisenberg kernel: [ 9884.510467] mce: CPU2: Package
temperature above threshold, cpu clock throttled (total events = 2609)
  Apr  5 18:27:07 heisenberg kernel: [ 9884.511427] mce: CPU3: Package
temperature/speed normal
  Apr  5 18:27:07 heisenberg kernel: [ 9884.511430] mce: CPU0: Package
temperature/speed normal
  Apr  5 18:27:07 heisenberg kernel: [ 9884.511431] mce: CPU1: Package
temperature/speed normal
  Apr  5 18:27:07 heisenberg kernel: [ 9884.511436] mce: CPU2: Package
temperature/speed normal
  => so for the tests with ipntel_pstate not being disabled, turbo mode
 was always disabled




3) How tests were made
**
I've tested with the following combinations:
- kernels 5.2.17 and 5.5.13
- with and without intel_pstate=disable
- with Cinnamon and GNOME Shell in classic mode

For all tests the notebook was placed in the same position and ran with
the same commands for tests, no other major processes (like firefox or
so) were running, just the respective bare desktop environment
(cinnamon or gnome shell classic), cron/anacron were stopped.

I always took temperature measurements with the output from sensors and
CSV output from powertop (which contains all the sleep states and high
energy users).

Temperature and powertop measurements were started at basically the
same time. powertop running for n iterations each 20s.
But since powertop takes a while to start the temperature measurements
are effectively shorter.


a) deep-idle
For these tests I've waited very long (like 5 minutes or more) for the
system to cool down.
Measurements with, e.g.:
export NAME="5.2.17/ipstate-disable/thermald-no/gnome-shell-
classic/deep-idle" ; timeout 80 sh -c "while true; do sleep 1; sensors;
done | grep °C > ${NAME}.temp"
and
export NAME="5.2.17/ipstate-disable/thermald-no/cinnamon/deep-idle" ;
powertop -i 4  --csv=${NAME}.powertop.csv


b) idle
Basically the same as (a), just not waiting so long to cool down.
Effectively I've always produced some load (with the fan and CPU 

Bug#945055: great CPU temperature increase from 5.2 to 5.5 ... and when using intel_pstate

2020-04-17 Thread Christoph Anton Mitterer
I've made some further very extensive tests in the meantime, but these
were mostly for clearly GPU related stuff, i.e. the problem that the
temperatures go through the roof when playing back any video.
These were reported here:
https://gitlab.freedesktop.org/drm/intel/-/issues/953#note_463451

But I haven't made any plots/conclusions for that new set of tests, yet
(will keep this ticket updated once I've done).



As for the general (I mean even when doing non-graphics intensive stuff
like the unhide-brute or sha512 sum verify tests that I've described
above) extreme temperature increase since >5.2 that I see, ... what I
would try next is whether
mitigations=off changes anything (it didn't for video playback).


Also I found out about the nice features of perf record respectively
perf report.
I've played a bit with that already and the first "results" showed that
when I do anyting (like just typing at the keyboard, quickly moving
up/down in e.g. Evolutions mail list, or just Alt-Tab-ing between
windows, the number of events recorded there increases by
magnitudes(!!).


I'd be thankful for any guide in what to actually test to better nail
down that problem I see.

Thanks!