Bug#945055: great CPU temperature increase from 5.2 to 5.5 ... and when using intel_pstate
Controle: retitle -1 huge CPU temperature increase from 5.2 to 5.5 ... and when using intel_pstate I've upgraded to 5.5.17 (again the stock Debian sid package), and all future tests with 5.5.x will be with this. Problems unchanged. I've also checked 5.5.17 with intel_pstate being enabled but at the same time using: iommu=off mitigations=off pci=nomsi I didn't repeat all tests as extensively as they're in the git repo, but I've played back a video with mpv and did some casual working (Atl- Tab-switching between windows, scrolling/up down in some windows, etc.). None of these seem to help in terms of my CPU temperature going through the roof.
Bug#945055: great CPU temperature increase from 5.2 to 5.5 ... and when using intel_pstate
Control: tags -1 - patch Control: notfound -1 3.16.81-1 Control: notfound -1 4.19.87-1 Control: notfound -1 5.4-1~exp1 Control: forwarded -1 https://bugzilla.kernel.org/show_bug.cgi?id=207245 Hey Ben. On Fri, 2020-04-17 at 18:02 +0100, Ben Hutchings wrote: > This is now neither "fixed" nor "found" in any 5.5 version. Please > update the versions properly. Took a while till I got the mail that the bug was unarchived so I didn't update everything immediately. > This is also tagged "patch" but without a direct link to the > patch(es) > that are supposed to fix it. (Linking to the upstream bug report is > not specific enough.) Sorry for the confusion I might have caused. The patch tag and also found-in-version was based on my guess that the problems I see since versions > 5.2 were caused by https://gitlab.freedesktop.org/drm/intel/issues/614 That bug was a regression introduced by a security fix that prevented the GPU from entering RC6 sleep states. perf showed me that I was affected by it, so I assumed the fix (which was introduced in 5.5rc-something) would solve everything. It didn't, as my fruther test series, which I've just sent to this Debian as well, showed. Even with 5.5 I see a tremendous temperature increase. Unfortunately I'm by far not an expert enough to really tell where the problem comes from (I'd say there may be even different problems involved)... and I'd also need guiding what to actually test, to better nail it down. When I saw the problem still occurs with 5.5, I've made another test series and reported it first at lkml: https://lore.kernel.org/lkml/ce8097694ddfab616616f8f81521495d99c74416.ca...@scientia.net/T/#u When I got no response I've updated my older ticket at intel-drm: https://gitlab.freedesktop.org/drm/intel/-/issues/953 My tests would indicate that there are a number of temperature problems, in short: - GPU intensive stuff (like playing videos) - GPU stuff which shouldn't be intensive at all (e.g. moving around windows) but also: - supposedly non-GPU intensive stuff like Alt-Tab-ing between windows, scrolling up/down in lists in the GUI) - stuff which doesn't even do graphics at all (see the unhide-brute and (SHA)-verify tests I've made. For the GPU-intensive stuff (specifically that I hit 100°C when I play any videos) there is: https://gitlab.freedesktop.org/drm/intel/issues/956 (intel-drm folks had asked me to put it in a separate issue) For the general stuff (e.g. unhide brute or SHA512 verification running much hotter), there is: - the post to lkml - https://bugzilla.kernel.org/show_bug.cgi?id=207245 - and since intel_pstate being enabled there's also: https://bugzilla.kernel.org/show_bug.cgi?id=207247 The different tickets contain also descriptions of symptoms I've see, e.g. where temperatures go through the roof even when just moving windows, Alt-Tab-switching between them, scrolling up/down in a window, and so on. See especially the plots in the git repo I've provided, which shows how much higher the temperature is from 5.2 to 5.5 (and for each of them for intel_pstate being on or off). Any help on what to test would be highly appreciated. I did some preliminary tests with perf record, while then e.g. scrolling up/down in a GUI window (used the mail list in Evolution) while the temperatures go up to ~80°C ... This would have indicated that during that, the number of events as recorded by perf record, grows by a magnitude. I haven't had time yet to make more systematic tests. Thanks, Chris.
Bug#945055: great CPU temperature increase from 5.2 to 5.5 ... and when using intel_pstate
Hey. For several months now, I've been chasing a tremendous heat increase (CPU/GPU) respectively power usage on my notebook. It basically started after upgrading from 5.2 to 5.3, at least I haven't explicitly noted any grave changes from before 5.2 to 5.2. The issue (actually there might be several) persists until at least 5.4 and 5.5. Things are so bad, that when just type this mail,... that I can hear the fan go up considerably (and temps up to 90°C) just by typing the mail in the mail client (while it goes back to - still insane - 65°C idle, when not typing... ok idle here(!) is with firefox running). Similar things when I scroll through a terminal window, Alt-Tab cycle between windows, and so on. Testing is a bit difficult for me, as I couldn't come up with an easy way to reproducibly generate real world load (like this typing, or scrolling terminal windows), yet I tried to do an extensive test series, which I think will illustrate some things. Not really sure what the normal average or idle temps of that CPU are, but I guess getting at average >80°C by just typing shouldn't be the case. 1) Previous tests * When first searching for the reason of the temperature increase, I've had opened several tickets: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=945055 https://lore.kernel.org/lkml/d05aba2742ae42783788c954e2a380e7fcb10830.ca...@scientia.net/ Finally to find (by coincidence): https://gitlab.freedesktop.org/drm/intel/issues/614 when reporting: https://gitlab.freedesktop.org/drm/intel/-/issues/953 myself. At first I thought #614 would be the bug, but the fix for that went into 5.5-rc, and in fact, with 5.5.x I do see the GPU entering RC6 sleep states again, yet the temperature of my system is still crazy. 2) Testing Environment ** (for these new tests here) - Fujitsu Lifebook U757 - most recent BIOS version (1.25) in the tests below (I've had used an older one in previous tests from the links) - 32GB memory, some Sandisk SSD - Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz - microcode: sig=0x806e9, pf=0x80, revision=0xca - Debian sid, all packages (unless some totally unrelated stuff at their newest versions in unstable) - all used kernels are stock kernels from Debian - I do use full dm-crypt encryption of the system, but that shouldn't be a cause for the problems, I guess. - in my /etc/sysfs.conf I have: devices/system/cpu/intel_pstate/no_turbo = 1 basically since I have that laptop... with turbo enabled I always got these: Apr 5 18:27:07 heisenberg kernel: [ 9884.510420] mce: CPU3: Package temperature above threshold, cpu clock throttled (total events = 2609) Apr 5 18:27:07 heisenberg kernel: [ 9884.510422] mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 2609) Apr 5 18:27:07 heisenberg kernel: [ 9884.510465] mce: CPU0: Package temperature above threshold, cpu clock throttled (total events = 2609) Apr 5 18:27:07 heisenberg kernel: [ 9884.510467] mce: CPU2: Package temperature above threshold, cpu clock throttled (total events = 2609) Apr 5 18:27:07 heisenberg kernel: [ 9884.511427] mce: CPU3: Package temperature/speed normal Apr 5 18:27:07 heisenberg kernel: [ 9884.511430] mce: CPU0: Package temperature/speed normal Apr 5 18:27:07 heisenberg kernel: [ 9884.511431] mce: CPU1: Package temperature/speed normal Apr 5 18:27:07 heisenberg kernel: [ 9884.511436] mce: CPU2: Package temperature/speed normal => so for the tests with ipntel_pstate not being disabled, turbo mode was always disabled 3) How tests were made ** I've tested with the following combinations: - kernels 5.2.17 and 5.5.13 - with and without intel_pstate=disable - with Cinnamon and GNOME Shell in classic mode For all tests the notebook was placed in the same position and ran with the same commands for tests, no other major processes (like firefox or so) were running, just the respective bare desktop environment (cinnamon or gnome shell classic), cron/anacron were stopped. I always took temperature measurements with the output from sensors and CSV output from powertop (which contains all the sleep states and high energy users). Temperature and powertop measurements were started at basically the same time. powertop running for n iterations each 20s. But since powertop takes a while to start the temperature measurements are effectively shorter. a) deep-idle For these tests I've waited very long (like 5 minutes or more) for the system to cool down. Measurements with, e.g.: export NAME="5.2.17/ipstate-disable/thermald-no/gnome-shell- classic/deep-idle" ; timeout 80 sh -c "while true; do sleep 1; sensors; done | grep °C > ${NAME}.temp" and export NAME="5.2.17/ipstate-disable/thermald-no/cinnamon/deep-idle" ; powertop -i 4 --csv=${NAME}.powertop.csv b) idle Basically the same as (a), just not waiting so long to cool down. Effectively I've always produced some load (with the fan and CPU
Bug#945055: great CPU temperature increase from 5.2 to 5.5 ... and when using intel_pstate
I've made some further very extensive tests in the meantime, but these were mostly for clearly GPU related stuff, i.e. the problem that the temperatures go through the roof when playing back any video. These were reported here: https://gitlab.freedesktop.org/drm/intel/-/issues/953#note_463451 But I haven't made any plots/conclusions for that new set of tests, yet (will keep this ticket updated once I've done). As for the general (I mean even when doing non-graphics intensive stuff like the unhide-brute or sha512 sum verify tests that I've described above) extreme temperature increase since >5.2 that I see, ... what I would try next is whether mitigations=off changes anything (it didn't for video playback). Also I found out about the nice features of perf record respectively perf report. I've played a bit with that already and the first "results" showed that when I do anyting (like just typing at the keyboard, quickly moving up/down in e.g. Evolutions mail list, or just Alt-Tab-ing between windows, the number of events recorded there increases by magnitudes(!!). I'd be thankful for any guide in what to actually test to better nail down that problem I see. Thanks!