[Kernel-packages] [Bug 2026658] Re: CPU frequency governor broken after upgrading from 22.10 to 23.04, stuck at 400Mhz on Alder Lake

Eli Sat, 02 Sep 2023 18:40:39 -0700

I have an interesting update:
I went and compiled/installed this tool: 
https://github.com/phush0/razer-laptop-control-no-dkms


@jadgardner: you will definitely want this, boost mode is at least +15%
performance.

Its a cli tool for poking the Razer hardware bits to set the different
power modes across the combination of CPU and GPU. None of these affect
the power details variables or intel pstate values. For simplicity I'm
using commands below that leave the GPU untouched. (the tool also lets
you control the LEDs, and fan; those are irrelevant and fan doesn't work
for me.)

All of the below testing was done with that compiled build of thermald
running in --adaptive mode. I have attached it. I doubt a single long
and fiddly run makes for a great data source, please let me know if
there is a specific combination of settings you would like me test. (its
3 megs, I have compressed it. exact command was `sudo ./thermald --no-
daemon --adaptive --loglevel=debug`)

(reminder: bug 1 is the 400mhz drop and lock, bug 2 is
intel_pstate/no_turbo getting set to 1. bug 2 is way harder to trigger.)

1) razer-cli write power ac 4 3 0
Highest "boost" performance mode.
CPU has much higher all core and multicore speeds, cpu package temp spikes to 
100 nearly instantly even moderate load.
I cannot trigger bug 1 (or bug 2) in this mode, `stress -c 1` pegs a core at 
4.8ghz and it stays there. `stress -c 2` stays at values around ~4775MHz with 
periodic drops down to ~4450MHz but then jumps right back up after about 1 
second. long_term_power either rock solid at 65 or very briefly drops down and 
then goes back up.

2) razer-cli write power ac 4 2 0
"High" performance mode.
It looks like this one sets a cpu target temperature around 90 for 
all/multicore, frequencies are higher than normal, temperature does force 
frequencies down at least until the fan ramps up a bit.
`stress -c 1` pegged a cpu core for less than a minute at 4.8, and then 
intel_pstate/max_perf_pct got set to 90 and cpu frequency dropped to 400MHz, 
and long_term_power dropped to 0.125. (e.g. bug 1).
Swapping back to level 3 (Boost) mode did not resolve bug.
Setting intel_pstate/max_perf_pct back to 100 does not resolve bug.
Setting long_term_power back to 65 resolves bug.

3) razer-cli write power ac 4 1 0
"Medium" power mode.
The behavior looks like a normal ~aggressive laptop performance behavior.
`stress -c 1` pegs a cpu core at 4.8, temps spike, fan slowly spins up. CPU 
speed drops down to various levels (2.8ghz, 4.5ghz, 4.3ghz, 4.2ghz), 
temperatures drop from mid 90s to mid 80s or 70s for a bit, long_term_power 
drops to values in the 20-30s, but then resets back to 65 after a few seconds.
I was able to trigger long_term_power down to 0.125 once by toggling stressing 
1 vs 2 cpu cores, but it still reset up to 65 after a few seconds. Otherwise I 
was not able to trigger bug 1 or 2.
(All and multicore CPU speeds are pretty close to normal, looks like it targets 
~75 degrees)

4) razer-cli write power ac 4 0 0
"low" power mode.
All and multicore CPU speeds are pretty close to normal, looks like it targets 
~70 degrees, very close to "Medium".
This behavior looks like a lower or fairly passive power mode on a laptop.
Pegging a single core would get 4.8ghz for a while, adding a second would 
briefly get the normal behavior, but then the system would back down to 4.6, 
4.5, 4.2, 3.9Ghz. Temperature would spike to 92-100 range only in the 
4.7-4.8Ghz range (this is normal) and then would back off pretty quickly. 
long_term_power would drop to a value like 22 pretty quickly and stay there. If 
I was very quick with toggling between one and two pegged cores I could hit 16, 
but then the value would recover back up to the default 65 and reset the 
process.
I was not able to trigger bug 1 at any point.

5) razer-cli write power ac 0 (this is a slightly different command than the 
others)
"Balanced" mode (according to the docs this is the normal operating mode, I'm 
skeptical?)
Seems to behave like a laptop that uses up its boost-range thermal headroom 
very quickly.
All and multicore seem to target 70 degrees, normal-to-low cpu frequencies.
I could only get 1 or 2 pegged cores to reach 4.8Ghz for a handful of seconds, 
then the max speed became about 4.6 but would hover in the 4.2-4.6 range for 
the rest of the test. If I let it sit idle for 30 seconds it would recover to 
allow for brief 4.8 spikes again.
long_term_power would drop to ~30, and would sit there until load dropped, at 
which point it would recover to the default 65.
Due to the inability to hit 4.8Ghz for long periods I was not able to trigger 
bug 1.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to thermald in Ubuntu.
https://bugs.launchpad.net/bugs/2026658

Title:
  CPU frequency governor broken after upgrading from 22.10 to 23.04,
  stuck at 400Mhz on Alder Lake

Status in linux package in Ubuntu:
  Incomplete
Status in thermald package in Ubuntu:
  In Progress

Bug description:
  I've tried to include as much detail as possible in this bug report, I
  originally assembled it just after the release of ubuntu 23.04.  There
  has been no change since then.

  
  I have had substantial performance problems since updating from ubuntu 22.10 
to 23.04.
  The computer in question is the 17 inch Razer Blade laptop from 2022 with an 
intel i7-12800H.
  Current kernel is 6.2.0-20-generic.  (now I'm on 6.2.0-24-generic and nothing 
has changed.)
  This issue occurs regardless of whether the OpenRazer 
(https://openrazer.github.io/) drivers etc are installed.

  
  Description of problem:
  I have discovered what may be two separate bugs involving low level power 
management details on the cpu, they involve the cpu entering different types of 
throttled states and never recovering. These issues appeared immediately after 
upgrading from ubuntu 22.10.  The computer is a large ~gaming laptop with 
plenty of thermal headroom, cpu temperatures cannot reach concerning values 
except when using stress testing tools.

  (I don't know how to propery untangle these two issues, so I'm posting
  them as one. I apologize for the review complexity this causes, but I
  think posting the information all in one spot is more constructive
  here.)

  
  High level testing notes:
  - This issue occurs with use of both the intel_pstate driver and the cpufreq 
driver. (I don't have the same level of detail for cpufreq, but the issue still 
occurs.)
  - I have additionally tested a handful of intel_pstate parameters (and 
others) via grub kernel command line arguments to no effect. All testing 
reported here was done with:
    GRUB_CMDLINE_LINUX_DEFAULT="modprobe.blacklist=nouveau"
    GRUB_CMDLINE_LINUX=""
    (loading nouveau caused problems for me on 22.10, I have not bothered 
reinvestigating it on 23.04)
  - There is a firmware update available from the manufacture when I boot into 
Windows, I have not installed it (yet).
  - - Update: I installed it. No change.
  - Changing the cpu governor setting from "powersave" to "performance" using 
`cpupower frequency-set -g performance` has no effect. (Note: this action is 
separate from the intel_pstate's power-saver/balanced/performance setting 
visible with the `powerprofilesctl` utility. It doesn't seem to be a governor 
bug.
  - - (There is a tertiary issue where I also see substantial (+50%) 
performance degredation using the "performance" profile in a test suite I run 
constantly for my job; that is clearly a problem but it is an unrelated bug 
that has existed for quite some time.)

  
  Summary and my own conclusions:
  These are my takeaways, the ~raw data is in the followup section.

  
  Bug 1)
  The reported cpu power limits are progressively constrained over time. Once 
this failure mode starts the performance never recovers.
    - As this situation progresses the observed cpu speeds (I'm using htop) 
list as 2800Mhz at idle, but the instant any load at all is placed on a cpu 
core that core immediately drops to exactly 400Mhz.
    - This situation occurs quite quickly in human terms, frequently within 20 
minutes of normal usage after a boot, but it will also occur when the computer 
is just sitting there unused for a handful of hours.
    - This occurs when using the cpufreq gevernor (by including 
"intel_pstate=disable" on the grub command line args.)
    - At boot the default value for short_term_time looks wrong to me. This is 
the duration of higher thermal targets in seconds, ~0.002 seconds seems 
extremely short. A normal value would be a handful of seconds.
    - This situation can be remedied by running the following python script. It 
uses the undervolt package (pip install undervolt==0.3.0) to force particular 
power limits (the provided values are intentional overkill):
       1   │ from undervolt import read_power_limit, set_power_limit, 
PowerLimit, ADDRESSES
       2   │ from pprint import pprint
       3   │ 
       4   │ limits = read_power_limit(ADDRESSES)
       5   │ pprint(vars(limits))  # print current values before setting them
       6   │ 
       7   │ POWER_LIMITS = PowerLimit()
       8   │ POWER_LIMITS.locked = True  # lock means don't allow the value to 
be reset until a reboot.
       9   │ POWER_LIMITS.backup_rest = 281474976776192  # afaik this is just a 
backup-on-failure setting, it has no effect here.
      10   │ POWER_LIMITS.long_term_enabled = True
      11   │ POWER_LIMITS.long_term_power = 160  # values are intentional 
overkill
      12   │ POWER_LIMITS.long_term_time = 2880.0
      13   │ POWER_LIMITS.short_term_enabled = True
      14   │ POWER_LIMITS.short_term_power = 250
      15   │ POWER_LIMITS.short_term_time = 500.0
      16   │ set_power_limit(POWER_LIMITS, ADDRESSES)
      17   | 
      18   | limits2 = read_power_limit(ADDRESSES)  # and print the new state
      19   | pprint(vars(limits2))

  
  Bug 2)
  `powerprofilesctl` has unearthed some bug where the cpu performance enters 
the degraded state "high-operating-temperature", and never recovers.
    - This appears to happen for no reason. There is a brief cpu temperature 
spike in the example data below, but it does not hit the listed hardware limit 
values so I am at a loss for its cause.
    - I ran a cpu stress test (prime95/mprime torture test), it immediately 
spikes cpu temperature to 100 degrees and throttles the cpu, but doesn't 
trigger the high temperature degraded state. Go figure.
    - This bug takes quite a while to kick in, uptime in my example below was 
at over 14 hours.
    - When this situation occurs the maximum cpu speed becomes 2400Mhz across 
all cpu cores. The cpu power management appears to behave correctly in the 
400-2400Mhz range. I believe this means all turbo frequencies are disabled.
    - Running the comman `sudo cpupower frequency-set -u 4800000` (or any value 
above 2400000) does not correct the reported cpu_policy_range, it remains 
locked at 2400Mhz.
    - The only fix I know is a reboot.

  
  THE DATA:

  Bug 1:
  This output was gathered using a python package called undervolt's 
read_power_limit function from a script that starts running at ~boot.
  long_term_power and short_term_power metrics are values in watts, 
long_term_time and short_term_time are values in seconds.

  2023-05-12  15:14:32 up 0 min,  0 user,  load average: 0.39, 0.10, 0.03 
  (boot, log starts after normal user login)
   long_term_power: 65.0
   long_term_time: 32.0
   short_term_power: 160.0
   short_term_time: 0.00244140625

  2023-05-12  15:20:29 up 6 min,  2 users,  load average: 1.90, 0.86, 0.37 
   long_term_power: 20.875  <-- down
   long_term_time: 28.0  <-- down
   short_term_power: 160.0
   short_term_time: 0.00244140625

  2023-05-12  15:20:46 up 6 min,  2 users,  load average: 1.63, 0.87, 0.38 
   long_term_power: 22.625  <-- hey it went up! I was still using the computer 
at this point
   long_term_time: 28.0
   short_term_power: 160.0
   short_term_time: 0.00244140625

  2023-05-12  15:46:15 up 32 min,  2 users,  load average: 0.66, 0.84, 0.79 
  (no longer at computer by the time this occurs)
   long_term_power: 20.625  <-- down
   long_term_time: 28.0
   short_term_power: 160.0
   short_term_time: 0.00244140625

  2023-05-12  16:04:46 up 50 min,  3 users,  load average: 0.46, 0.70, 0.79 
   long_term_power: 16.625  <-- down
   long_term_time: 28.0
   short_term_power: 160.0
   short_term_time: 0.00244140625

  2023-05-12  17:23:07 up  2:08,  3 users,  load average: 0.49, 0.61, 0.68 
  (by the time long_term_power hits 8.625 all cpu cores throttle to 400Mhz 
under any load. This one was preceded by ~1 second of a single cpu core 
randomly spiking to 78 degrees, output from `powerprofilesctl` remains normal. 
At this point long_term_power will never go up again. I have seen one more 
lowered stage at ~4.3125w.)
   long_term_power: 8.625  <-- way down - I've seen lower, though.
   long_term_time: 28.0
   short_term_power: 160.0
   short_term_time: 0.00244140625

  (And then after several hours stuck in this mode I returned to the
  computer and needed to run the script in the bug 1 summary to make it
  usable again.)

  
  Bug 2:
  (Some cleanup of output, script starts at ~boot)
  2023-05-11  22:21:15 up 14:15,  2 users,  load average: 0.38, 0.42, 0.52 

  Output from powerprofilesctl:
    |  performance:
    |    Driver:     intel_pstate
    |    Degraded:   no
    |* balanced:
    |    Driver:     intel_pstate
    |  power-saver:
    |    Driver:     intel_pstate

  some summarized details from the `cpupower` utility:
    | cpu_number: 2
    | cpu_range: 400 MHz - 4.70 GHz
    | cpu_policy_range: 400 MHz and 4.70 GHz.
    | governor: powersave

  output from `sensors` (slightly compactified, I don't know what's up with the 
cpu core numbers):
    | iwlwifi_1-virtual-0 - Adapter: Virtual device - temp1: +49.0°C  
    | nvme-pci-0300 - Adapter: PCI adapter - Composite:
    |   +40.9°C  (low = -5.2°C, high = +89.8°C) (crit = +93.8°C)
    | nvme-pci-0200 - Adapter: PCI adapter:
    |   Composite:   +36.9°C  (low = -273.1°C, high = +80.8°C) (crit = +84.8°C)
    |   Sensor 1:    +36.9°C  (low = -273.1°C, high = +65261.8°C)
    |   Sensor 2:    +38.9°C  (low = -273.1°C, high = +65261.8°C)
    | coretemp-isa-0000 - Adapter: ISA adapter
    | Package id 0:  +77.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 0:        +52.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 4:        +54.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 8:        +77.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 12:       +52.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 16:       +64.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 20:       +45.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 24:       +52.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 25:       +52.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 26:       +52.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 27:       +52.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 28:       +50.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 29:       +50.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 30:       +50.0°C  (high = +100.0°C, crit = +100.0°C)
    | Core 31:       +50.0°C  (high = +100.0°C, crit = +100.0°C)
    | acpitz-acpi-0 - Adapter: ACPI interface: temp1: +27.8°C (crit = +105.0°C)

  
  2023-05-11  22:21:17 up 14:15,  2 users,  load average: 0.38, 0.42, 0.52    
(2 seconds later)

  output from `powerprofilesctl`:
    |  performance:
    |    Driver:     intel_pstate
    |    Degraded:   yes (high-operating-temperature)
    |* balanced:
    |    Driver:     intel_pstate
    |  power-saver:
    |    Driver:     intel_pstate

  some summarized details from the `cpupower` utility:
    | cpu_number: 8
    | cpu_range: 400 MHz - 4.70 GHz
    | cpu_policy_range: 400 MHz and 2.40 GHz.
    | governor: powersave

  output from `sensors` (slightly compactified, I don't know what's up with the 
cpu core numbers):
    | iwlwifi_1-virtual-0 Adapter: Virtual device temp1: +49.0°C  
    | nvme-pci-0300 - Adapter: PCI adapter
    |   Composite:    +40.9°C  (low =  -5.2°C, high = +89.8°C) (crit = +93.8°C)
    | nvme-pci-0200 - Adapter: PCI adapter
    |   Composite:    +36.9°C  (low = -273.1°C, high = +80.8°C) (crit = +84.8°C)
    |   Sensor 1:     +36.9°C  (low = -273.1°C, high = +65261.8°C)
    |   Sensor 2:     +38.9°C  (low = -273.1°C, high = +65261.8°C)
    | coretemp-isa-0000 - Adapter: ISA adapter
    |   Package id 0:  +60.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 0:        +53.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 4:        +59.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 8:        +54.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 12:       +58.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 16:       +58.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 20:       +60.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 24:       +58.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 25:       +58.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 26:       +58.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 27:       +58.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 28:       +55.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 29:       +55.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 30:       +55.0°C  (high = +100.0°C, crit = +100.0°C)
    |   Core 31:       +55.0°C  (high = +100.0°C, crit = +100.0°C)
    | acpitz-acpi-0 - Adapter: ACPI interface - temp1: +27.8°C (crit = +105.0°C)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2026658/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2026658] Re: CPU frequency governor broken after upgrading from 22.10 to 23.04, stuck at 400Mhz on Alder Lake

Reply via email to