On Tue, Oct 15, 2019 at 06:31:46AM -0700, Srinivas Pandruvada wrote: > On Tue, 2019-10-15 at 10:48 +0200, Peter Zijlstra wrote: > > On Mon, Oct 14, 2019 at 02:21:00PM -0700, Srinivas Pandruvada wrote: > > > Some modern systems have very tight thermal tolerances. Because of > > > this > > > they may cross thermal thresholds when running normal workloads > > > (even > > > during boot). The CPU hardware will react by limiting > > > power/frequency > > > and using duty cycles to bring the temperature back into normal > > > range. > > > > > > Thus users may see a "critical" message about the "temperature > > > above > > > threshold" which is soon followed by "temperature/speed normal". > > > These > > > messages are rate limited, but still may repeat every few minutes. > > > > > > The solution here is to set a timeout when the temperature first > > > exceeds > > > the threshold. > > > > Why can we even reach critical thresholds when the fans are working? > > I > > always thought it was BAD to ever reach the critical temps and have > > the > > hardware throttle. > CPU temperature doesn't have to hit max(TjMax) to get these warnings. > OEMs has an ability to program a threshold where a thermal interrupt > can be generated. In some systems the offset is 20C+ (Read only value). > > In recent systems, there is another offset on top of it which can be > programmed by OS, once some agent can adjust power limits dynamically. > By default this is set to low by the firmware, which I guess the prime > motivation of Benjamin to submit the patch.
That all sounds like the printk should be downgraded too, it is not a KERN_CRIT warning. It is more a notification that we're getting warm.