On 03/03/2012 12:13 AM, Phillip Susi wrote:
On 02/29/2012 04:40 PM, John Moser wrote:
At full load (encoding a video), it eventually reaches 80C and the
system shuts down.

It sounds like you have some broken hardware. The stock heatsink and fan are designed to keep the cpu from overheating under full load at the design frequency and voltage. You might want to verify that your motherboard is driving the cpu at the correct frequency and voltage.


Possibly.

The only other use case I can think of is when ambient temperature is hot. Remember server rooms use air conditioning; I did find that for a while my machine would quickly overheat if the room temperature was above 79F, and so kept the room at 75F. The heat sink was completely clogged with dust at the time, though, which is why I recently cleaned and inspected it and checked all the fan speed monitors and motherboard settings to make sure everything was running as appropriate.

In any case if the A/C goes down in a server room, it would be nice to have the system CPU frequency scaling kick in and take the clock speed down before the chip overheats. Modern servers--for example, the new revision of the Dell PowerEdge II and III as per 4 or 5 years ago--lean on their low-power capabilities, and modern data centers use a centralized DC converter and high voltage (220V) DC mains in the data center to reduce power waste because of the high cost of electricity. It's extremely likely that said servers would provide a low enough clock speed to not overheat without air conditioning, which is an emergency situation.

Of course, the side benefit of not overheating desktops with inadequate cooling or faulty motherboard behavior is simply a bonus. Still, I believe in fault tolerance.

I currently have cpufreqd configured to clock to 1.8GHz at 73C, and move
to the ondemand governor at 70C.

This need for manual configuring is a good reason why it is not a candidate for standard install.


I've attached a configuration that generically uses sensors (i.e. if the program 'sensors' gives useful output, this works). It's just one core though (a multi-core system reads the same temperature for them all, as it's per-CPU); you can easily automatically generate this.

Mind you on the topic of automatic generation, 80C is a hard limit. It just is. My machine reports (through sensors) +95.0C as "Critical", but my BIOS shuts down the system at +80.0C immediately. Silicon physically does not tolerate temperatures above 80.0C well at all; if a chip claims it can run at 95.0C it's lying. Even SOD-CMOS doesn't tolerate those temperatures.

As well, again, you could write some generic profiles that detect when the system is running on battery (UPS, laptop) and make appreciable adjustments based on how much battery life is left.

At 73C, the system switches from 1.9GHz to 1.8GHz. Ten seconds later,
it's at 70C and switches back to 1.9GHz. 41 seconds after that, it
reaches 73C again and switches to 1.8GHz.

That means at stock frequency (1.9GHz) with stock cooling equipment, the
CPU overheats under full load. Clocked 0.1GHz slower than its rated
speed, it rapidly cools. Which is ridiculous; who designed this thing?

This sounds like your motherboard is overvolting the cpu in that 1.9 GHz stepping.


Possibly, but the settings are all default, nothing set to overclock (it has jumper free overclocking configuration, but the option "Standard" is default for clock rate and voltage settings, which I assume the CPU supplies).

Basically the argument here is between "Supply fault tolerance" and "Well your motherboard is [old|poorly designed] so buy a new one." That's an excellent argument for hard drives (I have, in fact, suggested in the past that Ubuntu monitor hard disks for behavior indicative of dying drives--SMART errors, IDE RESET commands because the drive hangs, etc--and begin annoying the user with messages about the SEVERE risk of extreme data loss if he doesn't back up his data), but really if my mobo/CPU is aging and the CPU runs a little hot I'm not going to cry when the CPU suddenly burns out and my machine shuts down. I'll be confused, annoyed, but I'll buy a new one--I might buy an entire new computer, unaware that just my CPU is broken, and shove the hard drive in there. So there's no harm in allowing the user's hardware to go ahead and burn itself out if you think that's what's going on here.

By all means that doesn't mean you can't have a diagnostic center somewhere that the user can review and see the whole collection. "Ethernet: Lots of garbage [Possibly: Faulty switch, faulty NIC, another computer with a chattering NIC spewing packets]." "CPU: Overheats under high CPU load [Possibly: Dust-clogged CPU heat sink, failing CPU fan, overclocking, failing CPU, failing motherboard voltage regulators, buggy motherboard BIOS]." "/!\ Hard drive: Freezes and needs IDE Resets [Possibly: Dying hard drive/!\, dying IDE controller, dying RAID controller] /!\WARNING: SEVERE DATA LOSS POSSIBLE". Etc. Looks like you really need a new computer...

Yes I have strange ideas about what a computer should and shouldn't do. But then, you know, people run huge racks of computers that fail catastrophically if you don't pipe an air conditioning line straight to the chassis fan intake (take a look under the cabinet, the floor tile directly under each server rack is perforated--the raised floor has A/C pumped under it and it vents directly and exclusively into the server cabinets).
# this is a comment
# see CPUFREQD.CONF(5) manpage for a complete reference
#
# Note: ondemand/conservative Profiles are disabled because
#       they are not available on many platforms.

[General]
pidfile=/var/run/cpufreqd.pid
poll_interval=0.2
verbosity=4
#enable_remote=1
#remote_group=root
[/General]

[Profile]
name=Standard
minfreq=0%
maxfreq=100%
policy=ondemand
[/Profile]

[Profile]
name=Hot
minfreq=50%
maxfreq=95%
policy=ondemand
[/Profile]

[Profile]
name=Overheating
minfreq=0%
maxfreq=10%
policy=ondemand
[/Profile]

##
# Basic states
##
[Rule]
name=Normal
#acpi_temperature=0-70
sensor=temp1:0-70
#cpu_interval=00-100
profile=Standard
[/Rule]

##
# Special Rules
##
# CPU Too hot!
[Rule]
name=CPU Hot
#acpi_temperature=4-5
sensor=temp1:73-76
#cpu_interval=00-100
profile=Hot
[/Rule]

[Rule]
name=CPU Too Hot
#acpi_temperature=50-100
sensor=temp1:76-100
#cpu_interval=00-100
profile=Overheating
[/Rule]

-- 
Ubuntu-devel-discuss mailing list
Ubuntu-devel-discuss@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/ubuntu-devel-discuss

Reply via email to