I have an i7-3612Qe system that has been giving kernel panics when running a 
custom video streaming application pretty consistently after a couple minutes.  
After the kernel panic, the machine reboots.  When I turn off turbo mode in the 
BIOS, the panics happen less frequently.  Other BIOS settings are all default, 
no overclocking or anything fancy.



-Can be reproduced by running streaming application using ~250% CPU, temps are 
a little high, they float around 68-71 degrees

-sysbench runs fine with 8 threads, throttles CPU up to ~800%, no kernel 
panics, temps remain below 70 degrees

-MemTest did not report any errors

-Intel Processor Diagnotic tool passed

-Tried swapping RAM

-Reproducible on multiple machines, not just a single processor (possibly 
eliminates it being a bad single proc)



-Able to mitigate most of the kernel panics and reboots by disabling Turbo mode 
(this is unacceptable, just including this for debugging purposes)

-Also able to mitigate kernel panics and reboots by changing the cpu frequency 
sacling_governor to conservative, from ondemand.  Conservative should 
"gracefully increase and decreases the CPU speed rather than jumping to max 
speed the moment there is any load on the CPU" 
(https://www.kernel.org/doc/Documentation/cpu-freq/governors.txt)



Here is what I could copy down from the kernel panic on the monitor, sometimes 
the messages vary slightly, bu the TSC and PROCESSOR messages are almost always 
the same.



[Hardware Error]:  TSC 6e496d96062

[Hardware Error]:  PROCESSOR 0:306a9 TIME 1418929330 SOCKET 0 APIC 3 microcode 
12

[Hardware Error]:  Run the above through 'mcelog --ascii'

[Hardware Error]:  CPU 0: Machine Check Exception: 5 Bank 4: b200000000100402

[Hardware Error]:  RIP !INEXACT! 10:<ffffffff811ee04c> {intel_idle+0xb9/0x119}

[Hardware Error]:  TSC 148c99828a0

[Hardware Error]:  PROCESSOR 0:306a9 TIME 1418929330 SOCKET 0 APIC 3 microcode 
12

[Hardware Error]:  Run the above through 'mcelog --ascii'

[Hardware Error]:  Some CPUs didn't answer in synchronization

[Hardware Error]:  Machine check:  Processor context corrupt

Kernel panic - not synching: Fatal machine check on current CPU

Pid: 0, comm: swapper/3 Tained: P  M   0 3.2.0-4-amd64 #1 Debian 3.2.51-1

Call Trace: ...



Also, here is a post from superuser on some suggestions that I tried: cpu - 
Kernel Panic from overheating? - Super 
User<http://superuser.com/questions/854199/kernel-panic-from-overheating?noredirect=1#comment1130272_854199>



Here is my discussion on intel where someone recommended I post to the debian 
list https://communities.intel.com/thread/58372?sr=stream



I am looking to actually debug and fix this, but not finding a lot out there.  
Any ideas for what to try next?

Reply via email to