Re: Please help - kernel crashes often

2006-02-02 Thread mike
[EMAIL PROTECTED] ~]# mcelog --k8 --ascii mce3.txt CPU 0 4 northbridge TSC 23f6fd4262e9 RIP 00:413bd6 Northbridge Chipkill ECC error Chipkill ECC syndrome = 3d3c bit40 = error found by scrub bit45 = uncorrected ecc error bit61 = error uncorrected bit62 =

Re: Please help - kernel crashes often

2006-02-01 Thread Lennart Sorensen
On Tue, Jan 31, 2006 at 08:09:16PM -0800, mike wrote: Do you mean disable the CPU cache (L2?) or disable the ECC memory cache? I meant L2 CPU cache. I haven't looked at the BIOS yet to see which options I have, but I want to make sure I do the right one :) Wouldn't disabling some caches

Re: Please help - kernel crashes often

2006-02-01 Thread mike
After running memtest86 (V3.3) for at least 24 hours, I came back and saw that each machine completed 61-63 cycles of tests, with 0 errors... However, I did look through the BIOS for cache disabling - and it doesn't appear I can disable the CPU cache. I did turn on chipkill and some other

Re: Please help - kernel crashes often

2006-02-01 Thread Lennart Sorensen
On Wed, Feb 01, 2006 at 08:47:30AM -0800, mike wrote: After running memtest86 (V3.3) for at least 24 hours, I came back and saw that each machine completed 61-63 cycles of tests, with 0 errors... However, I did look through the BIOS for cache disabling - and it doesn't appear I can disable

Re: Please help - kernel crashes often

2006-02-01 Thread Paul Brook
On Wednesday 01 February 2006 16:47, mike wrote: After running memtest86 (V3.3) for at least 24 hours, I came back and saw that each machine completed 61-63 cycles of tests, with 0 errors... However, I did look through the BIOS for cache disabling - and it doesn't appear I can disable the

Re: Please help - kernel crashes often

2006-02-01 Thread mike
Yeah, I looked at the memory. It's got a PQI sticker (at least one set) It's SUPPOSED to be (Samsung, Micron, Elpida, Infineon, Hynix OEM) - which would align it basically with what Supermicro suggests:

Re: Please help - kernel crashes often

2006-01-31 Thread mike
Yes, I was able to go down and get on the console, record it, and found a thread on how to decypher it. The MCE was: CPU 0: Machine Check Exception: 4 Bank 0: f60da833 TSC 23fd7acec1e ADDR 797db2c0 Kernel panic - not syncing: Machine check the output from mcelog was: web03:~#

Re: Please help - kernel crashes often

2006-01-31 Thread Lennart Sorensen
On Tue, Jan 31, 2006 at 01:45:01AM -0800, mike wrote: Yes, I was able to go down and get on the console, record it, and found a thread on how to decypher it. The MCE was: CPU 0: Machine Check Exception: 4 Bank 0: f60da833 TSC 23fd7acec1e ADDR 797db2c0 Kernel panic - not

Re: Please help - kernel crashes often

2006-01-31 Thread mike
Do you mean disable the CPU cache (L2?) or disable the ECC memory cache? I haven't looked at the BIOS yet to see which options I have, but I want to make sure I do the right one :) Wouldn't disabling some caches wind up cutting the performance down, especially if you mean the CPU cache? Having

Re: Please help - kernel crashes often

2006-01-30 Thread Anthony DeRobertis
mike wrote: Sure enough, one of the two others I tested failed within only 10-15 mins. Does this look appropriate for a dual-core, single chip, Opteron 175 in a 1u chassis? The max any CPU gets is 62C... That's pretty damn hot. Something closer to 40C is more normal... -- To UNSUBSCRIBE,

Re: Please help - kernel crashes often

2006-01-30 Thread Anthony DeRobertis
Harald Dunkel wrote: Hi Mike, A machine check exception indicates a hardware problem, i.e. a broken CPU. (I am not sure whether it could indicate bad ECC memory, too. Did you run memtest68?) ECC failures will generate MCE's. The MCE message *should* provide some hint as to what is wrong.

Please help - kernel crashes often

2006-01-29 Thread mike
I'm having a problem - any time I do a large rsync, or even compile a kernel over and over (basically a stress test - but not really throwing a major load on it) it eventually throws a machine check exception (I don't have a console into it right now so I can't paste the exact dump...) I'm

Re: Please help - kernel crashes often

2006-01-29 Thread mike
I forgot, I should have included dmesg... and lspci I suppose. Bootdata ok (command line is root=/dev/sda2 ro) Linux version 2.6.14.3-mike ([EMAIL PROTECTED]) (gcc version 4.0.3 2005 (prerelease) (Debian 4.0.2-4)) #3 SMP Fri Dec 2 05:39:39 PST 2005 BIOS-provided physical RAM map: BIOS-e820:

Re: Please help - kernel crashes often

2006-01-29 Thread Harald Dunkel
Hi Mike, A machine check exception indicates a hardware problem, i.e. a broken CPU. (I am not sure whether it could indicate bad ECC memory, too. Did you run memtest68?) Since you get the problem on heavy load I would suggest to look for the CPU fan. If the CPU is overheated then this is the

Re: Please help - kernel crashes often

2006-01-29 Thread mike
Sure enough, one of the two others I tested failed within only 10-15 mins. Does this look appropriate for a dual-core, single chip, Opteron 175 in a 1u chassis? The max any CPU gets is 62C... ipmisensors CPU Temparture (C) =62.00 Min=0.00 Max=75.00ok System Temparture (C)=25.00

Re: Please help - kernel crashes often

2006-01-29 Thread Christoph Fassbach
I would suppose it is a hardware issue, too. Have you tried memtest86 ( http://www.memtest.org/ ) ? Let this check run about 20 hours. I experienced a problem on my router after 12 runs of the complete test suite, but immediately unziping one special file. According to