[EMAIL PROTECTED] ~]# mcelog --k8 --ascii mce3.txt
CPU 0 4 northbridge TSC 23f6fd4262e9
RIP 00:413bd6
Northbridge Chipkill ECC error
Chipkill ECC syndrome = 3d3c
bit40 = error found by scrub
bit45 = uncorrected ecc error
bit61 = error uncorrected
bit62 =
On Tue, Jan 31, 2006 at 08:09:16PM -0800, mike wrote:
Do you mean disable the CPU cache (L2?) or disable the ECC memory cache?
I meant L2 CPU cache.
I haven't looked at the BIOS yet to see which options I have, but I
want to make sure I do the right one :)
Wouldn't disabling some caches
After running memtest86 (V3.3) for at least 24 hours, I came back and
saw that each machine completed 61-63 cycles of tests, with 0
errors...
However, I did look through the BIOS for cache disabling - and it
doesn't appear I can disable the CPU cache.
I did turn on chipkill and some other
On Wed, Feb 01, 2006 at 08:47:30AM -0800, mike wrote:
After running memtest86 (V3.3) for at least 24 hours, I came back and
saw that each machine completed 61-63 cycles of tests, with 0
errors...
However, I did look through the BIOS for cache disabling - and it
doesn't appear I can disable
On Wednesday 01 February 2006 16:47, mike wrote:
After running memtest86 (V3.3) for at least 24 hours, I came back and
saw that each machine completed 61-63 cycles of tests, with 0
errors...
However, I did look through the BIOS for cache disabling - and it
doesn't appear I can disable the
Yeah, I looked at the memory. It's got a PQI sticker (at least one set)
It's SUPPOSED to be (Samsung, Micron, Elpida, Infineon, Hynix OEM) -
which would align it basically with what Supermicro suggests:
Yes, I was able to go down and get on the console, record it, and
found a thread on how to decypher it.
The MCE was:
CPU 0: Machine Check Exception: 4 Bank 0: f60da833
TSC 23fd7acec1e ADDR 797db2c0
Kernel panic - not syncing: Machine check
the output from mcelog was:
web03:~#
On Tue, Jan 31, 2006 at 01:45:01AM -0800, mike wrote:
Yes, I was able to go down and get on the console, record it, and
found a thread on how to decypher it.
The MCE was:
CPU 0: Machine Check Exception: 4 Bank 0: f60da833
TSC 23fd7acec1e ADDR 797db2c0
Kernel panic - not
Do you mean disable the CPU cache (L2?) or disable the ECC memory cache?
I haven't looked at the BIOS yet to see which options I have, but I
want to make sure I do the right one :)
Wouldn't disabling some caches wind up cutting the performance down,
especially if you mean the CPU cache? Having
mike wrote:
Sure enough, one of the two others I tested failed within only 10-15 mins.
Does this look appropriate for a dual-core, single chip, Opteron 175
in a 1u chassis? The max any CPU gets is 62C...
That's pretty damn hot. Something closer to 40C is more normal...
--
To UNSUBSCRIBE,
Harald Dunkel wrote:
Hi Mike,
A machine check exception indicates a hardware problem, i.e.
a broken CPU. (I am not sure whether it could indicate
bad ECC memory, too. Did you run memtest68?)
ECC failures will generate MCE's. The MCE message *should* provide some
hint as to what is wrong.
I'm having a problem - any time I do a large rsync, or even compile a
kernel over and over (basically a stress test - but not really
throwing a major load on it) it eventually throws a machine check
exception (I don't have a console into it right now so I can't paste
the exact dump...)
I'm
I forgot, I should have included dmesg... and lspci I suppose.
Bootdata ok (command line is root=/dev/sda2 ro)
Linux version 2.6.14.3-mike ([EMAIL PROTECTED]) (gcc version 4.0.3 2005
(prerelease) (Debian 4.0.2-4)) #3 SMP Fri Dec 2 05:39:39 PST 2005
BIOS-provided physical RAM map:
BIOS-e820:
Hi Mike,
A machine check exception indicates a hardware problem, i.e.
a broken CPU. (I am not sure whether it could indicate
bad ECC memory, too. Did you run memtest68?)
Since you get the problem on heavy load I would suggest to
look for the CPU fan. If the CPU is overheated then this
is the
Sure enough, one of the two others I tested failed within only 10-15 mins.
Does this look appropriate for a dual-core, single chip, Opteron 175
in a 1u chassis? The max any CPU gets is 62C...
ipmisensors
CPU Temparture (C) =62.00 Min=0.00 Max=75.00ok
System Temparture (C)=25.00
I would suppose it is a hardware issue, too.
Have you tried memtest86 ( http://www.memtest.org/ ) ?
Let this check run about 20 hours. I experienced a problem on my router
after 12 runs
of the complete test suite, but immediately unziping one special file.
According to
16 matches
Mail list logo