Alexander Puchmayr <[EMAIL PROTECTED]> posted
[EMAIL PROTECTED], excerpted below, on 
Sun, 27 Apr 2008 15:23:25 +0200:

> After being abroad for a month I tried to boot my amd64 maschine and it
> crashes all the time, doing spontaneous reboots, hangs in the bios, and
> occasionally it tells me something about mce exception when loading
> hal-daemon, and immediately after that kernel panic, nothing works.
> 
> The MCE-exception also says i should run mcelog --ascii, but how should
> I do that if the maschine is virtually dead in the kernel panic state?

If you can, get the MCE number, write it down or whatever.  Then get the 
program parsemce (written by Dave Jones, the kernel guy).  The sources 
are here:

http://www.kernel.org/pub/linux/kernel/people/davej/tools/parsemce.c

The google hits are here:

http://www.google.com/linux?lr=lang_en&hl=en&q=parsemce

The personal experience:

The (generic) memory I originally had in this machine was apparently not 
quite up to its on-module and sold-as rating (PC3200).  Unfortunately, 
the original BIOS and early updates didn't have memory timing control 
either; the board simply ran what the on-module rating said it was good 
for (assuming the board and CPUs, Opterons with onboard memory 
controllers, could run it as well), and that proved somewhat optimistic.

Memtest86 checked out clean, because it tests memory retention in various 
patterns, not memory timing, which it doesn't stress.  However, I'd get 
crashes often enough I knew something was wrong, and they'd happen more 
frequently under high activity, tho it didn't seem to be directly CPU 
related.  I did notice occasional bunzip2 errors claiming a corrupt 
archive, but trying again would usually work.  That seemed the only 
common user-space symptom I had -- the only other one was the MCEs and 
often (but not always) kernel panics as a result.  (I had the option to 
panic automatically on detected error turned off, setting it to try to 
continue if it could -- the only way I was able to compile things, 
sometimes.  I also learned how to restart an in-progress merge after a 
reboot...)

I suspected it might be memory but as I said memtest came up clean, and 
it could have been the memory controller on the CPU, or the on-chip 
cache, or...

Finally I found out that MCE stood for machine-check-exception, and that 
the number it reports could be checked against a chart AMD and Intel both 
provide to see what the machine check said was going wrong.  Parsemce was 
the way to do this at the time, tho it now seems there may be other 
methods, based on your comments.

Looking it up I found it was indeed memory bus errors, but it still could 
have been the board or something, not the memory itself.  Still, knowing 
it was generic memory, I figured it was.  The problem was that the 
socket-940 Opterons (which I have) take registered memory, which is 
nowhere near the commodity that unregistered non-ecc memory is, and I 
really couldn't afford it at the time.

Well, Tyan finally came out with a BIOS update that exposed the necessary 
memory knobs for me to tweak, and I found that limiting speed just one 
notch, from 400 MHz to 383 MHz (basically PC3000 speed instead of PC3200) 
solved the problem ENTIRELY.  At that speed I could even significantly 
tighten down the individual memory timings and the system was STILL 
stable as a rock -- it was the overall clockspeed that the memory just 
couldn't quite handle, at least on my board.  I suspect the reason it was 
generic memory was that it was just at tolerance at one end, and the 
board (which had originally only been rated to PC2700 speeds, so I was 
surprised it actually clocked the memory at PC3200 in the first place -- 
I figure generic memory, but PC3200, should be plenty good to run at 
PC2700 and it would have been, only by then the board BIOS had been 
updated to take up to PC3200 memory even tho it hadn't be updated to 
clock limit it if necessary) is probably near tolerance at the other end, 
so the memory simply wasn't stable at those speeds in this board, but 
would have worked just fine at lower speeds (as it did for me when I 
could finally do it with a BIOS update) or at rated speeds on many other 
boards.

So I ran the memory at PC3000 speeds but with many of the individual 
timings tightened for a year or so... at which point I could finally 
afford to upgrade memory to what I /really/ wanted all along -- the 8 
gigs memory I'm now running, tho it cost me ~US$1100 to do it.  It's 
Super Talent brand, perhaps not top of the line, but at least it's stable 
at rated speeds.

Anyway, first moral of the story is that the MCE reports were accurate, 
altho they didn't pin it down all the way, they certainly pointed the 
area, confirming my earlier suspicions, and that parsemce can help with 
the number to English description mapping.  Second is that just because 
memtest86 says it's fine doesn't necessarily mean the /timings/ are fine, 
as it tests memory cell reliability and doesn't stress timings.

Other morals would be to avoid expensive registered memory requirements 
if possible, avoid generic memory, avoid making assumptions about what a 
board will try to run the memory at despite what it's rated to run it at, 
and when shopping, confirm if possible that the BIOS contains memory 
speed tweaking knobs.  

... The reason I got Opterons in the first place was because I wanted a 
dual CPU system, and I wanted AMD64, and that's the way you got it at the 
time.  I've been very happy with the system in general, Tyan is quite 
good with Linux support on many of their boards including this one -- it 
was certified with several Linux distributions and they even had a 
preconfigured lm_sensors.conf available for download! =8^)  Additionally, 
when the dual-cores came out, all I had to do was upgrade the BIOS and I 
was ready to upgrade the CPUs (which I've now done, to dual dual-core 
Opteron 290s, at 2.8 GHz, top of the socket 940 line).  It sure would 
have been nice to have the memory tweaking stuff in the BIOS rather 
earlier, however.

Meanwhile, I plan on keeping this one for awhile (thus the money sunk in 
upgrades), but when I do upgrade again, I'd like to make it a Tyan once 
again, but 4 cores is really plenty and likely will still be plenty at 
that time, and they have that available in single socket desktops now so 
I won't need to spend the extra $$ on server class gear.

As for your problems, keep in mind that a going bad or overloaded 
powersupply can produce similar symptoms as well, or it can be unstable 
wall power.  Or overclocking altho it doesn't sound like you're into that.
But since you are getting MCEs, I definitely check what they say is wrong 
before anything else.  The machine is providing the information, why not 
use it?

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

-- 
[email protected] mailing list

Reply via email to