Lance Jacobs <[EMAIL PROTECTED]> posted
[EMAIL PROTECTED], excerpted below, on
 Wed, 31 May 2006 10:58:16 -0400:

> Just FYI, I have this fixed now, and you were right on the money.  I
> wouldn't have believed it if I hadn't seen it, but it was the RAM.  I
> replaced the OCZ memory with equivalent parts from Crucial, and the
> system is fine now.  It still seems strange that many things seemed to
> run fine with the old RAM, except for bunzip2 and md5sum, and now
> everything is good -- only some code is intolerant of bad bits?  It just
> seems wrong....  Anyway, the system is rock solid now.

It's all down to the application (as in what the software is does,
rather than as in the specific executable)... md5sum and bunzip2 just
happen to be in a class of application that happens to be more sensitive
to this sort of thing than others, since their application (or part of it)
is that they both verify integrity, and even a single bit-flip somewhere
will cause that verification to fail. Most applications aren't that
sensitive.  In a normal executable, a single random bit-flip won't make a
lot of difference.  If it's in the lower order bits of an image bitmap or
sound sample, you'll not notice it at all (see steganography), and
certainly the result can still be played or viewed without error.  If it's
in the wrong place in an executable, you'll get bad results, but likely
not bad enough to immediately crash, but rather, output that gets worse
and worse, executables that get less and less stable, over time.  If you
don't tend to run executables for days or weeks at a time, if you shut
down your computer when not in use, and you never use integrity
verification applications, such memory unreliability may go entirely
undetected and unsuspected.

You mention that you have the memory in something else now, for further
testing.  Note that depending on the exact nature of the problem, the
memory may come up clean on a different mobo.  Hardware tolerances and
resistance to data signal noise being what they are, it's entirely
possible the memory was just at one end of the spec and the board at the
other, in terms of tolerances that would work, and they were thus
incompatible with each other, while each remains within spec or only
slightly out of spec, and will work with 90% of what's out there -- they
just wouldn't work when that particular memory was in that particular
board.

Based on the experience I had (which I posted earlier), you may also find
that the memory is perfectly fine under most conditions, but is subject to
errors in certain corner conditions.  If you've ever seen the complete set
of non-auto memory parameters available in some BIOS setups, there's quite
a list of them, ten or so.  If one of the rare corner-case ones doesn't
meet the on-stick memory ratings by even a single clock, but that state
transfer doesn't happen very frequently and even when it does, all but one
of the chips on the stick is fine with it, and even then, that single
exception only happens in a certain temperature zone in the case of the
third memory access in a row of a specific pattern, it could be extremely
difficult to find or verify, yet cause annoying problems just often enough
to be a real frustration!

You did mention that the memory was under warrantee, however, and that
it's going back, regardless, and that's a wise decision.



-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

-- 
gentoo-amd64@gentoo.org mailing list

Reply via email to