Lance Jacobs <[EMAIL PROTECTED]> posted [EMAIL PROTECTED], excerpted below, on Wed, 31 May 2006 10:58:16 -0400:
> Just FYI, I have this fixed now, and you were right on the money. I > wouldn't have believed it if I hadn't seen it, but it was the RAM. I > replaced the OCZ memory with equivalent parts from Crucial, and the > system is fine now. It still seems strange that many things seemed to > run fine with the old RAM, except for bunzip2 and md5sum, and now > everything is good -- only some code is intolerant of bad bits? It just > seems wrong.... Anyway, the system is rock solid now. It's all down to the application (as in what the software is does, rather than as in the specific executable)... md5sum and bunzip2 just happen to be in a class of application that happens to be more sensitive to this sort of thing than others, since their application (or part of it) is that they both verify integrity, and even a single bit-flip somewhere will cause that verification to fail. Most applications aren't that sensitive. In a normal executable, a single random bit-flip won't make a lot of difference. If it's in the lower order bits of an image bitmap or sound sample, you'll not notice it at all (see steganography), and certainly the result can still be played or viewed without error. If it's in the wrong place in an executable, you'll get bad results, but likely not bad enough to immediately crash, but rather, output that gets worse and worse, executables that get less and less stable, over time. If you don't tend to run executables for days or weeks at a time, if you shut down your computer when not in use, and you never use integrity verification applications, such memory unreliability may go entirely undetected and unsuspected. You mention that you have the memory in something else now, for further testing. Note that depending on the exact nature of the problem, the memory may come up clean on a different mobo. Hardware tolerances and resistance to data signal noise being what they are, it's entirely possible the memory was just at one end of the spec and the board at the other, in terms of tolerances that would work, and they were thus incompatible with each other, while each remains within spec or only slightly out of spec, and will work with 90% of what's out there -- they just wouldn't work when that particular memory was in that particular board. Based on the experience I had (which I posted earlier), you may also find that the memory is perfectly fine under most conditions, but is subject to errors in certain corner conditions. If you've ever seen the complete set of non-auto memory parameters available in some BIOS setups, there's quite a list of them, ten or so. If one of the rare corner-case ones doesn't meet the on-stick memory ratings by even a single clock, but that state transfer doesn't happen very frequently and even when it does, all but one of the chips on the stick is fine with it, and even then, that single exception only happens in a certain temperature zone in the case of the third memory access in a row of a specific pattern, it could be extremely difficult to find or verify, yet cause annoying problems just often enough to be a real frustration! You did mention that the memory was under warrantee, however, and that it's going back, regardless, and that's a wise decision. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- gentoo-amd64@gentoo.org mailing list