I ran a little test over the Thanksgiving holiday to see how common random errors in nonECC memory are. I used the memtest86+ bit fade test mode, which writes all 1s, waits 90 minutes, checks the result, then does the same thing for all 0s. Anyway, this was the best test I could find for detecting the occasional gamma ray type data loss event. The result: no errors logged in 5 solid days of testing. So this class of error (the type ECC would detect and probably fix) apparently occurs on these machines at a rate of less than 1 per 840 Gigabyte-hours. Possibly the upper limit is half that if data can only be lost on 1 -> 0 transition, or vice versa. This assumes the bit fade test works, which cannot be independently verified from these results.
On the web there are references to an IBM study which found 1 bit error/256Mb/Month, which would have been (.25 *30 * 24) = 1 per 180 Gigabyte-hours. If IBM's numbers held for my hardware there should have seen 4 or 5 errors in total. Mine are in a basement in a concrete building, perhaps that provided some shielding relative to what IBM used for their test conditions. The memory was Corsair Twinx1024-3200C2. When first installed all of this memory had run for 24 hours with no errors in normal memtest86+ testing. Regards, David Mathog [EMAIL PROTECTED] Manager, Sequence Analysis Facility, Biology Division, Caltech _______________________________________________ Beowulf mailing list, [email protected] To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
