David Mathog wrote:
> I ran a little test over the Thanksgiving holiday to see how common
> random errors in nonECC memory are.  I used the memtest86+ bit fade test
> mode, which writes all 1s, waits 90 minutes, checks the result, then
> does the same thing for all 0s.   Anyway, this was the best test I could
> find for detecting the occasional gamma ray type data loss event.  The
> [...]

Hello, David.

Memtest86+ is fine for 'burn-in' tests, but it does not do a realistic memory stress test under the conditions that normal applications run. I test new non-ECC compute nodes by booting memtest86+ and running it for 24h. If there are no errors I reboot into Linux and run memtester. I've found memory that passes a 24h memtest86+ test, but fails memtester:

    http://pyropus.ca/software/memtester/

If one of our compute node crashes in when use it is re-tested the same way before being allowed to rejoin the openMosix cluster. It is possible that faults detected by memtester are caused by other components such as CPU's overheating or PSU's struggling to provide enough power but the important point is these problems affect applications in a similar way.

All the compute nodes in our Beowulf cluster have to pass 24h Memtest86+ clean, followed by 100 memtester runs on 128MB RAM before being trusted to accept openMosix migrated processes, or to be used as LAM MPI hosts.

Best wishes,

    Tony.
--
Dr. A.J.Travis,                     |  mailto:[EMAIL PROTECTED]
Rowett Research Institute,          |    http://www.rri.sari.ac.uk/~ajt
Greenburn Road, Bucksburn,          |   phone:+44 (0)1224 712751
Aberdeen AB21 9SB, Scotland, UK.    |     fax:+44 (0)1224 716687
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to