- 2 Tyan S2882 dual processor Opteron 244 stepping 10
- 2 Tyan S2882-D dual processor dual core Opteron 275 stepping 2
OK, those are obviously fairly different in age and power.
We have two (relatively complicated) numerical models (RAMS and a homegrown
one) that will blow up in random locations on the 244 machines but run fine
on the 275 machines.
they blow up consistently on multiple 244's? might all the 244's have the
same potentially flawed cooling/heasink-compound/powersupply/etc? are you
saying there's something _similar_ between the 244s and the 275s?
By blow up it appears the calculations get corrupted in some way and the
numbers get un-physical in RAMS and the simulation exits. With the other
model we get segfaults.
is your ram ECC (and enabled as such in bios, preferably with scrub enabled)?
if ECC, have you run mcelog?
We've tried FC4/5 on the 244 machines. At one point all were running
identical FC5 installs with the same problems.
why do you think the problem is software?
Problem is not exactly reproducible unfortunately. It will crash at
different times in the simulations, but they will crash at some point with
the length of runs we are doing.
sounds like heat/power to me.
Are there any cpu tests out there that would check the accuracy of various
calculations?
you don't mean accuracy, do you? like some subtle problem with low-order bits?
I tend to use HPL for this kind of test - it's not to hard to tune it to use
however much memory you want, and to run for as long as you want. it's not
the ultimate memory-grinder, but it's pretty intense.
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf