On Mon, 30 Apr 2007, Orion Poplawski wrote:
I do mean accuracy, and not necessarily subtle - things blow up bad. Perform some set of calculations over and over and error if it doesn't give the expected result.
Again, this sounds like a bug, not a hardware problem. If it occurs on two completely different systems I'd suspect a plain old programming or systems bug until proven otherwise. There are so many of these possible that it is difficult to know where to start. Program bessel functions wrong (recursion in the wrong direction) and you end up with garbage. Program incomplete gamma functions wrong ditto. Use single precision where you should be using double, do a sum in the wrong order, overwrite the boundary of an array... The fact that you got/get a segment violation in one of your runs (IIRC) is pretty much telling you "hey, you're writing out of bounds in memory somewhere". This can be a simple case of somebody not doing bounds checking in a program and your using the program with inputs they didn't expect or in some uncommonly traced way. The reason to use open source code for doing serious work like this is that you can debug it. Everybody has their own methodology, but as I think Greg mentioned recently in a different context, ultimately it involves instrumenting your code with output statements and tracing it through one or more crashes. How difficult this ends up being depends on who wrote the code, how well commented it is, whether or not you can make a clever guess as to where the problem "probably" lies. I've been coding, one way or another, for coming up on 35 years or thereabouts, starting with paper tape, going through cards (lots of cards), and up the evolutionary ladder. In all of that time, I've encountered one -- count it, one -- time that a consistent error in code I was running was due to a real failure in the hardware I was running on and not a bug in my own code. And that was on crap hardware. I cannot begin to count the number of times that I've discovered bugs in my own code, including ones that were so subtle that I SWORE that the computer was making a mistake -- until I discovered my own. The question you should be asking, then, isn't "what could make my program fail". Nobody can answer that from far away. There a near infinity of possible causes. The right question is "how can I FIGURE OUT why my program is failing?" And the answer is, slowly and deliberately, by presuming a bug in the code and instrumenting the code until you can can "see" the failure occurring and completely understand where and why it fails. Nothing else works. Seriously. rgb
-- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:[EMAIL PROTECTED] _______________________________________________ Beowulf mailing list, [email protected] To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
