On Oct 15 2012, Iliev, Hristo wrote:
Numeric differences are to be expected with parallel applications. The basic reason for that is that on many architectures floating-point operations are performed using higher internal precision than that of the arguments and only the final result is rounded back to the lower output precision. When performing the same operation in parallel, intermediate results are communicated using the lower precision and thus the final result could differ. ...
Not quite. That's ONE reason.
You could try to "cure" this (non-problem) by telling your compiler to not use higher precision for intermediate results.
But it wouldn't help if the problem is the other reason, which is that floating-point arithmetic is not associative. That means that the actual order of the operations makes a difference to the final result, and that is (correctly) unspecified for MPI_Reduce. I have had long arguments with people who believe in deterministic floating-point (i.e. that consistency implies correctness), but the actual fact is that it is an unavoidable problem with parallel use of floating-point or indeed any serious numeric optimisation. So the summary is that anyone doing floating-point work has to learn to live with it. Any traditional book on numerical programming (i.e. before 1980) will take that for granted. Regards, Nick Maclaren.