Robert G. Brown wrote:
Perhaps fortunately (perhaps not) there is a lot less variation in system performance with system design than there once was. Everybody uses one of a few CPUs, one of a few chipsets, generic memory, standardized peripherals. There can be small variations from system to system, but in many cases one can get a pretty good idea of the nonlinear "performance fingerprint" of a given CPU/OS/compiler family (e.g. opteron/linux/gcc) all at once and have it not be crazy wrong or unintelligible as you vary similar systems from different manufacturers or vary clock speed within the family. There are enough exceptions that it isn't wise to TRUST this rule, but it is still likely correct within 10% or so.
I agree that this rule is true for almost all codes ... that are perfectly in cache and that do not try to benefit from specific optimisations.
HPC codes however are always pushing the limits and this means you will always stumble on some bottleneck somewhere. Once you removed the bottleneck, you stumble on another. And every bottleneck mask all others until you remove it.
E.g. it was already mentioned in this thread that one should not forget to pay attention to storage. However often people run parallel codes with each process performing heave IO without an adapted storage system.
Or another example, GotoBLAS is well known to outperform netlib-blas. However, in an application calling many dgemm's on small matrices (up to 50x50), netlib-blas will _really_ (i.e. a factor 30) outperform GotoBLAS (because GotoBLAS 'looses' time aligning the matrices etc. which becomes significant for small matrices)
toon _______________________________________________ Beowulf mailing list, [email protected] To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
