Robert G. Brown wrote:
Perhaps fortunately (perhaps not) there is a lot less variation in
system performance with system design than there once was.  Everybody
uses one of a few CPUs, one of a few chipsets, generic memory,
standardized peripherals.  There can be small variations from system to
system, but in many cases one can get a pretty good idea of the
nonlinear "performance fingerprint" of a given CPU/OS/compiler family
(e.g. opteron/linux/gcc) all at once and have it not be crazy wrong or
unintelligible as you vary similar systems from different manufacturers
or vary clock speed within the family.  There are enough exceptions that
it isn't wise to TRUST this rule, but it is still likely correct within
10% or so.


I agree that this rule is true for almost all codes ... that are perfectly in cache and that do not try to benefit from specific optimisations.

HPC codes however are always pushing the limits and this means you will always stumble on some bottleneck somewhere. Once you removed the bottleneck, you stumble on another. And every bottleneck mask all others until you remove it.

E.g. it was already mentioned in this thread that one should not forget to pay attention to storage. However often people run parallel codes with each process performing heave IO without an adapted storage system.

Or another example, GotoBLAS is well known to outperform netlib-blas. However, in an application calling many dgemm's on small matrices (up to 50x50), netlib-blas will _really_ (i.e. a factor 30) outperform GotoBLAS (because GotoBLAS 'looses' time aligning the matrices etc. which becomes significant for small matrices)

toon
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to