Brian Beesley wrote:
>The optimization that should probably be done for Athlon is to
>organize the code to allow FMUL & FADD to execute in parallel (which
>the Pentium II/III core just can't manage). This could give a speedup
>of the order of 40%.
That would be nice if true, but I suspect it's a bit overoptimistic.
The reason is this: the Athlon utilizes out-of-order execution, i.e.
even if the assembly code indicates a certain instruction ordering
(e.g. FADDs interleaved with FMULs, as required for the Pentium, which
can complete just one double-precision floating op per cycle), the CPU
is free to execute them in a different order, as long as any data
dependencies are preserved. That means the Athlon is probably already
executing quite a few such FADD/FMUL pairs in parallel, unless I'm
misunstanding something fundamental about its OOE capabilities.
As I've found the available Athlon documentation (the technical brief
and the code optimization guide from the AMD website) to be frustratingly
vague about things like the register set architecture and the functional
units, can anyone answer the following for me?
1a,b,c) How many floating-point registers does the Athlon have? Are these
all 80 bits? Are they accessed via the same kind of stack-based model as
the Pentium?
2a,b,c) I believe the Athlon has two floating adders in addition to a floating
multiplier. Can it dispatch 2 FADDs and 1 FMUL per cycle? Can it do 2 double-
precision FADDs per cycle, or just do single-precision adds in parallel?
(The former would help with the higher-radix FFTs in an LL code, since these
have more adds than multiplies, but the latter, while nice for multimedia
applications, would be useless for speeding LL testing.)
Thanks,
-Ernst
_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers