Mersenne: How fast can we square? (Was: K7 vs. x86)

EWMAYER Sun, 22 Aug 1999 20:56:05 -0700
What I neglected to state explicitly in my last posting is that the fact that
there is a lower limit to the adds in an FFT also sets a limit on the 
achievable
performance. Let's consider a Mersenne-mod DWT-based squaring with a (real) 
vector
length of 2^20, or just over a million, performed using 64-bit floats on a CPU
which can execute one floating add per clock cycle (i.e. most processors).

When one includes the FFT twiddle multiplies, a typical number is that we need
3 adds per float per power of 2 in the vector length, i.e. 30 per float to 
effect
a length-2^20 transform. Going to higher radices and highly optimized 
transforms
shaves a little bit off this figure, but mainly has the effect of reducing the
number of floating multiplies and allowing one to do lots of operations while
data are in registers, i.e. of the processor well fed. We thus estimate 60
FADDs per datum for a forward FFT, and an equal number for the inverse FFT.
Throw in some adds for the pointwise square and the carry step, and 135-150
adds per float per iteration sounds reasonable. We thus need to do around
150 million adds per iteration. On a 450 MHz CPU, even if one does the 
theoretical
maximum of one add per clock, each iteration will need a third of second. That
math applies to the IA-64, so we can expect good performance from the code 
George
is presumably writing for that CPU, but not miracles - George tells me his 
Prime95
code needs about 2/3 second to do a length-2^20 squaring on a 450MHz PII, so 
we
can expect at most a factor of 2 speedup on a similar-speed Merced.

Looking at my Mlucas code, on a 500MHz Alpha 21264 (also at most one add per
clock) it needs 0.4 seconds to do the above squaring, which is not much longer
than the estimated minimum time of 0.3 seconds. This tells me that a huge ASM
coding effort targeted at the 21264 is not justified, since one can't hope to
squeeze much more speed out of the code that way. On the other hand, the code
runs 3x as slow on an equal-speed 21164, so trying to boost the performance
on that architecture by improving the source code and doing some assembly 
coding
should be worth some effort.

-Ernst

_________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers
Mersenne: How fast can we square? (Was: K7 vs. x86)

Reply via email to