What I neglected to state explicitly in my last posting is that the fact that there is a lower limit to the adds in an FFT also sets a limit on the achievable performance. Let's consider a Mersenne-mod DWT-based squaring with a (real) vector length of 2^20, or just over a million, performed using 64-bit floats on a CPU which can execute one floating add per clock cycle (i.e. most processors). When one includes the FFT twiddle multiplies, a typical number is that we need 3 adds per float per power of 2 in the vector length, i.e. 30 per float to effect a length-2^20 transform. Going to higher radices and highly optimized transforms shaves a little bit off this figure, but mainly has the effect of reducing the number of floating multiplies and allowing one to do lots of operations while data are in registers, i.e. of the processor well fed. We thus estimate 60 FADDs per datum for a forward FFT, and an equal number for the inverse FFT. Throw in some adds for the pointwise square and the carry step, and 135-150 adds per float per iteration sounds reasonable. We thus need to do around 150 million adds per iteration. On a 450 MHz CPU, even if one does the theoretical maximum of one add per clock, each iteration will need a third of second. That math applies to the IA-64, so we can expect good performance from the code George is presumably writing for that CPU, but not miracles - George tells me his Prime95 code needs about 2/3 second to do a length-2^20 squaring on a 450MHz PII, so we can expect at most a factor of 2 speedup on a similar-speed Merced. Looking at my Mlucas code, on a 500MHz Alpha 21264 (also at most one add per clock) it needs 0.4 seconds to do the above squaring, which is not much longer than the estimated minimum time of 0.3 seconds. This tells me that a huge ASM coding effort targeted at the 21264 is not justified, since one can't hope to squeeze much more speed out of the code that way. On the other hand, the code runs 3x as slow on an equal-speed 21164, so trying to boost the performance on that architecture by improving the source code and doing some assembly coding should be worth some effort. -Ernst _________________________________________________________________ Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers