Remember also that an integer FFT requires all results to be reduced modulo
some prime number. For general primes this modular multiplication requires
at least four 64x64->64 multiplies, and these typically cannot pipeline
at all (if you have a pipelined multiplier, you've got to overlap several
independent modmul operations; if you don't have a pipelined multiplier,
you'll get dismal performance no matter what you do). There are special primes for which modular reductions are cheap; using one of these
primes and a Fast Galois Transform, the fastest integer FFT squaring I've managed is ~15% slower than Mlucas on an opteron.

the am64 PMULUDQ (whatever it was, I closed those PDFs hours ago) claims to do two 32x32->64 in one instruction (using 128 bit XMR registers of which there are 16), has a latency of 4 clocks, and can execute a pair of multiplies every two clocks. the XMR adders and other stuff can run in parallel. Did your integer FFT use the am64's SSE 128bit extensions?




_______________________________________________
Prime mailing list
[email protected]
http://hogranch.com/mailman/listinfo/prime

Reply via email to