In case of FFT we have to provide enough bits per word to safely hold the squares of the values in frequency space without getting rounding errors during the inverse transform. But actually we don't need that many bits during the forward transform. That offers the possibility to use a floating point format with a smaller mantissa. I assume that unfortunately the 24bit mantissa of single precision numbers is just a few bits too short to fully utilize the double precision 53bit mantissa on the way back.
However there is maybe still a possible gain by using SP for forward and DP for inverse transform thanks to roughly double throughput for single precision calculations using SIMD instructions on some architectures. The memory traffic wouldn't decrease much if at all because IMO it would be better to have the 32bit values at nearly the same locations as the doubles later to avoid additional TLB/cacheline issues, which could appear if some out-of-place transform style is being used.
An enhancement is possible by grouping SP values of 2 cachelines together into one to reduce memory traffic. The unused space of cacheline size would be used for the doubles and the grouping of SP values is also necessary to be able to load them efficiently using SIMD instructions.
Another story is, if this scheme could be applied to NTTs by using primes of different sizes. Unfortunately the values we get after a forward transform are repeatedly multiplied by powers of roots and added - always modulo a chosen prime. So I imagine it will be difficult or even impossible to change these values in a way that a different prime (with more bits) can be used for the inverse transform. Any ideas?
_________________________________________________________________________ Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers