At 07:41 PM 3/17/02 +0100, you wrote:
>Usually, we do several NTT modulo different primes and then use the Chinese
>Remainder Theorem to get t[0] ;...; t[i] ;...; t[l-1] modulo p, because the
>terms in the central loop of the transform can exceed the largest integer
>that fits in a word machine (for instance, 2^32-1) . That's one of the
>reasons why NTT are slower than FTT.
>
>Why don't we compute these terms modulo p at each step of the central loop ?
>Consequently, it would be useless to do several NTT and the Chinese
>Remainder Theorem would not be required.
If you can do computations modulo p directly there's no reason to use
the Chinese remainder theorem. The down side is that if you can break p
up into L primes, arithmetic mod p is much more than L times as expensive
as arithmetic mod the smaller primes, at least for most p. Using the CRT
means a mess when the convolution has finished but not in any intermediate
step; that's the big advantage.
Also, when talking strictly about Mersenne transforms, your prime p must
have a large power-of-two root of 2, and so must all of your L primes. I
don't think any of the 32-bit primes that are suitable for an NTT have this
property.
If you want to explore your idea further, here are some primes that may
be useful. All are of the form i*2^e+1, and have the advantage that roots
of unity out to some small order are "simple":
i exponent FFT radix DWT size root
-----------------------------------------------------------------
007d43548abc4361 320 8 2^17 67
04235deab8321000 512 8 2^22 19
0051656f7fe405c1 2048 8 2^19 29
07ab47a650572a01 2112 8 2^16 29
06954fe21e3e8100 384 32 N/A 37
290d741 1792 32 N/A 5
>Even if I am wrong, the speed of NTT can be improved by using 64 bit
>integers ( shorter NTT length ) supported by both new processors and new C
>compilers, and by using Intel MMX instructions.
The only operations MMX can perform at full 64-bit precision are shifts.
Modular arithmetic using MMX is possible but extremely unwieldy; SSE would
work better, but nowhere near as well as SSE floating point.
My personal opinion is that NTTs will not be faster than FFTs until modular
multiplication has the same throughput as floating point multiplication.
Right now we're at least a factor of 3-5 away from that.
Hope this helps,
jasonp
_________________________________________________________________________
Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm
Mersenne Prime FAQ -- http://www.tasam.com/~lrwiman/FAQ-mers