At 07:41 PM 3/17/02 +0100, you wrote:

>Usually, we do several NTT modulo different primes and then use the Chinese 
>Remainder Theorem to get t[0] ;...; t[i] ;...; t[l-1] modulo p, because the 
>terms in the central loop of the transform can exceed the largest integer 
>that fits in a word machine (for instance, 2^32-1) . That's one of the 
>reasons why NTT are slower than FTT.
>
>Why don't we compute these terms modulo p at each step of the central loop ?
>Consequently, it would be useless to do several NTT and the Chinese 
>Remainder Theorem would not be required.

If you can do computations modulo p directly there's no reason to use
the Chinese remainder theorem. The down side is that if you can break p 
up into L primes, arithmetic mod p is much more than L times as expensive
as arithmetic mod the smaller primes, at least for most p. Using the CRT 
means a mess when the convolution has finished but not in any intermediate
step; that's the big advantage.

Also, when talking strictly about Mersenne transforms, your prime p must
have a large power-of-two root of 2, and so must all of your L primes. I
don't think any of the 32-bit primes that are suitable for an NTT have this
property.

If you want to explore your idea further, here are some primes that may
be useful. All are of the form i*2^e+1, and have the advantage that roots
of unity out to some small order are "simple":

       i            exponent   FFT radix    DWT size   root
-----------------------------------------------------------------
007d43548abc4361       320         8           2^17      67
04235deab8321000       512         8           2^22      19
0051656f7fe405c1      2048         8           2^19      29
07ab47a650572a01      2112         8           2^16      29

06954fe21e3e8100       384        32           N/A       37
         290d741      1792        32           N/A        5


>Even if I am wrong, the speed of NTT can be improved by using 64 bit 
>integers ( shorter NTT length ) supported by both new processors and new C 
>compilers, and by using Intel MMX instructions.

The only operations MMX can perform at full 64-bit precision are shifts.
Modular arithmetic using MMX is possible but extremely unwieldy; SSE would
work better, but nowhere near as well as SSE floating point.

My personal opinion is that NTTs will not be faster than FFTs until modular
multiplication has the same throughput as floating point multiplication. 
Right now we're at least a factor of 3-5 away from that.

Hope this helps,
jasonp

_________________________________________________________________________
Unsubscribe & list info -- http://www.ndatech.com/mersenne/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

Reply via email to