On Oct 22, 2006, at 6:32 PM, Bill Hart wrote:

> I am now absolutely certain MAGMA uses the FFT for multiplying
> polynomials over ZZ right down to degree 16 (when the bit length is
> 1000). This is a **much** lower cutoff than NTL uses, which is
> indicative of the fact that MAGMA's FFT is way better implemented.

I recall from looking at NTL's FFT code that it doesn't seem all that  
cache friendly. Each pass goes all the way across the array -- I  
would have thought at some point you want to start working totally in  
cache.

Shoup does use some assembler in the FFT (for picking up the high  
word of a word * word multiplication) which might be useful.

> I determined that MAGMA definitely uses FFT by doing a series of
> timings and plotting them on a graph. The characteristic FFT
> step-function timing graph was quite recognizable.
>
> They have definitely missed one major trick though, so if I can figure
> out how to get an implementation anywhere near as efficient as theirs,
> I will be able to beat it.

Something I have always wondered is whether it would be good to work  
with an FFT of length 3^n 2^m. So you need to choose your FFT primes  
more carefully, and the FFT itself it more painful to implement, but  
you get much finer-grained control over the length, and less  
discontinuities in performance.

> Anyhow, I implemented the algorithm that Victor calls HomMul. I think
> he hardly uses it any more. I think I know why. It is as slow as a wet
> week. And there is very little that can be done to speed it up. As I
> suspected, it is a completely out of date and pretty much useless
> algorithm these days. It is completely dominated by the time taken to
> do modular reductions for large coefficients and the time taken to do
> Karatsuba for large degree and medium sized coefficients. It just
> reduces to Karatsuba for sufficiently small coefficients.

The thing that really puzzles me though is that NTL is much faster  
than MAGMA when the degree is very large and the coefficients are  
relatively small. I know NTL uses HomMul in that case, so it can't be  
all that bad, at least in certain situations.

> If anyone sees any papers on extremely efficient implementations of
> Schoenhagen-Strassen, please let me know. Implementing it seems to be
> fairly simple, but one has to believe that Victor's implementation
> doesn't use all the known tricks.

The only thing I'm aware of is the bit-operations I mentioned in an  
earlier email about his SSMul function.

David


--~--~---------~--~----~------------~-------~--~----~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://sage.scipy.org/sage/ and http://modular.math.washington.edu/sage/
-~----------~----~----~----~------~----~------~--~---

Reply via email to