David Harvey wrote:

> One possibility I thought about was some kind of FFT working
> simultaneously in both axes (degree and coefficient). It might be the
> case that for these sizes, the "rectangle" isn't long enough or tall
> enough to do fast arithmetic in either direction independently, but
> taken as a whole it might be okay. For example, consider degree 256
> times 1024 bit coefficients again. Say the machine word is 32 bits. I
> think NTL chooses SSMul here. Then your input data is 8192 machine
> words. This is probably large enough to efficiently perform a 2-d FFT
> in one go. But if you use SS, then each coefficient multiplication is
> only 32 words long, and GMP uses basecase (or possibly one level of
> karatsuba?) in this range.

I very much doubt a 2-D FFT would work, though the thought of trying it
had certainly occurred to me. Basically any multiplications in Z that
you do are already highly optimized because of GMP. I don't believe
breaking them apart will help.

I mean, I see what you are suggesting. Don't let GMP sort out any
coefficient multiplication, but do it as part of a 2D FFT. But I don't
think there is a hope of doing in C more efficiently what GMP does in
assembly code, especially for such a complicated algorithm. Thus, what
is lost by working in C, I doubt would be made up in doing a 2D FFT.
Actually, I am sure enough about this that I wouldn't even try.

I am actually extremely skeptical that SSMul is the correct
multiplication to be using in the range that you are speaking of (~300
digit coefficients in ~300 term polynomials). I believe what NTL calls
HomMul (which is actually a generic term as Dan Bernstein has shown,
and probably should be eschewed, since it isn't very descriptive) is
the correct way to do things. This suggests that there is something
dire wrong with the HomMul function in NTL. In fact, I really can't see
why Karatsuba, Toom-3 or one of their many variants shouldn't be enough
if composed in a sufficiently potent cocktail.

I will try implementing some polynomial multiplication routines over
the next week and see just how bad NTL's routines are. I don't expect
to beat NTL straight away, since there are so many possible algorithms
to use, and so many variants, that it is going to take me quite some
time to determine the best ones to use. However, at least I should be
able to get some idea of what is really going on.

I realised that cache locality is not an issue. There is hardly enough
data to actually fill the cache, and certainly not enough to cause a
particularly bad problem. Thus I now believe the algorithm is at fault
after all. Some multiplication algorithms are particularly cache
friendly, but that is not going to be the problem at the regimes you
are talking about.

>
> Of course this is a very woolly, imprecise idea, and I'm not sure
> exactly what I mean.
>
> I'm kind of surprised that NTL's SSMul is faster than HomMul for this
> case. It feels like HomMul should be faster, especially seeing as NTL
> does small-coefficient multiplications very quickly.

Not as quickly as GMP, and this may be part of the problem. I wouldn't
be surprised to find a factor of two here, and a factor of 2 or 3 in
the algorithm used.The MAGMA people probably have a nice intermediate
algorithm.

> I got the same result over multiple trials. I can't think why the
> drive would be thrashing every time I started it. It wasn't a memory-
> intensive program.

A memory leak is about the only thing I can think of. But I am certain
Victor is too careful for that. But strange things can happen and it
would explain the symptoms.
 
> David

Bill.


--~--~---------~--~----~------------~-------~--~----~
To post to this group, send email to sage-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/sage-devel
URLs: http://sage.scipy.org/sage/ and http://modular.math.washington.edu/sage/
-~----------~----~----~----~------~----~------~--~---

Reply via email to