Hey everybody. I've realized after finishing an integer convolution
library for the Alpha (using number-theoretic FFTs) that even a nice
64-bit processor like the Alpha couldn't give me the speed I wanted for
integer-only Mersenne-mod squaring. I've also wanted for a while to build
integer-only convolution on x86 machines that was at least somewhat
comparable to Prime95 in speed.

The big problem with the code now is that integer multiplies still take
too long. A floating point FFT can not only rearrange its structure to
minimize the amount of arithmetic but can also take advantage of floating
point hardware that can finish a result (or even two) every clock
cycle. In contrast, 62-bit integer modular multiplication takes a minimum
of 6 clocks on an Alpha ev6 (using highly vectorized assembly code) and
with extreme contortions takes 40 clocks on a Pentium and 30 clocks on an
Athlon. There are also no simplifications you can make to a
number-theoretic FFT, because typically a 4th- or 8th- root of unity has
no special form that reduces the amount of arithmetic when you're working
modulo a 62-bit prime (I use 62 bits because carries are easier to handle
than with 64-bit primes).

Choosing different kinds of primes won't help either. There are maybe 8
primes of 32-bit size that could be used for a large integer FFT, but none
of them allow a large root of 2 like a Mersenne-mod DWT requires
(never mind that you'd need two of them and CRT reconstruction). For
primes of the form i*2^32 + 1, with i between 2^30 and 2^31, there are 14
primes that allow a large root of 2 and hence allow a big DWT; however, I
can't find any primitive roots for these primes that allow "simple" roots
of unity, and so FFT speed is limited to the cycle counts mentioned
previously. As several people on this list are aware, 2^64-2^32+1 is
prime, allows for a big DWT, and has primitive roots that allow for simple
64th roots of unity. I've written a lot of x86 assembly code for small
FFTs in a finite field modulo this prime, and it seems that an FFT
butterfly takes 20-25 clocks on a Pentium (probably somewhat fewer on an
Athlon). While this is wonderful compared to general 64-bit arithmetic,
it's a far cry from the 10 clocks or so that a floating-point FFT
butterfly may need. Dealing with carries and borrows out of 64 bits also
slows it down on the Alpha, to the point where (on paper, at least) it's
just as fast to use one of those 14 other primes that do the same job.

Nussbaumer convolution and Schonhage-Strassen squaring are other
candidates, and have very little arithmetic (Nussbaumer in particular can
be optimized to be really fast), but their memory efficiency is
horrible. Both require arithmetic on huge numbers in their initial stages,
and there's no way to load a block of data into cache and spin on it for
several FFT passes like you can do with a floating point FFT. There's also
much more movement of data with these algorithms, although with Nussbaumer
you can minimize that by making the movement "virtual". Prime95 gets most
of its speed from very careful data placement that minimizes memory
traffic, and these algorithms are not good substitutes.

Finally, there are fast Galois transforms that use complex arithmetic but
with integers. A large-radix FFT is possible here, and using Mersenne
primes makes that arithmetic easier; but each complex multiply has four of
those awful integer multiplies in it, and even with a split-radix
formulation the work only seems to be about 10%-15% less than a
conventional integer FFT, for much more complexity. That's on the Alpha,
which can handle complex integer multiplies enormously faster than x86
machines can. Performance comparable to that of prime95 would require at
least a factor of two better than that.

I first discovered GIMPS around the middle of 1996, and I guess I'm
spouting off like this because years of sweating on all these methods with
no breakthrough has made me bitter. I want to contribute more than just
one dinky computer to this project, because computational number theory
and code optimization both fascinate me.

There may be a way. Nobody seems to have considered using very large
primes for an integer FFT (by large I mean 512 bits or more). Using huge
primes like this means less memory consumption (the runlengths are
smaller), more flexible FFT sizes (since the number of bits you pack into
an array element can vary over a wide range, you can use power-of-2 run-
lengths and still get the effect of a non-power-of-2 size FFT), and the
cost of arithmetic is amortized over many machine words. 

As with small primes, the chief obstacle is to find a way to do
relatively big multiprecision multiplies at very high speed. A fancy
multiply algorithm wouldn't give the speedup I'm looking for, so instead
I'm looking for large primes that have simple roots of unity. In
particular, I'm looking for numbers of the form i*2^512+1, for i the size
of one machine word, that 

1) have i>2^24 (allows 512 bits per convolution element)
2) have a large root of two ( 2^13 or more )
3) have some power of two primitive root that is of simple form.

For example, when i = 0x1326301 there are many primitive roots that allow
a 4th root of unity which only has 4 bits in it. This allows a radix-4
integer FFT which is more efficient than the radix-2 variety one is
ordinarily stuck with. Unfortunately, this prime doesn't have a large root
of 2.

I want to find more of these numbers, but it's pretty slow going. The
Alpha here will take weeks to search through all 32-bit values of i,
and if a suitable one isn't found (likely) I'll have to try i*2^32. Is
there anyone here who can lend computer power to the search? I have a
program that uses GMP and is optimized to the hilt; I have also ported
a subset of GMP 3.0 to Windows and can provide a DOS executable if anyone
is interested, as well as source. 

Any help with the theory behind all this or offers of computational
firepower would be greatly appreciated.

jasonp

_________________________________________________________________________
Unsubscribe & list info -- http://www.scruz.net/~luke/signup.htm
Mersenne Prime FAQ      -- http://www.tasam.com/~lrwiman/FAQ-mers

Reply via email to