Re: [mpir-devel] A new FFT for the New Year

Bill Hart Fri, 01 Jan 2010 15:29:12 -0800

2. FFT's for nitwits (like me)
====================

I will explain why I am a nitwit in a later section. But to understand
that, you need to understand some FFT basics first. So I describe them
here.

2a. Description of the FFT
===================

An FFT is not an integer multiplication routine per se, but it can be
used to construct one.

The first step in multiplying huge integers A and B is to view the
problem as a polynomial multiplication. Write your large integers in
some base D (usually D = 2^r for some r), i.e. write A = f1(D), B =
f2(D) for polynomials f1 and f2 whose coefficients are in the range
[0, D).

Next, multiply the polynomials g(x) = f1(x)*f2(x).

Finally, evaluate the product polynomial at D, i.e. C = A*B = f1(D)*f2(D).

So from now on I'll talk about large integer multiplication as though
it is polynomial multiplication.

Polynomial multiplication is computed using convolution of the two
polynomials. (Cyclic) convolution of polynomials is defined (up to
scaling) as multiplication of the polynomials modulo x^n - 1 for some
n. Note that if the degree of g is less than n then g(x) can be
recovered precisely from the convolution.

The FFT can be used to efficiently compute the convolution.
Specifically convolution of two polynomials f1 and f2 can be computed
as FFT^(-1)(FFT(f1).FFT(f2)) where the dot is the coordinate-wise
product of vectors. The inverse of an FFT can be computed (up to
scaling) by an IFFT. Thus for convolution modulo x^n - 1 where n =
2^k, the convolution can be computed as 2^(-k)*IFFT(FFT(f1).FFT(f2)),
where all the operations are on vectors of coefficients. In particular
the coordinate-wise product involves multiplication of coefficients
which are in [0,D). These products are known as pointwise
multiplications.

You should view the FFT as evaluating the polynomial at n = 2^k
different roots of unity, 1, z, z^2, ...., z^{2k-1}, where z is a
primitive 2^k-th root of unity. The pointwise products multiply such
values, effectively giving you the product polynomial g (modulo x^n -
1) evaluated at each of these roots of unity. The coefficients of the
polynomial g are retrieved by inverting the FFT process, i.e. by
interpolating the vector of pointwise products.

So how does the FFT actually work?

We start with a polynomial modulo x^n - 1 where n = 2^k. To save on
notation, let's write x^(2m) - 1 where m = 2^(k-1).

We write x^(2m) - 1 = (x^m - 1)(x^m + 1). If we wish to multiply
polynomials modulo x^(2m) - 1, our first step is to express our
polynomials modulo x^m - 1 and x^m + 1. We then compute the products
mod x^m - 1 and mod x^m + 1 and recombine using CRT.

Let f(x) = b(x)*x^m + a(x) where the degree of  a(x) and b(x) are less
than m, then f(x) modulo x^m - 1 is just a(x) + b(x), i.e. we simply
add coefficients in pairs. Similarly modulo x^m + 1 the polynomial
would be a(x) - b(x).

Taking the product of polynomials modulo x^m - 1 can then be computed
recursively using the same algorithm, i.e. by a convolution modulo x^m
- 1/

To compute the product modulo x^m + 1 we make a substitution y = z*x
where z is a primitive 2^k-th root of unity. This transforms the
problem into one modulo y^m - 1, which can again be done recursively
using the same algorithm. Note that substituting y = z*x is equivalent
to multiplying the coefficient x^i of f(x) by z^i. So we twiddle the
coefficients by roots of unity.

A convolution mod x^m - 1 is called a cyclic convolution, and a
convolution mod x^m + 1 is called a negacyclic convolution. So the FFT
algorithm works by breaking the cyclic convolution into a cyclic
convolution of half the size and a negacyclic convolution of half the
size and transforming the negacyclic convolution into a cyclic
convolution by twiddling by roots of unity.

In practice the evaluation step, adding or subtracting coefficients in
pairs, and the twiddle step for the negacyclic half, are combined.
This leads to the familiar FFT "butterfly" which replaces the left
half of a vector of coefficients of f(x), [a0, a1, a2, ...., a{m-1},
b0, b1, ..., b{m-1}] with [a0+b0, a1+b1, ...., a{m-1}+b{m-1}] and the
right half by [a0-b0, z(a1-b1), z^2*(a2-b2),...,
z^{m-1}*(a{m-1}-b{m-1}]. The butterfly step can be conveniently
written [a{i}, b{i}] => [a{i}+b{i}, z^i*(a{i}-b{i})].

One then recursively computes the FFT butterfly of the left and right
halves of the vector. Eventually one gets down to vectors of length 1
which represent polynomials modulo x-1 (or really x-z^i for i =
0,..,2m). Note that taking a polynomial f(x) modulo x - z^i is the
same thing as evaluating it at z^i. So what the FFT really computes is
[f(1), f(z), f(z^2), ...., f(z^{2m})] (well perhaps not in that order,
but in some order - probably bit reversed order, i.e. z^i is swapped
with z^j where i is the reverse of j in binary, when they are
considered as binary numbers of log_2(2m) bits).

This left-right FFT recursion scheme is known as a decimation in
frequency (DIF) FFT.

Note that multiplication of polynomials modulo x-1 is simply integer
multiplication. These are the pointwise multiplications we spoke of
earlier. We simply multiply our vectors of coefficients pointwise.

Now suppose I have a polynomial known modulo x^m - 1 and x^m + 1, how
can I invert this process and obtain it modulo x^(2m) - 1?
Specifically, if I have c{i} = a{i}+b{i} and d{i} = z^i*(a{i}-b{i}),
how can I retrieve a{i} and b{i}. Clearly the answer is to multiply
d{i} by z^{-i}, then adding it to c{i} yields 2*a{i} and subtracting
it from c{i} yields 2*b{i}. Notice the extra factor of 2 which has
crept in.

So the IFFT butterfly step replaces [c0, d0, c1, d1, ...., c{m-1},
d{m-1}] with [c0+d0, c0-d0, c1+z^(-1)*d1, c1-z*d1, ....,
c{m-1}+z^{-(m-1)}*d{m-1}, c{m-1}-z^{-(m-1)}*d{m-1}]. Or we can write
the butterfly conveniently as [c{i}, d{i}] => [c{i}+z^-i*d{i},
c{i}-z^-i*d{i}].

Note we have paired the odd indexed entries in our vector with the
even indexed entries. An FFT which has the butterfly step [c{i}, d{i}]
=> [c{i}+z^i*d{i}, c{i}-z^i*d{i}] (note the change in sign of the
exponents on the twiddle factors to convert from an IFFT to an FFT) is
called a decimation in time (DIT) FFT.

So our IFFT is simply a DIT FFT with the sign of the exponent of the
twiddle factors changed.

Now, an important note for later is that the reason this all worked is
that we can invert the FFT process. Specifically c{i}+z^-i*d{i} =
2*a{i}+b{i}-b{i} = 2*a{i}, and similarly we can retrieve 2*b{i}.

One final note. These factors of 2 that creep in can be removed at any
stage. They are sometimes removed at each butterfly step, sometimes
all at once at the end of the IFFT.

2b. FFT algorithms
==============

There are multiple different ways of implementing the FFT algorithm in
practice, and a number of important additional algorithms used in
multiplying integers using an FFT. I will describe them here.

a) Complex FFT. One can use complex numbers and floating point
arithmetic to multiply integers using the FFT. One has to be careful
that at each step not too much precision is being lost. For example,
one could work with double precision floating point numbers.

Naturally it is inefficient to use complex numbers, as the polynomials
we are trying to multiply have only real coefficients. There are some
strategies for getting around this. One can exploit the symmetries of
the FFT to get FFT's which work on real values and take roughly half
the time/space. For example the Fast Hartley Transform works with only
real values, and there are various other variations on this theme.

Note that roots of unity here are complex values, or for the fast
Hartley transform, real values which are linear combinations of sines
and cosines. So computing the twiddle factors can be expensive.

The biggest issue with real/complex FFT's is the loss of precision.
They are really fast as they can use the native floating point
arithmetic of the processor, they just usually don't guarantee a
correct result, and aren't so fast or don't support very long
transforms if they do.

b) Schonhage-Strassen FFTs. Schonhage and Strassen suggested working
in the ring Z/pZ where p is of the form 2^m+1. Here 2 is a 2m-th
(2^k-th) root of unity. Simply raise 2^m and you see that you get -1
in this ring, thus 2^(2m) must be 1 in this ring.

Note that multiplication by a root of unity then becomes
multiplication by a power of 2 (modulo 2^m + 1), or a bitshift on a
binary computer. Also working in this ring is convenient, as reduction
modulo p can be done with a subtraction. In practice we can work in a
ring Z/pZ where p = 2^(a*m)+1 for some value a. Here 2^a is a 2m-th
(2^k-th) root of unity. A ring of this form is called a Fermat ring
and p a generalised Fermat number.

Usually p is chosen to be a large prime that takes up numerous machine
limbs. For performance reasons a and m are usually chosen so that
reduction modulo
p does not involve any bitshifts, i.e. 2^(a*m) is a multiple of a limb shift.

c) Mersenne Ring FFT. Work in a ring modulo p = 2^(b*m) - 1, again for
m a power of 2 and b arbitrary. Here 2 is an m-th root of unity,
reduction modulo p is an addition. Numbers p of this form are called
generalised Mersenne numbers.

d) Number Theoretic Transforms (NTT's). Work in a ring Z/pZ where p is
a prime such that Z/pZ has lots of roots of unity. Usually p is chosen
to be a small prime that fits in a machine word. We usually start with
a prime p of the form p = a*n + 1 for some small value a and some
larger value n (for example a power of 2 so that reduction mod p has a
chance of being cheap). We find a primitive root r mod p, and then s =
r^a (mod p) by Fermat's Little Theorem is a primitive n-th root of
unity. Thus we can support a transform of length n on the ring Z/pZ.

Many of the historically large integer products that have been
computed, e.g. in computation of many digits of pi, involved NTT's.
One can use a few of them, modulo different primes p (usually 2-5 of
them) and combine using CRT.

The problems with NTT's are twofold. Firstly, once one gets beyond
about 5 of them, it becomes a significant proportion of the runtime to
do the CRT step. Also, in practice, NTT implementations don't compare
favourably with optimised Schonhage-Strassen implementations, being an
order of magnitude slower in practice. On a 32 bit machine, the
maximum transform length for a word sized prime can also be limiting
for very large problems.

The main drawback with NTT's is that multiplication by a root of unity
is no longer a bit shift. One must also compute all the roots of
unity, which usually means keeping some table of them handy. For very
long transforms, the table itself may exceed the size of cache. Or one
can partially compute the table and then perform some additional
computations on-the-fly.

-----------------

Now I will discuss some additional algorithms which one needs to know about.

e) Multiplication of integers modulo B^m + 1, using the negacyclic
convolution. Recall that the negacyclic convolution can be used to
multiply polynomials modulo x^n+1. If we evaluate such a polynomial at
some point x = B then we can use a negacyclic convolution to multiply
integers modulo B^n + 1. There's nothing special about using this
strategy to multiply integers. A cyclic convolution will generally be
faster, as one doesn't need the extra twiddle factors to turn the
negacyclic convolution into a cyclic convolution (so it can be
computed using an FFT). But for multiplying integers modulo a number
of the form B^n + 1 it is very useful.

Note that the pointwise multiplications in our convolution algorithm
are precisely multiplications modulo a number of the form B^m + 1,
when we are working in a Fermat ring. Thus we can effect our pointwise
multiplications using a negacyclic convolution. This allows the FFT to
recurse down to an FFT for the pointwise multiplications (when they
become sufficiently large). This is more efficient than multiplying
using an ordinary cyclic convolution and then reducing modulo p, as a
convolution of half the length can be used (you actually *want* it to
wrap around). This can save a significant amount of time once the
pointwise multiplications become so large that the use of an FFT is
indicated. If fact, because the convolution will be half the size, a
negacyclic convolution can be used *before* one would consider using
an ordinary FFT routine to multiply the integers.

That's all I need to discuss for this section, as I will explain some
other algorithms in later sections as they are needed.

Bill.

2010/1/1 Bill Hart <goodwillh...@googlemail.com>:
> 1. Formal release of code.
>
> I have included two new files in the top source directory of mpir in
> svn. The first is new_fft_with_flint.c, the second new_fft.c. They are
> isomorphic except that the second does not need to be linked against
> FLINT to run. There will be no difference in timings or results, I
> just removed all the timing and test code which relied on FLINT.
>
> Here is the command I use to build the code once you have built MPIR:
>
> gcc -O2 new_fft.c -o fft -I/home/wbhart/mpir-trunk
> -L/home/wbhart/mpir-trunk/.libs -lmpir
>
> Modify as desired for your own system.
>
> Please note the usual disclaimer:
>
> This code is INCOMPLETE, INEFFICIENT, INCONSISTENT often INCORRECT,
> ILL-TESTED and the original author is ILL-TEMPERED.
>
> It is LGPL v2+ licensed and I have taken extreme care to ensure that
> all the code was written from scratch by me without following someone
> else's work. The four functions FFT_split_* and FFT_combine_* were
> taken from FLINT, but to my knowledge were written entirely by me. I
> hereby release them under the LGPL v2+.
>
> The timing code does not print times, it merely completes a certain
> number of iterations and exits. Use the linux time command to get
> times.
>
> I have run very few tests on the multiplication routine, perhaps on
> the order of a few thousand integer multiplications, at very few
> sizes. This code is extremely new and should be treated as quite
> likely broken.
>
> Note that the negacyclic convolution is currently disabled. I don't
> even know if it passes its test code when switched on at present. I
> modified a number of the FFT's since writing it, so probably not. But
> it did pass it's test code. However note it does not deal with a
> couple of (trivial) special cases, so is incomplete and formally
> incorrect because of that.
>
> There is an implementation of the split radix FFT contained in the
> file too. I found the times were *identical* to those I got from the
> radix 2 code, so I abandoned it. There is also an old implementation
> of the radix 2 FFT/IFFT which I also abandoned. The basic radix 2 FFT
> function is called FFT_radix2_new.
>
> All the truncate, mfa and twiddle functions are new in that they don't
> rely on the old radix 2 implementation, I believe.
>
> Many of the code comments have been cut and pasted from earlier code
> they were derived from, so will be incorrect. This is especially true
> of the comments for the FFT and IFFT functions (including the truncate
> and mfa functions), except for the radix2_new and split_radix
> functions which I hope are about right.
>
> Bill.
>
> 2010/1/1 Bill Hart <goodwillh...@googlemail.com>:
>> As the code for this FFT is quite involved, I plan to incrementally
>> release a set of notes about it, starting tomorrow morning (today
>> already), which may take hours to days for me to complete. I got the
>> code working for the first time this afternoon, so please be patient
>> with me.
>>
>> Here is the rough list of topics I hope to cover in the notes:
>>
>> Table of Contents
>> =============
>>
>> 1) The formal code release. I need to check all the test code still
>> passes and make it easy for people to build and run. That will take an
>> hour or so for me to do.
>>
>> 2) FFT's for nitwits (like me). How does an FFT help in multiplying
>> numbers, basically, and what are the important algorithms and
>> techniques.
>>
>> 3) Why a new FFT? This is an important question which needs answering.
>> I will discuss what led me to work on a new FFT implementation, rather
>> than improve or port another implementation.
>>
>> 4) The strategy. What does this FFT actually do? Which algorithms does
>> it use?  I tried many things, most of which completely failed. In
>> section 3 I will have already discussed other strategies I tried which
>> did not work. Here I will discuss what did work. Whilst many of the
>> details came from my head, I'm pretty sure the general outline will
>> contain few surprises.
>>
>> 5) References. Few. I purposely tried to be creative.
>>
>> 8) Implementors notes. A long list of things that can be improved. In
>> releasing the code publicly it is my hope that contributors will help
>> make this the best FFT implementation it can be. The list will include
>> many trivial items which anyone could work on, all the way through to
>> fairly advanced topics which would probably require original research
>> to do really well. I'll rank the items according to how hard I think
>> they'd be to do.
>>
>> The code will be under revision control, so it is my hope that people
>> will be willing to contribute to it. I believe I spent between 150 and
>> 200 hours writing it, all told.
>>
>> Till tomorrow.
>>
>> Bill.
>>
>> 2010/1/1 Bill Hart <goodwillh...@googlemail.com>:
>>> Ha! I forgot to mention, those numbers represent the size of the
>>> integers in bits. I multiply two different integers with the same
>>> number of bits, though that is not a limitation of the code. It will
>>> accept integers of different lengths and adjust accordingly.
>>>
>>> Bill.
>>>
>>> 2010/1/1 Bill Hart <goodwillh...@googlemail.com>:
>>>> I re-timed that point and I believe I may have just mistranscribed it.
>>>> Here it is with the corrected time. Thanks for pointing it out!
>>>>
>>>> I've also included some times from the FLINT FFT, though only
>>>> non-truncated ones (i.e. where FLINT uses a convolution which is an
>>>> exact power of 2 length - not a limitation of FLINT, I just didn't get
>>>> around to timing it at the other points yet).
>>>>
>>>> 8364032:  6.8s / 8.0s
>>>> 9409536:  9.8s / 8.7s
>>>> 10455040:  10.8s / 9.6s
>>>> 11500544: 12.3s / 12.6s
>>>> 12546048: 12.7s / 15.3s
>>>> 13591552: 14.0s / 14.2s
>>>> 14637056: 14.9s / 15.3s
>>>> 15682560: 16.0s / 14.5s
>>>> 16728064: 16.8s / 19.5s
>>>>
>>>> 20910080: 24.3s / 23.2s
>>>> 25092096: 28.5s / 33.1s
>>>> 29274112: 33.1s / 31.8s
>>>> 33456128: 37.4s / 47.7s
>>>>
>>>> 519168: iters 10000: 30.7s / 32.2s / 32.3s
>>>> 2084864: iters 1000: 13.9s / 15.2s / 13.0s
>>>> 8364032: iters 100: 6.8s / 8.0s / 7.6s
>>>> 33497088: iters 100: 37.4s / 47.7s / 37.5s
>>>> 134103040: iters 10: 17.0s / 24.8s / 19.6s
>>>> 536608768: iters 1: 8.6s / 9.4s / 8.6s
>>>> 2146959360: iters 1: 42.3s / 40.3s / 38.9s
>>>>
>>>> Bill.
>>>>
>>>>
>>>> 2009/12/31 Bill Hart <goodwillh...@googlemail.com>:
>>>>> It's either a tuning issue or timing instability on the machine.
>>>>>
>>>>> Bill.
>>>>>
>>>>> On 31/12/2009, Robert Gerbicz <robert.gerb...@gmail.com> wrote:
>>>>>> Hi!
>>>>>>
>>>>>> How is it possible that to multiple longer numbers takes less time, for
>>>>>> example:
>>>>>>
>>>>>> 12546048: 14.7s / 15.3s
>>>>>> 13591552: 14.0s / 14.2s
>>>>>>
>>>>>> --
>>>>>>
>>>>>> You received this message because you are subscribed to the Google Groups
>>>>>> "mpir-devel" group.
>>>>>> To post to this group, send email to mpir-de...@googlegroups.com.
>>>>>> To unsubscribe from this group, send email to
>>>>>> mpir-devel+unsubscr...@googlegroups.com.
>>>>>> For more options, visit this group at
>>>>>> http://groups.google.com/group/mpir-devel?hl=en.
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

--

You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-de...@googlegroups.com.
To unsubscribe from this group, send email to 
mpir-devel+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en.

Re: [mpir-devel] A new FFT for the New Year

Reply via email to