2. FFT's for nitwits (like me) ==================== I will explain why I am a nitwit in a later section. But to understand that, you need to understand some FFT basics first. So I describe them here.
2a. Description of the FFT =================== An FFT is not an integer multiplication routine per se, but it can be used to construct one. The first step in multiplying huge integers A and B is to view the problem as a polynomial multiplication. Write your large integers in some base D (usually D = 2^r for some r), i.e. write A = f1(D), B = f2(D) for polynomials f1 and f2 whose coefficients are in the range [0, D). Next, multiply the polynomials g(x) = f1(x)*f2(x). Finally, evaluate the product polynomial at D, i.e. C = A*B = f1(D)*f2(D). So from now on I'll talk about large integer multiplication as though it is polynomial multiplication. Polynomial multiplication is computed using convolution of the two polynomials. (Cyclic) convolution of polynomials is defined (up to scaling) as multiplication of the polynomials modulo x^n - 1 for some n. Note that if the degree of g is less than n then g(x) can be recovered precisely from the convolution. The FFT can be used to efficiently compute the convolution. Specifically convolution of two polynomials f1 and f2 can be computed as FFT^(-1)(FFT(f1).FFT(f2)) where the dot is the coordinate-wise product of vectors. The inverse of an FFT can be computed (up to scaling) by an IFFT. Thus for convolution modulo x^n - 1 where n = 2^k, the convolution can be computed as 2^(-k)*IFFT(FFT(f1).FFT(f2)), where all the operations are on vectors of coefficients. In particular the coordinate-wise product involves multiplication of coefficients which are in [0,D). These products are known as pointwise multiplications. You should view the FFT as evaluating the polynomial at n = 2^k different roots of unity, 1, z, z^2, ...., z^{2k-1}, where z is a primitive 2^k-th root of unity. The pointwise products multiply such values, effectively giving you the product polynomial g (modulo x^n - 1) evaluated at each of these roots of unity. The coefficients of the polynomial g are retrieved by inverting the FFT process, i.e. by interpolating the vector of pointwise products. So how does the FFT actually work? We start with a polynomial modulo x^n - 1 where n = 2^k. To save on notation, let's write x^(2m) - 1 where m = 2^(k-1). We write x^(2m) - 1 = (x^m - 1)(x^m + 1). If we wish to multiply polynomials modulo x^(2m) - 1, our first step is to express our polynomials modulo x^m - 1 and x^m + 1. We then compute the products mod x^m - 1 and mod x^m + 1 and recombine using CRT. Let f(x) = b(x)*x^m + a(x) where the degree of a(x) and b(x) are less than m, then f(x) modulo x^m - 1 is just a(x) + b(x), i.e. we simply add coefficients in pairs. Similarly modulo x^m + 1 the polynomial would be a(x) - b(x). Taking the product of polynomials modulo x^m - 1 can then be computed recursively using the same algorithm, i.e. by a convolution modulo x^m - 1/ To compute the product modulo x^m + 1 we make a substitution y = z*x where z is a primitive 2^k-th root of unity. This transforms the problem into one modulo y^m - 1, which can again be done recursively using the same algorithm. Note that substituting y = z*x is equivalent to multiplying the coefficient x^i of f(x) by z^i. So we twiddle the coefficients by roots of unity. A convolution mod x^m - 1 is called a cyclic convolution, and a convolution mod x^m + 1 is called a negacyclic convolution. So the FFT algorithm works by breaking the cyclic convolution into a cyclic convolution of half the size and a negacyclic convolution of half the size and transforming the negacyclic convolution into a cyclic convolution by twiddling by roots of unity. In practice the evaluation step, adding or subtracting coefficients in pairs, and the twiddle step for the negacyclic half, are combined. This leads to the familiar FFT "butterfly" which replaces the left half of a vector of coefficients of f(x), [a0, a1, a2, ...., a{m-1}, b0, b1, ..., b{m-1}] with [a0+b0, a1+b1, ...., a{m-1}+b{m-1}] and the right half by [a0-b0, z(a1-b1), z^2*(a2-b2),..., z^{m-1}*(a{m-1}-b{m-1}]. The butterfly step can be conveniently written [a{i}, b{i}] => [a{i}+b{i}, z^i*(a{i}-b{i})]. One then recursively computes the FFT butterfly of the left and right halves of the vector. Eventually one gets down to vectors of length 1 which represent polynomials modulo x-1 (or really x-z^i for i = 0,..,2m). Note that taking a polynomial f(x) modulo x - z^i is the same thing as evaluating it at z^i. So what the FFT really computes is [f(1), f(z), f(z^2), ...., f(z^{2m})] (well perhaps not in that order, but in some order - probably bit reversed order, i.e. z^i is swapped with z^j where i is the reverse of j in binary, when they are considered as binary numbers of log_2(2m) bits). This left-right FFT recursion scheme is known as a decimation in frequency (DIF) FFT. Note that multiplication of polynomials modulo x-1 is simply integer multiplication. These are the pointwise multiplications we spoke of earlier. We simply multiply our vectors of coefficients pointwise. Now suppose I have a polynomial known modulo x^m - 1 and x^m + 1, how can I invert this process and obtain it modulo x^(2m) - 1? Specifically, if I have c{i} = a{i}+b{i} and d{i} = z^i*(a{i}-b{i}), how can I retrieve a{i} and b{i}. Clearly the answer is to multiply d{i} by z^{-i}, then adding it to c{i} yields 2*a{i} and subtracting it from c{i} yields 2*b{i}. Notice the extra factor of 2 which has crept in. So the IFFT butterfly step replaces [c0, d0, c1, d1, ...., c{m-1}, d{m-1}] with [c0+d0, c0-d0, c1+z^(-1)*d1, c1-z*d1, ...., c{m-1}+z^{-(m-1)}*d{m-1}, c{m-1}-z^{-(m-1)}*d{m-1}]. Or we can write the butterfly conveniently as [c{i}, d{i}] => [c{i}+z^-i*d{i}, c{i}-z^-i*d{i}]. Note we have paired the odd indexed entries in our vector with the even indexed entries. An FFT which has the butterfly step [c{i}, d{i}] => [c{i}+z^i*d{i}, c{i}-z^i*d{i}] (note the change in sign of the exponents on the twiddle factors to convert from an IFFT to an FFT) is called a decimation in time (DIT) FFT. So our IFFT is simply a DIT FFT with the sign of the exponent of the twiddle factors changed. Now, an important note for later is that the reason this all worked is that we can invert the FFT process. Specifically c{i}+z^-i*d{i} = 2*a{i}+b{i}-b{i} = 2*a{i}, and similarly we can retrieve 2*b{i}. One final note. These factors of 2 that creep in can be removed at any stage. They are sometimes removed at each butterfly step, sometimes all at once at the end of the IFFT. 2b. FFT algorithms ============== There are multiple different ways of implementing the FFT algorithm in practice, and a number of important additional algorithms used in multiplying integers using an FFT. I will describe them here. a) Complex FFT. One can use complex numbers and floating point arithmetic to multiply integers using the FFT. One has to be careful that at each step not too much precision is being lost. For example, one could work with double precision floating point numbers. Naturally it is inefficient to use complex numbers, as the polynomials we are trying to multiply have only real coefficients. There are some strategies for getting around this. One can exploit the symmetries of the FFT to get FFT's which work on real values and take roughly half the time/space. For example the Fast Hartley Transform works with only real values, and there are various other variations on this theme. Note that roots of unity here are complex values, or for the fast Hartley transform, real values which are linear combinations of sines and cosines. So computing the twiddle factors can be expensive. The biggest issue with real/complex FFT's is the loss of precision. They are really fast as they can use the native floating point arithmetic of the processor, they just usually don't guarantee a correct result, and aren't so fast or don't support very long transforms if they do. b) Schonhage-Strassen FFTs. Schonhage and Strassen suggested working in the ring Z/pZ where p is of the form 2^m+1. Here 2 is a 2m-th (2^k-th) root of unity. Simply raise 2^m and you see that you get -1 in this ring, thus 2^(2m) must be 1 in this ring. Note that multiplication by a root of unity then becomes multiplication by a power of 2 (modulo 2^m + 1), or a bitshift on a binary computer. Also working in this ring is convenient, as reduction modulo p can be done with a subtraction. In practice we can work in a ring Z/pZ where p = 2^(a*m)+1 for some value a. Here 2^a is a 2m-th (2^k-th) root of unity. A ring of this form is called a Fermat ring and p a generalised Fermat number. Usually p is chosen to be a large prime that takes up numerous machine limbs. For performance reasons a and m are usually chosen so that reduction modulo p does not involve any bitshifts, i.e. 2^(a*m) is a multiple of a limb shift. c) Mersenne Ring FFT. Work in a ring modulo p = 2^(b*m) - 1, again for m a power of 2 and b arbitrary. Here 2 is an m-th root of unity, reduction modulo p is an addition. Numbers p of this form are called generalised Mersenne numbers. d) Number Theoretic Transforms (NTT's). Work in a ring Z/pZ where p is a prime such that Z/pZ has lots of roots of unity. Usually p is chosen to be a small prime that fits in a machine word. We usually start with a prime p of the form p = a*n + 1 for some small value a and some larger value n (for example a power of 2 so that reduction mod p has a chance of being cheap). We find a primitive root r mod p, and then s = r^a (mod p) by Fermat's Little Theorem is a primitive n-th root of unity. Thus we can support a transform of length n on the ring Z/pZ. Many of the historically large integer products that have been computed, e.g. in computation of many digits of pi, involved NTT's. One can use a few of them, modulo different primes p (usually 2-5 of them) and combine using CRT. The problems with NTT's are twofold. Firstly, once one gets beyond about 5 of them, it becomes a significant proportion of the runtime to do the CRT step. Also, in practice, NTT implementations don't compare favourably with optimised Schonhage-Strassen implementations, being an order of magnitude slower in practice. On a 32 bit machine, the maximum transform length for a word sized prime can also be limiting for very large problems. The main drawback with NTT's is that multiplication by a root of unity is no longer a bit shift. One must also compute all the roots of unity, which usually means keeping some table of them handy. For very long transforms, the table itself may exceed the size of cache. Or one can partially compute the table and then perform some additional computations on-the-fly. ----------------- Now I will discuss some additional algorithms which one needs to know about. e) Multiplication of integers modulo B^m + 1, using the negacyclic convolution. Recall that the negacyclic convolution can be used to multiply polynomials modulo x^n+1. If we evaluate such a polynomial at some point x = B then we can use a negacyclic convolution to multiply integers modulo B^n + 1. There's nothing special about using this strategy to multiply integers. A cyclic convolution will generally be faster, as one doesn't need the extra twiddle factors to turn the negacyclic convolution into a cyclic convolution (so it can be computed using an FFT). But for multiplying integers modulo a number of the form B^n + 1 it is very useful. Note that the pointwise multiplications in our convolution algorithm are precisely multiplications modulo a number of the form B^m + 1, when we are working in a Fermat ring. Thus we can effect our pointwise multiplications using a negacyclic convolution. This allows the FFT to recurse down to an FFT for the pointwise multiplications (when they become sufficiently large). This is more efficient than multiplying using an ordinary cyclic convolution and then reducing modulo p, as a convolution of half the length can be used (you actually *want* it to wrap around). This can save a significant amount of time once the pointwise multiplications become so large that the use of an FFT is indicated. If fact, because the convolution will be half the size, a negacyclic convolution can be used *before* one would consider using an ordinary FFT routine to multiply the integers. That's all I need to discuss for this section, as I will explain some other algorithms in later sections as they are needed. Bill. 2010/1/1 Bill Hart <goodwillh...@googlemail.com>: > 1. Formal release of code. > > I have included two new files in the top source directory of mpir in > svn. The first is new_fft_with_flint.c, the second new_fft.c. They are > isomorphic except that the second does not need to be linked against > FLINT to run. There will be no difference in timings or results, I > just removed all the timing and test code which relied on FLINT. > > Here is the command I use to build the code once you have built MPIR: > > gcc -O2 new_fft.c -o fft -I/home/wbhart/mpir-trunk > -L/home/wbhart/mpir-trunk/.libs -lmpir > > Modify as desired for your own system. > > Please note the usual disclaimer: > > This code is INCOMPLETE, INEFFICIENT, INCONSISTENT often INCORRECT, > ILL-TESTED and the original author is ILL-TEMPERED. > > It is LGPL v2+ licensed and I have taken extreme care to ensure that > all the code was written from scratch by me without following someone > else's work. The four functions FFT_split_* and FFT_combine_* were > taken from FLINT, but to my knowledge were written entirely by me. I > hereby release them under the LGPL v2+. > > The timing code does not print times, it merely completes a certain > number of iterations and exits. Use the linux time command to get > times. > > I have run very few tests on the multiplication routine, perhaps on > the order of a few thousand integer multiplications, at very few > sizes. This code is extremely new and should be treated as quite > likely broken. > > Note that the negacyclic convolution is currently disabled. I don't > even know if it passes its test code when switched on at present. I > modified a number of the FFT's since writing it, so probably not. But > it did pass it's test code. However note it does not deal with a > couple of (trivial) special cases, so is incomplete and formally > incorrect because of that. > > There is an implementation of the split radix FFT contained in the > file too. I found the times were *identical* to those I got from the > radix 2 code, so I abandoned it. There is also an old implementation > of the radix 2 FFT/IFFT which I also abandoned. The basic radix 2 FFT > function is called FFT_radix2_new. > > All the truncate, mfa and twiddle functions are new in that they don't > rely on the old radix 2 implementation, I believe. > > Many of the code comments have been cut and pasted from earlier code > they were derived from, so will be incorrect. This is especially true > of the comments for the FFT and IFFT functions (including the truncate > and mfa functions), except for the radix2_new and split_radix > functions which I hope are about right. > > Bill. > > 2010/1/1 Bill Hart <goodwillh...@googlemail.com>: >> As the code for this FFT is quite involved, I plan to incrementally >> release a set of notes about it, starting tomorrow morning (today >> already), which may take hours to days for me to complete. I got the >> code working for the first time this afternoon, so please be patient >> with me. >> >> Here is the rough list of topics I hope to cover in the notes: >> >> Table of Contents >> ============= >> >> 1) The formal code release. I need to check all the test code still >> passes and make it easy for people to build and run. That will take an >> hour or so for me to do. >> >> 2) FFT's for nitwits (like me). How does an FFT help in multiplying >> numbers, basically, and what are the important algorithms and >> techniques. >> >> 3) Why a new FFT? This is an important question which needs answering. >> I will discuss what led me to work on a new FFT implementation, rather >> than improve or port another implementation. >> >> 4) The strategy. What does this FFT actually do? Which algorithms does >> it use? I tried many things, most of which completely failed. In >> section 3 I will have already discussed other strategies I tried which >> did not work. Here I will discuss what did work. Whilst many of the >> details came from my head, I'm pretty sure the general outline will >> contain few surprises. >> >> 5) References. Few. I purposely tried to be creative. >> >> 8) Implementors notes. A long list of things that can be improved. In >> releasing the code publicly it is my hope that contributors will help >> make this the best FFT implementation it can be. The list will include >> many trivial items which anyone could work on, all the way through to >> fairly advanced topics which would probably require original research >> to do really well. I'll rank the items according to how hard I think >> they'd be to do. >> >> The code will be under revision control, so it is my hope that people >> will be willing to contribute to it. I believe I spent between 150 and >> 200 hours writing it, all told. >> >> Till tomorrow. >> >> Bill. >> >> 2010/1/1 Bill Hart <goodwillh...@googlemail.com>: >>> Ha! I forgot to mention, those numbers represent the size of the >>> integers in bits. I multiply two different integers with the same >>> number of bits, though that is not a limitation of the code. It will >>> accept integers of different lengths and adjust accordingly. >>> >>> Bill. >>> >>> 2010/1/1 Bill Hart <goodwillh...@googlemail.com>: >>>> I re-timed that point and I believe I may have just mistranscribed it. >>>> Here it is with the corrected time. Thanks for pointing it out! >>>> >>>> I've also included some times from the FLINT FFT, though only >>>> non-truncated ones (i.e. where FLINT uses a convolution which is an >>>> exact power of 2 length - not a limitation of FLINT, I just didn't get >>>> around to timing it at the other points yet). >>>> >>>> 8364032: 6.8s / 8.0s >>>> 9409536: 9.8s / 8.7s >>>> 10455040: 10.8s / 9.6s >>>> 11500544: 12.3s / 12.6s >>>> 12546048: 12.7s / 15.3s >>>> 13591552: 14.0s / 14.2s >>>> 14637056: 14.9s / 15.3s >>>> 15682560: 16.0s / 14.5s >>>> 16728064: 16.8s / 19.5s >>>> >>>> 20910080: 24.3s / 23.2s >>>> 25092096: 28.5s / 33.1s >>>> 29274112: 33.1s / 31.8s >>>> 33456128: 37.4s / 47.7s >>>> >>>> 519168: iters 10000: 30.7s / 32.2s / 32.3s >>>> 2084864: iters 1000: 13.9s / 15.2s / 13.0s >>>> 8364032: iters 100: 6.8s / 8.0s / 7.6s >>>> 33497088: iters 100: 37.4s / 47.7s / 37.5s >>>> 134103040: iters 10: 17.0s / 24.8s / 19.6s >>>> 536608768: iters 1: 8.6s / 9.4s / 8.6s >>>> 2146959360: iters 1: 42.3s / 40.3s / 38.9s >>>> >>>> Bill. >>>> >>>> >>>> 2009/12/31 Bill Hart <goodwillh...@googlemail.com>: >>>>> It's either a tuning issue or timing instability on the machine. >>>>> >>>>> Bill. >>>>> >>>>> On 31/12/2009, Robert Gerbicz <robert.gerb...@gmail.com> wrote: >>>>>> Hi! >>>>>> >>>>>> How is it possible that to multiple longer numbers takes less time, for >>>>>> example: >>>>>> >>>>>> 12546048: 14.7s / 15.3s >>>>>> 13591552: 14.0s / 14.2s >>>>>> >>>>>> -- >>>>>> >>>>>> You received this message because you are subscribed to the Google Groups >>>>>> "mpir-devel" group. >>>>>> To post to this group, send email to mpir-de...@googlegroups.com. >>>>>> To unsubscribe from this group, send email to >>>>>> mpir-devel+unsubscr...@googlegroups.com. >>>>>> For more options, visit this group at >>>>>> http://groups.google.com/group/mpir-devel?hl=en. >>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> > -- You received this message because you are subscribed to the Google Groups "mpir-devel" group. To post to this group, send email to mpir-de...@googlegroups.com. To unsubscribe from this group, send email to mpir-devel+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/mpir-devel?hl=en.