I've done some more timings. This time I turned off MPIR's sqrt2 trick which I don't have yet.
For some multiplications this causes MPIR to take way too long due to tuning issues. For example, there are points where larger numbers take less time to multiply. I made a comparison showing how much faster the new code is, compared to MPIR, with a negative percentage when MPIR is faster. I've also omitted all percentages where the two times are within 5% of each other, which is probably below the error threshold for these timings anyway. For MPIR times, if a later time is lower I used that when computing the percentages, as technically this effect could be tuned away. As you can see, the new code is up to 100% faster, and over quite a large range has a big advantage over the MPIR FFT. Eventually MPIR catches up. For smaller multiplications MPIR has a distinct advantage which I will discuss later, but even so, rarely is it more than a few percent faster in this range! 519168/10000: 29.5s / 33.9s (14.9%) 584064/10000: 35.5 / 34.1s 648960/1000: 4.0s / 3.9s 713856/1000: 4.4s / 4.3s (-11.4%) 778752/1000: 4.7s / 5.5s (17.0%) 843648/1000: 5.4s / 5.3s 908544/1000: 5.7s / 7.6s 973440/1000: 6.1s / 5.9s 1038336/1000: 6.4s / 6.7s 1297920/1000: 8.7s / 9.0s 1557504/1000: 10.1s / 12.7s (25.7%) 1817088/1000: 12.5s / 15.8s (26.4%) 2084864/1000 : 13.9s / 15.2s (9.4%) 2345472/1000 : 16.5s / 16.6s 2606080/1000 : 18.5s / 18.2s 2866688/1000 : 20.1s / 19.9s 3127296/1000 : 21.6s / 26.8s (24%) 3387904/1000 : 23.7s / 28.2s (15.2%) 3387904/1000: 25.3s / 30.5s (7.9%) 3909120/1000: 26.9s / 27.3s 4169728/1000: 28.4s / 35.4s (24.6%) 5212160/1000: 43.0s / 40.6s (-5.6%) 6254592/100: 5.0s / 6.5s (30%) 7297024/100: 6.1s / 7.0s (14.8%) 8364032/100: 6.6s / 8.7s (31.8%) 9409536/100: 9.4s / 17.8s (60.6%) 10455040/100: 10.3s / 19.6s (46.6%) 11500544/100: 11.3s / 20.5s (33.6%) 12546048/100: 12.8s / 15.1s (18.0%) 13591552/100: 13.9s / 16.6s (10.1%) 14637056/100: 14.8s / 15.3s 15682560/100: 15.9s / 24.7s (35.8%) 167280641/100: 16.0s / 21.6s (35%) 20910080/100: 21.7s / 23.5s (8.3%) 25092096/100: 25.5s / 37.8s (48.2%) 29274112/100: 29.6s / 45.9s (55.1%) 33456128/100: 35.2s / 50.7s (44.0%) 37684224/10: 4.7s / 8.7s (48.9%) 41871360/10: 5.1s / 9.4s (37.3%) 46058496/10: 5.6s / 7.0s (25.0%) 50245632/10: 6.0s / 7.9s (31.7%) 54432768/10: 6.6s / 8.2s (24.2%) 58619904/10: 7.0s / 8.0s (14.3%) 62807040/10: 7.5s / 11.6s (54.7%) 66994176/10: 8.3s / 16.7s (101.2%) 83742720/10: 10.4s / 16.9s (62.5%) 100491264/10: 12.3s / 16.4s (33.3%) 117239808/10: 14.8s / 17.2s (16.2%) 134103040/10: 16.9s / 32.2s (90.5%) 150865920/10: 23.3s / 32.2s (38.2%) 167628800/10: 26.3s / 33.8s (28.5%) 184391680/10: 28.7s / 34.0s (18.5%) 201154560/10: 31.1s / 40.6s (30.5%) 217917440/10: 32.9s / 37.3s (13.4%) 234680320/10: 35.0s / 37.8s (8.0%) 251443200/10: 38.8s / 39.8s 268206080/1: 4.2s / 5.7s (35.7%) 335257600/1: 5.5s / 6.4s (16.4%) 402309120/1: 6.3s / 7.9s (25.4%) 469360640/1: 7.5s / 8.7s (16.0%) 536608768/1: 8.4s / 9.9s (17.9%) 603684864/1: 11.7s / 10.5s (-10.3%) 670760960/1: 13.3s / 12.6s (-5.3%) 737837056/1: 14.1s / 13.3s (-5.7%) 804913152/1: 15.7s / 15.9s 871989248/1: 17.2s / 16.6s 939065344/1: 17.9s / 17.5s 1006141440/1: 19.0s / 17.8s (-6.3%) 1073217536/1: 20.7s / 20.6s 1341521920/1: 26.8s / 25.3s (-5.6%) 1609826304/1: 30.8s / 31.2s 1878130688/1: 36.8s / 37.4s 2146959360/1: 41.1s / 41.9s Bill. 2010/1/1 Bill Hart <goodwillh...@googlemail.com>: > 2. FFT's for nitwits (like me) > ==================== > > I will explain why I am a nitwit in a later section. But to understand > that, you need to understand some FFT basics first. So I describe them > here. > > 2a. Description of the FFT > =================== > > An FFT is not an integer multiplication routine per se, but it can be > used to construct one. > > The first step in multiplying huge integers A and B is to view the > problem as a polynomial multiplication. Write your large integers in > some base D (usually D = 2^r for some r), i.e. write A = f1(D), B = > f2(D) for polynomials f1 and f2 whose coefficients are in the range > [0, D). > > Next, multiply the polynomials g(x) = f1(x)*f2(x). > > Finally, evaluate the product polynomial at D, i.e. C = A*B = f1(D)*f2(D). > > So from now on I'll talk about large integer multiplication as though > it is polynomial multiplication. > > Polynomial multiplication is computed using convolution of the two > polynomials. (Cyclic) convolution of polynomials is defined (up to > scaling) as multiplication of the polynomials modulo x^n - 1 for some > n. Note that if the degree of g is less than n then g(x) can be > recovered precisely from the convolution. > > The FFT can be used to efficiently compute the convolution. > Specifically convolution of two polynomials f1 and f2 can be computed > as FFT^(-1)(FFT(f1).FFT(f2)) where the dot is the coordinate-wise > product of vectors. The inverse of an FFT can be computed (up to > scaling) by an IFFT. Thus for convolution modulo x^n - 1 where n = > 2^k, the convolution can be computed as 2^(-k)*IFFT(FFT(f1).FFT(f2)), > where all the operations are on vectors of coefficients. In particular > the coordinate-wise product involves multiplication of coefficients > which are in [0,D). These products are known as pointwise > multiplications. > > You should view the FFT as evaluating the polynomial at n = 2^k > different roots of unity, 1, z, z^2, ...., z^{2k-1}, where z is a > primitive 2^k-th root of unity. The pointwise products multiply such > values, effectively giving you the product polynomial g (modulo x^n - > 1) evaluated at each of these roots of unity. The coefficients of the > polynomial g are retrieved by inverting the FFT process, i.e. by > interpolating the vector of pointwise products. > > So how does the FFT actually work? > > We start with a polynomial modulo x^n - 1 where n = 2^k. To save on > notation, let's write x^(2m) - 1 where m = 2^(k-1). > > We write x^(2m) - 1 = (x^m - 1)(x^m + 1). If we wish to multiply > polynomials modulo x^(2m) - 1, our first step is to express our > polynomials modulo x^m - 1 and x^m + 1. We then compute the products > mod x^m - 1 and mod x^m + 1 and recombine using CRT. > > Let f(x) = b(x)*x^m + a(x) where the degree of a(x) and b(x) are less > than m, then f(x) modulo x^m - 1 is just a(x) + b(x), i.e. we simply > add coefficients in pairs. Similarly modulo x^m + 1 the polynomial > would be a(x) - b(x). > > Taking the product of polynomials modulo x^m - 1 can then be computed > recursively using the same algorithm, i.e. by a convolution modulo x^m > - 1/ > > To compute the product modulo x^m + 1 we make a substitution y = z*x > where z is a primitive 2^k-th root of unity. This transforms the > problem into one modulo y^m - 1, which can again be done recursively > using the same algorithm. Note that substituting y = z*x is equivalent > to multiplying the coefficient x^i of f(x) by z^i. So we twiddle the > coefficients by roots of unity. > > A convolution mod x^m - 1 is called a cyclic convolution, and a > convolution mod x^m + 1 is called a negacyclic convolution. So the FFT > algorithm works by breaking the cyclic convolution into a cyclic > convolution of half the size and a negacyclic convolution of half the > size and transforming the negacyclic convolution into a cyclic > convolution by twiddling by roots of unity. > > In practice the evaluation step, adding or subtracting coefficients in > pairs, and the twiddle step for the negacyclic half, are combined. > This leads to the familiar FFT "butterfly" which replaces the left > half of a vector of coefficients of f(x), [a0, a1, a2, ...., a{m-1}, > b0, b1, ..., b{m-1}] with [a0+b0, a1+b1, ...., a{m-1}+b{m-1}] and the > right half by [a0-b0, z(a1-b1), z^2*(a2-b2),..., > z^{m-1}*(a{m-1}-b{m-1}]. The butterfly step can be conveniently > written [a{i}, b{i}] => [a{i}+b{i}, z^i*(a{i}-b{i})]. > > One then recursively computes the FFT butterfly of the left and right > halves of the vector. Eventually one gets down to vectors of length 1 > which represent polynomials modulo x-1 (or really x-z^i for i = > 0,..,2m). Note that taking a polynomial f(x) modulo x - z^i is the > same thing as evaluating it at z^i. So what the FFT really computes is > [f(1), f(z), f(z^2), ...., f(z^{2m})] (well perhaps not in that order, > but in some order - probably bit reversed order, i.e. z^i is swapped > with z^j where i is the reverse of j in binary, when they are > considered as binary numbers of log_2(2m) bits). > > This left-right FFT recursion scheme is known as a decimation in > frequency (DIF) FFT. > > Note that multiplication of polynomials modulo x-1 is simply integer > multiplication. These are the pointwise multiplications we spoke of > earlier. We simply multiply our vectors of coefficients pointwise. > > Now suppose I have a polynomial known modulo x^m - 1 and x^m + 1, how > can I invert this process and obtain it modulo x^(2m) - 1? > Specifically, if I have c{i} = a{i}+b{i} and d{i} = z^i*(a{i}-b{i}), > how can I retrieve a{i} and b{i}. Clearly the answer is to multiply > d{i} by z^{-i}, then adding it to c{i} yields 2*a{i} and subtracting > it from c{i} yields 2*b{i}. Notice the extra factor of 2 which has > crept in. > > So the IFFT butterfly step replaces [c0, d0, c1, d1, ...., c{m-1}, > d{m-1}] with [c0+d0, c0-d0, c1+z^(-1)*d1, c1-z*d1, ...., > c{m-1}+z^{-(m-1)}*d{m-1}, c{m-1}-z^{-(m-1)}*d{m-1}]. Or we can write > the butterfly conveniently as [c{i}, d{i}] => [c{i}+z^-i*d{i}, > c{i}-z^-i*d{i}]. > > Note we have paired the odd indexed entries in our vector with the > even indexed entries. An FFT which has the butterfly step [c{i}, d{i}] > => [c{i}+z^i*d{i}, c{i}-z^i*d{i}] (note the change in sign of the > exponents on the twiddle factors to convert from an IFFT to an FFT) is > called a decimation in time (DIT) FFT. > > So our IFFT is simply a DIT FFT with the sign of the exponent of the > twiddle factors changed. > > Now, an important note for later is that the reason this all worked is > that we can invert the FFT process. Specifically c{i}+z^-i*d{i} = > 2*a{i}+b{i}-b{i} = 2*a{i}, and similarly we can retrieve 2*b{i}. > > One final note. These factors of 2 that creep in can be removed at any > stage. They are sometimes removed at each butterfly step, sometimes > all at once at the end of the IFFT. > > 2b. FFT algorithms > ============== > > There are multiple different ways of implementing the FFT algorithm in > practice, and a number of important additional algorithms used in > multiplying integers using an FFT. I will describe them here. > > a) Complex FFT. One can use complex numbers and floating point > arithmetic to multiply integers using the FFT. One has to be careful > that at each step not too much precision is being lost. For example, > one could work with double precision floating point numbers. > > Naturally it is inefficient to use complex numbers, as the polynomials > we are trying to multiply have only real coefficients. There are some > strategies for getting around this. One can exploit the symmetries of > the FFT to get FFT's which work on real values and take roughly half > the time/space. For example the Fast Hartley Transform works with only > real values, and there are various other variations on this theme. > > Note that roots of unity here are complex values, or for the fast > Hartley transform, real values which are linear combinations of sines > and cosines. So computing the twiddle factors can be expensive. > > The biggest issue with real/complex FFT's is the loss of precision. > They are really fast as they can use the native floating point > arithmetic of the processor, they just usually don't guarantee a > correct result, and aren't so fast or don't support very long > transforms if they do. > > b) Schonhage-Strassen FFTs. Schonhage and Strassen suggested working > in the ring Z/pZ where p is of the form 2^m+1. Here 2 is a 2m-th > (2^k-th) root of unity. Simply raise 2^m and you see that you get -1 > in this ring, thus 2^(2m) must be 1 in this ring. > > Note that multiplication by a root of unity then becomes > multiplication by a power of 2 (modulo 2^m + 1), or a bitshift on a > binary computer. Also working in this ring is convenient, as reduction > modulo p can be done with a subtraction. In practice we can work in a > ring Z/pZ where p = 2^(a*m)+1 for some value a. Here 2^a is a 2m-th > (2^k-th) root of unity. A ring of this form is called a Fermat ring > and p a generalised Fermat number. > > Usually p is chosen to be a large prime that takes up numerous machine > limbs. For performance reasons a and m are usually chosen so that > reduction modulo > p does not involve any bitshifts, i.e. 2^(a*m) is a multiple of a limb shift. > > c) Mersenne Ring FFT. Work in a ring modulo p = 2^(b*m) - 1, again for > m a power of 2 and b arbitrary. Here 2 is an m-th root of unity, > reduction modulo p is an addition. Numbers p of this form are called > generalised Mersenne numbers. > > d) Number Theoretic Transforms (NTT's). Work in a ring Z/pZ where p is > a prime such that Z/pZ has lots of roots of unity. Usually p is chosen > to be a small prime that fits in a machine word. We usually start with > a prime p of the form p = a*n + 1 for some small value a and some > larger value n (for example a power of 2 so that reduction mod p has a > chance of being cheap). We find a primitive root r mod p, and then s = > r^a (mod p) by Fermat's Little Theorem is a primitive n-th root of > unity. Thus we can support a transform of length n on the ring Z/pZ. > > Many of the historically large integer products that have been > computed, e.g. in computation of many digits of pi, involved NTT's. > One can use a few of them, modulo different primes p (usually 2-5 of > them) and combine using CRT. > > The problems with NTT's are twofold. Firstly, once one gets beyond > about 5 of them, it becomes a significant proportion of the runtime to > do the CRT step. Also, in practice, NTT implementations don't compare > favourably with optimised Schonhage-Strassen implementations, being an > order of magnitude slower in practice. On a 32 bit machine, the > maximum transform length for a word sized prime can also be limiting > for very large problems. > > The main drawback with NTT's is that multiplication by a root of unity > is no longer a bit shift. One must also compute all the roots of > unity, which usually means keeping some table of them handy. For very > long transforms, the table itself may exceed the size of cache. Or one > can partially compute the table and then perform some additional > computations on-the-fly. > > ----------------- > > Now I will discuss some additional algorithms which one needs to know about. > > e) Multiplication of integers modulo B^m + 1, using the negacyclic > convolution. Recall that the negacyclic convolution can be used to > multiply polynomials modulo x^n+1. If we evaluate such a polynomial at > some point x = B then we can use a negacyclic convolution to multiply > integers modulo B^n + 1. There's nothing special about using this > strategy to multiply integers. A cyclic convolution will generally be > faster, as one doesn't need the extra twiddle factors to turn the > negacyclic convolution into a cyclic convolution (so it can be > computed using an FFT). But for multiplying integers modulo a number > of the form B^n + 1 it is very useful. > > Note that the pointwise multiplications in our convolution algorithm > are precisely multiplications modulo a number of the form B^m + 1, > when we are working in a Fermat ring. Thus we can effect our pointwise > multiplications using a negacyclic convolution. This allows the FFT to > recurse down to an FFT for the pointwise multiplications (when they > become sufficiently large). This is more efficient than multiplying > using an ordinary cyclic convolution and then reducing modulo p, as a > convolution of half the length can be used (you actually *want* it to > wrap around). This can save a significant amount of time once the > pointwise multiplications become so large that the use of an FFT is > indicated. If fact, because the convolution will be half the size, a > negacyclic convolution can be used *before* one would consider using > an ordinary FFT routine to multiply the integers. > > That's all I need to discuss for this section, as I will explain some > other algorithms in later sections as they are needed. > > Bill. > > 2010/1/1 Bill Hart <goodwillh...@googlemail.com>: >> 1. Formal release of code. >> >> I have included two new files in the top source directory of mpir in >> svn. The first is new_fft_with_flint.c, the second new_fft.c. They are >> isomorphic except that the second does not need to be linked against >> FLINT to run. There will be no difference in timings or results, I >> just removed all the timing and test code which relied on FLINT. >> >> Here is the command I use to build the code once you have built MPIR: >> >> gcc -O2 new_fft.c -o fft -I/home/wbhart/mpir-trunk >> -L/home/wbhart/mpir-trunk/.libs -lmpir >> >> Modify as desired for your own system. >> >> Please note the usual disclaimer: >> >> This code is INCOMPLETE, INEFFICIENT, INCONSISTENT often INCORRECT, >> ILL-TESTED and the original author is ILL-TEMPERED. >> >> It is LGPL v2+ licensed and I have taken extreme care to ensure that >> all the code was written from scratch by me without following someone >> else's work. The four functions FFT_split_* and FFT_combine_* were >> taken from FLINT, but to my knowledge were written entirely by me. I >> hereby release them under the LGPL v2+. >> >> The timing code does not print times, it merely completes a certain >> number of iterations and exits. Use the linux time command to get >> times. >> >> I have run very few tests on the multiplication routine, perhaps on >> the order of a few thousand integer multiplications, at very few >> sizes. This code is extremely new and should be treated as quite >> likely broken. >> >> Note that the negacyclic convolution is currently disabled. I don't >> even know if it passes its test code when switched on at present. I >> modified a number of the FFT's since writing it, so probably not. But >> it did pass it's test code. However note it does not deal with a >> couple of (trivial) special cases, so is incomplete and formally >> incorrect because of that. >> >> There is an implementation of the split radix FFT contained in the >> file too. I found the times were *identical* to those I got from the >> radix 2 code, so I abandoned it. There is also an old implementation >> of the radix 2 FFT/IFFT which I also abandoned. The basic radix 2 FFT >> function is called FFT_radix2_new. >> >> All the truncate, mfa and twiddle functions are new in that they don't >> rely on the old radix 2 implementation, I believe. >> >> Many of the code comments have been cut and pasted from earlier code >> they were derived from, so will be incorrect. This is especially true >> of the comments for the FFT and IFFT functions (including the truncate >> and mfa functions), except for the radix2_new and split_radix >> functions which I hope are about right. >> >> Bill. >> >> 2010/1/1 Bill Hart <goodwillh...@googlemail.com>: >>> As the code for this FFT is quite involved, I plan to incrementally >>> release a set of notes about it, starting tomorrow morning (today >>> already), which may take hours to days for me to complete. I got the >>> code working for the first time this afternoon, so please be patient >>> with me. >>> >>> Here is the rough list of topics I hope to cover in the notes: >>> >>> Table of Contents >>> ============= >>> >>> 1) The formal code release. I need to check all the test code still >>> passes and make it easy for people to build and run. That will take an >>> hour or so for me to do. >>> >>> 2) FFT's for nitwits (like me). How does an FFT help in multiplying >>> numbers, basically, and what are the important algorithms and >>> techniques. >>> >>> 3) Why a new FFT? This is an important question which needs answering. >>> I will discuss what led me to work on a new FFT implementation, rather >>> than improve or port another implementation. >>> >>> 4) The strategy. What does this FFT actually do? Which algorithms does >>> it use? I tried many things, most of which completely failed. In >>> section 3 I will have already discussed other strategies I tried which >>> did not work. Here I will discuss what did work. Whilst many of the >>> details came from my head, I'm pretty sure the general outline will >>> contain few surprises. >>> >>> 5) References. Few. I purposely tried to be creative. >>> >>> 8) Implementors notes. A long list of things that can be improved. In >>> releasing the code publicly it is my hope that contributors will help >>> make this the best FFT implementation it can be. The list will include >>> many trivial items which anyone could work on, all the way through to >>> fairly advanced topics which would probably require original research >>> to do really well. I'll rank the items according to how hard I think >>> they'd be to do. >>> >>> The code will be under revision control, so it is my hope that people >>> will be willing to contribute to it. I believe I spent between 150 and >>> 200 hours writing it, all told. >>> >>> Till tomorrow. >>> >>> Bill. >>> >>> 2010/1/1 Bill Hart <goodwillh...@googlemail.com>: >>>> Ha! I forgot to mention, those numbers represent the size of the >>>> integers in bits. I multiply two different integers with the same >>>> number of bits, though that is not a limitation of the code. It will >>>> accept integers of different lengths and adjust accordingly. >>>> >>>> Bill. >>>> >>>> 2010/1/1 Bill Hart <goodwillh...@googlemail.com>: >>>>> I re-timed that point and I believe I may have just mistranscribed it. >>>>> Here it is with the corrected time. Thanks for pointing it out! >>>>> >>>>> I've also included some times from the FLINT FFT, though only >>>>> non-truncated ones (i.e. where FLINT uses a convolution which is an >>>>> exact power of 2 length - not a limitation of FLINT, I just didn't get >>>>> around to timing it at the other points yet). >>>>> >>>>> 8364032: 6.8s / 8.0s >>>>> 9409536: 9.8s / 8.7s >>>>> 10455040: 10.8s / 9.6s >>>>> 11500544: 12.3s / 12.6s >>>>> 12546048: 12.7s / 15.3s >>>>> 13591552: 14.0s / 14.2s >>>>> 14637056: 14.9s / 15.3s >>>>> 15682560: 16.0s / 14.5s >>>>> 16728064: 16.8s / 19.5s >>>>> >>>>> 20910080: 24.3s / 23.2s >>>>> 25092096: 28.5s / 33.1s >>>>> 29274112: 33.1s / 31.8s >>>>> 33456128: 37.4s / 47.7s >>>>> >>>>> 519168: iters 10000: 30.7s / 32.2s / 32.3s >>>>> 2084864: iters 1000: 13.9s / 15.2s / 13.0s >>>>> 8364032: iters 100: 6.8s / 8.0s / 7.6s >>>>> 33497088: iters 100: 37.4s / 47.7s / 37.5s >>>>> 134103040: iters 10: 17.0s / 24.8s / 19.6s >>>>> 536608768: iters 1: 8.6s / 9.4s / 8.6s >>>>> 2146959360: iters 1: 42.3s / 40.3s / 38.9s >>>>> >>>>> Bill. >>>>> >>>>> >>>>> 2009/12/31 Bill Hart <goodwillh...@googlemail.com>: >>>>>> It's either a tuning issue or timing instability on the machine. >>>>>> >>>>>> Bill. >>>>>> >>>>>> On 31/12/2009, Robert Gerbicz <robert.gerb...@gmail.com> wrote: >>>>>>> Hi! >>>>>>> >>>>>>> How is it possible that to multiple longer numbers takes less time, for >>>>>>> example: >>>>>>> >>>>>>> 12546048: 14.7s / 15.3s >>>>>>> 13591552: 14.0s / 14.2s >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups >>>>>>> "mpir-devel" group. >>>>>>> To post to this group, send email to mpir-de...@googlegroups.com. >>>>>>> To unsubscribe from this group, send email to >>>>>>> mpir-devel+unsubscr...@googlegroups.com. >>>>>>> For more options, visit this group at >>>>>>> http://groups.google.com/group/mpir-devel?hl=en. >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > -- You received this message because you are subscribed to the Google Groups "mpir-devel" group. To post to this group, send email to mpir-de...@googlegroups.com. To unsubscribe from this group, send email to mpir-devel+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/mpir-devel?hl=en.