Re: [mpir-devel] A new FFT for the New Year

Bill Hart Sat, 02 Jan 2010 13:13:48 -0800

3. Why a new FFT?

The timings I just posted probably justify writing a new FFT. However
that is not what concretely motivated me to write one.

I began thinking about writing a new FFT about the end of September
this year. At the time I skimmed through a huge number of papers on
the FFT for something else I was working on, and this gave me some new
ideas (some I haven't tried yet).

Of course when you want to try out completely new ideas, you very
often have to start from scratch. The main thing which motivated me to
start writing the current FFT a couple of weeks back was a desire to
try and write a radix 3 FFT. This has little to nothing in common with
an ordinary radix 2 FFT and thus I had to begin afresh.

Of course the radix 3 FFT didn't work out. More about that shortly.

Let me first discuss some of the issues a good FFT implementation has
to address.

1) Fast butterflies. The main part of the FFT and IFFT routines is the
butterfly. These need to be coded efficiently, otherwise the whole
thing slows down.

2) Cache locality. The standard iterative radix 2 FFT is very cache
unfriendly. It wants to access all of the coefficients about k times
where the length of the convolution is about 2^k. For decent sized
problems, each coefficient might be many thousands of bits, and you
might have thousands of them. So in effect you are drawing
coefficients from main memory, not cache, over and over. This is very
expensive.

The recursive FFT is much more cache friendly. It keeps halving the
size of the problem and then concentrates on each half separately
before progressing to the other half. This keeps things more localised
and in cache. However it has its limits. The top levels of the FFT may
be completely out of cache. For decent sized problems, maybe half of
the layers are completely out of cache. Thus making the FFT recursive
only fixes half the problem.

3) Truncation. If I multiply integers of different lengths, or even
integers of the same length but which don't break nicely into 2^k
coefficients of 2^{k-1} bits each, then the only way to multiply such
integers with a standard FFT (whose length must be a power of 2) is to
zero pad the integers, i.e. set a whole lot of the FFT coefficients to
zero. This is completely inefficient and can cause the computation to
take up to twice as long (e.g. if your integers are just a little too
large to multiply using an FFT of a given length and you have to go up
to one of twice the size).

4) Negacyclic convolution for large pointwise multiplications. As
explained in an earlier section, when the pointwise multiplications
themselves become so large that you want to use an FFT, you should use
the negacyclic convolution, as it will take about half the time.

Originally I had intended to port the FLINT FFT to MPIR. David Harvey
and I worked on that for a very long time, then at some point I got
interested in working on polynomial arithmetic instead of FFTs and
David went on and developed the FFT in its current form in FLINT. I
won't say I don't know how it works, but I certainly am not too
familiar with all the code.

The FLINT FFT does deal with all of the above issues. A lot of care
was taken with the butterflies, cache locality was improved with
Bailey's 4 step (an infinite dimensional version), truncation was
taken care of with van der Hoeven's truncated FFT/IFFT and there was a
negacyclic convolution (much more sophisticated than mine).

So porting the FLINT FFT to MPIR certainly looked attractive. Even so,
I had some problems with that. The code was quite complex, which made
it difficult to check, understand, port, maintain and improve. That
didn't pose an insurmountable problem, but it is what put me off
porting it. It also wasn't written with inclusion in MPIR in mind (it
was written before MPIR existed), so much of it would have to be
rewritten anyhow. It was spread over multiple files, which is great
for FLINT, but not so comfortable for MPIR. So, whilst tempted, I
decided not to port the FLINT FFT. The main thing that tipped the
balance though was the desire to try some different ideas.

So why not just improve the MPIR FFT? It's a great implementation, the
paper about it is very nice and we are extremely grateful to the
authors of it for making it available so that we can use it.

But the MPIR FFT has some serious problems, and this is really what
set me off with the idea to try and come up with something better.

1) It is insanely hard to tune. There are so many tuning parameters
and it takes hours and hours per architecture. We still haven't tuned
it for some old architectures, and for none of the ones we don't have
access to. Tuning it on Windows appears to be near impossible. This
point is underlined by the crazy times I just posted. For some sized
integers it picks totally ridiculous parameters and ends up taking
much longer than if it were multiplying shorter integers.

2) The MPIR FFT uses both a Fermat FFT and a Mersenne FFT, i.e. it
does a multiplication of polynomials with coefficients mod p with p =
2^{aN}+1 then one with coefficients mod p = 2^N-1. Then it CRT's the
results. By varying the parameter _a_, as well as the FFT length, it
is possible to get quite a lot of control over the zero padding
wastage. This is essentially their strategy for dealing with issue 3
above.

In their paper:

http://www.loria.fr/~gaudry/publis/issac07.pdf

they claim on page 171:

"a Mersenne transform can use a larger FFT length K = 2^k than the
corresponding Fermat transform. Indeed, while K must divide N for the
Fermat transform, so that theta = 2N/K is a power of two, it only
needs to divide 2N for the Mersenne transform, so that omega =
2^(2N/K) is a power of two. This improves the efficiency for K near
sqrt(N), and enables one to use a value of K close to optimal."

However I do not understand this argument. Clearly in a ring mod 2^M+1
we have that 2 is a 2M-th root of unity. So convolutions of length 2M
can be supported. However in a ring mod 2^M-1 we have that 2 is only
an M-th root of unity and so only convolutions of length M can be
supported.

So let's work through an example (the one given earlier in their paper
when discussing efficiency):

Fermat ring:
N=1,044,480 bits
break into K=2^10 chunks each (i.e. k = 10)
2N/K+k = 2050 bits per output coefficient
coefficients must be a multiple of 1024 bits
must use coefficients of 3*1024 = 3072 bits

Mersenne ring:
1,044,480 bits
break into K=2^10 chunks each (i.e. k = 10)
2N/K+k = 2050 bits per output coefficient
coefficients must be a multiple of 2048 bits
must use coefficients of 2*2048 = 4096 bits

So for convolutions of the same length, the efficiency actually goes
down for a Mersenne ring, not up. I'm probably missing something, but
I don't get the argument.

The other issue is this. Suppose we CRT a convolution with
coefficients mod 2^M-1 and one modulo 2^M+1. Even if the Mersenne
convolution uses the longest length convolution it can support, then
the Fermat ring (if it uses a convolution of the same length) is then
constrained to use a convolution of half the maximum it can support,
i.e. the Mersenne FFT constrains the efficiency of the Fermat FFT. So
if we set a = 1 in MPIR, we automatically lose something.

However, when multiplying smaller integers, this loss of efficiency is
actually offset by something. Had we used only a single FFT and not
CRT'd two FFT's together, we would be forced to do pointwise
multiplications of twice the size. Now if those pointwise
multiplications take up much of the time and use schoolboy
multiplication instead of an FFT then their time is essentially
quadratic in the size. So I think that in fact, the reason the MPIR
FFT works at all well is that it can use pointwise multiplications of
half the size.

There is no advantage to using a = 2 in MPIR. In that situation one
could just use a single FFT with coefficients of three times the size
and then break 2^{3B}+1 up into 2^B+1 and 2^{2B}-2^B+1 and CRT
together. In the schoolbook range this is more efficient and so there
is no gain for pointwise multiplications in MPIR's FFT.

Well, I could carry the analysis through to an algorithm which *does*
work. But I'll leave that for the next iteration of our
implementation. Basically there is a strategy which will also beat
MPIR by a long way when the multiplications are small.

Another strategy I thought about was to do a negacyclic FFT and a
cyclic FFT of a different size and CRT together. For example one could
do one FFT mod x^m-1 with coefficients mod 2^B+1 and one FFT mod
x^{2m}+1 with coefficients mod 2^B+1. The issue with this strategy is
that to do a negacyclic FFT one first needs to twiddle all the
coefficients, so it is strictly less efficient than a cyclic FFT. But
by far the major killer is that twice as many roots of unity are
needed to effect a negacyclic FFT of the same length as are needed for
a cyclic FFT. This again constrains the efficiency. One cannot do the
cyclic FFT mod x^{2m}-1 and the negacyclic one mod x^m+1 as these two
polynomials are not coprime and thus one cannot CRT. So this
possibility went on the scrapheap of ideas.

Another strategy I tried (and fully coded), was the split radix FFT.
Roughly speaking, one starts mod x^{4m}-1 and splits into x^m-1, x^m+1
and x^{2m}+1. Thus one half of the recursion becomes half the length
whilst the other half of the recursion is split into quarter sized
FFT's.

I was utterly amazed to find that when I had optimised both a radix 2
and split radix FFT to the same degree, the timings I got were
identical. When using complex FFT's the split radix FFT actually has a
lower operation count (less roots of unity to apply). But the
operation count in a Fermat ring is the same. However one thought that
at least the overheads would be lower, more combining of operations
could be done using the new mpn functions in MPIR or maybe the cache
locality would be improved. But nada. Nothing. Zippo. Zilch. So I
abandoned this idea too.

The next idea I had was to handle the truncation issue by using radix
2 and radix 3 steps. So long as you have 3^j-th roots of unity (or so
I thought) you can perform a radix 3 FFT. My strategy was to do all
the top levels of the FFT at radix 2, cutting the length of the FFT in
half with each recursion. Once the roots of unity corresponded to
shifts by a multiple of a limb, I would switch to radix 3. I reasoned
that you needed less layers of radix 3 to get you down to the
pointwise mults than you would need of radix 2. Because you can use
both radix 2 and radix 3 steps, your FFT could be any length of the
form 2^k*3*j. That effectively solved the truncation problem.

One can manufacture cube roots of unity by working in a ring 2^{3B}+1.
Here I could try using 2^B as a cube root of unity.

I went ahead and implemented the *entire thing*, including a kind of
Bailey's 4-step for cache locality. I even got timings for it, and
they were very good. Only one problem. When I multiplied integers with
it, I got the wrong answer. Coding bug? No!

The problem is this. When doing a radix 3 FFT, everything proceeds as
per a radix 2 FFT for the FFT side of things. And it works. It
evaluates at 3^j-th roots of unity.

However, let's suppose you have a polynomial mod x^n-1, x^n-w and
x^n-w^2 where w is a cube root of unity. If your polynomial mod
x^{3n}-1 is a(x)+b(x)*x^n+c(x)*x^n you will have d1 = a+b+c, d2 =
a+w*b+w^2*c and d3 = a+w^2*b+w*c.

Now how would I retrieve a, b and c? Clearly I can get 3*a by adding
d1, d2 and d3....

Or can I?

This is actually where things fall down. In order to get 3*a I need
that b + w*b + w^2*b = 0 and c + w^2*c+w*c = 0. In other words, I need
that 1 + w + w^2 = 0.

Well when w is a complex number, this holds. But in my ring mod
2^{3B}+1 it doesn't hold. Go ahead try it. It doesn't work.

So in order to do a radix 3 FFT one actually needs to work in Z/pZ
where p = 2^{2B}+2^B+1. Then the roots of unity work as desired.

But now there is a problem. The binary representation of p has 3
binary 1's. That means that to reduce mod p takes two subtractions,
not 1. In other words, the amount of time doing butterflies is
basically doubled. It's worse than that though. We are now working mod
a value p which is 2/3 of the size of what we wanted. Not only is that
so, but each butterfly has 6 operations instead of the 2 that we had
for a radix 2 FFT. In every way, radix 3 is useless in Fermat rings
(it's fine in rings of characteristic 2 and other rings where 2 is not
invertible).

So despite all the effort I put into this, it too went on the
scrapheap. And I realised I was a nitwit.

But by now I already had too much code to throw it in the garbage. So
I sat down and thought about how else I might solve the truncation
problem. By this stage I had already got something efficient to solve
all the other issues (except for the negacyclic convolution, which is
only important for really huge multiplications).

So that is why I pushed on and wrote the current FFT.

In the next section I will describe what I actually ended up doing.
It's not too surprising given the options actually available.

Bill.

2010/1/1 Bill Hart <goodwillh...@googlemail.com>:
> 2. FFT's for nitwits (like me)
> ====================
>
> I will explain why I am a nitwit in a later section. But to understand
> that, you need to understand some FFT basics first. So I describe them
> here.
>
> 2a. Description of the FFT
> ===================
>
> An FFT is not an integer multiplication routine per se, but it can be
> used to construct one.
>
> The first step in multiplying huge integers A and B is to view the
> problem as a polynomial multiplication. Write your large integers in
> some base D (usually D = 2^r for some r), i.e. write A = f1(D), B =
> f2(D) for polynomials f1 and f2 whose coefficients are in the range
> [0, D).
>
> Next, multiply the polynomials g(x) = f1(x)*f2(x).
>
> Finally, evaluate the product polynomial at D, i.e. C = A*B = f1(D)*f2(D).
>
> So from now on I'll talk about large integer multiplication as though
> it is polynomial multiplication.
>
> Polynomial multiplication is computed using convolution of the two
> polynomials. (Cyclic) convolution of polynomials is defined (up to
> scaling) as multiplication of the polynomials modulo x^n - 1 for some
> n. Note that if the degree of g is less than n then g(x) can be
> recovered precisely from the convolution.
>
> The FFT can be used to efficiently compute the convolution.
> Specifically convolution of two polynomials f1 and f2 can be computed
> as FFT^(-1)(FFT(f1).FFT(f2)) where the dot is the coordinate-wise
> product of vectors. The inverse of an FFT can be computed (up to
> scaling) by an IFFT. Thus for convolution modulo x^n - 1 where n =
> 2^k, the convolution can be computed as 2^(-k)*IFFT(FFT(f1).FFT(f2)),
> where all the operations are on vectors of coefficients. In particular
> the coordinate-wise product involves multiplication of coefficients
> which are in [0,D). These products are known as pointwise
> multiplications.
>
> You should view the FFT as evaluating the polynomial at n = 2^k
> different roots of unity, 1, z, z^2, ...., z^{2k-1}, where z is a
> primitive 2^k-th root of unity. The pointwise products multiply such
> values, effectively giving you the product polynomial g (modulo x^n -
> 1) evaluated at each of these roots of unity. The coefficients of the
> polynomial g are retrieved by inverting the FFT process, i.e. by
> interpolating the vector of pointwise products.
>
> So how does the FFT actually work?
>
> We start with a polynomial modulo x^n - 1 where n = 2^k. To save on
> notation, let's write x^(2m) - 1 where m = 2^(k-1).
>
> We write x^(2m) - 1 = (x^m - 1)(x^m + 1). If we wish to multiply
> polynomials modulo x^(2m) - 1, our first step is to express our
> polynomials modulo x^m - 1 and x^m + 1. We then compute the products
> mod x^m - 1 and mod x^m + 1 and recombine using CRT.
>
> Let f(x) = b(x)*x^m + a(x) where the degree of  a(x) and b(x) are less
> than m, then f(x) modulo x^m - 1 is just a(x) + b(x), i.e. we simply
> add coefficients in pairs. Similarly modulo x^m + 1 the polynomial
> would be a(x) - b(x).
>
> Taking the product of polynomials modulo x^m - 1 can then be computed
> recursively using the same algorithm, i.e. by a convolution modulo x^m
> - 1/
>
> To compute the product modulo x^m + 1 we make a substitution y = z*x
> where z is a primitive 2^k-th root of unity. This transforms the
> problem into one modulo y^m - 1, which can again be done recursively
> using the same algorithm. Note that substituting y = z*x is equivalent
> to multiplying the coefficient x^i of f(x) by z^i. So we twiddle the
> coefficients by roots of unity.
>
> A convolution mod x^m - 1 is called a cyclic convolution, and a
> convolution mod x^m + 1 is called a negacyclic convolution. So the FFT
> algorithm works by breaking the cyclic convolution into a cyclic
> convolution of half the size and a negacyclic convolution of half the
> size and transforming the negacyclic convolution into a cyclic
> convolution by twiddling by roots of unity.
>
> In practice the evaluation step, adding or subtracting coefficients in
> pairs, and the twiddle step for the negacyclic half, are combined.
> This leads to the familiar FFT "butterfly" which replaces the left
> half of a vector of coefficients of f(x), [a0, a1, a2, ...., a{m-1},
> b0, b1, ..., b{m-1}] with [a0+b0, a1+b1, ...., a{m-1}+b{m-1}] and the
> right half by [a0-b0, z(a1-b1), z^2*(a2-b2),...,
> z^{m-1}*(a{m-1}-b{m-1}]. The butterfly step can be conveniently
> written [a{i}, b{i}] => [a{i}+b{i}, z^i*(a{i}-b{i})].
>
> One then recursively computes the FFT butterfly of the left and right
> halves of the vector. Eventually one gets down to vectors of length 1
> which represent polynomials modulo x-1 (or really x-z^i for i =
> 0,..,2m). Note that taking a polynomial f(x) modulo x - z^i is the
> same thing as evaluating it at z^i. So what the FFT really computes is
> [f(1), f(z), f(z^2), ...., f(z^{2m})] (well perhaps not in that order,
> but in some order - probably bit reversed order, i.e. z^i is swapped
> with z^j where i is the reverse of j in binary, when they are
> considered as binary numbers of log_2(2m) bits).
>
> This left-right FFT recursion scheme is known as a decimation in
> frequency (DIF) FFT.
>
> Note that multiplication of polynomials modulo x-1 is simply integer
> multiplication. These are the pointwise multiplications we spoke of
> earlier. We simply multiply our vectors of coefficients pointwise.
>
> Now suppose I have a polynomial known modulo x^m - 1 and x^m + 1, how
> can I invert this process and obtain it modulo x^(2m) - 1?
> Specifically, if I have c{i} = a{i}+b{i} and d{i} = z^i*(a{i}-b{i}),
> how can I retrieve a{i} and b{i}. Clearly the answer is to multiply
> d{i} by z^{-i}, then adding it to c{i} yields 2*a{i} and subtracting
> it from c{i} yields 2*b{i}. Notice the extra factor of 2 which has
> crept in.
>
> So the IFFT butterfly step replaces [c0, d0, c1, d1, ...., c{m-1},
> d{m-1}] with [c0+d0, c0-d0, c1+z^(-1)*d1, c1-z*d1, ....,
> c{m-1}+z^{-(m-1)}*d{m-1}, c{m-1}-z^{-(m-1)}*d{m-1}]. Or we can write
> the butterfly conveniently as [c{i}, d{i}] => [c{i}+z^-i*d{i},
> c{i}-z^-i*d{i}].
>
> Note we have paired the odd indexed entries in our vector with the
> even indexed entries. An FFT which has the butterfly step [c{i}, d{i}]
> => [c{i}+z^i*d{i}, c{i}-z^i*d{i}] (note the change in sign of the
> exponents on the twiddle factors to convert from an IFFT to an FFT) is
> called a decimation in time (DIT) FFT.
>
> So our IFFT is simply a DIT FFT with the sign of the exponent of the
> twiddle factors changed.
>
> Now, an important note for later is that the reason this all worked is
> that we can invert the FFT process. Specifically c{i}+z^-i*d{i} =
> 2*a{i}+b{i}-b{i} = 2*a{i}, and similarly we can retrieve 2*b{i}.
>
> One final note. These factors of 2 that creep in can be removed at any
> stage. They are sometimes removed at each butterfly step, sometimes
> all at once at the end of the IFFT.
>
> 2b. FFT algorithms
> ==============
>
> There are multiple different ways of implementing the FFT algorithm in
> practice, and a number of important additional algorithms used in
> multiplying integers using an FFT. I will describe them here.
>
> a) Complex FFT. One can use complex numbers and floating point
> arithmetic to multiply integers using the FFT. One has to be careful
> that at each step not too much precision is being lost. For example,
> one could work with double precision floating point numbers.
>
> Naturally it is inefficient to use complex numbers, as the polynomials
> we are trying to multiply have only real coefficients. There are some
> strategies for getting around this. One can exploit the symmetries of
> the FFT to get FFT's which work on real values and take roughly half
> the time/space. For example the Fast Hartley Transform works with only
> real values, and there are various other variations on this theme.
>
> Note that roots of unity here are complex values, or for the fast
> Hartley transform, real values which are linear combinations of sines
> and cosines. So computing the twiddle factors can be expensive.
>
> The biggest issue with real/complex FFT's is the loss of precision.
> They are really fast as they can use the native floating point
> arithmetic of the processor, they just usually don't guarantee a
> correct result, and aren't so fast or don't support very long
> transforms if they do.
>
> b) Schonhage-Strassen FFTs. Schonhage and Strassen suggested working
> in the ring Z/pZ where p is of the form 2^m+1. Here 2 is a 2m-th
> (2^k-th) root of unity. Simply raise 2^m and you see that you get -1
> in this ring, thus 2^(2m) must be 1 in this ring.
>
> Note that multiplication by a root of unity then becomes
> multiplication by a power of 2 (modulo 2^m + 1), or a bitshift on a
> binary computer. Also working in this ring is convenient, as reduction
> modulo p can be done with a subtraction. In practice we can work in a
> ring Z/pZ where p = 2^(a*m)+1 for some value a. Here 2^a is a 2m-th
> (2^k-th) root of unity. A ring of this form is called a Fermat ring
> and p a generalised Fermat number.
>
> Usually p is chosen to be a large prime that takes up numerous machine
> limbs. For performance reasons a and m are usually chosen so that
> reduction modulo
> p does not involve any bitshifts, i.e. 2^(a*m) is a multiple of a limb shift.
>
> c) Mersenne Ring FFT. Work in a ring modulo p = 2^(b*m) - 1, again for
> m a power of 2 and b arbitrary. Here 2 is an m-th root of unity,
> reduction modulo p is an addition. Numbers p of this form are called
> generalised Mersenne numbers.
>
> d) Number Theoretic Transforms (NTT's). Work in a ring Z/pZ where p is
> a prime such that Z/pZ has lots of roots of unity. Usually p is chosen
> to be a small prime that fits in a machine word. We usually start with
> a prime p of the form p = a*n + 1 for some small value a and some
> larger value n (for example a power of 2 so that reduction mod p has a
> chance of being cheap). We find a primitive root r mod p, and then s =
> r^a (mod p) by Fermat's Little Theorem is a primitive n-th root of
> unity. Thus we can support a transform of length n on the ring Z/pZ.
>
> Many of the historically large integer products that have been
> computed, e.g. in computation of many digits of pi, involved NTT's.
> One can use a few of them, modulo different primes p (usually 2-5 of
> them) and combine using CRT.
>
> The problems with NTT's are twofold. Firstly, once one gets beyond
> about 5 of them, it becomes a significant proportion of the runtime to
> do the CRT step. Also, in practice, NTT implementations don't compare
> favourably with optimised Schonhage-Strassen implementations, being an
> order of magnitude slower in practice. On a 32 bit machine, the
> maximum transform length for a word sized prime can also be limiting
> for very large problems.
>
> The main drawback with NTT's is that multiplication by a root of unity
> is no longer a bit shift. One must also compute all the roots of
> unity, which usually means keeping some table of them handy. For very
> long transforms, the table itself may exceed the size of cache. Or one
> can partially compute the table and then perform some additional
> computations on-the-fly.
>
> -----------------
>
> Now I will discuss some additional algorithms which one needs to know about.
>
> e) Multiplication of integers modulo B^m + 1, using the negacyclic
> convolution. Recall that the negacyclic convolution can be used to
> multiply polynomials modulo x^n+1. If we evaluate such a polynomial at
> some point x = B then we can use a negacyclic convolution to multiply
> integers modulo B^n + 1. There's nothing special about using this
> strategy to multiply integers. A cyclic convolution will generally be
> faster, as one doesn't need the extra twiddle factors to turn the
> negacyclic convolution into a cyclic convolution (so it can be
> computed using an FFT). But for multiplying integers modulo a number
> of the form B^n + 1 it is very useful.
>
> Note that the pointwise multiplications in our convolution algorithm
> are precisely multiplications modulo a number of the form B^m + 1,
> when we are working in a Fermat ring. Thus we can effect our pointwise
> multiplications using a negacyclic convolution. This allows the FFT to
> recurse down to an FFT for the pointwise multiplications (when they
> become sufficiently large). This is more efficient than multiplying
> using an ordinary cyclic convolution and then reducing modulo p, as a
> convolution of half the length can be used (you actually *want* it to
> wrap around). This can save a significant amount of time once the
> pointwise multiplications become so large that the use of an FFT is
> indicated. If fact, because the convolution will be half the size, a
> negacyclic convolution can be used *before* one would consider using
> an ordinary FFT routine to multiply the integers.
>
> That's all I need to discuss for this section, as I will explain some
> other algorithms in later sections as they are needed.
>
> Bill.
>
> 2010/1/1 Bill Hart <goodwillh...@googlemail.com>:
>> 1. Formal release of code.
>>
>> I have included two new files in the top source directory of mpir in
>> svn. The first is new_fft_with_flint.c, the second new_fft.c. They are
>> isomorphic except that the second does not need to be linked against
>> FLINT to run. There will be no difference in timings or results, I
>> just removed all the timing and test code which relied on FLINT.
>>
>> Here is the command I use to build the code once you have built MPIR:
>>
>> gcc -O2 new_fft.c -o fft -I/home/wbhart/mpir-trunk
>> -L/home/wbhart/mpir-trunk/.libs -lmpir
>>
>> Modify as desired for your own system.
>>
>> Please note the usual disclaimer:
>>
>> This code is INCOMPLETE, INEFFICIENT, INCONSISTENT often INCORRECT,
>> ILL-TESTED and the original author is ILL-TEMPERED.
>>
>> It is LGPL v2+ licensed and I have taken extreme care to ensure that
>> all the code was written from scratch by me without following someone
>> else's work. The four functions FFT_split_* and FFT_combine_* were
>> taken from FLINT, but to my knowledge were written entirely by me. I
>> hereby release them under the LGPL v2+.
>>
>> The timing code does not print times, it merely completes a certain
>> number of iterations and exits. Use the linux time command to get
>> times.
>>
>> I have run very few tests on the multiplication routine, perhaps on
>> the order of a few thousand integer multiplications, at very few
>> sizes. This code is extremely new and should be treated as quite
>> likely broken.
>>
>> Note that the negacyclic convolution is currently disabled. I don't
>> even know if it passes its test code when switched on at present. I
>> modified a number of the FFT's since writing it, so probably not. But
>> it did pass it's test code. However note it does not deal with a
>> couple of (trivial) special cases, so is incomplete and formally
>> incorrect because of that.
>>
>> There is an implementation of the split radix FFT contained in the
>> file too. I found the times were *identical* to those I got from the
>> radix 2 code, so I abandoned it. There is also an old implementation
>> of the radix 2 FFT/IFFT which I also abandoned. The basic radix 2 FFT
>> function is called FFT_radix2_new.
>>
>> All the truncate, mfa and twiddle functions are new in that they don't
>> rely on the old radix 2 implementation, I believe.
>>
>> Many of the code comments have been cut and pasted from earlier code
>> they were derived from, so will be incorrect. This is especially true
>> of the comments for the FFT and IFFT functions (including the truncate
>> and mfa functions), except for the radix2_new and split_radix
>> functions which I hope are about right.
>>
>> Bill.
>>
>> 2010/1/1 Bill Hart <goodwillh...@googlemail.com>:
>>> As the code for this FFT is quite involved, I plan to incrementally
>>> release a set of notes about it, starting tomorrow morning (today
>>> already), which may take hours to days for me to complete. I got the
>>> code working for the first time this afternoon, so please be patient
>>> with me.
>>>
>>> Here is the rough list of topics I hope to cover in the notes:
>>>
>>> Table of Contents
>>> =============
>>>
>>> 1) The formal code release. I need to check all the test code still
>>> passes and make it easy for people to build and run. That will take an
>>> hour or so for me to do.
>>>
>>> 2) FFT's for nitwits (like me). How does an FFT help in multiplying
>>> numbers, basically, and what are the important algorithms and
>>> techniques.
>>>
>>> 3) Why a new FFT? This is an important question which needs answering.
>>> I will discuss what led me to work on a new FFT implementation, rather
>>> than improve or port another implementation.
>>>
>>> 4) The strategy. What does this FFT actually do? Which algorithms does
>>> it use?  I tried many things, most of which completely failed. In
>>> section 3 I will have already discussed other strategies I tried which
>>> did not work. Here I will discuss what did work. Whilst many of the
>>> details came from my head, I'm pretty sure the general outline will
>>> contain few surprises.
>>>
>>> 5) References. Few. I purposely tried to be creative.
>>>
>>> 8) Implementors notes. A long list of things that can be improved. In
>>> releasing the code publicly it is my hope that contributors will help
>>> make this the best FFT implementation it can be. The list will include
>>> many trivial items which anyone could work on, all the way through to
>>> fairly advanced topics which would probably require original research
>>> to do really well. I'll rank the items according to how hard I think
>>> they'd be to do.
>>>
>>> The code will be under revision control, so it is my hope that people
>>> will be willing to contribute to it. I believe I spent between 150 and
>>> 200 hours writing it, all told.
>>>
>>> Till tomorrow.
>>>
>>> Bill.
>>>
>>> 2010/1/1 Bill Hart <goodwillh...@googlemail.com>:
>>>> Ha! I forgot to mention, those numbers represent the size of the
>>>> integers in bits. I multiply two different integers with the same
>>>> number of bits, though that is not a limitation of the code. It will
>>>> accept integers of different lengths and adjust accordingly.
>>>>
>>>> Bill.
>>>>
>>>> 2010/1/1 Bill Hart <goodwillh...@googlemail.com>:
>>>>> I re-timed that point and I believe I may have just mistranscribed it.
>>>>> Here it is with the corrected time. Thanks for pointing it out!
>>>>>
>>>>> I've also included some times from the FLINT FFT, though only
>>>>> non-truncated ones (i.e. where FLINT uses a convolution which is an
>>>>> exact power of 2 length - not a limitation of FLINT, I just didn't get
>>>>> around to timing it at the other points yet).
>>>>>
>>>>> 8364032:  6.8s / 8.0s
>>>>> 9409536:  9.8s / 8.7s
>>>>> 10455040:  10.8s / 9.6s
>>>>> 11500544: 12.3s / 12.6s
>>>>> 12546048: 12.7s / 15.3s
>>>>> 13591552: 14.0s / 14.2s
>>>>> 14637056: 14.9s / 15.3s
>>>>> 15682560: 16.0s / 14.5s
>>>>> 16728064: 16.8s / 19.5s
>>>>>
>>>>> 20910080: 24.3s / 23.2s
>>>>> 25092096: 28.5s / 33.1s
>>>>> 29274112: 33.1s / 31.8s
>>>>> 33456128: 37.4s / 47.7s
>>>>>
>>>>> 519168: iters 10000: 30.7s / 32.2s / 32.3s
>>>>> 2084864: iters 1000: 13.9s / 15.2s / 13.0s
>>>>> 8364032: iters 100: 6.8s / 8.0s / 7.6s
>>>>> 33497088: iters 100: 37.4s / 47.7s / 37.5s
>>>>> 134103040: iters 10: 17.0s / 24.8s / 19.6s
>>>>> 536608768: iters 1: 8.6s / 9.4s / 8.6s
>>>>> 2146959360: iters 1: 42.3s / 40.3s / 38.9s
>>>>>
>>>>> Bill.
>>>>>
>>>>>
>>>>> 2009/12/31 Bill Hart <goodwillh...@googlemail.com>:
>>>>>> It's either a tuning issue or timing instability on the machine.
>>>>>>
>>>>>> Bill.
>>>>>>
>>>>>> On 31/12/2009, Robert Gerbicz <robert.gerb...@gmail.com> wrote:
>>>>>>> Hi!
>>>>>>>
>>>>>>> How is it possible that to multiple longer numbers takes less time, for
>>>>>>> example:
>>>>>>>
>>>>>>> 12546048: 14.7s / 15.3s
>>>>>>> 13591552: 14.0s / 14.2s
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups
>>>>>>> "mpir-devel" group.
>>>>>>> To post to this group, send email to mpir-de...@googlegroups.com.
>>>>>>> To unsubscribe from this group, send email to
>>>>>>> mpir-devel+unsubscr...@googlegroups.com.
>>>>>>> For more options, visit this group at
>>>>>>> http://groups.google.com/group/mpir-devel?hl=en.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

--

You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-de...@googlegroups.com.
To unsubscribe from this group, send email to 
mpir-devel+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en.

Re: [mpir-devel] A new FFT for the New Year

Reply via email to