I've had a long think, and I don't think there is a way to get the FFT
to use the mpn_mulmod_2expm1 function for pointwise mults, no matter
what we do. I think it is fundamentally impossible.

That's completely bizarre to me.

And there is no such thing as an FFT in a Mersenne ring as I called
it. So that's two shocks in one day.

The problem I'm trying to work around is quite serious. Even with
truncation there is a big jump in times at power of two lengths
because one needs to use coefficients twice as large, so the time for
the pointwise mults shoots up dramatically. It's not a problem for
very large multiplications, as doubling the size of the pointwise
mults just over doubles the time to do them, which is fine as all your
data fits in half as many of them. The problem occurs only when there
is a big jump up in the times as occurs when the pointwise mults are
using schoolbook multiplication for example.

I'm amazed there isn't an obvious solution to that problem. The best I
can come up with is what was used in FLINT. Try and ensure they are
mod B^{3m}+1.

Bill.

2010/1/3 Bill Hart <goodwillh...@googlemail.com>:
> My mind does not keep up with my fingers typing. Of course my
> suggestion 3 will not work because of this principal unit thing.
>
> I'll need to give that whole part of the strategy some more thought.
>
> I'm also now slightly confused as to why MPIR falls so far behind the
> new code in the middle, then catches up again at the end. When timing
> the FFT's against FLINT I also noticed this, i.e. it had nothing to do
> with pointwise multiplications.
>
> Bill.
>
> 2010/1/3 Bill Hart <goodwillh...@googlemail.com>:
>> 8. What remains to be done.
>>
>> Wow, I made it all the way to section 8! This is the most important
>> section, where I describe what remains, in order to make this
>> implementation really fast, clean and worth switching out the existing
>> implementation in MPIR.
>>
>> I'm very interested if people would like to volunteer to work on
>> various projects. They range from the completely trivial, which can be
>> carried out by anyone, right through to things that could turn into
>> research projects (some of the first four things below).
>>
>> This list applies to the version of the code that I am about to commit
>> to the repo, which is slightly modified from the version currently in
>> there. Specifically I'm talking about new_fft.c, i.e. the one that
>> doesn't need FLINT (which I don't see any point in continuing to
>> update - it's there for reference only).
>>
>> Clearly the high level ideas that need the most work are:
>>
>> 1) Implement the sqrt2 trick. In the ring Z/pZ where p = 2^B + 1 the
>> value 2^(2B/4)-2^(B/4) is a square root of 2. This means you can have
>> convolutions of length 4m instead of 2m by using these extra roots of
>> unity. It should be the case that these are only needed in the very
>> top layer of the FFT and IFFT. However one should still implement
>> these efficiently. Obviously one decomposes a multiplication by
>> (sqrt(2))^{2i+1} into 2^i*sqrt(2) (and clearly sqrt(2)^{2i} just
>> becomes 2^i which we can already do).
>>
>> The left hand half of the FFT butterfly is unaffected, but the right
>> hand half will need multiplication by sqrt. One can multiply by
>> 2^{B/4} first, combining this with the existing butterfly shifts. Then
>> one addmuls by a further 2^(B/4) for the other "leg" of the sqrt2.
>> Whilst this is less efficient than a standard butterfly it allows one
>> to double the convolution length, work with smaller coefficients in
>> the pointwise mults and the cost is only small, at the very top layer.
>>
>> For the IFFT one must do a similar thing, decomposing the
>> multiplication by sqrt2. There it is slightly more complex. The best
>> we can probably do is to apply the usual bitshifts first, then apply a
>> sqrt2 combined with any limb shifts, then do a sumdiff. That's three
>> operations, but without additional assembly functions I don't see how
>> to do better. At least any negations can be more easily combined with
>> the sumdiff.
>>
>> 2) Come up with a decent negacyclic convolution strategy. I think a
>> lot of work could be done on this, even research. It's currently
>> inefficient because we need to use a multiple of half the FFT length
>> for the coefficient bit size, which means too much padding.
>>
>> FLINT has some fairly intricate strategy which I barely understand. I
>> think it uses a naive negacyclic convolution with *coefficients*
>> modulo a power of the limb size, i.e. mod B^i for some small i and
>> combines it with the information you get from an FFT negacyclic
>> convolution with *coefficients* modulo some 2^b+1. It's still a
>> negacyclic convolution, so that still gives you the product mod 2^B+1
>> that you are after. I'm not sure if that is the idea or not. I just
>> remembering it being a real headscrew for me when I tried to figure it
>> out.
>>
>> I don't think the FLINT implementation was ever carried through to its
>> logical conclusion however. It does work, but I don't know about the
>> efficiency. It is definitely more efficient than what I am currently
>> doing however, which is nothing. From memory the idea in FLINT was to
>> force the pointwise multiplication *of the FFT negacyclic convolution*
>> to be a multiple of 3 limbs, so that the identity 2^3B+1 =
>> (2^B+1)(2^2B-2^B+1) can be used, as explained in an earlier section.
>>
>> I never got through thinking about whether that strategy was also
>> worth pursuing at the FFT level itself and not just for the pointwise
>> mults. Is there a karatsuba and toom negacyclic convolution algorithm?
>> I don't know the answer to these questions. I didn't think about
>> whether there was.
>>
>> Anyhow, there is no requirement that we do what is done in FLINT, just
>> that we do something sensible.
>>
>> 3) Implement a transform with coefficients mod p = 2^{sB}-1. This
>> would still be a standard cyclic convolution, using an "ordinary FFT",
>> just that you are doing your reductions mod p with an addition instead
>> of a subtraction. This would require rewriting the butterflies to work
>> modulo this different p. I didn't check if there was a sqrt2 trick,
>> and certainly convolutions of only half the length can be supported.
>> But I am sure the savings in the pointwise multiplications will make
>> it worth implementing this trick. With truncation, efficiency becomes
>> much less of an issue.
>>
>> Now I proceed through new_fft.c one function at a time and mention
>> explicitly what needs to be done, while it is all still fresh on my
>> mind. I will number the ideas starting from 4, so that people can
>> volunteer to take them on, by number, if they should so choose.
>>
>> Feel free to add suggestions to the list.
>>
>> mpn_addmod_2expp1_1:
>>
>> 4) As noted in the code comments, this can probably be done faster in
>> the case where c is small (the generic case), whether positive or
>> negative. In fact it is probably only ever used in such cases.
>>
>> 5) The function should probably be renamed. The name doesn't make a
>> whole lot of sense.
>>
>> revbin:
>>
>> 6) This is very inefficient. I don't know an efficient algorithm, but
>> I am sure one exists.
>>
>> 7) The name should be changed to something more exotic to avoid
>> polluting namespace when linking with other libraries.
>>
>> FFT_split* :
>>
>> 8) We need a function which given the parameters currently sent to
>> FFT_split_bits and a coefficient number, just returns that single
>> coefficient. The reason for this is that splitting the integer up into
>> chunks (to the bit) can then be merged with the top layer of the FFT,
>> which has consequences for cache locality. In particular if you were
>> doing an MFA transform, you'd pull out only the coefficients needed
>> for each column FFT at a time, then immediately do that FFT.
>>
>> 9) Experiment with cache hints. On Opterons in particular we found
>> this to give a speedup in FLINT. The current code indicates where
>> cache hints should be used, though combine this strategy with 8.
>>
>> 10) Replace any ulong's which should really be mp_limb_t's so that
>> things will work on Windows.
>>
>> FFT_combine*:
>>
>> 11) Again it might be possible to combine just a single coefficient at
>> a time so that cache locality can be maintained for the MFA IFFT's.
>> One has to be slightly careful to avoid a quadratic run time, but for
>> unsigned coefficients I think it will be OK. One also needs to
>> selectively zero the limbs of the output integer in step with these
>> coefficient combines I would guess. So that step should probably be
>> combined with this somehow. Needs careful thought.
>>
>> 12) Pass in temporary space from the main integer multiplication routine.
>>
>> mpn_submod_i_2expp1:
>>
>> 13) Remove as it is only used in the split radix code, which is not utilised.
>>
>> mpn_lshB_sumdiffmod_2expp1:
>> mpn_sumdiff_rshBmod_2expp1:
>>
>> 14) The name's are stupid and should be changed. Too specific to be
>> mpn functions.
>>
>> 15) Is there some symmetry which can be utilised to make the functions
>> simpler. There seem to be a lot of cases.  I don't think there is, as
>> one is doing a subtraction, however combining with a possible negation
>> without increasing the code complexity might be possible. Certainly
>> the case x = 0 is the (I)FFT butterfly case so surely some
>> consolidation is possible there.
>>
>> 16) Can the entire functions be done in assembly? Either way, the
>> overheads need to be reduced as much as possible, and we still need a
>> generic C implementation.
>>
>> mpn_mul_2expmod_2expp1:
>> mpn_div_2expmod_2expp1:
>>
>> 17) The first function should probably just deal with all cases,
>> whether bit shifts, or including limb shifts, positive or negative. I
>> ended up implementing FFT and negacyclic_twiddle functions because
>> this didn't support limb shifts, and these were needed in the
>> truncated FFT. So some rationalisation is possible there. Do away with
>> the mpn_div_2expmod_2expp1 function. Do right shift by d bits by doing
>> left shift by 2*n*w - d bits and make this function handle the
>> negation. But see how the FFT_twiddle and FFT_negacyclic_twiddle
>> functions are implemented which return an int to indicate if they did
>> their operation in-place or not to save copying data around when it is
>> not necessary. Perhaps just pass the temporary in as a parameter and
>> do the swap internal to the function.
>>
>> 18) Can this function be optimised in assembly?
>>
>> FFT_split_radix_butterfly:
>> FFT_split_radix_butterfly2:
>>
>> 19) Remove as no longer utilised.
>>
>> FFT_radix2_twiddle_butterfly:
>> FFT_radix2_twiddle_inverse_butterfly:
>>
>> 20) Surely this can be combined with the ordinary
>> (IF)FFT_radix2_butterfly functions. However note that it is better to
>> pass nw than to pass n and w separately as we do elsewhere. Perhaps
>> rename this parameter to nw or limbs, not NW.
>>
>> 21) The radix2 in the name is redundant, as we only implement radix 2,
>> and other radices are impractical. Omit the twiddle from the name.
>>
>> 22) The motif:
>>
>>   b1 %= (2*NW);
>>   if (b1 >= NW)
>>   {
>>      negate2 = 1;
>>      b1 -= NW;
>>   }
>>   x = b1/GMP_LIMB_BITS;
>>   b1 -= x*GMP_LIMB_BITS;
>>
>> occurs often in this and other functions, and should be adapted away
>> into a macro or something like that. It's not necessarily optimised
>> either.
>>
>> FFT_radix2_butterfly:
>> FFT_radix2_inverse_butterfly:
>>
>> 23) Surely better to pass nw and an actual bit shift rather than i,
>> since otherwise n*w needs to be computed over and over, or at least
>> i*w does.
>>
>> 24) In FFT_radix2_butterfly, I don't think i >= n is actually
>> possible, so remove that.
>>
>> FFT_split_radix:
>>
>> 25) Remove, as it is not utilised.
>>
>> FFT_radix2:
>>
>> 26) Remove as it is no longer used.
>>
>> FFT_radix2_new:
>> FFT_radix2_twiddle:
>> IFFT_radix2_new:
>> IFFT_radix2_twiddle:
>>
>> 27) radix2 is superfluous in the name, as is the new.
>>
>> 28) combine the non-twiddle functions with the twiddle versions.
>>
>> 29) I don't believe the result stride rs is ever used, and we always
>> do the FFT and IFFT in-place, so remove rr and rs entirely.
>>
>> 30) It is no longer necessary to pass temp. It isn't used.
>>
>> 31) Can t1 and t2 be combined without a loss of speed? Please
>> benchmark and be *absolutely sure*.
>>
>> 32) The motif:
>>
>>      ptr = rr[0];
>>      rr[0] = *t1;
>>      *t1 = ptr;
>>
>> occurs frequently here and elsewhere and should definitely be adapted
>> away into a pointer swap macro, or use one of the already provided
>> MPIR macros for this.
>>
>> FFT_negacyclic_twiddle:
>> FFT_twiddle:
>>
>> 33) Remove, see 17. However note that an indication of how to combine
>> the negation is provided in commented out code (which is not tested).
>>
>> FFT_radix2_truncate1:
>> FFT_radix2_truncate:
>> FFT_radix2_truncate1_twiddle:
>> FFT_radix2_truncate_twiddle:
>> IFFT_radix2_truncate1:
>> IFFT_radix2_truncate1_twiddle:
>> IFFT_radix2_truncate:
>> IFFT_radix2_truncate1_twiddle:
>>
>> 34) Combine non-twiddle and twiddle versions. Remove radix2 from name.
>>
>> 35) Remove rr and rs. Do in-place.
>>
>> 36) Correct code comments which are cut and pasted from elsewhere.
>>
>> 37) Write special cases for n = 1, 2 to remove the unnecessary
>> restriction that trunc be divisible by 2. Remember to remove the
>> restriction in the multiplication code which sets trunc to a multiple
>> of 2*sqrt instead of just a multiple of sqrt.
>>
>> FFT_radix2_negacyclic:
>> IFFT_radix2_negacyclic:
>>
>> 38) Remove rr and rs. Always work in-place.
>>
>> 39) Do not pass temp.
>>
>> FFT_radix2_mfa:
>> FFT_radix2_mfa_truncate:
>> IFFT_radix2_mfa:
>> IFFT_radix2_mfa_truncate:
>>
>> 40) Do not pass temp.
>>
>> 41) Combine truncate and non-truncate versions. Actually the
>> non-truncate versions are not really needed.
>>
>> FFT_mulmod_2expp1:
>>
>> 42) Memory allocation should be done in a single block, or even passed
>> in as a temporary.
>>
>> 43) Clean up the code which handles the subtractions and make it cache
>> friendly.
>>
>> new_mpn_mulmod_2expp1:
>>
>> 44) Doesn't currently deal with the case that c != 0 when using the
>> negacyclic convolution (which it currently never uses anyway because
>> it is too inefficient), i.e. the case where one of the inputs is
>> 2^{n*w} so that it doesn't fit into {n*w}/GMP_LIMB_BITS limbs.
>>
>> 45) Doesn't currently return any carry, i.e. when the result is p-1
>> there is a carry.
>>
>> new_mpn_mul_mfa:
>>
>> 46) In the memory allocation part, one shouldn't mix allocations of
>> arrays of mp_limb_t's and mp_limb_t *'s as this is not necessarily
>> safe on all machines. Or is it?
>>
>> General:
>>
>> 47) The code needs to be made Windows safe by avoiding variable
>> declarations except at the beginnings of blocks.
>>
>> 48) Some functions don't have adequate test functions.
>>
>> 49) I think the normalise function is correct, but the test code
>> doesn't check all cases.
>>
>> 50) The test code doesn't test the FFTs, IFFTs and multiplication
>> functions for lots of different sizes, especially different values of
>> n, w, and trunc.
>>
>> Bill.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 2010/1/3 Bill Hart <goodwillh...@googlemail.com>:
>>> 5. References.
>>>
>>> I am indebted to the people who prepared the following references:
>>>
>>> "Matters Computational: ideas, algorithms and source code", by J org
>>> Arndt, see http://www.jjj.de/fxt/fxtbook.pdf
>>>
>>> Specifically I used chapter 3. Note that I took very great care to not
>>> refer to the source code. I found the notation used in equations such
>>> as 19.2-6 very helpful, and this book was also the resource I used for
>>> a description of the Matrix Fourier Algorithm and the split radix
>>> algorithm (which did not actually end up being used by my integer
>>> multiplication implementation).
>>>
>>> "Primes numbers: a computational perspective", by Richard Crandall and
>>> Carl Pomerance, 2nd ed., 2005, Springer.
>>>
>>> Specifically I used this reference to figure out why my negacyclic
>>> convolution didn't work (I had forgotten that the coefficients would
>>> be signed).
>>>
>>> Two other references which I referred to, but which did not end up
>>> featuring in the implementation I settled on (so far), but to which I
>>> am grateful to the authors for providing are:
>>>
>>> "A GMP-based implementation of Schonhage-Strassen's Large Integer
>>> Multiplication Algorithm" by Pierrick Gaudry, Alexander Kruppa and
>>> Paul Zimmermann, ISAAC 2007 proceedings, pp 167-174. See
>>> http://www.loria.fr/~gaudry/publis/issac07.pdf
>>>
>>> and
>>>
>>> "Multidigit multiplication for mathematicians" by Dan Bernstein (to
>>> appear). see http://cr.yp.to/papers/m3.pdf
>>>
>>> Specifically this paper helped me sort out why my radix 2/3
>>> implementation didn't work. I would also say that the algebraic way of
>>> thinking about the FFT which is advocated in this paper also helped
>>> when it came to thinking up a truncation strategy. I had been looking
>>> at this paper to work out why my radix 2/3 thing didn't work and went
>>> for a walk to town in frustration. By the time I got back I had
>>> basically figured out how to do truncation without radix 3. The
>>> initial thought that set it off was to think about the FFT as breaking
>>> into a cyclic and negacyclic half, as described in section 2 of these
>>> notes.
>>>
>>> Bill.
>>>
>>> 2010/1/3 Bill Hart <goodwillh...@googlemail.com>:
>>>> Hmm, so the only reason MPIR can be beating me for small
>>>> multiplications is general coding efficiency. My butterfly strategy
>>>> probably introduces overheads when the coefficients are small, i.e.
>>>> only a few limbs.
>>>>
>>>> The overheads could be lowered by having an iterative FFT for short
>>>> lengths. But we should worry about that later. It will complicate the
>>>> code too much and the existing code needs consolidation first. There's
>>>> a lot of code duplication currently.
>>>>
>>>> Bill.
>>>>
>>>> 2010/1/3 Bill Hart <goodwillh...@googlemail.com>:
>>>>> 2010/1/3 David Harvey <dmhar...@cims.nyu.edu>:
>>>>>>
>>>>>>
>>>>>> On Jan 3, 9:03 am, Bill Hart <goodwillh...@googlemail.com> wrote:
>>>>>>
>>>>>>> I figured I probably did misunderstand. What I think they are trying
>>>>>>> to get at is that one can use coefficients modulo 2^N+1 except at
>>>>>>> certain levels because the roots of unity are actually the same. But
>>>>>>> then the discussion becomes one of combining a Fermat and Mersenne
>>>>>>> transform, which is the basic strategy that the implementation uses. I
>>>>>>> got confused. I still am.
>>>>>>
>>>>>> I think they are saying that to multiply two integers whose product
>>>>>> has N bits, choose a and b (as in Lemma 1 of their paper) such that a
>>>>>> + b ~ N and then multiply modulo 2^a - 1 (use Mersenne transform) and
>>>>>> modulo 2^b + 1 (use Fermat transform). The smoothness comes from the
>>>>>> freedom to choose a and b.
>>>>>>
>>>>>
>>>>> OK, I think I see. Their big N corresponds to an FFT for the original
>>>>> integers, which are multiplied mod 2^N+1 in the case of a Fermat
>>>>> transform or 2^N-1 in the case of a Mersenne transform. I think I
>>>>> would call these negacyclic and cyclic convolutions respectively.
>>>>>
>>>>> But then a Mersenne transform is nothing but an ordinary FFT, no?
>>>>>
>>>>> The idea of combining with a negacyclic FFT seems to be one of the
>>>>> strategies I discussed and discarded.
>>>>>
>>>>> I still don't understand what they mean by "power-of-2 roots of unity
>>>>> are needed only at the "lower level"". I understand that by the "lower
>>>>> level" they seem to mean the pointwise mults. But what is a power of 2
>>>>> root of unity? Aren't they all powers of 2, except when using the
>>>>> sqrt2 trick.
>>>>>
>>>>> Maybe I don't understand the Schonhage-Strassen algorithm correctly.
>>>>> Perhaps they not only worked with coefficients mod 2^B+1 but also did
>>>>> the convolution mod x^m+1. They were worried about asymptotic
>>>>> complexity, not speed, so they wanted their algorithm to recurse so
>>>>> that the pointwise mults could be done using the same algorithm. Gosh
>>>>> that is exactly what the authors of this paper say!
>>>>>
>>>>> So my misunderstanding came from the fact that I assumed SS advocated
>>>>> a Mersenne transform, i.e. mod x^m-1, not a Fermat transform, mod
>>>>> x^m+1.
>>>>>
>>>>> Well, just to be clear. I worked mod x^m-1 because it is more
>>>>> efficient, and in a ring mod p where p = 2^B+1, for which one must
>>>>> perform a negacyclic convolution if one wishes to recurse, which
>>>>> involves twiddling to get an ordinary cyclic convolution.
>>>>>
>>>>> Anyhow, this is all a non-strategy once you have truncation. All that
>>>>> is required is to implement a single transform which works with
>>>>> coefficients mod 2^B-1, which I was formerly calling a Mersenne ring,
>>>>> when the large integers being multiplied are "small". Then you don't
>>>>> want to recurse for doing the pointwise mults and so the pointwise
>>>>> mults can be done mod 2^B/2-1 and 2^B/2+1 using our mpn_mulmod_2expp1
>>>>> and mpn_mulmod_2expm1, which can do whatever is most efficient.
>>>>>
>>>>> Bill.
>>>>>
>>>>
>>>
>>
>

--

You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-de...@googlegroups.com.
To unsubscribe from this group, send email to 
mpir-devel+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en.


Reply via email to