I've committed the changes to the code. The improvements are more
modest than I remember them being last night, but here are the new
timings as affected by the negacyclic convolution anyhow:

536608768/1: 8.4s / 9.9s (17.9%)
603684864/1: 10.3s / 10.5s
670760960/1: 11.3s / 12.6s (11.5%)
737837056/1: 12.6s / 13.3s (5.6%)
804913152/1: 13.7s / 15.9s (16.1%)
871989248/1: 14.5s / 16.6s (14.5%)
939065344/1: 15.7s / 17.5s (19.0%)
1006141440/1: 16.5s / 17.8s (15.6%)

1073217536/1: 17.6s / 20.6s (17.0%)
1341521920/1: 23.3s / 25.3s (8.6%)
1609826304/1: 27.7s / 31.2s (12.6%)
1878130688/1: 31.4s / 37.4s (19.1%)

2146959360/1: 36.3s / 41.9s (15.4%)

Bill.

2010/1/4 Bill Hart <goodwillh...@googlemail.com>:
> I fixed my negacyclic convolution. I had to change two whole
> characters to make it more efficient. It now beats MPIR from about
> 603684864 bits up. By the time we reach about 2GB it beats MPIR by
> about 20-25%.
>
> I'll post new times and code tomorrow. It does pass the test code,
> which is a good sign.
>
> So *that* problem appears to be gone. Still lots more that could be
> done for the negacyclic convolution though.
>
> Bill.
>
> 2010/1/4 Bill Hart <goodwillh...@googlemail.com>:
>> Ah, I see that our current mpn_mulmod_2expp1 does not do anything
>> special at this stage.
>>
>> So that is my number 3:
>>
>> 3. Make mpn_mulmod_2expp1 deal with the special case where you are
>> multiplying mod 2^{3B}+1 where B is some integer divisible by
>> GMP_LIMB_BITS.
>>
>> Simply use the identity 2^{3B}+1 = (2^B+1)(2^{2B}-2^B+1) and do two
>> multiplications, one mod each of these (in precisely the way that
>> mpn_mulmod_2expp1 currently works, i.e. you could call the function
>> recursively. Ensure that there is a tuned cutoff for this trick, as
>> once the multiplications become too large it doesn't work.
>>
>> Of course in the FFT range, the negacyclic convolution should be used.
>> But only when we get the negacyclic convolution fast enough to be
>> useful.
>>
>> Bill.
>>
>> 2010/1/3 Bill Hart <goodwillh...@googlemail.com>:
>>> I've had a long think, and I don't think there is a way to get the FFT
>>> to use the mpn_mulmod_2expm1 function for pointwise mults, no matter
>>> what we do. I think it is fundamentally impossible.
>>>
>>> That's completely bizarre to me.
>>>
>>> And there is no such thing as an FFT in a Mersenne ring as I called
>>> it. So that's two shocks in one day.
>>>
>>> The problem I'm trying to work around is quite serious. Even with
>>> truncation there is a big jump in times at power of two lengths
>>> because one needs to use coefficients twice as large, so the time for
>>> the pointwise mults shoots up dramatically. It's not a problem for
>>> very large multiplications, as doubling the size of the pointwise
>>> mults just over doubles the time to do them, which is fine as all your
>>> data fits in half as many of them. The problem occurs only when there
>>> is a big jump up in the times as occurs when the pointwise mults are
>>> using schoolbook multiplication for example.
>>>
>>> I'm amazed there isn't an obvious solution to that problem. The best I
>>> can come up with is what was used in FLINT. Try and ensure they are
>>> mod B^{3m}+1.
>>>
>>> Bill.
>>>
>>> 2010/1/3 Bill Hart <goodwillh...@googlemail.com>:
>>>> My mind does not keep up with my fingers typing. Of course my
>>>> suggestion 3 will not work because of this principal unit thing.
>>>>
>>>> I'll need to give that whole part of the strategy some more thought.
>>>>
>>>> I'm also now slightly confused as to why MPIR falls so far behind the
>>>> new code in the middle, then catches up again at the end. When timing
>>>> the FFT's against FLINT I also noticed this, i.e. it had nothing to do
>>>> with pointwise multiplications.
>>>>
>>>> Bill.
>>>>
>>>> 2010/1/3 Bill Hart <goodwillh...@googlemail.com>:
>>>>> 8. What remains to be done.
>>>>>
>>>>> Wow, I made it all the way to section 8! This is the most important
>>>>> section, where I describe what remains, in order to make this
>>>>> implementation really fast, clean and worth switching out the existing
>>>>> implementation in MPIR.
>>>>>
>>>>> I'm very interested if people would like to volunteer to work on
>>>>> various projects. They range from the completely trivial, which can be
>>>>> carried out by anyone, right through to things that could turn into
>>>>> research projects (some of the first four things below).
>>>>>
>>>>> This list applies to the version of the code that I am about to commit
>>>>> to the repo, which is slightly modified from the version currently in
>>>>> there. Specifically I'm talking about new_fft.c, i.e. the one that
>>>>> doesn't need FLINT (which I don't see any point in continuing to
>>>>> update - it's there for reference only).
>>>>>
>>>>> Clearly the high level ideas that need the most work are:
>>>>>
>>>>> 1) Implement the sqrt2 trick. In the ring Z/pZ where p = 2^B + 1 the
>>>>> value 2^(2B/4)-2^(B/4) is a square root of 2. This means you can have
>>>>> convolutions of length 4m instead of 2m by using these extra roots of
>>>>> unity. It should be the case that these are only needed in the very
>>>>> top layer of the FFT and IFFT. However one should still implement
>>>>> these efficiently. Obviously one decomposes a multiplication by
>>>>> (sqrt(2))^{2i+1} into 2^i*sqrt(2) (and clearly sqrt(2)^{2i} just
>>>>> becomes 2^i which we can already do).
>>>>>
>>>>> The left hand half of the FFT butterfly is unaffected, but the right
>>>>> hand half will need multiplication by sqrt. One can multiply by
>>>>> 2^{B/4} first, combining this with the existing butterfly shifts. Then
>>>>> one addmuls by a further 2^(B/4) for the other "leg" of the sqrt2.
>>>>> Whilst this is less efficient than a standard butterfly it allows one
>>>>> to double the convolution length, work with smaller coefficients in
>>>>> the pointwise mults and the cost is only small, at the very top layer.
>>>>>
>>>>> For the IFFT one must do a similar thing, decomposing the
>>>>> multiplication by sqrt2. There it is slightly more complex. The best
>>>>> we can probably do is to apply the usual bitshifts first, then apply a
>>>>> sqrt2 combined with any limb shifts, then do a sumdiff. That's three
>>>>> operations, but without additional assembly functions I don't see how
>>>>> to do better. At least any negations can be more easily combined with
>>>>> the sumdiff.
>>>>>
>>>>> 2) Come up with a decent negacyclic convolution strategy. I think a
>>>>> lot of work could be done on this, even research. It's currently
>>>>> inefficient because we need to use a multiple of half the FFT length
>>>>> for the coefficient bit size, which means too much padding.
>>>>>
>>>>> FLINT has some fairly intricate strategy which I barely understand. I
>>>>> think it uses a naive negacyclic convolution with *coefficients*
>>>>> modulo a power of the limb size, i.e. mod B^i for some small i and
>>>>> combines it with the information you get from an FFT negacyclic
>>>>> convolution with *coefficients* modulo some 2^b+1. It's still a
>>>>> negacyclic convolution, so that still gives you the product mod 2^B+1
>>>>> that you are after. I'm not sure if that is the idea or not. I just
>>>>> remembering it being a real headscrew for me when I tried to figure it
>>>>> out.
>>>>>
>>>>> I don't think the FLINT implementation was ever carried through to its
>>>>> logical conclusion however. It does work, but I don't know about the
>>>>> efficiency. It is definitely more efficient than what I am currently
>>>>> doing however, which is nothing. From memory the idea in FLINT was to
>>>>> force the pointwise multiplication *of the FFT negacyclic convolution*
>>>>> to be a multiple of 3 limbs, so that the identity 2^3B+1 =
>>>>> (2^B+1)(2^2B-2^B+1) can be used, as explained in an earlier section.
>>>>>
>>>>> I never got through thinking about whether that strategy was also
>>>>> worth pursuing at the FFT level itself and not just for the pointwise
>>>>> mults. Is there a karatsuba and toom negacyclic convolution algorithm?
>>>>> I don't know the answer to these questions. I didn't think about
>>>>> whether there was.
>>>>>
>>>>> Anyhow, there is no requirement that we do what is done in FLINT, just
>>>>> that we do something sensible.
>>>>>
>>>>> 3) Implement a transform with coefficients mod p = 2^{sB}-1. This
>>>>> would still be a standard cyclic convolution, using an "ordinary FFT",
>>>>> just that you are doing your reductions mod p with an addition instead
>>>>> of a subtraction. This would require rewriting the butterflies to work
>>>>> modulo this different p. I didn't check if there was a sqrt2 trick,
>>>>> and certainly convolutions of only half the length can be supported.
>>>>> But I am sure the savings in the pointwise multiplications will make
>>>>> it worth implementing this trick. With truncation, efficiency becomes
>>>>> much less of an issue.
>>>>>
>>>>> Now I proceed through new_fft.c one function at a time and mention
>>>>> explicitly what needs to be done, while it is all still fresh on my
>>>>> mind. I will number the ideas starting from 4, so that people can
>>>>> volunteer to take them on, by number, if they should so choose.
>>>>>
>>>>> Feel free to add suggestions to the list.
>>>>>
>>>>> mpn_addmod_2expp1_1:
>>>>>
>>>>> 4) As noted in the code comments, this can probably be done faster in
>>>>> the case where c is small (the generic case), whether positive or
>>>>> negative. In fact it is probably only ever used in such cases.
>>>>>
>>>>> 5) The function should probably be renamed. The name doesn't make a
>>>>> whole lot of sense.
>>>>>
>>>>> revbin:
>>>>>
>>>>> 6) This is very inefficient. I don't know an efficient algorithm, but
>>>>> I am sure one exists.
>>>>>
>>>>> 7) The name should be changed to something more exotic to avoid
>>>>> polluting namespace when linking with other libraries.
>>>>>
>>>>> FFT_split* :
>>>>>
>>>>> 8) We need a function which given the parameters currently sent to
>>>>> FFT_split_bits and a coefficient number, just returns that single
>>>>> coefficient. The reason for this is that splitting the integer up into
>>>>> chunks (to the bit) can then be merged with the top layer of the FFT,
>>>>> which has consequences for cache locality. In particular if you were
>>>>> doing an MFA transform, you'd pull out only the coefficients needed
>>>>> for each column FFT at a time, then immediately do that FFT.
>>>>>
>>>>> 9) Experiment with cache hints. On Opterons in particular we found
>>>>> this to give a speedup in FLINT. The current code indicates where
>>>>> cache hints should be used, though combine this strategy with 8.
>>>>>
>>>>> 10) Replace any ulong's which should really be mp_limb_t's so that
>>>>> things will work on Windows.
>>>>>
>>>>> FFT_combine*:
>>>>>
>>>>> 11) Again it might be possible to combine just a single coefficient at
>>>>> a time so that cache locality can be maintained for the MFA IFFT's.
>>>>> One has to be slightly careful to avoid a quadratic run time, but for
>>>>> unsigned coefficients I think it will be OK. One also needs to
>>>>> selectively zero the limbs of the output integer in step with these
>>>>> coefficient combines I would guess. So that step should probably be
>>>>> combined with this somehow. Needs careful thought.
>>>>>
>>>>> 12) Pass in temporary space from the main integer multiplication routine.
>>>>>
>>>>> mpn_submod_i_2expp1:
>>>>>
>>>>> 13) Remove as it is only used in the split radix code, which is not 
>>>>> utilised.
>>>>>
>>>>> mpn_lshB_sumdiffmod_2expp1:
>>>>> mpn_sumdiff_rshBmod_2expp1:
>>>>>
>>>>> 14) The name's are stupid and should be changed. Too specific to be
>>>>> mpn functions.
>>>>>
>>>>> 15) Is there some symmetry which can be utilised to make the functions
>>>>> simpler. There seem to be a lot of cases.  I don't think there is, as
>>>>> one is doing a subtraction, however combining with a possible negation
>>>>> without increasing the code complexity might be possible. Certainly
>>>>> the case x = 0 is the (I)FFT butterfly case so surely some
>>>>> consolidation is possible there.
>>>>>
>>>>> 16) Can the entire functions be done in assembly? Either way, the
>>>>> overheads need to be reduced as much as possible, and we still need a
>>>>> generic C implementation.
>>>>>
>>>>> mpn_mul_2expmod_2expp1:
>>>>> mpn_div_2expmod_2expp1:
>>>>>
>>>>> 17) The first function should probably just deal with all cases,
>>>>> whether bit shifts, or including limb shifts, positive or negative. I
>>>>> ended up implementing FFT and negacyclic_twiddle functions because
>>>>> this didn't support limb shifts, and these were needed in the
>>>>> truncated FFT. So some rationalisation is possible there. Do away with
>>>>> the mpn_div_2expmod_2expp1 function. Do right shift by d bits by doing
>>>>> left shift by 2*n*w - d bits and make this function handle the
>>>>> negation. But see how the FFT_twiddle and FFT_negacyclic_twiddle
>>>>> functions are implemented which return an int to indicate if they did
>>>>> their operation in-place or not to save copying data around when it is
>>>>> not necessary. Perhaps just pass the temporary in as a parameter and
>>>>> do the swap internal to the function.
>>>>>
>>>>> 18) Can this function be optimised in assembly?
>>>>>
>>>>> FFT_split_radix_butterfly:
>>>>> FFT_split_radix_butterfly2:
>>>>>
>>>>> 19) Remove as no longer utilised.
>>>>>
>>>>> FFT_radix2_twiddle_butterfly:
>>>>> FFT_radix2_twiddle_inverse_butterfly:
>>>>>
>>>>> 20) Surely this can be combined with the ordinary
>>>>> (IF)FFT_radix2_butterfly functions. However note that it is better to
>>>>> pass nw than to pass n and w separately as we do elsewhere. Perhaps
>>>>> rename this parameter to nw or limbs, not NW.
>>>>>
>>>>> 21) The radix2 in the name is redundant, as we only implement radix 2,
>>>>> and other radices are impractical. Omit the twiddle from the name.
>>>>>
>>>>> 22) The motif:
>>>>>
>>>>>   b1 %= (2*NW);
>>>>>   if (b1 >= NW)
>>>>>   {
>>>>>      negate2 = 1;
>>>>>      b1 -= NW;
>>>>>   }
>>>>>   x = b1/GMP_LIMB_BITS;
>>>>>   b1 -= x*GMP_LIMB_BITS;
>>>>>
>>>>> occurs often in this and other functions, and should be adapted away
>>>>> into a macro or something like that. It's not necessarily optimised
>>>>> either.
>>>>>
>>>>> FFT_radix2_butterfly:
>>>>> FFT_radix2_inverse_butterfly:
>>>>>
>>>>> 23) Surely better to pass nw and an actual bit shift rather than i,
>>>>> since otherwise n*w needs to be computed over and over, or at least
>>>>> i*w does.
>>>>>
>>>>> 24) In FFT_radix2_butterfly, I don't think i >= n is actually
>>>>> possible, so remove that.
>>>>>
>>>>> FFT_split_radix:
>>>>>
>>>>> 25) Remove, as it is not utilised.
>>>>>
>>>>> FFT_radix2:
>>>>>
>>>>> 26) Remove as it is no longer used.
>>>>>
>>>>> FFT_radix2_new:
>>>>> FFT_radix2_twiddle:
>>>>> IFFT_radix2_new:
>>>>> IFFT_radix2_twiddle:
>>>>>
>>>>> 27) radix2 is superfluous in the name, as is the new.
>>>>>
>>>>> 28) combine the non-twiddle functions with the twiddle versions.
>>>>>
>>>>> 29) I don't believe the result stride rs is ever used, and we always
>>>>> do the FFT and IFFT in-place, so remove rr and rs entirely.
>>>>>
>>>>> 30) It is no longer necessary to pass temp. It isn't used.
>>>>>
>>>>> 31) Can t1 and t2 be combined without a loss of speed? Please
>>>>> benchmark and be *absolutely sure*.
>>>>>
>>>>> 32) The motif:
>>>>>
>>>>>      ptr = rr[0];
>>>>>      rr[0] = *t1;
>>>>>      *t1 = ptr;
>>>>>
>>>>> occurs frequently here and elsewhere and should definitely be adapted
>>>>> away into a pointer swap macro, or use one of the already provided
>>>>> MPIR macros for this.
>>>>>
>>>>> FFT_negacyclic_twiddle:
>>>>> FFT_twiddle:
>>>>>
>>>>> 33) Remove, see 17. However note that an indication of how to combine
>>>>> the negation is provided in commented out code (which is not tested).
>>>>>
>>>>> FFT_radix2_truncate1:
>>>>> FFT_radix2_truncate:
>>>>> FFT_radix2_truncate1_twiddle:
>>>>> FFT_radix2_truncate_twiddle:
>>>>> IFFT_radix2_truncate1:
>>>>> IFFT_radix2_truncate1_twiddle:
>>>>> IFFT_radix2_truncate:
>>>>> IFFT_radix2_truncate1_twiddle:
>>>>>
>>>>> 34) Combine non-twiddle and twiddle versions. Remove radix2 from name.
>>>>>
>>>>> 35) Remove rr and rs. Do in-place.
>>>>>
>>>>> 36) Correct code comments which are cut and pasted from elsewhere.
>>>>>
>>>>> 37) Write special cases for n = 1, 2 to remove the unnecessary
>>>>> restriction that trunc be divisible by 2. Remember to remove the
>>>>> restriction in the multiplication code which sets trunc to a multiple
>>>>> of 2*sqrt instead of just a multiple of sqrt.
>>>>>
>>>>> FFT_radix2_negacyclic:
>>>>> IFFT_radix2_negacyclic:
>>>>>
>>>>> 38) Remove rr and rs. Always work in-place.
>>>>>
>>>>> 39) Do not pass temp.
>>>>>
>>>>> FFT_radix2_mfa:
>>>>> FFT_radix2_mfa_truncate:
>>>>> IFFT_radix2_mfa:
>>>>> IFFT_radix2_mfa_truncate:
>>>>>
>>>>> 40) Do not pass temp.
>>>>>
>>>>> 41) Combine truncate and non-truncate versions. Actually the
>>>>> non-truncate versions are not really needed.
>>>>>
>>>>> FFT_mulmod_2expp1:
>>>>>
>>>>> 42) Memory allocation should be done in a single block, or even passed
>>>>> in as a temporary.
>>>>>
>>>>> 43) Clean up the code which handles the subtractions and make it cache
>>>>> friendly.
>>>>>
>>>>> new_mpn_mulmod_2expp1:
>>>>>
>>>>> 44) Doesn't currently deal with the case that c != 0 when using the
>>>>> negacyclic convolution (which it currently never uses anyway because
>>>>> it is too inefficient), i.e. the case where one of the inputs is
>>>>> 2^{n*w} so that it doesn't fit into {n*w}/GMP_LIMB_BITS limbs.
>>>>>
>>>>> 45) Doesn't currently return any carry, i.e. when the result is p-1
>>>>> there is a carry.
>>>>>
>>>>> new_mpn_mul_mfa:
>>>>>
>>>>> 46) In the memory allocation part, one shouldn't mix allocations of
>>>>> arrays of mp_limb_t's and mp_limb_t *'s as this is not necessarily
>>>>> safe on all machines. Or is it?
>>>>>
>>>>> General:
>>>>>
>>>>> 47) The code needs to be made Windows safe by avoiding variable
>>>>> declarations except at the beginnings of blocks.
>>>>>
>>>>> 48) Some functions don't have adequate test functions.
>>>>>
>>>>> 49) I think the normalise function is correct, but the test code
>>>>> doesn't check all cases.
>>>>>
>>>>> 50) The test code doesn't test the FFTs, IFFTs and multiplication
>>>>> functions for lots of different sizes, especially different values of
>>>>> n, w, and trunc.
>>>>>
>>>>> Bill.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 2010/1/3 Bill Hart <goodwillh...@googlemail.com>:
>>>>>> 5. References.
>>>>>>
>>>>>> I am indebted to the people who prepared the following references:
>>>>>>
>>>>>> "Matters Computational: ideas, algorithms and source code", by J org
>>>>>> Arndt, see http://www.jjj.de/fxt/fxtbook.pdf
>>>>>>
>>>>>> Specifically I used chapter 3. Note that I took very great care to not
>>>>>> refer to the source code. I found the notation used in equations such
>>>>>> as 19.2-6 very helpful, and this book was also the resource I used for
>>>>>> a description of the Matrix Fourier Algorithm and the split radix
>>>>>> algorithm (which did not actually end up being used by my integer
>>>>>> multiplication implementation).
>>>>>>
>>>>>> "Primes numbers: a computational perspective", by Richard Crandall and
>>>>>> Carl Pomerance, 2nd ed., 2005, Springer.
>>>>>>
>>>>>> Specifically I used this reference to figure out why my negacyclic
>>>>>> convolution didn't work (I had forgotten that the coefficients would
>>>>>> be signed).
>>>>>>
>>>>>> Two other references which I referred to, but which did not end up
>>>>>> featuring in the implementation I settled on (so far), but to which I
>>>>>> am grateful to the authors for providing are:
>>>>>>
>>>>>> "A GMP-based implementation of Schonhage-Strassen's Large Integer
>>>>>> Multiplication Algorithm" by Pierrick Gaudry, Alexander Kruppa and
>>>>>> Paul Zimmermann, ISAAC 2007 proceedings, pp 167-174. See
>>>>>> http://www.loria.fr/~gaudry/publis/issac07.pdf
>>>>>>
>>>>>> and
>>>>>>
>>>>>> "Multidigit multiplication for mathematicians" by Dan Bernstein (to
>>>>>> appear). see http://cr.yp.to/papers/m3.pdf
>>>>>>
>>>>>> Specifically this paper helped me sort out why my radix 2/3
>>>>>> implementation didn't work. I would also say that the algebraic way of
>>>>>> thinking about the FFT which is advocated in this paper also helped
>>>>>> when it came to thinking up a truncation strategy. I had been looking
>>>>>> at this paper to work out why my radix 2/3 thing didn't work and went
>>>>>> for a walk to town in frustration. By the time I got back I had
>>>>>> basically figured out how to do truncation without radix 3. The
>>>>>> initial thought that set it off was to think about the FFT as breaking
>>>>>> into a cyclic and negacyclic half, as described in section 2 of these
>>>>>> notes.
>>>>>>
>>>>>> Bill.
>>>>>>
>>>>>> 2010/1/3 Bill Hart <goodwillh...@googlemail.com>:
>>>>>>> Hmm, so the only reason MPIR can be beating me for small
>>>>>>> multiplications is general coding efficiency. My butterfly strategy
>>>>>>> probably introduces overheads when the coefficients are small, i.e.
>>>>>>> only a few limbs.
>>>>>>>
>>>>>>> The overheads could be lowered by having an iterative FFT for short
>>>>>>> lengths. But we should worry about that later. It will complicate the
>>>>>>> code too much and the existing code needs consolidation first. There's
>>>>>>> a lot of code duplication currently.
>>>>>>>
>>>>>>> Bill.
>>>>>>>
>>>>>>> 2010/1/3 Bill Hart <goodwillh...@googlemail.com>:
>>>>>>>> 2010/1/3 David Harvey <dmhar...@cims.nyu.edu>:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Jan 3, 9:03 am, Bill Hart <goodwillh...@googlemail.com> wrote:
>>>>>>>>>
>>>>>>>>>> I figured I probably did misunderstand. What I think they are trying
>>>>>>>>>> to get at is that one can use coefficients modulo 2^N+1 except at
>>>>>>>>>> certain levels because the roots of unity are actually the same. But
>>>>>>>>>> then the discussion becomes one of combining a Fermat and Mersenne
>>>>>>>>>> transform, which is the basic strategy that the implementation uses. 
>>>>>>>>>> I
>>>>>>>>>> got confused. I still am.
>>>>>>>>>
>>>>>>>>> I think they are saying that to multiply two integers whose product
>>>>>>>>> has N bits, choose a and b (as in Lemma 1 of their paper) such that a
>>>>>>>>> + b ~ N and then multiply modulo 2^a - 1 (use Mersenne transform) and
>>>>>>>>> modulo 2^b + 1 (use Fermat transform). The smoothness comes from the
>>>>>>>>> freedom to choose a and b.
>>>>>>>>>
>>>>>>>>
>>>>>>>> OK, I think I see. Their big N corresponds to an FFT for the original
>>>>>>>> integers, which are multiplied mod 2^N+1 in the case of a Fermat
>>>>>>>> transform or 2^N-1 in the case of a Mersenne transform. I think I
>>>>>>>> would call these negacyclic and cyclic convolutions respectively.
>>>>>>>>
>>>>>>>> But then a Mersenne transform is nothing but an ordinary FFT, no?
>>>>>>>>
>>>>>>>> The idea of combining with a negacyclic FFT seems to be one of the
>>>>>>>> strategies I discussed and discarded.
>>>>>>>>
>>>>>>>> I still don't understand what they mean by "power-of-2 roots of unity
>>>>>>>> are needed only at the "lower level"". I understand that by the "lower
>>>>>>>> level" they seem to mean the pointwise mults. But what is a power of 2
>>>>>>>> root of unity? Aren't they all powers of 2, except when using the
>>>>>>>> sqrt2 trick.
>>>>>>>>
>>>>>>>> Maybe I don't understand the Schonhage-Strassen algorithm correctly.
>>>>>>>> Perhaps they not only worked with coefficients mod 2^B+1 but also did
>>>>>>>> the convolution mod x^m+1. They were worried about asymptotic
>>>>>>>> complexity, not speed, so they wanted their algorithm to recurse so
>>>>>>>> that the pointwise mults could be done using the same algorithm. Gosh
>>>>>>>> that is exactly what the authors of this paper say!
>>>>>>>>
>>>>>>>> So my misunderstanding came from the fact that I assumed SS advocated
>>>>>>>> a Mersenne transform, i.e. mod x^m-1, not a Fermat transform, mod
>>>>>>>> x^m+1.
>>>>>>>>
>>>>>>>> Well, just to be clear. I worked mod x^m-1 because it is more
>>>>>>>> efficient, and in a ring mod p where p = 2^B+1, for which one must
>>>>>>>> perform a negacyclic convolution if one wishes to recurse, which
>>>>>>>> involves twiddling to get an ordinary cyclic convolution.
>>>>>>>>
>>>>>>>> Anyhow, this is all a non-strategy once you have truncation. All that
>>>>>>>> is required is to implement a single transform which works with
>>>>>>>> coefficients mod 2^B-1, which I was formerly calling a Mersenne ring,
>>>>>>>> when the large integers being multiplied are "small". Then you don't
>>>>>>>> want to recurse for doing the pointwise mults and so the pointwise
>>>>>>>> mults can be done mod 2^B/2-1 and 2^B/2+1 using our mpn_mulmod_2expp1
>>>>>>>> and mpn_mulmod_2expm1, which can do whatever is most efficient.
>>>>>>>>
>>>>>>>> Bill.
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

--

You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-de...@googlegroups.com.
To unsubscribe from this group, send email to 
mpir-devel+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en.


Reply via email to