I've committed the changes to the code. The improvements are more modest than I remember them being last night, but here are the new timings as affected by the negacyclic convolution anyhow:
536608768/1: 8.4s / 9.9s (17.9%) 603684864/1: 10.3s / 10.5s 670760960/1: 11.3s / 12.6s (11.5%) 737837056/1: 12.6s / 13.3s (5.6%) 804913152/1: 13.7s / 15.9s (16.1%) 871989248/1: 14.5s / 16.6s (14.5%) 939065344/1: 15.7s / 17.5s (19.0%) 1006141440/1: 16.5s / 17.8s (15.6%) 1073217536/1: 17.6s / 20.6s (17.0%) 1341521920/1: 23.3s / 25.3s (8.6%) 1609826304/1: 27.7s / 31.2s (12.6%) 1878130688/1: 31.4s / 37.4s (19.1%) 2146959360/1: 36.3s / 41.9s (15.4%) Bill. 2010/1/4 Bill Hart <goodwillh...@googlemail.com>: > I fixed my negacyclic convolution. I had to change two whole > characters to make it more efficient. It now beats MPIR from about > 603684864 bits up. By the time we reach about 2GB it beats MPIR by > about 20-25%. > > I'll post new times and code tomorrow. It does pass the test code, > which is a good sign. > > So *that* problem appears to be gone. Still lots more that could be > done for the negacyclic convolution though. > > Bill. > > 2010/1/4 Bill Hart <goodwillh...@googlemail.com>: >> Ah, I see that our current mpn_mulmod_2expp1 does not do anything >> special at this stage. >> >> So that is my number 3: >> >> 3. Make mpn_mulmod_2expp1 deal with the special case where you are >> multiplying mod 2^{3B}+1 where B is some integer divisible by >> GMP_LIMB_BITS. >> >> Simply use the identity 2^{3B}+1 = (2^B+1)(2^{2B}-2^B+1) and do two >> multiplications, one mod each of these (in precisely the way that >> mpn_mulmod_2expp1 currently works, i.e. you could call the function >> recursively. Ensure that there is a tuned cutoff for this trick, as >> once the multiplications become too large it doesn't work. >> >> Of course in the FFT range, the negacyclic convolution should be used. >> But only when we get the negacyclic convolution fast enough to be >> useful. >> >> Bill. >> >> 2010/1/3 Bill Hart <goodwillh...@googlemail.com>: >>> I've had a long think, and I don't think there is a way to get the FFT >>> to use the mpn_mulmod_2expm1 function for pointwise mults, no matter >>> what we do. I think it is fundamentally impossible. >>> >>> That's completely bizarre to me. >>> >>> And there is no such thing as an FFT in a Mersenne ring as I called >>> it. So that's two shocks in one day. >>> >>> The problem I'm trying to work around is quite serious. Even with >>> truncation there is a big jump in times at power of two lengths >>> because one needs to use coefficients twice as large, so the time for >>> the pointwise mults shoots up dramatically. It's not a problem for >>> very large multiplications, as doubling the size of the pointwise >>> mults just over doubles the time to do them, which is fine as all your >>> data fits in half as many of them. The problem occurs only when there >>> is a big jump up in the times as occurs when the pointwise mults are >>> using schoolbook multiplication for example. >>> >>> I'm amazed there isn't an obvious solution to that problem. The best I >>> can come up with is what was used in FLINT. Try and ensure they are >>> mod B^{3m}+1. >>> >>> Bill. >>> >>> 2010/1/3 Bill Hart <goodwillh...@googlemail.com>: >>>> My mind does not keep up with my fingers typing. Of course my >>>> suggestion 3 will not work because of this principal unit thing. >>>> >>>> I'll need to give that whole part of the strategy some more thought. >>>> >>>> I'm also now slightly confused as to why MPIR falls so far behind the >>>> new code in the middle, then catches up again at the end. When timing >>>> the FFT's against FLINT I also noticed this, i.e. it had nothing to do >>>> with pointwise multiplications. >>>> >>>> Bill. >>>> >>>> 2010/1/3 Bill Hart <goodwillh...@googlemail.com>: >>>>> 8. What remains to be done. >>>>> >>>>> Wow, I made it all the way to section 8! This is the most important >>>>> section, where I describe what remains, in order to make this >>>>> implementation really fast, clean and worth switching out the existing >>>>> implementation in MPIR. >>>>> >>>>> I'm very interested if people would like to volunteer to work on >>>>> various projects. They range from the completely trivial, which can be >>>>> carried out by anyone, right through to things that could turn into >>>>> research projects (some of the first four things below). >>>>> >>>>> This list applies to the version of the code that I am about to commit >>>>> to the repo, which is slightly modified from the version currently in >>>>> there. Specifically I'm talking about new_fft.c, i.e. the one that >>>>> doesn't need FLINT (which I don't see any point in continuing to >>>>> update - it's there for reference only). >>>>> >>>>> Clearly the high level ideas that need the most work are: >>>>> >>>>> 1) Implement the sqrt2 trick. In the ring Z/pZ where p = 2^B + 1 the >>>>> value 2^(2B/4)-2^(B/4) is a square root of 2. This means you can have >>>>> convolutions of length 4m instead of 2m by using these extra roots of >>>>> unity. It should be the case that these are only needed in the very >>>>> top layer of the FFT and IFFT. However one should still implement >>>>> these efficiently. Obviously one decomposes a multiplication by >>>>> (sqrt(2))^{2i+1} into 2^i*sqrt(2) (and clearly sqrt(2)^{2i} just >>>>> becomes 2^i which we can already do). >>>>> >>>>> The left hand half of the FFT butterfly is unaffected, but the right >>>>> hand half will need multiplication by sqrt. One can multiply by >>>>> 2^{B/4} first, combining this with the existing butterfly shifts. Then >>>>> one addmuls by a further 2^(B/4) for the other "leg" of the sqrt2. >>>>> Whilst this is less efficient than a standard butterfly it allows one >>>>> to double the convolution length, work with smaller coefficients in >>>>> the pointwise mults and the cost is only small, at the very top layer. >>>>> >>>>> For the IFFT one must do a similar thing, decomposing the >>>>> multiplication by sqrt2. There it is slightly more complex. The best >>>>> we can probably do is to apply the usual bitshifts first, then apply a >>>>> sqrt2 combined with any limb shifts, then do a sumdiff. That's three >>>>> operations, but without additional assembly functions I don't see how >>>>> to do better. At least any negations can be more easily combined with >>>>> the sumdiff. >>>>> >>>>> 2) Come up with a decent negacyclic convolution strategy. I think a >>>>> lot of work could be done on this, even research. It's currently >>>>> inefficient because we need to use a multiple of half the FFT length >>>>> for the coefficient bit size, which means too much padding. >>>>> >>>>> FLINT has some fairly intricate strategy which I barely understand. I >>>>> think it uses a naive negacyclic convolution with *coefficients* >>>>> modulo a power of the limb size, i.e. mod B^i for some small i and >>>>> combines it with the information you get from an FFT negacyclic >>>>> convolution with *coefficients* modulo some 2^b+1. It's still a >>>>> negacyclic convolution, so that still gives you the product mod 2^B+1 >>>>> that you are after. I'm not sure if that is the idea or not. I just >>>>> remembering it being a real headscrew for me when I tried to figure it >>>>> out. >>>>> >>>>> I don't think the FLINT implementation was ever carried through to its >>>>> logical conclusion however. It does work, but I don't know about the >>>>> efficiency. It is definitely more efficient than what I am currently >>>>> doing however, which is nothing. From memory the idea in FLINT was to >>>>> force the pointwise multiplication *of the FFT negacyclic convolution* >>>>> to be a multiple of 3 limbs, so that the identity 2^3B+1 = >>>>> (2^B+1)(2^2B-2^B+1) can be used, as explained in an earlier section. >>>>> >>>>> I never got through thinking about whether that strategy was also >>>>> worth pursuing at the FFT level itself and not just for the pointwise >>>>> mults. Is there a karatsuba and toom negacyclic convolution algorithm? >>>>> I don't know the answer to these questions. I didn't think about >>>>> whether there was. >>>>> >>>>> Anyhow, there is no requirement that we do what is done in FLINT, just >>>>> that we do something sensible. >>>>> >>>>> 3) Implement a transform with coefficients mod p = 2^{sB}-1. This >>>>> would still be a standard cyclic convolution, using an "ordinary FFT", >>>>> just that you are doing your reductions mod p with an addition instead >>>>> of a subtraction. This would require rewriting the butterflies to work >>>>> modulo this different p. I didn't check if there was a sqrt2 trick, >>>>> and certainly convolutions of only half the length can be supported. >>>>> But I am sure the savings in the pointwise multiplications will make >>>>> it worth implementing this trick. With truncation, efficiency becomes >>>>> much less of an issue. >>>>> >>>>> Now I proceed through new_fft.c one function at a time and mention >>>>> explicitly what needs to be done, while it is all still fresh on my >>>>> mind. I will number the ideas starting from 4, so that people can >>>>> volunteer to take them on, by number, if they should so choose. >>>>> >>>>> Feel free to add suggestions to the list. >>>>> >>>>> mpn_addmod_2expp1_1: >>>>> >>>>> 4) As noted in the code comments, this can probably be done faster in >>>>> the case where c is small (the generic case), whether positive or >>>>> negative. In fact it is probably only ever used in such cases. >>>>> >>>>> 5) The function should probably be renamed. The name doesn't make a >>>>> whole lot of sense. >>>>> >>>>> revbin: >>>>> >>>>> 6) This is very inefficient. I don't know an efficient algorithm, but >>>>> I am sure one exists. >>>>> >>>>> 7) The name should be changed to something more exotic to avoid >>>>> polluting namespace when linking with other libraries. >>>>> >>>>> FFT_split* : >>>>> >>>>> 8) We need a function which given the parameters currently sent to >>>>> FFT_split_bits and a coefficient number, just returns that single >>>>> coefficient. The reason for this is that splitting the integer up into >>>>> chunks (to the bit) can then be merged with the top layer of the FFT, >>>>> which has consequences for cache locality. In particular if you were >>>>> doing an MFA transform, you'd pull out only the coefficients needed >>>>> for each column FFT at a time, then immediately do that FFT. >>>>> >>>>> 9) Experiment with cache hints. On Opterons in particular we found >>>>> this to give a speedup in FLINT. The current code indicates where >>>>> cache hints should be used, though combine this strategy with 8. >>>>> >>>>> 10) Replace any ulong's which should really be mp_limb_t's so that >>>>> things will work on Windows. >>>>> >>>>> FFT_combine*: >>>>> >>>>> 11) Again it might be possible to combine just a single coefficient at >>>>> a time so that cache locality can be maintained for the MFA IFFT's. >>>>> One has to be slightly careful to avoid a quadratic run time, but for >>>>> unsigned coefficients I think it will be OK. One also needs to >>>>> selectively zero the limbs of the output integer in step with these >>>>> coefficient combines I would guess. So that step should probably be >>>>> combined with this somehow. Needs careful thought. >>>>> >>>>> 12) Pass in temporary space from the main integer multiplication routine. >>>>> >>>>> mpn_submod_i_2expp1: >>>>> >>>>> 13) Remove as it is only used in the split radix code, which is not >>>>> utilised. >>>>> >>>>> mpn_lshB_sumdiffmod_2expp1: >>>>> mpn_sumdiff_rshBmod_2expp1: >>>>> >>>>> 14) The name's are stupid and should be changed. Too specific to be >>>>> mpn functions. >>>>> >>>>> 15) Is there some symmetry which can be utilised to make the functions >>>>> simpler. There seem to be a lot of cases. I don't think there is, as >>>>> one is doing a subtraction, however combining with a possible negation >>>>> without increasing the code complexity might be possible. Certainly >>>>> the case x = 0 is the (I)FFT butterfly case so surely some >>>>> consolidation is possible there. >>>>> >>>>> 16) Can the entire functions be done in assembly? Either way, the >>>>> overheads need to be reduced as much as possible, and we still need a >>>>> generic C implementation. >>>>> >>>>> mpn_mul_2expmod_2expp1: >>>>> mpn_div_2expmod_2expp1: >>>>> >>>>> 17) The first function should probably just deal with all cases, >>>>> whether bit shifts, or including limb shifts, positive or negative. I >>>>> ended up implementing FFT and negacyclic_twiddle functions because >>>>> this didn't support limb shifts, and these were needed in the >>>>> truncated FFT. So some rationalisation is possible there. Do away with >>>>> the mpn_div_2expmod_2expp1 function. Do right shift by d bits by doing >>>>> left shift by 2*n*w - d bits and make this function handle the >>>>> negation. But see how the FFT_twiddle and FFT_negacyclic_twiddle >>>>> functions are implemented which return an int to indicate if they did >>>>> their operation in-place or not to save copying data around when it is >>>>> not necessary. Perhaps just pass the temporary in as a parameter and >>>>> do the swap internal to the function. >>>>> >>>>> 18) Can this function be optimised in assembly? >>>>> >>>>> FFT_split_radix_butterfly: >>>>> FFT_split_radix_butterfly2: >>>>> >>>>> 19) Remove as no longer utilised. >>>>> >>>>> FFT_radix2_twiddle_butterfly: >>>>> FFT_radix2_twiddle_inverse_butterfly: >>>>> >>>>> 20) Surely this can be combined with the ordinary >>>>> (IF)FFT_radix2_butterfly functions. However note that it is better to >>>>> pass nw than to pass n and w separately as we do elsewhere. Perhaps >>>>> rename this parameter to nw or limbs, not NW. >>>>> >>>>> 21) The radix2 in the name is redundant, as we only implement radix 2, >>>>> and other radices are impractical. Omit the twiddle from the name. >>>>> >>>>> 22) The motif: >>>>> >>>>> b1 %= (2*NW); >>>>> if (b1 >= NW) >>>>> { >>>>> negate2 = 1; >>>>> b1 -= NW; >>>>> } >>>>> x = b1/GMP_LIMB_BITS; >>>>> b1 -= x*GMP_LIMB_BITS; >>>>> >>>>> occurs often in this and other functions, and should be adapted away >>>>> into a macro or something like that. It's not necessarily optimised >>>>> either. >>>>> >>>>> FFT_radix2_butterfly: >>>>> FFT_radix2_inverse_butterfly: >>>>> >>>>> 23) Surely better to pass nw and an actual bit shift rather than i, >>>>> since otherwise n*w needs to be computed over and over, or at least >>>>> i*w does. >>>>> >>>>> 24) In FFT_radix2_butterfly, I don't think i >= n is actually >>>>> possible, so remove that. >>>>> >>>>> FFT_split_radix: >>>>> >>>>> 25) Remove, as it is not utilised. >>>>> >>>>> FFT_radix2: >>>>> >>>>> 26) Remove as it is no longer used. >>>>> >>>>> FFT_radix2_new: >>>>> FFT_radix2_twiddle: >>>>> IFFT_radix2_new: >>>>> IFFT_radix2_twiddle: >>>>> >>>>> 27) radix2 is superfluous in the name, as is the new. >>>>> >>>>> 28) combine the non-twiddle functions with the twiddle versions. >>>>> >>>>> 29) I don't believe the result stride rs is ever used, and we always >>>>> do the FFT and IFFT in-place, so remove rr and rs entirely. >>>>> >>>>> 30) It is no longer necessary to pass temp. It isn't used. >>>>> >>>>> 31) Can t1 and t2 be combined without a loss of speed? Please >>>>> benchmark and be *absolutely sure*. >>>>> >>>>> 32) The motif: >>>>> >>>>> ptr = rr[0]; >>>>> rr[0] = *t1; >>>>> *t1 = ptr; >>>>> >>>>> occurs frequently here and elsewhere and should definitely be adapted >>>>> away into a pointer swap macro, or use one of the already provided >>>>> MPIR macros for this. >>>>> >>>>> FFT_negacyclic_twiddle: >>>>> FFT_twiddle: >>>>> >>>>> 33) Remove, see 17. However note that an indication of how to combine >>>>> the negation is provided in commented out code (which is not tested). >>>>> >>>>> FFT_radix2_truncate1: >>>>> FFT_radix2_truncate: >>>>> FFT_radix2_truncate1_twiddle: >>>>> FFT_radix2_truncate_twiddle: >>>>> IFFT_radix2_truncate1: >>>>> IFFT_radix2_truncate1_twiddle: >>>>> IFFT_radix2_truncate: >>>>> IFFT_radix2_truncate1_twiddle: >>>>> >>>>> 34) Combine non-twiddle and twiddle versions. Remove radix2 from name. >>>>> >>>>> 35) Remove rr and rs. Do in-place. >>>>> >>>>> 36) Correct code comments which are cut and pasted from elsewhere. >>>>> >>>>> 37) Write special cases for n = 1, 2 to remove the unnecessary >>>>> restriction that trunc be divisible by 2. Remember to remove the >>>>> restriction in the multiplication code which sets trunc to a multiple >>>>> of 2*sqrt instead of just a multiple of sqrt. >>>>> >>>>> FFT_radix2_negacyclic: >>>>> IFFT_radix2_negacyclic: >>>>> >>>>> 38) Remove rr and rs. Always work in-place. >>>>> >>>>> 39) Do not pass temp. >>>>> >>>>> FFT_radix2_mfa: >>>>> FFT_radix2_mfa_truncate: >>>>> IFFT_radix2_mfa: >>>>> IFFT_radix2_mfa_truncate: >>>>> >>>>> 40) Do not pass temp. >>>>> >>>>> 41) Combine truncate and non-truncate versions. Actually the >>>>> non-truncate versions are not really needed. >>>>> >>>>> FFT_mulmod_2expp1: >>>>> >>>>> 42) Memory allocation should be done in a single block, or even passed >>>>> in as a temporary. >>>>> >>>>> 43) Clean up the code which handles the subtractions and make it cache >>>>> friendly. >>>>> >>>>> new_mpn_mulmod_2expp1: >>>>> >>>>> 44) Doesn't currently deal with the case that c != 0 when using the >>>>> negacyclic convolution (which it currently never uses anyway because >>>>> it is too inefficient), i.e. the case where one of the inputs is >>>>> 2^{n*w} so that it doesn't fit into {n*w}/GMP_LIMB_BITS limbs. >>>>> >>>>> 45) Doesn't currently return any carry, i.e. when the result is p-1 >>>>> there is a carry. >>>>> >>>>> new_mpn_mul_mfa: >>>>> >>>>> 46) In the memory allocation part, one shouldn't mix allocations of >>>>> arrays of mp_limb_t's and mp_limb_t *'s as this is not necessarily >>>>> safe on all machines. Or is it? >>>>> >>>>> General: >>>>> >>>>> 47) The code needs to be made Windows safe by avoiding variable >>>>> declarations except at the beginnings of blocks. >>>>> >>>>> 48) Some functions don't have adequate test functions. >>>>> >>>>> 49) I think the normalise function is correct, but the test code >>>>> doesn't check all cases. >>>>> >>>>> 50) The test code doesn't test the FFTs, IFFTs and multiplication >>>>> functions for lots of different sizes, especially different values of >>>>> n, w, and trunc. >>>>> >>>>> Bill. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> 2010/1/3 Bill Hart <goodwillh...@googlemail.com>: >>>>>> 5. References. >>>>>> >>>>>> I am indebted to the people who prepared the following references: >>>>>> >>>>>> "Matters Computational: ideas, algorithms and source code", by J org >>>>>> Arndt, see http://www.jjj.de/fxt/fxtbook.pdf >>>>>> >>>>>> Specifically I used chapter 3. Note that I took very great care to not >>>>>> refer to the source code. I found the notation used in equations such >>>>>> as 19.2-6 very helpful, and this book was also the resource I used for >>>>>> a description of the Matrix Fourier Algorithm and the split radix >>>>>> algorithm (which did not actually end up being used by my integer >>>>>> multiplication implementation). >>>>>> >>>>>> "Primes numbers: a computational perspective", by Richard Crandall and >>>>>> Carl Pomerance, 2nd ed., 2005, Springer. >>>>>> >>>>>> Specifically I used this reference to figure out why my negacyclic >>>>>> convolution didn't work (I had forgotten that the coefficients would >>>>>> be signed). >>>>>> >>>>>> Two other references which I referred to, but which did not end up >>>>>> featuring in the implementation I settled on (so far), but to which I >>>>>> am grateful to the authors for providing are: >>>>>> >>>>>> "A GMP-based implementation of Schonhage-Strassen's Large Integer >>>>>> Multiplication Algorithm" by Pierrick Gaudry, Alexander Kruppa and >>>>>> Paul Zimmermann, ISAAC 2007 proceedings, pp 167-174. See >>>>>> http://www.loria.fr/~gaudry/publis/issac07.pdf >>>>>> >>>>>> and >>>>>> >>>>>> "Multidigit multiplication for mathematicians" by Dan Bernstein (to >>>>>> appear). see http://cr.yp.to/papers/m3.pdf >>>>>> >>>>>> Specifically this paper helped me sort out why my radix 2/3 >>>>>> implementation didn't work. I would also say that the algebraic way of >>>>>> thinking about the FFT which is advocated in this paper also helped >>>>>> when it came to thinking up a truncation strategy. I had been looking >>>>>> at this paper to work out why my radix 2/3 thing didn't work and went >>>>>> for a walk to town in frustration. By the time I got back I had >>>>>> basically figured out how to do truncation without radix 3. The >>>>>> initial thought that set it off was to think about the FFT as breaking >>>>>> into a cyclic and negacyclic half, as described in section 2 of these >>>>>> notes. >>>>>> >>>>>> Bill. >>>>>> >>>>>> 2010/1/3 Bill Hart <goodwillh...@googlemail.com>: >>>>>>> Hmm, so the only reason MPIR can be beating me for small >>>>>>> multiplications is general coding efficiency. My butterfly strategy >>>>>>> probably introduces overheads when the coefficients are small, i.e. >>>>>>> only a few limbs. >>>>>>> >>>>>>> The overheads could be lowered by having an iterative FFT for short >>>>>>> lengths. But we should worry about that later. It will complicate the >>>>>>> code too much and the existing code needs consolidation first. There's >>>>>>> a lot of code duplication currently. >>>>>>> >>>>>>> Bill. >>>>>>> >>>>>>> 2010/1/3 Bill Hart <goodwillh...@googlemail.com>: >>>>>>>> 2010/1/3 David Harvey <dmhar...@cims.nyu.edu>: >>>>>>>>> >>>>>>>>> >>>>>>>>> On Jan 3, 9:03 am, Bill Hart <goodwillh...@googlemail.com> wrote: >>>>>>>>> >>>>>>>>>> I figured I probably did misunderstand. What I think they are trying >>>>>>>>>> to get at is that one can use coefficients modulo 2^N+1 except at >>>>>>>>>> certain levels because the roots of unity are actually the same. But >>>>>>>>>> then the discussion becomes one of combining a Fermat and Mersenne >>>>>>>>>> transform, which is the basic strategy that the implementation uses. >>>>>>>>>> I >>>>>>>>>> got confused. I still am. >>>>>>>>> >>>>>>>>> I think they are saying that to multiply two integers whose product >>>>>>>>> has N bits, choose a and b (as in Lemma 1 of their paper) such that a >>>>>>>>> + b ~ N and then multiply modulo 2^a - 1 (use Mersenne transform) and >>>>>>>>> modulo 2^b + 1 (use Fermat transform). The smoothness comes from the >>>>>>>>> freedom to choose a and b. >>>>>>>>> >>>>>>>> >>>>>>>> OK, I think I see. Their big N corresponds to an FFT for the original >>>>>>>> integers, which are multiplied mod 2^N+1 in the case of a Fermat >>>>>>>> transform or 2^N-1 in the case of a Mersenne transform. I think I >>>>>>>> would call these negacyclic and cyclic convolutions respectively. >>>>>>>> >>>>>>>> But then a Mersenne transform is nothing but an ordinary FFT, no? >>>>>>>> >>>>>>>> The idea of combining with a negacyclic FFT seems to be one of the >>>>>>>> strategies I discussed and discarded. >>>>>>>> >>>>>>>> I still don't understand what they mean by "power-of-2 roots of unity >>>>>>>> are needed only at the "lower level"". I understand that by the "lower >>>>>>>> level" they seem to mean the pointwise mults. But what is a power of 2 >>>>>>>> root of unity? Aren't they all powers of 2, except when using the >>>>>>>> sqrt2 trick. >>>>>>>> >>>>>>>> Maybe I don't understand the Schonhage-Strassen algorithm correctly. >>>>>>>> Perhaps they not only worked with coefficients mod 2^B+1 but also did >>>>>>>> the convolution mod x^m+1. They were worried about asymptotic >>>>>>>> complexity, not speed, so they wanted their algorithm to recurse so >>>>>>>> that the pointwise mults could be done using the same algorithm. Gosh >>>>>>>> that is exactly what the authors of this paper say! >>>>>>>> >>>>>>>> So my misunderstanding came from the fact that I assumed SS advocated >>>>>>>> a Mersenne transform, i.e. mod x^m-1, not a Fermat transform, mod >>>>>>>> x^m+1. >>>>>>>> >>>>>>>> Well, just to be clear. I worked mod x^m-1 because it is more >>>>>>>> efficient, and in a ring mod p where p = 2^B+1, for which one must >>>>>>>> perform a negacyclic convolution if one wishes to recurse, which >>>>>>>> involves twiddling to get an ordinary cyclic convolution. >>>>>>>> >>>>>>>> Anyhow, this is all a non-strategy once you have truncation. All that >>>>>>>> is required is to implement a single transform which works with >>>>>>>> coefficients mod 2^B-1, which I was formerly calling a Mersenne ring, >>>>>>>> when the large integers being multiplied are "small". Then you don't >>>>>>>> want to recurse for doing the pointwise mults and so the pointwise >>>>>>>> mults can be done mod 2^B/2-1 and 2^B/2+1 using our mpn_mulmod_2expp1 >>>>>>>> and mpn_mulmod_2expm1, which can do whatever is most efficient. >>>>>>>> >>>>>>>> Bill. >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > -- You received this message because you are subscribed to the Google Groups "mpir-devel" group. To post to this group, send email to mpir-de...@googlegroups.com. To unsubscribe from this group, send email to mpir-devel+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/mpir-devel?hl=en.