On Friday 16 October 2009 19:49:01 Fredrik Johansson wrote: > On Thu, Oct 15, 2009 at 7:34 PM, Jason Moxham <ja...@njkfrudils.plus.com> wrote: > > ----- Original Message ----- > > From: "Fredrik Johansson" <fredrik.johans...@gmail.com> > > To: <mpir-devel@googlegroups.com> > > Sent: Thursday, October 15, 2009 7:01 AM > > Subject: [mpir-devel] Function requests > > > >> Hi all, > >> > >> I'm currently looking for fast ways to sum exponential-type Taylor > >> series in fixed point arithmetic using mpz (see > >> http://code.google.com/p/fastfunlib/)). I'm mainly interested in low > >> precision (say a few hundred or thousand bits), so I don't think > >> binary splitting is worthwhile (but I could be wrong here). > > > > For such small sizes , the overhead of the mpz layer will be a fair > > fraction of the runtime. Write it in mpz's to get the algorithm and > > layout sorted , then convert to the mpn layer one part at a time. This is > > how I generally go about things , I dont want to be debugging an > > algorithm at the mpn level. > > This seems like a reasonable strategy. > > >> The most important missing ingredient is a function that does > >> > >> mpz_mul(z, x, y); > >> mpz_tdiv_q_2exp(z, z, p); > >> > >> efficiently in a single step. For other purposes, it should ideally > >> only use the necessary high limbs of x and y so that x can be a fixed > >> point value at much higher precision, without the need to place a > >> truncated copy in a temporary variable. If it helps, it would be > >> possible to choose p so as to be a multiple of the limb size. > > > > We have an undocumented (for now, ie you may need to include gmp-impl.h , > > longlong.h etc) > > > > // (rp,2n)=(xp,n)*(yp,n) > > void mpn_mulhigh_n (mp_ptr rp, mp_srcptr xp, mp_srcptr yp, mp_size_t n) > > > > where the upper n limbs are guarenteed to be correct , ( the lower n > > limbs may be junk or correct) > > > > As we havent actually used it anywhere yet (next mpir version will use it > > for roots) , the basecase sizes are still writen in C and dont compare > > that favorably with the asm writen mul_basecase. When comapiring > > like-for-like it gives ~30% speedup. > > > > For the more general (rp,n+m)=(xp,n)*(yp,m) where only the top k limbs > > are correct , there are many algorithmic choices , Thom Mulders paper has > > some details on this, but for small sizes experiment is best. Note: like > > mullow , mulhigh can be simplified when k<n,m , which is I think what you > > were asking. > > Yes, I think the more general case is needed. The products are > generally unbalanced. > > At least at this point, I'm not ambitious enough myself to implement > and debug custom multiplication algorithms, especially given the > advantage of the existing asm. So I'll probably continue waiting for > this. > > > Basecase is easy though , and basecase is usually fastest <20 limbs = > > 1280bits > > > >> Secondly, there are a lot of divisions by small integers (1, 2, 3, > >> ...). I'm not sure how much they contribute in total, but if there is > >> a way to speed up the first few with precomputation, it could help. > >> > >> A single-step x = x + y/k, k a small integer, would also be useful if > >> it could be done faster. > > > > I expect the addition could be easily absorbed into the division , by > > which I mean that a mpn_add_div would run at the same speed as a mpn_div > > , Precalculating the inverses would be a definite benefit , although we > > dont have any interfaces for that at the moment. Note: some special > > divisors can run at much greater speeds. eg 2^k , 2^k-1 ,2^k+1 , divisors > > of 2^k-1 >
Another idea is to precondition the polynomial your are evaluating , although whether it is faster depends on many things(see Knuth,vol2,evaluating polys) > It occurred to me that instead of dividing on each term, it's possible > to divide by a rising factorial once every several terms (the > increment could be chosen so that the divisor fits in a single limb or > two) and then multiply the factors back (assuming extra precision is > used). How do you think this would compare to division with > precomputed inverses? (I hope not too favorably, because it'd be a > pain to code :-) > On the AMD chip , single limb division now (mpir-1.3) runs at 10.5c/l (soon to be 9.5c/l) whereas multiplication runs at 2.5c/l . We could run two independent divisions at once to reach an effective 7c/l. > Fredrik > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "mpir-devel" group. To post to this group, send email to mpir-devel@googlegroups.com To unsubscribe from this group, send email to mpir-devel+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/mpir-devel?hl=en -~----------~----~----~----~------~----~------~--~---