On Friday 16 October 2009 19:49:01 Fredrik Johansson wrote:
> On Thu, Oct 15, 2009 at 7:34 PM, Jason Moxham <ja...@njkfrudils.plus.com> 
wrote:
> > ----- Original Message -----
> > From: "Fredrik Johansson" <fredrik.johans...@gmail.com>
> > To: <mpir-devel@googlegroups.com>
> > Sent: Thursday, October 15, 2009 7:01 AM
> > Subject: [mpir-devel] Function requests
> >
> >> Hi all,
> >>
> >> I'm currently looking for fast ways to sum exponential-type Taylor
> >> series in fixed point arithmetic using mpz (see
> >> http://code.google.com/p/fastfunlib/)). I'm mainly interested in low
> >> precision (say a few hundred or thousand bits), so I don't think
> >> binary splitting is worthwhile (but I could be wrong here).
> >
> > For such small sizes , the overhead of the mpz layer will be a fair
> > fraction of the runtime. Write it in mpz's to get the algorithm and
> > layout sorted , then convert to the mpn layer one part at a time. This is
> > how I generally go about things , I dont want to be debugging an
> > algorithm at the mpn level.
>
> This seems like a reasonable strategy.
>
> >> The most important missing ingredient is a function that does
> >>
> >> mpz_mul(z, x, y);
> >> mpz_tdiv_q_2exp(z, z, p);
> >>
> >> efficiently in a single step. For other purposes, it should ideally
> >> only use the necessary high limbs of x and y so that x can be a fixed
> >> point value at much higher precision, without the need to place a
> >> truncated copy in a temporary variable. If it helps, it would be
> >> possible to choose p so as to be a multiple of the limb size.
> >
> > We have an undocumented (for now, ie you may need to include gmp-impl.h ,
> > longlong.h etc)
> >
> > // (rp,2n)=(xp,n)*(yp,n)
> > void mpn_mulhigh_n (mp_ptr rp, mp_srcptr xp, mp_srcptr yp, mp_size_t n)
> >
> > where the upper n limbs are guarenteed to be correct , ( the lower n
> > limbs may be junk or correct)
> >
> > As we havent actually used it anywhere yet (next mpir version will use it
> > for roots) , the basecase sizes are still writen in C and dont compare
> > that favorably with the asm writen mul_basecase. When comapiring
> > like-for-like it gives ~30% speedup.
> >
> > For the more general (rp,n+m)=(xp,n)*(yp,m) where only the top k  limbs
> > are correct , there are many algorithmic choices , Thom Mulders paper has
> > some details on this, but for small sizes experiment is best. Note: like
> > mullow , mulhigh can be simplified when k<n,m , which is I think what you
> > were asking.
>
> Yes, I think the more general case is needed. The products are
> generally unbalanced.
>
> At least at this point, I'm not ambitious enough myself to implement
> and debug custom multiplication algorithms, especially given the
> advantage of the existing asm. So I'll probably continue waiting for
> this.
>
> > Basecase is easy though , and basecase is usually fastest <20 limbs =
> > 1280bits
> >
> >> Secondly, there are a lot of divisions by small integers (1, 2, 3,
> >> ...). I'm not sure how much they contribute in total, but if there is
> >> a way to speed up the first few with precomputation, it could help.
> >>
> >> A single-step x = x + y/k, k a small integer, would also be useful if
> >> it could be done faster.
> >
> > I expect the addition could be easily absorbed into the division , by
> > which I mean that a mpn_add_div would run at the same speed as a mpn_div
> > , Precalculating the inverses would be a definite benefit , although we
> > dont have any interfaces for that at the moment. Note: some special
> > divisors can run at much greater speeds. eg 2^k , 2^k-1 ,2^k+1 , divisors
> > of 2^k-1
>

Another idea is to precondition the polynomial your are evaluating , although 
whether it is faster depends on many things(see Knuth,vol2,evaluating polys)

> It occurred to me that instead of dividing on each term, it's possible
> to divide by a rising factorial once every several terms (the
> increment could be chosen so that the divisor fits in a single limb or
> two) and then multiply the factors back (assuming extra precision is
> used). How do you think this would compare to division with
> precomputed inverses? (I hope not too favorably, because it'd be a
> pain to code :-)
>

On the AMD chip , single limb division now (mpir-1.3) runs at 10.5c/l (soon to 
be 9.5c/l) whereas multiplication runs at 2.5c/l . We could run two 
independent divisions at once to reach an effective 7c/l.


> Fredrik
>
> 

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-devel@googlegroups.com
To unsubscribe from this group, send email to 
mpir-devel+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to