On Thursday 26 February 2009 11:04:52 Bill Hart wrote:
> I'd be happy if Jason's idea gave us 6-7% on K8-K10 because that must
> be how much we are behind the impending 4.3 release.
>

I doubt we will get that much by doing the addmul , but we can do 

1) have a separate mul_basecase that mul_kara calls , so we know size >=14 
therefore we can remove a branch from the outer loop. This may not show up on 
the speed benchmark , because of branch prediction , but in real world code?
2) completely pipeline the above , code size increase of 1.3x , perhaps 2-4% 
faster.

My guess is that 4.3 has the above plus a fft tweek , and a mpz tweek.

3) Write mul_kara as assembly, or perhaps just the first recursion , this 
should save a lot of overhead. Consider the difference between mul_basecase 
in assembler and C , where we save *just* function call overhead , and size%4 
decision logic , although we are working at larger sizes on the kara , but we 
do have more logic to save, who knows?? 

4) The usual kara-mul of size 2n
    3 muls of size n
    2 adds of size n
    3 adds of size 2n
    is 5 fn calls and 8n adds
   using addadd we get 4 fn calls and 7.3n adds (assuming add=1.5c , 
addadd=2.2c)
   using addmul we can certainly save n adds , and I'm confident of 2n adds
   In the mpir manual note that under kara-mul we can save n adds but fn call 
overheads kill it . Before we write any assembler we want a C version with 
minimal adds.


And more speculatively...

5) On the K8/K10 most(all?) the assembler loops are 4-way unroll , and we have 
to make this decision every time we call a mpn_function , could just have 4 
functions for each mpn_function , again depends on how good the branch 
prediction is , we would have to write it and try it on some real world code.

6) Forget any pretense that mul_kara is a recursive function , we only do 3 
levels until we hit toom3 , optimise each level separatley and perhaps find 
some extra cancelation by treating each level differently.


> Essentially it is just addmul and submul we need right? They could be
> useful to have in assembly in their own right.
>
> I think I understand Jason's idea, but I don't understand why this
> would work only for the lowest level of the recursion. Doesn't it work
> at all levels?
>

Yes , but of course only for 1/3 of the muls , 

> Bill.
>
> 2009/2/26 Jason Martin <jason.worth.mar...@gmail.com>:
> > I like Jason's idea.
> >
> > I'm not sure if it would generate a huge improvement on K8-10 type
> > processors where the mul is already quite fast, but on the Intel
> > processors with their slow multiply instructions it seems like doing
> > "add submul" would be faster than "addadd mul" just because the
> > add/sub can hide inside the latency of the mul operation.
> >

Yeah , core2 add/sub are slow due to latency of adc , so we benefit more from 
combined operations like addadd,addlsh or addmul , and the slow mul means we 
can squeeze another operation inside the mul loop easier, core2 is wider than 
K10 as well, just more difficult to use.

Look at the gmp-mparam for core2 , the kara threshold is 18  , and toom is 
65 , compared with 26,89 for K8 

> > --jason
> > - Show quoted text -
> >
> > On Thu, Feb 26, 2009 at 1:41 AM,  <ja...@njkfrudils.plus.com> wrote:
> >> On Wednesday 25 February 2009 20:00:19 Bill Hart wrote:
> >>> I had a play with the karatsuba function today. I figured that one
> >>> might be able to cut out some overhead by having it complete the last
> >>> two iterations of the recursion in one go. I wrote some code which
> >>> works for the last two iterations when n/2, at that point, is
> >>> divisible by 4. But after fiddling for quite some time to try and find
> >>> a way to make the existing code faster with this as a special case, I
> >>> have not seen any speedups. Anyhow, here is the code, in case someone
> >>> else had some ideas. It obviously only works on systems with
> >>> mpn_addadd_n and mpn_addsub_n.
> >>
> >> Consider a new assembler function
> >>
> >> carry=mpn_muladd_basecase(rp,sp1,n1,sp2,n2)
> >>
> >> functionally
> >> mpn_mul_basecase(tp,sp1,n1,sp2,n2);
> >> carry=mpn_add_n(rp,rp,tp,n1+n2)
> >>
> >> implemented exactly like mpn_mul_basecase but with the first row an
> >> addmul_1 rather than a mul_1 , and a extra carry to be taken from row to
> >> row.
> >>
> >> This is easy to implement at maximum efficiency , and it runs at the
> >> same speed as a mul_basecase but with a double sized add thrown in for
> >> free.Clearly we can do the same with sqr_basecase.Using it in the
> >> current karatsuba code would conflict with the use of addadd/addsub but
> >> I suppose/hope we can come up with a varient where we can use both.You
> >> would only use it at the lowest level of  recursion.An extension of this
> >> would be the obvious mpn_muladd_kara which we would use for the "middle
> >> mul"  as you only need it once.It may be helpful to play around with the
> >> different karatsuba formulars to find a better match.
> >>
> >> Jason
>
> 


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-devel@googlegroups.com
To unsubscribe from this group, send email to 
mpir-devel+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to