On Jun 19, 11:24 pm, Jason <ja...@njkfrudils.plus.com> wrote:
> > 24) New Toom22 code , the new code is smaller if we let the high part
>
> > >= low part which is the opposite of the current code , so it's
>
> > probably easier just to rewrite the whole thing.
>
> Hi
>
> Here is a outline of the new toom22_n code , there are obvious O(1) speedups
> to do , but I'll leave them until I've tested the new assembler code as the
> linear part O(n) is what has improved . I rewrote all the code as that was the
> easiest way as there are other slight minor differences(and I do so hate
> reading other's code). The original code has the differences between high an
> low parts and this has not changed , what has changed is the last section
> where we add/sub the sub-products together to form the desired full product.
> Originally this consisted of three add's which on the K8 run at 4.5 cycles per
> word , this was improved with the new mpn_addadd_n function which ran at 3.5
> cycles per word and now I have a new mpn_karaadd(ie mpn_addaddadd) function
> which runs at 2.5 cycles a word. The addadd function gave us 2-7% speedup and
> I pretty much expect the same again.The lower bound is actually 2.0 cycles a
> word and I think I may be able to get it without too much pipe lining. Similar
> improvements are possible on the Intel cpu's and many others( the RISC cpu's
> are probably easier). I've only writen the inner loop of addaddadd function so
> far but I dont for-see any difficulties , should be able to finish it it next
> week.In case you are wondering the new asm code wont be general code like the
> mpn_addadd_n was but will be specific for toom22 multiplication as it has to
> cope with operand overlap and the "odd" cases.
>
> Jason

Hi

I've saved about 5 instructions from the inner loop which is good and
I still searching for something faster , so far I have got it down to
2.350 cycles per word without any pipelining. There are a number of
possible cravats , as we are starting to hit the load/store bound for
the K8 , it could get very sensitve to the relative alignment(mod 8)
of the pointers ( ie rp , rp+n , tp) , my standard testing will be to
try over all combinations of rp and tp , but in this case we also need
to consider rp+n . I think at this stage I'll ignore it and hope for
the best. The K10 can schedule load/stores better than the K8 , so
maybe they may be a better inner loop in that case and as long as I
dont use pipelining , writing both is a trivial exercise.
Incidentally the new K103 (or K10.5 or lano is nearly out) , this is a
K102 with a pipelined hardware divider , not terribly useful for us ,
but I've heard that the multiplier has been improved , dont know in
what way :( , I doubt it's the floating point multiplier as the chip
is intergrated with a GPU (like bobcat) , the schedulers are deeper
which may be useful , but the clock speeds don't look great , but
perhaps if you turn off the gpu?

Jason


Jason

-- 
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-devel@googlegroups.com.
To unsubscribe from this group, send email to 
mpir-devel+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en.

Reply via email to