On Jun 19, 11:24 pm, Jason <ja...@njkfrudils.plus.com> wrote: > > 24) New Toom22 code , the new code is smaller if we let the high part > > > >= low part which is the opposite of the current code , so it's > > > probably easier just to rewrite the whole thing. > > Hi > > Here is a outline of the new toom22_n code , there are obvious O(1) speedups > to do , but I'll leave them until I've tested the new assembler code as the > linear part O(n) is what has improved . I rewrote all the code as that was the > easiest way as there are other slight minor differences(and I do so hate > reading other's code). The original code has the differences between high an > low parts and this has not changed , what has changed is the last section > where we add/sub the sub-products together to form the desired full product. > Originally this consisted of three add's which on the K8 run at 4.5 cycles per > word , this was improved with the new mpn_addadd_n function which ran at 3.5 > cycles per word and now I have a new mpn_karaadd(ie mpn_addaddadd) function > which runs at 2.5 cycles a word. The addadd function gave us 2-7% speedup and > I pretty much expect the same again.The lower bound is actually 2.0 cycles a > word and I think I may be able to get it without too much pipe lining. Similar > improvements are possible on the Intel cpu's and many others( the RISC cpu's > are probably easier). I've only writen the inner loop of addaddadd function so > far but I dont for-see any difficulties , should be able to finish it it next > week.In case you are wondering the new asm code wont be general code like the > mpn_addadd_n was but will be specific for toom22 multiplication as it has to > cope with operand overlap and the "odd" cases. > > Jason
Hi I've saved about 5 instructions from the inner loop which is good and I still searching for something faster , so far I have got it down to 2.350 cycles per word without any pipelining. There are a number of possible cravats , as we are starting to hit the load/store bound for the K8 , it could get very sensitve to the relative alignment(mod 8) of the pointers ( ie rp , rp+n , tp) , my standard testing will be to try over all combinations of rp and tp , but in this case we also need to consider rp+n . I think at this stage I'll ignore it and hope for the best. The K10 can schedule load/stores better than the K8 , so maybe they may be a better inner loop in that case and as long as I dont use pipelining , writing both is a trivial exercise. Incidentally the new K103 (or K10.5 or lano is nearly out) , this is a K102 with a pipelined hardware divider , not terribly useful for us , but I've heard that the multiplier has been improved , dont know in what way :( , I doubt it's the floating point multiplier as the chip is intergrated with a GPU (like bobcat) , the schedulers are deeper which may be useful , but the clock speeds don't look great , but perhaps if you turn off the gpu? Jason Jason -- You received this message because you are subscribed to the Google Groups "mpir-devel" group. To post to this group, send email to mpir-devel@googlegroups.com. To unsubscribe from this group, send email to mpir-devel+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/mpir-devel?hl=en.