I've tweeked some more speed from the mul_basecase function , and I now get a 
benchmark of 9000 (gmp-4.2.4 is 6000) . The current code is 1361 bytes long 
which includes 115 bytes for sizes <=2 . Further improvements can be made but 
they would increase the code size. We could separate out <=5 limbs , this 
would take an extra say 600? bytes , and save a branch in the outer loop , 
this should save a few cycles (say 2%?) in the real world. Another way is to 
pipeline the internal addmul_1's together , this would probably double the 
code size , and if I can get it to work , then we might get upto 10% more 
speed.
While typing this I just had a thought , perhaps using CMOV would allow me to 
safely load the next limb ahead of the current iteration , have to see if it 
helps , if this can be made to work , then I wont need another "copy" for the 
last part of the pipeline.

So any thoughts on how much code size would be too much , or does anyone have 
real world benchmarks.

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-devel@googlegroups.com
To unsubscribe from this group, send email to 
mpir-devel+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to