I've tweeked some more speed from the mul_basecase function , and I now get a benchmark of 9000 (gmp-4.2.4 is 6000) . The current code is 1361 bytes long which includes 115 bytes for sizes <=2 . Further improvements can be made but they would increase the code size. We could separate out <=5 limbs , this would take an extra say 600? bytes , and save a branch in the outer loop , this should save a few cycles (say 2%?) in the real world. Another way is to pipeline the internal addmul_1's together , this would probably double the code size , and if I can get it to work , then we might get upto 10% more speed. While typing this I just had a thought , perhaps using CMOV would allow me to safely load the next limb ahead of the current iteration , have to see if it helps , if this can be made to work , then I wont need another "copy" for the last part of the pipeline.
So any thoughts on how much code size would be too much , or does anyone have real world benchmarks. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "mpir-devel" group. To post to this group, send email to mpir-devel@googlegroups.com To unsubscribe from this group, send email to mpir-devel+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/mpir-devel?hl=en -~----------~----~----~----~------~----~------~--~---