For the last few months, I have been working on writing and rewriting basecase code for X64-64 processors. The result is now in the mainline GMP repo.
The basecase code I have focused on is: mul_basecase, sqr_basecase, mullo_basecase, and Hensel remainder via redc_1. At the start of this project, we had good code only for AMD K8-K10 and only for mul_basecase, sqr_basecase and mullo_basecase. Now there is optimised code for many CPUs for these functions, see the below diagram. (X means new code, o means old code.) mul sqr mullo redc_1 AMD K8 o o o X AMD K10 o o o X AMD Bobcat X X X AMD Bulldozer X X Intel Conroe/Wolfdale X X X X Intel Nehalem/Westmere X X X X Intel Sandybridge/Ivybridge X X X X Intel Haswell X X X X Intel Atom - - - X There are diagrams for the performance of the new code at <http://gmplib.org/devel/>. These diagrams don't include old performance, but instead should be used to look for further optimisation opportunities by musing over non-smoothness or unexpected relative performance. Only Nehalem (first generation Core i7) performs precisely like I want, since only Nehalem and Conroe (Core2) have the expanded mul_basecase and sqr_basecase which will be the new standard. The code is "expanded" in the sense that it at entry chooses one of typically 4 outer loops, where each outer loop contains a tailored inner loop. This structure implies an ugly code bloat, but it saves enough execution time in these critical functions to be worth it. If you want to give the new code a spin, you may want to grab e.g. <ftp://ftp.gmplib.org/pub/snapshot/gmp-5.1.90-20130926.tar.lz>. What speedup should one expect? It varies with CPU, operand size, and operation. Haswell (latest Intel CPU) and Ivy bridge (penultimate Intel CPU) have gotten greatest improvement. Of the operations, redc_1 has seen the greatest improvement thanks to pipelined quotient computation. The speedup will vary from 50% for modexp on Haswell, to only 6% for multiplication on Conroe. I think typical speedup will be around 10%. For the time being, I have eliminated special code for tiny limb counts like 1, 2, 3, except when the mainline code couldn't handle those sizes naturally. Even then, the code is pretty naive. This will surely imply significant slowdown in some cases; I will insert such tiny-operands code on a case-by-case basis. Tuning the generic code has higher priority. -- Torbjörn _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel