Richard Henderson <r...@twiddle.net> writes: > On my system, umaal has a latency if 3, whatever dependencies I create. > (There are 4 input regs and 2 output, so there are quite a few > possible dependency combinations; I only tried a subset.) > > Either the docs are plain wrong, or there are several variants of A9. Dunno. It's at this point that I'd try asking one of the arm guys from the gcc list and see if they can get an answer from somewhere inside the company. I suppose I need to do that, to get the GMP arm code in a really good shape. The ARM landscape if completely new to me, and it seems at least as vast as the x86 landscape.
I pushed an addmul_2 running at 2.38 cycles per limb product. It is trivial to make it run at 2.25 c/l, at the expense of using more callee- saves registers. Again, the innerloop uses no explicit add instructions, just umaal pairs. Number speed louder than words, this code disproves ARM's cycle numbers as it would not run at this speed with a latency of 5 cycles. I expect to reach 2 c/l for addmul_3 or addmul_4. Unlike other fast processors, A9 doesn't make fast code huge, so actually implementing addmul_3/4 seem quite reasonable. Next thing to do is to write mul_basecase, sqr_basecase, mullo_basecase. These just reuse the addmul_2 and addmul_1 loops, but doing so saves a lot of overhead (i.e., it decreases the constant of the O(n) term). -- Torbjörn _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel