I took a brief look at the definition of these instructions. It is clear that they did not consult an expert in the area. They also added DES instructions now (in 2012).
They added a few useful instructions, addxc/addxcc and umulxhi. The former is a 64-bit addition with useful carry in and out (previous 64-bit add instructions read the carry bit which was set from the middle bit by a previous add...). They forgot to add corresponding subxc/subxcc instructions, creating a very unfortunate timing asymmetry. The tradeoff of when mpmul is faster than a flat-out mulx/umulxhi loop is beyond 2x2 limbs, so I don't see any value in looking into that just yet. Did you add umulxhi use in your patch from a few days ago? I've seen that instruction in the Solaris assembler for > 10 years, but I have noticed that no chip supports it. Has that now changed? Is that instruction well-implemented on any chip? What throughput does a series of umulxhi get, if they are independent? Unless the pipeline is poorly designed, one can usually make a mul_bascase (or even say an addmul_2) which runs at about the mulx/umulxhi bandwidth. But that requires careful software pipelining. For example, how well does something like this run? .text .register %g2, #scratch .register %g3, #scratch .globl main .type main, #function .align 32 main: save sethi %hi(1500000000), %i0 nop nop nop nop nop nop 1: umulxhi %g5, %g5, %g1 mulx %g5, %g5, %g2 umulxhi %g5, %g5, %g3 mulx %g5, %g5, %g4 umulxhi %g5, %g5, %i1 mulx %g5, %g5, %i2 umulxhi %g5, %g5, %i3 mulx %g5, %g5, %i4 brnz,a %i0, 1b addx %i0, -1, %i0 ret restore Please change "1500000000" in the file by the actual CPU frequency. Ideally, this would then run in 4 seconds, but 8 seconds is not bad either. If it is slower, it will not compete with x86-64 CPUs. But what counts when considering "mpmul", is relative performance, of course. If mpmul needs many cycles for accumulating a 128-bit product, then an mulx/umulxhi pair will also surely be slow, since odds are low that that they use the same multiply hardware. There's a lot of setup and teardown associated with using mpmul because it uses several register windows and some of the floating point registers to hold the entire set of inputs, and to provide the result. A really silly design, since it forces a non-balanced instruction mix. Few pipelines mind some loads and stores between arithmetic, but with these instructions one will delay the start of arithmetic for tens of cycles while performing lots of loads. (For modexp, I assume one can stay in registers, making this overhead small when using a large exponent, such as RSA signing/decryption.) That's why realistically I'll probably only use mpmul for 3x3 and larger. I wouldn't be surprised if mpmul would never beat truly well-optimised discrete code. -- Torbjörn _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel