> I think mpn/alpha/addmul_1.asm might serve as a better starting point > than the mips64 lo/hi code. That code is simple enough, yet OK for > pipelined in-order and out-of-order cores. I will take a look at that. On second thought, the top-level alpha code is overscheduled, at least for the devices it would be used for.
The instructions should be directly 1:1 translatable to MIPS code, though. The best loop strategy is usually to put the the multiplier operand load at the top of the loop, and then schedule the low multiply at a distance which corresponds to L1d latency. The high multiply can then be scheduled for multiplier hardware throughput. Then do accumulation scheduled after multiplier latency. For implementations with pipelined multiply, performance might become limited by the recurrent carry latency. To handle that problem, add incoming carry as late as possible, and then compute outgoing carry with as few instructions as possible, I just use user-level QEMU. I see. By default I would assume it to reject r6 instruction execution. -- Torbjörn Please encrypt, key id 0xC8601622 _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel