> I think mpn/alpha/addmul_1.asm might serve as a better starting point
  > than the mips64 lo/hi code.  That code is simple enough, yet OK for
  > pipelined in-order and out-of-order cores.
  
  I will take a look at that.
  
On second thought, the top-level alpha code is overscheduled, at least
for the devices it would be used for.

The instructions should be directly 1:1 translatable to MIPS code,
though.

The best loop strategy is usually to put the the multiplier operand load
at the top of the loop, and then schedule the low multiply at a distance
which corresponds to L1d latency.  The high multiply can then be
scheduled for multiplier hardware throughput.  Then do accumulation
scheduled after multiplier latency.

For implementations with pipelined multiply, performance might become
limited by the recurrent carry latency.  To handle that problem, add
incoming carry as late as possible, and then compute outgoing carry with
as few instructions as possible,

  I just use user-level QEMU.
  
I see.  By default I would assume it to reject r6 instruction execution.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel

Reply via email to