Re: [PATCH] Optimize 32-bit sparc T1 multiply routines.

Torbjorn Granlund Sat, 05 Jan 2013 16:15:46 -0800

David Miller <da...@davemloft.net> writes:

  From: Torbjorn Granlund <t...@gmplib.org>
  Date: Fri, 04 Jan 2013 14:54:15 +0100

  > Did you add umulxhi use in your patch from a few days ago?

  Yes I did use mulx/umulxhi (both T3 and T4 have umulxhi) and yes the
  multiplies do pipeline on T4 (it doesn't on T3), and it gets about 4
  cycles per limb in a two-way unrolled loop in mul_1.  addmul_1 gets
  about 6.5 cycles per limb.

Could you please try my mul-only loop to determine the throughput?


It is a 2-issue pipeline, right?  So the two extra instructions for
addmul_1 compared to mul_1, if both are deply unrolled, should allow for
1 + epsilon differential cycle.

With two-way unrolling, we will get 3 extra instructions per way (5 or 6
in total per loop).  This still does not explain the slowdown from 4 c/l
to 6.5 c/l.

-- 
Torbjörn
_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel

Re: [PATCH] Optimize 32-bit sparc T1 multiply routines.

Reply via email to