Re: [PATCH] Optimize 32-bit sparc T1 multiply routines.

Torbjorn Granlund Sun, 06 Jan 2013 04:08:29 -0800

David Miller <da...@davemloft.net> writes:

  Thanks for your help, the following works.  I'll work on unrolling
  and scheduling it.
  
  PROLOGUE(mpn_sub_nc)
        ba,pt   %xcc, L(ent)
         xor    cy, 1, cy
  EPILOGUE()
  PROLOGUE(mpn_sub_n)
        mov     1, cy
  L(ent):       cmp     %g0, cy
  L(top):       ldx     [up+0], %o4
        add     up, 8, up
        ldx     [vp+0], %o5
        add     vp, 8, vp
        add     rp, 8, rp
        add     n, -1, n
        xnor    %o5, %g0, %o5
        addxccc %o4, %o5, %g3
        brgz    n, L(top)
         stx    %g3, [rp-8]
  
        clr     %o0
        retl
         movcc  %xcc, 1, %o0
  EPILOGUE()
  
Since we are working with a throughput constrained pipeline, we should
really use as few insns as possible.


There are 6 operation insns, and it seems hard to use less than 5
bookkeeping insns.  With k-way unrolling we should then get to
max(3,(6k+5)/(2k)) cycles/limb.

For small k, we could put the pointers the end of its operands, then use
a combined index and loop counter -n...0.  This would give
max(3,(7k+1)/(2k)) cycles/limb.

(The max(3...) handles the load/store bandwidth limit.  It has no
limiting effect for sub_n, but it does for add_n.)

sub_n:
  n      method 1   method 2
  1        5.5        4.0
  2        4.2        3.8
  3        3.8        3.7
  4        3.6        3.6
  5        3.5        3.6
  6        3.4        3.6
  7        3.4        3.6
  8        3.3        3.6
 oo        3.0        3.5

add_n:
  n      method 1   method 2
  1        5.0        3.5
  2        3.8        3.2
  3        3.3        3.2
  4        3.1        3.1
  5        3.0        3.1
  6        3.0        3.1
  7        3.0        3.1
  8        3.0        3.1
 oo        3.0        3.0

For add_n, I recommend either method 1 with 4-way unrolling, or method 2
with 2-way unrolling.

For sub_n we should use at least 4-way unrolling.

-- 
Torbjörn
_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel

Re: [PATCH] Optimize 32-bit sparc T1 multiply routines.

Reply via email to