Re: Sandybridge addmul_N challenge

Torbjorn Granlund Thu, 23 Feb 2012 02:13:36 -0800

ni...@lysator.liu.se (Niels Möller) writes:

  Here's a sketch of an adddmul_2 iteration using Karatsuba. I assume we
  have vl, vh, vd = |vl - vh| and an appropriate sign vmask in registers
  before the loop. Carry input in c0, c1, carry out in r2, r3.
  
          mov   (up), %rax
          mov   %rax, ul
          mul   vl              C low product
          mov   8(up), uh
          mov   %rax, r0
          mov   %rdx, r1
          lea   (uh, ul), %rax
          sub   uh, ul
          cmovnc        ul, %rax
          sbb   r3, r3
          mul   vd              C Middle product
          mov   r1, r2          C Add shifted low product
          add   r0, r1
          adc   $0, r2
          add   (rp), r0        C Add rp limbs
          adc   8(rp), r1
          adc   $0, r2          
          mov   %rax, p0
          mov   %rdx, p1
          mov   uh, %rax
          mul   vh              C High product
          xor   vmask, r3
          xor   r3, p0          C Conditionally negate, and add, middle product
          xor   r3, p1
          bt    $0, r3
          adc   p0, r1
          adc   p1, r2
          adc   $0, r3
          add   %rax, r1        C Add shifted high product
          adc   %rdx, r2
          adc   $0, r2
          add   c0, r0          C Add input carry limbs
          adc   c1, r1
          mov   c0, (rp)
          mov   c1, 8(rp)
          adc   %rax, r2        C Add high product
          adc   %rdx, r3
  
  37 instructions, or 12.25 instructions per limb, excluding looping logic
  (and it has to be unrolled twice, to use separate registers for input
  and output carries).
  
How did you arrive to 12.25 insns/limb?  I have not tried to understand
the code, but doesn't it perform a 2x2 limb multiply with accumulation?
That's 9.25 insn/limb product.


I very much doubt this will win for Sandybridge, unless you can decrease
the insn count with several instructions.  Unfortunately it has no
chance on Bull-dozer, since the latter has a 2 issue pipeline; you need
to beat 32 insns per 2x2 accumulation block there.

  I think the instruction count can be reduced a bit, at the cost of
  higher pressure on the out-of-order execution.
  
Perhaps some of the adc $0 could be eliminated with 2x unrolling?
  
  What do you think? If one can get one iteration to run at 12 cycles,
  that's 3 c/l and an improvement over addmul_1. If one can get it down to
  11 or 11.5, one beats 3 c/l.
  
If that is possible, it might not be enough...  See below.

  For a "vertical" mul_basecase, the quadratic work would be an iteration
  of
  
        mov     (up), %rax
          mul   (vp)
          add   %rax, r0
          adc   %rax, r1
          adc   $0, r3
  
  So there's potential for that to run at 2 cycles per limb product. But
  then there's also a significant linear cost for accumulation and carrry
  propagation, and possible bad branch-prediction due to loops of varying
  lenghts.
  
Exactly.  (But the branch misprediction problem would not happen for for
David's mulmid_basecase, I suppose.)

Some ways to deal with the branch misprediction problem:

* Have straight line code for the corners, up to a limit.  This gets rid
  of the really high relative branch misprediction for these small
  areas.

* Handle two (or more) columns in parallel, and separately for the
  low-significant right triangle, any middle rectangular part, and the
  left triangle.  This doubles (or more) the amount of useful work per
  branch misprediction.

* I suppose that making full use of out-of-order execution just before a
  mispredicted branch would make sense.

I played a bit with mul_2 yesternight.  I am not 100% the code is
correct, but I think it is.  The loopmixer found a 2.5 c/l version of
it.

I started with genxmul.c (from the loopmixer repo) using these args:
"-n2 -w4 --mul".  I then analysed the critical path and determined that
it is about 24.  The problem is adc feeding other adc feeding other adc
though a register (remember that pure carry deps are fast on
Sandybridge).  So I mindlessly introduced 4 new registers, then using
alternating accumulation registers.  I needed 8 extra insn (in total,
corresponding to 2 per way) to pairwise sum accumulation registers.

I am sure this was not done optimally, but (assuming my code is sound)
it proves that there is a lot of performance headroom, as expected.

I conjecture that we could create an addmul_N for Sandybridge that runs
at <= 2.5 c/l.  I think this will be possible already for N=2.  Perhaps
we could arrive to 2.25 for N=4, matching the K8-K10 performance.

-- 
Torbjörn
_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel

Re: Sandybridge addmul_N challenge

Reply via email to