Re: Sandybridge addmul_N challenge

2012-02-23 Thread Torbjorn Granlund
Torbjorn Granlund  writes:

  If we slightly restrict the operand range, we could remove the indicated
  carry propagation insn.
  
Wrong.  Those carry propagation insns are needed.

-- 
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: Sandybridge addmul_N challenge

2012-02-23 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  For the recurrency, the inputs are c0, c1, and the outputs are r1, r2.
  Let's write the interesting instructions out and unroll twice (using
  different registers),
  
  add c0, r0C  0  6
  adc c1, r1C  1  7
  adc $0, r2C  3  9
  
  add r1, c2C  3  9
  adc r2, c0C  4  10
  adc $0, c1C  6  12
  
  So the recurrency, for one iteration, seems to be just 3 cycles. But the
  loop mixer doesn't find anything faster then 6.36 cycles for one
  iteration, or 3.18 per limb product. Which isn't too bad (a slight
  improvement over 3.24, which I think is the best reported earlier), but
  stubbornly above 3 c/l.
  
I am playing with this block:

carry-in lo in r14
carry-in hi in rcx
mov 0(up), %rax
mul v1
mov 8(rp), %r8
add %rax, %r8
mov %rdx, %r9
adc $0, %r9
mov 8(up), %rax
mul v0
add %rax, %r8
adc %rdx, %r9
mov $0, R32(%rbx)
adc $0, R32(%rbx)
add %r14, %r8   C 0
adc %rcx, %r9   C 1
adc $0, R32(%rbx)   C might be removed
mov %r8, 8(rp)
carry-out lo in r9
carry-out hi in rbx

This is not identical to your block, I think.  It runs at exactly 3 c/l.
The recurrency path is extremely shallow, at 1.5 c/l.

If we slightly restrict the operand range, we could remove the indicated
carry propagation insn.  Then the code runs at 2.8 c/l.

Neither is decoding bandwidth limited,

It is further possible to supplant the 'mov $0,reg' and following 'adc
$0,reg' with 'setc reg'.  This creates a false dependency (on the upper
56 bits) and seems to run at about the same speed.

The plain code (i.e., the code which runs at 3.0 c/l) runs at 3-epsilon
if the lea pointer update insns are removed.  This is a good sign,
proving there is no magic stopping us at 3 c/l...

  > My experience of Sandybridge is that with load/store coding style, the
  > CPU typically executes 3 insn/cycle unless there is a recurrency
  > dependency stopping that.
  
  If we could get there, the above loop should run just below 3 c/l.
  
I was obviously wrong.  :-(

  Hmm. Last time I looked at that was in a 32-bit context. There's a
  32x32->64 instruction which might be useful for a 32-bit build, at least
  in theory, but as far as I can find in the manual, the latest
  ss*-extensions don't provide any wider multiplication than that.

I believe that insn is used for 32-bit builds, where it helps.  Much
improvments could be done for 32-bit builds, if one care

(I see a new mail has arrived, will read now.)

-- 
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: Sandybridge addmul_N challenge

2012-02-23 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes:

> So the recurrency, for one iteration, seems to be just 3 cycles. But the
> loop mixer doesn't find anything faster then 6.36 cycles for one
> iteration, or 3.18 per limb product. Which isn't too bad (a slight
> improvement over 3.24, which I think is the best reported earlier), but
> stubbornly above 3 c/l.

One update. I have now tried unrolling four times. Then I've seen one
sequence running at 6.16 cycles per iteration, or 3.08 c/l.

See shell:~nisse/hack/loopmix/lms/addmul_2-nisse-2.nlms.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: Sandybridge addmul_N challenge

2012-02-23 Thread Niels Möller
Torbjorn Granlund  writes:

> In loopmixer or manually?  I wouldn't draw any conclusions without
> mixing the code first...

With the loop mixer.

> Meaning evaluating in +1 instead of -1, I assume.

Exactly.

> Did you compute the recurrency chain?  Annotating the instructions on
> the recurrency chain helps understanding the problem.

I tried. I use this iteration 

mov (up, n, 8), %rax
mov %rax, u
mul v0
mov %rax, r0
mov %rdx, r1
mov u, %rax
mul v1
mov (rp, n, 8), t 
add t, r0  
add %rax, r1
mov %rdx, r2
adc $0, r2
add c0, r0
mov r0, (rp, n, 8)
adc c1, r1
adc $0, r2

For the recurrency, the inputs are c0, c1, and the outputs are r1, r2.
Let's write the interesting instructions out and unroll twice (using
different registers),

add c0, r0  C  0  6
adc c1, r1  C  1  7
adc $0, r2  C  3  9

add r1, c2  C  3  9
adc r2, c0  C  4  10
adc $0, c1  C  6  12

So the recurrency, for one iteration, seems to be just 3 cycles. But the
loop mixer doesn't find anything faster then 6.36 cycles for one
iteration, or 3.18 per limb product. Which isn't too bad (a slight
improvement over 3.24, which I think is the best reported earlier), but
stubbornly above 3 c/l.

> My experience of Sandybridge is that with load/store coding style, the
> CPU typically executes 3 insn/cycle unless there is a recurrency
> dependency stopping that.

If we could get there, the above loop should run just below 3 c/l.

> I don't think there are.  These instructions are mostly FP plus narrow
> integer ops.  

Hmm. Last time I looked at that was in a 32-bit context. There's a
32x32->64 instruction which might be useful for a 32-bit build, at least
in theory, but as far as I can find in the manual, the latest
ss*-extensions don't provide any wider multiplication than that.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: Sandybridge addmul_N challenge

2012-02-23 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  One can decrease it a bit by adding c0, c1 earlier (do you think
  recurrency can be a problem if we add c0, c1 to the first product?) and
  doing an in-place add to (rp) and 8(rp) at the end.
  
  I could get it down to 30 instructions with a deep carry recurrency, or
  34 with a short one. I can get neither variant to run faster than 4 c/l.
  
In loopmixer or manually?  I wouldn't draw any conclusions without
mixing the code first...

  I also had a quick look at doing Karatsuba based on (u0+u1)*(v0+v1).

Meaning evaluating in +1 instead of -1, I assume.

  It's about the same number of instructions, but the updates from
  carries are independent of all products, so there's more more freedom
  in where to move them around.

  I think this idea may be more useful for other processors, without the
  awkward hardwired mul registers.

True.
  
  > I played a bit with mul_2 yesternight.  I am not 100% the code is
  > correct, but I think it is.  The loopmixer found a 2.5 c/l version of
  > it.
  
  Nice. I've now wasted quite some time... It seems really difficult.

It is challenging, but I am getting convinced we can really speed things
a lot.
  
  Now I also tried a very basic variant of addmul_2, doing only one u limb
  per iteration and multiplying it by the two v limbs. Even if I have a
  very nice carry recurrence between iterations, add, adc, adc $0, four
  cycles, and a small number of instructions (15 per iteration, 32 for the
  twice unrolled loop), which one might think could be executed in 11
  cycles or 5.5 / iteration or 2.75 cycles per limb product. But it won't
  run faster than 6.5 cycles per iteration, or 3.25 c/l.
  
  So it just seems very difficult to convince the cpu to really execute
  the independent operations, outside of the recurrency, in parallel.
  
Did you compute the recurrency chain?  Annotating the instructions on
the recurrency chain helps understanding the problem.

My experience of Sandybridge is that with load/store coding style, the
CPU typically executes 3 insn/cycle unless there is a recurrency
dependency stopping that.

  BTW, are any of the SSE3 etc instructions useful here?
  
I don't think there are.  These instructions are mostly FP plus narrow
integer ops.  

-- 
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: Sandybridge addmul_N challenge

2012-02-23 Thread Niels Möller
Torbjorn Granlund  writes:

> How did you arrive to 12.25 insns/limb?

I did someting wrong, I guess... 9.25 instructions / limb or 3.1 cycles
if we can issue 3 instructions per cycle.

> I very much doubt this will win for Sandybridge, unless you can decrease
> the insn count with several instructions.

One can decrease it a bit by adding c0, c1 earlier (do you think
recurrency can be a problem if we add c0, c1 to the first product?) and
doing an in-place add to (rp) and 8(rp) at the end.

I could get it down to 30 instructions with a deep carry recurrency, or
34 with a short one. I can get neither variant to run faster than 4 c/l.

I also had a quick look at doing Karatsuba based on (u0+u1)*(v0+v1).
It's about the same number of instructions, but the updates
from carries are independent of all products, so there's more more
freedom in where to move them around.

I think this idea may be more useful for other processors, without the
awkward hardwired mul registers. For documentation, this is what the
iteration should do:

vd = |v1 - v0|, sign in vs (outside loop)

ud = |u1 - u0|, sign in us
s = us ^ vs ^ 1
 = u1 * v1
 = ud * vd ^ <-s, -s>
 = u0 * v0

+-+-+
|p3 p2|p1 p0|
+-+--+--+
  |c1 c0|
  +-+
  |r1 r0|
   +--+--+--+
   |p1 p0|
   +-+
   |p3 p2|
+--+--+--+
|-s| 0| s|
+--+--+--+
   |m1 m0|
  --+--+-+--+
|c3 c2 r1 r0|
+---+

or

 = v1 + v0 (outside loop)

 = u1 + u0
 = u1 * v1
 = us * vs
 = u0 * v0

+-+-+
|p3 p2|p1 p0|
+-+--+--+
  |c1 c0|
  +-+
  |r1 r0|
   +--+--+--+
-  |p1 p0|
   +-+
-  |p3 p2|
   +-+
   |m1 m0|
+--+--+--+
|vc vs|if uc
+--+--+
   |us|if vc
--+--+--+-+
  |c3 c2 r1 r0|
  +---+

> Perhaps some of the adc $0 could be eliminated with 2x unrolling?

In effect, that would be a kind of 4x2 multiply. Which would then be
done as two 2x2 (I think the high limbs one get from evaluation rules
out using toom32 or toom42).

Haven't tried that. I suspect one will run out of registers.

> I played a bit with mul_2 yesternight.  I am not 100% the code is
> correct, but I think it is.  The loopmixer found a 2.5 c/l version of
> it.

Nice. I've now wasted quite some time... It seems really
difficult.

Now I also tried a very basic variant of addmul_2, doing only one u limb
per iteration and multiplying it by the two v limbs. Even if I have a
very nice carry recurrence between iterations, add, adc, adc $0, four
cycles, and a small number of instructions (15 per iteration, 32 for the
twice unrolled loop), which one might think could be executed in 11
cycles or 5.5 / iteration or 2.75 cycles per limb product. But it won't
run faster than 6.5 cycles per iteration, or 3.25 c/l.

So it just seems very difficult to convince the cpu to really execute
the independent operations, outside of the recurrency, in parallel.

BTW, are any of the SSE3 etc instructions useful here?

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: Sandybridge addmul_N challenge

2012-02-23 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  Here's a sketch of an adddmul_2 iteration using Karatsuba. I assume we
  have vl, vh, vd = |vl - vh| and an appropriate sign vmask in registers
  before the loop. Carry input in c0, c1, carry out in r2, r3.
  
  mov   (up), %rax
  mov   %rax, ul
  mul   vl  C low product
  mov   8(up), uh
  mov   %rax, r0
  mov   %rdx, r1
  lea   (uh, ul), %rax
  sub   uh, ul
  cmovncul, %rax
  sbb   r3, r3
  mul   vd  C Middle product
  mov   r1, r2  C Add shifted low product
  add   r0, r1
  adc   $0, r2
  add   (rp), r0C Add rp limbs
  adc   8(rp), r1
  adc   $0, r2  
  mov   %rax, p0
  mov   %rdx, p1
  mov   uh, %rax
  mul   vh  C High product
  xor   vmask, r3
  xor   r3, p0  C Conditionally negate, and add, middle product
  xor   r3, p1
  bt$0, r3
  adc   p0, r1
  adc   p1, r2
  adc   $0, r3
  add   %rax, r1C Add shifted high product
  adc   %rdx, r2
  adc   $0, r2
  add   c0, r0  C Add input carry limbs
  adc   c1, r1
  mov   c0, (rp)
  mov   c1, 8(rp)
  adc   %rax, r2C Add high product
  adc   %rdx, r3
  
  37 instructions, or 12.25 instructions per limb, excluding looping logic
  (and it has to be unrolled twice, to use separate registers for input
  and output carries).
  
How did you arrive to 12.25 insns/limb?  I have not tried to understand
the code, but doesn't it perform a 2x2 limb multiply with accumulation?
That's 9.25 insn/limb product.

I very much doubt this will win for Sandybridge, unless you can decrease
the insn count with several instructions.  Unfortunately it has no
chance on Bull-dozer, since the latter has a 2 issue pipeline; you need
to beat 32 insns per 2x2 accumulation block there.

  I think the instruction count can be reduced a bit, at the cost of
  higher pressure on the out-of-order execution.
  
Perhaps some of the adc $0 could be eliminated with 2x unrolling?
  
  What do you think? If one can get one iteration to run at 12 cycles,
  that's 3 c/l and an improvement over addmul_1. If one can get it down to
  11 or 11.5, one beats 3 c/l.
  
If that is possible, it might not be enough...  See below.

  For a "vertical" mul_basecase, the quadratic work would be an iteration
  of
  
mov (up), %rax
  mul   (vp)
  add   %rax, r0
  adc   %rax, r1
  adc   $0, r3
  
  So there's potential for that to run at 2 cycles per limb product. But
  then there's also a significant linear cost for accumulation and carrry
  propagation, and possible bad branch-prediction due to loops of varying
  lenghts.
  
Exactly.  (But the branch misprediction problem would not happen for for
David's mulmid_basecase, I suppose.)

Some ways to deal with the branch misprediction problem:

* Have straight line code for the corners, up to a limit.  This gets rid
  of the really high relative branch misprediction for these small
  areas.

* Handle two (or more) columns in parallel, and separately for the
  low-significant right triangle, any middle rectangular part, and the
  left triangle.  This doubles (or more) the amount of useful work per
  branch misprediction.

* I suppose that making full use of out-of-order execution just before a
  mispredicted branch would make sense.

I played a bit with mul_2 yesternight.  I am not 100% the code is
correct, but I think it is.  The loopmixer found a 2.5 c/l version of
it.

I started with genxmul.c (from the loopmixer repo) using these args:
"-n2 -w4 --mul".  I then analysed the critical path and determined that
it is about 24.  The problem is adc feeding other adc feeding other adc
though a register (remember that pure carry deps are fast on
Sandybridge).  So I mindlessly introduced 4 new registers, then using
alternating accumulation registers.  I needed 8 extra insn (in total,
corresponding to 2 per way) to pairwise sum accumulation registers.

I am sure this was not done optimally, but (assuming my code is sound)
it proves that there is a lot of performance headroom, as expected.

I conjecture that we could create an addmul_N for Sandybridge that runs
at <= 2.5 c/l.  I think this will be possible already for N=2.  Perhaps
we could arrive to 2.25 for N=4, matching the K8-K10 performance.

-- 
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: Sandybridge addmul_N challenge

2012-02-23 Thread Niels Möller
Torbjorn Granlund  writes:

> I think we should focus not on addmul_1 but on mul_basecase,
> sqr_basecase, redc_1, perhaps redc_2.  I.e., please focus on addmul_2
> (or addmul_N, N > 2) or vertical multiplication primitives.

Here's a sketch of an adddmul_2 iteration using Karatsuba. I assume we
have vl, vh, vd = |vl - vh| and an appropriate sign vmask in registers
before the loop. Carry input in c0, c1, carry out in r2, r3.

mov (up), %rax
mov %rax, ul
mul vl  C low product
mov 8(up), uh
mov %rax, r0
mov %rdx, r1
lea (uh, ul), %rax
sub uh, ul
cmovnc  ul, %rax
sbb r3, r3
mul vd  C Middle product
mov r1, r2  C Add shifted low product
add r0, r1
adc $0, r2
add (rp), r0C Add rp limbs
adc 8(rp), r1
adc $0, r2  
mov %rax, p0
mov %rdx, p1
mov uh, %rax
mul vh  C High product
xor vmask, r3
xor r3, p0  C Conditionally negate, and add, middle product
xor r3, p1
bt  $0, r3
adc p0, r1
adc p1, r2
adc $0, r3
add %rax, r1C Add shifted high product
adc %rdx, r2
adc $0, r2
add c0, r0  C Add input carry limbs
adc c1, r1
mov c0, (rp)
mov c1, 8(rp)
adc %rax, r2C Add high product
adc %rdx, r3

37 instructions, or 12.25 instructions per limb, excluding looping logic
(and it has to be unrolled twice, to use separate registers for input
and output carries).

I think the instruction count can be reduced a bit, at the cost of
higher pressure on the out-of-order execution.

* At least the moves to p0, p1 can be eliminated.

* One could also save
  some instructions from adding in c0, c1 earlier, and doing an in-place
  add to (rp) at the end, on the theory that the recurrency is less tight.

* I'm also not sure if the order of the three multiplications is the
  best one.
  
* I don't try to optimize the add HIGH(ul * vl) + LOW(uh * vh), which
  (if additions are organized in the right way) is done twice, I suspect
  it's going to be a bit painful since the carry has to be applied at
  two places.

What do you think? If one can get one iteration to run at 12 cycles,
that's 3 c/l and an improvement over addmul_1. If one can get it down to
11 or 11.5, one beats 3 c/l.

For a "vertical" mul_basecase, the quadratic work would be an iteration
of

mov (up), %rax
mul (vp)
add %rax, r0
adc %rax, r1
adc $0, r3

So there's potential for that to run at 2 cycles per limb product. But
then there's also a significant linear cost for accumulation and carrry
propagation, and possible bad branch-prediction due to loops of varying
lenghts.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel