Re: div_qr_1 interface

2013-10-21 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  Will try that. I think one could also try to delay the quotient store
  one iteration, keeping Q1 in a register until the next iteration. Then
  one gets rid of the
  
adc Q2,8(QP, UN, 8)
  
  in the loop, using only a single store per iteration in the likely case.
  May need yet another register, though.
  
On Intel chips, op-to-mem is expensive.  Even op-from-memory is often
slower than load+op.  (I understand the register shortage problem.)

   I suspect one or two of the register-to-register copy insns could be
   optimised out.
  
  Maybe. And it would be easier to avoid moves if one unrolls the loop
  twice, switching roles U0-U1 and Q0-Q1. But that makes it a bit more
  bloated, of course.
  
It might be worth it, since this is an importand operation.

   In order to run this through the loopmixer, you need to setup data in
   the prologue which makes the adjustment branch to never be taken.
   Letting the inverse be 0 or else B-1 might work...
  
  I vaguely recall some previous attempt at loopmixing this, but I don't
  remember any success.

Let's take a look at current performance on all amd64 CPUs except nocona
(=pentium4).  I compare the pi variants here.

Conclusions:

* The code is no win for AMD k10/k8 (although close to 10 c/l might well be
  possible)

* The code is a big win for AMD bulldozer and also for piledriver
  
* The code is a big win for Intel core2 (alias conroe)

* The code is a cycle slower for Intel sandybridge

* The code is a cycle faster on Intel nehalem, ivybridge, haswell

* The code is a big win for VIA nano

In ~tege/GMP/newdiv/div_1n_pi2-x86_64.asm I claim 9.75 c/l (and that 7
c/l is possible) for k10/k8, 16 c/l for core2, and 13.25 c/l for
nehalem.  Of course, the precomputation cost there is much higher.

 k10 
overhead 6.00 cycles, precision 100 units of 3.12e-10 secs, CPU freq 
3200.35 MHz
mpn_div_qr_1n_pi1.0xcafebabedeadbeef 
mpn_preinv_divrem_1.0xcafebabedeadbeef
1#12.0018   26.0043
2 19.5030  #19.5024
3 17.6695  #17.3362
5 15.8019  #15.6019
8 14.7518  #14.6267
13   #14.0895   15.0901
22   #14.1463   14.2366
37   #13.6849   13.7393
62   #13.4139   13.4445
105  #13.2498   13.2589
178  #13.1524   13.1632
302  #13.0952   13.1011
513  #13.0607   13.0642
872  #13.0302   13.0325
 bulldozer 
overhead 6.00 cycles, precision 100 units of 2.77e-10 secs, CPU freq 
3612.09 MHz
mpn_div_qr_1n_pi1.0xcafebabedeadbeef 
mpn_preinv_divrem_1.0xcafebabedeadbeef
1#13.9118   30.8628
2#22.5047   24.0040
3 20.6703  #20.3110
5#18.0036   20.0033
8#17.2535   19.2532
13   #16.7725   19.8804
22   #17.0943   20.5489
37   #16.6519   20.4899
62   #16.3905   20.2277
105  #16.2322   20.1748
178  #16.1383   20.0710
302  #16.0895   20.0499
513  #16.0513   20.0218
872  #16.0337   20.0186
 piledriver 
overhead 6.00 cycles, precision 100 units of 7.14e-10 secs, CPU freq 
4000.00 MHz
mpn_div_qr_1n_pi1.0xcafebabedeadbeef 
mpn_preinv_divrem_1.0xcafebabedeadbeef
1#13.4460   27.7072
2#21.2536   22.0034
3#19.1284   20.6698
5#17.6283   19.6027
8#16.8365   19.1819
13   #16.7634   19.3874
22   #16.5480   19.1393
37   #16.5433   18.9761
62   #16.3419   18.8095
105  #16.2121   18.6698
178  #16.0991   18.6101
302  #16.0503   18.5661
513  #16.2965   18.5379
872  #16.3580   19.0568

 core2 
overhead 6.01 cycles, precision 100 units of 4.69e-10 secs, CPU freq 
2132.93 MHz
mpn_div_qr_1n_pi1.0xcafebabedeadbeef 
mpn_preinv_divrem_1.0xcafebabedeadbeef
1#15.7048   28.7024
2#26.5272   26.5408
3#24.7981   25.2783
5#21.9089   24.9270
8#20.9994   24.4683
13   #20.4778   24.1549
22   #20.0956   23.8461
37   #19.7079   23.8088
62   #19.6855   23.8366
105  #19.5935   23.9688
178  #19.3434   23.8856
302  #19.3213   23.8421
513  #19.4093   23.8145
872  #19.3424   23.8016
 nehalem 
overhead 6.00 cycles, precision 100 units of 3.75e-10 secs, CPU freq 
2670.00 MHz
mpn_div_qr_1n_pi1.0xcafebabedeadbeef 
mpn_preinv_divrem_1.0xcafebabedeadbeef
1#12.1014   24.6814
2#21.0024   21.8684
3 20.9170  #20.5440
5#19.6475   20.3452
8

Re: div_qr_1 interface

2013-10-21 Thread Niels Möller
Torbjorn Granlund t...@gmplib.org writes:

 On Intel chips, op-to-mem is expensive.  Even op-from-memory is often
 slower than load+op.  (I understand the register shortage problem.)

The following (untested) variant needs one register too many.

  UP, QP, UN:   Load, store, loop counter.
  DINV, B2, B2md:   Loop invariant constants.
  U2, U1I, U0, Q1I, Q0: Inputs.
  U1O, Q1O: Outputs.
  Q2, %rax, %rdx:   Locals.

Also U1I - U1O recurrency chain (with opteron cycle counts)

mov U2, Q2
mov U2, Q1O
neg Q2
mov DINV, %rax
and DINV, Q1O
mul U1I
add Q0, Q1O
adc $0, Q2
mov %rax, Q0
add %rdx, Q1O
adc $0, Q2

mov B2, %rax
and B2, U2
mul U1I C 0 6
lea (U0, B2md), U1O
add U0, U2
cmovnc  U2, U1O
adc U1I, Q1O
adc Q1I, Q2
mov Q2, (QP, UN, 8)
jc  L(incr)

L(incr_done):
mov (UP, UN, 8), U0
add %rax, U0C 4
adc %rdx, U1O   C 5
sbb U2, U2  C 6

25 instructions (27 K10 decoder slots) excluding loop overhead.

But one variable must be moved out of the registers. Maybe B2md (used
once) is the best candidate. Then

lea (U0, B2md), U1O

would have to be replaced by

mov (%rsp), U1O C Can be done very early
...
add U0, U1O

We then have 26 instructions + loop overhead, or 54 instructions for 2
iterations. Or possibly DINV, if one thinks the quotient logic is less
critical.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-21 Thread Torbjorn Granlund
I looked at the logic following this:

sbb U2, U2  C 7 13

You negate the U2 copy in Q2.  It seems that three adc by sbb
could avoid the neg.

I might also be possible to replace the early loop and stuff by cmov.
Note that the carry flag survives dec, although that causes a pipeline
replay on older Intel chips.  (IIRC, only sandybridge, ivybridge,
haswell runs that well.)

  But one variable must be moved out of the registers. Maybe B2md (used
  once) is the best candidate. Then
  
lea (U0, B2md), U1O
  
  would have to be replaced by
  
mov (%rsp), U1O C Can be done very early
  ...
  add   U0, U1O
  
  We then have 26 instructions + loop overhead, or 54 instructions for 2
  iterations. Or possibly DINV, if one thinks the quotient logic is less
  critical.
  
Reading from a stack slot costs nothing under ideal circumstances.

To optimise register usage, I sometimes annotate the code with live
ranges for each register.  That will help with register coalescing.
(T is rather shot-lived, perhaps its register could serve two usages?)

-- 
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-21 Thread Niels Möller
Torbjorn Granlund t...@gmplib.org writes:

 I looked at the logic following this:

 sbb U2, U2  C 7 13

 You negate the U2 copy in Q2.  It seems that three adc by sbb
 could avoid the neg.

The problem is the final use, where Q2 is added, with carry, to a
different register. It's tempting to replace

adc Q1I, Q2

with

sbb Q2, Q1I

and negated Q2, but I'm afraid that will get the sense of the carry
wrong. Do you see any trick to get that right without negating Q2
somewhere along the way?

 I might also be possible to replace the early loop and stuff by
 cmov.

Maybe, but the simple way to do conditional addition with lea + cmov
won't to, since we also need carry out.

Does it matter if we do

mov B2, r
and mask, r

or

mov $0, r
cmovc   B2, r

?

 To optimise register usage, I sometimes annotate the code with live
 ranges for each register.  That will help with register coalescing.

There are lots of possibilities, since the computations for Q and U are
mostly independent. The data flow is something like

  load U limb
   |
  _V_
  U2, U1I, U0 - |___| - U2, U1O, U0 
   \   |__/ cy
   _V__V___V_
Q1I, Q0- |__|  - Q1O, Q0
\
 V
   store Q limb

 (T is rather shot-lived, perhaps its register could serve two usages?)

It could perhaps eliminated.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel