Re: div_qr_1 interface

2013-10-26 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  A long time ago, we choose an interface for sbpi1_div_qr which does
  *not* store the most significant limb; instead it returns it. I think it
  was the intention that a new top-level mpn_div_qr should follow that
  convention, and not store the top limb. I don't remember if that was
  purely for esthetic reasons, or if it also simplified the divide&conquer
  division code.
  
  Having some of the special division case functions, like mpn_div_qr_1,
  storing the top quotient limb leads to unnecessary complications when
  writing the general mpn_div_qr; it forces mpn_div_qr for n/1 to do the
  top limb itself and call mpn_div_qr_1 with (n-1)/1, which seems like a
  case of bad impedance mismatch.
  
  But then, requirements for div_qr_1 don't necessarily translate to
  requirements for the primitive division loops we're discussing now.
  
  For the recent mpn_div_qr_1 and mpn_div_qr_1n_pi1, it turned out to work
  fine to have mpn_div_qr_1 for n/1 do the top quotient limb up front, and
  let mpn_div_qr_1n_pi1 generate n-1 quotient limbs (but n-1 *full* limbs,
  since we pass in a high u limb, reduced to uh < d).
  
  It's unclear to me if the best design for the unnormalized case is to
  follow mpn_div_qr_1n_pi1 or not. It get be clearer after writing a
  decent mpn_div_qr_1u_pi1.
  
Thanks, this all make sense to me now.  :-)

I think it will not be too hard to adapt the new pi2 code; moving the
current after-loop code to before the loop will probably make the
structure best.  At least the 1n variant.

The 1u case is a bit trickier.  Here, we will always have a final
quotient limb which depends on a remainder limb and a partial dividend
limb.

-- 
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-25 Thread Niels Möller
Torbjorn Granlund  writes:

> ni...@lysator.liu.se (Niels Möller) writes:
>
>   The interesting thing is that the next higher function, mpn_div_qr_1,
>   should return the high quotient limb separately.
>
> I am not sure I agree.  Please explain.

A long time ago, we choose an interface for sbpi1_div_qr which does
*not* store the most significant limb; instead it returns it. I think it
was the intention that a new top-level mpn_div_qr should follow that
convention, and not store the top limb. I don't remember if that was
purely for esthetic reasons, or if it also simplified the divide&conquer
division code.

Having some of the special division case functions, like mpn_div_qr_1,
storing the top quotient limb leads to unnecessary complications when
writing the general mpn_div_qr; it forces mpn_div_qr for n/1 to do the
top limb itself and call mpn_div_qr_1 with (n-1)/1, which seems like a
case of bad impedance mismatch.

But then, requirements for div_qr_1 don't necessarily translate to
requirements for the primitive division loops we're discussing now.

For the recent mpn_div_qr_1 and mpn_div_qr_1n_pi1, it turned out to work
fine to have mpn_div_qr_1 for n/1 do the top quotient limb up front, and
let mpn_div_qr_1n_pi1 generate n-1 quotient limbs (but n-1 *full* limbs,
since we pass in a high u limb, reduced to uh < d).

It's unclear to me if the best design for the unnormalized case is to
follow mpn_div_qr_1n_pi1 or not. It get be clearer after writing a
decent mpn_div_qr_1u_pi1.

Regards,
/Niels




-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-25 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  The interesting thing is that the next higher function, mpn_div_qr_1,
  should return the high quotient limb separately.

I am not sure I agree.  Please explain.

You're saying that en n-limb consecutive dividend should yield an
(n-1)-limb consecutive quotient plus a separate high quotient limb,
right?

-- 
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-25 Thread Niels Möller
Torbjorn Granlund  writes:

> I now realise that you use a n+1 limb dividend passed as separate
> operands for the low n limbs and the high limb. I had misunderstood
> your intended interface...

It doesn't *have* to be 100% consistent. The interesting thing is that
the next higher function, mpn_div_qr_1, should return the high quotient
limb separately. For div_qr_1n_pi1, it seemed natural to pass a high
limb, uh < d, as a separate argument.

> The new pi2 functions follow your suggestion to stuff the dividend
> together with the various precomputations in a single operand.  Here it
> is a structure, perhaps we could use a simple array instead (like cps).

I'll see if I'll get to change qr_1n_pi1 too.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-24 Thread Torbjorn Granlund
I pushed initial C versions of these functions:

mpn_div_qr_1n_pi2
mpn_div_qr_1u_pi2

I have had these for a long time, judging from the file time stamps.

These accept n-limb dividends in a single consecutive operand and
generate n-limb quotients also in a consecutive operand.  I now realise
that you use a n+1 limb dividend passed as separate operands for the low
n limbs and the high limb.  I had misunderstood your intended
interface...

The new pi2 functions follow your suggestion to stuff the dividend
together with the various precomputations in a single operand.  Here it
is a structure, perhaps we could use a simple array instead (like cps).

-- 
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-24 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes:

> Torbjorn Granlund  writes:
>
>> Basically qp = up won't work, but qp = up + k for any positive k will?
>> Does the C code share that property?
[...]
>> I think it would be good to fix that, since it is surely a common usage
>> scenario.
>
> I agree.

I've checked in some bugfixes. Now it seems to work according to try,
and without the no-overlap patch which it turned out I never pushed into
the repo.

> If/when the code is reorganized to delay the q stores (to avoid the "adc
> Q2,8(QP, UN, 8)" instruction),

Here's a first crude version, which seems to work. I could eliminate
other uses of T, and then use that register to the previous Q1 limb
around

I had to change the logic

lea (B2md, U0), T
...
and B2, U2
add U2, U0
cmovnc  U0, T
adc U1, Q1

Here, B2md holds B^2 (mod d) - d, and U2 is initially a mask. To use U2
as temporary (it dies at the add), I had to change it to

and B2, U2
add U2, U0
lea (B2md, U0), U2
cmovnc  U0, U2
adc U1, Q1

which gives the same result, provided that B2md is adjusted to just hold
-d. This sequence has less friendly dependencies. Maybe it would be
better to stick some of the loop-invariant constants in memory instead.

It's getting late, so I'll not do any benchmarking right now.

Regards,
/Niels

dnl  x86-64 mpn_div_qr_1n_pi1
dnl  -- Divide an mpn number by a normalized single-limb number,
dnl using a single-limb inverse.

dnl  Contributed to the GNU project by Niels Möller

dnl  Copyright 2013 Free Software Foundation, Inc.

dnl  This file is part of the GNU MP Library.

dnl  The GNU MP Library is free software; you can redistribute it and/or modify
dnl  it under the terms of the GNU Lesser General Public License as published
dnl  by the Free Software Foundation; either version 3 of the License, or (at
dnl  your option) any later version.

dnl  The GNU MP Library is distributed in the hope that it will be useful, but
dnl  WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
dnl  or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
dnl  License for more details.

dnl  You should have received a copy of the GNU Lesser General Public License
dnl  along with the GNU MP Library.  If not, see http://www.gnu.org/licenses/.

include(`../config.m4')


C   c/l
C AMD K8,K9 13
C AMD K10   13
C AMD bull  16.5
C AMD pile  15
C AMD steam  ?
C AMD bobcat16
C AMD jaguar ?
C Intel P4  47  poor
C Intel core19.25
C Intel NHM 18
C Intel SBR 15  poor
C Intel IBR 13
C Intel HWL 11.7
C Intel BWL  ?
C Intel atom52  very poor
C VIA nano  19


C INPUT Parameters
define(`QP', `%rdi')
define(`UP', `%rsi')
define(`UN_INPUT', `%rdx')
define(`U1', `%rcx')C Also in %rax
define(`D', `%r8')
define(`DINV', `%r9')

C Invariants
define(`B2', `%rbp')
define(`B2md', `%rbx')

C Variables
define(`UN', `%r8') C Overlaps D input
define(`T', `%r10')
define(`U0', `%r11')
define(`U2', `%r12')
define(`Q0', `%r13')
define(`Q1', `%r14')
define(`Q2', `%r15')

ABI_SUPPORT(STD64)

ASM_START()
TEXT
ALIGN(16)
PROLOGUE(mpn_div_qr_1n_pi1)
FUNC_ENTRY(6)
IFDOS(` mov 56(%rsp), %r8   ')
IFDOS(` mov 64(%rsp), %r9   ')
dec UN_INPUT
jnz L(first)

C Just a single 2/1 division.
C T, U0 are allocated in scratch registers
lea 1(U1), T
mov U1, %rax
mul DINV
mov (UP), U0
add U0, %rax
adc T, %rdx
mov %rdx, T
imulD, %rdx
sub %rdx, U0
cmp U0, %rax
lea (U0, D), %rax
cmovnc  U0, %rax
sbb $0, T
cmp D, %rax
jc  L(single_div_done)
sub D, %rax
add $1, T
L(single_div_done):
mov T, (QP)
FUNC_EXIT
ret
L(first):
C FIXME: Could delay some of these until we enter the loop.
push%r15
push%r14
push%r13
push%r12
push%rbx
push%rbp

mov D, B2
imulDINV, B2
neg B2
C   mov B2, B2md
C   sub D, B2md
mov D, B2md
neg B2md

C D not needed until final reduction
pushD
mov UN_INPUT, UNC Clobbers D

mov DINV, %rax
mul U1
mov %rax, Q0
add U1, %rdx
mov %rdx, Q1

mov B2, %rax
mul U1
mov -8(UP, UN, 8), U0
mov (UP, UN, 8), U1
C mov   Q1, (QP, UN, 8)
add %rax, U0
adc %rdx, U1
sbb U2, U2
dec UN
mov U1, %rax
jz  L(final)

ALIGN(16)

C Loop is 28 instructions, 30 decoder slots, shoul

Re: div_qr_1 interface

2013-10-24 Thread Niels Möller
Torbjorn Granlund  writes:

> Basically qp = up won't work, but qp = up + k for any positive k will?
> Does the C code share that property?

I think the C code has the same problem. Hmm, and it looks like in both
the C and asm code, it's only the first iteration, before the main loop,
which has this problem. Then it should be easy to fix, in the asm case,
it's one additional mov and not in the main loop.

> I think it would be good to fix that, since it is surely a common usage
> scenario.

I agree.

If/when the code is reorganized to delay the q stores (to avoid the "adc
Q2,8(QP, UN, 8)" instruction), I think the overlap issue will also get
fixed in the process.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-23 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  Fixed now.
  
  > Then there's another problem with n==2, which needs a bit more
  > debugging.
  
  Problem was that the q limb is written a bit too early, breaking overlap
  operation. Maybe that ought to be fixed, but for now I changed try.c to
  not allow any overlap.
  
Basically qp = up won't work, but qp = up + k for any positive k will?
Does the C code share that property?

I think it would be good to fix that, since it is surely a common usage
scenario.  Furthermore, divrem_1 falling back to the new stuff would
need some complications if we disallow full overlap.

-- 
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-23 Thread Niels Möller
Torbjorn Granlund  writes:

> ni...@lysator.liu.se (Niels Möller) writes:
>
>   These sizes aren't exercised by div_qr_1.
>
> The plain test code?  Then I suppose it should be fixed...

I think t-div calls mpn_div_qr_1 for all reasonable small sizes. But
that function doesn't always call mpn_div_qr_1n_pi1, depending on
thresholds, and the up-front handling of the high quotient limb.

Separate tests for mpn_div_qr_1n_pi1 would be needed to improve
coverage.

/Niels
-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-23 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes:

> And sure enough, it detects some bugs in the new assembly code. For size
> n==1, there's a missing mov.

Fixed now.

> Then there's another problem with n==2, which needs a bit more
> debugging.

Problem was that the q limb is written a bit too early, breaking overlap
operation. Maybe that ought to be fixed, but for now I changed try.c to
not allow any overlap.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-22 Thread Torbjorn Granlund
I added data for the new code at .

There is a line for div_qr_1u_pi1 as well, since that will also be
needed.  It might actually be more common that the divisor is not
normalised.

I should try to wrap up div_qr_1n_pi2 and div_qr_1u_pi2 as well, and
then add threshold for the non-invariant case.  If my old data for those
are correct, then it is always faster for large enough operands.

I expect div_qr_1u_pi1 to be no slower than div_qr_1n_pi1 on some
machines, just like divrem_1 is often the same speed for normalised and
unnormalised divisors (sometimes using one loop, sometimes using two).

To use just one loop, we probably need an efficient shrd, since then the
normalised case just mean a shift count of 0.  (Only Intel's high-end
processors run shrd well.)

I suppose we should provide just div_qr_1_pi1 when a general loop is
fast.

-- 
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-22 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  And sure enough, it detects some bugs in the new assembly code. For size
  n==1, there's a missing mov. I'll add that shortly. Then there's another
  problem with n==2, which needs a bit more debugging.
  
Good.  So now you have debugged the new try.c code with buggy assembly.
You might actually try reads before ad after areas to make sure they
trigger segfaults.

  These sizes aren't exercised by div_qr_1.

The plain test code?  Then I suppose it should be fixed...

-- 
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-22 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes:

> ni...@lysator.liu.se (Niels Möller) writes:
>
>> But sure, support also in try.c would be good.
>
> Added now.

And sure enough, it detects some bugs in the new assembly code. For size
n==1, there's a missing mov. I'll add that shortly. Then there's another
problem with n==2, which needs a bit more debugging. These sizes aren't
exercised by div_qr_1.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-22 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  ni...@lysator.liu.se (Niels Möller) writes:
  
  > But sure, support also in try.c would be good.
  
  Added now. Please have a look if it the changes are sane. I use the
  second source for the uh input, and I added a DATA_DIV_QR_1 to get it in
  the proper range. I couldn't figure out any way to make size2 == 1
  always, so I just set tr->size2 = 1 and let it be whatever size try
  figures out. Anyway, only the low limb is used.
  
  I also tried setting tr->size_2 == SIZE_2, but then it crashed when
  copying arguments around, I don't understand why.
  
This confuses me too, I'm afraid.

I never really understood try.c, and it is always a pain to add things.

(I didn't write try.c, Kevin did.  It is very a useful tool, and am am
grateful for his contribution.)

-- 
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-22 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes:

> But sure, support also in try.c would be good.

Added now. Please have a look if it the changes are sane. I use the
second source for the uh input, and I added a DATA_DIV_QR_1 to get it in
the proper range. I couldn't figure out any way to make size2 == 1
always, so I just set tr->size2 = 1 and let it be whatever size try
figures out. Anyway, only the low limb is used.

I also tried setting tr->size_2 == SIZE_2, but then it crashed when
copying arguments around, I don't understand why.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-22 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  > (3) Offet UP to avoid the offset in the loop.
  
  Should be straight forward. Or one could offset UN, but I think it's
  nice to be able to use dec / jnz for the loop.

There is plenty of execution bandwidth, so using 2 x lea, dec, jnz for
the loop control is no problem.  Only when one is decode (or dispatch,
retire) bandwidth constrained is indexing a good thing.

  Or is the obscure "loop" decrement and branch instruction useful?
  
I've hadrly ever used it.  IIRC, it can only reach -0x80..+0x7f from its
end.

-- 
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-22 Thread Niels Möller
Torbjorn Granlund  writes:

> But I lack unit testing code for the function, making hacking quite
> cumbersome.  I don't feel safe hacking *any* GMP assembly code without
> tests/devel/try.c's function and access checks.

tests/mpn/t-div.c includes tests for mpn_div_qr_1, including single
guard limbs before and after the q area. That provides for some
reasonable testing, and I was running while GMP_CHECK_RANDOMIZE=0
./t-div ; : ; done over night in the weekend.

But sure, support also in try.c would be good.

> (1) Shorten a dep chain, and avoid keeping CF live over an inc
> instruction.  The cmov doesn't really depend on sbb, since the
> latter insn never really changes carry.  (This might btw be useful
> to teach loppmixer!)

Right, using cmov (instead of masking) allows some more reordering.

> (2) Reallocate Q2 to an "old" register (not r8-r15) and then use the
> 32-bit form of "adc $0,reg".  That form is shorter.

Clever!

> (3) Offet UP to avoid the offset in the loop.

Should be straight forward. Or one could offset UN, but I think it's
nice to be able to use dec / jnz for the loop. Or is the obscure "loop"
decrement and branch instruction useful?

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-22 Thread Torbjorn Granlund
I played more with the code, now trying to break the add-adc-sbb-cmov
chain, for the benefit of most Intel processors.

But I lack unit testing code for the function, making hacking quite
cumbersome.  I don't feel safe hacking *any* GMP assembly code without
tests/devel/try.c's function and access checks.

The changes I wanted to try was:

(1) Shorten a dep chain, and avoid keeping CF live over an inc
instruction.  The cmov doesn't really depend on sbb, since the
latter insn never really changes carry.  (This might btw be useful
to teach loppmixer!)

(2) Reallocate Q2 to an "old" register (not r8-r15) and then use the
32-bit form of "adc $0,reg".  That form is shorter.

(3) Offet UP to avoid the offset in the loop.  That form has longer load
latency for some Intel CPUs.  Also try non-indexed form for QP and UP.

-- 
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-22 Thread Niels Möller
Torbjorn Granlund  writes:

> I turned out the code was a bit slower on k8.
>
> This patch changes that.  With it applied, things takes 11 c/l on both
> pipelines.  This is also a 2 c/l improvement for piledriver.

Cool. 

> I have not tested that this is correct.  If you like the patch, please
> consider putting the result in the k8 subdir.

I'll try to get that done reasonably soon.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-22 Thread Torbjorn Granlund
I turned out the code was a bit slower on k8.

This patch changes that.  With it applied, things takes 11 c/l on both
pipelines.  This is also a 2 c/l improvement for piledriver.

I have not tested that this is correct.  If you like the patch, please
consider putting the result in the k8 subdir.

diff -r e9a5ec7f4003 mpn/x86_64/k10/div_qr_1n_pi1.asm
--- a/mpn/x86_64/k10/div_qr_1n_pi1.asm  Tue Oct 22 10:16:16 2013 +0200
+++ b/mpn/x86_64/k10/div_qr_1n_pi1.asm  Tue Oct 22 14:19:48 2013 +0200
@@ -117,15 +117,16 @@
dec UN
mov U1, %rax
jz  L(final)
+   mov $0, R32(Q1)

ALIGN(16)
 
-   C Loop is 28 instructions, 30 decoder slots, should run in 10 cycles.
-   C At entry, %rax holds an extra copy of U1, and carry holds an extra 
copy of U2.
+   C Loop is 28 instructions, 30 K8/K10 decoder slots, should run in 10
+   C cycles.  At entry, %rax holds an extra copy of U1, and carry holds
+   C an extra copy of U2.
 L(loop):
C {Q2, Q1, Q0} <-- DINV * U1 + B (Q0 + U2 DINV) + B^2 U2
C Remains to add in B (U1 + c)
-   mov $0, Q1
cmovc   DINV, Q1
mov U2, Q2
neg Q2
@@ -147,13 +148,14 @@
C {QP+UN, ...} <-- {QP+UN, ...} + {Q2, Q1} + U1 + c
adc U1, Q1
mov -8(UP, UN, 8), U0
-   adc Q2,8(QP, UN, 8)
+   adc Q2, 8(QP, UN, 8)
jc  L(q_incr)
 L(q_incr_done):
add %rax, U0
mov T, %rax
adc %rdx, %rax
mov Q1, (QP, UN, 8)
+   mov $0, R32(Q1)
sbb U2, U2
dec UN
mov %rax, U1 

-- 
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-22 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  Torbjorn Granlund  writes:
  
  > * The code is no win for AMD k10/k8 (although close to 10 c/l might well be
  >   possible)
  
  I tried replacing one masking op by cmov, as you suggested. We then get
  down to 11.25 c/l on K10. I put this modified version in the k10
  subdirectory, since it was a significant slowdown on some other
  processors.
  
Nice speedup!  It is not too far from decoder saturated now, I presume.

I think the right place for the file is the k8 subdir, not the k10
subdir.  Their pipelines are almost identical, so the k10 subdir are
used just for code which uses instructions not available on k10.

  Next thing to try is to delay the Q1 store, but that's a bit more work.
  After that, I guess I should try the loop mixer.
  
I think k8-k10 are losing importance since they aren't made since
several years.  AMD bulldriver/piledriver are not terribly important GMP
targets either, since they have a hopelessly slow integer multiply unit.

The most important targets are sandybridge/ivybridge (similar pipelines)
and haswell.  Less important are nehalem/westmere (very similar
pipelines).  Conroe and the other core2 processors are not important,
except for your laptop.  :-)

I think haswell code could be made a few cycles faster by using the mulx
instruction.  That will avoid the copying forth and back of rax.

-- 
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-22 Thread Niels Möller
Torbjorn Granlund  writes:

> * The code is no win for AMD k10/k8 (although close to 10 c/l might well be
>   possible)

I tried replacing one masking op by cmov, as you suggested. We then get
down to 11.25 c/l on K10. I put this modified version in the k10
subdirectory, since it was a significant slowdown on some other
processors.

Next thing to try is to delay the Q1 store, but that's a bit more work.
After that, I guess I should try the loop mixer.

I benchmarked the code on the k8, k10, core2, sandybridge, nehalem and
nano machines. I couldn't log in to haswell and piledriver.

/Niels


-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-21 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  The problem is the final use, where Q2 is added, with carry, to a
  different register. It's tempting to replace
  
adc Q1I, Q2
  
  with
  
sbb Q2, Q1I
  
  and negated Q2, but I'm afraid that will get the sense of the carry
  wrong. Do you see any trick to get that right without negating Q2
  somewhere along the way?
  
Well, no.

  > I might also be possible to replace the early loop "and" stuff by
  > cmov.
  
  Maybe, but the simple way to do conditional addition with lea + cmov
  won't to, since we also need carry out.
  
  Does it matter if we do
  
mov B2, r
  and   mask, r
  
  or
  
mov $0, r
  cmovc B2, r
  
  ?
  
The latter tends to be faster on AMD CPUs.  Not sure about Intel.

-- 
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-21 Thread Niels Möller
Torbjorn Granlund  writes:

> I looked at the logic following this:
>
> sbb U2, U2  C 7 13
>
> You negate the U2 copy in Q2.  It seems that three adc by sbb
> could avoid the neg.

The problem is the final use, where Q2 is added, with carry, to a
different register. It's tempting to replace

adc Q1I, Q2

with

sbb Q2, Q1I

and negated Q2, but I'm afraid that will get the sense of the carry
wrong. Do you see any trick to get that right without negating Q2
somewhere along the way?

> I might also be possible to replace the early loop "and" stuff by
> cmov.

Maybe, but the simple way to do conditional addition with lea + cmov
won't to, since we also need carry out.

Does it matter if we do

mov B2, r
and mask, r

or

mov $0, r
cmovc   B2, r

?

> To optimise register usage, I sometimes annotate the code with live
> ranges for each register.  That will help with register coalescing.

There are lots of possibilities, since the computations for Q and U are
mostly independent. The data flow is something like

  load U limb
   |
  _V_
  U2, U1I, U0 -> |___| -> U2, U1O, U0 
   \   |__/ cy
   _V__V___V_
Q1I, Q0-> |__|  -> Q1O, Q0
\
 V
   store Q limb

> (T is rather shot-lived, perhaps its register could serve two usages?)

It could perhaps eliminated.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-21 Thread Torbjorn Granlund
I looked at the logic following this:

sbb U2, U2  C 7 13

You negate the U2 copy in Q2.  It seems that three adc by sbb
could avoid the neg.

I might also be possible to replace the early loop "and" stuff by cmov.
Note that the carry flag survives dec, although that causes a pipeline
replay on older Intel chips.  (IIRC, only sandybridge, ivybridge,
haswell runs that well.)

  But one variable must be moved out of the registers. Maybe B2md (used
  once) is the best candidate. Then
  
lea (U0, B2md), U1O
  
  would have to be replaced by
  
mov (%rsp), U1O C Can be done very early
  ...
  add   U0, U1O
  
  We then have 26 instructions + loop overhead, or 54 instructions for 2
  iterations. Or possibly DINV, if one thinks the quotient logic is less
  critical.
  
Reading from a stack slot costs nothing under ideal circumstances.

To optimise register usage, I sometimes annotate the code with live
ranges for each register.  That will help with register coalescing.
(T is rather shot-lived, perhaps its register could serve two usages?)

-- 
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-21 Thread Niels Möller
Torbjorn Granlund  writes:

> On Intel chips, op-to-mem is expensive.  Even op-from-memory is often
> slower than load+op.  (I understand the register shortage problem.)

The following (untested) variant needs one register too many.

  UP, QP, UN:   Load, store, loop counter.
  DINV, B2, B2md:   Loop invariant constants.
  U2, U1I, U0, Q1I, Q0: Inputs.
  U1O, Q1O: Outputs.
  Q2, %rax, %rdx:   Locals.

Also U1I -> U1O recurrency chain (with opteron cycle counts)

mov U2, Q2
mov U2, Q1O
neg Q2
mov DINV, %rax
and DINV, Q1O
mul U1I
add Q0, Q1O
adc $0, Q2
mov %rax, Q0
add %rdx, Q1O
adc $0, Q2

mov B2, %rax
and B2, U2
mul U1I C 0 6
lea (U0, B2md), U1O
add U0, U2
cmovnc  U2, U1O
adc U1I, Q1O
adc Q1I, Q2
mov Q2, (QP, UN, 8)
jc  L(incr)

L(incr_done):
mov (UP, UN, 8), U0
add %rax, U0C 4
adc %rdx, U1O   C 5
sbb U2, U2  C 6

25 instructions (27 K10 decoder slots) excluding loop overhead.

But one variable must be moved out of the registers. Maybe B2md (used
once) is the best candidate. Then

lea (U0, B2md), U1O

would have to be replaced by

mov (%rsp), U1O C Can be done very early
...
add U0, U1O

We then have 26 instructions + loop overhead, or 54 instructions for 2
iterations. Or possibly DINV, if one thinks the quotient logic is less
critical.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-21 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  Will try that. I think one could also try to delay the quotient store
  one iteration, keeping "Q1" in a register until the next iteration. Then
  one gets rid of the
  
adc Q2,8(QP, UN, 8)
  
  in the loop, using only a single store per iteration in the likely case.
  May need yet another register, though.
  
On Intel chips, op-to-mem is expensive.  Even op-from-memory is often
slower than load+op.  (I understand the register shortage problem.)

  > I suspect one or two of the register-to-register copy insns could be
  > optimised out.
  
  Maybe. And it would be easier to avoid moves if one unrolls the loop
  twice, switching roles U0<->U1 and Q0<->Q1. But that makes it a bit more
  bloated, of course.
  
It might be worth it, since this is an importand operation.

  > In order to run this through the loopmixer, you need to setup data in
  > the prologue which makes the adjustment branch to never be taken.
  > Letting the inverse be 0 or else B-1 might work...
  
  I vaguely recall some previous attempt at loopmixing this, but I don't
  remember any success.

Let's take a look at current performance on all amd64 CPUs except nocona
(=pentium4).  I compare the pi variants here.

Conclusions:

* The code is no win for AMD k10/k8 (although close to 10 c/l might well be
  possible)

* The code is a big win for AMD bulldozer and also for piledriver
  
* The code is a big win for Intel core2 (alias conroe)

* The code is a cycle slower for Intel sandybridge

* The code is a cycle faster on Intel nehalem, ivybridge, haswell

* The code is a big win for VIA nano

In ~tege/GMP/newdiv/div_1n_pi2-x86_64.asm I claim 9.75 c/l (and that 7
c/l is possible) for k10/k8, 16 c/l for core2, and 13.25 c/l for
nehalem.  Of course, the precomputation cost there is much higher.

 k10 
overhead 6.00 cycles, precision 100 units of 3.12e-10 secs, CPU freq 
3200.35 MHz
mpn_div_qr_1n_pi1.0xcafebabedeadbeef 
mpn_preinv_divrem_1.0xcafebabedeadbeef
1#12.0018   26.0043
2 19.5030  #19.5024
3 17.6695  #17.3362
5 15.8019  #15.6019
8 14.7518  #14.6267
13   #14.0895   15.0901
22   #14.1463   14.2366
37   #13.6849   13.7393
62   #13.4139   13.4445
105  #13.2498   13.2589
178  #13.1524   13.1632
302  #13.0952   13.1011
513  #13.0607   13.0642
872  #13.0302   13.0325
 bulldozer 
overhead 6.00 cycles, precision 100 units of 2.77e-10 secs, CPU freq 
3612.09 MHz
mpn_div_qr_1n_pi1.0xcafebabedeadbeef 
mpn_preinv_divrem_1.0xcafebabedeadbeef
1#13.9118   30.8628
2#22.5047   24.0040
3 20.6703  #20.3110
5#18.0036   20.0033
8#17.2535   19.2532
13   #16.7725   19.8804
22   #17.0943   20.5489
37   #16.6519   20.4899
62   #16.3905   20.2277
105  #16.2322   20.1748
178  #16.1383   20.0710
302  #16.0895   20.0499
513  #16.0513   20.0218
872  #16.0337   20.0186
 piledriver 
overhead 6.00 cycles, precision 100 units of 7.14e-10 secs, CPU freq 
4000.00 MHz
mpn_div_qr_1n_pi1.0xcafebabedeadbeef 
mpn_preinv_divrem_1.0xcafebabedeadbeef
1#13.4460   27.7072
2#21.2536   22.0034
3#19.1284   20.6698
5#17.6283   19.6027
8#16.8365   19.1819
13   #16.7634   19.3874
22   #16.5480   19.1393
37   #16.5433   18.9761
62   #16.3419   18.8095
105  #16.2121   18.6698
178  #16.0991   18.6101
302  #16.0503   18.5661
513  #16.2965   18.5379
872  #16.3580   19.0568

 core2 
overhead 6.01 cycles, precision 100 units of 4.69e-10 secs, CPU freq 
2132.93 MHz
mpn_div_qr_1n_pi1.0xcafebabedeadbeef 
mpn_preinv_divrem_1.0xcafebabedeadbeef
1#15.7048   28.7024
2#26.5272   26.5408
3#24.7981   25.2783
5#21.9089   24.9270
8#20.9994   24.4683
13   #20.4778   24.1549
22   #20.0956   23.8461
37   #19.7079   23.8088
62   #19.6855   23.8366
105  #19.5935   23.9688
178  #19.3434   23.8856
302  #19.3213   23.8421
513  #19.4093   23.8145
872  #19.3424   23.8016
 nehalem 
overhead 6.00 cycles, precision 100 units of 3.75e-10 secs, CPU freq 
2670.00 MHz
mpn_div_qr_1n_pi1.0xcafebabedeadbeef 
mpn_preinv_divrem_1.0xcafebabedeadbeef
1#12.1014   24.6814
2#21.0024   21.8684
3 20.9170  #20.5440
5#19.6475   20.3452
8   

Re: div_qr_1 interface

2013-10-20 Thread Niels Möller
Torbjorn Granlund  writes:

> Have you analysed the register needs?  Pushing all callee-saves
> registers is quite expensive.

Per the FIXME-comment, we could avoid saving them for the n == 2 case
(which I think corresponds corresponds to n == 3 for the mpn_div_qr_1
caller, so it might help that regression), but we do need a lot of
registers for the actual loop.

> For the mul insn, it is usually better to copy the invariant/noncritical
> operand to rax, and use the critical operand explicitly in the mul insn.

Will try that. I think one could also try to delay the quotient store
one iteration, keeping "Q1" in a register until the next iteration. Then
one gets rid of the

adc Q2,8(QP, UN, 8)

in the loop, using only a single store per iteration in the likely case.
May need yet another register, though.

> I suspect one or two of the register-to-register copy insns could be
> optimised out.

Maybe. And it would be easier to avoid moves if one unrolls the loop
twice, switching roles U0<->U1 and Q0<->Q1. But that makes it a bit more
bloated, of course.

> In order to run this through the loopmixer, you need to setup data in
> the prologue which makes the adjustment branch to never be taken.
> Letting the inverse be 0 or else B-1 might work...

I vaguely recall some previous attempt at loopmixing this, but I don't
remember any success.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-20 Thread Torbjorn Granlund
I took a brief look at the loop of the new assembly code.

Have you analysed the register needs?  Pushing all callee-saves
registers is quite expensive.

For the mul insn, it is usually better to copy the invariant/noncritical
operand to rax, and use the critical operand explicitly in the mul insn.

I suspect one or two of the register-to-register copy insns could be
optimised out.

In order to run this through the loopmixer, you need to setup data in
the prologue which makes the adjustment branch to never be taken.
Letting the inverse be 0 or else B-1 might work...

-- 
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-20 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  On my core2 laptop:
  
  $ ./speed -s 2-10,100,500 -C mpn_divrem_1.0x 
mpn_div_qr_1.0x
  overhead 6.13 cycles, precision 1 units of 8.33e-10 secs, CPU freq 
1200.00 MHz
  mpn_divrem_1.0x mpn_div_qr_1.0x
  2 60.6420  #39.9427
  3#40.9839   55.0469
  4#43.7667   44.4534
  5 44.6333  #38.9055
  6 39.6259  #34.4167
  7 34.0063  #32.4018
  8 30.1364  #28.5745
  9 29.6472  #27.4599
  1029.1270  #26.7300
  100   24.7920  #20.6700
  500   24.4400  #19.7600
  
  So here it's a clear win, except an ugly regression for n = 3.
  
You might want to pass -p100 or something, to avoid startup noise.

  On shell, the same command gives:
  
  2#37.4379   51.1157
  3#30.0256   61.0904
  4#25.8058   27.0781
  5#23.2717   24.2831
  6#21.7520   22.4346
  7#20.5219   21.
  8#19.4783   20.1101
  9#18.7726   19.3369
  10   #18.3271   18.7228
  100  #13.8063   13.8175
  500  #13.2670   13.2750
  
  So here the new code is epsilon slower for the larger sizes. Maybe the
  loopmixer can help.
  
The old code runs optimally, given its instructions.
What is the latency critical path of the new code?

The performance for n=3 is poor for both processors.  Do you understand
the reason?

-- 
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-20 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes:

> I'll try to get the x86_64 assembly for mpn_div_qr_1n_pi1 in soon.

Pushed first working version now, see
http://gmplib.org:8000/gmp/file/tip/mpn/x86_64/div_qr_1n_pi1.asm

On my core2 laptop:

$ ./speed -s 2-10,100,500 -C mpn_divrem_1.0x 
mpn_div_qr_1.0x
overhead 6.13 cycles, precision 1 units of 8.33e-10 secs, CPU freq 1200.00 
MHz
mpn_divrem_1.0x mpn_div_qr_1.0x
2 60.6420  #39.9427
3#40.9839   55.0469
4#43.7667   44.4534
5 44.6333  #38.9055
6 39.6259  #34.4167
7 34.0063  #32.4018
8 30.1364  #28.5745
9 29.6472  #27.4599
1029.1270  #26.7300
100   24.7920  #20.6700
500   24.4400  #19.7600

So here it's a clear win, except an ugly regression for n = 3.

On shell, the same command gives:

2#37.4379   51.1157
3#30.0256   61.0904
4#25.8058   27.0781
5#23.2717   24.2831
6#21.7520   22.4346
7#20.5219   21.
8#19.4783   20.1101
9#18.7726   19.3369
10   #18.3271   18.7228
100  #13.8063   13.8175
500  #13.2670   13.2750

So here the new code is epsilon slower for the larger sizes. Maybe the
loopmixer can help.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-20 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  > I'm about to push the first step, with C implementations of mpn_div_qr_1
  > and mpn_div_qr_1n_pi1.
  
  Done now, including some tuning code. It would be interesting to have
  
DIV_QR_1N_PI1_METHOD
DIV_QR_1_NORM_THRESHOLD
DIV_QR_1_UNNORM_THRESHOLD
  
  added to the tuning reports on the web.
  
Done.  First results are appearing now.  (The order might need cleanup,
patch ~tege/prec/gmpgenttab.sh.)

  I'll try to get the x86_64 assembly for mpn_div_qr_1n_pi1 in soon.
  I think my plan is to do things in this order:
  
  1. Get div_qr_1 working efficiently.
  
  2. Get divf_qr_1 (fraction) working efficiently.
  
  3. Let divrem_1.c use the new code.
  
  4. Check for regressions, and delete divrem_1.asm for onw platform at a
 time.
  
  but this will most likely take a long time to complete.
  
The fraction code might be of limited importance.  IIRC, it is only used
for radix conversion basecase, and Paul Z recently showed that our
basecase code there is quite poor.

The division project really needs work, and I really appreciate your
work on it!

I have some code in ~tege/GMP/newdiv which should be adapted and
committed:

-rw-r--r--  1 tege  tege   605 Jun 19  2007 
/home/tege/GMP/newdiv/pre_mod_1-alpha.asm
-rw-rw-r--  1 tege  tege  1245 Jun 20  2007 
/home/tege/GMP/newdiv/pre_mod_1-k8.asm
-rw-rw-r--  1 tege  tege  2147 Jun 25  2007 
/home/tege/GMP/newdiv/divrem_1-p4.asm
-rw-rw-r--  1 tege  tege  6590 Jul 12  2007 
/home/tege/GMP/newdiv/divrem_1-alpha.asm
-rw-r--r--  1 tege  tege  2989 Jul 19  2007 
/home/tege/GMP/newdiv/div_1_pi2-alpha.asm
-rw-rw-r--  1 tege  tege  6775 Oct 18  2008 
/home/tege/GMP/newdiv/divrem_1-k7.asm
-rw-r--r--  1 tege  tege  3397 Jun  9  2009 
/home/tege/GMP/newdiv/divrem_1-arm.asm
-rw-rw-r--  1 tege  tege  6195 Mar 18  2010 
/home/tege/GMP/newdiv/divrem_1-ppc32.asm
-rw-rw-r--  1 tege  tege  2036 Jun 18  2010 
/home/tege/GMP/newdiv/bdiv_q_1_pi2-k8.asm
-rw-rw-r--  1 tege  tege  2632 Jun 24  2010 
/home/tege/GMP/newdiv/div_1n_pi2-x86_64.asm
-rw-rw-r--  1 tege  tege  5852 Jun 26  2010 
/home/tege/GMP/newdiv/div_1u_pi2-x86_64-shld.asm
-rw-rw-r--  1 tege  tege   187 Jul  1  2010 
/home/tege/GMP/newdiv/div_1u_pi2-x86_64.asm
-rw-rw-r--  1 tege  tege  3210 Jul  1  2010 
/home/tege/GMP/newdiv/div_2n_pi2-x86_64.asm
-rw-rw-r--  1 tege  tege  9766 Jan 19  2011 
/home/tege/GMP/newdiv/divrem_1-ia64.asm
-rw-r--r--  1 tege  tege  2933 Jan 23  2011 
/home/tege/GMP/newdiv/div_1_pi2-ppc64.asm

-- 
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-20 Thread Niels Möller
Torbjorn Granlund  writes:

> Which tail call?

In the normalized case, the checked in mpn_div_qr_1 does something like

  *qh = ...;
  ...
  return mpn_div_qr_1n_pi1(...);

Which is a nice tail call. With the struct-returning version one
gets instead

  res.qh = ...;
  ...
  res.r = mpn_div_qr_1n_pi1(...);
  return res;

which I don't think the compiler has any chance of turning into a tail
call.

> Should we unconfuse ourselves and users about what type of inverse is to
> be passed?  The "pi1" moniker might be replaced.  How many one-limb
> inverse types do we currently consider?

It would make some sense to me to do like the mod_1_* functions, and
have a corresponding _cps function precomputing anything needed. The ABI
then says nothing about the details, only that a maximum of 4 (say)
precomputed limbs will be used.

For the fancy div_qr_1 algorithm, what's needed is the same old
reciprocal dinv = (B^2-1)/d - B, and B^2 - d*(B+dinv), same as for
mod_1_1. And shiftcount, in the unnormalized case.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-20 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes:

> I'm about to push the first step, with C implementations of mpn_div_qr_1
> and mpn_div_qr_1n_pi1.

Done now, including some tuning code. It would be interesting to have

  DIV_QR_1N_PI1_METHOD
  DIV_QR_1_NORM_THRESHOLD
  DIV_QR_1_UNNORM_THRESHOLD

added to the tuning reports on the web.

I'll try to get the x86_64 assembly for mpn_div_qr_1n_pi1 in soon.
I think my plan is to do things in this order:

1. Get div_qr_1 working efficiently.

2. Get divf_qr_1 (fraction) working efficiently.

3. Let divrem_1.c use the new code.

4. Check for regressions, and delete divrem_1.asm for onw platform at a
   time.

but this will most likely take a long time to complete.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-20 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  Torbjorn Granlund  writes:
  
  > I think x86-64, x86-32, arm32, arm64, powerpc-64, sparc-64 matter.
  > Unfortunately, powerpc-64 (and -32) return these types onto the stack
  > via an implicit pointer.
  
  Ok, I think I'll stick to using a separate pointer for qh then.

I did not mean to argue against returning a structure.  Actually, it is
more efficiently returned than I thought...

  Storing qh early may also have the advantage of allowing more tail
  calls.

Which tail call?

  I'm about to push the first step, with C implementations of mpn_div_qr_1
  and mpn_div_qr_1n_pi1.
  
Should we unconfuse ourselves and users about what type of inverse is to
be passed?  The "pi1" moniker might be replaced.  How many one-limb
inverse types do we currently consider?

I don't think fully descriptive names would be viable, since they would
be too long.  For plain 1/b mod \beta^k, perhaps use _bik, where k is a
digit.

For the various Euclidean inverses, _iak, _ibk, _ick is one alternative,
for "type a", "type b", etc.  Or some other more clever letters.

  I was a bit confused here. mpn_div_qr_1 is intended as a public
  function.
  
Indeed.  See http://gmplib.org/devel/ for my thinking about division
functions, from a few years back...

-- 
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-19 Thread Niels Möller
Torbjorn Granlund  writes:

> I think x86-64, x86-32, arm32, arm64, powerpc-64, sparc-64 matter.
> Unfortunately, powerpc-64 (and -32) return these types onto the stack
> via an implicit pointer.

Ok, I think I'll stick to using a separate pointer for qh then. Storing qh
early may also have the advantage of allowing more tail calls.

I'm about to push the first step, with C implementations of mpn_div_qr_1
and mpn_div_qr_1n_pi1.

The C code implements two variants of mpn_div_qr_1n_pi1: First is a
plain loop around udiv_qrnnd_preinv, second uses the new method.

We'd need support in tuneup to select the best variant. And also support
for tuning the DIV_QR_1_ thresholds.

I wrote:

>   In principle, since these are internal functions,

I was a bit confused here. mpn_div_qr_1 is intended as a public
function.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-18 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  ni...@lysator.liu.se (Niels Möller) writes:
  
  (about using a small struct as return value)
  
  > If the caller is going to store the returned value directly in memory
  > anyway, there's little difference. And if the caller is going to operate
  > on the return value, and needs it in a register, I think struct return
  > should be epsilon more efficient.
  
  Things are different on 32-bit x86; there the struct seems to be
  returned via a pointer passed by the caller (on the stack, like all
  arguments). So the crude measure of the number of instructions for the
  example divrem function grows from 13 to 18 when using a struct return
  value.
  
I think you might be mistaken.  Perhaps you're using a structure of two
long long fields?

This example,

struct retme  { unsigned long int a, b; };

struct retme
foo (struct retme x)
{
  struct retme y;
  y.a = x.a + 17;
  y.b = x.b + 4711;
  return y;
}

is compiled by gcc 4.7.2 into,

 movl 8(%esp), %edx
 addl $4711, %edx
 movl 4(%esp), %eax
 addl $17, %eax
 ret

indicating that (like any argument) the struct is passed on the stack,
but it is returned in a register pair.

  How much do we care about 32-bit x86 these days? What other ABIs out
  there can't return a small (two word-sized elements) struct in
  registers?
  
I think x86-64, x86-32, arm32, arm64, powerpc-64, sparc-64 matter.
Unfortunately, powerpc-64 (and -32) return these types onto the stack
via an implicit pointer.  I think all the other listed get it right.  (I
was involved in the powerpc-64 ELF ABI; I will slap my fingers...)

  In principle, since these are internal functions, I guess one could use
  
struct  {mp_limb_t r, mp_limb_t qh } mpn_div_qr_1n_pi1 (...)
  
  on some systems and
  
mp_limb_t mpn_div_qr_1n_pi1 (..., mp_limb_t *qhp)
  
  on others, depending on which style is most efficient. But that seems
  messy.
  
That's conceivable, and perhaps one could reduce the pain by using sone
clever macros.

-- 
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-18 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes:

(about using a small struct as return value)

> If the caller is going to store the returned value directly in memory
> anyway, there's little difference. And if the caller is going to operate
> on the return value, and needs it in a register, I think struct return
> should be epsilon more efficient.

Things are different on 32-bit x86; there the struct seems to be
returned via a pointer passed by the caller (on the stack, like all
arguments). So the crude measure of the number of instructions for the
example divrem function grows from 13 to 18 when using a struct return
value.

How much do we care about 32-bit x86 these days? What other ABIs out
there can't return a small (two word-sized elements) struct in
registers?

In principle, since these are internal functions, I guess one could use

  struct  {mp_limb_t r, mp_limb_t qh } mpn_div_qr_1n_pi1 (...)

on some systems and

  mp_limb_t mpn_div_qr_1n_pi1 (..., mp_limb_t *qhp)

on others, depending on which style is most efficient. But that seems
messy.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-17 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes:

> Consider the test compilation unit
>
>   typedef struct {
> unsigned long q; unsigned long r;
>   } qr_t;
>   
>   qr_t divrem (unsigned long u, unsigned long d)
>   {
> qr_t res;
> res.q = u/d;
> res.r = u - res.q*d;
> return res;
>   }
>
> On x86_64 (and gnu/linux), gcc -c -O compiles this to
>
>0: 48 89 f8mov%rdi,%rax
>3: 31 d2   xor%edx,%edx
>5: 48 f7 f6div%rsi
>8: 48 89 famov%rdi,%rdx
>b: 48 0f af f0 imul   %rax,%rsi
>f: 48 29 f2sub%rsi,%rdx
>   12: c3  retq   

And I guess it is relevant to compare this to the same function,
reorganized to return the second value via a pointer:

  unsigned long divrem (unsigned long u, unsigned long d,
unsigned long *qp)
  {
unsigned long q;
q = u/d;
*qp = q;
  
return u - q * d;
  }

which is compiled to

   0:   48 89 d1mov%rdx,%rcx
   3:   48 89 f8mov%rdi,%rax
   6:   31 d2   xor%edx,%edx
   8:   48 f7 f6div%rsi
   b:   48 0f af f0 imul   %rax,%rsi
   f:   48 89 01mov%rax,(%rcx)
  12:   48 89 f8mov%rdi,%rax
  15:   48 29 f0sub%rsi,%rax
  18:   c3  retq   

If the caller is going to store the returned value directly in memory
anyway, there's little difference. And if the caller is going to operate
on the return value, and needs it in a register, I think struct return
should be epsilon more efficient.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-17 Thread Niels Möller
Torbjorn Granlund  writes:

> I am not too enthusiastic about struct return types for critical
> functions.  I expect this to be returned via a stack slot everywhere o
> almost everywhere.

As far as I understand, the most common ABIs for x86_64 and ARM (which
is pretty close to "almost everywhere"...) both return structs of this form
in two registers: %rax/%rdx, and %r0/%r1.

Consider the test compilation unit

  typedef struct {
unsigned long q; unsigned long r;
  } qr_t;
  
  qr_t divrem (unsigned long u, unsigned long d)
  {
qr_t res;
res.q = u/d;
res.r = u - res.q*d;
return res;
  }

On x86_64 (and gnu/linux), gcc -c -O compiles this to

   0:   48 89 f8mov%rdi,%rax
   3:   31 d2   xor%edx,%edx
   5:   48 f7 f6div%rsi
   8:   48 89 famov%rdi,%rdx
   b:   48 0f af f0 imul   %rax,%rsi
   f:   48 29 f2sub%rsi,%rdx
  12:   c3  retq   

Both inputs and outputs are passed in registers. The return value is the
only thing stored on the stack.

> I recall to have seen some code for that.  How fast does it run
> currently on the various CPUs?

Don't know yet.

> Code comment:
>
> I think we cannot afford to do a separate lshift of the dividend operand
> when the divisor is just a few limbs.  We need to to shifting on-the-
> fly, however irksome that might be.  AN mpn_div_qr_1u_pi1 is called-for.

I think we'll definitely want mpn_div_qr_1u_pi1 for the most common
platforms. I was thinking, that maybe we could let it be an optional
function, with no C implementation, and resort to a separate mpn_lshift
if the function is missing.

But if needed, it's no big deal to extract a C mpn_div_qr_1u_pi1 from
divrem_1.c, with on-the-fly shifting. 

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-17 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  ni...@lysator.liu.se (Niels Möller) writes:
  
  > To get going, I've written C implementations of mpn_div_qr_1n_pi1 and
  > mpn_divf_qr1n_pi1, and made divrem_1 call them.
  
  Below, also an mpn_div_qr_1, using these primitives (and with some
  inspiration from divrem_1). For return value, I use the type
  
typedef struct { mp_limb_t r; mp_limb_t qh; } gmp_qr_1_t;
  
  The order here is a micro-optimization. I expect that on most ABI:s, for
  the typical code fragment
  
gmp_qr_1_t res;
...
res.r = mpn_foo (...);
return res;
  
  the return value from mpn_foo will already be in the right register, and
  only qh needs to be moved from some callee-save register to the second
  return value register.
  
I am not too enthusiastic about struct return types for critical
functions.  I expect this to be returned via a stack slot everywhere o
almost everywhere.

But I agree that it is nice to avoid putting that extra quotient limb at
the main quotient operand.  Now, performance is more important than
cuteness...

  I also found some old x86_64 assembly code for the new div_qr_1
  algorithm. With perfect scheduling, it could run at 10 cycles on opteron
  (the main cpu of interest at the time). Main loop is 28 instructions,
  two independent mul, and decoder limited).
  
I recall to have seen some code for that.  How fast does it run
currently on the various CPUs?

Code comment:

I think we cannot afford to do a separate lshift of the dividend operand
when the divisor is just a few limbs.  We need to to shifting on-the-
fly, however irksome that might be.  AN mpn_div_qr_1u_pi1 is called-for.

-- 
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: div_qr_1 interface

2013-10-17 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes:

> To get going, I've written C implementations of mpn_div_qr_1n_pi1 and
> mpn_divf_qr1n_pi1, and made divrem_1 call them.

Below, also an mpn_div_qr_1, using these primitives (and with some
inspiration from divrem_1). For return value, I use the type

  typedef struct { mp_limb_t r; mp_limb_t qh; } gmp_qr_1_t;

The order here is a micro-optimization. I expect that on most ABI:s, for
the typical code fragment

  gmp_qr_1_t res;
  ...
  res.r = mpn_foo (...);
  return res;

the return value from mpn_foo will already be in the right register, and
only qh needs to be moved from some callee-save register to the second
return value register.

Are there any problems with using a struct return values like here, in
terms of performance, portability, style...?

If you agree this makes sense, I'm considering checking in the div_qr_1*
functions fairly soon (maybe divrem_1 changes can wait until
div_qr_1*_pi1 has settled. The interface should allow for additional
precomputed values).

I also found some old x86_64 assembly code for the new div_qr_1
algorithm. With perfect scheduling, it could run at 10 cycles on opteron
(the main cpu of interest at the time). Main loop is 28 instructions,
two independent mul, and decoder limited).

Regards,
/Niels

/* mpn_div_qr_1 -- mpn by limb division.

   Contributed to the GNU project by Niels Möller and Torbjörn Granlund

Copyright 1991, 1993, 1994, 1996, 1998, 1999, 2000, 2002, 2003, 2013 Free 
Software
Foundation, Inc.

This file is part of the GNU MP Library.

The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of the GNU Lesser General Public License as published by
the Free Software Foundation; either version 3 of the License, or (at your
option) any later version.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
License for more details.

You should have received a copy of the GNU Lesser General Public License
along with the GNU MP Library.  If not, see http://www.gnu.org/licenses/.  */

#include "gmp.h"
#include "gmp-impl.h"
#include "longlong.h"

#ifndef DIV_QR_1_NORM_THRESHOLD
#define DIV_QR_1_NORM_THRESHOLD 3
#endif
#ifndef DIV_QR_1_UNNORM_THRESHOLD
#define DIV_QR_1_UNNORM_THRESHOLD 3
#endif

#if GMP_NAIL_BITS > 0
#error Nail bits not supported
#endif

/* Divides {up, n} by d. Writes the n-1 low quotient limbs at {qp,
 * n-1}. Returns remainder and high quotient limb. */
gmp_qr_1_t
mpn_div_qr_1 (mp_ptr qp, mp_srcptr up, mp_size_t n,
  mp_limb_t d)
{
  gmp_qr_1_t res;

  unsigned cnt;
  mp_limb_t uh;

  ASSERT (n > 0);
  ASSERT (d > 0);

  if (d & GMP_NUMB_HIGHBIT)
{
  /* Normalized case */
  mp_limb_t dinv, q;

  uh = up[--n];

  q = (uh >= d);
  res.qh = q;
  uh -= (-q) & d;

  if (BELOW_THRESHOLD (n, DIV_QR_1_NORM_THRESHOLD))
{
  cnt = 0;
plain:
  while (n > 0)
{
  mp_limb_t ul = up[--n];
  udiv_qrnnd (qp[n], uh, uh, ul, d);
}
  res.r = uh >> cnt;
  return res;
}
  invert_limb (dinv, d);
  res.r = mpn_div_qr_1n_pi1 (qp, up, n, uh, d, dinv);
  return res;
}
  else
{
  /* Unnormalized case */
  mp_limb_t dinv, ul;

  if (! UDIV_NEEDS_NORMALIZATION
  && BELOW_THRESHOLD (n, DIV_QR_1_UNNORM_THRESHOLD))
{
  uh = up[--n];
  udiv_qrnnd (res.qh, uh, 0, uh, d);
  cnt = 0;
  goto plain;
}

  count_leading_zeros (cnt, d);
  d <<= cnt;

#if HAVE_NATIVE_div_qr_1u_pi1
  /* FIXME: Call loop doing on-the-fly normalization */
#endif

  /* Shift up front, use qp area for shifted copy. A bit messy,
 since we have only n-1 limbs available, and shift the high
 limb manually. */
  uh = up[--n];
  ul = (uh << cnt) | mpn_lshift (qp, up, n, cnt);
  uh >>= (GMP_LIMB_BITS - cnt);

  if (UDIV_NEEDS_NORMALIZATION
  && BELOW_THRESHOLD (n, DIV_QR_1_UNNORM_THRESHOLD))
{
  udiv_qrnnd (res.qh, uh, uh, ul, d);
  up = qp;
  goto plain;
}
  invert_limb (dinv, d);

  udiv_qrnnd_preinv (res.qh, uh, uh, ul, d, dinv);
  res.r = mpn_div_qr_1n_pi1 (qp, qp, n, uh, d, dinv) >> cnt;
  return res;
}
}

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel