Re: hgcd1/2

Niels Möller Tue, 03 Sep 2019 00:55:11 -0700

ni...@lysator.liu.se (Niels Möller) writes:

> I think it's promising, and also a bit simpler than the current div1
> code. Some concerns:
>
> 1. I think the approximate q means that we will often reduce from a > b
>    to a close to but larger than b, so we need two iterations to get to
>    a < b. Might improve progress to add some kind of adjustment step
>    after a - q*b, checking either if result >= b or < 0.
>
> 2. There are different options for the UNLIKELY branch. One could count
>    leading zeros of b, shift, and look up a quotient, and apply
>
>    a -= q * b << k;
>
>    for suitable q and k. Likely cheaper than a division, but we'll then
>    do multiple iterations if size difference is large (which I think is
>    ok).
>
> 3. We get poor accuracy for q when b_hi == 1. One could make that less
>    likely by using a somewhat asymmetric table, say looking up 3 bits of
>    a but 6 bits of b.
>
> 4. One could also go in the other direction and do something closer to
>    left-to-right binary, and in each step always use q = 2^k for
>    suitable k, based on count_leading_zeros on both a and b.


I've tried replacing the single-precision loop in hgcd2 with this loop.
And it does give a speedup of about 5% in the Lehmer range. Patch below.

Cycle numbers on my laptop, speed -p 100000 -c -s 10-100 -f 1.1 mpn_gcd:

Before:
              mpn_gcd
10           21505.73
11           23976.73
12           26443.39
13           28978.88
14           31538.00
15           33956.00
16           36556.81
17           39220.50
18           42087.00
19           45242.36
20           47446.70
22           53891.50
24           59186.14
26           64720.50
28           70047.80
30           75426.00
33           83215.33
36           92378.00
39          101442.50
42          110168.50
46          118957.00
50          132620.00
55          147323.00
60          166901.00
66          188441.00
72          208773.00
79          236515.00
86          262802.00
94          295412.00

After:

              mpn_gcd
10           20174.23
11           22459.27
12           24851.14
13           27303.50
14           29583.15
15           31994.11
16           34313.12
17           36977.57
18           39485.92
19           42317.91
20           44573.10
22           50640.62
24           55728.29
26           60987.33
28           66027.00
30           71474.00
33           79510.00
36           87607.00
39           97009.50
42          105033.50
46          116175.00
50          129649.00
55          141719.00
60          161759.00
66          179491.00
72          200928.00
79          226226.00
86          251570.00
94          283439.00

And this while leaving the more expensive double-limb loop unchanged.

But... I get even better numbers if I keep the old code and just replace
the div1 function with plain division q = a / b:

              mpn_gcd
10           18757.85   # 15% speedup
11           20862.24
12           22963.11
13           25123.29
14           27388.40
15           29541.06
16           31587.19
17           34132.93
18           36209.33
19           39231.64
20           42615.30
22           47155.88
24           51235.71
26           56656.17
28           61496.60
30           65850.00
33           72661.33
36           81102.67
39           89463.00
42           97084.00
46          106561.00
50          117578.00
55          132398.00
60          149708.00
66          168662.00
72          188981.00
79          211562.00
86          236054.00
94          268411.00   # 10% speedup

And this on a laptop with an Intel U4100 (5 years old?), so I'd assume
it doesn't have a particularly fast div instruction. Should we just
delete div1 ? On which architectures can we expect it to be beneficial?
It should be fairly easy to find out, if we define a HGCD_DIV1_METHOD
known to tuneup, to select between plain division and the div1 function.

The case for div2 may be different.

Regards,
/Niels

*** /tmp/extdiff.9wboZ5/gmp.c9a8cea1d06d/mpn/generic/hgcd2.c	2019-09-03 09:36:19.965241047 +0200
--- /home/nisse/hack/gmp/mpn/generic/hgcd2.c	2019-09-03 09:09:45.312275554 +0200
***************
*** 36,41 ****
  #include "longlong.h"
  
! #if GMP_NAIL_BITS == 0
  
  /* Copied from the old mpn/generic/gcdext.c, and modified slightly to return
     the remainder. */
--- 36,48 ----
  #include "longlong.h"
  
! #if GMP_NAIL_BITS != 0
! #error GMP_NAIL_BITS > 0 not supported
! #endif
! 
! #ifndef HGCD1_METHOD
! #define HGCD1_METHOD 2
! #endif
  
+ #if HGCD1_METHOD == 1
  /* Copied from the old mpn/generic/gcdext.c, and modified slightly to return
     the remainder. */
***************
*** 95,98 ****
--- 102,296 ----
  }
  
+ /* Single precision reduction, tail called from hgcd2. */
+ static inline int
+ hgcd1 (mp_limb_t a, mp_limb_t b,
+        mp_limb_t u00, mp_limb_t u01,
+        mp_limb_t u10, mp_limb_t u11,
+        struct hgcd_matrix1 *M)
+ {
+   /* Single precision loop */
+   if (a < b)
+     goto subtract_a;
+ 
+   for (;;)
+     {
+       ASSERT (a >= b);
+ 
+       a -= b;
+       if (a < (CNST_LIMB (1) << (GMP_LIMB_BITS / 2 + 1)))
+ 	break;
+ 
+       if (a <= b)
+ 	{
+ 	  /* Use q = 1 */
+ 	  u01 += u00;
+ 	  u11 += u10;
+ 	}
+       else
+ 	{
+ 	  mp_limb_t r;
+ 	  mp_limb_t q = div1 (&r, a, b);
+ 	  a = r;
+ 	  if (a < (CNST_LIMB(1) << (GMP_LIMB_BITS / 2 + 1)))
+ 	    {
+ 	      /* A is too small, but q is correct. */
+ 	      u01 += q * u00;
+ 	      u11 += q * u10;
+ 	      break;
+ 	    }
+ 	  q++;
+ 	  u01 += q * u00;
+ 	  u11 += q * u10;
+ 	}
+     subtract_a:
+       ASSERT (b >= a);
+ 
+       b -= a;
+       if (b < (CNST_LIMB (1) << (GMP_LIMB_BITS / 2 + 1)))
+ 	break;
+ 
+       if (b <= a)
+ 	{
+ 	  /* Use q = 1 */
+ 	  u00 += u01;
+ 	  u10 += u11;
+ 	}
+       else
+ 	{
+ 	  mp_limb_t r;
+ 	  mp_limb_t q = div1 (&r, b, a);
+ 	  b = r;
+ 	  if (b < (CNST_LIMB(1) << (GMP_LIMB_BITS / 2 + 1)))
+ 	    {
+ 	      /* B is too small, but q is correct. */
+ 	      u00 += q * u01;
+ 	      u10 += q * u11;
+ 	      break;
+ 	    }
+ 	  q++;
+ 	  u00 += q * u01;
+ 	  u10 += q * u11;
+ 	}
+     }
+ 
+   M->u[0][0] = u00; M->u[0][1] = u01;
+   M->u[1][0] = u10; M->u[1][1] = u11;
+ 
+   return 1;
+ }
+ #elif HGCD1_METHOD == 2
+ #define QTAB_BITS 4
+ #if QTAB_BITS == 3
+ /* 1,a1,a0.xxx / b2,b1,b0.xxx > 1,a1,a0 / b2,b1,b0 + 1 
+  * Need to round b up, except when a_hi == b_hi, since we always have a > b.
+  */
+ unsigned char q_tab[7][4] = {
+   /* a    = 4, 5, 6, 7
+    /  b */
+   { /* 1 */ 2, 2, 3, 3 },
+   { /* 2 */ 1, 1, 2, 2 },
+   { /* 3 */ 1, 1, 1, 1 },
+   { /* 4 */ 1, 1, 1, 1 },
+   { /* 5 */ 0, 1, 1, 1 },
+   { /* 6 */ 0, 0, 1, 1 },
+   { /* 7 */ 0, 0, 0, 1 },
+ };
+ #elif QTAB_BITS == 4
+ unsigned char q_tab[15][8] = {
+   /* a    = 8, 9,10,11,12,13,14,15
+    /  b */
+   { /* 1 */ 4, 4, 5, 5, 6, 6, 7, 7 },
+   { /* 2 */ 2, 3, 3, 3, 4, 4, 4, 5 },
+   { /* 3 */ 2, 2, 2, 2, 3, 3, 3, 3 },
+   { /* 4 */ 1, 1, 2, 2, 2, 2, 2, 3 },
+   { /* 5 */ 1, 1, 1, 1, 2, 2, 2, 2 },
+   { /* 6 */ 1, 1, 1, 1, 1, 1, 2, 2 },
+   { /* 7 */ 1, 1, 1, 1, 1, 1, 1, 1 },
+   { /* 8 */ 1, 1, 1, 1, 1, 1, 1, 1 },
+   { /* 9 */ 0, 1, 1, 1, 1, 1, 1, 1 },
+   { /*10 */ 0, 0, 1, 1, 1, 1, 1, 1 },
+   { /*11 */ 0, 0, 0, 1, 1, 1, 1, 1 },
+   { /*12 */ 0, 0, 0, 0, 1, 1, 1, 1 },
+   { /*13 */ 0, 0, 0, 0, 0, 1, 1, 1 },
+   { /*14 */ 0, 0, 0, 0, 0, 0, 1, 1 },
+   { /*15 */ 0, 0, 0, 0, 0, 0, 0, 1 },
+ };
+ #else
+ #error unsupported table size
+ #endif
+ 
+ #define CND_SWAP_MASK(mask, x, y) do { \
+   mp_limb_t __t = (x ^ y) & mask; \
+   x ^= __t; \
+   y ^= __t; \
+   } while (0)
+ 
+ static inline int
+ hgcd1 (mp_limb_t a, mp_limb_t b,
+        mp_limb_t u00, mp_limb_t u01,
+        mp_limb_t u10, mp_limb_t u11,
+        struct hgcd_matrix1 *M)
+ {
+   int swapped = 0; 
+   for (;;) 
+     {
+       mp_limb_t a_hi;
+       mp_limb_t b_hi;
+       mp_limb_t q;
+       int c;
+ #if 1
+       if (a < b) 
+ 	{
+ 	  MP_LIMB_T_SWAP (a,b);
+ 	  MP_LIMB_T_SWAP (u00, u01);
+ 	  MP_LIMB_T_SWAP (u10, u11);
+ 	  swapped ^= 1;
+ 	}
+ #else
+       mp_limb_t mask = - (a < b);
+       CND_SWAP_MASK (mask, a, b);
+       CND_SWAP_MASK (mask, u00, u01);
+       CND_SWAP_MASK (mask, u10, u11);
+       swapped ^= mask;
+ #endif
+       count_leading_zeros (c, a);
+       ASSERT (c + QTAB_BITS <= GMP_LIMB_BITS);
+       a_hi = a >> (GMP_LIMB_BITS - QTAB_BITS - c);
+       b_hi = (b >> (GMP_LIMB_BITS - QTAB_BITS - c));
+       if (UNLIKELY(b_hi == 0))
+ 	{
+ 	  q = a / b;
+ 	}
+       else 
+ 	q = q_tab[b_hi - 1][a_hi - (1U << (QTAB_BITS - 1))];
+ 
+       ASSERT (q > 0);
+       ASSERT (a >= q*b);
+       a -= q*b;
+       u01 += q*u00;
+       u11 += q*u10;
+       if (a < (CNST_LIMB(1) << (GMP_NUMB_BITS / 2 + 1)))
+ 	{
+ 	  u01 -= u00;
+ 	  u11 -= u10;
+ 	  break;
+ 	}      
+     }
+   if (swapped) 
+     {
+       MP_LIMB_T_SWAP (u00, u01);
+       MP_LIMB_T_SWAP (u10, u11);
+     }
+   M->u[0][0] = u00;
+   M->u[0][1] = u01;
+   M->u[1][0] = u10;
+   M->u[1][1] = u11;
+   return 1;
+ }
+ 
+ #else
+ #error Unknown HGCD1_METHOD
+ #endif
+ 
  /* Two-limb division optimized for small quotients.  */
  static inline mp_limb_t
***************
*** 198,210 ****
  #endif
  
- #else /* GMP_NAIL_BITS != 0 */
- /* Check all functions for nail support. */
- /* hgcd2 should be defined to take inputs including nail bits, and
-    produce a matrix with elements also including nail bits. This is
-    necessary, for the matrix elements to be useful with mpn_mul_1,
-    mpn_addmul_1 and friends. */
- #error Not implemented
- #endif /* GMP_NAIL_BITS != 0 */
- 
  /* Reduces a,b until |a-b| (almost) fits in one limb + 1 bit. Constructs
     matrix M. Returns 1 if we make progress, i.e. can perform at least
--- 396,399 ----
***************
*** 260,269 ****
  
        if (ah < (CNST_LIMB(1) << (GMP_LIMB_BITS / 2)))
! 	{
! 	  ah = (ah << (GMP_LIMB_BITS / 2) ) + (al >> (GMP_LIMB_BITS / 2));
! 	  bh = (bh << (GMP_LIMB_BITS / 2) ) + (bl >> (GMP_LIMB_BITS / 2));
! 
! 	  break;
! 	}
  
        /* Subtract a -= q b, and multiply M from the right by (1 q ; 0
--- 449,453 ----
  
        if (ah < (CNST_LIMB(1) << (GMP_LIMB_BITS / 2)))
! 	break;
  
        /* Subtract a -= q b, and multiply M from the right by (1 q ; 0
***************
*** 303,312 ****
  
        if (bh < (CNST_LIMB(1) << (GMP_LIMB_BITS / 2)))
! 	{
! 	  ah = (ah << (GMP_LIMB_BITS / 2) ) + (al >> (GMP_LIMB_BITS / 2));
! 	  bh = (bh << (GMP_LIMB_BITS / 2) ) + (bl >> (GMP_LIMB_BITS / 2));
! 
! 	  goto subtract_a1;
! 	}
  
        /* Subtract b -= q a, and multiply M from the right by (1 0 ; q
--- 487,491 ----
  
        if (bh < (CNST_LIMB(1) << (GMP_LIMB_BITS / 2)))
! 	break;
  
        /* Subtract b -= q a, and multiply M from the right by (1 0 ; q
***************
*** 344,408 ****
       get a truly maximal M (corresponding to |a - b| <
       2^{GMP_LIMB_BITS +1}). */
!   /* Single precision loop */
!   for (;;)
!     {
!       ASSERT (ah >= bh);
! 
!       ah -= bh;
!       if (ah < (CNST_LIMB (1) << (GMP_LIMB_BITS / 2 + 1)))
! 	break;
! 
!       if (ah <= bh)
! 	{
! 	  /* Use q = 1 */
! 	  u01 += u00;
! 	  u11 += u10;
! 	}
!       else
! 	{
! 	  mp_limb_t r;
! 	  mp_limb_t q = div1 (&r, ah, bh);
! 	  ah = r;
! 	  if (ah < (CNST_LIMB(1) << (GMP_LIMB_BITS / 2 + 1)))
! 	    {
! 	      /* A is too small, but q is correct. */
! 	      u01 += q * u00;
! 	      u11 += q * u10;
! 	      break;
! 	    }
! 	  q++;
! 	  u01 += q * u00;
! 	  u11 += q * u10;
! 	}
!     subtract_a1:
!       ASSERT (bh >= ah);
! 
!       bh -= ah;
!       if (bh < (CNST_LIMB (1) << (GMP_LIMB_BITS / 2 + 1)))
! 	break;
! 
!       if (bh <= ah)
! 	{
! 	  /* Use q = 1 */
! 	  u00 += u01;
! 	  u10 += u11;
! 	}
!       else
! 	{
! 	  mp_limb_t r;
! 	  mp_limb_t q = div1 (&r, bh, ah);
! 	  bh = r;
! 	  if (bh < (CNST_LIMB(1) << (GMP_LIMB_BITS / 2 + 1)))
! 	    {
! 	      /* B is too small, but q is correct. */
! 	      u00 += q * u01;
! 	      u10 += q * u11;
! 	      break;
! 	    }
! 	  q++;
! 	  u00 += q * u01;
! 	  u10 += q * u11;
! 	}
!     }
  
   done:
--- 523,529 ----
       get a truly maximal M (corresponding to |a - b| <
       2^{GMP_LIMB_BITS +1}). */
!   ah = (ah << (GMP_LIMB_BITS / 2) ) + (al >> (GMP_LIMB_BITS / 2));
!   bh = (bh << (GMP_LIMB_BITS / 2) ) + (bl >> (GMP_LIMB_BITS / 2));
!   return hgcd1 (ah, bh, u00, u01, u10, u11, M);
  
   done:

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.

_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel

Re: hgcd1/2

Reply via email to