One of my previous proposal looks like this, when put into its right form:

mp_limb_t
div1 (mp_limb_t n0, mp_limb_t d0)
{
  if (UNLIKELY ((d0 >> 61) != 0))
    return n0 / d0;

  if (UNLIKELY (n0 >= (d0 << 3)))
    return n0 / d0;
  else {
    mp_limb_t q, mask;

    d0 <<= 2;

    mask = -(mp_limb_t) (n0 >= d0);
    q = 4 & mask;
    n0 -= d0 & mask;

    d0 >>= 1;
    mask = -(mp_limb_t) (n0 >= d0);
    q += 2 & mask;
    n0 -= d0 & mask;

    d0 >>= 1;
    mask = -(mp_limb_t) (n0 >= d0);
    q += 1 & mask;
    n0 -= d0 & mask;

    return q;
  }
}

Some timing tests indicate that it is 6 times faster on 'shell' for
quotients < 8.  For hgcd, that would mean almost all quotients.
I expect an asm version would be 10x faster.

I attached a full test-and-timing program.

Attachment: div1.c
Description: Binary data

-- 
Torbjörn
Please encrypt, key id 0xC8601622
_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel

Reply via email to