Re: div_qr_1n_pi1

Torbjörn Granlund Mon, 07 Jun 2021 02:51:08 -0700

ni...@lysator.liu.se (Niels Möller) writes:

  I've tried it out. Works nicely, but no speedup on my machine. I'm
  attaching another patch. There are then 4 methods:


  method 1: Old loop around udiv_qrnnd_preinv.

  method 2: The clever code from 10 years ago, with the microoptimization
    I commited the other day.

  method 3: More or less the same as I posted a few days ago.

  method 4: Postpones the update u1 -= u2 d, off the critical recurrency
    chain. Instead, conditionally adds in the constant B2 (B - d) to the
    lower u limbs.

Let me throw some additional complication into this project:

The fastest way to do a division by a single limb will involve more
precomputation, assuming we have good throughput of the multiply
instruction.

Even in some scenario where one has a 1-limb inverse precomputed, it
will be faster to first do a Newton round to get a 2-limb inverse, and
then run a loop which develops two quotient limbs at a time.  I expect
the dividend limb count cutoff point to be small, surely less than 10.

With full precomputation, the dividend limb count cutoff point would be
2, I presume.

The function div_qr_1n_pi1 is useful mainly for CPUs without a pipelined
multiply instruction.  And of course also useful until we have great
div_qr_1n_piN for N >= 2.

Improvements to div_qr_1n_piN are of course theoretically interesting
too.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel

Re: div_qr_1n_pi1

Reply via email to