Re: udiv_qr_3by2 vs divappr

2018-08-27 Thread Torbjörn Granlund
Some more measurements. The first 3 result columns are for division, for gradually harder cases; B=0xcafebabedeadbeef. The last columns is widening multiplication. cpu type 17/3 B/4711 2^64/3 mul presumed algorithm skylake 25 25 86 1 SRT-8 broadwell

Re: udiv_qr_3by2 vs divappr

2018-08-27 Thread Torbjörn Granlund
t...@gmplib.org (Torbjörn Granlund) writes: Last time I explored this, the division instructions (div/idiv) ran at varying speed as long as the upper dividend word is zero. When it is, the timing is rougly linear in either log2(dividend) or log2(quotient). But the fastst time is still som

Re: udiv_qr_3by2 vs divappr

2018-08-27 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes: paul zimmermann writes: > page 1: the division instruction is now much faster than before on modern > processors According to https://gmplib.org/~tege/x86-timing.pdf, they're still an order of magnitute slower than multiplication. E.g 8

Re: udiv_qr_3by2 vs divappr

2018-08-27 Thread Niels Möller
paul zimmermann writes: > page 1: the division instruction is now much faster than before on modern > processors According to https://gmplib.org/~tege/x86-timing.pdf, they're still an order of magnitute slower than multiplication. E.g 86 vs 3 cycles on Intel skylake. And in addition, multip

Re: udiv_qr_3by2 vs divappr

2018-08-27 Thread paul zimmermann
Dear Niels, > From: ni...@lysator.liu.se (Niels Möller) > Date: Sun, 05 Aug 2018 09:59:46 +0200 > > ni...@lysator.liu.se (Niels Möller) writes: > > > static inline mp_limb_t > > divappr_2 (mp_limb_t n2, mp_limb_t n1, > >mp_limb_t d1, mp_limb_t d0, mp_limb_t dinv) > > { > > mp_li