Hello,
I am tempted to go with something like the attached patch to support long
long in gmpxx.h. (the patch is not quite ready)
Essentially, it adds a way to build mpz_class from long long, and for all
other operations, long long is converted to long if it fits and to
mpz_class otherwise. So
ni...@lysator.liu.se (Niels Möller) writes:
And I don't quite trust these cycle numbers, they should probably be
twice as large, on the order of 10 cycles/limb for all variants. Less
than 5 cycles is too good to be true, right?
Yes.
"Turbo" messes things up. The TSC cycle counterstays it
ni...@lysator.liu.se (Niels Möller) writes:
Maybe we should have some macrology for that? Or do all relevant
processors and compilers support efficient cmov these days? I'm sticking
to masking expressions for now.
Let's not trust results from compiler generated code for these things.
The mi
ni...@lysator.liu.se (Niels Möller) writes:
> $ ./speed -p 100 -s 2-20 -C mpn_div_qr_1n_pi1.0x8765432108765432
> mpn_div_qr_1n_pi1_1.0x8765432108765432 mpn_div_qr_1n_pi1_2.0x8765432108765432
> mpn_div_qr_1n_pi1_3.0x8765432108765432 mpn_div_qr_1n_pi1_4.0x8765432108765432
> overhead 2.63 cycle
ni...@lysator.liu.se (Niels Möller) writes:
> You're idea of conditonally adding the invariant d * B2 at the right
> place is also interesting,
I've tried it out. Works nicely, but no speedup on my machine. I'm
attaching another patch. There are then 4 methods:
method 1: Old loop around udiv_qrn
Marco Bodrato writes:
> Using masks does not always give the fastest code. I tried the
> following variation on Niels' code, and, on my laptop with "g++-10 -O2
> -mtune=icelake-client -march=icelake-client", the resulting code is
> comparable (faster?) with the current asm.
Maybe we should have