ni...@lysator.liu.se (Niels Möller) writes: ni...@lysator.liu.se (Niels Möller) writes: > To get going, I've written C implementations of mpn_div_qr_1n_pi1 and > mpn_divf_qr1n_pi1, and made divrem_1 call them. Below, also an mpn_div_qr_1, using these primitives (and with some inspiration from divrem_1). For return value, I use the type typedef struct { mp_limb_t r; mp_limb_t qh; } gmp_qr_1_t; The order here is a micro-optimization. I expect that on most ABI:s, for the typical code fragment gmp_qr_1_t res; ... res.r = mpn_foo (...); return res; the return value from mpn_foo will already be in the right register, and only qh needs to be moved from some callee-save register to the second return value register. I am not too enthusiastic about struct return types for critical functions. I expect this to be returned via a stack slot everywhere o almost everywhere.
But I agree that it is nice to avoid putting that extra quotient limb at the main quotient operand. Now, performance is more important than cuteness... I also found some old x86_64 assembly code for the new div_qr_1 algorithm. With perfect scheduling, it could run at 10 cycles on opteron (the main cpu of interest at the time). Main loop is 28 instructions, two independent mul, and decoder limited). I recall to have seen some code for that. How fast does it run currently on the various CPUs? Code comment: I think we cannot afford to do a separate lshift of the dividend operand when the divisor is just a few limbs. We need to to shifting on-the- fly, however irksome that might be. AN mpn_div_qr_1u_pi1 is called-for. -- Torbjörn _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel