ni...@lysator.liu.se (Niels Möller) writes: I see two ways. Either just do an inversion table, which will cost an additional multiplication when used. Than one could handle quotients up to 7 or 8 bits bits with just 256 bytes of table.
Or shift both ni and di, and get quotient with something like q = tabp[(ni << NBITS) + di] >> (NBITS - (dcnt - ncnt)) Then one could handle quotients up to 5 bits with a table of 1024 bytes (indexing by 5 significant bits of each of n and d, excluding the most significant one bit). I went ahead and pushed a hgcd2.c with _METHOD = 4 and _METHOD 5. The former tables quotients and neeeds one multiply per div1 call and the latter tables inverses and needs 2 multiplies per div1 call. The quotient table method isn't pretty. It can surely be improved. It uses a very large table and gets poor throughput. It also has only about 80% branch hit rate, which is not good enough. Niels suggested a variant with better progress (I get quotient with at most NBITS/2 bits in the fast path while Niels expects NBITS) when extracting NBITS numerator and NBITS quotient bits. I would be happy to see a variant with NBITS progress. The inverse table method is more mature, I believe. It has much smaller tables and with the default table size (256 2-byte entries) it gets a branch hit rate of 97%. Doubling the table results in 98.5% rate. Benchmark? The inverse table method seems to be the fastest div1 thus far for some current Intel processors. -- Torbjörn Please encrypt, key id 0xC8601622 _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel