ni...@lysator.liu.se (Niels Möller) writes: I had a quick look at the machines that completed tuning this night. These three seem to prefer the old code (HGCD2_METHOD == 2):
armcortexa7neon-unknown-linux-gnueabihf pi2.gmplib.org-stat armcortexa8neon-unknown-linux-gnueabihf beagle.gmplib.org-stat armcortexa12neon-unknown-linux-gnueabihf tinker.gmplib.org-stat The others want HGCD_METHOD == 1, i.e., replacing div1 with plain division. The table generator program has now run. Also pi1 wants the old code. The next question is how badly they want it. :-) I checked the small quotient division speed of tinker and pi2. It is 16 and 36 cycles, respectively. With 36 cycles, I am not surprised the division instructions should be avoided. But why does tinker prefer div1 before a 16 cycles division instruction? I suppose it has a lot to do with pipeline length too. Only bwl ran tonight (of the known slow x86 dividers). It prefers its 25 cycle division instruction. But it has a super long pipeline, and random branches on average will cost (pipeline length)/2. The low-end ARM chips like cortex a12 (used by tinker) have short pipelines. Did you have a chance of playing with some code which takes care of small quotients without branches or division? I feel pretty confident that with 25 cycles to play (as with Intel CPUs) some bit operations for quotients <= 7 would give great speeup. Such code is perhaps best written in assembly, but even C code should help! #if HAVE_NATIVE_div_1_lt8 ... #elif HAVE_NATIVE_div_1_lt4 ... #endif -- Torbjörn Please encrypt, key id 0xC8601622 _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel