Richard Henderson <r...@twiddle.net> writes: Indeed I know that the hw registers that allow such recognition are all privileged. For linux they best one can do is /proc/cpuinfo or (to some extent) the values in AT_HWCAP. Something portable would be nice...
FYI, I dug out the add/mul_2.asm files I was working in in February. IIRC, they're correct as in they pass the testsuite, but I could not show them to be faster than the add/mul_1 paths. Do you know the repeat rate of umull, umlal, umaal, assuming no reg dependencies? Usually, it is possible to come close to the mul throughput for some addmul_N, N >= 1. Forget mul_2, go for addmul_2, since the latter will be used repeatedly from mul_basecase or sqr_basecase. One will want to do latency scheduling for umaal, handling v0 (the low limb of the 2-limb v operand) and v1 (its high limb) semi-seprately; one first multiplies with v0, then limping along some cycles later, multiply-adds v1. I suppose umaal + umaal or umaal + umlal would both work. -- Torbjörn _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel