I've just noticed that x86 implementations (32-bit and 64-bit) of this function are suboptimal: they use looping to count the bits, which is very slow. X86 has the dedicated BSF (bit scan forward) instruction, which executes in 3-4 cycles. The difference is very significant and measurable in certain algorithms that depend on this particular operation (e.g. van Emde-Boas tree).
I am willing to contribute an assembly-optimized version for the x86, if there is any interest. This message posted from opensolaris.org _______________________________________________ opensolaris-code mailing list [email protected] http://mail.opensolaris.org/mailman/listinfo/opensolaris-code
