On 03/17/2011 03:23 AM, David Laight wrote:
On Wed, Mar 16, 2011 at 01:26:31PM -0500, Adam Martinson wrote:
__builtin_ctz() compiles to:
mov    0x8(%ebp),%eax
bsf    %eax,%eax

(ffs()-1) compiles to:
mov    $0xffffffff,%edx
bsf    0x8(%ebp),%eax
cmove  %edx,%eax
...
So yes, there is a reason, ctz() is at least 50% faster.
I'm not where you get 50% from!

I've read both the intel and amd x86 instruction performance manuals
(but can't clain to remember all of it!).
The 'bsf' will be a slow instruction (with constraints on where it
exectutes, and what can execute in parallel).
The 'cmove' has even worse constraints since it can't execute until
the 'flags' from the previous instruction are known.
cmove is only slightly better than a mis-predicted branch!
In this case there will be complete pipeline stall between the 'bsf'
and the 'cmove'.

ffs would probably execute faster with a forwards conditional branch
(predicted not taken) in the 'return -1' path.

        David

For my purposes the 'return -1' path will never be taken, so it's not needed. I want to be able to do:

while (bitmap)
{
    i = ctz(bitmap); /* get the LSB index */
    bitmap ^= 1 << i; /* zero LSB */

    /* for each set bit i... */
}

If there's a faster way to do this I'm all ears.



Reply via email to