On Thu, Feb 12 2015, "George Spelvin" <[email protected]> wrote:

>> Rasmus, your version has ANDing by mask, and resetting the mask at each 
>> iteration
>> of main loop. I think we can avoid it. What do you think on next?
>
> Yes, that's basically what I proposed (modulo checking for zero size and
> my buggy LAST_WORD_MASK).
>
> But two unconditional instructions in the loop are awfully minor; it's
> loads and conditional branches that cost.
>
> The reset of the mask can be done in parallel with other operations; it's
> only the AND that actually takes a cycle.
>
> I can definitely see the argument that, for code that's not used often
> enough to stay resident in the L1 cache, any speedup has to win by at
> least one L2 cache access to be worth taking another cache line.

That, and if the compiler was smart enough, the AND should actually be
free (at least on x86), since it can replace a test instruction. [1]

In any case, my code compiles to fewer bytes (though not an entire cache
line), and I think it is somewhat clearer - I'm obviously biased on the
latter point.

Rasmus


[1] Unfortunately, the compiler doesn't seem to be smart enough :-( When I
compile my code with gcc 5.0, I get

0000000000000000 <find_last_bit_rv>:
   0:   53                      push   %rbx
   1:   89 f1                   mov    %esi,%ecx
   3:   48 8d 5e 3f             lea    0x3f(%rsi),%rbx
   7:   f7 d9                   neg    %ecx
   9:   48 c7 c2 ff ff ff ff    mov    $0xffffffffffffffff,%rdx
  10:   48 83 e3 c0             and    $0xffffffffffffffc0,%rbx
  14:   48 d3 ea                shr    %cl,%rdx
  17:   eb 1a                   jmp    33 <find_last_bit_rv+0x33>
  19:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)
  20:   48 89 d1                mov    %rdx,%rcx
  23:   48 23 0c df             and    (%rdi,%rbx,8),%rcx
  27:   48 c7 c2 ff ff ff ff    mov    $0xffffffffffffffff,%rdx
  2e:   48 85 c9                test   %rcx,%rcx
  31:   75 15                   jne    48 <find_last_bit_rv+0x48>
  33:   48 83 eb 01             sub    $0x1,%rbx
  37:   48 83 fb ff             cmp    $0xffffffffffffffff,%rbx
  3b:   75 e3                   jne    20 <find_last_bit_rv+0x20>
  3d:   48 89 f0                mov    %rsi,%rax
  40:   5b                      pop    %rbx
  41:   c3                      retq   
  42:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)
  48:   48 89 cf                mov    %rcx,%rdi
  4b:   48 c1 e3 06             shl    $0x6,%rbx
  4f:   e8 00 00 00 00          callq  54 <find_last_bit_rv+0x54>
  54:   48 98                   cltq   
  56:   48 01 d8                add    %rbx,%rax
  59:   5b                      pop    %rbx
  5a:   c3                      retq   
  5b:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)

the main loop is 20--3b. The test instruction at 2e seems to be
redundant. The same at 37: the sub instruction already sets plenty of
flags that could be used, so explicitly comparing %rbx to -1 seems
redundant.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to