https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115500

--- Comment #7 from Jeffrey A. Law <law at gcc dot gnu.org> ---
And to be clearer, if you look at the two assembly snippets:

The problem is about
   0:   814d                    srli    a0,a0,0x13
   2:   8905                    andi    a0,a0,1
   4:   e501                    bnez    a0,c <.L3>
vs 
   0:   02c51793                slli    a5,a0,0x2c
   4:   0007c563                bltz    a5,e <.L3>



They're both using the same basic idioms (logical shifts and simple conditional
branch), one just has an extra andi.   The second one has a smaller data
dependency critical path.  So it's hard to see how the first would ever be
better.

More likely than not what's going on here is going to be something highly
specific to the micro-architecture implementation of whatever chip you tested. 
So for example, some uarchs are particularly sensitive to code alignments. 
That could effect the little loop or the function call.

To put this in perspective, I'm aware of a uarch that would show a double-digit
performance delta due to a 2 instruction, 6 byte sequence moving across a
particular boundary -- in a real world benchmark that executes nearly a
trillion instructions.

Point is you have to be *very* careful analyzing this stuff and sometimes
things can be very surprising.

So probably the next question is what did you use to test this and what do we
know about its uarch and can we correlate what is public about that uarch to
the behavior your seeing.

Reply via email to