https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115500
--- Comment #7 from Jeffrey A. Law <law at gcc dot gnu.org> --- And to be clearer, if you look at the two assembly snippets: The problem is about 0: 814d srli a0,a0,0x13 2: 8905 andi a0,a0,1 4: e501 bnez a0,c <.L3> vs 0: 02c51793 slli a5,a0,0x2c 4: 0007c563 bltz a5,e <.L3> They're both using the same basic idioms (logical shifts and simple conditional branch), one just has an extra andi. The second one has a smaller data dependency critical path. So it's hard to see how the first would ever be better. More likely than not what's going on here is going to be something highly specific to the micro-architecture implementation of whatever chip you tested. So for example, some uarchs are particularly sensitive to code alignments. That could effect the little loop or the function call. To put this in perspective, I'm aware of a uarch that would show a double-digit performance delta due to a 2 instruction, 6 byte sequence moving across a particular boundary -- in a real world benchmark that executes nearly a trillion instructions. Point is you have to be *very* careful analyzing this stuff and sometimes things can be very surprising. So probably the next question is what did you use to test this and what do we know about its uarch and can we correlate what is public about that uarch to the behavior your seeing.