https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104773
Bug ID: 104773 Summary: compare with 1 not merged with subtract 1 Product: gcc Version: 12.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-*, arm-*-* std::bit_ceil(x) involves if(x == 0 || x == 1) return 1; and 1u << (32-clz(x-1)). The compare of course compiles to an unsigned <= 1, which can be done with a sub instead of cmp, producing the value we need as an input for the leading-zero count. But GCC does *not* do this. (Neither does clang for x86-64). I trimmed down the libstdc++ <bit> code into something I could compile even when Godbolt is doesn't have working headers for some ISAs: https://godbolt.org/z/3EE7W5bna // cut down from libstdc++ for normal integer cases; compiles the same template<typename _Tp> constexpr _Tp bit_ceil(_Tp __x) noexcept { constexpr auto _Nd = std::numeric_limits<_Tp>::digits; if (__x == 0 || __x == 1) return 1; auto __shift_exponent = _Nd - __builtin_clz((_Tp)(__x - 1u)); // using __promoted_type = decltype(__x << 1); ... // removed check for x<<n widening the result return (_Tp)1u << __shift_exponent; } } for x86-64 with GCC trunk -O3 -march=ivybridge, we get this inefficient code: roundup(unsigned int): mov eax, 1 cmp edi, 1 jbe .L1 sub edi, 1 # could have just done a sub in the first place bsr edi, edi # correctly avoiding a false dependency by *not* using ECX as the destination lea ecx, [rdi+1] # could have shifted 2<<n instead of 1<<(n+1) sal eax, cl # 3 uops, vs. 1 for bts is a more efficient way to materialize 1<<n .L1: ret Also, Ivybridge has no problem with DEC instead of SUB 1, IDK why it's avoiding DEC here but not for Haswell for example. (Haswell pessimizes by using 32-lzcnt instead of lzcnt^31 or something, or still just BSR because it performs identically on actual Haswell; lzcnt is only faster on AMD) But this bug report is just about sub/cmp combining, not how to materialize 1<<(n+1) or other stuff: Better would be sub edi, 1 jbe .L1 bsr edi, edi xor eax, eax inc edi bts eax, edi # EAX |= 1<<EDI ret .L1: mov eax, 1 ret Intel SnB-family can macro-fuse sub/jbe. AMD can't, so the change is break-even for front-end uops when the branch is not-taken, and worse when it is taken. But it's still smaller code-size. For ARM, clang finds a very clever way to combine it: roundup(unsigned int): subs r0, r0, #1 clz r0, r0 rsb r1, r0, #32 @ 32-clz mov r0, #1 lslhi r0, r0, r1 @ using flags set by SUBS bx lr @ 1<<(32-clz) or just 1 GCC on the other hand does much worse with -O3 -std=gnu++20 -mcpu=cortex-a53 -mthumb roundup(unsigned int): cmp r0, #1 itttt hi addhi r3, r0, #-1 movhi r0, #1 clzhi r3, r3 rsbhi r3, r3, #32 ite hi lslhi r0, r0, r3 movls r0, #1 bx lr I suspect we could do better by combining the cmp and addhi, and doing `mov r0, #1` outside of predication. (I think that's a separate bug, planning to report it separately.) Then one `it` would be enough to cover things, I think. That would basically reduce it to clang's strategy, although the predication of the clz and rsb are optional.