https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120952
Bug ID: 120952 Summary: GCC fails to optimize division to shift when divisor is non-constant power of two Product: gcc Version: 16.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: redi at gcc dot gnu.org Target Milestone: --- Given std::bit_ceil (which returns the smallest power of two not less than its argument), I hoped the following would be optimized to use a shift: #include <bit> using std::size_t; size_t get_block_size(size_t bytes, size_t alignment) { alignment = std::__bit_ceil(alignment); if (!std::__has_single_bit(alignment)) // definitely a power of two now! __builtin_unreachable(); return ((bytes + alignment - 1) / alignment) * alignment; } on x86_64 GCC optimizes this to: "get_block_size(unsigned long, unsigned long)": mov rax, rdi mov edi, 1 cmp rsi, 1 jbe .L2 sub rsi, 1 bsr rsi, rsi lea ecx, [rsi+1] sal rdi, cl .L2: lea rcx, [rdi-1+rax] xor edx, edx mov rax, rcx div rdi mov rax, rcx sub rax, rdx ret Whereas clang uses a shift: get_block_size(unsigned long, unsigned long): mov eax, 1 cmp rsi, 2 jb .LBB0_2 dec rsi mov ecx, 127 bsr rcx, rsi xor ecx, 63 neg cl shl rax, cl .LBB0_2: lea rcx, [rdi + rax] dec rcx neg rax and rax, rcx ret If I remove the bit_ceil and just tell the compiler it's already a power of two we avoid the branch but GCC still uses a shift: using size_t = decltype(sizeof(0)); size_t get_block_size(size_t bytes, size_t alignment) { if (alignment & (alignment - 1)) __builtin_unreachable(); return ((bytes + alignment - 1) / alignment) * alignment; } GCC still uses DIV: "get_block_size(unsigned long, unsigned long)": lea rax, [rsi-1+rdi] xor edx, edx div rsi lea rax, [rsi-1+rdi] sub rax, rdx ret And Clang still shifts: get_block_size(unsigned long, unsigned long): lea rax, [rdi + rsi] dec rax rep bsf rcx, rsi shr rax, cl imul rax, rsi ret On IRC Andrew P said that we don't track pop count on ssa names.