https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80813
Bug ID: 80813 Summary: x86: std::vector<bool>::operator[] could be somewhat faster using BT instead of SHL Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* This actually applies to all cases of testing a single bit in an array, not just vector<bool>. #include <vector> bool f(const std::vector<bool>& v, std::size_t x) { return v[x]; } compiles with -O3 to this asm, on gcc5.x, 6.x, 7.1, and 8: # https://godbolt.org/g/OHsgcv movq (%rdi), %rdx movq %rsi, %r8 movl %esi, %ecx shrq $6, %r8 movl $1, %eax salq %cl, %rax testq %rax, (%rdx,%r8,8) setne %al ret (or with -march=haswell, it uses SHLX for the 1<<x) It would be more efficient to do 1<<x using xor %eax,%eax bts %rsi,%rax but gcc doesn't do this, even with -march=haswell. (On AMD CPUs, BTS reg,reg is 2 m-ops, but it's probably about break-even if it saves a mov instruction to copy the shift-count to ecx. But for AMD with BMI2, mov $1,reg and shlx is probably optimal despite the larger code-size.) --- But it's even better to load and then test the bit with BT like clang does. There's also no need to do a 64-bit load. We can avoid a lot of REX prefixes by doing a 32-bit load. # hand-crafted sequence I think is optimal, modified from clang output movq (%rdi), %rax movl %esi, %ecx shrq $5, %rsi # does still need to be 64-bit movl (%rax,%rsi,4), %eax btl %ecx, %eax setb %al retq Using BT is fewer instructions, and variable-count SHL is slower than BT on Intel SnB-family CPUs. According to Agner Fog's tables, SHL %cl,%reg is 3 uops on SnB/HSW/SKL, but BT %reg,%reg is 1 uop on Intel and AMD CPUs. (SHLX is 1 uop, so BMI2 avoids the x86 legacy baggage of leaving flags untouched if count==0.) If we were branching on it, TEST/JZ could macro-fuse on Intel and AMD CPUs, but BT/JNC couldn't. The BT strategy is still a speed win, or with BMI2 only a win on code-size. ---- Actually, if not branching on it, we could just isolate the bit with a shift and mask, instead of a TEST and SETCC. movq (%rdi), %rax movl %esi, %ecx shrq $5, %rsi movl (%rax,%rsi,4), %eax shrx %ecx, %eax, %eax andl $1, %eax retq This is as cheap as the bt/setb sequence if BMI2 is available, but zero-extends the bool for free to fill a register instead of having it only in the low 8. (OTOH, BT/SETNC lets us invert the bit for free). The latency should be the same as with SHL/TEST/SETNZ or with BT/SETC, because TEST with a memory operand still runs as a separate load and then 1-cycle test. The load can start before the mask operand is ready. So there are always two 1-cycle ALU operations after the load. Without BMI2, using shr %cl,%eax would be worse on Intel SnB-family, because it has higher latency than BT and is on the critical path (unlike when shl is used to make a mask in parallel with the load, so load-use latency hides the latency of setting up the mask unless ALU resource conflicts delay it).