https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81602
--- Comment #3 from Peter Cordes <peter at cordes dot ca> --- Forgot to mention: memory-source popcnt with an indexed addressing mode would also be worse on SnB/IvB: it can't stay micro-fused, so the front-end un-laminates it in the issue stage. Haswell and later can keep popcnt (%rdi, %rdx), %eax micro-fused throughout the pipeline, so it's always 1 fused-domain uop instead of expanding to 2, but it's still 2 unfused-domain uops so it takes more room in the scheduler than the reg-reg form. When Intel fixes the output dependency in some future uarch, it might un-laminate again with indexed addressing modes. That's what happens on Skylake for tzcnt/lzcnt, because SKL fixed their output dependency. (And judging from the published errata, they meant to fix popcnt as well.) But index addressing modes can only stay micro-fused with an ALU uop with "traditional" x86-style instructions with 2 operands where the destination is read/write, not write-only. (Tested on Haswell and Skylake). And yes, this makes indexed addressing modes with AVX instructions worse than with the SSE equivalent. :/