https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153
--- Comment #13 from ncm at cantrip dot org --- This is essentially the entire difference between the versions of puzzlegen-int.cc without, and with, the added "++count;" line referenced above (modulo register assignments and branch labels) that sidesteps the +50% pessimization: (Asm is from "g++ -fverbose-asm -std=c++14 -O3 -Wall -S $SRC.cc" using g++ (Debian 5.2.1-15) 5.2.1 20150808, with no instruction-set extensions specified. Output with "-mbmi -mbmi2" has different instructions, but they do not noticeably affect run time on Haswell i7-4770.) @@ -793,25 +793,26 @@ .L141: movl (%rdi), %esi # MEM[base: _244, offset: 0], word testl %r11d, %esi # D.66634, word jne .L138 #, xorl %eax, %eax # tmp419 cmpl %esi, %r12d # word, seven leaq 208(%rsp), %rcx #, tmp574 sete %al #, tmp419 movl %r12d, %edx # seven, seven leal 1(%rax,%rax), %r8d #, D.66619 .p2align 4,,10 .p2align 3 .L140: movl %edx, %eax # seven, D.66634 negl %eax # D.66634 andl %edx, %eax # seven, D.66622 testl %eax, %esi # D.66622, word je .L139 #, addl %r8d, 24(%rcx) # D.66619, MEM[base: _207, offset: 24B] + addl $1, %ebx #, count .L139: notl %eax # D.66622 subq $4, %rcx #, ivtmp.424 andl %eax, %edx # D.66622, seven jne .L140 #, addq $4, %rdi #, ivtmp.428 cmpq %rdi, %r10 # ivtmp.428, D.66637 jne .L141 #, I tried a version of the program with a fixed-length loop (over 'place' in [6..0]) so that branches do not depend on results of "rest &= ~-rest". The compiler unrolled the loop, but the program ran at pessimized speed with or without the "++count" line. I am very curious whether this has been reproduced on others' Haswells, and on Ivybridge and Skylake.