[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset<>

ncm at cantrip dot org Sun, 16 Aug 2015 14:44:07 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153


--- Comment #13 from ncm at cantrip dot org ---
This is essentially the entire difference between the versions of
puzzlegen-int.cc without, and with, the added "++count;" line 
referenced above (modulo register assignments and branch labels)
that sidesteps the +50% pessimization:

(Asm is from "g++ -fverbose-asm -std=c++14 -O3 -Wall -S $SRC.cc" using
g++ (Debian 5.2.1-15) 5.2.1 20150808, with no instruction-set extensions 
specified.  Output with "-mbmi -mbmi2" has different instructions, but
they do not noticeably affect run time on Haswell i7-4770.)

@@ -793,25 +793,26 @@
 .L141:
        movl    (%rdi), %esi    # MEM[base: _244, offset: 0], word
        testl   %r11d, %esi     # D.66634, word
        jne     .L138           #,
        xorl    %eax, %eax      # tmp419
        cmpl    %esi, %r12d     # word, seven
        leaq    208(%rsp), %rcx #, tmp574
        sete    %al             #, tmp419
        movl    %r12d, %edx     # seven, seven
        leal    1(%rax,%rax), %r8d      #, D.66619
        .p2align 4,,10
        .p2align 3
  .L140:
        movl    %edx, %eax      # seven, D.66634
        negl    %eax    # D.66634
        andl    %edx, %eax      # seven, D.66622
        testl   %eax, %esi      # D.66622, word
        je      .L139   #,
        addl    %r8d, 24(%rcx)  # D.66619, MEM[base: _207, offset: 24B]
+       addl    $1, %ebx        #, count
 .L139:
        notl    %eax            # D.66622
        subq    $4, %rcx        #, ivtmp.424
        andl    %eax, %edx      # D.66622, seven
        jne     .L140           #,
        addq    $4, %rdi        #, ivtmp.428
        cmpq    %rdi, %r10      # ivtmp.428, D.66637
        jne     .L141           #,

I tried a version of the program with a fixed-length loop (over 
'place' in [6..0]) so that branches do not depend on results of
"rest &= ~-rest".  The compiler unrolled the loop, but the program
ran at pessimized speed with or without the "++count" line.

I am very curious whether this has been reproduced on others' Haswells,
and on Ivybridge and Skylake.

[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset<>

Reply via email to