[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset<>

2016-02-08 Thread nate at verse dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

--- Comment #23 from Nathan Kurz  ---
> 1. As a correction: *without* the count takes twice as long to run as with,
>or when using bitset<>.

Oops, I did say that backwards.  My tests agree with what you say.  

> 2. As a heuristic, favoring a branch to skip a biggish loop body evidently 
>has much less downside than favoring the branch into it.  

A correctly predicted branch-not-taken will be the same speed or one cycle
faster than a correctly predicted branch-taken.   Modern Intel can sustain 2
fused cmp/jcc µops per cycle, but only if the first is not-taken.  

> Maybe Gcc already has such a heuristic, and the addition of 7 conditional 
> increments in the loop, or whatever overhead bitset<> adds, was enough
> to push it over?  

I presume this is true, but don't know the GCC internals.  I did test though
with clang++ 3.8.0 and icpc 16.01, and found that both made the same layout
choices as GCC in the absence of __builtin_expect() and had approximately the
same speed on Skylake (about 2 cycles per iteration).  With USE_EXPECT, clang
and gcc were equally fast (one cycle), and icc was at an intermediate speed
(1.2-ish).  

My reason for thinking this is not a bug is that the fastest choice will depend
on the contents of the word list.   Regardless of layout, there will be one
option that is slightly faster than the other.   I guess it's reasonable to
ask, though, whether it's better by default to try to save one cycle on an
already very fast empty loop, or to save one cycle on a more expensive loop.
But the real gain (if there is one) will be matching the layout to the runtime
behavior, for which the compiler requires outside information.   

> Westmere runs both instruction sequences (with and without __builtin_expect)
> the same.  Maybe on Westmere the loop takes two cycles regardless of code 
> placement, and Gcc is (still) tuned for the older timings?


This matches my belief:  Haswell and Skylake have the potential of doing the
inner loop in a single cycle if the layout matches the runtime pattern, while
Nehalem (Westmere's sibling) takes at least 2 cycles regardless.   This is due
to the capabilities for "Macro-op Fusion" across the generations.  See the
sections with that heading for Nehalem, Sandy Bridge, Haswell, and Skylake
here: http://www.agner.org/optimize/microarchitecture.pdf. 

Often the limiting factor for very tight loops like this is that only 4 µops
can be "retired" per cycle.   If both cmp/jcc instructions are fused (cmp here
includes a different subset of cmp/test/add/sub for different generations), and
if the first branch is not taken, it's possible to do both plus the 'mov' and
'add' in a single cycle on Haswell and Skylake.  If as on previous generations
they are not both fused (see Agner's PDF), or if the first branch is taken,
there is a two cycle minimum for the loop.  

My recommendation to any GCC internals experts reading this would be to read
Agner's suggestions for promoting fusion, and change the loop optimizers to
respect them. While not happening in this case, I feel I often see situations
where  GCC often puts 'mov' or other non-flag changing instructions between the
'cmp' and 'jcc', defeating macro-op fusion and creating slower loops.  Intel's
icc frequently does a better job at avoiding this, and thus often produces
slightly faster loops.

[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset<>

2016-02-08 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

--- Comment #22 from ncm at cantrip dot org ---
(In reply to Nathan Kurz from comment #21)
> My current belief is
> that everything here is expected behavior, and there is no bug with either
> the compiler or processor.  
> 
> The code spends most of its time in a tight loop that depends on runtime
> input, and the compiler doesn't have any reason to know which branch is more
> likely.   The addition of "count" changes the heuristic in recent compilers,
> and by chance, changes it for the worse.  

I am going to disagree, carefully.  It seems clear, in any case, that Haswell
is off the hook.

1. As a correction: *without* the count takes twice as long to run as with,
   or when using bitset<>.

2. As a heuristic, favoring a branch to skip a biggish loop body evidently 
   has much less downside than favoring the branch into it.  Maybe Gcc
   already has such a heuristic, and the addition of 7 conditional 
   increments in the loop, or whatever overhead bitset<> adds, was enough
   to push it over?  

Westmere runs both instruction sequences (with and without __builtin_expect)
the same.  Maybe on Westmere the loop takes two cycles regardless of code 
placement, and Gcc is (still) tuned for the older timings?

[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset<>

2016-02-08 Thread nate at verse dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

Nathan Kurz  changed:

   What|Removed |Added

 CC||nate at verse dot com

--- Comment #21 from Nathan Kurz  ---
Created attachment 37629
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=37629&action=edit
Test case using __builtin_expect()

I wondered if this might match some other effects I've seen recently, so I
tested this on Skylake and Haswell with g++ 5.3.0.  My current belief is that
everything here is expected behavior, and there is no bug with either the
compiler or processor.  

The code spends most of its time in a tight loop that depends on runtime input,
and the compiler doesn't have any reason to know which branch is more likely.  
The addition of "count" changes the heuristic in recent compilers, and by
chance, changes it for the worse.   

More directly, one can get the same effect with __builtin_expect().  Here are
the (excerpted) results I see on Skylake:  

nate@skylake:~/src/rust$ g++-5 -std=gnu++14 -O3 -Wall -g puzzlegen-expect.cc -o
puzzlegen
nate@skylake:~/src/rust$ perf stat ./puzzlegen > /dev/null
210.098864  task-clock (msec) #1.001 CPUs utilized
   711,412,710  cycles#3.386 GHz
 1,930,668,439  instructions  #2.71  insns per cycle
   625,004,194  branches  # 2974.810 M/sec
 2,920,721  branch-misses #0.47% of all branches
   0.209838807 seconds time elapsed

nate@skylake:~/src/rust$ g++-5 -std=gnu++14 -O3 -Wall -g puzzlegen-expect.cc -o
puzzlegen -DUSE_EXPECT
nate@skylake:~/src/rust$ perf stat ./puzzlegen > /dev/null
122.037566  task-clock (msec) #1.001 CPUs utilized
   413,227,074  cycles#3.386 GHz
 1,930,678,854  instructions  #4.67  insns per cycle
   625,011,748  branches  # 5121.470 M/sec
 2,952,030  branch-misses #0.47% of all branches
   0.121937547 seconds time elapsed

Both versions have the same number of instructions, branches, and
branch-misses.  The version with USE_EXPECT is almost twice as fast because it
executes almost twice as many instructions per cycle.  GCC does a great job of
producing a very tight loop that executes in one cycle per iteration: 

   77.21 │210:┌─→add$0x4,%rcx 
0.66 ││  cmp%rcx,%r8  
 ││↓ je 260   
4.20 │219:│  mov(%rcx),%edi 
0.88 ││  test   %r9d,%edi   
 │└──jne210

The cmp/je and test/jne instructions fuse into on µop each, so that the
processor can execute the remaining 4 fused µops at an iteration per cycle. If
the loop is rearranged so that the middle branch is taken, each iteration takes
a minimum of 2 cycles instead.

My suspicion was that we'd see GCC splitting the instructions in a way that
prevented fusion (which I think it often does), but I think it's doing an
excellent job here.  It's possible there is a different effect in one of the
other test cases, but I don't think the case that involves 'count' needs
fixing.

[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset<>

2016-01-04 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

--- Comment #20 from Andrew Pinski  ---
(In reply to Andrew Pinski from comment #19)
> bitset<> version
> real0m1.073s
> user0m1.052s
> sys 0m0.021s
> 
> unsigned int
> real0m0.903s
> user0m0.883s
> sys 0m0.019s
> 
> bitset with container adapter:
> real0m1.519s
> user0m1.499s
> sys 0m0.019s

I should say this is on a ThunderX T88 pass 2 running at 1.9GHz.

[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset<>

2016-01-04 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

--- Comment #19 from Andrew Pinski  ---
bitset<> version
real0m1.073s
user0m1.052s
sys 0m0.021s

unsigned int
real0m0.903s
user0m0.883s
sys 0m0.019s

bitset with container adapter:
real0m1.519s
user0m1.499s
sys 0m0.019s

[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset<>

2015-12-04 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

--- Comment #18 from ncm at cantrip dot org ---
It is far from clear to me that gcc-5's choice to put the increment value in a
register, and use just one loop body, is wrong. Rather, it appears that an
incidental choice in the placement order of basic blocks or register assignment
interacts badly with a bug in Haswell branch prediction, value dependency
tracking, micro-op cache, or something.  An actual fix for this would need to
identify and step around Haswell's sensititvity to whichever detail of code
generation this program happens upon.

[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset<>

2015-12-04 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|5.3 |5.4

--- Comment #17 from Richard Biener  ---
GCC 5.3 is being released, adjusting target milestone.

[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset<>

2015-11-18 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

Richard Biener  changed:

   What|Removed |Added

   Priority|P3  |P2
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2015-11-18
 Ever confirmed|0   |1

--- Comment #16 from Richard Biener  ---
Confirmed.  Neither GCC 5 nor trunk are unswitching the loop in main.

[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset<>

2015-08-16 Thread miyuki at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

Mikhail Maltsev  changed:

   What|Removed |Added

 CC||miyuki at gcc dot gnu.org

--- Comment #15 from Mikhail Maltsev  ---
(In reply to ncm from comment #13)
> I am very curious whether this has been reproduced on others' Haswells,
> and on Ivybridge and Skylake.

I could reproduce this with Haswell i7-5820K (affected) and also tested on
Ivybridge i7-3770K (not affected).

Haswell:

GCC 4.9.3 native

real0m0.150s
user0m0.149s
sys 0m0.001s
GCC 6.0.0 native

real0m0.213s
user0m0.212s
sys 0m0.000s
GCC 6.0.0 native + count

real0m0.166s
user0m0.165s
sys 0m0.000s



GCC 4.9.3 generic

real0m0.128s
user0m0.127s
sys 0m0.001s
GCC 6.0.0 generic

real0m0.213s
user0m0.213s
sys 0m0.000s
GCC 6.0.0 generic + count

real0m0.125s
user0m0.124s
sys 0m0.000s


Ivybridge:

GCC 4.9.2 native

real0m0.218s
user0m0.216s
sys 0m0.000s
GCC 6.0.0 native

real0m0.186s
user0m0.184s
sys 0m0.000s
GCC 6.0.0 native + count

real0m0.266s
user0m0.264s
sys 0m0.000s



GCC 4.9.2 generic

real0m0.224s
user0m0.220s
sys 0m0.004s
GCC 6.0.0 generic

real0m0.183s
user0m0.180s
sys 0m0.000s
GCC 6.0.0 generic + count

real0m0.181s
user0m0.176s
sys 0m0.004s

Here "native" means "-march=native -mtune=native", "generic" means no arch
options. "+count" means adding a counter into the loop and writing it's value
to a global volatile variable after the loop.


[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset<>

2015-08-16 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

--- Comment #14 from ncm at cantrip dot org ---
A notable difference between g++-4.9 output and g++-5 output is that,
while both hoist the "(word == seven)" comparison out of the innermost
loop, gcc-4.9 splits inner loop into two versions, one that increments 
scores by 3 and another that increments by 1, where g++-5 saves 3 or 1
into a register and uses the same inner loop for both cases.

Rewriting the critical loop
  - to run with separate inner loops
 - does not slow down the fast g++-4.9-compiled program, but
 - fails to speed up the slow g++-5-compiled program.
  - to precompute a 1 or 3 increment, with one inner loop for both cases
 - does slow down the previously fast g++-4.9-compiled program, and
 - does not change the speed of the slow g++-5-compiled program


[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset<>

2015-08-16 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

--- Comment #13 from ncm at cantrip dot org ---
This is essentially the entire difference between the versions of
puzzlegen-int.cc without, and with, the added "++count;" line 
referenced above (modulo register assignments and branch labels)
that sidesteps the +50% pessimization:

(Asm is from "g++ -fverbose-asm -std=c++14 -O3 -Wall -S $SRC.cc" using
g++ (Debian 5.2.1-15) 5.2.1 20150808, with no instruction-set extensions 
specified.  Output with "-mbmi -mbmi2" has different instructions, but
they do not noticeably affect run time on Haswell i7-4770.)

@@ -793,25 +793,26 @@
 .L141:
movl(%rdi), %esi# MEM[base: _244, offset: 0], word
testl   %r11d, %esi # D.66634, word
jne .L138   #,
xorl%eax, %eax  # tmp419
cmpl%esi, %r12d # word, seven
leaq208(%rsp), %rcx #, tmp574
sete%al #, tmp419
movl%r12d, %edx # seven, seven
leal1(%rax,%rax), %r8d  #, D.66619
.p2align 4,,10
.p2align 3
  .L140:
movl%edx, %eax  # seven, D.66634
negl%eax# D.66634
andl%edx, %eax  # seven, D.66622
testl   %eax, %esi  # D.66622, word
je  .L139   #,
addl%r8d, 24(%rcx)  # D.66619, MEM[base: _207, offset: 24B]
+   addl$1, %ebx#, count
 .L139:
notl%eax# D.66622
subq$4, %rcx#, ivtmp.424
andl%eax, %edx  # D.66622, seven
jne .L140   #,
addq$4, %rdi#, ivtmp.428
cmpq%rdi, %r10  # ivtmp.428, D.66637
jne .L141   #,

I tried a version of the program with a fixed-length loop (over 
'place' in [6..0]) so that branches do not depend on results of
"rest &= ~-rest".  The compiler unrolled the loop, but the program
ran at pessimized speed with or without the "++count" line.

I am very curious whether this has been reproduced on others' Haswells,
and on Ivybridge and Skylake.


[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset<>

2015-08-13 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

--- Comment #12 from ncm at cantrip dot org ---
As regards hot spots, the program has two:

int score[7] = { 0, };
for (Letters word : words)
/**/if (!(word & ~seven))
for_each_in_seven([&](Letters letter, int place) {
if (word & letter)
/**/score[place] += (word == seven) ? 3 : 1;
});

The first is executed 300M times, the second 3.3M times.
Inserting a counter bump before the second eliminates the slowdown:

if (word & letter) {
++count;
/**/score[place] += (word == seven) ? 3 : 1;
}

This fact seems consequential.  The surrounding for_each_in_seven
loop isn't doing popcounts, but is doing "while (v &= -v)".

I have repeated tests using -m[no-]bmi[2], with identical results
(i.e. no effect).


[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset<>

2015-08-12 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

--- Comment #11 from ncm at cantrip dot org ---
Aha, Uroš, I see your name in 

  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011

Please forgive me for "teaching" you about micro-ops.

The code being generated for all versions does use (e.g.)
"popcntq %rax, %rax" almost everywhere.  Not quite everywhere -- I see 
one "popcntq %rax, %rdx" -- but certainly in all the performance-sensitive 
bits.

[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset<>

2015-08-12 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

--- Comment #10 from ncm at cantrip dot org ---
I found this, which at first blush seems like it might be relevant.
The hardware complained about here is the same Haswell i7-4770.

http://stackoverflow.com/questions/25078285/replacing-a-32-bit-loop-count-variable-with-64-bit-introduces-crazy-performance


[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset<>

2015-08-12 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

--- Comment #9 from ncm at cantrip dot org ---
I did experiment with -m[no-]bmi[2] a fair bit.  It all made a significant
difference in the instructions emitted, but exactly zero difference in 
runtime. That's actually not surprising at all; those instructions get 
decomposed into micro-ops that exactly match those from the equivalent
instructions, and are cached, and the loops that dominate runtime execute 
out of the micro-op cache.  The only real effect is maybe slightly shorter
object code, which could matter in a program dominated by bus traffic
with loops too big to cache well.  I say "maybe slightly shorter" because
instruction-set extension instructions are actually huge, mostly prefixes.

I.e. most of the BMI stuff is marketing fluff, added mainly to make the 
competition waste money matching them instead of improving the product.


[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset<>

2015-08-11 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

--- Comment #8 from Uroš Bizjak  ---
(In reply to Uroš Bizjak from comment #7)

> Can you experiment a bit with -mno-bmi and/or -mno-bmi2 compile options?

Also, perf is able to record execution profiles, it will help you find hot
spots.

[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset<>

2015-08-11 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

--- Comment #7 from Uroš Bizjak  ---
(In reply to ncm from comment #6)
> It seems worth adding that the same failure occurs without "-march=native".

Can you experiment a bit with -mno-bmi and/or -mno-bmi2 compile options?

[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset<>

2015-08-11 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

--- Comment #6 from ncm at cantrip dot org ---
It seems worth adding that the same failure occurs without "-march=native".


[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset<>

2015-08-10 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

--- Comment #5 from ncm at cantrip dot org ---
My preliminary conclusion is that a hardware optimization provided in Haswell
but not in Westmere is not recognizing the opportunity in the unsigned int
test case, that it finds in the original bitset version, as compiled by gcc-5.

I have also observed that adding an assertion that the array index is not
negative, before the first array access, slows the program a further 100%, 
on Westmere.

Note that the entire data set fits in L3 cache on all tested targets, so
memory bandwidth does not figure.

To my inexperienced eye the effects look like branch mispredictions.
I do not understand why a 3.4 GHz DDR3 Haswell runs as slowly as a 
2.4 GHz DDR2 Westmere, when branch prediction (or whatever it is) 
fails.


[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset<>

2015-08-10 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

--- Comment #4 from ncm at cantrip dot org ---
Also fails 5.2.1 (Debian 5.2.1--15) 5.2.1 20150808
As noted, the third version of the program, using bitset but not using
lambdas, is as slow as the version using unsigned int -- even when built
using gcc-4.9.  (Recall the int version and the first bitset version
run fast when built with gcc-4.9.)

Confirmed that on Westmere, compiled -march=native, all versions 
run about the same speed with all versions of the compiler reported,
and this runtime is about the same as the slow Haswell speed despite
the very different clock rate.


[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset<>

2015-08-09 Thread ncm at cantrip dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

--- Comment #3 from ncm at cantrip dot org ---
Created attachment 36159
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=36159&action=edit
bitset, but using an inlined container adapter, not lambdas, and slow

This version compiles just as badly as the integer version, even by gcc-4.9.


[Bug tree-optimization/67153] [5/6 Regression] integer optimizations 53% slower than std::bitset<>

2015-08-07 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153

Richard Biener  changed:

   What|Removed |Added

   Keywords||missed-optimization
 Target|Linux amd64 |x86_64-*-*
  Component|c++ |tree-optimization
   Host|Linux amd64 |
   Target Milestone|--- |5.3
Summary|integer optimizations 53%   |[5/6 Regression] integer
   |slower than std::bitset<>   |optimizations 53% slower
   ||than std::bitset<>
  Build|Linux amd64 |