https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87264

            Bug ID: 87264
           Summary: missed optimization of std::find_if (predicate
                    inlining)
           Product: gcc
           Version: 8.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: skvadrik at gmail dot com
  Target Milestone: ---

A recent talk on the C++ Russia conference showed GCC performing ~1.5x worse
than Clang on a simple test that calls std::find_if on a large std::vector:

http://cppconf.ru/talks/mikhail-matrosov    (talk)
https://github.com/mmatrosov/ExtractRight   (benchmark code)

GCC's poor performance is caused by not inlining std::find_if and not
specializing it for the predicates. A number of factors contribute to this, so
I created a simplified benchmark (bench.cpp in attach) that can be compiled
with -D<FACTOR> options to show the impact of different factors on the
compilation. On -O3 level, dropping any of the defines results in ~1.5x
speedup, so that GCC becomes comparable to Clang. On -O2 it gets a bit harder,
but some combinations of the defines still have effect. See full bench results
below.

The benchmark code includes copy-pasted parts of the <algorithm> header (GCC's
headers are used with both GCC and Clang). 

Factors are:

-DUNROLL: use the manually unrolled specialized implementation of std::find_if
for random iterators (also complained about in
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84170), as opposed to the default
implementation for input iterators

-DWRAP: wrap function pointer (predicate) in a small class that stores the
pointer and confuses constant propagation (part of GCC's implementation of
<algorithm>)

-DFNPTR: use functions as predicates (the alternative is to use functor objects
that are much better inlined)

-DARG: one extra argument to std::find_if implementation, even unused, confuses
constant propagation and inlining


Benchmark results are as follows (see run_bench.sh in attach):

$ uname -por
4.17.0-gentoo Intel(R) Core(TM) i3 CPU M 350 @ 2.27GHz GNU/Linux

$ ./run_bench

gcc   -O3 -DUNROLL -DWRAP -DFNPTR -DARG:  0.32  1.78x slower
clang -O3 -DUNROLL -DWRAP -DFNPTR -DARG:  0.18

gcc   -O3          -DWRAP -DFNPTR -DARG:  0.23  1.15x slower
clang -O3          -DWRAP -DFNPTR -DARG:  0.20

gcc   -O3 -DUNROLL        -DFNPTR -DARG:  0.22  1.16x slower
clang -O3 -DUNROLL        -DFNPTR -DARG:  0.19

gcc   -O3 -DUNROLL -DWRAP         -DARG:  0.22  1.16x slower
clang -O3 -DUNROLL -DWRAP         -DARG:  0.19

gcc   -O3 -DUNROLL -DWRAP -DFNPTR      :  0.23  1.28x slower
clang -O3 -DUNROLL -DWRAP -DFNPTR      :  0.18

gcc   -O3                              :  0.24  1.14x slower
clang -O3                              :  0.21

gcc   -O2 -DUNROLL -DWRAP -DFNPTR -DARG:  0.38  2.11x slower
clang -O2 -DUNROLL -DWRAP -DFNPTR -DARG:  0.18

gcc   -O2          -DWRAP -DFNPTR -DARG:  0.29  1.38x slower
clang -O2          -DWRAP -DFNPTR -DARG:  0.21

gcc   -O2 -DUNROLL        -DFNPTR -DARG:  0.38  2.11x slower
clang -O2 -DUNROLL        -DFNPTR -DARG:  0.18

gcc   -O2 -DUNROLL -DWRAP         -DARG:  0.27  1.42x slower
clang -O2 -DUNROLL -DWRAP         -DARG:  0.19

gcc   -O2 -DUNROLL -DWRAP -DFNPTR      :  0.37  2.06x slower
clang -O2 -DUNROLL -DWRAP -DFNPTR      :  0.18

gcc   -O2                              :  0.29  1.45x slower
clang -O2                              :  0.20


I can't say it's a *bug*, but it definitely hurts to see GCC being that much
slower than Clang for no good reason. People do write code like this all the
time (if they use STL). 

I see different possible ways to fix this:

1. don't use "unrolled" std::find_if specialization unless the predicate is
inlined
2. use more aggressive inlining for simple predicates to STL algorithms

Option 2 is desirable anyway.

Reply via email to