https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87264
Bug ID: 87264 Summary: missed optimization of std::find_if (predicate inlining) Product: gcc Version: 8.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: skvadrik at gmail dot com Target Milestone: --- A recent talk on the C++ Russia conference showed GCC performing ~1.5x worse than Clang on a simple test that calls std::find_if on a large std::vector: http://cppconf.ru/talks/mikhail-matrosov (talk) https://github.com/mmatrosov/ExtractRight (benchmark code) GCC's poor performance is caused by not inlining std::find_if and not specializing it for the predicates. A number of factors contribute to this, so I created a simplified benchmark (bench.cpp in attach) that can be compiled with -D<FACTOR> options to show the impact of different factors on the compilation. On -O3 level, dropping any of the defines results in ~1.5x speedup, so that GCC becomes comparable to Clang. On -O2 it gets a bit harder, but some combinations of the defines still have effect. See full bench results below. The benchmark code includes copy-pasted parts of the <algorithm> header (GCC's headers are used with both GCC and Clang). Factors are: -DUNROLL: use the manually unrolled specialized implementation of std::find_if for random iterators (also complained about in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84170), as opposed to the default implementation for input iterators -DWRAP: wrap function pointer (predicate) in a small class that stores the pointer and confuses constant propagation (part of GCC's implementation of <algorithm>) -DFNPTR: use functions as predicates (the alternative is to use functor objects that are much better inlined) -DARG: one extra argument to std::find_if implementation, even unused, confuses constant propagation and inlining Benchmark results are as follows (see run_bench.sh in attach): $ uname -por 4.17.0-gentoo Intel(R) Core(TM) i3 CPU M 350 @ 2.27GHz GNU/Linux $ ./run_bench gcc -O3 -DUNROLL -DWRAP -DFNPTR -DARG: 0.32 1.78x slower clang -O3 -DUNROLL -DWRAP -DFNPTR -DARG: 0.18 gcc -O3 -DWRAP -DFNPTR -DARG: 0.23 1.15x slower clang -O3 -DWRAP -DFNPTR -DARG: 0.20 gcc -O3 -DUNROLL -DFNPTR -DARG: 0.22 1.16x slower clang -O3 -DUNROLL -DFNPTR -DARG: 0.19 gcc -O3 -DUNROLL -DWRAP -DARG: 0.22 1.16x slower clang -O3 -DUNROLL -DWRAP -DARG: 0.19 gcc -O3 -DUNROLL -DWRAP -DFNPTR : 0.23 1.28x slower clang -O3 -DUNROLL -DWRAP -DFNPTR : 0.18 gcc -O3 : 0.24 1.14x slower clang -O3 : 0.21 gcc -O2 -DUNROLL -DWRAP -DFNPTR -DARG: 0.38 2.11x slower clang -O2 -DUNROLL -DWRAP -DFNPTR -DARG: 0.18 gcc -O2 -DWRAP -DFNPTR -DARG: 0.29 1.38x slower clang -O2 -DWRAP -DFNPTR -DARG: 0.21 gcc -O2 -DUNROLL -DFNPTR -DARG: 0.38 2.11x slower clang -O2 -DUNROLL -DFNPTR -DARG: 0.18 gcc -O2 -DUNROLL -DWRAP -DARG: 0.27 1.42x slower clang -O2 -DUNROLL -DWRAP -DARG: 0.19 gcc -O2 -DUNROLL -DWRAP -DFNPTR : 0.37 2.06x slower clang -O2 -DUNROLL -DWRAP -DFNPTR : 0.18 gcc -O2 : 0.29 1.45x slower clang -O2 : 0.20 I can't say it's a *bug*, but it definitely hurts to see GCC being that much slower than Clang for no good reason. People do write code like this all the time (if they use STL). I see different possible ways to fix this: 1. don't use "unrolled" std::find_if specialization unless the predicate is inlined 2. use more aggressive inlining for simple predicates to STL algorithms Option 2 is desirable anyway.