[Bug libstdc++/98226] Slow std::countr_one
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226 --- Comment #14 from CVS Commits --- The releases/gcc-10 branch has been updated by Jonathan Wakely : https://gcc.gnu.org/g:d7216ea6c0cd6c4fef06e9501bd630c3161b14fd commit r10-9576-gd7216ea6c0cd6c4fef06e9501bd630c3161b14fd Author: Jonathan Wakely Date: Thu Dec 10 21:57:42 2020 + libstdc++: Remove redundant branches in countl_one and countr_one [PR 98226] There's no need to explicitly check for the maximum value, because the function we call handles it correctly anyway. libstdc++-v3/ChangeLog: PR libstdc++/98226 * include/std/bit (__countl_one, __countr_one): Remove redundant branches. (cherry picked from commit 2ea62857a3fbdf091ba38cbb62e98dc76b198e2e)
[Bug libstdc++/98226] Slow std::countr_one
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226 Oleg Zaikin changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |INVALID --- Comment #13 from Oleg Zaikin --- countr_one() should be used where it suits better.
[Bug libstdc++/98226] Slow std::countr_one
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226 --- Comment #12 from Oleg Zaikin --- (In reply to Alexander Monakov from comment #10) > But why you are trying to use a more complex branchy expression in C++17 > mode when you already have a more efficient expression as a "fallback"? > > Note that a cheaper way is available: > > return (x+1) & ~x; > > (though gcc can optimize '(y ^ x) & y' you have to the same machine code) Thank you! We will try it.
[Bug libstdc++/98226] Slow std::countr_one
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226 --- Comment #11 from Oleg Zaikin --- (In reply to Jonathan Wakely from comment #8) > That needs to be investigated, but it's a problem with the compiler. It has > nothing to do with countr_one being implemented using countr_zero (as shown > by the same problem being present when calling __builtin_ctz directly). Thank you for the detailed clarification!
[Bug libstdc++/98226] Slow std::countr_one
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #10 from Alexander Monakov --- (In reply to Oleg Zaikin from comment #6) > There > we have a function firstzero(unsigned x) that returns 2^i, with i the > position of the first 0 of x, and 0 iff there is no 0. Its implementation is: > unsigned firstzero(const unsigned x) noexcept { > #if __cplusplus > 201703L > return x == unsigned(-1) ? 0 : unsigned(1) << std::countr_one(x); > #else > const unsigned y = x+1; return (y ^ x) & y; > #endif > } But why you are trying to use a more complex branchy expression in C++17 mode when you already have a more efficient expression as a "fallback"? Note that a cheaper way is available: return (x+1) & ~x; (though gcc can optimize '(y ^ x) & y' you have to the same machine code)
[Bug libstdc++/98226] Slow std::countr_one
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226 --- Comment #9 from Jonathan Wakely --- The generated code hasn't changed between gcc-10 and gcc-11 though, so the difference must be in the code used to run the benchmarks, not the code under test. See https://godbolt.org/z/bWeMen
[Bug libstdc++/98226] Slow std::countr_one
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226 --- Comment #8 from Jonathan Wakely --- (In reply to Oleg Zaikin from comment #6) > When we switched from C++17-based g++ to C++20-based g++, the performance of > the whole program decreased by about 7 %. It turned out that the main reason > is the firstzero function. But if I create a microbenchmark for this function and replace countr_one(x) by __builtin_ctz(~x) I don't see any improvement. The problem is not that countr_one(x) calls countr_zero(~x). The problem is that the two versions of firstzero are completely different. If the (y ^ x) & y version is more efficient, just use that. In other words, this is not a problem with countr_one, it's that you chose to use countr_one here. For example, using google benchmark: #include #include #include #include #include unsigned total = 0; unsigned firstzero(const unsigned x) noexcept { const unsigned y = x+1; return (y ^ x) & y; } unsigned firstzero_cxx20(const unsigned x) noexcept { return x == unsigned(-1) ? 0 : unsigned(1) << std::countr_one(x); } unsigned firstzero_ctz(const unsigned x) noexcept { return x == unsigned(-1) ? 0 : unsigned(1) << __builtin_ctz(~x); } static void BM_orig(benchmark::State& state) { srand(0); std::vector v(100); for (auto& u : v) u = rand(); for (auto _ : state) { total = 0; for (auto& u : v) total += firstzero(u); } } static void BM_cxx20(benchmark::State& state) { unsigned total; srand(0); std::vector v(100); for (auto& u : v) u = rand(); for (auto _ : state) { total = 0; for (auto& u : v) total += firstzero_cxx20(u); } if (total != ::total) // check for correct result throw 1; } static void BM_ctz(benchmark::State& state) { unsigned total; srand(0); std::vector v(100); for (auto& u : v) u = rand(); for (auto _ : state) { total = 0; for (auto& u : v) total += firstzero_ctz(u); } if (total != ::total) // check for correct result throw 1; } BENCHMARK(BM_orig); BENCHMARK(BM_cxx20); BENCHMARK(BM_ctz); BENCHMARK_MAIN(); With GCC 10 I get: - Benchmark Time CPU Iterations - BM_orig179283 ns 178739 ns 4388 BM_cxx20 888982 ns 886885 ns 770 BM_ctz 883696 ns 881765 ns 783 I think you've just chosen a bad algorithm. However, I do see a regression using GCC 11: - Benchmark Time CPU Iterations - BM_orig174122 ns 172492 ns 4466 BM_cxx20 1104081 ns 1097202 ns 608 BM_ctz1119849 ns 1112550 ns 634 That needs to be investigated, but it's a problem with the compiler. It has nothing to do with countr_one being implemented using countr_zero (as shown by the same problem being present when calling __builtin_ctz directly).
[Bug libstdc++/98226] Slow std::countr_one
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226 --- Comment #7 from Oleg Zaikin --- (In reply to Jonathan Wakely from comment #5) > I've removed some redundant code from them, but not changed the indirection > that this PR complains about. I don't plan to change that. Thank you! I've got your point regarding the indirection.
[Bug libstdc++/98226] Slow std::countr_one
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226 --- Comment #6 from Oleg Zaikin --- (In reply to Jonathan Wakely from comment #2) > Oh, but you didn't enable any optimization at all, so who cares about the > performance? Let me give the whole picture. The issue is very close to that from https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97759 where a C-style implementation was replaced by C++20 std::has_single_bit. We have a complex project that of course is compiled with all proper optimization flags. There we have a function firstzero(unsigned x) that returns 2^i, with i the position of the first 0 of x, and 0 iff there is no 0. Its implementation is: unsigned firstzero(const unsigned x) noexcept { #if __cplusplus > 201703L return x == unsigned(-1) ? 0 : unsigned(1) << std::countr_one(x); #else const unsigned y = x+1; return (y ^ x) & y; #endif } When we switched from C++17-based g++ to C++20-based g++, the performance of the whole program decreased by about 7 %. It turned out that the main reason is the firstzero function.
[Bug libstdc++/98226] Slow std::countr_one
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226 --- Comment #5 from Jonathan Wakely --- I've removed some redundant code from them, but not changed the indirection that this PR complains about. I don't plan to change that.
[Bug libstdc++/98226] Slow std::countr_one
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226 --- Comment #4 from CVS Commits --- The master branch has been updated by Jonathan Wakely : https://gcc.gnu.org/g:2ea62857a3fbdf091ba38cbb62e98dc76b198e2e commit r11-5922-g2ea62857a3fbdf091ba38cbb62e98dc76b198e2e Author: Jonathan Wakely Date: Thu Dec 10 21:57:42 2020 + libstdc++: Remove redundant branches in countl_one and countr_one [PR 98226] There's no need to explicitly check for the maximum value, because the function we call handles it correctly anyway. libstdc++-v3/ChangeLog: PR libstdc++/98226 * include/std/bit (__countl_one, __countr_one): Remove redundant branches.
[Bug libstdc++/98226] Slow std::countr_one
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226 --- Comment #3 from Andrew Pinski --- (In reply to Jonathan Wakely from comment #2) > Oh, but you didn't enable any optimization at all, so who cares about the > performance? I was thinking the same.
[Bug libstdc++/98226] Slow std::countr_one
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226 --- Comment #2 from Jonathan Wakely --- Oh, but you didn't enable any optimization at all, so who cares about the performance?
[Bug libstdc++/98226] Slow std::countr_one
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226 --- Comment #1 from Jonathan Wakely --- This seems like an optimizer bug. There is no way I'm going to repeat the entire body of countr_zero in countr_one.