https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226
--- Comment #8 from Jonathan Wakely <redi at gcc dot gnu.org> --- (In reply to Oleg Zaikin from comment #6) > When we switched from C++17-based g++ to C++20-based g++, the performance of > the whole program decreased by about 7 %. It turned out that the main reason > is the firstzero function. But if I create a microbenchmark for this function and replace countr_one(x) by __builtin_ctz(~x) I don't see any improvement. The problem is not that countr_one(x) calls countr_zero(~x). The problem is that the two versions of firstzero are completely different. If the (y ^ x) & y version is more efficient, just use that. In other words, this is not a problem with countr_one, it's that you chose to use countr_one here. For example, using google benchmark: #include <benchmark/benchmark.h> #include <bit> #include <vector> #include <iostream> #include <stdlib.h> unsigned total = 0; unsigned firstzero(const unsigned x) noexcept { const unsigned y = x+1; return (y ^ x) & y; } unsigned firstzero_cxx20(const unsigned x) noexcept { return x == unsigned(-1) ? 0 : unsigned(1) << std::countr_one(x); } unsigned firstzero_ctz(const unsigned x) noexcept { return x == unsigned(-1) ? 0 : unsigned(1) << __builtin_ctz(~x); } static void BM_orig(benchmark::State& state) { srand(0); std::vector<unsigned> v(1000000); for (auto& u : v) u = rand(); for (auto _ : state) { total = 0; for (auto& u : v) total += firstzero(u); } } static void BM_cxx20(benchmark::State& state) { unsigned total; srand(0); std::vector<unsigned> v(1000000); for (auto& u : v) u = rand(); for (auto _ : state) { total = 0; for (auto& u : v) total += firstzero_cxx20(u); } if (total != ::total) // check for correct result throw 1; } static void BM_ctz(benchmark::State& state) { unsigned total; srand(0); std::vector<unsigned> v(1000000); for (auto& u : v) u = rand(); for (auto _ : state) { total = 0; for (auto& u : v) total += firstzero_ctz(u); } if (total != ::total) // check for correct result throw 1; } BENCHMARK(BM_orig); BENCHMARK(BM_cxx20); BENCHMARK(BM_ctz); BENCHMARK_MAIN(); With GCC 10 I get: ----------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------- BM_orig 179283 ns 178739 ns 4388 BM_cxx20 888982 ns 886885 ns 770 BM_ctz 883696 ns 881765 ns 783 I think you've just chosen a bad algorithm. However, I do see a regression using GCC 11: ----------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------- BM_orig 174122 ns 172492 ns 4466 BM_cxx20 1104081 ns 1097202 ns 608 BM_ctz 1119849 ns 1112550 ns 634 That needs to be investigated, but it's a problem with the compiler. It has nothing to do with countr_one being implemented using countr_zero (as shown by the same problem being present when calling __builtin_ctz directly).