[Bug libstdc++/98226] Slow std::countr_one

2021-03-29 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226

--- Comment #14 from CVS Commits  ---
The releases/gcc-10 branch has been updated by Jonathan Wakely
:

https://gcc.gnu.org/g:d7216ea6c0cd6c4fef06e9501bd630c3161b14fd

commit r10-9576-gd7216ea6c0cd6c4fef06e9501bd630c3161b14fd
Author: Jonathan Wakely 
Date:   Thu Dec 10 21:57:42 2020 +

libstdc++: Remove redundant branches in countl_one and countr_one [PR
98226]

There's no need to explicitly check for the maximum value, because the
function we call handles it correctly anyway.

libstdc++-v3/ChangeLog:

PR libstdc++/98226
* include/std/bit (__countl_one, __countr_one): Remove redundant
branches.

(cherry picked from commit 2ea62857a3fbdf091ba38cbb62e98dc76b198e2e)

[Bug libstdc++/98226] Slow std::countr_one

2020-12-11 Thread zaikin.icc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226

Oleg Zaikin  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |INVALID

--- Comment #13 from Oleg Zaikin  ---
countr_one() should be used where it suits better.

[Bug libstdc++/98226] Slow std::countr_one

2020-12-11 Thread zaikin.icc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226

--- Comment #12 from Oleg Zaikin  ---
(In reply to Alexander Monakov from comment #10)
> But why you are trying to use a more complex branchy expression in C++17
> mode when you already have a more efficient expression as a "fallback"?
> 
> Note that a cheaper way is available:
> 
> return (x+1) & ~x;
> 
> (though gcc can optimize '(y ^ x) & y' you have to the same machine code)

Thank you! We will try it.

[Bug libstdc++/98226] Slow std::countr_one

2020-12-11 Thread zaikin.icc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226

--- Comment #11 from Oleg Zaikin  ---
(In reply to Jonathan Wakely from comment #8)
> That needs to be investigated, but it's a problem with the compiler. It has
> nothing to do with countr_one being implemented using countr_zero (as shown
> by the same problem being present when calling __builtin_ctz directly).

Thank you for the detailed clarification!

[Bug libstdc++/98226] Slow std::countr_one

2020-12-11 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #10 from Alexander Monakov  ---
(In reply to Oleg Zaikin from comment #6)
> There
> we have a function firstzero(unsigned x) that returns 2^i, with i the
> position of the first 0 of x, and 0 iff there is no 0. Its implementation is:
>   unsigned firstzero(const unsigned x) noexcept {
> #if __cplusplus > 201703L
> return x == unsigned(-1) ? 0 : unsigned(1) << std::countr_one(x);
> #else
> const unsigned y = x+1; return (y ^ x) & y;
> #endif
>   }

But why you are trying to use a more complex branchy expression in C++17 mode
when you already have a more efficient expression as a "fallback"?

Note that a cheaper way is available:

return (x+1) & ~x;

(though gcc can optimize '(y ^ x) & y' you have to the same machine code)

[Bug libstdc++/98226] Slow std::countr_one

2020-12-11 Thread redi at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226

--- Comment #9 from Jonathan Wakely  ---
The generated code hasn't changed between gcc-10 and gcc-11 though, so the
difference must be in the code used to run the benchmarks, not the code under
test.
See https://godbolt.org/z/bWeMen

[Bug libstdc++/98226] Slow std::countr_one

2020-12-11 Thread redi at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226

--- Comment #8 from Jonathan Wakely  ---
(In reply to Oleg Zaikin from comment #6)
> When we switched from C++17-based g++ to C++20-based g++, the performance of
> the whole program decreased by about 7 %. It turned out that the main reason
> is the firstzero function.

But if I create a microbenchmark for this function and replace countr_one(x) by
__builtin_ctz(~x) I don't see any improvement. The problem is not that
countr_one(x) calls countr_zero(~x). The problem is that the two versions of
firstzero are completely different.

If the (y ^ x) & y version is more efficient, just use that.

In other words, this is not a problem with countr_one, it's that you chose to
use countr_one here.

For example, using google benchmark:

#include 
#include 
#include 
#include 
#include 

unsigned total = 0;

unsigned firstzero(const unsigned x) noexcept {
  const unsigned y = x+1;
  return (y ^ x) & y;
}

unsigned firstzero_cxx20(const unsigned x) noexcept {
  return x == unsigned(-1) ? 0 : unsigned(1) << std::countr_one(x);
}

unsigned firstzero_ctz(const unsigned x) noexcept {
  return x == unsigned(-1) ? 0 : unsigned(1) << __builtin_ctz(~x);
}

static void BM_orig(benchmark::State& state) {
  srand(0);
  std::vector v(100);
  for (auto& u : v)
u = rand();
  for (auto _ : state)
  {
total = 0;
for (auto& u : v)
  total += firstzero(u);
  }
}

static void BM_cxx20(benchmark::State& state) {
  unsigned total;
  srand(0);
  std::vector v(100);
  for (auto& u : v)
u = rand();
  for (auto _ : state)
  {
total = 0;
for (auto& u : v)
  total += firstzero_cxx20(u);
  }
  if (total != ::total) // check for correct result
throw 1;
}

static void BM_ctz(benchmark::State& state) {
  unsigned total;
  srand(0);
  std::vector v(100);
  for (auto& u : v)
u = rand();
  for (auto _ : state)
  {
total = 0;
for (auto& u : v)
  total += firstzero_ctz(u);
  }
  if (total != ::total) // check for correct result
throw 1;
}


BENCHMARK(BM_orig);
BENCHMARK(BM_cxx20);
BENCHMARK(BM_ctz);
BENCHMARK_MAIN();


With GCC 10 I get:

-
Benchmark   Time CPU   Iterations
-
BM_orig179283 ns   178739 ns 4388
BM_cxx20   888982 ns   886885 ns  770
BM_ctz 883696 ns   881765 ns  783

I think you've just chosen a bad algorithm.

However, I do see a regression using GCC 11:

-
Benchmark   Time CPU   Iterations
-
BM_orig174122 ns   172492 ns 4466
BM_cxx20  1104081 ns  1097202 ns  608
BM_ctz1119849 ns  1112550 ns  634

That needs to be investigated, but it's a problem with the compiler. It has
nothing to do with countr_one being implemented using countr_zero (as shown by
the same problem being present when calling __builtin_ctz directly).

[Bug libstdc++/98226] Slow std::countr_one

2020-12-11 Thread zaikin.icc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226

--- Comment #7 from Oleg Zaikin  ---
(In reply to Jonathan Wakely from comment #5)
> I've removed some redundant code from them, but not changed the indirection
> that this PR complains about. I don't plan to change that.

Thank you! I've got your point regarding the indirection.

[Bug libstdc++/98226] Slow std::countr_one

2020-12-11 Thread zaikin.icc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226

--- Comment #6 from Oleg Zaikin  ---
(In reply to Jonathan Wakely from comment #2)
> Oh, but you didn't enable any optimization at all, so who cares about the
> performance?

Let me give the whole picture. The issue is very close to that from
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97759 where a C-style
implementation was replaced by C++20 std::has_single_bit. We have a complex
project that of course is compiled with all proper optimization flags. There we
have a function firstzero(unsigned x) that returns 2^i, with i the position of
the first 0 of x, and 0 iff there is no 0. Its implementation is:
  unsigned firstzero(const unsigned x) noexcept {
#if __cplusplus > 201703L
return x == unsigned(-1) ? 0 : unsigned(1) << std::countr_one(x);
#else
const unsigned y = x+1; return (y ^ x) & y;
#endif
  }
When we switched from C++17-based g++ to C++20-based g++, the performance of
the whole program decreased by about 7 %. It turned out that the main reason is
the firstzero function.

[Bug libstdc++/98226] Slow std::countr_one

2020-12-10 Thread redi at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226

--- Comment #5 from Jonathan Wakely  ---
I've removed some redundant code from them, but not changed the indirection
that this PR complains about. I don't plan to change that.

[Bug libstdc++/98226] Slow std::countr_one

2020-12-10 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226

--- Comment #4 from CVS Commits  ---
The master branch has been updated by Jonathan Wakely :

https://gcc.gnu.org/g:2ea62857a3fbdf091ba38cbb62e98dc76b198e2e

commit r11-5922-g2ea62857a3fbdf091ba38cbb62e98dc76b198e2e
Author: Jonathan Wakely 
Date:   Thu Dec 10 21:57:42 2020 +

libstdc++: Remove redundant branches in countl_one and countr_one [PR
98226]

There's no need to explicitly check for the maximum value, because the
function we call handles it correctly anyway.

libstdc++-v3/ChangeLog:

PR libstdc++/98226
* include/std/bit (__countl_one, __countr_one): Remove redundant
branches.

[Bug libstdc++/98226] Slow std::countr_one

2020-12-10 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226

--- Comment #3 from Andrew Pinski  ---
(In reply to Jonathan Wakely from comment #2)
> Oh, but you didn't enable any optimization at all, so who cares about the
> performance?

I was thinking the same.

[Bug libstdc++/98226] Slow std::countr_one

2020-12-10 Thread redi at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226

--- Comment #2 from Jonathan Wakely  ---
Oh, but you didn't enable any optimization at all, so who cares about the
performance?

[Bug libstdc++/98226] Slow std::countr_one

2020-12-10 Thread redi at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226

--- Comment #1 from Jonathan Wakely  ---
This seems like an optimizer bug. There is no way I'm going to repeat the
entire body of countr_zero in countr_one.