[Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs

2024-07-03 Thread kim.walisch at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749

--- Comment #9 from kim.walisch at gmail dot com ---
Here I am providing some benchmark results to back up my claim that switching
to the integer modulo by a constant algorithm with 2 multiplication
instructions (which is the default in both Clang and MSVC) is faster than GCC's
current default algorithm using only 1 multiplication but more shifts, adds and
subs.

Here is a simple C program that can be used to benchmark both algorithms using
GCC. It computes integer modulo by a constant in a loop (10^10 iterations).
While this benchmark program may not be ideal for simulating real world use
cases, it is better than having no benchmark at all.

---

#include 
#include 
#include 
#include 

__attribute__((noinline))
uint64_t benchmark(uint64_t x, uint64_t inc, uint64_t iters) {
uint64_t sum = 0;

#pragma GCC unroll 1
for (uint64_t i = 0; i < iters; i++, x += inc)
sum += x % 240;

return sum;
}

int main() {
struct timeval start, end;
long seconds, useconds;
double mtime;

// Start timer
gettimeofday(&start, NULL);

uint64_t result = benchmark(0, 11, 100ull);

// End timer
gettimeofday(&end, NULL);

// Calculate the elapsed time in milliseconds
seconds  = end.tv_sec  - start.tv_sec;
useconds = end.tv_usec - start.tv_usec;

mtime = ((seconds) * 1000 + useconds / 1000.0);

printf("Result = %" PRIu64 "\n", result);
printf("The loop took %f milliseconds to execute.\n", mtime);

return 0;
}

---


This GCC command produces the assembly sequence with only 1 multiplication
instruction:

gcc -O3 bench.c -o bench1

This GCC command produces the assembly sequence with 2 multiplication
instructions:

gcc -O3 -mtune=skylake bench.c -o bench2


And here are the benchmark results:


CPU: AMD EPYC 9R14 (AMD Zen4 from 2023), Compiler: GCC 14 on Ubuntu 24.04.

$ ./bench1
Result = 119499360
The loop took 8385.613000 milliseconds to execute.
$ ./bench2
Result = 119499360
The loop took 5898.967000 milliseconds to execute.

The algorithm with 2 multiplications is 30% faster.


CPU: Intel i5-12600K (big.LITTLE), Performance CPU core, Compiler: GCC 13.2
(MinGW-w64)

$ ./bench1.exe
Result = 119499360
The loop took 5633.36 milliseconds to execute.
$ ./bench2.exe
Result = 119499360
The loop took 4369.167000 milliseconds to execute.

The algorithm with 2 multiplications is 23% faster.


CPU: Intel i5-12600K (big.LITTLE), Efficiency CPU core, Compiler: GCC 13.2
(MinGW-w64)

$ ./bench1.exe
Result = 119499360
The loop took 10788.097000 milliseconds to execute.
$ ./bench2.exe
Result = 119499360
The loop took 9453.191000 milliseconds to execute.

The algorithm with 2 multiplications is 12% faster.


One of the comments in PR 115756 was "I'd lean towards shift+add because for
example Intel E-cores have a slow imul.". However, my benchmarks suggest that
even on Intel Efficiency CPU cores the algorithm with 2 multiplication
instructions is faster. (I used the Process Lasso tool on Windows 11 to force
the benchmark to be run on an Efficiency CPU core).

[Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs

2024-07-02 Thread kim.walisch at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749

--- Comment #4 from kim.walisch at gmail dot com ---
One possible explanation for why GCC's current integer division by a constant
assembly sequence was chosen back in the day (I guess one or two decades ago)
is that GCC's current assembly sequence uses only 1 mul instruction whereas
Clang uses 2 mul instructions.

Historically, multiplication instructions used to be slower than add, sub and
shift instructions on nearly all CPU architectures and so it made sense to
avoid mul instructions whenever possible. However in the past decade this
performance gap has narrowed and now it is more important to avoid long
instruction dependency chains which GCC's current integer modulo by a constant
assembly sequence suffers from.

[Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs

2024-07-02 Thread kim.walisch at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749

--- Comment #3 from kim.walisch at gmail dot com ---
(In reply to Andrew Pinski from comment #2)
> This seems like a tuning issue. In that gcc thinks the shifts and stuff is
> faster than mulx.
> 
> What happens if you do -march=native?
> 
> Does it use mulx?

I tried using g++-14 using both -march=native and -march=x86-64-v4 (on a 12th
Gen Intel Core i5-12600K which supports BMI2 and AVX2) but GCC always produces
that same 11 instruction assembly sequence without mulx for the integer modulo
by a constant.

[Bug c++/115749] Missed BMI2 optimization on x86-64

2024-07-02 Thread kim.walisch at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749

--- Comment #1 from kim.walisch at gmail dot com ---
I played a bit more with my C/C++ code snippet and managed to further simplify
it. The GCC performance issue seems to be mostly caused by GCC producing worse
assembly than Clang for the integer modulo by a constant on x86-64 CPUs:

unsigned long func(unsigned long x)
{
return x % 240;
}


GCC trunk produces the following 11 instruction assembly sequence (without
mulx) when compiled using -O3 -mbmi2:

func:
movabs  rax, -8608480567731124087
mul rdi
mov rax, rdx
shr rax, 7
mov rdx, rax
sal rdx, 4
sub rdx, rax
mov rax, rdi
sal rdx, 4
sub rax, rdx
ret

Clang trunk produces the following shorter and faster 8 instruction assembly
sequence (with mulx) when compiled using -O3 -mbmi2:

func:
mov rax, rdi
movabs  rcx, -8608480567731124087
mov rdx, rdi
mulxrcx, rcx, rcx
shr rcx, 7
imulrcx, rcx, 240
sub rax, rcx
ret

In my first post one can see that Clang uses mulx for both the integer division
by a constant and the integer modulo by a constant, while GCC does not use
mulx. However, for the integer division by a constant GCC uses the same number
of instructions as Clang (even without GCC using mulx) but for the integer
modulo by a constant GCC uses up to 30% more instructions and is noticeably
slower.

Please note that Clang's assembly is also shorter (8 asm instructions) than
GCC's assembly for the integer modulo by a constant on x86-64 CPUs when
compiling without -mbmi2 e.g. with just -O3:

func:
movabs  rcx, -8608480567731124087
mov rax, rdi
mul rcx
shr rdx, 7
imulrax, rdx, 240
sub rdi, rax
mov rax, rdi
ret

[Bug c++/115749] New: Missed BMI2 optimization on x86-64

2024-07-02 Thread kim.walisch at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749

Bug ID: 115749
   Summary: Missed BMI2 optimization on x86-64
   Product: gcc
   Version: 14.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kim.walisch at gmail dot com
  Target Milestone: ---

Hi,

I have debugged a performance issue in one of my C++ applications on x86-64
CPUs where GCC produces noticeably slower code (using all GCC versions) than
Clang. I was able to find that the performance issue was caused by GCC not
using the mulx instruction from BMI2 even when compiling with -mbmi2. Clang on
the other hand used the mulx instruction producing a shorter and faster
assembly sequence. For this particular code sequence Clang used up to 30% fewer
instructions than GCC.

Here is a minimal C/C++ code snippet that reproduces the issue:


extern const unsigned long array[240];

unsigned long func(unsigned long x)
{
unsigned long index = x / 240;
return array[index % 240];
}



GCC trunk produces the following 15 instruction assembly sequence (without
mulx) when compiled using -O3 -mbmi2:

func(unsigned long):
movabs  rcx, -8608480567731124087
mov rax, rdi
mul rcx
mov rdi, rdx
shr rdi, 7
mov rax, rdi
mul rcx
shr rdx, 7
mov rax, rdx
sal rax, 4
sub rax, rdx
sal rax, 4
sub rdi, rax
mov rax, QWORD PTR array[0+rdi*8]
ret


Clang trunk produces the following shorter and faster 12 instruction assembly
sequence (with mulx) when compiled using -O3 -mbmi2:

func(unsigned long):   # @func(unsigned long)
movabs  rax, -8608480567731124087
mov rdx, rdi
mulxrdx, rdx, rax
shr rdx, 7
movabs  rax, 153722867280912931
mulxrax, rax, rax
shr eax
imuleax, eax, 240
sub edx, eax
mov rax, qword ptr [rip + array@GOTPCREL]
mov rax, qword ptr [rax + 8*rdx]
ret

[Bug c++/107452] New: Failed to catch C++ exception thrown from multiarch-function (x64 CPUs)

2022-10-28 Thread kim.walisch at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107452

Bug ID: 107452
   Summary: Failed to catch C++ exception thrown from
multiarch-function (x64 CPUs)
   Product: gcc
   Version: 11.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kim.walisch at gmail dot com
  Target Milestone: ---

Hi,

Tested using: GCC 11.2.0, Ubuntu 22.10 x64
Tested using: GCC 9.4.0, Ubuntu 18.04 x64

I am using the GCC multiarch feature (also known as function multiversioning:
https://gcc.gnu.org/onlinedocs/gcc/Function-Multiversioning.html) in my
primesieve C++ project to take advantage of the latest supported CPU
instruction set e.g. AVX, AVX2, AVX512 (on x64 CPUs).

Today I found out that if I throw a C++ exception from a multiarch-function and
I try to catch that exception outside of the originating multiarch-function but
within the same translation unit, then catching the exception fails and my
program simply aborts.

My exception is thrown from here:
https://github.com/kimwalisch/primesieve/blob/776c102f92905401613a83508d60744d41df7c73/src/PrimeGenerator.cpp#L332
It should be caught here:
https://github.com/kimwalisch/primesieve/blob/776c102f92905401613a83508d60744d41df7c73/src/iterator-c.cpp#L151



My bug can be reproduced using these steps:

git clone https://github.com/kimwalisch/primesieve.git
cd primesieve && mkdir build && cd build
git checkout 776c102f92905401613a83508d60744d41df7c73
CXXFLAGS="-O2 -Wall -Wextra -pedantic" cmake ..  -DBUILD_TESTS=ON
-DCMAKE_BUILD_TYPE=Debug -DWITH_MULTIARCH=ON  && make -j8
test/next_prime2

The test/next_prime2 will fail with the following error message:

terminate called after throwing an instance of 'primesieve::primesieve_error'
  what():  cannot generate primes > 2^64
Aborted



If I recompile without function multiversioning (-DWITH_MULTIARCH=OFF) the same
exception is caught successfully:

rm -rf *
CXXFLAGS="-O2 -Wall -Wextra -pedantic" cmake ..  -DBUILD_TESTS=ON
-DCMAKE_BUILD_TYPE=Debug -DWITH_MULTIARCH=OFF && make -j8
test/next_prime2

The test/next_prime2 completes successfully:

...
primesieve_iterator: cannot generate primes > 2^64
next_prime(18446744073709551615) = PRIMESIEVE_ERROR:   OK
primesieve_iterator: cannot generate primes > 2^64
next_prime(18446744073709551615) = PRIMESIEVE_ERROR:   OK

All tests passed successfully!



Clang also supports function multiversioning on Linux & x64 CPUs. And with
Clang this issue is not present, with Clang catching C++ exceptions thrown from
a multiarch-function works flawlessly (tested using Clang 14.0.0 on Ubuntu
22.10 x64):

rm -rf *
CXX=clang++ CC=clang CXXFLAGS="-O2 -Wall -Wextra -pedantic" cmake .. 
-DBUILD_TESTS=ON -DCMAKE_BUILD_TYPE=Debug -DWITH_MULTIARCH=ON && make -j8
test/next_prime2

The test/next_prime2 completes successfully:

...
primesieve_iterator: cannot generate primes > 2^64
next_prime(18446744073709551615) = PRIMESIEVE_ERROR:   OK
primesieve_iterator: cannot generate primes > 2^64
next_prime(18446744073709551615) = PRIMESIEVE_ERROR:   OK

All tests passed successfully!



Is this a known GCC issue? If needed I could also try to write a minimal test
that reproduces this issue.

[Bug c++/107287] New: -Wuninitialized false positive (in all GCC versions)

2022-10-17 Thread kim.walisch at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107287

Bug ID: 107287
   Summary: -Wuninitialized false positive (in all GCC versions)
   Product: gcc
   Version: 12.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kim.walisch at gmail dot com
  Target Milestone: ---

Hi,

In my primecount C++ project I have hit a -Wuninitialized false positive with
GCC. It happens basically with every g++ version >= 8 && <= 12 that I have
tested accross Ubuntu 18.04, 20.04 and 22.04 (x64). Unfortunately I have not
been able to create a minimal test case that reliably triggers this issue on
all GCC versions. However, I was able to create a minimal test case on
godbolt.org that triggers the issue on GCC trunk:

#include 
#include 

extern double get_time();

long func(std::vector& vect, bool print)
{
double time;
if (print)
time = get_time();

int sum = 0;
for (int n : vect)
sum += n;

if (print)
std::cout << time;

return sum;
}

When compiled with "x86-64 gcc (trunk)" and "-O3 -Wall -Werror -pedantic" this
produces the following warning (see https://godbolt.org/z/qv3W8j44b):



In file included from
/opt/compiler-explorer/gcc-trunk-20221017/include/c++/13.0.0/iostream:41,
 from :1:
In member function 'std::basic_ostream<_CharT, _Traits>::__ostream_type&
std::basic_ostream<_CharT, _Traits>::operator<<(double) [with _CharT = char;
_Traits = std::char_traits]',
inlined from 'long int func(std::vector&, bool)' at :18:22:
/opt/compiler-explorer/gcc-trunk-20221017/include/c++/13.0.0/ostream:223:25:
error: 'time' may be used uninitialized [-Werror=maybe-uninitialized]
  223 |   { return _M_insert(__f); }
  |~^
: In function 'long int func(std::vector&, bool)':
:9:12: note: 'time' was declared here
9 | double time;
  |^~~~
cc1plus: all warnings being treated as errors
Compiler returned: 1



Using my primecount project this GCC warning can be reproduced reliably using:


git clone https://github.com/kimwalisch/primecount.git
git checkout git checkout 57f18f9207dc00aaadcd3f4c1e76e320e1387061
cmake . -DCMAKE_CXX_FLAGS="-Wall -Wextra -pedantic -Werror"
make -j8

My code is basically identical to the minimal test case above
(https://github.com/kimwalisch/primecount/blob/57f18f9207dc00aaadcd3f4c1e76e320e1387061/src/phi.cpp#L386):


int64_t phi(int64_t x,
int64_t a,
int threads,
bool is_print)
{
  double time;

  if (is_print)
  {
print("");
print("=== phi(x, a) ===");
time = get_time();
  }

  int64_t sum = phi_OpenMP(x, a, threads);

  if (is_print)
print("phi", sum, time);

  return sum;
}


Also note that there is no warning if my code is compiled with Clang and
-Wuninitialized.

Regards,
Kim Walisch

[Bug tree-optimization/101831] Spurious maybe-uninitialized warning on std::array::size

2022-01-21 Thread kim.walisch at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101831

kim.walisch at gmail dot com changed:

   What|Removed |Added

 CC||kim.walisch at gmail dot com

--- Comment #2 from kim.walisch at gmail dot com ---
> I'm not sure that the std::array use case is common enough to justify the  
> potential for the false negatives

I just hit the same GCC warning on completely valid code. Both Clang & MSVC
correctly do not issue any warning.

void Foo::func()
{
  std::array pos;
  assert(pos.size() == static_global_array.size());
  ...
}

In member function ‘void Foo::func()’:
warning: ‘pos’ may be used uninitialized [-Wmaybe-uninitialized]
  312 |   assert(pos.size() == buffers_.size());