[Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749 --- Comment #9 from kim.walisch at gmail dot com --- Here I am providing some benchmark results to back up my claim that switching to the integer modulo by a constant algorithm with 2 multiplication instructions (which is the default in both Clang and MSVC) is faster than GCC's current default algorithm using only 1 multiplication but more shifts, adds and subs. Here is a simple C program that can be used to benchmark both algorithms using GCC. It computes integer modulo by a constant in a loop (10^10 iterations). While this benchmark program may not be ideal for simulating real world use cases, it is better than having no benchmark at all. --- #include #include #include #include __attribute__((noinline)) uint64_t benchmark(uint64_t x, uint64_t inc, uint64_t iters) { uint64_t sum = 0; #pragma GCC unroll 1 for (uint64_t i = 0; i < iters; i++, x += inc) sum += x % 240; return sum; } int main() { struct timeval start, end; long seconds, useconds; double mtime; // Start timer gettimeofday(&start, NULL); uint64_t result = benchmark(0, 11, 100ull); // End timer gettimeofday(&end, NULL); // Calculate the elapsed time in milliseconds seconds = end.tv_sec - start.tv_sec; useconds = end.tv_usec - start.tv_usec; mtime = ((seconds) * 1000 + useconds / 1000.0); printf("Result = %" PRIu64 "\n", result); printf("The loop took %f milliseconds to execute.\n", mtime); return 0; } --- This GCC command produces the assembly sequence with only 1 multiplication instruction: gcc -O3 bench.c -o bench1 This GCC command produces the assembly sequence with 2 multiplication instructions: gcc -O3 -mtune=skylake bench.c -o bench2 And here are the benchmark results: CPU: AMD EPYC 9R14 (AMD Zen4 from 2023), Compiler: GCC 14 on Ubuntu 24.04. $ ./bench1 Result = 119499360 The loop took 8385.613000 milliseconds to execute. $ ./bench2 Result = 119499360 The loop took 5898.967000 milliseconds to execute. The algorithm with 2 multiplications is 30% faster. CPU: Intel i5-12600K (big.LITTLE), Performance CPU core, Compiler: GCC 13.2 (MinGW-w64) $ ./bench1.exe Result = 119499360 The loop took 5633.36 milliseconds to execute. $ ./bench2.exe Result = 119499360 The loop took 4369.167000 milliseconds to execute. The algorithm with 2 multiplications is 23% faster. CPU: Intel i5-12600K (big.LITTLE), Efficiency CPU core, Compiler: GCC 13.2 (MinGW-w64) $ ./bench1.exe Result = 119499360 The loop took 10788.097000 milliseconds to execute. $ ./bench2.exe Result = 119499360 The loop took 9453.191000 milliseconds to execute. The algorithm with 2 multiplications is 12% faster. One of the comments in PR 115756 was "I'd lean towards shift+add because for example Intel E-cores have a slow imul.". However, my benchmarks suggest that even on Intel Efficiency CPU cores the algorithm with 2 multiplication instructions is faster. (I used the Process Lasso tool on Windows 11 to force the benchmark to be run on an Efficiency CPU core).
[Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749 --- Comment #4 from kim.walisch at gmail dot com --- One possible explanation for why GCC's current integer division by a constant assembly sequence was chosen back in the day (I guess one or two decades ago) is that GCC's current assembly sequence uses only 1 mul instruction whereas Clang uses 2 mul instructions. Historically, multiplication instructions used to be slower than add, sub and shift instructions on nearly all CPU architectures and so it made sense to avoid mul instructions whenever possible. However in the past decade this performance gap has narrowed and now it is more important to avoid long instruction dependency chains which GCC's current integer modulo by a constant assembly sequence suffers from.
[Bug target/115749] Non optimal assembly for integer modulo by a constant on x86-64 CPUs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749 --- Comment #3 from kim.walisch at gmail dot com --- (In reply to Andrew Pinski from comment #2) > This seems like a tuning issue. In that gcc thinks the shifts and stuff is > faster than mulx. > > What happens if you do -march=native? > > Does it use mulx? I tried using g++-14 using both -march=native and -march=x86-64-v4 (on a 12th Gen Intel Core i5-12600K which supports BMI2 and AVX2) but GCC always produces that same 11 instruction assembly sequence without mulx for the integer modulo by a constant.
[Bug c++/115749] Missed BMI2 optimization on x86-64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749 --- Comment #1 from kim.walisch at gmail dot com --- I played a bit more with my C/C++ code snippet and managed to further simplify it. The GCC performance issue seems to be mostly caused by GCC producing worse assembly than Clang for the integer modulo by a constant on x86-64 CPUs: unsigned long func(unsigned long x) { return x % 240; } GCC trunk produces the following 11 instruction assembly sequence (without mulx) when compiled using -O3 -mbmi2: func: movabs rax, -8608480567731124087 mul rdi mov rax, rdx shr rax, 7 mov rdx, rax sal rdx, 4 sub rdx, rax mov rax, rdi sal rdx, 4 sub rax, rdx ret Clang trunk produces the following shorter and faster 8 instruction assembly sequence (with mulx) when compiled using -O3 -mbmi2: func: mov rax, rdi movabs rcx, -8608480567731124087 mov rdx, rdi mulxrcx, rcx, rcx shr rcx, 7 imulrcx, rcx, 240 sub rax, rcx ret In my first post one can see that Clang uses mulx for both the integer division by a constant and the integer modulo by a constant, while GCC does not use mulx. However, for the integer division by a constant GCC uses the same number of instructions as Clang (even without GCC using mulx) but for the integer modulo by a constant GCC uses up to 30% more instructions and is noticeably slower. Please note that Clang's assembly is also shorter (8 asm instructions) than GCC's assembly for the integer modulo by a constant on x86-64 CPUs when compiling without -mbmi2 e.g. with just -O3: func: movabs rcx, -8608480567731124087 mov rax, rdi mul rcx shr rdx, 7 imulrax, rdx, 240 sub rdi, rax mov rax, rdi ret
[Bug c++/115749] New: Missed BMI2 optimization on x86-64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749 Bug ID: 115749 Summary: Missed BMI2 optimization on x86-64 Product: gcc Version: 14.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: kim.walisch at gmail dot com Target Milestone: --- Hi, I have debugged a performance issue in one of my C++ applications on x86-64 CPUs where GCC produces noticeably slower code (using all GCC versions) than Clang. I was able to find that the performance issue was caused by GCC not using the mulx instruction from BMI2 even when compiling with -mbmi2. Clang on the other hand used the mulx instruction producing a shorter and faster assembly sequence. For this particular code sequence Clang used up to 30% fewer instructions than GCC. Here is a minimal C/C++ code snippet that reproduces the issue: extern const unsigned long array[240]; unsigned long func(unsigned long x) { unsigned long index = x / 240; return array[index % 240]; } GCC trunk produces the following 15 instruction assembly sequence (without mulx) when compiled using -O3 -mbmi2: func(unsigned long): movabs rcx, -8608480567731124087 mov rax, rdi mul rcx mov rdi, rdx shr rdi, 7 mov rax, rdi mul rcx shr rdx, 7 mov rax, rdx sal rax, 4 sub rax, rdx sal rax, 4 sub rdi, rax mov rax, QWORD PTR array[0+rdi*8] ret Clang trunk produces the following shorter and faster 12 instruction assembly sequence (with mulx) when compiled using -O3 -mbmi2: func(unsigned long): # @func(unsigned long) movabs rax, -8608480567731124087 mov rdx, rdi mulxrdx, rdx, rax shr rdx, 7 movabs rax, 153722867280912931 mulxrax, rax, rax shr eax imuleax, eax, 240 sub edx, eax mov rax, qword ptr [rip + array@GOTPCREL] mov rax, qword ptr [rax + 8*rdx] ret
[Bug c++/107452] New: Failed to catch C++ exception thrown from multiarch-function (x64 CPUs)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107452 Bug ID: 107452 Summary: Failed to catch C++ exception thrown from multiarch-function (x64 CPUs) Product: gcc Version: 11.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: kim.walisch at gmail dot com Target Milestone: --- Hi, Tested using: GCC 11.2.0, Ubuntu 22.10 x64 Tested using: GCC 9.4.0, Ubuntu 18.04 x64 I am using the GCC multiarch feature (also known as function multiversioning: https://gcc.gnu.org/onlinedocs/gcc/Function-Multiversioning.html) in my primesieve C++ project to take advantage of the latest supported CPU instruction set e.g. AVX, AVX2, AVX512 (on x64 CPUs). Today I found out that if I throw a C++ exception from a multiarch-function and I try to catch that exception outside of the originating multiarch-function but within the same translation unit, then catching the exception fails and my program simply aborts. My exception is thrown from here: https://github.com/kimwalisch/primesieve/blob/776c102f92905401613a83508d60744d41df7c73/src/PrimeGenerator.cpp#L332 It should be caught here: https://github.com/kimwalisch/primesieve/blob/776c102f92905401613a83508d60744d41df7c73/src/iterator-c.cpp#L151 My bug can be reproduced using these steps: git clone https://github.com/kimwalisch/primesieve.git cd primesieve && mkdir build && cd build git checkout 776c102f92905401613a83508d60744d41df7c73 CXXFLAGS="-O2 -Wall -Wextra -pedantic" cmake .. -DBUILD_TESTS=ON -DCMAKE_BUILD_TYPE=Debug -DWITH_MULTIARCH=ON && make -j8 test/next_prime2 The test/next_prime2 will fail with the following error message: terminate called after throwing an instance of 'primesieve::primesieve_error' what(): cannot generate primes > 2^64 Aborted If I recompile without function multiversioning (-DWITH_MULTIARCH=OFF) the same exception is caught successfully: rm -rf * CXXFLAGS="-O2 -Wall -Wextra -pedantic" cmake .. -DBUILD_TESTS=ON -DCMAKE_BUILD_TYPE=Debug -DWITH_MULTIARCH=OFF && make -j8 test/next_prime2 The test/next_prime2 completes successfully: ... primesieve_iterator: cannot generate primes > 2^64 next_prime(18446744073709551615) = PRIMESIEVE_ERROR: OK primesieve_iterator: cannot generate primes > 2^64 next_prime(18446744073709551615) = PRIMESIEVE_ERROR: OK All tests passed successfully! Clang also supports function multiversioning on Linux & x64 CPUs. And with Clang this issue is not present, with Clang catching C++ exceptions thrown from a multiarch-function works flawlessly (tested using Clang 14.0.0 on Ubuntu 22.10 x64): rm -rf * CXX=clang++ CC=clang CXXFLAGS="-O2 -Wall -Wextra -pedantic" cmake .. -DBUILD_TESTS=ON -DCMAKE_BUILD_TYPE=Debug -DWITH_MULTIARCH=ON && make -j8 test/next_prime2 The test/next_prime2 completes successfully: ... primesieve_iterator: cannot generate primes > 2^64 next_prime(18446744073709551615) = PRIMESIEVE_ERROR: OK primesieve_iterator: cannot generate primes > 2^64 next_prime(18446744073709551615) = PRIMESIEVE_ERROR: OK All tests passed successfully! Is this a known GCC issue? If needed I could also try to write a minimal test that reproduces this issue.
[Bug c++/107287] New: -Wuninitialized false positive (in all GCC versions)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107287 Bug ID: 107287 Summary: -Wuninitialized false positive (in all GCC versions) Product: gcc Version: 12.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: kim.walisch at gmail dot com Target Milestone: --- Hi, In my primecount C++ project I have hit a -Wuninitialized false positive with GCC. It happens basically with every g++ version >= 8 && <= 12 that I have tested accross Ubuntu 18.04, 20.04 and 22.04 (x64). Unfortunately I have not been able to create a minimal test case that reliably triggers this issue on all GCC versions. However, I was able to create a minimal test case on godbolt.org that triggers the issue on GCC trunk: #include #include extern double get_time(); long func(std::vector& vect, bool print) { double time; if (print) time = get_time(); int sum = 0; for (int n : vect) sum += n; if (print) std::cout << time; return sum; } When compiled with "x86-64 gcc (trunk)" and "-O3 -Wall -Werror -pedantic" this produces the following warning (see https://godbolt.org/z/qv3W8j44b): In file included from /opt/compiler-explorer/gcc-trunk-20221017/include/c++/13.0.0/iostream:41, from :1: In member function 'std::basic_ostream<_CharT, _Traits>::__ostream_type& std::basic_ostream<_CharT, _Traits>::operator<<(double) [with _CharT = char; _Traits = std::char_traits]', inlined from 'long int func(std::vector&, bool)' at :18:22: /opt/compiler-explorer/gcc-trunk-20221017/include/c++/13.0.0/ostream:223:25: error: 'time' may be used uninitialized [-Werror=maybe-uninitialized] 223 | { return _M_insert(__f); } |~^ : In function 'long int func(std::vector&, bool)': :9:12: note: 'time' was declared here 9 | double time; |^~~~ cc1plus: all warnings being treated as errors Compiler returned: 1 Using my primecount project this GCC warning can be reproduced reliably using: git clone https://github.com/kimwalisch/primecount.git git checkout git checkout 57f18f9207dc00aaadcd3f4c1e76e320e1387061 cmake . -DCMAKE_CXX_FLAGS="-Wall -Wextra -pedantic -Werror" make -j8 My code is basically identical to the minimal test case above (https://github.com/kimwalisch/primecount/blob/57f18f9207dc00aaadcd3f4c1e76e320e1387061/src/phi.cpp#L386): int64_t phi(int64_t x, int64_t a, int threads, bool is_print) { double time; if (is_print) { print(""); print("=== phi(x, a) ==="); time = get_time(); } int64_t sum = phi_OpenMP(x, a, threads); if (is_print) print("phi", sum, time); return sum; } Also note that there is no warning if my code is compiled with Clang and -Wuninitialized. Regards, Kim Walisch
[Bug tree-optimization/101831] Spurious maybe-uninitialized warning on std::array::size
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101831 kim.walisch at gmail dot com changed: What|Removed |Added CC||kim.walisch at gmail dot com --- Comment #2 from kim.walisch at gmail dot com --- > I'm not sure that the std::array use case is common enough to justify the > potential for the false negatives I just hit the same GCC warning on completely valid code. Both Clang & MSVC correctly do not issue any warning. void Foo::func() { std::array pos; assert(pos.size() == static_global_array.size()); ... } In member function ‘void Foo::func()’: warning: ‘pos’ may be used uninitialized [-Wmaybe-uninitialized] 312 | assert(pos.size() == buffers_.size());