https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749
--- Comment #9 from kim.walisch at gmail dot com --- Here I am providing some benchmark results to back up my claim that switching to the integer modulo by a constant algorithm with 2 multiplication instructions (which is the default in both Clang and MSVC) is faster than GCC's current default algorithm using only 1 multiplication but more shifts, adds and subs. Here is a simple C program that can be used to benchmark both algorithms using GCC. It computes integer modulo by a constant in a loop (10^10 iterations). While this benchmark program may not be ideal for simulating real world use cases, it is better than having no benchmark at all. ----------------------------------------------------------------------- #include <stdint.h> #include <inttypes.h> #include <stdio.h> #include <sys/time.h> __attribute__((noinline)) uint64_t benchmark(uint64_t x, uint64_t inc, uint64_t iters) { uint64_t sum = 0; #pragma GCC unroll 1 for (uint64_t i = 0; i < iters; i++, x += inc) sum += x % 240; return sum; } int main() { struct timeval start, end; long seconds, useconds; double mtime; // Start timer gettimeofday(&start, NULL); uint64_t result = benchmark(0, 11, 10000000000ull); // End timer gettimeofday(&end, NULL); // Calculate the elapsed time in milliseconds seconds = end.tv_sec - start.tv_sec; useconds = end.tv_usec - start.tv_usec; mtime = ((seconds) * 1000 + useconds / 1000.0); printf("Result = %" PRIu64 "\n", result); printf("The loop took %f milliseconds to execute.\n", mtime); return 0; } ----------------------------------------------------------------------- This GCC command produces the assembly sequence with only 1 multiplication instruction: gcc -O3 bench.c -o bench1 This GCC command produces the assembly sequence with 2 multiplication instructions: gcc -O3 -mtune=skylake bench.c -o bench2 And here are the benchmark results: CPU: AMD EPYC 9R14 (AMD Zen4 from 2023), Compiler: GCC 14 on Ubuntu 24.04. $ ./bench1 Result = 1194999999360 The loop took 8385.613000 milliseconds to execute. $ ./bench2 Result = 1194999999360 The loop took 5898.967000 milliseconds to execute. The algorithm with 2 multiplications is 30% faster. CPU: Intel i5-12600K (big.LITTLE), Performance CPU core, Compiler: GCC 13.2 (MinGW-w64) $ ./bench1.exe Result = 1194999999360 The loop took 5633.360000 milliseconds to execute. $ ./bench2.exe Result = 1194999999360 The loop took 4369.167000 milliseconds to execute. The algorithm with 2 multiplications is 23% faster. CPU: Intel i5-12600K (big.LITTLE), Efficiency CPU core, Compiler: GCC 13.2 (MinGW-w64) $ ./bench1.exe Result = 1194999999360 The loop took 10788.097000 milliseconds to execute. $ ./bench2.exe Result = 1194999999360 The loop took 9453.191000 milliseconds to execute. The algorithm with 2 multiplications is 12% faster. One of the comments in PR 115756 was "I'd lean towards shift+add because for example Intel E-cores have a slow imul.". However, my benchmarks suggest that even on Intel Efficiency CPU cores the algorithm with 2 multiplication instructions is faster. (I used the Process Lasso tool on Windows 11 to force the benchmark to be run on an Efficiency CPU core).