https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749

--- Comment #9 from kim.walisch at gmail dot com ---
Here I am providing some benchmark results to back up my claim that switching
to the integer modulo by a constant algorithm with 2 multiplication
instructions (which is the default in both Clang and MSVC) is faster than GCC's
current default algorithm using only 1 multiplication but more shifts, adds and
subs.

Here is a simple C program that can be used to benchmark both algorithms using
GCC. It computes integer modulo by a constant in a loop (10^10 iterations).
While this benchmark program may not be ideal for simulating real world use
cases, it is better than having no benchmark at all.

-----------------------------------------------------------------------

#include <stdint.h>
#include <inttypes.h>
#include <stdio.h>
#include <sys/time.h>

__attribute__((noinline))
uint64_t benchmark(uint64_t x, uint64_t inc, uint64_t iters) {
    uint64_t sum = 0;

    #pragma GCC unroll 1
    for (uint64_t i = 0; i < iters; i++, x += inc)
        sum += x % 240;

    return sum;
}

int main() {
    struct timeval start, end;
    long seconds, useconds;
    double mtime;

    // Start timer
    gettimeofday(&start, NULL);

    uint64_t result = benchmark(0, 11, 10000000000ull);

    // End timer
    gettimeofday(&end, NULL);

    // Calculate the elapsed time in milliseconds
    seconds  = end.tv_sec  - start.tv_sec;
    useconds = end.tv_usec - start.tv_usec;

    mtime = ((seconds) * 1000 + useconds / 1000.0);

    printf("Result = %" PRIu64 "\n", result);
    printf("The loop took %f milliseconds to execute.\n", mtime);

    return 0;
}

-----------------------------------------------------------------------


This GCC command produces the assembly sequence with only 1 multiplication
instruction:

gcc -O3 bench.c -o bench1

This GCC command produces the assembly sequence with 2 multiplication
instructions:

gcc -O3 -mtune=skylake bench.c -o bench2


And here are the benchmark results:


CPU: AMD EPYC 9R14 (AMD Zen4 from 2023), Compiler: GCC 14 on Ubuntu 24.04.

$ ./bench1
Result = 1194999999360
The loop took 8385.613000 milliseconds to execute.
$ ./bench2
Result = 1194999999360
The loop took 5898.967000 milliseconds to execute.

The algorithm with 2 multiplications is 30% faster.


CPU: Intel i5-12600K (big.LITTLE), Performance CPU core, Compiler: GCC 13.2
(MinGW-w64)

$ ./bench1.exe
Result = 1194999999360
The loop took 5633.360000 milliseconds to execute.
$ ./bench2.exe
Result = 1194999999360
The loop took 4369.167000 milliseconds to execute.

The algorithm with 2 multiplications is 23% faster.


CPU: Intel i5-12600K (big.LITTLE), Efficiency CPU core, Compiler: GCC 13.2
(MinGW-w64)

$ ./bench1.exe
Result = 1194999999360
The loop took 10788.097000 milliseconds to execute.
$ ./bench2.exe
Result = 1194999999360
The loop took 9453.191000 milliseconds to execute.

The algorithm with 2 multiplications is 12% faster.


One of the comments in PR 115756 was "I'd lean towards shift+add because for
example Intel E-cores have a slow imul.". However, my benchmarks suggest that
even on Intel Efficiency CPU cores the algorithm with 2 multiplication
instructions is faster. (I used the Process Lasso tool on Windows 11 to force
the benchmark to be run on an Efficiency CPU core).

Reply via email to