https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749

--- Comment #1 from kim.walisch at gmail dot com ---
I played a bit more with my C/C++ code snippet and managed to further simplify
it. The GCC performance issue seems to be mostly caused by GCC producing worse
assembly than Clang for the integer modulo by a constant on x86-64 CPUs:

unsigned long func(unsigned long x)
{
    return x % 240;
}


GCC trunk produces the following 11 instruction assembly sequence (without
mulx) when compiled using -O3 -mbmi2:

func:
        movabs  rax, -8608480567731124087
        mul     rdi
        mov     rax, rdx
        shr     rax, 7
        mov     rdx, rax
        sal     rdx, 4
        sub     rdx, rax
        mov     rax, rdi
        sal     rdx, 4
        sub     rax, rdx
        ret

Clang trunk produces the following shorter and faster 8 instruction assembly
sequence (with mulx) when compiled using -O3 -mbmi2:

func:
        mov     rax, rdi
        movabs  rcx, -8608480567731124087
        mov     rdx, rdi
        mulx    rcx, rcx, rcx
        shr     rcx, 7
        imul    rcx, rcx, 240
        sub     rax, rcx
        ret

In my first post one can see that Clang uses mulx for both the integer division
by a constant and the integer modulo by a constant, while GCC does not use
mulx. However, for the integer division by a constant GCC uses the same number
of instructions as Clang (even without GCC using mulx) but for the integer
modulo by a constant GCC uses up to 30% more instructions and is noticeably
slower.

Please note that Clang's assembly is also shorter (8 asm instructions) than
GCC's assembly for the integer modulo by a constant on x86-64 CPUs when
compiling without -mbmi2 e.g. with just -O3:

func:
        movabs  rcx, -8608480567731124087
        mov     rax, rdi
        mul     rcx
        shr     rdx, 7
        imul    rax, rdx, 240
        sub     rdi, rax
        mov     rax, rdi
        ret

Reply via email to