https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749

            Bug ID: 115749
           Summary: Missed BMI2 optimization on x86-64
           Product: gcc
           Version: 14.1.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: kim.walisch at gmail dot com
  Target Milestone: ---

Hi,

I have debugged a performance issue in one of my C++ applications on x86-64
CPUs where GCC produces noticeably slower code (using all GCC versions) than
Clang. I was able to find that the performance issue was caused by GCC not
using the mulx instruction from BMI2 even when compiling with -mbmi2. Clang on
the other hand used the mulx instruction producing a shorter and faster
assembly sequence. For this particular code sequence Clang used up to 30% fewer
instructions than GCC.

Here is a minimal C/C++ code snippet that reproduces the issue:


extern const unsigned long array[240];

unsigned long func(unsigned long x)
{
    unsigned long index = x / 240;
    return array[index % 240];
}



GCC trunk produces the following 15 instruction assembly sequence (without
mulx) when compiled using -O3 -mbmi2:

func(unsigned long):
        movabs  rcx, -8608480567731124087
        mov     rax, rdi
        mul     rcx
        mov     rdi, rdx
        shr     rdi, 7
        mov     rax, rdi
        mul     rcx
        shr     rdx, 7
        mov     rax, rdx
        sal     rax, 4
        sub     rax, rdx
        sal     rax, 4
        sub     rdi, rax
        mov     rax, QWORD PTR array[0+rdi*8]
        ret


Clang trunk produces the following shorter and faster 12 instruction assembly
sequence (with mulx) when compiled using -O3 -mbmi2:

func(unsigned long):                               # @func(unsigned long)
        movabs  rax, -8608480567731124087
        mov     rdx, rdi
        mulx    rdx, rdx, rax
        shr     rdx, 7
        movabs  rax, 153722867280912931
        mulx    rax, rax, rax
        shr     eax
        imul    eax, eax, 240
        sub     edx, eax
        mov     rax, qword ptr [rip + array@GOTPCREL]
        mov     rax, qword ptr [rax + 8*rdx]
        ret

Reply via email to