https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749
Bug ID: 115749 Summary: Missed BMI2 optimization on x86-64 Product: gcc Version: 14.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: kim.walisch at gmail dot com Target Milestone: --- Hi, I have debugged a performance issue in one of my C++ applications on x86-64 CPUs where GCC produces noticeably slower code (using all GCC versions) than Clang. I was able to find that the performance issue was caused by GCC not using the mulx instruction from BMI2 even when compiling with -mbmi2. Clang on the other hand used the mulx instruction producing a shorter and faster assembly sequence. For this particular code sequence Clang used up to 30% fewer instructions than GCC. Here is a minimal C/C++ code snippet that reproduces the issue: extern const unsigned long array[240]; unsigned long func(unsigned long x) { unsigned long index = x / 240; return array[index % 240]; } GCC trunk produces the following 15 instruction assembly sequence (without mulx) when compiled using -O3 -mbmi2: func(unsigned long): movabs rcx, -8608480567731124087 mov rax, rdi mul rcx mov rdi, rdx shr rdi, 7 mov rax, rdi mul rcx shr rdx, 7 mov rax, rdx sal rax, 4 sub rax, rdx sal rax, 4 sub rdi, rax mov rax, QWORD PTR array[0+rdi*8] ret Clang trunk produces the following shorter and faster 12 instruction assembly sequence (with mulx) when compiled using -O3 -mbmi2: func(unsigned long): # @func(unsigned long) movabs rax, -8608480567731124087 mov rdx, rdi mulx rdx, rdx, rax shr rdx, 7 movabs rax, 153722867280912931 mulx rax, rax, rax shr eax imul eax, eax, 240 sub edx, eax mov rax, qword ptr [rip + array@GOTPCREL] mov rax, qword ptr [rax + 8*rdx] ret