https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115749
--- Comment #1 from kim.walisch at gmail dot com --- I played a bit more with my C/C++ code snippet and managed to further simplify it. The GCC performance issue seems to be mostly caused by GCC producing worse assembly than Clang for the integer modulo by a constant on x86-64 CPUs: unsigned long func(unsigned long x) { return x % 240; } GCC trunk produces the following 11 instruction assembly sequence (without mulx) when compiled using -O3 -mbmi2: func: movabs rax, -8608480567731124087 mul rdi mov rax, rdx shr rax, 7 mov rdx, rax sal rdx, 4 sub rdx, rax mov rax, rdi sal rdx, 4 sub rax, rdx ret Clang trunk produces the following shorter and faster 8 instruction assembly sequence (with mulx) when compiled using -O3 -mbmi2: func: mov rax, rdi movabs rcx, -8608480567731124087 mov rdx, rdi mulx rcx, rcx, rcx shr rcx, 7 imul rcx, rcx, 240 sub rax, rcx ret In my first post one can see that Clang uses mulx for both the integer division by a constant and the integer modulo by a constant, while GCC does not use mulx. However, for the integer division by a constant GCC uses the same number of instructions as Clang (even without GCC using mulx) but for the integer modulo by a constant GCC uses up to 30% more instructions and is noticeably slower. Please note that Clang's assembly is also shorter (8 asm instructions) than GCC's assembly for the integer modulo by a constant on x86-64 CPUs when compiling without -mbmi2 e.g. with just -O3: func: movabs rcx, -8608480567731124087 mov rax, rdi mul rcx shr rdx, 7 imul rax, rdx, 240 sub rdi, rax mov rax, rdi ret