https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123524

            Bug ID: 123524
           Summary: 6% performance regression in gcc-16 compared to gcc-15
                    when compiling an interpreter
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: mikulas at artax dot karlin.mff.cuni.cz
  Target Milestone: ---

Hi

GCC 16 (20251214 from Debian Sid) generates worse code when compiling the
interpreter for the Ajla programming language.

I uploaded the preprocessed source code for the file ipret.c here. Compile it
with gcc -O2.
http://www.jikos.cz/~mikulas/testcases/gcc/ipret-gcc15.e
http://www.jikos.cz/~mikulas/testcases/gcc/ipret-gcc16.e

This is a piece of interpreter code that checks tags, sums two signed 64-bit
numbers, checks for overflow, stores the result and jumps to the next
instruction.
On gcc-15 we can see that the code is almost optimal:
gcc-15:
   36931:       41 0f b6 54 24 02       movzbl 0x2(%r12),%edx        <--- load
variable offsets from the bytecode
   36937:       41 0f b6 4c 24 03       movzbl 0x3(%r12),%ecx
   3693d:       41 0f b6 74 24 04       movzbl 0x4(%r12),%esi
   36943:       0f b6 3c 0b             movzbl (%rbx,%rcx,1),%edi    <--- check
tags
   36947:       40 0a 3c 13             or     (%rbx,%rdx,1),%dil
   3694b:       75 22                   jne    3696f <u_run+0x31d7f> <---
escape if at least one argument is tagged
   3694d:       48 8b 14 d3             mov    (%rbx,%rdx,8),%rdx    <--- load
the first argument
   36951:       48 03 14 cb             add    (%rbx,%rcx,8),%rdx    <--- add
the second argument
   36955:       70 18                   jo     3696f <u_run+0x31d7f> <---
escape on overflow
   36957:       48 89 14 f3             mov    %rdx,(%rbx,%rsi,8)    <--- store
the result
   3695b:       41 0f b7 54 24 06       movzwl 0x6(%r12),%edx        <--- load
the next instruction opcode
   36961:       48 8d 05 00 00 00 00    lea    0x0(%rip),%rax        <--- load
the base of the jump table
   36968:       49 83 c4 06             add    $0x6,%r12             <---
increase the opcode pointer
   3696c:       ff 24 d0                jmp    *(%rax,%rdx,8)        <--- jump
to the next instruction

The equivalent code generated by gcc-16 is this:
gcc-16:
   19cca:       41 0f b6 55 02          movzbl 0x2(%r13),%edx
   19ccf:       41 0f b6 4d 03          movzbl 0x3(%r13),%ecx
   19cd4:       41 0f b6 7d 04          movzbl 0x4(%r13),%edi
   19cd9:       0f b6 34 13             movzbl (%rbx,%rdx,1),%esi
   19cdd:       40 0a 34 0b             or     (%rbx,%rcx,1),%sil
   19ce1:       0f 85 c7 81 02 00       jne    41eae <u_run+0x3d26e>
   19ce7:       48 8d 34 d3             lea    (%rbx,%rdx,8),%rsi
   19ceb:       48 8d 14 cb             lea    (%rbx,%rcx,8),%rdx
   19cef:       40 0f b6 cf             movzbl %dil,%ecx
   19cf3:       48 8d 0c cb             lea    (%rbx,%rcx,8),%rcx
   19cf7:       48 8b 06                mov    (%rsi),%rax
   19cfa:       48 03 02                add    (%rdx),%rax
   19cfd:       0f 80 ab 81 02 00       jo     41eae <u_run+0x3d26e>
   19d03:       48 89 01                mov    %rax,(%rcx)
   19d06:       41 0f b7 55 06          movzwl 0x6(%r13),%edx
   19d0b:       48 8d 05 00 00 00 00    lea    0x0(%rip),%rax        # 19d12
<u_run+0x150d2>
   19d12:       49 83 c5 06             add    $0x6,%r13
   19d16:       48 8b 04 d0             mov    (%rax,%rdx,8),%rax
   19d1a:       e9 61 af fe ff          jmp    4c80 <u_run+0x40>
    4c80:       ba 02 00 02 00          mov    $0x20002,%edx
    4c85:       66 0f 6e e2             movd   %edx,%xmm4
    4c89:       66 0f 70 fc 00          pshufd $0x0,%xmm4,%xmm7
    4c8e:       0f 29 3c 24             movaps %xmm7,(%rsp)
    4c92:       ff e0                   jmp    *%rax

We can see that gcc 16 doesn't use the scaled addressing modes when accessing
the variables and there is nonsensical code at address 4c80 that stores a
pattern to the stack frame (I don't know where does this come from, the source
code doesn't contain any attempt to store the constant 0x20002 at that point).

Note that if I use the flag -fno-tree-vectorize, the code that stores 0x20002
to the stack frame is not generated (but gcc still doesn't use the scaled
addressing modes).

Due to these regressions, the code generated by gcc-16 is bigger and slower:
"objdump -d ipret.o |wc -l"
gcc-15: 69695 lines
gcc-16: 75414 lines

Benchmark:
1. Download Ajla from https://www.ajla-lang.cz/
2. Compile it with CC='gcc-15 -DDEBUG_ENV -O2' and with CC='gcc-16 -DDEBUG_ENV
-O2'
(the DEBUG_ENV macro makes it respond to debugging environment variables)
3. Run time CG=none ./scripts/update.sh
(this compiles the language itself, I use it as a benchmark)
(the CG=none variable disables the code generator, so that it uses only the
interpreter)

The results:
Core i7-2640M:
gcc-15: 51 seconds
gcc-16: 54 seconds
Ryzen 7 PRO 7840U:
gcc-15: 9.2 seconds
gcc-16: 9.9 seconds

Reply via email to