https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123063

            Bug ID: 123063
           Summary: End of array of stack variable recalculated in every
                    iteration of for loop
           Product: gcc
           Version: 14.1.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: cousteaulecommandant at gmail dot com
  Target Milestone: ---

When doing a very simple iteration over all the elements of an array (for
loop), GCC optimization replaces the counter with a pointer so that it does
pointer indirection and end condition checking using a single variable, saving
cycles (i.e., it uses a single pointer into the array instead of a counter, and
checks if that pointer has reached the end of the array to exit the loop).

However, if the array is in the stack (automatic), GCC 14 and newer will
recalculate the address of the end of the array in every iteration, instead of
calculating it once at the beginning and using that value every iteration.  As
a result, the loop will be 1 instruction longer, and therefore slower.  This
didn't happen with GCC 13.4 and earlier.

GCC version:
  - Good: 13.4
  - Bad:  14.1 onward
Severity:  performance issue
Language:  C  (also observed in C++)
Target:  RISC-V 32 and 64 bits  (also noticed in x86-64, possibly more)
Compile options:  -O3
Example code:

    #include <string.h> // memcpy()

    int sum100(const int v[100]) {
        int res=0;
        for (int i=0; i<100; i++) {
            res += v[i];
        }
        return res;
    }

    int sum100_stack(const int v[100]) {
        int v_copy[100];
        memcpy(v_copy, v, sizeof v_copy);
        int res=0;
        for (int i=0; i<100; i++) {
            res += v_copy[i];
        }
        return res;
    }

Result:
  - sum100() generates the same 4-instruction loop on both versions:

                addi    a3,a0,400
                li      a0,0
        .L2:
                lw      a4,0(a5)
                addi    a5,a5,4
                add     a0,a0,a4
                bne     a3,a5,.L2

  - sum100_stack() generates a similar 4-instruction loop in version 13.4 and
earlier:

                addi    a3,sp,400
                li      a0,0
        .L6:
                lw      a4,0(a5)
                addi    a5,a5,4
                add     a0,a0,a4
                bne     a5,a3,.L6

  - But sum100_stack() generates a 5-instruction loop in version 14.1 and newer
(moving the `addi a3,sp,400` instruction inside the loop):

                li      a0,0
        .L6:
                lw      a4,0(a5)
                addi    a5,a5,4
                add     a0,a0,a4
                addi    a4,sp,400
                bne     a5,a4,.L6

I have found this issue on RISC-V 32-bits but it also seems to affect RISC-V
64-bit, x86-64, and possibly more.  Playing around in godbolt, it seems to
affect all versions from 14.1 to 15.2 and trunk.  It may not seem like much,
but for the example shown here it means a ~15% slower code.

Reply via email to