https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123063
Bug ID: 123063
Summary: End of array of stack variable recalculated in every
iteration of for loop
Product: gcc
Version: 14.1.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c
Assignee: unassigned at gcc dot gnu.org
Reporter: cousteaulecommandant at gmail dot com
Target Milestone: ---
When doing a very simple iteration over all the elements of an array (for
loop), GCC optimization replaces the counter with a pointer so that it does
pointer indirection and end condition checking using a single variable, saving
cycles (i.e., it uses a single pointer into the array instead of a counter, and
checks if that pointer has reached the end of the array to exit the loop).
However, if the array is in the stack (automatic), GCC 14 and newer will
recalculate the address of the end of the array in every iteration, instead of
calculating it once at the beginning and using that value every iteration. As
a result, the loop will be 1 instruction longer, and therefore slower. This
didn't happen with GCC 13.4 and earlier.
GCC version:
- Good: 13.4
- Bad: 14.1 onward
Severity: performance issue
Language: C (also observed in C++)
Target: RISC-V 32 and 64 bits (also noticed in x86-64, possibly more)
Compile options: -O3
Example code:
#include <string.h> // memcpy()
int sum100(const int v[100]) {
int res=0;
for (int i=0; i<100; i++) {
res += v[i];
}
return res;
}
int sum100_stack(const int v[100]) {
int v_copy[100];
memcpy(v_copy, v, sizeof v_copy);
int res=0;
for (int i=0; i<100; i++) {
res += v_copy[i];
}
return res;
}
Result:
- sum100() generates the same 4-instruction loop on both versions:
addi a3,a0,400
li a0,0
.L2:
lw a4,0(a5)
addi a5,a5,4
add a0,a0,a4
bne a3,a5,.L2
- sum100_stack() generates a similar 4-instruction loop in version 13.4 and
earlier:
addi a3,sp,400
li a0,0
.L6:
lw a4,0(a5)
addi a5,a5,4
add a0,a0,a4
bne a5,a3,.L6
- But sum100_stack() generates a 5-instruction loop in version 14.1 and newer
(moving the `addi a3,sp,400` instruction inside the loop):
li a0,0
.L6:
lw a4,0(a5)
addi a5,a5,4
add a0,a0,a4
addi a4,sp,400
bne a5,a4,.L6
I have found this issue on RISC-V 32-bits but it also seems to affect RISC-V
64-bit, x86-64, and possibly more. Playing around in godbolt, it seems to
affect all versions from 14.1 to 15.2 and trunk. It may not seem like much,
but for the example shown here it means a ~15% slower code.