https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122529
Bug ID: 122529
Summary: Optimizing for size --- unnecessary x86 instructions
Product: gcc
Version: 15.2.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: zero at smallinteger dot com
Target Milestone: ---
Created attachment 62688
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=62688&action=edit
Sample code
Consider the attached code, compiled with -Oz. Per Godbolt, the output for GCC
15.2 is as follows.
test:
xor eax, eax
mov ecx, 1024
mov rdx, rdi
rep stosd
xor eax, eax
.L2:
mov DWORD PTR [rdx+rax*4], eax
inc rax
cmp rax, 160
jne .L2
ret
Observe the second xor eax, eax is unnecessary because rax is still zero after
rep stosd. Removing the for loop eliminates the second xor eax, eax. It seems
as if GCC is assuming rax is trashed by rep stosd.
Moreover, per Godbolt the output for GCC trunk is as follows.
"test":
xor eax, eax
mov ecx, 1023
mov rdx, rdi
and DWORD PTR [rdi+4092], 0
rep stosd
xor eax, eax
.L2:
mov DWORD PTR [rdx+rax*4], eax
inc rax
cmp rax, 160
jne .L2
ret
Observe that in this case one of the writes in rep stosd has been peeled off
for some reason, resulting in even larger code. Again with GCC trunk, -Os adds
even more instructions.
"test":
xor eax, eax
mov ecx, 1023
mov rdx, rdi
mov DWORD PTR [rdi+4092], eax
xor eax, eax
rep stosd
xor eax, eax
.L2:
mov DWORD PTR [rdx+rax*4], eax
inc rax
cmp rax, 160
jne .L2
ret
Now there are three cases of xor eax, eax.
I could not eliminate the additional unnecessary instructions by enabling
specific optimizations (since -Oz enables most but not all of -O2).
For comparison, per Godbolt clang trunk does this.
test:
mov rdx, rdi
xor esi, esi
mov ecx, 1024
xor eax, eax
rep stosd es:[rdi], eax
.LBB0_1:
cmp rsi, 160
je .LBB0_3
mov dword ptr [rdx + 4*rsi], esi
inc rsi
jmp .LBB0_1
.LBB0_3:
ret
Observe that now there are unnecessary instructions due to not reusing eax
after rep stosd.