https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115204
Bug ID: 115204 Summary: unnecessary stack usage and copies (of temporaries) Product: gcc Version: 15.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: mkretz at gcc dot gnu.org Target Milestone: --- Target: x86_64-*-*, i?86-*-* Test case (https://compiler-explorer.com/z/P7s75EhMr): struct A { int data[8]; }; struct A gen(); void g(struct A); void f() { g(gen()); } This places the returned A object from 'gen()' on the stack, copies it and then calls 'g'. Why? So instead of f: sub rsp, 40 xor eax, eax mov rdi, rsp call gen sub rsp, 32 movdqa xmm0, XMMWORD PTR [rsp+32] movups XMMWORD PTR [rsp], xmm0 movdqa xmm0, XMMWORD PTR [rsp+48] movups XMMWORD PTR [rsp+16], xmm0 call g add rsp, 72 ret can GCC just elide the copy? Like this: f: sub rsp, 40 xor eax, eax mov rdi, rsp call gen call g add rsp, 40 ret I understand that this optimization requires the caller to never read from the object anymore. So a second call to 'g' with the same object returned from 'gen' (like in https://compiler-explorer.com/z/6rMYdnb34) requires that the first call to 'g' gets a copy. But the second call does not require the copy. I.e. int f() { struct A a = gen(); g(a); g(a); return 1; } compiles to f: sub rsp, 40 xor eax, eax mov rdi, rsp call gen sub rsp, 32 movdqa xmm0, XMMWORD PTR [rsp+32] movups XMMWORD PTR [rsp], xmm0 movdqa xmm0, XMMWORD PTR [rsp+48] movups XMMWORD PTR [rsp+16], xmm0 call g movdqa xmm0, XMMWORD PTR [rsp+32] movups XMMWORD PTR [rsp], xmm0 movdqa xmm0, XMMWORD PTR [rsp+48] movups XMMWORD PTR [rsp+16], xmm0 call g mov eax, 1 add rsp, 72 ret but could be f: sub rsp, 40 xor eax, eax mov rdi, rsp call gen sub rsp, 32 movdqa xmm0, XMMWORD PTR [rsp+32] movups XMMWORD PTR [rsp], xmm0 movdqa xmm0, XMMWORD PTR [rsp+48] movups XMMWORD PTR [rsp+16], xmm0 call g add rsp, 32 call g mov eax, 1 add rsp, 40 ret IIUC, the second change would be significantly harder to implement because it needs to shrink the stack. However, I don't believe this second case is as important. The first one should be sufficiently common because of temporaries passed into function arguments. So the following variation void f() { g(gen(), gen()); } is something I see often, leading to many unnecessary stack copies. Instead of f: sub rsp, 72 xor eax, eax mov rdi, rsp call gen lea rdi, [rsp+32] xor eax, eax call gen sub rsp, 64 movdqa xmm0, XMMWORD PTR [rsp+64] movups XMMWORD PTR [rsp+32], xmm0 movdqa xmm0, XMMWORD PTR [rsp+80] movups XMMWORD PTR [rsp+48], xmm0 movdqa xmm0, XMMWORD PTR [rsp+96] movups XMMWORD PTR [rsp], xmm0 movdqa xmm0, XMMWORD PTR [rsp+112] movups XMMWORD PTR [rsp+16], xmm0 call g add rsp, 136 ret I think it should be: f: sub rsp, 72 xor eax, eax mov rdi, rsp call gen lea rdi, [rsp+32] xor eax, eax call gen call g add rsp, 72 ret IIUC, this depends on the psABI and I don't know how target-dependent such an optimization is. That's why I