https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115204

            Bug ID: 115204
           Summary: unnecessary stack usage and copies (of temporaries)
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: mkretz at gcc dot gnu.org
  Target Milestone: ---
            Target: x86_64-*-*, i?86-*-*

Test case (https://compiler-explorer.com/z/P7s75EhMr):

struct A {
  int data[8];
};

struct A gen();

void g(struct A);

void f()
{
  g(gen());
}

This places the returned A object from 'gen()' on the stack, copies it and then
calls 'g'. Why? So instead of

f:
        sub     rsp, 40
        xor     eax, eax
        mov     rdi, rsp
        call    gen
        sub     rsp, 32
        movdqa  xmm0, XMMWORD PTR [rsp+32]
        movups  XMMWORD PTR [rsp], xmm0
        movdqa  xmm0, XMMWORD PTR [rsp+48]
        movups  XMMWORD PTR [rsp+16], xmm0
        call    g
        add     rsp, 72
        ret

can GCC just elide the copy? Like this:

f:
        sub     rsp, 40
        xor     eax, eax
        mov     rdi, rsp
        call    gen
        call    g
        add     rsp, 40
        ret


I understand that this optimization requires the caller to never read from the
object anymore. So a second call to 'g' with the same object returned from
'gen' (like in https://compiler-explorer.com/z/6rMYdnb34) requires that the
first call to 'g' gets a copy. But the second call does not require the copy.
I.e.

int f()
{
  struct A a = gen();
  g(a);
  g(a);
  return 1;
}

compiles to

f:
        sub     rsp, 40
        xor     eax, eax
        mov     rdi, rsp
        call    gen
        sub     rsp, 32
        movdqa  xmm0, XMMWORD PTR [rsp+32]
        movups  XMMWORD PTR [rsp], xmm0
        movdqa  xmm0, XMMWORD PTR [rsp+48]
        movups  XMMWORD PTR [rsp+16], xmm0
        call    g
        movdqa  xmm0, XMMWORD PTR [rsp+32]
        movups  XMMWORD PTR [rsp], xmm0
        movdqa  xmm0, XMMWORD PTR [rsp+48]
        movups  XMMWORD PTR [rsp+16], xmm0
        call    g
        mov     eax, 1
        add     rsp, 72
        ret

but could be

f:
        sub     rsp, 40
        xor     eax, eax
        mov     rdi, rsp
        call    gen
        sub     rsp, 32
        movdqa  xmm0, XMMWORD PTR [rsp+32]
        movups  XMMWORD PTR [rsp], xmm0
        movdqa  xmm0, XMMWORD PTR [rsp+48]
        movups  XMMWORD PTR [rsp+16], xmm0
        call    g
        add     rsp, 32
        call    g
        mov     eax, 1
        add     rsp, 40
        ret

IIUC, the second change would be significantly harder to implement because it
needs to shrink the stack. However, I don't believe this second case is as
important. The first one should be sufficiently common because of temporaries
passed into function arguments. So the following variation

void f()
{
  g(gen(), gen());
}

is something I see often, leading to many unnecessary stack copies. Instead of

f:
        sub     rsp, 72
        xor     eax, eax
        mov     rdi, rsp
        call    gen
        lea     rdi, [rsp+32]
        xor     eax, eax
        call    gen
        sub     rsp, 64
        movdqa  xmm0, XMMWORD PTR [rsp+64]
        movups  XMMWORD PTR [rsp+32], xmm0
        movdqa  xmm0, XMMWORD PTR [rsp+80]
        movups  XMMWORD PTR [rsp+48], xmm0
        movdqa  xmm0, XMMWORD PTR [rsp+96]
        movups  XMMWORD PTR [rsp], xmm0
        movdqa  xmm0, XMMWORD PTR [rsp+112]
        movups  XMMWORD PTR [rsp+16], xmm0
        call    g
        add     rsp, 136
        ret

I think it should be:

f:
        sub     rsp, 72
        xor     eax, eax
        mov     rdi, rsp
        call    gen
        lea     rdi, [rsp+32]
        xor     eax, eax
        call    gen
        call    g
        add     rsp, 72
        ret

IIUC, this depends on the psABI and I don't know how target-dependent such an
optimization is. That's why I

Reply via email to