https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110035

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |aagarwa at gcc dot gnu.org,
                   |                            |amonakov at gcc dot gnu.org

--- Comment #14 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Pontakorn Prasertsuk from comment #12)
> I notice that GCC also does not optimize this case:
> https://godbolt.org/z/7oGqjqqz4

Yes.  To quote:

#include <array>
#include <cstdint>
#include <cstdlib>
#include <iostream>

struct MyClass {
    std::array<uint64_t, 6> arr;
};

MyClass globalA;

// Prevent optimization
void sink(MyClass *m) { std::cout << m->arr[0] << std::endl; }

void __attribute__((noinline)) gg(MyClass &a) {
    MyClass c = a;
    MyClass *b = (MyClass *)malloc(sizeof(MyClass));
    *b = c;
    sink(b);
}

and we do RTL expansion from

  <bb 2> [local count: 1073741824]:
  vect_c_arr__M_elems_0_6.31_25 = MEM <vector(2) long unsigned int> [(long
unsigned int *)a_2(D)];
  vect_c_arr__M_elems_0_6.32_27 = MEM <vector(2) long unsigned int> [(long
unsigned int *)a_2(D) + 16B];
  vect_c_arr__M_elems_0_6.33_29 = MEM <vector(2) long unsigned int> [(long
unsigned int *)a_2(D) + 32B];
  b_4 = malloc (48);
  MEM <vector(2) long unsigned int> [(long unsigned int *)b_4] =
vect_c_arr__M_elems_0_6.31_25;
  MEM <vector(2) long unsigned int> [(long unsigned int *)b_4 + 16B] =
vect_c_arr__M_elems_0_6.32_27;
  MEM <vector(2) long unsigned int> [(long unsigned int *)b_4 + 32B] =
vect_c_arr__M_elems_0_6.33_29;
  sink (b_4); [tail call]

note that the temporary was elided but we specifically avoid TER
(some magic scheduling of stmts in a basic-block) to cross function
calls and there's no optimization phase that would try to optimize
register pressure over calls.  In this case we want to sink the
loads across the call, in other cases we want to avoid doing so.
In the end this would be a job for a late running pass that factors
in things like register pressure and the set of call clobbered register.

I'll note that -fschedule-insns doesn't seem to have any effect here,
but I also remember that scheduling around calls was recently fiddled with,
specifically in r13-5154-g733a1b777f16cd which restricts motion even
with -fsched-pressure (not sure how that honors call clobbered regs).

In the above case the GPR for a_2(D) would be needed after the call
(but there are not call clobbered GPRs) but the three data vectors
in xmm would no longer be live across the call (and all vector registers
are call clobbered on x86).

Of course I'm not sure at all whether RTL scheduling can disambiguate
against a 'malloc' call.

Reply via email to