https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106106

            Bug ID: 106106
           Summary: SRA scalarizes structure copies
           Product: gcc
           Version: 13.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64*

The following example

#include <arm_neon.h>

float32x2x2_t f2(const float *p1, const float *p2)
{
    float32x2x2_t v = vld2_f32(p1);
    return vld2_lane_f32(p2, v, 1);
}

uses a type `float32x2x2_t` which is an array consisting of two `float32x2_t`
types.  This type fits within the maximum object size for SRA so it tries to
scalarize it.

However doing so it makes some useless copies:

  D.22939 = __builtin_aarch64_ld2v2sf (p1_2(D));
  v = D.22939;
  __b = v;
  D.22937 = __builtin_aarch64_ld2_lanev2sf (p2_3(D), __b, 1); [tail call]

becomes

  D.22939 = __builtin_aarch64_ld2v2sf (p1_2(D));
  v$val$0_3 = D.22939.val[0];
  v$val$1_5 = D.22939.val[1];
  __b.val[0] = v$val$0_3;
  __b.val[1] = v$val$1_5;
  D.22937 = __builtin_aarch64_ld2_lanev2sf (p2_4(D), __b, 1); [tail call]

having broken the structures up it causes problem for register allocation as
these types require sequential register allocation and reload is unable to
consolidate all the copies resulting in superfluous register moves:

f2:
        ld2     {v2.2s - v3.2s}, [x0]
        mov     v0.8b, v2.8b
        mov     v1.8b, v3.8b
        ld2     {v0.s - v1.s}[1], [x1]
        ret

The following snippet from a real library using intrinsics shows the resulting
carnage https://godbolt.org/z/xnre3Pe34.

Perhaps SRA should not scalarize a type if it's just being used in a copy? or
have a way to prevent scalarization of certain types?

Reply via email to