https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109391

            Bug ID: 109391
           Summary: Inefficient codegen on AArch64 when structure types
                    are returned
           Product: gcc
           Version: 13.0
            Status: UNCONFIRMED
          Keywords: missed-optimization, ra
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tnfchris at gcc dot gnu.org
                CC: rsandifo at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64*

This example https://godbolt.org/z/Pe3f3ozGf

---

#include <arm_neon.h>

int16x8x3_t bsl(const uint16x8x3_t *check, const int16x8x3_t *in1,
                              const int16x8x3_t *in2) {
  int16x8x3_t out;
  for (uint32_t j = 0; j < 3; j++) {
    out.val[j] = vbslq_s16(check->val[j], in1->val[j], in2->val[j]);
  }
  return out;
}


---

Generates:

bsl:
        ldp     q6, q16, [x1]
        ldp     q0, q4, [x2]
        ldp     q5, q7, [x0]
        bsl     v5.16b, v6.16b, v0.16b
        ldr     q0, [x2, 32]
        bsl     v7.16b, v16.16b, v4.16b
        ldr     q6, [x1, 32]
        mov     v1.16b, v5.16b
        ldr     q5, [x0, 32]
        bsl     v5.16b, v6.16b, v0.16b
        mov     v0.16b, v1.16b
        mov     v1.16b, v7.16b
        mov     v2.16b, v5.16b
        ret

with 3 superfluous moves.  It looks like reload is having trouble dealing
with the new compound types as return arguments.

So in RTL We have:

(insn 17 20 22 2 (set (subreg:V8HI (reg/v:V3x8HI 105 [ out ]) 16)
        (xor:V8HI (and:V8HI (xor:V8HI (reg:V8HI 115 [ in2_11(D)->val[1] ])
                    (reg:V8HI 114 [ in1_10(D)->val[1] ]))
                (reg:V8HI 113 [ check_9(D)->val[1] ]))
            (reg:V8HI 115 [ in2_11(D)->val[1] ]))) "/app/example.c":7:16
discrim 1 2558 {aarch64_simd_bslv8hi_internal}
     (expr_list:REG_DEAD (reg:V8HI 115 [ in2_11(D)->val[1] ])
        (expr_list:REG_DEAD (reg:V8HI 114 [ in1_10(D)->val[1] ])
            (expr_list:REG_DEAD (reg:V8HI 113 [ check_9(D)->val[1] ])
                (nil)))))
(insn 22 17 29 2 (set (subreg:V8HI (reg/v:V3x8HI 105 [ out ]) 32)
        (xor:V8HI (and:V8HI (xor:V8HI (reg:V8HI 118 [ in2_11(D)->val[2] ])
                    (reg:V8HI 117 [ in1_10(D)->val[2] ]))
                (reg:V8HI 116 [ check_9(D)->val[2] ]))
            (reg:V8HI 118 [ in2_11(D)->val[2] ]))) "/app/example.c":7:16
discrim 1 2558 {aarch64_simd_bslv8hi_internal}
     (expr_list:REG_DEAD (reg:V8HI 118 [ in2_11(D)->val[2] ])
        (expr_list:REG_DEAD (reg:V8HI 117 [ in1_10(D)->val[2] ])
            (expr_list:REG_DEAD (reg:V8HI 116 [ check_9(D)->val[2] ])
                (nil)))))
(insn 29 22 30 2 (set (reg/i:V3x8HI 32 v0)
        (reg/v:V3x8HI 105 [ out ])) "/app/example.c":10:1 3964
{*aarch64_movv3x8hi}
     (expr_list:REG_DEAD (reg/v:V3x8HI 105 [ out ])
        (nil)))
(insn 30 29 37 2 (use (reg/i:V3x8HI 32 v0)) "/app/example.c":10:1 -1
     (nil))

Reload then decides to insert a bunch of reloads:

         Choosing alt 0 in insn 17:  (0) =w  (1) 0  (2) w  (3) w
{aarch64_simd_bslv8hi_internal}
      Creating newreg=126 from oldreg=113, assigning class FP_REGS to r126
   17: r126:V8HI=r115:V8HI^r114:V8HI&r126:V8HI^r115:V8HI
      REG_DEAD r115:V8HI
      REG_DEAD r114:V8HI
      REG_DEAD r113:V8HI
    Inserting insn reload before:
   43: r126:V8HI=r113:V8HI
    Inserting insn reload after:
   44: r105:V3x8HI#16=r126:V8HI

which introduces these moves.  The problem existed with the previous structure
types as well (OImode etc) so it's not new but costs us lots of perf.

I don't think I can fix this with the same pass as
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106106 can I? It looks like in
this case the RTL looks fine.

Reply via email to