https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106106

--- Comment #2 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #1)
> SRA is eliding 'v' by doing what it does, so it essentially changes
> it looks like providing __builtin_neon_vld2_lanev2sf with float32x2x2
> argument and return type might avoid one copy.
> 

We already do, the UNSPEC is

(insn 11 10 12 2 (set (reg:V2x2SF 95 [ D.22913 ])
        (unspec:V2x2SF [
                (mem:BLK (reg/v/f:DI 100 [ p2 ]) [0  S8 A8])
                (reg/v:V2x2SF 97 [ __b ])
                (const_int 1 [0x1])
            ] UNSPEC_LD2_LANE))
"/opt/compiler-explorer/arm64/gcc-trunk-20220628/aarch64-unknown-linux-gnu/lib/gcc/aarch64-unknown-linux-gnu/13.0.0/include/arm_neon.h":17515:10
-1
     (nil))

> In any case improving register allocation or massaging the RTL before it
> is the way to go here.  How does the RTL IL fed to RA differ with/without
> SRA?

I am not sure this a reload problem. The underlying type of float32x2x2_t which
is V2x2SF always reserves two sequential registers.

without SRA we get

(insn 8 7 9 2 (set (reg/v:V2x2SF 95 [ v ])
        (reg:V2x2SF 92 [ D.22915 ])) -1
     (nil))
(insn 9 8 10 2 (set (reg/v:V2x2SF 96 [ __b ])
        (reg/v:V2x2SF 95 [ v ])) -1
     (nil))

which is simple to eliminate as it's copying the whole structure in one go and
reload eliminates the extra move fine.  With SRA scalarization you end up with
a series of subregs

(insn 8 7 9 2 (set (reg:V2SF 93 [ v$val$1 ])
        (subreg:V2SF (reg:V2x2SF 94 [ D.22915 ]) 8)) -1
     (nil))
(insn 9 8 10 2 (set (subreg:V2SF (reg/v:V2x2SF 97 [ __b ]) 0)
        (subreg:V2SF (reg:V2x2SF 94 [ D.22915 ]) 0)) -1
     (nil))
(insn 10 9 11 2 (set (subreg:V2SF (reg/v:V2x2SF 97 [ __b ]) 8)
        (reg:V2SF 93 [ v$val$1 ])) -1
     (nil))

So we get an explicit extract and piecewise recreation of the V2x2SF, 94 will
take 2 registers and 97 two different ones. reload is just doing as it was
told.

Reply via email to