https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106106
--- Comment #2 from Tamar Christina <tnfchris at gcc dot gnu.org> --- (In reply to Richard Biener from comment #1) > SRA is eliding 'v' by doing what it does, so it essentially changes > it looks like providing __builtin_neon_vld2_lanev2sf with float32x2x2 > argument and return type might avoid one copy. > We already do, the UNSPEC is (insn 11 10 12 2 (set (reg:V2x2SF 95 [ D.22913 ]) (unspec:V2x2SF [ (mem:BLK (reg/v/f:DI 100 [ p2 ]) [0 S8 A8]) (reg/v:V2x2SF 97 [ __b ]) (const_int 1 [0x1]) ] UNSPEC_LD2_LANE)) "/opt/compiler-explorer/arm64/gcc-trunk-20220628/aarch64-unknown-linux-gnu/lib/gcc/aarch64-unknown-linux-gnu/13.0.0/include/arm_neon.h":17515:10 -1 (nil)) > In any case improving register allocation or massaging the RTL before it > is the way to go here. How does the RTL IL fed to RA differ with/without > SRA? I am not sure this a reload problem. The underlying type of float32x2x2_t which is V2x2SF always reserves two sequential registers. without SRA we get (insn 8 7 9 2 (set (reg/v:V2x2SF 95 [ v ]) (reg:V2x2SF 92 [ D.22915 ])) -1 (nil)) (insn 9 8 10 2 (set (reg/v:V2x2SF 96 [ __b ]) (reg/v:V2x2SF 95 [ v ])) -1 (nil)) which is simple to eliminate as it's copying the whole structure in one go and reload eliminates the extra move fine. With SRA scalarization you end up with a series of subregs (insn 8 7 9 2 (set (reg:V2SF 93 [ v$val$1 ]) (subreg:V2SF (reg:V2x2SF 94 [ D.22915 ]) 8)) -1 (nil)) (insn 9 8 10 2 (set (subreg:V2SF (reg/v:V2x2SF 97 [ __b ]) 0) (subreg:V2SF (reg:V2x2SF 94 [ D.22915 ]) 0)) -1 (nil)) (insn 10 9 11 2 (set (subreg:V2SF (reg/v:V2x2SF 97 [ __b ]) 8) (reg:V2SF 93 [ v$val$1 ])) -1 (nil)) So we get an explicit extract and piecewise recreation of the V2x2SF, 94 will take 2 registers and 97 two different ones. reload is just doing as it was told.