https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106106

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jamborm at gcc dot gnu.org,
                   |                            |rsandifo at gcc dot gnu.org

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Tamar Christina from comment #2)
> (In reply to Richard Biener from comment #1)
> > SRA is eliding 'v' by doing what it does, so it essentially changes
> > it looks like providing __builtin_neon_vld2_lanev2sf with float32x2x2
> > argument and return type might avoid one copy.
> > 
> 
> We already do, the UNSPEC is
> 
> (insn 11 10 12 2 (set (reg:V2x2SF 95 [ D.22913 ])
>         (unspec:V2x2SF [
>                 (mem:BLK (reg/v/f:DI 100 [ p2 ]) [0  S8 A8])
>                 (reg/v:V2x2SF 97 [ __b ])
>                 (const_int 1 [0x1])
>             ] UNSPEC_LD2_LANE))
> "/opt/compiler-explorer/arm64/gcc-trunk-20220628/aarch64-unknown-linux-gnu/
> lib/gcc/aarch64-unknown-linux-gnu/13.0.0/include/arm_neon.h":17515:10 -1
>      (nil))
> 
> > In any case improving register allocation or massaging the RTL before it
> > is the way to go here.  How does the RTL IL fed to RA differ with/without
> > SRA?
> 
> I am not sure this a reload problem. The underlying type of float32x2x2_t
> which is V2x2SF always reserves two sequential registers.
> 
> without SRA we get
> 
> (insn 8 7 9 2 (set (reg/v:V2x2SF 95 [ v ])
>         (reg:V2x2SF 92 [ D.22915 ])) -1
>      (nil))
> (insn 9 8 10 2 (set (reg/v:V2x2SF 96 [ __b ])
>         (reg/v:V2x2SF 95 [ v ])) -1
>      (nil))

So float32x2x2_t is a register on RTL but an aggregate in GIMPLE :/

> which is simple to eliminate as it's copying the whole structure in one go
> and reload eliminates the extra move fine.  With SRA scalarization you end
> up with a series of subregs
> 
> (insn 8 7 9 2 (set (reg:V2SF 93 [ v$val$1 ])
>         (subreg:V2SF (reg:V2x2SF 94 [ D.22915 ]) 8)) -1
>      (nil))
> (insn 9 8 10 2 (set (subreg:V2SF (reg/v:V2x2SF 97 [ __b ]) 0)
>         (subreg:V2SF (reg:V2x2SF 94 [ D.22915 ]) 0)) -1
>      (nil))
> (insn 10 9 11 2 (set (subreg:V2SF (reg/v:V2x2SF 97 [ __b ]) 8)
>         (reg:V2SF 93 [ v$val$1 ])) -1
>      (nil))

So why do we get the lowpart copy in insn 9 but the highpart requires two
insns?  But yes, it looks like the RA fails to follow copies of multi-reg
pseudos when they are copied component-wise.

> So we get an explicit extract and piecewise recreation of the V2x2SF, 94
> will take 2 registers and 97 two different ones. reload is just doing as it
> was told.

Is the fact that float32x2x2_t is an aggregate with a field named 'val'
part of the neon API?  If so I can write such a SRAed copy manually and
we don't optimize that well which means it is worth trying to optimize this.

In general it looks like re-composing these kind of copies for multi-register
pseudos might be a useful thing, not sure if there's a good pass in the
RTL pipeline to do this job.

Not optimizing it on the SRA side would leave you with extra aggregate
copies.  It might be worth enhancing SRA to be flow-sensitive so it
could see that the component replacement it creates die at the aggregate
rematerialization point so it would avoid creating them in the first place
but I fear that's a quite large project.

We could heuristically avoid to scalarize arrays when the aggregate has
a vector mode.  Alternatively instead of scalarizing to the array
element type we could choose the type of the aggregate mode (but only
when doing total scalarization, that is, when there are no component
uses or defs).

Reply via email to