https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104049

--- Comment #11 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to Jakub Jelinek from comment #9)
> Perhaps the r12-2288-g8695bf78dad1a42636 change wasn't a good idea?

I think it's still a good idea as it fixes a bigger problem (unneeded SIMD
partial extracts) and makes it easier to write RTL as you don't have to deal
with both VEC_SELECT and subregs.  So having one canonical form is better.

> I mean, if we add some hack for the .REDUC_* stuff so that we don't have the
> lowpart vec_select that r12-2288 folds into a subreg, won't we still suffer
> the same problem when doing anything similar?

Yes but I think the problem is in how we do the transfers to start with. While
looking at this issue I noticed that the SIMD <-> genreg transfers for sizes
where we don't have an exact register for on the genreg (i.e. 8-bit and 16-bit)
are suboptimal (even before this) in a number of cases already and dealing with
that  underlying problem first is better, so I postponed it to GCC 13.

That is to say, even

typedef int V __attribute__((vector_size (4 * sizeof (int))));

int
test (V a)
{
  int sum = a[0];
  return (unsigned int)sum >> 16;
}

is suboptimal.

> E.g. with -O2:
> 
> typedef int V __attribute__((vector_size (4 * sizeof (int))));
> 
> int
> test (V a)
> {
>   int sum = a[0];
>   return (((unsigned short)sum) + ((unsigned int)sum >> 16)) >> 1;
> }
> 
> The assembly difference is then:
> -     fmov    w0, s0
> -     lsr     w1, w0, 16
> -     add     w0, w1, w0, uxth
> +     umov    w0, v0.h[0]
> +     fmov    w1, s0
> +     add     w0, w0, w1, lsr 16
>       lsr     w0, w0, 1
>       ret
> Dunno how costly on aarch64 is Neon -> GPR register move.
> Is fmov w0, s0; fmov w1, s0 or fmov w0, s0; mov w1, w0 cheaper?

The answer is quite uarch specific, but in general fmov w0, s0; mov w1, w0 is
cheaper, that said for the sequence you pasted above the it's really a bit of a
wash.

The old codegen has a longer dependency chain and needed both a shift and
zero-extend,

The new codegen removes the zero extends and folds the shift into the add but
adds a transfer so it about cancels out.

Ideally we'd want here:

        umov    w0, v0.h[0]
        umov    w1, v0.h[1]
        add     w0, w0, w1
        lsr     w0, w0, 1
        ret

where the shift and the zero extend are gone and the moves could be done in
parallel.

Reply via email to