https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104049
--- Comment #11 from Tamar Christina <tnfchris at gcc dot gnu.org> --- (In reply to Jakub Jelinek from comment #9) > Perhaps the r12-2288-g8695bf78dad1a42636 change wasn't a good idea? I think it's still a good idea as it fixes a bigger problem (unneeded SIMD partial extracts) and makes it easier to write RTL as you don't have to deal with both VEC_SELECT and subregs. So having one canonical form is better. > I mean, if we add some hack for the .REDUC_* stuff so that we don't have the > lowpart vec_select that r12-2288 folds into a subreg, won't we still suffer > the same problem when doing anything similar? Yes but I think the problem is in how we do the transfers to start with. While looking at this issue I noticed that the SIMD <-> genreg transfers for sizes where we don't have an exact register for on the genreg (i.e. 8-bit and 16-bit) are suboptimal (even before this) in a number of cases already and dealing with that underlying problem first is better, so I postponed it to GCC 13. That is to say, even typedef int V __attribute__((vector_size (4 * sizeof (int)))); int test (V a) { int sum = a[0]; return (unsigned int)sum >> 16; } is suboptimal. > E.g. with -O2: > > typedef int V __attribute__((vector_size (4 * sizeof (int)))); > > int > test (V a) > { > int sum = a[0]; > return (((unsigned short)sum) + ((unsigned int)sum >> 16)) >> 1; > } > > The assembly difference is then: > - fmov w0, s0 > - lsr w1, w0, 16 > - add w0, w1, w0, uxth > + umov w0, v0.h[0] > + fmov w1, s0 > + add w0, w0, w1, lsr 16 > lsr w0, w0, 1 > ret > Dunno how costly on aarch64 is Neon -> GPR register move. > Is fmov w0, s0; fmov w1, s0 or fmov w0, s0; mov w1, w0 cheaper? The answer is quite uarch specific, but in general fmov w0, s0; mov w1, w0 is cheaper, that said for the sequence you pasted above the it's really a bit of a wash. The old codegen has a longer dependency chain and needed both a shift and zero-extend, The new codegen removes the zero extends and folds the shift into the add but adds a transfer so it about cancels out. Ideally we'd want here: umov w0, v0.h[0] umov w1, v0.h[1] add w0, w0, w1 lsr w0, w0, 1 ret where the shift and the zero extend are gone and the moves could be done in parallel.