[Bug tree-optimization/122277] [16 regression] Costing VF 1 instead of 4 since r16-4411-gb6e802fd55d37e

rguenther at suse dot de via Gcc-bugs Tue, 14 Oct 2025 05:45:14 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122277


--- Comment #3 from rguenther at suse dot de <rguenther at suse dot de> ---
On Tue, 14 Oct 2025, rdapp at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122277
> 
> --- Comment #2 from Robin Dapp <rdapp at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #1)
> > Hmm, I see V4QI being used:
> > 
> > ~/obj-riscv-g/gcc> /home/rguenther/obj-riscv-g/gcc/xgcc
> > -B/home/rguenther/obj-riscv-g/gcc/
> > /home/rguenther/src/gcc/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr118019-
> > 2.c -march=rv64gcv_zvl512b -mno-vector-strict-align
> > -fdiagnostics-plain-output -O3 -ftree-vectorize -O3 -march=rv64gcv_zvl512b
> > -mabi=lp64d -mno-vector-strict-align -ffat-lto-objects -fno-ident -S -o
> > pr118019-2.s -fdump-tree-vect-details -fopt-info-vec
> > /home/rguenther/src/gcc/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr118019-
> > 2.c:42:21: optimized: loop vectorized using 4 byte vectors and unroll factor
> > 4
> > /home/rguenther/src/gcc/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr118019-
> > 2.c:34:21: optimized: loop vectorized using 16 byte vectors and unroll
> > factor 4
> 
> Ugh, sorry.  I have been working on a local version that has 
> 
>  int tmp[4][4];
> 
> instead of
> 
>  uint32_t tmp[4][4];
> 
> (that's what upstream x264 uses).

Ah, so one change is that previously we didn't attempt single-lane SLP
when a reduction chain failed vectorization in the end:

note:   ==> examining statement: _59 = tmp[0][i_147];
missed:   permutation not supported, using elementwise access
missed:   Not using elementwise accesses due to variable vectorization 
factor.

but now we do, and single-lane RVVM1SI looks better from your
cost model.  I believe that before my patch we never tried this
because there's a conversion around the reduction and

            /* ???  When there's a conversion around the reduction
               chain 'last' isn't the entry of the reduction.  */
            if (STMT_VINFO_DEF_TYPE (last) != vect_reduction_def)
              return opt_result::failure_at (vect_location,
                                             "SLP build failed.\n");
            /* It can be still vectorized as part of an SLP reduction.  */
            loop_vinfo->reductions.safe_push (last);

we had previously - we had no way to "recover" the original non-chained
reduction.  That's one of the "improvements", we can now handle every
reduction chain as its regular original reduction ...

[Bug tree-optimization/122277] [16 regression] Costing VF 1 instead of 4 since r16-4411-gb6e802fd55d37e

Reply via email to