On Tue, Jan 28, 2025 at 10:05 AM Richard Sandiford
<[email protected]> wrote:
>
> vectorizable_slp_permutation_1 has two ways of generating the
> permutations: one that looks for repeating patterns and one that
> calculates the permutation index for every output element individually.
> The former works for VLA and VLS whereas the latter only works for VLS.
>
> There are two justifications for using the repeating code for VLS:
> it gives more testing coverage, and it should reduce the analysis
> overhead for common cases. This PR kind-of demonstrates both:
> the VLS coverage was showing a bug in the analysis shortcut.
>
> The bug seems to go back to g:ab7e60cec1a6, which added the
> repeating_p path. It generated N copies of the permutation vector
> in the repeating case, but didn't multiply the number of permutation
> instructions for costing purposes by N. So we seem to have been
> undercounting ncopies>1 permutations all this time...
>
> The problem became more visible with g:8157f3f2d211, which extended
> the repeating code to handle more cases.
>
> In the patch, I think noutputs is in practice always a multiple
> of unpack_factor, but it seemed more future-proof to handle the
> general case.
>
> Tested on aarch64-linux-gnu & x86_64-linux-gnu. Also tested by
> making sure that g:8157f3f2d211 with this patch added didn't
> change the imagick code for -O3 -mcpu=generic on aarch64
> compared to g:8157f3f2d211~. OK to install?
OK.
Thanks,
Richard.
> Richard
>
>
> gcc/
> PR tree-optimization/117270
> * tree-vect-slp.cc (vectorizable_slp_permutation_1): Make nperms
> account for the number of times that each permutation will be used
> during transformation.
> ---
> gcc/tree-vect-slp.cc | 14 ++++++++++----
> 1 file changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index 115b7b73eee..ac1733004b6 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -10884,9 +10884,15 @@ vectorizable_slp_permutation_1 (vec_info *vinfo,
> gimple_stmt_iterator *gsi,
> vectors to check during analysis, but we need to generate NOUTPUTS
> vectors during transformation. */
> unsigned total_nelts = olanes;
> - if (repeating_p && gsi)
> - total_nelts = (total_nelts / unpack_factor) * noutputs;
> - for (unsigned i = 0; i < total_nelts; ++i)
> + unsigned process_nelts = olanes;
> + if (repeating_p)
> + {
> + total_nelts = (total_nelts / unpack_factor) * noutputs;
> + if (gsi)
> + process_nelts = total_nelts;
> + }
> + unsigned last_ei = (total_nelts - 1) % process_nelts;
> + for (unsigned i = 0; i < process_nelts; ++i)
> {
> /* VI is the input vector index when generating code for REPEATING_P.
> */
> unsigned vi = i / olanes * (pack_p ? 2 : 1);
> @@ -10960,7 +10966,7 @@ vectorizable_slp_permutation_1 (vec_info *vinfo,
> gimple_stmt_iterator *gsi,
> }
>
> if (!identity_p)
> - nperms++;
> + nperms += CEIL (total_nelts, process_nelts) - (ei > last_ei);
> if (gsi)
> {
> if (second_vec.first == -1U)
> --
> 2.25.1
>