Hi Richard,

Thanks for the review.

> On 15 Oct 2025, at 10:39 pm, Richard Biener <[email protected]> 
> wrote:
>
> External email: Use caution opening links or attachments
>
>
> On Wed, Oct 15, 2025 at 12:08 AM Kugan Vivekanandarajah
> <[email protected]> wrote:
>>
>> Hi,
>>
>> This patch eliminates redundant reverse permutations in vectorized reverse
>> loops by detecting and optimizing patterns during store vectorization.
>>
>> The reverse load (b[i]) generates PERM, operations are applied, then the
>> reverse store adds another PERM. This creates redundant permute pairs that
>> we now detect and eliminate.
>>
>> With the patch, for the example loop
>>  for (int i = N - 1; i >= 0; i--)
>>    {
>>      a[i] = b[i] + 1.0f;
>>    }
>> Changes to the following
>> -       ldr     q29, [x0, x2]
>> -       tbl     v29.16b, {v29.16b}, v31.16b
>> -       fadd    v29.4s, v29.4s, v30.4s
>> -       tbl     v29.16b, {v29.16b}, v31.16b
>> -       str     q29, [x3, x2]
>> +       ldr     q30, [x0, x2]
>> +       fadd    v30.4s, v30.4s, v31.4s
>> +       str     q30, [x3, x2]
>
> So this works basically as a post-processing optimization at the time
> we generate the
> vector store.  While that's in principle an OK optimization I'd rather
> have such post-processing
> implemented outside of the vectorizer because then also permutes not
> originating from
> vectorizer permuted store generation would benefit.
>
> As for implementing this in the vectorizer itself the more appropriate
> thing would be
> to expose these permutes to the permute optimization phase, because then it 
> can
> be also taken into account during costing and a reverse load permute
> could be elided
> if it feeds an associatable reduction.
>
> There is, unfortunately, currently no good way to represent how we implement
> negative strided contiguous accesses with load permutations as the peculiarity
> only exposes itself after applying the VF and load/lane permutations are
> represented on the VF == 1 SLP graph.  One of my ideas what that once we
> settle on VF (and possibly vector types) we want to expand the SLP graph
> to cover all lanes of the vector loop so we can expose actual permutes and
> vector granularity.  This is a bit far off though.
>
> So in line with your patch but more appropriate for in-vectorizer
> operation would
> be an analysis on the SLP graph that simply marks reverse permutes that can
> be elided (for the back-to-back case).  This way both costing and code
> generation
> can take this into account and you wouldn't have to adjust any stmts.

I  have now changed it to account for the costing. Bootstrapped and regression 
tested on aarch64-linux-gnu.

Is this OK?

Thanks,
Kugan


>
> Thanks,
> Richard.
>
>>        PR tree-optimization/61338
>>
>> gcc/ChangeLog:
>>        (get_vector_perm_operand): New.
>>        (vect_find_reverse_permute_operand): New  helper function
>>        to find reverse permutations through element-wise operation chains.
>>        Returns true only if ALL operands have reverse permutations.
>>        (vectorizable_store): Use recursive helper to eliminate redundant
>>        reverse permutations with configurable search depth.
>>
>> gcc/testsuite/ChangeLog:
>>
>>        * gcc.dg/vect/slp-permute-reverse-1.c: New test for basic
>>        reverse permute optimization (simple copy).
>>        * gcc.dg/vect/slp-permute-reverse-2.c: New runtime test for
>>        basic pattern.
>> Signed-off-by: Kugan Vivekanandarajah <[email protected]>
>>
>> Bootstrapped and regression tested on aarch64-linux-gcc. Is this OK?
>>
>> Thanks,
>> Kugan


Attachment: 0001-PATCH-tree-optimization-61338-Optimize-redundant-rev.patch
Description: 0001-PATCH-tree-optimization-61338-Optimize-redundant-rev.patch

Reply via email to