On 04/12/15 17:46, Ramana Radhakrishnan wrote:


On 04/12/15 16:04, Richard Biener wrote:
On December 4, 2015 4:32:33 PM GMT+01:00, Alan Lawrence <alan.lawre...@arm.com> 
wrote:
On 27/11/15 08:30, Richard Biener wrote:

This is part 1 of a fix for PR68533 which shows that some targets
cannot can_vec_perm_p on an identity permutation.  I chose to fix
this in the vectorizer by detecting the identity itself but with
the current structure of vect_transform_slp_perm_load this is
somewhat awkward.  Thus the following no-op patch simplifies it
greatly (from the times it was restricted to do interleaving-kind
of permutes).  It turned out to not be 100% no-op as we now can
handle non-adjacent source operands so I split it out from the
actual fix.

The two adjusted testcases no longer fail to vectorize because
of "need three vectors" but unadjusted would fail because there
are simply not enough scalar iterations in the loop.  I adjusted
that and now we vectorize it just fine (running into PR68559
which I filed).

Bootstrapped and tested on x86_64-unknown-linux-gnu, applied.

Richard.

2015-11-27  Richard Biener  <rguent...@suse.de>

        PR tree-optimization/68553
        * tree-vect-slp.c (vect_get_mask_element): Remove.
        (vect_transform_slp_perm_load): Implement in a simpler way.

        * gcc.dg/vect/pr45752.c: Adjust.
        * gcc.dg/vect/slp-perm-4.c: Likewise.

On aarch64 and ARM targets, this causes

PASS->FAIL: gcc.dg/vect/O3-pr36098.c scan-tree-dump-times vect
"vectorizing
stmts using SLP" 0

That is, we now vectorize using SLP, when previously we did not.

On aarch64 (and I expect ARM too), previously we used a VEC_LOAD_LANES,
without
unrolling,
but now we unroll * 4, and vectorize using 3 loads and
permutes:

Happens on x86_64 as well with at least Sse4.1.  Unfortunately we'll have to 
start introducing much more fine-grained target-supports for vect_perm to 
reliably guard all targets.

I don't know enough about SSE4.1 to know whether it's a problem there or not. 
This is an actual regression on AArch64 and ARM and not just a testism, you now 
get :

.L5:
         ldr     q0, [x5, 16]
         add     x4, x4, 48
         ldr     q1, [x5, 32]
         add     w6, w6, 1
         ldr     q4, [x5, 48]
         cmp     w3, w6
         ldr     q2, [x5], 64
         orr     v3.16b, v0.16b, v0.16b
         orr     v5.16b, v4.16b, v4.16b
         orr     v4.16b, v1.16b, v1.16b
         tbl     v0.16b, {v0.16b - v1.16b}, v6.16b
         tbl     v2.16b, {v2.16b - v3.16b}, v7.16b
         tbl     v4.16b, {v4.16b - v5.16b}, v16.16b
         str     q0, [x4, -32]
         str     q2, [x4, -48]
         str     q4, [x4, -16]
         bhi     .L5

instead of

.L5:
         ld4     {v4.4s - v7.4s}, [x7], 64
         add     w4, w4, 1
         cmp     w3, w4
         orr     v1.16b, v4.16b, v4.16b
         orr     v2.16b, v5.16b, v5.16b
         orr     v3.16b, v6.16b, v6.16b
         st3     {v1.4s - v3.4s}, [x6], 48
         bhi     .L5

LD4 and ST3 do all the permutes without needing actual permute instructions - a 
strategy that favours generic permutes avoiding the load_lanes case is likely 
to be more expensive on most implementations. I think worth a PR atleast.

regards
Ramana


Yes, quite right. PR 68707.

--Alan

Reply via email to