On 04/12/15 17:46, Ramana Radhakrishnan wrote:
On 04/12/15 16:04, Richard Biener wrote:
On December 4, 2015 4:32:33 PM GMT+01:00, Alan Lawrence <alan.lawre...@arm.com>
wrote:
On 27/11/15 08:30, Richard Biener wrote:
This is part 1 of a fix for PR68533 which shows that some targets
cannot can_vec_perm_p on an identity permutation. I chose to fix
this in the vectorizer by detecting the identity itself but with
the current structure of vect_transform_slp_perm_load this is
somewhat awkward. Thus the following no-op patch simplifies it
greatly (from the times it was restricted to do interleaving-kind
of permutes). It turned out to not be 100% no-op as we now can
handle non-adjacent source operands so I split it out from the
actual fix.
The two adjusted testcases no longer fail to vectorize because
of "need three vectors" but unadjusted would fail because there
are simply not enough scalar iterations in the loop. I adjusted
that and now we vectorize it just fine (running into PR68559
which I filed).
Bootstrapped and tested on x86_64-unknown-linux-gnu, applied.
Richard.
2015-11-27 Richard Biener <rguent...@suse.de>
PR tree-optimization/68553
* tree-vect-slp.c (vect_get_mask_element): Remove.
(vect_transform_slp_perm_load): Implement in a simpler way.
* gcc.dg/vect/pr45752.c: Adjust.
* gcc.dg/vect/slp-perm-4.c: Likewise.
On aarch64 and ARM targets, this causes
PASS->FAIL: gcc.dg/vect/O3-pr36098.c scan-tree-dump-times vect
"vectorizing
stmts using SLP" 0
That is, we now vectorize using SLP, when previously we did not.
On aarch64 (and I expect ARM too), previously we used a VEC_LOAD_LANES,
without
unrolling,
but now we unroll * 4, and vectorize using 3 loads and
permutes:
Happens on x86_64 as well with at least Sse4.1. Unfortunately we'll have to
start introducing much more fine-grained target-supports for vect_perm to
reliably guard all targets.
I don't know enough about SSE4.1 to know whether it's a problem there or not.
This is an actual regression on AArch64 and ARM and not just a testism, you now
get :
.L5:
ldr q0, [x5, 16]
add x4, x4, 48
ldr q1, [x5, 32]
add w6, w6, 1
ldr q4, [x5, 48]
cmp w3, w6
ldr q2, [x5], 64
orr v3.16b, v0.16b, v0.16b
orr v5.16b, v4.16b, v4.16b
orr v4.16b, v1.16b, v1.16b
tbl v0.16b, {v0.16b - v1.16b}, v6.16b
tbl v2.16b, {v2.16b - v3.16b}, v7.16b
tbl v4.16b, {v4.16b - v5.16b}, v16.16b
str q0, [x4, -32]
str q2, [x4, -48]
str q4, [x4, -16]
bhi .L5
instead of
.L5:
ld4 {v4.4s - v7.4s}, [x7], 64
add w4, w4, 1
cmp w3, w4
orr v1.16b, v4.16b, v4.16b
orr v2.16b, v5.16b, v5.16b
orr v3.16b, v6.16b, v6.16b
st3 {v1.4s - v3.4s}, [x6], 48
bhi .L5
LD4 and ST3 do all the permutes without needing actual permute instructions - a
strategy that favours generic permutes avoiding the load_lanes case is likely
to be more expensive on most implementations. I think worth a PR atleast.
regards
Ramana
Yes, quite right. PR 68707.
--Alan