On Fri, 4 Dec 2015, Alan Lawrence wrote: > On 04/12/15 17:46, Ramana Radhakrishnan wrote: > > > > > > On 04/12/15 16:04, Richard Biener wrote: > > > On December 4, 2015 4:32:33 PM GMT+01:00, Alan Lawrence > > > <alan.lawre...@arm.com> wrote: > > > > On 27/11/15 08:30, Richard Biener wrote: > > > > > > > > > > This is part 1 of a fix for PR68533 which shows that some targets > > > > > cannot can_vec_perm_p on an identity permutation. I chose to fix > > > > > this in the vectorizer by detecting the identity itself but with > > > > > the current structure of vect_transform_slp_perm_load this is > > > > > somewhat awkward. Thus the following no-op patch simplifies it > > > > > greatly (from the times it was restricted to do interleaving-kind > > > > > of permutes). It turned out to not be 100% no-op as we now can > > > > > handle non-adjacent source operands so I split it out from the > > > > > actual fix. > > > > > > > > > > The two adjusted testcases no longer fail to vectorize because > > > > > of "need three vectors" but unadjusted would fail because there > > > > > are simply not enough scalar iterations in the loop. I adjusted > > > > > that and now we vectorize it just fine (running into PR68559 > > > > > which I filed). > > > > > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied. > > > > > > > > > > Richard. > > > > > > > > > > 2015-11-27 Richard Biener <rguent...@suse.de> > > > > > > > > > > PR tree-optimization/68553 > > > > > * tree-vect-slp.c (vect_get_mask_element): Remove. > > > > > (vect_transform_slp_perm_load): Implement in a simpler way. > > > > > > > > > > * gcc.dg/vect/pr45752.c: Adjust. > > > > > * gcc.dg/vect/slp-perm-4.c: Likewise. > > > > > > > > On aarch64 and ARM targets, this causes > > > > > > > > PASS->FAIL: gcc.dg/vect/O3-pr36098.c scan-tree-dump-times vect > > > > "vectorizing > > > > stmts using SLP" 0 > > > > > > > > That is, we now vectorize using SLP, when previously we did not. > > > > > > > > On aarch64 (and I expect ARM too), previously we used a VEC_LOAD_LANES, > > > > without > > > > unrolling, > > > but now we unroll * 4, and vectorize using 3 loads and > > > > permutes: > > > > > > Happens on x86_64 as well with at least Sse4.1. Unfortunately we'll have > > > to start introducing much more fine-grained target-supports for vect_perm > > > to reliably guard all targets. > > > > I don't know enough about SSE4.1 to know whether it's a problem there or > > not. This is an actual regression on AArch64 and ARM and not just a testism, > > you now get : > > > > .L5: > > ldr q0, [x5, 16] > > add x4, x4, 48 > > ldr q1, [x5, 32] > > add w6, w6, 1 > > ldr q4, [x5, 48] > > cmp w3, w6 > > ldr q2, [x5], 64 > > orr v3.16b, v0.16b, v0.16b > > orr v5.16b, v4.16b, v4.16b > > orr v4.16b, v1.16b, v1.16b > > tbl v0.16b, {v0.16b - v1.16b}, v6.16b > > tbl v2.16b, {v2.16b - v3.16b}, v7.16b > > tbl v4.16b, {v4.16b - v5.16b}, v16.16b > > str q0, [x4, -32] > > str q2, [x4, -48] > > str q4, [x4, -16] > > bhi .L5 > > > > instead of > > > > .L5: > > ld4 {v4.4s - v7.4s}, [x7], 64 > > add w4, w4, 1 > > cmp w3, w4 > > orr v1.16b, v4.16b, v4.16b > > orr v2.16b, v5.16b, v5.16b > > orr v3.16b, v6.16b, v6.16b > > st3 {v1.4s - v3.4s}, [x6], 48 > > bhi .L5 > > > > LD4 and ST3 do all the permutes without needing actual permute instructions > > - a strategy that favours generic permutes avoiding the load_lanes case is > > likely to be more expensive on most implementations. I think worth a PR > > atleast. > > > > regards > > Ramana > > > > Yes, quite right. PR 68707.
Thanks - I will think of sth. Note that it's not all clearly obvious the 2nd variant is cheaper because in the load-lane variant you have a larger unrolling factor plus extra peeling due to the gap (ld4). This means that loops not iterating much at runtime are likely pessimized compared to the SLP variant. Not sure where the actual cut-off would be. Btw, is a ld3/st3 pair actually cheaper than three ld/st (without extra permutation)? It's shorter code, but is it faster? Thanks, Richard.