On Fri, 4 Dec 2015, Alan Lawrence wrote:

> On 04/12/15 17:46, Ramana Radhakrishnan wrote:
> > 
> > 
> > On 04/12/15 16:04, Richard Biener wrote:
> > > On December 4, 2015 4:32:33 PM GMT+01:00, Alan Lawrence
> > > <alan.lawre...@arm.com> wrote:
> > > > On 27/11/15 08:30, Richard Biener wrote:
> > > > > 
> > > > > This is part 1 of a fix for PR68533 which shows that some targets
> > > > > cannot can_vec_perm_p on an identity permutation.  I chose to fix
> > > > > this in the vectorizer by detecting the identity itself but with
> > > > > the current structure of vect_transform_slp_perm_load this is
> > > > > somewhat awkward.  Thus the following no-op patch simplifies it
> > > > > greatly (from the times it was restricted to do interleaving-kind
> > > > > of permutes).  It turned out to not be 100% no-op as we now can
> > > > > handle non-adjacent source operands so I split it out from the
> > > > > actual fix.
> > > > > 
> > > > > The two adjusted testcases no longer fail to vectorize because
> > > > > of "need three vectors" but unadjusted would fail because there
> > > > > are simply not enough scalar iterations in the loop.  I adjusted
> > > > > that and now we vectorize it just fine (running into PR68559
> > > > > which I filed).
> > > > > 
> > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu, applied.
> > > > > 
> > > > > Richard.
> > > > > 
> > > > > 2015-11-27  Richard Biener  <rguent...@suse.de>
> > > > > 
> > > > >       PR tree-optimization/68553
> > > > >       * tree-vect-slp.c (vect_get_mask_element): Remove.
> > > > >       (vect_transform_slp_perm_load): Implement in a simpler way.
> > > > > 
> > > > >       * gcc.dg/vect/pr45752.c: Adjust.
> > > > >       * gcc.dg/vect/slp-perm-4.c: Likewise.
> > > > 
> > > > On aarch64 and ARM targets, this causes
> > > > 
> > > > PASS->FAIL: gcc.dg/vect/O3-pr36098.c scan-tree-dump-times vect
> > > > "vectorizing
> > > > stmts using SLP" 0
> > > > 
> > > > That is, we now vectorize using SLP, when previously we did not.
> > > > 
> > > > On aarch64 (and I expect ARM too), previously we used a VEC_LOAD_LANES,
> > > > without
> > > > unrolling,
> > > but now we unroll * 4, and vectorize using 3 loads and
> > > > permutes:
> > > 
> > > Happens on x86_64 as well with at least Sse4.1.  Unfortunately we'll have
> > > to start introducing much more fine-grained target-supports for vect_perm
> > > to reliably guard all targets.
> > 
> > I don't know enough about SSE4.1 to know whether it's a problem there or
> > not. This is an actual regression on AArch64 and ARM and not just a testism,
> > you now get :
> > 
> > .L5:
> >          ldr     q0, [x5, 16]
> >          add     x4, x4, 48
> >          ldr     q1, [x5, 32]
> >          add     w6, w6, 1
> >          ldr     q4, [x5, 48]
> >          cmp     w3, w6
> >          ldr     q2, [x5], 64
> >          orr     v3.16b, v0.16b, v0.16b
> >          orr     v5.16b, v4.16b, v4.16b
> >          orr     v4.16b, v1.16b, v1.16b
> >          tbl     v0.16b, {v0.16b - v1.16b}, v6.16b
> >          tbl     v2.16b, {v2.16b - v3.16b}, v7.16b
> >          tbl     v4.16b, {v4.16b - v5.16b}, v16.16b
> >          str     q0, [x4, -32]
> >          str     q2, [x4, -48]
> >          str     q4, [x4, -16]
> >          bhi     .L5
> > 
> > instead of
> > 
> > .L5:
> >          ld4     {v4.4s - v7.4s}, [x7], 64
> >          add     w4, w4, 1
> >          cmp     w3, w4
> >          orr     v1.16b, v4.16b, v4.16b
> >          orr     v2.16b, v5.16b, v5.16b
> >          orr     v3.16b, v6.16b, v6.16b
> >          st3     {v1.4s - v3.4s}, [x6], 48
> >          bhi     .L5
> > 
> > LD4 and ST3 do all the permutes without needing actual permute instructions
> > - a strategy that favours generic permutes avoiding the load_lanes case is
> > likely to be more expensive on most implementations. I think worth a PR
> > atleast.
> > 
> > regards
> > Ramana
> > 
> 
> Yes, quite right. PR 68707.

Thanks - I will think of sth.  Note that it's not all clearly obvious
the 2nd variant is cheaper because in the load-lane variant you have
a larger unrolling factor plus extra peeling due to the gap (ld4).
This means that loops not iterating much at runtime are likely pessimized
compared to the SLP variant.  Not sure where the actual cut-off would be.

Btw, is a ld3/st3 pair actually cheaper than three ld/st (without
extra permutation)?  It's shorter code, but is it faster?

Thanks,
Richard.

Reply via email to