On Fri, 3 Mar 2017, Richard Sandiford wrote:

> Richard Biener <rguent...@suse.de> writes:
> > On Wed, 1 Mar 2017, Richard Sandiford wrote:
> >
> >> Richard Biener <rguent...@suse.de> writes:
> >> > On Wed, 1 Mar 2017, Richard Sandiford wrote:
> >> >
> >> >> Richard Biener <rguent...@suse.de> writes:
> >> >> > On Wed, 1 Mar 2017, Richard Sandiford wrote:
> >> >> >
> >> >> >> Richard Biener <rguent...@suse.de> writes:
> >> >> >> > On Wed, 1 Mar 2017, Richard Sandiford wrote:
> >> >> >> >
> >> >> >> >> Sorry for the late reply, but:
> >> >> >> >> 
> >> >> >> >> Richard Biener <rguent...@suse.de> writes:
> >> >> >> >> > On Mon, 7 Nov 2016, Richard Biener wrote:
> >> >> >> >> >
> >> >> >> >> >> 
> >> >> >> >> >> Currently we force peeling for gaps whenever element
> >> >> >> >> >> overrun can occur
> >> >> >> >> >> but for aligned accesses we know that the loads won't trap
> >> >> >> >> >> and thus
> >> >> >> >> >> we can avoid this.
> >> >> >> >> >> 
> >> >> >> >> >> Bootstrap and regtest running on x86_64-unknown-linux-gnu
> >> >> >> >> >> (I expect
> >> >> >> >> >> some testsuite fallout here so didn't bother to invent a
> >> >> >> >> >> new testcase).
> >> >> >> >> >> 
> >> >> >> >> >> Just in case somebody thinks the overrun is a bad idea in 
> >> >> >> >> >> general
> >> >> >> >> >> (even when not trapping).  Like for ASAN or valgrind.
> >> >> >> >> >
> >> >> >> >> > This is what I applied.
> >> >> >> >> >
> >> >> >> >> > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> >> >> >> >> >
> >> >> >> >> > Richard.
> >> >> >> >> [...]
> >> >> >> >> > diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
> >> >> >> >> > index 15aec21..c29e73d 100644
> >> >> >> >> > --- a/gcc/tree-vect-stmts.c
> >> >> >> >> > +++ b/gcc/tree-vect-stmts.c
> >> >> >> >> > @@ -1789,6 +1794,10 @@ get_group_load_store_type (gimple *stmt, 
> >> >> >> >> > tree vectype, bool slp,
> >> >> >> >> >        /* If there is a gap at the end of the group then these 
> >> >> >> >> > optimizations
> >> >> >> >> >          would access excess elements in the last iteration.  */
> >> >> >> >> >        bool would_overrun_p = (gap != 0);
> >> >> >> >> > +      /* If the access is aligned an overrun is fine.  */
> >> >> >> >> > +      if (would_overrun_p
> >> >> >> >> > +         && aligned_access_p (STMT_VINFO_DATA_REF (stmt_info)))
> >> >> >> >> > +       would_overrun_p = false;
> >> >> >> >> >        if (!STMT_VINFO_STRIDED_P (stmt_info)
> >> >> >> >> >           && (can_overrun_p || !would_overrun_p)
> >> >> >> >> >           && compare_step_with_zero (stmt) > 0)
> >> >> >> >> 
> >> >> >> >> ...is this right for all cases?  I think it only looks for
> >> >> >> >> single-vector
> >> >> >> >> alignment, but the gap can in principle be vector-sized or larger,
> >> >> >> >> at least for load-lanes.
> >> >> >> >>
> >> >> >> >> E.g. say we have a 128-bit vector of doubles in a group of size 4
> >> >> >> >> and a gap of 2 or 3.  Even if the access itself is aligned, the 
> >> >> >> >> group
> >> >> >> >> spans two vectors and we have no guarantee that the second one
> >> >> >> >> is mapped.
> >> >> >> >
> >> >> >> > The check assumes that if aligned_access_p () returns true then the
> >> >> >> > whole access is aligned in a way that it can't cross page 
> >> >> >> > boundaries.
> >> >> >> > That's of course not the case if alignment is 16 bytes but the 
> >> >> >> > access
> >> >> >> > will be a multiple of that.
> >> >> >> >  
> >> >> >> >> I haven't been able to come up with a testcase though.  We seem 
> >> >> >> >> to be
> >> >> >> >> overly conservative when computing alignments.
> >> >> >> >
> >> >> >> > Not sure if we can run into this with load-lanes given that bumps 
> >> >> >> > the
> >> >> >> > vectorization factor.  Also does load-lane work with gaps?
> >> >> >> >
> >> >> >> > I think that gap can never be larger than nunits-1 so it is by 
> >> >> >> > definition
> >> >> >> > in the last "vector" independent of the VF.
> >> >> >> >
> >> >> >> > Classical gap case is
> >> >> >> >
> >> >> >> > for (i=0; i<n; ++i)
> >> >> >> >  {
> >> >> >> >    y[3*i + 0] = x[4*i + 0];
> >> >> >> >    y[3*i + 1] = x[4*i + 1];
> >> >> >> >    y[3*i + 2] = x[4*i + 2];
> >> >> >> >  }
> >> >> >> >
> >> >> >> > where x has a gap of 1.  You'll get VF of 12 for the above.  Make
> >> >> >> > the y's different streams and you should get the perfect case for
> >> >> >> > load-lane:
> >> >> >> >
> >> >> >> > for (i=0; i<n; ++i)
> >> >> >> >  {
> >> >> >> >    y[i] = x[4*i + 0];
> >> >> >> >    z[i] = x[4*i + 1];
> >> >> >> >    w[i] = x[4*i + 2];
> >> >> >> >  } 
> >> >> >> >
> >> >> >> > previously we'd peel at least 4 iterations into the epilogue for
> >> >> >> > the fear of accessing x[4*i + 3].  When x is V4SI aligned that's
> >> >> >> > ok.
> >> >> >> 
> >> >> >> The case I was thinking of was like the second, but with the
> >> >> >> element type being DI or DF and with the + 2 statement removed.
> >> >> >> E.g.:
> >> >> >> 
> >> >> >> double __attribute__((noinline))
> >> >> >> foo (double *a)
> >> >> >> {
> >> >> >>   double res = 0.0;
> >> >> >>   for (int n = 0; n < 256; n += 4)
> >> >> >>     res += a[n] + a[n + 1];
> >> >> >>   return res;
> >> >> >> }
> >> >> >> 
> >> >> >> (with -ffast-math).  We do use LD4 for this, and having "a" aligned
> >> >> >> to V2DF isn't enough to guarantee that we can access a[n + 2]
> >> >> >> and a[n + 3].
> >> >> >
> >> >> > Yes, indeed.  It's safe when peeling for gaps would remove
> >> >> > N < alignof (ref) / sizeof (ref) scalar iterations.
> >> >> >
> >> >> > Peeling for gaps simply subtracts one from the niter of the 
> >> >> > vectorized 
> >> >> > loop.
> >> >> 
> >> >> I think subtracting one is enough in all cases.  It's only the final
> >> >> iteration of the scalar loop that can't access a[n + 2] and a[n + 3].
> >> >> 
> >> >> (Of course, subtracting one happens before peeling for niters, so it
> >> >> only makes a difference if the original niters was a multiple of the VF,
> >> >> in which case we peel a full vector's worth of iterations instead of
> >> >> peeling none.)
> >> >
> >> > I think one could extend the gcc.dg/vect/group-no-gaps-1.c testcase
> >> > to covert the case with bigger VF, for example by having different
> >> > types for a and b.
> >> >
> >> > I can try playing with this later this week but if you can come up
> >> > with a testcase that exercises load-/store-lanes that would be great.
> >> > See also gcc.dg/vect/pr49038.c for a less convoluted testcase to copy
> >> > from.
> >> 
> >> Yeah, that's what I'd tried originally, but couldn't convince
> >> vect_compute_data_ref_alignment to realise that the base was aligned.
> >
> > Try using __builtin_assume_aligned.
> 
> Thanks, that did the trick.  Filed as PR79824.
> 
> >> On that topic, the same patch had:
> >> 
> >>       if (DECL_USER_ALIGN (base))
> >>    {
> >>      if (dump_enabled_p ())
> >>        {
> >>          dump_printf_loc (MSG_NOTE, vect_location,
> >>                           "not forcing alignment of user-aligned "
> >>                           "variable: ");
> >>          dump_generic_expr (MSG_NOTE, TDF_SLIM, base);
> >>          dump_printf (MSG_NOTE, "\n");
> >>        }
> >>      return true;
> >>    }
> >> 
> >> But shouldn't this also be testing whether decl is packed?  The docs
> >> say that the aligned attribute only specifies a minimum, so increasing
> >> it should be fine.
> >
> > I think I copied this from elsewhere (the ipa-alignment pass I think).
> >
> > What do you mean with checking packed?  That we may not increase
> > alignment of a decl declared as packed?  
> > symtab_node::can_increase_alignment_p doesn't check that either
> > (it also doesn't check DECL_USER_ALIGN).
> >
> > The packed attribute docs suggest you may combine aligned and packed
> > and then get exact alignment, so you suggest
> >
> >   if (DECL_USER_ALIGN (base) && DECL_PACKED (base))
> >
> > or simply
> >
> >   if (DECL_PACKED (base))
> >
> > ?
> 
> Yeah, was thinking of the second one, but I'd forgotten that you can't
> pack a variable, just a field.
> 
> I don't think we can make meaningful guarantees about a "maximum"
> alignment for variables.  E.g. one variable could be placed after
> another variable with a higher alignment and affectively have its
> alignment increased that way.  Plus of course objects could happen
> to fall on nicely-aligned addresses even if the requested alignment
> was smaller.

Yeah, this also makes the idea of preserving "lowering of alignment"
of decls bogus.

> It just seems that one of the use cases of the aligned attribute
> is to encourage optimisation.  If the compiler happens to find a
> reason for increasing it further then I think we should allow that.
> E.g. people who add aligned attributes based on the width of one
> vector architecture probably didn't want to stop the vectoriser
> from handling later architectures that have wider vectors.

So the only reason I can think of honoring DECL_USER_ALIGN is to avoid
excessively re-aligning data if data segment size is a concern.

Richard.

Reply via email to