On Fri, 3 Mar 2017, Richard Sandiford wrote: > Richard Biener <rguent...@suse.de> writes: > > On Wed, 1 Mar 2017, Richard Sandiford wrote: > > > >> Richard Biener <rguent...@suse.de> writes: > >> > On Wed, 1 Mar 2017, Richard Sandiford wrote: > >> > > >> >> Richard Biener <rguent...@suse.de> writes: > >> >> > On Wed, 1 Mar 2017, Richard Sandiford wrote: > >> >> > > >> >> >> Richard Biener <rguent...@suse.de> writes: > >> >> >> > On Wed, 1 Mar 2017, Richard Sandiford wrote: > >> >> >> > > >> >> >> >> Sorry for the late reply, but: > >> >> >> >> > >> >> >> >> Richard Biener <rguent...@suse.de> writes: > >> >> >> >> > On Mon, 7 Nov 2016, Richard Biener wrote: > >> >> >> >> > > >> >> >> >> >> > >> >> >> >> >> Currently we force peeling for gaps whenever element > >> >> >> >> >> overrun can occur > >> >> >> >> >> but for aligned accesses we know that the loads won't trap > >> >> >> >> >> and thus > >> >> >> >> >> we can avoid this. > >> >> >> >> >> > >> >> >> >> >> Bootstrap and regtest running on x86_64-unknown-linux-gnu > >> >> >> >> >> (I expect > >> >> >> >> >> some testsuite fallout here so didn't bother to invent a > >> >> >> >> >> new testcase). > >> >> >> >> >> > >> >> >> >> >> Just in case somebody thinks the overrun is a bad idea in > >> >> >> >> >> general > >> >> >> >> >> (even when not trapping). Like for ASAN or valgrind. > >> >> >> >> > > >> >> >> >> > This is what I applied. > >> >> >> >> > > >> >> >> >> > Bootstrapped and tested on x86_64-unknown-linux-gnu. > >> >> >> >> > > >> >> >> >> > Richard. > >> >> >> >> [...] > >> >> >> >> > diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c > >> >> >> >> > index 15aec21..c29e73d 100644 > >> >> >> >> > --- a/gcc/tree-vect-stmts.c > >> >> >> >> > +++ b/gcc/tree-vect-stmts.c > >> >> >> >> > @@ -1789,6 +1794,10 @@ get_group_load_store_type (gimple *stmt, > >> >> >> >> > tree vectype, bool slp, > >> >> >> >> > /* If there is a gap at the end of the group then these > >> >> >> >> > optimizations > >> >> >> >> > would access excess elements in the last iteration. */ > >> >> >> >> > bool would_overrun_p = (gap != 0); > >> >> >> >> > + /* If the access is aligned an overrun is fine. */ > >> >> >> >> > + if (would_overrun_p > >> >> >> >> > + && aligned_access_p (STMT_VINFO_DATA_REF (stmt_info))) > >> >> >> >> > + would_overrun_p = false; > >> >> >> >> > if (!STMT_VINFO_STRIDED_P (stmt_info) > >> >> >> >> > && (can_overrun_p || !would_overrun_p) > >> >> >> >> > && compare_step_with_zero (stmt) > 0) > >> >> >> >> > >> >> >> >> ...is this right for all cases? I think it only looks for > >> >> >> >> single-vector > >> >> >> >> alignment, but the gap can in principle be vector-sized or larger, > >> >> >> >> at least for load-lanes. > >> >> >> >> > >> >> >> >> E.g. say we have a 128-bit vector of doubles in a group of size 4 > >> >> >> >> and a gap of 2 or 3. Even if the access itself is aligned, the > >> >> >> >> group > >> >> >> >> spans two vectors and we have no guarantee that the second one > >> >> >> >> is mapped. > >> >> >> > > >> >> >> > The check assumes that if aligned_access_p () returns true then the > >> >> >> > whole access is aligned in a way that it can't cross page > >> >> >> > boundaries. > >> >> >> > That's of course not the case if alignment is 16 bytes but the > >> >> >> > access > >> >> >> > will be a multiple of that. > >> >> >> > > >> >> >> >> I haven't been able to come up with a testcase though. We seem > >> >> >> >> to be > >> >> >> >> overly conservative when computing alignments. > >> >> >> > > >> >> >> > Not sure if we can run into this with load-lanes given that bumps > >> >> >> > the > >> >> >> > vectorization factor. Also does load-lane work with gaps? > >> >> >> > > >> >> >> > I think that gap can never be larger than nunits-1 so it is by > >> >> >> > definition > >> >> >> > in the last "vector" independent of the VF. > >> >> >> > > >> >> >> > Classical gap case is > >> >> >> > > >> >> >> > for (i=0; i<n; ++i) > >> >> >> > { > >> >> >> > y[3*i + 0] = x[4*i + 0]; > >> >> >> > y[3*i + 1] = x[4*i + 1]; > >> >> >> > y[3*i + 2] = x[4*i + 2]; > >> >> >> > } > >> >> >> > > >> >> >> > where x has a gap of 1. You'll get VF of 12 for the above. Make > >> >> >> > the y's different streams and you should get the perfect case for > >> >> >> > load-lane: > >> >> >> > > >> >> >> > for (i=0; i<n; ++i) > >> >> >> > { > >> >> >> > y[i] = x[4*i + 0]; > >> >> >> > z[i] = x[4*i + 1]; > >> >> >> > w[i] = x[4*i + 2]; > >> >> >> > } > >> >> >> > > >> >> >> > previously we'd peel at least 4 iterations into the epilogue for > >> >> >> > the fear of accessing x[4*i + 3]. When x is V4SI aligned that's > >> >> >> > ok. > >> >> >> > >> >> >> The case I was thinking of was like the second, but with the > >> >> >> element type being DI or DF and with the + 2 statement removed. > >> >> >> E.g.: > >> >> >> > >> >> >> double __attribute__((noinline)) > >> >> >> foo (double *a) > >> >> >> { > >> >> >> double res = 0.0; > >> >> >> for (int n = 0; n < 256; n += 4) > >> >> >> res += a[n] + a[n + 1]; > >> >> >> return res; > >> >> >> } > >> >> >> > >> >> >> (with -ffast-math). We do use LD4 for this, and having "a" aligned > >> >> >> to V2DF isn't enough to guarantee that we can access a[n + 2] > >> >> >> and a[n + 3]. > >> >> > > >> >> > Yes, indeed. It's safe when peeling for gaps would remove > >> >> > N < alignof (ref) / sizeof (ref) scalar iterations. > >> >> > > >> >> > Peeling for gaps simply subtracts one from the niter of the > >> >> > vectorized > >> >> > loop. > >> >> > >> >> I think subtracting one is enough in all cases. It's only the final > >> >> iteration of the scalar loop that can't access a[n + 2] and a[n + 3]. > >> >> > >> >> (Of course, subtracting one happens before peeling for niters, so it > >> >> only makes a difference if the original niters was a multiple of the VF, > >> >> in which case we peel a full vector's worth of iterations instead of > >> >> peeling none.) > >> > > >> > I think one could extend the gcc.dg/vect/group-no-gaps-1.c testcase > >> > to covert the case with bigger VF, for example by having different > >> > types for a and b. > >> > > >> > I can try playing with this later this week but if you can come up > >> > with a testcase that exercises load-/store-lanes that would be great. > >> > See also gcc.dg/vect/pr49038.c for a less convoluted testcase to copy > >> > from. > >> > >> Yeah, that's what I'd tried originally, but couldn't convince > >> vect_compute_data_ref_alignment to realise that the base was aligned. > > > > Try using __builtin_assume_aligned. > > Thanks, that did the trick. Filed as PR79824. > > >> On that topic, the same patch had: > >> > >> if (DECL_USER_ALIGN (base)) > >> { > >> if (dump_enabled_p ()) > >> { > >> dump_printf_loc (MSG_NOTE, vect_location, > >> "not forcing alignment of user-aligned " > >> "variable: "); > >> dump_generic_expr (MSG_NOTE, TDF_SLIM, base); > >> dump_printf (MSG_NOTE, "\n"); > >> } > >> return true; > >> } > >> > >> But shouldn't this also be testing whether decl is packed? The docs > >> say that the aligned attribute only specifies a minimum, so increasing > >> it should be fine. > > > > I think I copied this from elsewhere (the ipa-alignment pass I think). > > > > What do you mean with checking packed? That we may not increase > > alignment of a decl declared as packed? > > symtab_node::can_increase_alignment_p doesn't check that either > > (it also doesn't check DECL_USER_ALIGN). > > > > The packed attribute docs suggest you may combine aligned and packed > > and then get exact alignment, so you suggest > > > > if (DECL_USER_ALIGN (base) && DECL_PACKED (base)) > > > > or simply > > > > if (DECL_PACKED (base)) > > > > ? > > Yeah, was thinking of the second one, but I'd forgotten that you can't > pack a variable, just a field. > > I don't think we can make meaningful guarantees about a "maximum" > alignment for variables. E.g. one variable could be placed after > another variable with a higher alignment and affectively have its > alignment increased that way. Plus of course objects could happen > to fall on nicely-aligned addresses even if the requested alignment > was smaller.
Yeah, this also makes the idea of preserving "lowering of alignment" of decls bogus. > It just seems that one of the use cases of the aligned attribute > is to encourage optimisation. If the compiler happens to find a > reason for increasing it further then I think we should allow that. > E.g. people who add aligned attributes based on the width of one > vector architecture probably didn't want to stop the vectoriser > from handling later architectures that have wider vectors. So the only reason I can think of honoring DECL_USER_ALIGN is to avoid excessively re-aligning data if data segment size is a concern. Richard.