On Mon, 28 Mar 2022, Richard Sandiford wrote:

> Richard Biener <rguent...@suse.de> writes:
> > On Mon, 28 Mar 2022, Richard Sandiford wrote:
> >
> >> Richard Biener <rguent...@suse.de> writes:
> >> > Since we're now vectorizing by default at -O2 issues like PR101908
> >> > become more important where we apply basic-block vectorization to
> >> > parts of the function covering loads from function parameters passed
> >> > on the stack.  Since we have no good idea how the stack pushing
> >> > was performed but we do have a good idea that it was done recent
> >> > when at the start of the function a STLF failure is inevitable
> >> > if the argument passing code didn't use the same or larger vector
> >> > stores and also has them aligned the same way.
> >> >
> >> > Until there's a robust IPA based solution the following implements
> >> > target independent heuristics in the vectorizer to retain scalar
> >> > loads for loads from parameters likely passed in memory (I use
> >> > a BLKmode DECL_MODE check for this rather than firing up
> >> > cummulative-args).  I've restricted this also to loads from the
> >> > first "block" (that can be less than the first basic block if there's
> >> > a call for example), since that covers the testcase.
> >> >
> >> > Note that for the testcase (but not c-ray from the bugreport) there's
> >> > a x86 peephole2 that vectorizes things back, so the patch is
> >> > not effective there.
> >> >
> >> > Any comments?  I know we're also looking at x86 port specific
> >> > mitigations but the issue will hit arm and power/z as well I think.
> >> 
> >> I'm not sure this is a target-independent win.  In a loop that:
> >> 
> >>   stores 2 scalars
> >>   loads the stored scalars as a vector
> >>   adds a vector
> >>   stores a vector
> >> 
> >> (no feedback), I see a 20% regression using elementwise accesses for
> >> the load vs. using a normal vector load (measured on a Cortex-A72).
> >> With feedback the elementwise version is still slower, but obviously
> >> not by such a big factor.
> >
> > I see, so that's even without a call inbetween the scalar stores
> > and the vector load as is the case we're trying to cover.  I
> > would suspect that maybe the two elementwise accesses execute
> > too close to the two scalar stores to benefit from any forwarding
> > with the A72 micro-architecture?  Do you see a speedup when
> > performing a vector store instead of two scalar stores?
> 
> Yeah, it's faster with a vector store than with 2 scalar stores.
> 
> The difference between elementwise loads and vector loads reproduces
> with execution of the stores and the load forced further apart.
> 
> Note that (unlike x86?) the elementwise loads are still done on the
> vector side, so this is not a scalar->vector vs. scalar->scalar
> trade-off.

That's the same on x86 - scalar loads are still done on the vector
side but forwarding from the store buffer isn't possible if the
stores were scalar and the load vector.  The CPU has to wait for
the data to reach L1-D for the load to complete.  With matching
scalar/scalar or vector/vector store/load the data can forward
from the store buffers.

Richard.

Reply via email to