[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

linkw at gcc dot gnu.org Wed, 16 Sep 2020 19:50:31 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789


--- Comment #11 from Kewen Lin <linkw at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #10)
> (In reply to Kewen Lin from comment #9)
> > (In reply to Richard Biener from comment #8)
> > > (In reply to Kewen Lin from comment #7)
> > > > Two questions in mind, need to dig into it further:
> > > >   1) from the assembly of scalar/vector code, I don't see any stores 
> > > > needed
> > > > into temp array d (array diff in pixel_sub_wxh), but when modeling we
> > > > consider the stores.
> > > 
> > > Because when modeling they are still there.  There's no good way around 
> > > this.
> > > 
> > 
> > I noticed the stores get eliminated during FRE.  Can we consider running FRE
> > once just before SLP? a bad idea due to compilation time?
> 
> Yeah, we already run FRE a lot and it is one of the more expensive passes.
> 
> Note there's one point we could do better which is the embedded SESE FRE
> run from cunroll which is only run before we consider peeling an outer loop
> and thus not for the outermost unrolled/peeled code (but the question would
> be from where / up to what to apply FRE to).  On x86_64 this would apply to
> the unvectorized but then unrolled outer loop from pixel_sub_wxh which feeds
> quite bad IL to the SLP pass (but that shouldn't matter too much, maybe it
> matters for costing though).

Thanks for the explanation! I'll look at it after checking 2). IIUC, the
advantage to eliminate stores here looks able to get those things which is fed
to stores and stores' consumers bundled, then get more things SLP-ed if
available?

> 
> I think I looked at this or a related testcase some time ago and split out
> some PRs (can't find those right now).  For example we are not considering
> to simplify
> 
....
> 
> the load permutations suggest that splitting the group into 4-lane pieces
> would avoid doing permutes but then that would require target support
> for V4QI and V4HI vectors.  At least the loads could be considered
> to be vectorized with strided-SLP, yielding 'int' loads and a vector
> build from 4 ints.  I'd need to analyze why we do not consider this.

Good idea! Curious that is there some port where int load can not work well on
1-byte aligned address like trap?

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

Reply via email to