[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

rguenth at gcc dot gnu.org Wed, 16 Sep 2020 06:05:49 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789


--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Kewen Lin from comment #9)
> (In reply to Richard Biener from comment #8)
> > (In reply to Kewen Lin from comment #7)
> > > Two questions in mind, need to dig into it further:
> > >   1) from the assembly of scalar/vector code, I don't see any stores 
> > > needed
> > > into temp array d (array diff in pixel_sub_wxh), but when modeling we
> > > consider the stores.
> > 
> > Because when modeling they are still there.  There's no good way around 
> > this.
> > 
> 
> I noticed the stores get eliminated during FRE.  Can we consider running FRE
> once just before SLP? a bad idea due to compilation time?

Yeah, we already run FRE a lot and it is one of the more expensive passes.

Note there's one point we could do better which is the embedded SESE FRE
run from cunroll which is only run before we consider peeling an outer loop
and thus not for the outermost unrolled/peeled code (but the question would
be from where / up to what to apply FRE to).  On x86_64 this would apply to
the unvectorized but then unrolled outer loop from pixel_sub_wxh which feeds
quite bad IL to the SLP pass (but that shouldn't matter too much, maybe it
matters for costing though).

I think I looked at this or a related testcase some time ago and split out
some PRs (can't find those right now).  For example we are not considering
to simplify

  _318 = {_4, _14, _293, _30, _49, _251, _225, _248, _52, _70, _260, _284,
_100, _117, _134, _151};
  vect__5.47_319 = (vector(16) short unsigned int) _318;
  _154 = MEM[(pixel *)pix2_58(D) + 99B];
  _320 = {_6, _16, _22, _32, _51, _255, _231, _243, _54, _68, _276, _286, _103,
_120, _137, _154};
  vect__7.48_321 = (vector(16) short unsigned int) _320;
  vect__12.49_322 = vect__5.47_319 - vect__7.48_321;
  _317 = BIT_FIELD_REF <vect__12.49_322, 64, 0>;
  _315 = BIT_FIELD_REF <vect__12.49_322, 64, 64>;
  _313 = BIT_FIELD_REF <vect__12.49_322, 64, 128>;
  _311 = BIT_FIELD_REF <vect__12.49_322, 64, 192>;
  vect_perm_even_165 = VEC_PERM_EXPR <_317, _315, { 0, 2, 4, 6 }>;
  vect_perm_odd_164 = VEC_PERM_EXPR <_317, _315, { 1, 3, 5, 7 }>;
  vect_perm_even_163 = VEC_PERM_EXPR <_313, _311, { 0, 2, 4, 6 }>;
  vect_perm_odd_156 = VEC_PERM_EXPR <_313, _311, { 1, 3, 5, 7 }>;

down to smaller vectors.  Also appearantly the two vector CTORs are not
re-shuffled to vector load + shuffle.  In the SLP analysis we end up with

t2.c:12:32: note:   Final SLP tree for instance:
t2.c:12:32: note:   node 0x436e3c0 (max_nunits=16, refcnt=2)
t2.c:12:32: note:       stmt 0 *_11 = _12;
t2.c:12:32: note:       stmt 1 *_21 = _71;
...
t2.c:12:32: note:       stmt 15 *_160 = _161;
t2.c:12:32: note:       children 0x436de70
t2.c:12:32: note:   node 0x436de70 (max_nunits=16, refcnt=2)
t2.c:12:32: note:       stmt 0 _12 = _5 - _7;
t2.c:12:32: note:       stmt 1 _71 = _15 - _17;
...
.c:12:32: note:       stmt 15 _161 = _152 - _155;
t2.c:12:32: note:       children 0x436ebb0 0x4360b70
t2.c:12:32: note:   node 0x436ebb0 (max_nunits=16, refcnt=2)
t2.c:12:32: note:       stmt 0 _5 = (short unsigned int) _4;
...
t2.c:12:32: note:       stmt 15 _152 = (short unsigned int) _151;
t2.c:12:32: note:       children 0x42f1740
t2.c:12:32: note:   node 0x42f1740 (max_nunits=16, refcnt=2)
t2.c:12:32: note:       stmt 0 _4 = *pix1_57(D);
...
t2.c:12:32: note:       stmt 15 _151 = MEM[(pixel *)pix1_295 + 3B];
t2.c:12:32: note:       load permutation { 0 1 2 3 16 17 18 19 32 33 34 35 48
49 50 51 }
t2.c:12:32: note:   node 0x4360b70 (max_nunits=16, refcnt=2)
t2.c:12:32: note:       stmt 0 _7 = (short unsigned int) _6;
...
t2.c:12:32: note:       stmt 15 _155 = (short unsigned int) _154;
t2.c:12:32: note:       children 0x4360be0
t2.c:12:32: note:   node 0x4360be0 (max_nunits=16, refcnt=2)
t2.c:12:32: note:       stmt 0 _6 = *pix2_58(D);
...
t2.c:12:32: note:       stmt 15 _154 = MEM[(pixel *)pix2_296 + 3B];
t2.c:12:32: note:       load permutation { 0 1 2 3 32 33 34 35 64 65 66 67 96
97 98 99 }

the load permutations suggest that splitting the group into 4-lane pieces
would avoid doing permutes but then that would require target support
for V4QI and V4HI vectors.  At least the loads could be considered
to be vectorized with strided-SLP, yielding 'int' loads and a vector
build from 4 ints.  I'd need to analyze why we do not consider this.

t2.c:50:1: note:   Detected interleaving load of size 52
t2.c:50:1: note:        _4 = *pix1_57(D);
t2.c:50:1: note:        _14 = MEM[(pixel *)pix1_57(D) + 1B];
t2.c:50:1: note:        _293 = MEM[(pixel *)pix1_57(D) + 2B];
t2.c:50:1: note:        _30 = MEM[(pixel *)pix1_57(D) + 3B];
t2.c:50:1: note:        <gap of 12 elements>
t2.c:50:1: note:        _49 = *pix1_40;
t2.c:50:1: note:        _251 = MEM[(pixel *)pix1_40 + 1B];
t2.c:50:1: note:        _225 = MEM[(pixel *)pix1_40 + 2B];
t2.c:50:1: note:        _248 = MEM[(pixel *)pix1_40 + 3B];
t2.c:50:1: note:        <gap of 12 elements>
t2.c:50:1: note:        _52 = *pix1_264;
t2.c:50:1: note:        _70 = MEM[(pixel *)pix1_264 + 1B];
t2.c:50:1: note:        _260 = MEM[(pixel *)pix1_264 + 2B];
t2.c:50:1: note:        _284 = MEM[(pixel *)pix1_264 + 3B];
t2.c:50:1: note:        <gap of 12 elements>
t2.c:50:1: note:        _100 = *pix1_295;
t2.c:50:1: note:        _117 = MEM[(pixel *)pix1_295 + 1B];
t2.c:50:1: note:        _134 = MEM[(pixel *)pix1_295 + 2B];
t2.c:50:1: note:        _151 = MEM[(pixel *)pix1_295 + 3B];
t2.c:50:1: note:   Detected interleaving load of size 100
t2.c:50:1: note:        _6 = *pix2_58(D);
t2.c:50:1: note:        _16 = MEM[(pixel *)pix2_58(D) + 1B];
t2.c:50:1: note:        _22 = MEM[(pixel *)pix2_58(D) + 2B];
t2.c:50:1: note:        _32 = MEM[(pixel *)pix2_58(D) + 3B];
t2.c:50:1: note:        <gap of 28 elements>
t2.c:50:1: note:        _51 = *pix2_41;
t2.c:50:1: note:        _255 = MEM[(pixel *)pix2_41 + 1B];
t2.c:50:1: note:        _231 = MEM[(pixel *)pix2_41 + 2B];
t2.c:50:1: note:        _243 = MEM[(pixel *)pix2_41 + 3B];
t2.c:50:1: note:        <gap of 28 elements>
t2.c:50:1: note:        _54 = *pix2_272;
t2.c:50:1: note:        _68 = MEM[(pixel *)pix2_272 + 1B];
t2.c:50:1: note:        _276 = MEM[(pixel *)pix2_272 + 2B];
t2.c:50:1: note:        _286 = MEM[(pixel *)pix2_272 + 3B];
t2.c:50:1: note:        <gap of 28 elements>
t2.c:50:1: note:        _103 = *pix2_296;
t2.c:50:1: note:        _120 = MEM[(pixel *)pix2_296 + 1B];
t2.c:50:1: note:        _137 = MEM[(pixel *)pix2_296 + 2B];
t2.c:50:1: note:        _154 = MEM[(pixel *)pix2_296 + 3B];

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

Reply via email to