https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789
--- Comment #10 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Kewen Lin from comment #9) > (In reply to Richard Biener from comment #8) > > (In reply to Kewen Lin from comment #7) > > > Two questions in mind, need to dig into it further: > > > 1) from the assembly of scalar/vector code, I don't see any stores > > > needed > > > into temp array d (array diff in pixel_sub_wxh), but when modeling we > > > consider the stores. > > > > Because when modeling they are still there. There's no good way around > > this. > > > > I noticed the stores get eliminated during FRE. Can we consider running FRE > once just before SLP? a bad idea due to compilation time? Yeah, we already run FRE a lot and it is one of the more expensive passes. Note there's one point we could do better which is the embedded SESE FRE run from cunroll which is only run before we consider peeling an outer loop and thus not for the outermost unrolled/peeled code (but the question would be from where / up to what to apply FRE to). On x86_64 this would apply to the unvectorized but then unrolled outer loop from pixel_sub_wxh which feeds quite bad IL to the SLP pass (but that shouldn't matter too much, maybe it matters for costing though). I think I looked at this or a related testcase some time ago and split out some PRs (can't find those right now). For example we are not considering to simplify _318 = {_4, _14, _293, _30, _49, _251, _225, _248, _52, _70, _260, _284, _100, _117, _134, _151}; vect__5.47_319 = (vector(16) short unsigned int) _318; _154 = MEM[(pixel *)pix2_58(D) + 99B]; _320 = {_6, _16, _22, _32, _51, _255, _231, _243, _54, _68, _276, _286, _103, _120, _137, _154}; vect__7.48_321 = (vector(16) short unsigned int) _320; vect__12.49_322 = vect__5.47_319 - vect__7.48_321; _317 = BIT_FIELD_REF <vect__12.49_322, 64, 0>; _315 = BIT_FIELD_REF <vect__12.49_322, 64, 64>; _313 = BIT_FIELD_REF <vect__12.49_322, 64, 128>; _311 = BIT_FIELD_REF <vect__12.49_322, 64, 192>; vect_perm_even_165 = VEC_PERM_EXPR <_317, _315, { 0, 2, 4, 6 }>; vect_perm_odd_164 = VEC_PERM_EXPR <_317, _315, { 1, 3, 5, 7 }>; vect_perm_even_163 = VEC_PERM_EXPR <_313, _311, { 0, 2, 4, 6 }>; vect_perm_odd_156 = VEC_PERM_EXPR <_313, _311, { 1, 3, 5, 7 }>; down to smaller vectors. Also appearantly the two vector CTORs are not re-shuffled to vector load + shuffle. In the SLP analysis we end up with t2.c:12:32: note: Final SLP tree for instance: t2.c:12:32: note: node 0x436e3c0 (max_nunits=16, refcnt=2) t2.c:12:32: note: stmt 0 *_11 = _12; t2.c:12:32: note: stmt 1 *_21 = _71; ... t2.c:12:32: note: stmt 15 *_160 = _161; t2.c:12:32: note: children 0x436de70 t2.c:12:32: note: node 0x436de70 (max_nunits=16, refcnt=2) t2.c:12:32: note: stmt 0 _12 = _5 - _7; t2.c:12:32: note: stmt 1 _71 = _15 - _17; ... .c:12:32: note: stmt 15 _161 = _152 - _155; t2.c:12:32: note: children 0x436ebb0 0x4360b70 t2.c:12:32: note: node 0x436ebb0 (max_nunits=16, refcnt=2) t2.c:12:32: note: stmt 0 _5 = (short unsigned int) _4; ... t2.c:12:32: note: stmt 15 _152 = (short unsigned int) _151; t2.c:12:32: note: children 0x42f1740 t2.c:12:32: note: node 0x42f1740 (max_nunits=16, refcnt=2) t2.c:12:32: note: stmt 0 _4 = *pix1_57(D); ... t2.c:12:32: note: stmt 15 _151 = MEM[(pixel *)pix1_295 + 3B]; t2.c:12:32: note: load permutation { 0 1 2 3 16 17 18 19 32 33 34 35 48 49 50 51 } t2.c:12:32: note: node 0x4360b70 (max_nunits=16, refcnt=2) t2.c:12:32: note: stmt 0 _7 = (short unsigned int) _6; ... t2.c:12:32: note: stmt 15 _155 = (short unsigned int) _154; t2.c:12:32: note: children 0x4360be0 t2.c:12:32: note: node 0x4360be0 (max_nunits=16, refcnt=2) t2.c:12:32: note: stmt 0 _6 = *pix2_58(D); ... t2.c:12:32: note: stmt 15 _154 = MEM[(pixel *)pix2_296 + 3B]; t2.c:12:32: note: load permutation { 0 1 2 3 32 33 34 35 64 65 66 67 96 97 98 99 } the load permutations suggest that splitting the group into 4-lane pieces would avoid doing permutes but then that would require target support for V4QI and V4HI vectors. At least the loads could be considered to be vectorized with strided-SLP, yielding 'int' loads and a vector build from 4 ints. I'd need to analyze why we do not consider this. t2.c:50:1: note: Detected interleaving load of size 52 t2.c:50:1: note: _4 = *pix1_57(D); t2.c:50:1: note: _14 = MEM[(pixel *)pix1_57(D) + 1B]; t2.c:50:1: note: _293 = MEM[(pixel *)pix1_57(D) + 2B]; t2.c:50:1: note: _30 = MEM[(pixel *)pix1_57(D) + 3B]; t2.c:50:1: note: <gap of 12 elements> t2.c:50:1: note: _49 = *pix1_40; t2.c:50:1: note: _251 = MEM[(pixel *)pix1_40 + 1B]; t2.c:50:1: note: _225 = MEM[(pixel *)pix1_40 + 2B]; t2.c:50:1: note: _248 = MEM[(pixel *)pix1_40 + 3B]; t2.c:50:1: note: <gap of 12 elements> t2.c:50:1: note: _52 = *pix1_264; t2.c:50:1: note: _70 = MEM[(pixel *)pix1_264 + 1B]; t2.c:50:1: note: _260 = MEM[(pixel *)pix1_264 + 2B]; t2.c:50:1: note: _284 = MEM[(pixel *)pix1_264 + 3B]; t2.c:50:1: note: <gap of 12 elements> t2.c:50:1: note: _100 = *pix1_295; t2.c:50:1: note: _117 = MEM[(pixel *)pix1_295 + 1B]; t2.c:50:1: note: _134 = MEM[(pixel *)pix1_295 + 2B]; t2.c:50:1: note: _151 = MEM[(pixel *)pix1_295 + 3B]; t2.c:50:1: note: Detected interleaving load of size 100 t2.c:50:1: note: _6 = *pix2_58(D); t2.c:50:1: note: _16 = MEM[(pixel *)pix2_58(D) + 1B]; t2.c:50:1: note: _22 = MEM[(pixel *)pix2_58(D) + 2B]; t2.c:50:1: note: _32 = MEM[(pixel *)pix2_58(D) + 3B]; t2.c:50:1: note: <gap of 28 elements> t2.c:50:1: note: _51 = *pix2_41; t2.c:50:1: note: _255 = MEM[(pixel *)pix2_41 + 1B]; t2.c:50:1: note: _231 = MEM[(pixel *)pix2_41 + 2B]; t2.c:50:1: note: _243 = MEM[(pixel *)pix2_41 + 3B]; t2.c:50:1: note: <gap of 28 elements> t2.c:50:1: note: _54 = *pix2_272; t2.c:50:1: note: _68 = MEM[(pixel *)pix2_272 + 1B]; t2.c:50:1: note: _276 = MEM[(pixel *)pix2_272 + 2B]; t2.c:50:1: note: _286 = MEM[(pixel *)pix2_272 + 3B]; t2.c:50:1: note: <gap of 28 elements> t2.c:50:1: note: _103 = *pix2_296; t2.c:50:1: note: _120 = MEM[(pixel *)pix2_296 + 1B]; t2.c:50:1: note: _137 = MEM[(pixel *)pix2_296 + 2B]; t2.c:50:1: note: _154 = MEM[(pixel *)pix2_296 + 3B];