https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111970
--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> --- So I can see we don't recognize a gather IFN during pattern recog here. t.c:15:1: note: Final SLP tree for instance 0x502e9a0: t.c:15:1: note: node 0x4f84700 (max_nunits=128, refcnt=2) vector(32) float t.c:15:1: note: op template: *_10 = _11; t.c:15:1: note: stmt 0 *_10 = _11; t.c:15:1: note: stmt 1 *_20 = _21; t.c:15:1: note: children 0x4f84790 t.c:15:1: note: node 0x4f84790 (max_nunits=128, refcnt=2) vector(32) float t.c:15:1: note: op template: _11 = _8 + 1.0e+0; t.c:15:1: note: stmt 0 _11 = _8 + 1.0e+0; t.c:15:1: note: stmt 1 _21 = _18 + 2.0e+0; t.c:15:1: note: children 0x4f84820 0x4f84940 t.c:15:1: note: node 0x4f84820 (max_nunits=128, refcnt=2) vector(32) float t.c:15:1: note: op template: _8 = *_7; t.c:15:1: note: stmt 0 _8 = *_7; t.c:15:1: note: stmt 1 _18 = *_17; t.c:15:1: note: children 0x4f848b0 t.c:15:1: note: node 0x4f848b0 (max_nunits=128, refcnt=2) vector(128) unsigned char t.c:15:1: note: op template: _4 = *_3; t.c:15:1: note: stmt 0 _4 = *_3; t.c:15:1: note: stmt 1 _14 = *_13; t.c:15:1: note: load permutation { 0 1 } t.c:15:1: note: node (constant) 0x4f84940 (max_nunits=1, refcnt=1) t.c:15:1: note: { 1.0e+0, 2.0e+0 } t.c:15:1: note: === vect_match_slp_patterns === t.c:15:1: note: Analyzing SLP tree 0x4f84700 for patterns t.c:15:1: note: === vect_make_slp_decision === t.c:15:1: note: Decided to SLP 1 instances. Unrolling factor 64 it tries a few other modes, one even having .MASK_LEN_GATHER_LOAD but that fails to build SLP. In the end we choose t.c:15:1: note: ***** Choosing vector mode RVVM4QI t.c:15:1: note: ***** Choosing epilogue vector mode RVVMF4QI the main loop instance is t.c:15:1: note: Vectorizing SLP tree: t.c:15:1: note: node 0x4f849d0 (max_nunits=64, refcnt=1) vector(32) float t.c:15:1: note: op template: *_10 = _11; t.c:15:1: note: stmt 0 *_10 = _11; t.c:15:1: note: stmt 1 *_20 = _21; t.c:15:1: note: children 0x4f84a60 t.c:15:1: note: node 0x4f84a60 (max_nunits=64, refcnt=1) vector(32) float t.c:15:1: note: op template: _11 = _8 + 1.0e+0; t.c:15:1: note: stmt 0 _11 = _8 + 1.0e+0; t.c:15:1: note: stmt 1 _21 = _18 + 2.0e+0; t.c:15:1: note: children 0x4f84af0 0x4f84c10 t.c:15:1: note: node 0x4f84af0 (max_nunits=64, refcnt=1) vector(32) float t.c:15:1: note: op template: _8 = *_7; t.c:15:1: note: stmt 0 _8 = *_7; t.c:15:1: note: stmt 1 _18 = *_17; t.c:15:1: note: children 0x4f84b80 t.c:15:1: note: node 0x4f84b80 (max_nunits=64, refcnt=1) vector(64) unsigned char t.c:15:1: note: op template: _4 = *_3; t.c:15:1: note: stmt 0 _4 = *_3; t.c:15:1: note: stmt 1 _14 = *_13; t.c:15:1: note: node (constant) 0x4f84c10 (max_nunits=1, refcnt=1) vector(32) float t.c:15:1: note: { 1.0e+0, 2.0e+0 } so the main loop uses emulated gather but the epilog uses non-SLP but gathers here. # vectp_index.6_209 = PHI <vectp_index.6_210(5), index_25(D)(2)> # vectp_y.12_601 = PHI <vectp_y.12_602(5), y_27(D)(2)> vect__4.8_211 = MEM <vector(64) unsigned char> [(uint8_t *)vectp_index.6_209]; ... MEM <vector(32) float> [(float *)vectp_y.12_601] = vect__11.11_599; vectp_y.12_604 = vectp_y.12_601 + 128; MEM <vector(32) float> [(float *)vectp_y.12_604] = vect__11.11_599; ... vectp_index.6_210 = vectp_index.6_209 + 64; vectp_y.12_602 = vectp_y.12_604 + 128; ivtmp_607 = ivtmp_606 + 1; if (ivtmp_607 < 3) that IV updates look OK to me. So not sure what to do? Does the testcase execute correctly with --param vect-epilogues-nomask=0 ?