https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114734
--- Comment #3 from Robin Dapp <rdapp at gcc dot gnu.org> --- > probably -fwhole-program is enough, -flto not needed(?) Yes, -fwhole-program is sufficient. > > # vectp_g.248_1401 = PHI <vectp_g.248_1402(32), &g(143)> > ... > _1411 = .SELECT_VL (ivtmp_1409, POLY_INT_CST [2, 2]); > .. > vect__193.250_1403 = .MASK_LEN_LOAD (vectp_g.248_1401, 32B, { -1, ... }, > _1411, 0); > vect__194.251_1404 = -vect__193.250_1403; > vect_iftmp.252_1405 = (vector([2,2]) long int) vect__194.251_1404; > > # vect_iftmp.252_1406 = PHI <vect_iftmp.252_1405(5)> > # loop_len_1427 = PHI <_1411(5)> > ... > _1407 = loop_len_1427 + 18446744073709551615; > _1408 = .VEC_EXTRACT (vect_iftmp.252_1406, _1407); > iftmp.3_1204 = _1408; > > is stored to b[15]. Doesn't look too odd to me. At the assembly equivalent of > vect__193.250_1403 = .MASK_LEN_LOAD (vectp_g.248_1401, 32B, { -1, ... }, > _1411, 0); we load [3 3] (=f) instead of [0 0] (=g). f is located after g in memory and register a3 is increased before the loop latch. We then re-use a3 to load the last two elements of g but actually read the first two of f.