https://gcc.gnu.org/bugzilla/show_bug.cgi?id=36127
--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Andrew Pinski from comment #8) > So what seems to be happening is PRE is pull out the following from the loop: > > pretmp_250 = MEM[(float *)_2 + 4294933760B + ivtmp.159_57 * 1]; > _22 = (void *) ivtmp.140_79; > pretmp_253 = MEM[(float *)_22 + 4294934276B]; > pretmp_257 = MEM[(float *)_22 + 4294900220B]; > pretmp_259 = MEM[(float *)_22 + 4294933244B]; > pretmp_261 = MEM[(float *)_22 + 4294933760B]; I don't see any of that for the original testcase, in fact the original reported issue that -O2/-O3 -fno-vectorize are slower than -O/-Os -fno-vectorize is no longer present. vectorizing also provides a nice speedup for me