https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #13 from rguenther at suse dot de <rguenther at suse dot de> --- On Wed, 16 Jan 2019, ktkachov at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > > --- Comment #11 from ktkachov at gcc dot gnu.org --- > Thank you all for the input. > > Just to add a bit of data. > I've instrumented 510.parest_r to count the number of loop iterations to get a > feel for how much of the unrolled loop is spent in the actual unrolled part > rather than the prologue/peeled part. Overall, the hot function itself is > entered 290M times. The distribution of loop iteration counts is: > > Frequency iter: > 92438870 36 > 87028560 54 > 20404571 24 > 17312960 62 > 14237184 72 > 13403904 108 > 7574437 102 > 7574420 70 > 5564881 40 > 4328249 64 > 4328240 46 > 3142656 48 > 2666496 124 > 1248176 8 > 1236641 16 > 1166592 204 > 1166592 140 > 1134392 4 > 857088 80 > 666624 92 > 666624 128 > 618320 30 > 613056 1 > 234464 2 > 190464 32 > 95232 60 > 84476 20 > 48272 10 > 6896 5 > > So the two most common iteration counts are 36 and 54. For an 8x unrolled loop > that's 4 and 6 iterations spent in the prologue with 4 and 6 times going > around > the 8x unrolled loop respectively. > > As an experiment I hacked the AArch64 assembly of the function generated with > -funroll-loops to replace the peeled prologue version with a simple > non-unrolled loop. That gave a sizeable speedup on two AArch64 platforms: >7%. > > So beyond the vectorisation point Richard S. made above, maybe it's worth > considering replacing the peeled prologue with a simple loop instead? > Or at least add that as a distinct unrolling strategy and work to come up with > an analysis that would allow us to choose one over the other? Patches welcome ;) Usually the peeling is done to improve branch prediction on the prologue/epilogue.