https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #13 from rguenther at suse dot de <rguenther at suse dot de> ---
On Wed, 16 Jan 2019, ktkachov at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> 
> --- Comment #11 from ktkachov at gcc dot gnu.org ---
> Thank you all for the input.
> 
> Just to add a bit of data.
> I've instrumented 510.parest_r to count the number of loop iterations to get a
> feel for how much of the unrolled loop is spent in the actual unrolled part
> rather than the prologue/peeled part. Overall, the hot function itself is
> entered 290M times. The distribution of loop iteration counts is:
> 
> Frequency iter:
> 92438870  36
> 87028560  54
> 20404571  24
> 17312960  62
> 14237184  72
> 13403904  108
> 7574437   102
> 7574420   70
> 5564881   40
> 4328249   64
> 4328240   46
> 3142656   48
> 2666496   124
> 1248176   8
> 1236641   16
> 1166592   204
> 1166592   140
> 1134392   4
>  857088   80
>  666624   92
>  666624   128
>  618320   30
>  613056   1
>  234464   2
>  190464   32
>   95232   60
>   84476   20
>   48272   10
>    6896   5
> 
> So the two most common iteration counts are 36 and 54. For an 8x unrolled loop
> that's 4 and 6 iterations spent in the prologue with 4 and 6 times going 
> around
> the 8x unrolled loop respectively.
> 
> As an experiment I hacked the AArch64 assembly of the function generated with
> -funroll-loops to replace the peeled prologue version with a simple
> non-unrolled loop. That gave a sizeable speedup on two AArch64 platforms: >7%.
> 
> So beyond the vectorisation point Richard S. made above, maybe it's worth
> considering replacing the peeled prologue with a simple loop instead?
> Or at least add that as a distinct unrolling strategy and work to come up with
> an analysis that would allow us to choose one over the other?

Patches welcome ;)

Usually the peeling is done to improve branch prediction on the
prologue/epilogue.

Reply via email to