Re: [PATCH, GCC, AArch64] Fix PR88398 for AArch64

Wilco Dijkstra Wed, 27 Nov 2019 05:18:07 -0800

Hi Richard,

>> Yes so it does the insane "fully unrolled trailing loop before the unrolled
>> loop" thing. One always does the trailing loop last (and typically as an
>> actual loop of course) and then the code ends up much faster, close to
>> the ideal version shown in the PR.
>
> Well, you can't do the unrolled loop first unless you keep all exit tests.
> Not keeping them is the whole point of unrolling!


You always need a loop entry test, but rather than testing iterations > 0,
we can just test iterations >= 4 before entering a 4x unrolled loop.

>> For these kinds of loops, stupid unrolling is clearly better than the
>> default unrolling, both in size and in performance. For the example
>> we only ever execute part of the "trailing" loop, and never enter the
>> unrolled main loop!
>
> Well, then you don't want unrolling you want peeling.  You'd be
> actually happy with four peeled iterations and then the regular,
> not unrolled loop at the tail.

While peeling would work in this case since the average number of
iterations is so small, that's not what you'd want in general. The key is
not to do the trailing loop before the unrolled loop.

> The stupid strategy is what it says - stupid.

Absolutely, it still can be improved significantly. We need to characterize
loops and unroll smartly using different unroll strategies rather than
bluntly unroll every loop 8 times.

> Sure, which is why I suggest to change how we emit the
> prologue here.  We can select the variant of the prologue
> with a target hook based on preference for example, between
> doing it peeling-like (which you prefer), using a scheme
> like current (preferably in some optimized form).

Well what I'm suggesting is to move the prologue to the epilogue
similar to how the vectorizer executes the trailing loop at the end
(rather than before the vectorized loop).

Cheers,
Wilco

Re: [PATCH, GCC, AArch64] Fix PR88398 for AArch64

Reply via email to