"Kewen.Lin" <li...@linux.ibm.com> writes:
> Hi Richard,
>
> on 2020/7/27 下午9:10, Richard Sandiford wrote:
>> "Kewen.Lin" <li...@linux.ibm.com> writes:
>>> Hi,
>>>
>>> As Richard S. suggested in the thread:
>>>
>>> https://gcc.gnu.org/pipermail/gcc-patches/2020-July/550633.html
>>>
>>> this patch is separated from the one of that thread, mainly to refactor the
>>> existing peel_iters_{pro,epi}logue cost model handlings.
>>>
>>> I've addressed Richard S.'s review comments there, moreover because of one
>>> failure of aarch64 testing, I updated it a bit more to keep the logic 
>>> unchanged
>>> as before first (refactor_cost.diff).
>> 
>> Heh, nice when a clean-up exposes an existing bug. ;-)  I agree the
>> updates look correct.  E.g. if vf is 1, we should assume that there
>> are no peeled iterations even if the number of iterations is unknown.
>> 
>> Both patches are OK with some very minor nits (sorry):
>
> Thanks for the comments!  I've addressed them and commit refactor_cost.diff
> via r11-2378.
>
> For adjust.diff, it works well on powerpc64le-linux-gnu (bootstrap/regtest),
> but got one failure on aarch64 as below:
>
> PASS->FAIL: gcc.target/aarch64/sve/cost_model_2.c -march=armv8.2-a+sve  
> scan-assembler-times \\tst1w\\tz 1
>
> I spent some time investigating it and thought it's expected.  As you 
> mentioned
> above, the patch can fix the cases with VF 1, the VF of this case is exactly 
> one.  :)
>
> Without the proposed adjustment, the cost for V16QI looks like:
>
>   Vector inside of loop cost: 1
>   Vector prologue cost: 4
>   Vector epilogue cost: 3
>   Scalar iteration cost: 4
>   Scalar outside cost: 6
>   Vector outside cost: 7
>   prologue iterations: 0
>   epilogue iterations: 0
>
> After the change, the cost becomes to:
>
>     Vector inside of loop cost: 1
>     Vector prologue cost: 1
>     Vector epilogue cost: 0
>     Scalar iteration cost: 4
>     Scalar outside cost: 6
>     Vector outside cost: 1
>     prologue iterations: 0
>     epilogue iterations: 0
>
> which outperforms VNx16QI that isn't affected by this patch with partial 
> vectors.

Hmm.  V16QI outperforming VNx16QI is kind-of bogus in this case,
since the store is a full-vector store for both Advanced SIMD and
128-bit SVE.  But we do currently still generate the SVE loop in
a more general form (since doing otherwise introduces other problems)
so I guess the result is self-consistent.

More importantly, even if the costs were equal, V16QI would still
win for 128-bit SVE.

So yeah, while I had my doubts at first, I agree this is the right fix.

> gcc/ChangeLog:
>
>       * tree-vect-loop.c (vect_get_known_peeling_cost): Don't consider branch
>       taken costs for prologue and epilogue if they don't exist.
>       (vect_estimate_min_profitable_iters): Likewise.
>
> gcc/testsuite/ChangeLog:
>
>       * gcc.target/aarch64/sve/cost_model_2.c: Adjust due to cost model
>       change.

OK, thanks.

Richard

Reply via email to