Richard Biener <richard.guent...@gmail.com> writes:
> On Mon, Mar 18, 2019 at 10:59 AM Richard Sandiford
> <richard.sandif...@arm.com> wrote:
>>
>> This patch fixes a case in which we vectorised something with a
>> fully-predicated loop even after the cost model had rejected it.
>> E.g. the loop in the testcase has the costs:
>>
>>   Vector inside of loop cost: 27
>>   Vector prologue cost: 0
>>   Vector epilogue cost: 0
>>   Scalar iteration cost: 7
>>   Scalar outside cost: 6
>>   Vector outside cost: 0
>>   prologue iterations: 0
>>   epilogue iterations: 0
>>
>> and we can see that the loop executes at most three times, but we
>> decided to vectorise it anyway.
>>
>> (The costs here are equal for three iterations, but the same thing
>> happens even when the vector code is strictly more expensive.)
>>
>> The problem is the handling of "/VF" in:
>>
>>   /* Calculate number of iterations required to make the vector version
>>      profitable, relative to the loop bodies only.  The following condition
>>      must hold true:
>>      SIC * niters + SOC > VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC
>>      where
>>      SIC = scalar iteration cost, VIC = vector iteration cost,
>>      VOC = vector outside cost, VF = vectorization factor,
>>      PL_ITERS = prologue iterations, EP_ITERS= epilogue iterations
>>      SOC = scalar outside cost for run time cost model check.  */
>>
>> We treat the "/VF" as truncating, but for fully-predicated loops, it's
>> closer to a ceil division, since fractional iterations are handled by a
>> full iteration with some predicate bits set to false.
>>
>> The easiest fix seemed to be to calculate the minimum number of vector
>> iterations first, then use that to calculate the minimum number of scalar
>> iterations.
>>
>> Calculating the minimum number of vector iterations might make sense for
>> unpredicated loops too, since calculating the scalar niters directly
>> doesn't take into account the fact that the VIC multiple has to be an
>> integer.  But the handling of PL_ITERS and EP_ITERS for unpredicated
>> loops is a bit hand-wavy anyway, so maybe vagueness here cancels out
>> vagueness there?
>
> Well, their estimate if we do not know them statically is.  If we statically
> know pl/ep iters the formulat should match, no?

Yeah, true.

I don't see how much we gain by trying to estimate the number of peels
for the runtime threshold.  The NPEEL-dependent component of VOC is
really just SIC * NPEEL, which makes sense given that each peeled
iteration is just a normal scalar iteration.  VOC also includes extra
base overhead on top of that, but once the number of scalar iterations
is big enough for the inner-loop saving to compensate for the base overhead,
each extra peel counts equally against both sides.  So I think we could
estimate the minimum number of vector (rather than scalar) iterations
without taking the number of peels into account for VOC, just the
potential presence of peeling.  We could then add a worst-case amount
of prologue peeling, an estimate (as in the patch) or the actual runtime
amount (which would mean calculating the value even when the vector loop
isn't used).

> Hopefully for GCC 10 we can even fix the case in PR89754.

Yeah, would be nice. :-)  Would also be good to be able to compare inner
and outer loop vectorisation side-by-side and pick whichever's best.
We want that for SVE to avoid e.g. using outer-loop vectorisation for
a 3-iteration outer loop and a many-iteration inner loop.

Thanks,
Richard

Reply via email to