Re: [RFC][AARCH64][PATCH 5/5] add aarch64_loop_unroll_adjust to limit partial unrolling in rtl based on strided-loads in loop

Kugan Vivekanandarajah Sun, 17 Sep 2017 16:48:59 -0700

Hi Andrew,

On 15 September 2017 at 13:36, Andrew Pinski <pins...@gmail.com> wrote:
> On Thu, Sep 14, 2017 at 6:33 PM, Kugan Vivekanandarajah
> <kugan.vivekanandara...@linaro.org> wrote:
>> This patch adds aarch64_loop_unroll_adjust to limit partial unrolling
>> in rtl based on strided-loads in loop.
>
> Can you expand on this some more?  Like give an example of where this
> helps?  I am trying to better understand your counting schemes since
> it seems like the count is based on the number of loads and not cache
> lines.


This is a simplified model and I am assuming here that prefetcher will
tune based on the memory accesses. I don't have access to any of the
internals of how this is implemented in different microarchitectures
but I am assuming (in a simplified sense) that hw logic will detect
memory accesses  patterns and using this it will prefetch the cache
line. If there are memory accesses like what you have shown that falls
within the cache line, they may be combined but you still need to
detect them and tune. And also detecting them at compile time is not
always easy. So this is a simplified model.

> What do you mean by a strided load?
> Doesn't this function overcount when you have:
> for(int i = 1;i<1024;i++)
>   {
>     t+= a[i-1]*a[i];
>   }
> if it is counting based on cache lines rather than based on load addresses?
Sorry for my terminology. what I mean by strided access is any memory
accesses in the form memory[iv]. I am counting memory[iv] and
memory[iv+1] as two deferent streams. This may or may not fall into
same cache line.

>
> It also seems to do some weird counting when you have:
> for(int i = 1;i<1024;i++)
>   {
>     t+= a[(i-1)*N+i]*a[(i)*N+i];
>   }
>
> That is:
> (PLUS (REG) (REG))
>
> Also seems to overcount when loading from the same pointer twice.

If you prefer to count cache line basis, then I am counting it twice
intentionally.

>
> In my micro-arch, the number of prefetch slots is based on cache line
> miss so this would be overcounting by a factor of 2.

I am not entirely sure if this will be useful for all the cores. It is
shown to beneficial for falkor based on what is done in LLVM.

Thanks,
Kugan
>
> Thanks,
> Andrew
>
>>
>> Thanks,
>> Kugan
>>
>> gcc/ChangeLog:
>>
>> 2017-09-12  Kugan Vivekanandarajah  <kug...@linaro.org>
>>
>>     * cfgloop.h (iv_analyze_biv): export.
>>     * loop-iv.c: Likewise.
>>     * config/aarch64/aarch64.c (strided_load_p): New.
>>     (insn_has_strided_load): New.
>>     (count_strided_load_rtl): New.
>>     (aarch64_loop_unroll_adjust): New.

Re: [RFC][AARCH64][PATCH 5/5] add aarch64_loop_unroll_adjust to limit partial unrolling in rtl based on strided-loads in loop

Reply via email to