Hi Andrew, On 15 September 2017 at 13:36, Andrew Pinski <pins...@gmail.com> wrote: > On Thu, Sep 14, 2017 at 6:33 PM, Kugan Vivekanandarajah > <kugan.vivekanandara...@linaro.org> wrote: >> This patch adds aarch64_loop_unroll_adjust to limit partial unrolling >> in rtl based on strided-loads in loop. > > Can you expand on this some more? Like give an example of where this > helps? I am trying to better understand your counting schemes since > it seems like the count is based on the number of loads and not cache > lines.
This is a simplified model and I am assuming here that prefetcher will tune based on the memory accesses. I don't have access to any of the internals of how this is implemented in different microarchitectures but I am assuming (in a simplified sense) that hw logic will detect memory accesses patterns and using this it will prefetch the cache line. If there are memory accesses like what you have shown that falls within the cache line, they may be combined but you still need to detect them and tune. And also detecting them at compile time is not always easy. So this is a simplified model. > What do you mean by a strided load? > Doesn't this function overcount when you have: > for(int i = 1;i<1024;i++) > { > t+= a[i-1]*a[i]; > } > if it is counting based on cache lines rather than based on load addresses? Sorry for my terminology. what I mean by strided access is any memory accesses in the form memory[iv]. I am counting memory[iv] and memory[iv+1] as two deferent streams. This may or may not fall into same cache line. > > It also seems to do some weird counting when you have: > for(int i = 1;i<1024;i++) > { > t+= a[(i-1)*N+i]*a[(i)*N+i]; > } > > That is: > (PLUS (REG) (REG)) > > Also seems to overcount when loading from the same pointer twice. If you prefer to count cache line basis, then I am counting it twice intentionally. > > In my micro-arch, the number of prefetch slots is based on cache line > miss so this would be overcounting by a factor of 2. I am not entirely sure if this will be useful for all the cores. It is shown to beneficial for falkor based on what is done in LLVM. Thanks, Kugan > > Thanks, > Andrew > >> >> Thanks, >> Kugan >> >> gcc/ChangeLog: >> >> 2017-09-12 Kugan Vivekanandarajah <kug...@linaro.org> >> >> * cfgloop.h (iv_analyze_biv): export. >> * loop-iv.c: Likewise. >> * config/aarch64/aarch64.c (strided_load_p): New. >> (insn_has_strided_load): New. >> (count_strided_load_rtl): New. >> (aarch64_loop_unroll_adjust): New.