Hi Wilko, Thanks for your comments.
On 14 February 2018 at 00:05, Wilco Dijkstra <wilco.dijks...@arm.com> wrote: > Hi Kugan, > >> Based on the previous discussions, I tried to implement a tree loop >> unroller for partial unrolling. I would like to queue this RFC patches >> for next stage1 review. > > This is a great plan - GCC urgently requires a good unroller! > >> * Cost-model for selecting the loop uses the same params used >> elsewhere in related optimizations. I was told that keeping this same >> would allow better tuning for all the optimizations. > > I'd advise against using the existing params as is. Unrolling by 8x by > default is > way too aggressive and counterproductive. It was perhaps OK for in-order cores > 20 years ago, but not today. The goal of unrolling is to create more ILP in > small > loops, not to generate huge blocks of repeated code which definitely won't > fit in > micro-op caches and loop buffers... > OK, I will create separate params. It is possible that I misunderstood it in the first place. > Also we need to enable this by default, at least with -O3, maybe even for > small > (or rather tiny) loops in -O2 like LLVM does. It is enabled for -O3 and above now. > >> * I have also implemented an option to limit loops based on memory >> streams. i.e., some micro-architectures where limiting the resulting >> memory streams is preferred and used to limit unrolling factor. > > I'm not convinced this is needed once you tune the parameters for unrolling. > If you have say 4 read streams you must have > 10 instructions already so > you may want to unroll this 2x in -O3, but definitely not 8x. So I see the > streams > issue as a problem caused by too aggressive unroll settings. I think if you > address that first, you're unlikely going to have an issue with too many > streams. > I will experiment with some microbenchmarks. I still think that it will be useful for some micro-architectures. Thats why, it its not enabled by default. If a back-end thinks that it is useful, they can enable limiting unroll factor based on memory streams. >> * I expect that there will be some cost-model changes might be needed >> to handle (or provide ability to handle) various loop preferences of >> the micro-architectures. I am sending this patch for review early to >> get feedbacks on this. > > Yes it should be feasible to have settings based on backend preference > and optimization level (so O3/Ofast will unroll more than O2). > >> * Position of the pass in passes.def can also be changed. Example, >> unrolling before SLP. > > As long as it runs before IVOpt so we get base+immediate addressing modes. Thats what I am doing now. Thanks, Kugan > > Wilco