Hi Wilko,

Thanks for your comments.

On 14 February 2018 at 00:05, Wilco Dijkstra <wilco.dijks...@arm.com> wrote:
> Hi Kugan,
>
>> Based on the previous discussions, I tried to implement a tree loop
>> unroller for partial unrolling. I would like to queue this RFC patches
>> for next stage1 review.
>
> This is a great plan - GCC urgently requires a good unroller!
>
>> * Cost-model for selecting the loop uses the same params used
>> elsewhere in related optimizations. I was told that keeping this same
>> would allow better tuning for all the optimizations.
>
> I'd advise against using the existing params as is. Unrolling by 8x by 
> default is
> way too aggressive and counterproductive. It was perhaps OK for in-order cores
> 20 years ago, but not today. The goal of unrolling is to create more ILP in 
> small
> loops, not to generate huge blocks of repeated code which definitely won't 
> fit in
> micro-op caches and loop buffers...
>
OK, I will create separate params. It is possible that I misunderstood
it in the first place.


> Also we need to enable this by default, at least with -O3, maybe even for 
> small
> (or rather tiny) loops in -O2 like LLVM does.
It is enabled for -O3 and above now.

>
>> * I have also implemented an option to limit loops based on memory
>> streams. i.e., some micro-architectures where limiting the resulting
>> memory streams is preferred and used  to limit unrolling factor.
>
> I'm not convinced this is needed once you tune the parameters for unrolling.
> If you have say 4 read streams you must have > 10 instructions already so
> you may want to unroll this 2x in -O3, but definitely not 8x. So I see the 
> streams
> issue as a problem caused by too aggressive unroll settings. I think if you
> address that first, you're unlikely going to have an issue with too many 
> streams.
>

I will experiment with some microbenchmarks. I still think that it
will be useful for some micro-architectures. Thats why, it its not
enabled by default. If a back-end thinks that it is useful, they can
enable limiting unroll factor based on memory streams.

>> * I expect that there will be some cost-model changes might be needed
>> to handle (or provide ability to handle) various loop preferences of
>> the micro-architectures. I am sending this patch for review early to
>> get feedbacks on this.
>
> Yes it should be feasible to have settings based on backend preference
> and optimization level (so O3/Ofast will unroll more than O2).
>
>> * Position of the pass in passes.def can also be changed. Example,
>> unrolling before SLP.
>
> As long as it runs before IVOpt so we get base+immediate addressing modes.
Thats what I am doing now.

Thanks,
Kugan

>
> Wilco

Reply via email to