Hi All, Based on the previous discussions, I tried to implement a tree loop unroller for partial unrolling. I would like to queue this RFC patches for next stage1 review.
In summary: * Cost-model for selecting the loop uses the same params used elsewhere in related optimizations. I was told that keeping this same would allow better tuning for all the optimizations. * I have also implemented an option to limit loops based on memory streams. i.e., some micro-architectures where limiting the resulting memory streams is preferred and used to limit unrolling factor. * I have tested this on variants of aarch64 and the results are promising. I am in the process of running benchmarks on x86. I will update the results later. * I expect that there will be some cost-model changes might be needed to handle (or provide ability to handle) various loop preferences of the micro-architectures. I am sending this patch for review early to get feedbacks on this. * Position of the pass in passes.def can also be changed. Example, unrolling before SLP. * I have bootstrapped and regression tested on aarch64-linux-gnu. There are no execution errors or ICEs. There are some testsuite differences as expected. Few of them needs further evaluation and I am doing that now. Patches are organized as: Patch1: Adds a target hook TARGET_HW_MAX_MEM_READ_STREAMS. Loop unroller, if defined, will try to limit the unrolling factor based on this. Patch2: Implements tree loop unroller using the infrastructure provided. Pass itself is very simple. Patch3: Implements target hook TARGET_HW_MAX_MEM_READ_STREAMS for aarch64. Patch4: Implements a machine reorg pass for aarch64/Falkor to handle prefetcher tag collision. This is strictly not part of the loop unroller but for Falkor, unrolling can make h/w prefetcher performing badly if there are too-much tag collisions based on the discussions in https://gcc.gnu.org/ml/gcc/2017-10/msg00178.html. Thanks, Kugan