Hi All,

Based on the previous discussions, I tried to implement a tree loop
unroller for partial unrolling. I would like to queue this RFC patches
for next stage1 review.

In summary:

* Cost-model for selecting the loop uses the same params used
elsewhere in related optimizations. I was told that keeping this same
would allow better tuning for all the optimizations.

* I have also implemented an option to limit loops based on memory
streams. i.e., some micro-architectures where limiting the resulting
memory streams is preferred and used  to limit unrolling factor.

* I have tested this on variants of aarch64 and the results are
promising. I am in the process of running benchmarks on x86. I will
update the results later.

* I expect that there will be some cost-model changes might be needed
to handle (or provide ability to handle) various loop preferences of
the micro-architectures. I am sending this patch for review early to
get feedbacks on this.

* Position of the pass in passes.def can also be changed. Example,
unrolling before SLP.

* I have bootstrapped and regression tested on aarch64-linux-gnu.
There are no execution errors or ICEs. There are some testsuite
differences as expected. Few of them needs further evaluation and I am
doing that now.

Patches are organized as:

Patch1: Adds a target hook TARGET_HW_MAX_MEM_READ_STREAMS. Loop
unroller, if defined, will try to limit the unrolling factor based on
this.

Patch2: Implements tree loop unroller using the infrastructure
provided. Pass itself is very simple.

Patch3: Implements target hook TARGET_HW_MAX_MEM_READ_STREAMS for aarch64.

Patch4: Implements a machine reorg pass for aarch64/Falkor to handle
prefetcher tag collision. This is strictly not part of the loop
unroller but for Falkor, unrolling can make h/w prefetcher performing
badly if there are too-much tag collisions based on the discussions in
https://gcc.gnu.org/ml/gcc/2017-10/msg00178.html.

Thanks,
Kugan

Reply via email to