https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to ktkachov from comment #2) > Created attachment 45386 [details] > aarch64-llvm output with -Ofast -mcpu=cortex-a57 > > I'm attaching the full LLVM aarch64 output. > > The output you quoted is with -funroll-loops. If that's not given, GCC > doesn't seem to unroll by default at all (on aarch64 or x86_64 from my > testing). > > Is there anything we can do to make the default unrolling a bit more > aggressive? Well, the RTL loop unroller is not enabled by default at any optimization level (unless you are using FDO). There's also related flags not enabled (-fsplit-ivs-in-unroller and -fvariable-expansion-in-unroller). The RTL loop unroller is simply not good at estimating benefit of unrolling (which is also why you usually see it unrolling --param max-unroll-times times) and the tunables it has are not very well tuned across targets. Micha did quite extensive benchmarking (on x86_64) which shows that the cases where unrolling is profitable are rare and the reason is often hard to understand. That's of course in the context of CPUs having caches of pre-decoded/fused/etc. instructions optimizing issue which makes peeled prologues expensive as well as even more special caches for small loops avoiding more frontend costs. Not sure if arm archs have any of this. I generally don't believe in unrolling as a separately profitable transform. Rather unrolling could be done as part of another transform (vectorization is the best example). For sth still done on RTL that would then include scheduling which is where the best cost estimates should be available (and if you do this post-reload then you even have a very good idea of register pressure). This is also why I think a standalone unrolling phase belongs on RTL since I don't see a good way of estimating cost/benefit on GIMPLE (see how difficult it is to cost vectorization vs. non-vectorization there).