https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95784
--- Comment #6 from Gabriel Ravier <gabravier at gmail dot com> --- Created attachment 48761 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48761&action=edit File for benchmarking this function but everything is aligned properly. I've changed the source file slightly, it looks like the LLVM version was faster than the "do nothing" version because the loop was misaligned. This is the test results I get with the version with aligned loops (I've also adjusted the amount of iterations) : $ gcc test.S -O3 -ggdb3 -DGCC_VERSION && time ./a.out && gcc test.S -O3 -ggdb3 -DLLVM_VERSION && time ./a.out && gcc test.S -O3 -ggdb3 && time ./a.out real 0m3.130s # GCC version user 0m3.122s sys 0m0.001s real 0m2.599s # LLVM version user 0m2.593s sys 0m0.001s real 0m2.597s # version that does nothing user 0m2.591s sys 0m0.000s I can now note that the LLVM version is now almost as fast as literally doing nothing, so now it looks really much better than the GCC version, at least to me.