https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #23 from Wilco <wilco at gcc dot gnu.org> --- (In reply to ktkachov from comment #22) > helps even more. On Cortex-A72 it gives a bit more than 6% (vs 3%) > improvement on parest, and about 5.3% on a more aggressive CPU. > I tried unrolling 8x in a similar manner and that was not faster than 4x on > either target. The 4x unrolled version has 19 instructions (and microops) vs 7*4 for the non-unrolled version, a significant reduction (without LDP it would be 21 vs 28). There is potential to use 2 more LDPs and use load+writeback which would make it 15 vs 28 instructions, so close to 2x reduction in instructions.