https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #23 from Wilco <wilco at gcc dot gnu.org> ---
(In reply to ktkachov from comment #22)

> helps even more. On Cortex-A72 it gives a bit more than 6% (vs 3%)
> improvement on parest, and about 5.3% on a more aggressive CPU.
> I tried unrolling 8x in a similar manner and that was not faster than 4x on
> either target.

The 4x unrolled version has 19 instructions (and microops) vs 7*4 for the
non-unrolled version, a significant reduction (without LDP it would be 21 vs
28). There is potential to use 2 more LDPs and use load+writeback which would
make it 15 vs 28 instructions, so close to 2x reduction in instructions.

Reply via email to