https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79665

wilco at gcc dot gnu.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |wilco at gcc dot gnu.org

--- Comment #14 from wilco at gcc dot gnu.org ---
(In reply to PeteVine from comment #13)
> Still, the 5% regression must have happened very recently. The fast gcc was
> built on 20170220 and the slow one yesterday, using the original patch. Once
> again, switching away from Cortex-A53 codegen restores the expected
> performance.

The issue is due to inefficient code generated for unsigned modulo:

        umull   x0, w0, w4
        umull   x1, w1, w4
        lsr     x0, x0, 32
        lsr     x1, x1, 32
        lsr     w0, w0, 6
        lsr     w1, w1, 6

It seems the Cortex-A53 scheduler isn't modelling this correctly. When I
manually remove the redundant shifts I get a 15% speedup. I'll have a look.

Reply via email to