[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

amonakov at gcc dot gnu.org Mon, 21 Sep 2020 03:35:41 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127


Alexander Monakov <amonakov at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amonakov at gcc dot gnu.org

--- Comment #2 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
Richard, though register moves are resolved by renaming, they still occupy a
uop in all stages except execution, and since renaming is one of the narrowest
points in the pipeline (only up to 4 uops/cycle on Intel), reducing number of
uops generally helps.

In Michael's the actual memory address has two operands:

<       vmovapd %ymm1, %ymm10
<       vmovapd %ymm1, %ymm11
<       vfnmadd213pd    (%rdx,%rax), %ymm9, %ymm10
<       vfnmadd213pd    (%rcx,%rax), %ymm7, %ymm11
---
>       vmovupd (%rdx,%rax), %ymm10
>       vmovupd (%rcx,%rax), %ymm11
>       vfnmadd231pd    %ymm1, %ymm9, %ymm10
>       vfnmadd231pd    %ymm1, %ymm7, %ymm11

The "uop" that carries operands of vfnmadd213pd gets "unlaminated" before
renaming (because otherwise there would be too many operands to handle). Hence
the original code has 4 uops after decoding, 6 uops before renaming, and the
transformed code has 4 uops before renaming. Execution handles 4 uops in both
cases.

FMA unlamination is mentioned in
https://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes

Michael, you can probably measure it for yourself with

   perf stat -e cycles,instructions,uops_retired.all,uops_retired.retire_slots

[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

Reply via email to