https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127
Alexander Monakov <amonakov at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |amonakov at gcc dot gnu.org --- Comment #2 from Alexander Monakov <amonakov at gcc dot gnu.org> --- Richard, though register moves are resolved by renaming, they still occupy a uop in all stages except execution, and since renaming is one of the narrowest points in the pipeline (only up to 4 uops/cycle on Intel), reducing number of uops generally helps. In Michael's the actual memory address has two operands: < vmovapd %ymm1, %ymm10 < vmovapd %ymm1, %ymm11 < vfnmadd213pd (%rdx,%rax), %ymm9, %ymm10 < vfnmadd213pd (%rcx,%rax), %ymm7, %ymm11 --- > vmovupd (%rdx,%rax), %ymm10 > vmovupd (%rcx,%rax), %ymm11 > vfnmadd231pd %ymm1, %ymm9, %ymm10 > vfnmadd231pd %ymm1, %ymm7, %ymm11 The "uop" that carries operands of vfnmadd213pd gets "unlaminated" before renaming (because otherwise there would be too many operands to handle). Hence the original code has 4 uops after decoding, 6 uops before renaming, and the transformed code has 4 uops before renaming. Execution handles 4 uops in both cases. FMA unlamination is mentioned in https://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes Michael, you can probably measure it for yourself with perf stat -e cycles,instructions,uops_retired.all,uops_retired.retire_slots