https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105513

--- Comment #7 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
The second sequence is 3 uops vs 1/2 (issued/executed) uops in first, and on
Haswell and Skylake it ties up port 5 for two cycles.

Unclear if you're microbenchmarking latency or throughput, but in any case on
Haswell and Skylake you should see a close to 2x difference.

Reply via email to