[Bug rtl-optimization/111143] [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing

amonakov at gcc dot gnu.org via Gcc-bugs Sat, 26 Aug 2023 01:09:59 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111143


--- Comment #6 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
Thanks.

i5-1335U has two "performance cores" (with HT, four logical CPUs) and eight
"efficiency cores". They have different micro-architecture. Are you binding the
benchmark to some core in particular?

On the "performance cores", 'add rbx, 1' can be eliminated ("executed" with
zero latency), this optimization appeared in the Alder Lake generation with the
"Golden Cove" uarch and was found by Andreas Abel. There are limitations (e.g.
it works for 64-bit additions but not 32-bit, the addend must be an immediate
less than 1024).

Of course, it is better to have 'add rbx, 1' instead of 'add rbx, rax' in this
loop on any CPU ('mov eax, 1' competes for ALU ports with other instructions,
so when it's delayed due to contention the dependent 'add rbx, rax; movsx rax,
[rbx]' get delayed too), but ascribing the difference to compiler scheduling on
a CPU that does out-of-order dynamic scheduling is strange.

[Bug rtl-optimization/111143] [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing

Reply via email to