https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111143
--- Comment #6 from Alexander Monakov <amonakov at gcc dot gnu.org> --- Thanks. i5-1335U has two "performance cores" (with HT, four logical CPUs) and eight "efficiency cores". They have different micro-architecture. Are you binding the benchmark to some core in particular? On the "performance cores", 'add rbx, 1' can be eliminated ("executed" with zero latency), this optimization appeared in the Alder Lake generation with the "Golden Cove" uarch and was found by Andreas Abel. There are limitations (e.g. it works for 64-bit additions but not 32-bit, the addend must be an immediate less than 1024). Of course, it is better to have 'add rbx, 1' instead of 'add rbx, rax' in this loop on any CPU ('mov eax, 1' competes for ALU ports with other instructions, so when it's delayed due to contention the dependent 'add rbx, rax; movsx rax, [rbx]' get delayed too), but ascribing the difference to compiler scheduling on a CPU that does out-of-order dynamic scheduling is strange.