http://gcc.gnu.org/bugzilla/show_bug.cgi?id=47949
--- Comment #3 from Steven Fuerst <svfuerst at gmail dot com> 2011-03-02 21:51:12 UTC --- Having a quick look at generated code... it appears that this pattern doesn't come up all that often. However, there is one case where it does: the epilogue of a function. i.e. gcc tends to generate code looking like: movl %ebp, %eax movq 8(%rsp), %rbx movq 16(%rsp), %rbp movq 24(%rsp), %r12 movq 32(%rsp), %r13 addq $40, %rsp ret Replacing the move to %eax with an exchange with %ebp is a win in this particular case. The extra cycle or two of latency that xchg takes doesn't matter as the other moves and ret instruction overlap in execution with it. Benchmarking on an opteron in 64bit mode confirms this hypothesis even in the degenerate case where no other moves exist: foo1: mov %edi, %eax retq foo2: xchg %eax, %edi retq foo1 and foo2 take the same time to execute.