https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93721
Bug ID: 93721 Summary: swapping adjacent scalars could be more efficient Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: msebor at gcc dot gnu.org Target Milestone: --- For an implementation of a swap function like this: template <class T> void swap (std::pair<T, T> &x) { T t = x.first; x.first = x.second; x.second = t; } GCC for x86 emits the ROL instruction for T=char: _Z4swapIcEvRSt4pairIT_S1_E: .LFB97: .cfi_startproc rolw $8, (%rdi) ret .cfi_endproc but a series of MOV instructions for T=short and T=int: _Z4swapIiEvRSt4pairIT_S1_E: .LFB97: .cfi_startproc movl (%rdi), %eax movl 4(%rdi), %edx movl %eax, 4(%rdi) movl %edx, (%rdi) ret .cfi_endproc A hand-coded (but convoluted) implementation of the function like below lets GCC for x86_64 emit the ROL instruction for both int and short: void swap (std::pair<int, int> &x) { int y[2], t; static_assert (sizeof x == sizeof y); __builtin_memcpy (y, &x, sizeof x); t = y[0]; y[0] = y[1]; y[1] = t; __builtin_memcpy (&x, y, sizeof x); } _ZL4swapRSt4pairIiiE: .LFB94: .cfi_startproc rolq $32, (%rdi) ret .cfi_endproc Benchmarking it shows that the ROL form is measurably faster (at least on my machine) than the MOV form.