https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82668
Bug ID: 82668 Summary: could use BMI2 rorx for unpacking struct { int a,b }; from a register (SysV ABI) Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-* struct twoint { int a, b; }; int bar(struct twoint s) { return s.a + s.b; } https://godbolt.org/g/4ygAMm movq %rdi, %rax sarq $32, %rax addl %edi, %eax ret But we could have used rorx $32, %rdi, %rax # 1 uop 1c latency add $edi, %eax ret rorxq is only 1 uop, vs. 2 for mov + sar. It also saves a byte a 3 byte MOV + a 4 byte SAR with a 6 byte rorx. Without BMI2, we can shorten critical path if mov isn't zero latency, from 3 to 2 cycles (and save a byte on the REX prefix for the mov): movl %edi, %eax sarq $32, %rdi addl %edi, %eax ret This would be a better choice in general, especially for tune=generic. Also related (let me know if I should report separately, or if gcc knowing how to use rotate to swap struct members would fix this too): // only needs one call-preserved reg and a rotate. long foo(int a /* edi */, int b /* esi */) { struct_arg ( (struct twoint){a,b}); struct_arg ( (struct twoint){b,a}); return 0; } gcc saves two call-preserved registers so it can save a and b separately, and shift+OR them together each time. pushq %rbp movl %edi, %ebp pushq %rbx movl %esi, %ebx movq %rbx, %rdi salq $32, %rdi subq $8, %rsp orq %rbp, %rdi call struct_arg movq %rbp, %rdi salq $32, %rdi orq %rbx, %rdi call struct_arg addq $8, %rsp xorl %eax, %eax popq %rbx popq %rbp ret This is sub-optimal in two ways: first, on Intel SnB-family (but not silvermont or any AMD), SHRD is efficient (1 uop, 1c latency, runs on port1 only instead of p06 for other shifts/rotates). SHL + SHRD may be better than mov + shl + or. Second, because instead of redoing the creation of the struct, we can rotate the first one. Even writing it as a swap of the members of a struct (instead of creation of a new struct) doesn't help. Anyway, I think this would be better pushq %rbx shl $32, %rdi shrd $32, %rsi, %rdi # SnB-family alternative to mov+shl+or rorx $32, %rdi, %rbx # arg for 2nd call call struct_arg movq %rbx, %rdi call struct_arg xorl %eax, %eax popq %rbx ret I didn't check whether I got the correct arg as the high half, but that's not the point.